In [1]:
from sklearn.datasets import make_multilabel_classification
from sklearn.preprocessing import MultiLabelBinarizer
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier

In [2]:
X_train = np.array(["empleos",
                    "numero de unidades economias",
                    "desempleados",
                    "gente de 8 a 9 anios",
                    "gente de 10 a 23 anios",
                    "gente de 23 a 103 anios que tiene empleos",
                    "numero de unidades economicas que fabrican",
                    "producto interno bruto del pais",
                    "produccion total de mineria",
                    "produccion total de plata y oro y bronze",
                    "gente de 103 a 500 anios que no tiene empleos",
                    "poblacion que tiene carencias",
                    "unidades economicas de edicion",
                    "promedio de poblacion que vive en iztapalapa",
                    "esta mujeres, genero, violencia de genero y goles y encuestas",
                    "mujeres y violencia de genero que trabajan en unidades ecnomicas que tiene entre 4 y 105 anios"])

In [3]:
y_train = [[0],[0],[0],[1],[1],[0,1],[0],[0],[0],[0],[0,1],[1],[0],[1],[2],[0,2,1]]

In [148]:
# MultiLabelBinarizer().fit_transform(Y)

array([[1, 1, 1, 0, 0],
       [1, 1, 1, 0, 1],
       [1, 1, 0, 0, 1],
       [1, 1, 0, 0, 0],
       [0, 0, 1, 1, 0]])

In [27]:
X_test = np.array(['unidades economicas que fabrican celulares',
                   'gente de 45 a 102 anios',
                   'gente de 3 a 75 anios que tiene empleos en sector manufacturero',
                  'adios de 3 a 5768 anios',
                  'gente desempleada que tiene empleos',
                  'alquiler de maquinaria y equipo para construccion, mineria y actividades forestales',
                  'mujeres que reportan violencia de genero que trabajan en unidades economias de entre 4 y 789 anios']) 

In [28]:
target_names = ['economia', 'demografia','violencia de genero']

##Que le vamos a hacer a los datos y en que orden?


**CountVectorizer** Convert a collection of text documents to a matrix of token counts.
    
1. Produces a sparse representation of the counts using scipy.sparse.coo_matrix

2. The number of ffeatures willbe equal to the vocabulary size found by analyzing the data.


**TfidfTransformer** Tranform a count matrix to a normalized tf ot tf-idf representation.

1. Tf means term-frequency while tf-idf means term-frequency tmies the document-frequency. This is a common therm weighting scheme in information retrieval, that has also found good use in document classification.

2. The goal of using tf-idf instead of the raw frequencies of occurence of a token in a given document is to scale down the impact of tokens that occur very frequently in a given corpus and that are hence empirically less informative than features that occur in a small fraction of the training corpus.

3. The actual formula used for tf-idf is tf*(idf +1) = tf + tf*idf instead of tf*idf. The effectof this is that terms with zero idf, i.e that occur in all documents of a training set, will not be entirely ignored. The formulas used to cimpute tf and idf depend on parameter settings that correspond to the SMART notation used in IR. 


**OneVsRestClassifier**

1. This strategy, also known as one-vs-all, is implemented in OneVsRestClassifier. The strategy consists in fitting one classifier per class. For each classfier, the class is fitted against all the other classes. In addition to its cimputational efficiency (only n_classes classifiers are needed), one advantage of this approach is its interpretability. Since each class is represented by one and one classifier only, it is possible to gain knowledge about the class by inspecting its corresponding classifier. This is the msot commonly used strategy and is a fair default choice.

In [29]:
classifier = Pipeline([
    ('vectorizer', CountVectorizer(lowercase=True,strip_accents='ascii')),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC(random_state=0)))])

In [34]:
#entrenamos y predecimos
classifier.fit(X_train, y_train)
predicted = classifier.predict(X_test)
predicted

[(0,), (1,), (0, 1), (1,), (0, 1), (0,), (0, 1, 2)]

###Resultados de Predicciones:

In [32]:
for item, labels in zip(X_test, predicted):
    print '%s => %s' % (item, ', '.join(target_names[x] for x in labels))

unidades economicas que fabrican celulares => economia
gente de 45 a 102 anios => demografia
gente de 3 a 75 anios que tiene empleos en sector manufacturero => economia, demografia
adios de 3 a 5768 anios => demografia
gente desempleada que tiene empleos => economia, demografia
alquiler de maquinaria y equipo para construccion, mineria y actividades forestales => economia
mujeres que reportan violencia de genero que trabajan en unidades economias de entre 4 y 789 anios => economia, demografia, violencia de genero
