# Sentiment analysis para clasificar críticas de películas

## Introducción
La idea de este notebook es establecer un método general adaptable a otros datasets. 

Primero vamos a ver las diferentes features que podemos extraer de un texto relativas a _sentiment analysis_. Construiremos un extractor de features generalizado que pueda valer para la mayoría de problemas

Luego aplicaremos las features a diferentes modelos, como _Support Vector Machines_ (__SVM__) o un clasificador _Naive Bayes_.

Lo interesante de esta estrategia es que nos permitirá saber automáticamente cuáles son las mejores features para nuestro problema mediante una búsqueda de parámetros.


## Sentiment analysis

La idea del _sentiment analysis_ es extraer información de opinión de un texto. Puede ser la polarización del texto (si habla bien o mal de algún tema), clasificación de sentimientos: si el emisor del mensaje está contento, enfadado, triste, etc.

En este caso vamos a clasificar críticas de películas en inglés. Podríamos intentar cuantificar la crítica con una puntuación de cero a diez, pero vamos a simplificarlo clasificando las críticas como buenas (puntuación mayor o igual a cinco) o malas (puntuación menor que cinco).

## Extracción de features

Nuestro modelo va a ser entrenado en base a ciertas features del texto. En vez de usar todo el texto, debemos usar algo mucho más sencillo de entender para el modelo. Trabajaremos sobre diferentes ideas para alcanzar un modelo lo más certero posible

Vamos a cargar un ejemplo del conjunto de entrenamiento para ver paso a paso los cambios que van ocurriendo. Para este ejemplo vamos a cargar la review negativa \#3886

In [169]:
raw_text = open('data/train/neg/3886_1.txt').read()
print(raw_text)

Is this a stupid movie? You bet!! I could not find any moment in this film that was creepy or scary. Stupid moments? Plenty. Stupid characters? You bet. Bad effects? Everywhere! Rick Baker may have gone and done bigger and better things, this is not one of them. Oh well people gotta start somewhere. Dr. Ted Nelson is cheesed. He is the most whiny doctor I've ever seen. He's got a melting man running amok out in Ventura County somewhere, he's not overly happy that his wife is pregnant (probably cause she's 55 years old and weighs 90 lbs) and there's no crackers to be found anywhere. Plus he's got the not-too-helpful general on his hinder wanting to find astronaut Steve. And the local sheriff wants to know what's going on even though Mr. Nelson can't tell him anything. There also some random characters thrown in for good measure who encounter the melting man. Eventually the movie ends and out monster gets scooped into a trash can to become compost. In the end it's just what you need for 

### Unigramas

Vamos a usar unigramas como base. Es decir, separaremos las palabras

In [108]:
from nltk.tokenize import word_tokenize

In [111]:
word_tokenize(raw_text)

['Is',
 'this',
 'a',
 'stupid',
 'movie',
 '?',
 'You',
 'bet',
 '!',
 '!',
 'I',
 'could',
 'not',
 'find',
 'any',
 'moment',
 'in',
 'this',
 'film',
 'that',
 'was',
 'creepy',
 'or',
 'scary',
 '.',
 'Stupid',
 'moments',
 '?',
 'Plenty',
 '.',
 'Stupid',
 'characters',
 '?',
 'You',
 'bet',
 '.',
 'Bad',
 'effects',
 '?',
 'Everywhere',
 '!',
 'Rick',
 'Baker',
 'may',
 'have',
 'gone',
 'and',
 'done',
 'bigger',
 'and',
 'better',
 'things',
 ',',
 'this',
 'is',
 'not',
 'one',
 'of',
 'them',
 '.',
 'Oh',
 'well',
 'people',
 'got',
 'ta',
 'start',
 'somewhere',
 '.',
 'Dr.',
 'Ted',
 'Nelson',
 'is',
 'cheesed',
 '.',
 'He',
 'is',
 'the',
 'most',
 'whiny',
 'doctor',
 'I',
 "'ve",
 'ever',
 'seen',
 '.',
 'He',
 "'s",
 'got',
 'a',
 'melting',
 'man',
 'running',
 'amok',
 'out',
 'in',
 'Ventura',
 'County',
 'somewhere',
 ',',
 'he',
 "'s",
 'not',
 'overly',
 'happy',
 'that',
 'his',
 'wife',
 'is',
 'pregnant',
 '(',
 'probably',
 'cause',
 'she',
 "'s",
 '55',
 'ye

Se puede ver que `word_tokenize` ha separado todas las palabras del texto, con algunos errores. Podemos ver que conceptos como _not-too-helpful_ no ha sido capaz de separarlo en diferentes palabras al llevar un guión.

Usando esto podríamos trabajar con la frecuencia de cada palabra en el texto. Esto ya nos es muy útil, pero podemos buscar algo mucho más fino.

### Stemming

Es posible que nos encontremos palabras con significados muy parecidos a la hora de clasificar en _sentiment analysis_ pero que al ser palabras distintas vayan a contar por separado. Un ejemplo de esto podría ser __*happy*__ (feliz) junto a __*happier*__ (más feliz que). Ambos ejemplos aportan la misma idea: felicidad, pero si solo usamos la frecuencia de palabras van a ser analizadas por separado.

Una posible aproximación para arreglar esto es guardar solo la raíces _(stems)_ de las palabras. De esta manera _happy_ y _happier_ se guardarían como __*happi*__. También van a ayudar a transformarlo todo a minúsculas.

Los diferentes _stemmers_ disponibles en NLTK no son perfectos, pero podemos ver como funcionan sobre un conjunto de palabras que ya hemos sacado en el ejemplo anterior

In [114]:
from nltk.stem.snowball import SnowballStemmer

In [115]:
snow_stemmer = SnowballStemmer(language='english')

Vemos cada palabra con su _stem_

In [123]:
[(word, snow_stemmer.stem(word)) for word in tokens]

[('Is', 'is'),
 ('this', 'this'),
 ('a', 'a'),
 ('stupid', 'stupid'),
 ('movie', 'movi'),
 ('?', '?'),
 ('You', 'you'),
 ('bet', 'bet'),
 ('!', '!'),
 ('!', '!'),
 ('I', 'i'),
 ('could', 'could'),
 ('not', 'not'),
 ('find', 'find'),
 ('any', 'ani'),
 ('moment', 'moment'),
 ('in', 'in'),
 ('this', 'this'),
 ('film', 'film'),
 ('that', 'that'),
 ('was', 'was'),
 ('creepy', 'creepi'),
 ('or', 'or'),
 ('scary', 'scari'),
 ('.', '.'),
 ('Stupid', 'stupid'),
 ('moments', 'moment'),
 ('?', '?'),
 ('Plenty', 'plenti'),
 ('.', '.'),
 ('Stupid', 'stupid'),
 ('characters', 'charact'),
 ('?', '?'),
 ('You', 'you'),
 ('bet', 'bet'),
 ('.', '.'),
 ('Bad', 'bad'),
 ('effects', 'effect'),
 ('?', '?'),
 ('Everywhere', 'everywher'),
 ('!', '!'),
 ('Rick', 'rick'),
 ('Baker', 'baker'),
 ('may', 'may'),
 ('have', 'have'),
 ('gone', 'gone'),
 ('and', 'and'),
 ('done', 'done'),
 ('bigger', 'bigger'),
 ('and', 'and'),
 ('better', 'better'),
 ('things', 'thing'),
 (',', ','),
 ('this', 'this'),
 ('is', 'is'),

### Part of speech (POS) tagging

Por último podemos aprovechar toda la potencia que da el análisis POS. Al analizar cada palabra y saber que tipo es (verbo, nombre, adjetivo), podemos afinar mucho más en aquellos sitios donde sabemos que está nuestra información.

En este caso, sabemos que las opiniones se producen sobre todo en adjetivos y adverbios (bueno, malo, peor que, etc), por lo tanto podemos hacer un filtrado por este tipo de palabras.

In [124]:
from nltk import pos_tag

In [125]:
tagged_tokens = pos_tag(tokens)
tagged_tokens

[('Is', 'VBZ'),
 ('this', 'DT'),
 ('a', 'DT'),
 ('stupid', 'JJ'),
 ('movie', 'NN'),
 ('?', '.'),
 ('You', 'PRP'),
 ('bet', 'VBP'),
 ('!', '.'),
 ('!', '.'),
 ('I', 'PRP'),
 ('could', 'MD'),
 ('not', 'RB'),
 ('find', 'VB'),
 ('any', 'DT'),
 ('moment', 'NN'),
 ('in', 'IN'),
 ('this', 'DT'),
 ('film', 'NN'),
 ('that', 'WDT'),
 ('was', 'VBD'),
 ('creepy', 'NN'),
 ('or', 'CC'),
 ('scary', 'JJ'),
 ('.', '.'),
 ('Stupid', 'JJ'),
 ('moments', 'NNS'),
 ('?', '.'),
 ('Plenty', 'NNP'),
 ('.', '.'),
 ('Stupid', 'NNP'),
 ('characters', 'NNS'),
 ('?', '.'),
 ('You', 'PRP'),
 ('bet', 'RB'),
 ('.', '.'),
 ('Bad', 'NNP'),
 ('effects', 'NNS'),
 ('?', '.'),
 ('Everywhere', 'RB'),
 ('!', '.'),
 ('Rick', 'NNP'),
 ('Baker', 'NNP'),
 ('may', 'MD'),
 ('have', 'VB'),
 ('gone', 'VBN'),
 ('and', 'CC'),
 ('done', 'VBN'),
 ('bigger', 'JJR'),
 ('and', 'CC'),
 ('better', 'JJR'),
 ('things', 'NNS'),
 (',', ','),
 ('this', 'DT'),
 ('is', 'VBZ'),
 ('not', 'RB'),
 ('one', 'CD'),
 ('of', 'IN'),
 ('them', 'PRP'),
 ('.', '

Cada palabra ha sido identificada con un tag. Podemos ver el significado de cada tag en 

In [126]:
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

En nuestro caso queremos quedarnos con adjetivos __JJ\*__ y adverbios __RB\*__ 

In [127]:
[(token, tag) for token, tag in tagged_tokens
 if tag[:2] == 'JJ' or tag[:2] == 'RB']

[('stupid', 'JJ'),
 ('not', 'RB'),
 ('scary', 'JJ'),
 ('Stupid', 'JJ'),
 ('bet', 'RB'),
 ('Everywhere', 'RB'),
 ('bigger', 'JJR'),
 ('better', 'JJR'),
 ('not', 'RB'),
 ('well', 'RB'),
 ('ta', 'JJ'),
 ('somewhere', 'RB'),
 ('most', 'RBS'),
 ('whiny', 'JJ'),
 ('ever', 'RB'),
 ('amok', 'RB'),
 ('somewhere', 'RB'),
 ('not', 'RB'),
 ('overly', 'RB'),
 ('happy', 'JJ'),
 ('pregnant', 'JJ'),
 ('probably', 'RB'),
 ('old', 'JJ'),
 ('anywhere', 'RB'),
 ('not-too-helpful', 'JJ'),
 ('general', 'JJ'),
 ('astronaut', 'JJ'),
 ('local', 'JJ'),
 ('even', 'RB'),
 ("n't", 'RB'),
 ('also', 'RB'),
 ('good', 'JJ'),
 ('Eventually', 'RB'),
 ('monster', 'JJR'),
 ('just', 'RB'),
 ('great', 'JJ')]

### Composición de pasos
Resumimos todos los pasos anteriores en una misma función que extraiga las features que queremos.

Lo bueno de hacerlo así es que, en el momento que queramos experimentar con otras features, solo tenemos que cambiar esta función dejando el resto del código tal cual

In [231]:
def custom_tokenizer(text, use_stem=True, stemmer=snow_stemmer, use_pos=True, use_only_adj=False):
    # Separate words
    words = word_tokenize(text)
    # PoS tagging words
    if use_pos:
        pos_tags = nltk.pos_tag(words)
    else:
        pos_tags = zip(words, [''] * len(words))
    
    tokens = []
    for word, tag in pos_tags:
        res_word = word
        use_word = True
        # Convert to stem
        if use_stem:
            res_word = stemmer.stem(res_word)
        # Use POS tag with the word
        if use_pos and not use_only_adj:
            res_word += '_' + tag
        # Only use adv and adj
        if use_only_adj and not (tag[:2] == 'JJ' or tag[:2] == 'RB'):
            use_word = False
        # Append the word to the tokenizer
        if use_word:
            tokens.append(res_word)
    return tokens
#word_tokenize es el primero
def text_stems_tok(text):
    return custom_tokenizer(text, use_stem=True, use_pos=False)
def pos_tok(text):
    return custom_tokenizer(text, use_stem=False, use_pos=True)
def pos_stems_tok(text):
    return custom_tokenizer(text, use_stem=True, use_pos=True)
def adj_tok(text):
    return custom_tokenizer(text, use_stem=False, use_pos=True, use_only_adj=True)
def adj_stems_tok(text):
    return custom_tokenizer(text, use_stem=True, use_pos=True, use_only_adj=True)

In [233]:
features = adj_stems_tok(raw_text)
features

['stupid',
 'not',
 'scari',
 'stupid',
 'bet',
 'everywher',
 'bigger',
 'better',
 'not',
 'well',
 'ta',
 'somewher',
 'most',
 'whini',
 'ever',
 'amok',
 'somewher',
 'not',
 'over',
 'happi',
 'pregnant',
 'probabl',
 'old',
 'anywher',
 'not-too-help',
 'general',
 'astronaut',
 'local',
 'even',
 "n't",
 'also',
 'good',
 'eventu',
 'monster',
 'just',
 'great']

In [182]:
nltk.download('tagsetstoken')

[nltk_data] Error loading tagsetstoken: Package 'tagsetstoken' not
[nltk_data]     found in index


False

# SKLearn approach

In [136]:
from sklearn.datasets import load_files
from sklearn.model_selection import train_test_split

In [137]:
dataset = load_files('data/train')
X_train, X_test, y_train, y_test = train_test_split(
    dataset.data, dataset.target,
    test_size=0.2, random_state=0
)

In [160]:
test_dataset = load_files('data/test')

In [167]:
test_dataset.target
test_dataset.data

[b"Don't hate Heather Graham because she's beautiful, hate her because she's fun to watch in this movie. Like the hip clothing and funky surroundings, the actors in this flick work well together. Casey Affleck is hysterical and Heather Graham literally lights up the screen. The minor characters - Goran Visnjic {sigh} and Patricia Velazquez are as TALENTED as they are gorgeous. Congratulations Miramax & Director Lisa Krueger!",
 b'I don\'t know how this movie has received so many positive comments. One can call it "artistic" and "beautifully filmed", but those things don\'t make up for the empty plot that was filled with sexual innuendos. I wish I had not wasted my time to watch this movie. Rather than being biographical, it was a poor excuse for promoting strange and lewd behavior. It was just another Hollywood attempt to convince us that that kind of life is normal and OK. From the very beginning I asked my self what was the point of this movie,and I continued watching, hoping that it

In [138]:
y_train

array([0, 0, 0, ..., 0, 0, 1])

In [71]:
for t in range(len(dataset.target_names)):
    print(t, dataset.target_names[t])

0 neg
1 pos


In [147]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer(tokenizer=custom_tokenizer)
X_train_counts = count_vect.fit_transform(X_train)
X_train_counts[0]

<1x34327 sparse matrix of type '<class 'numpy.int64'>'
	with 24 stored elements in Compressed Sparse Row format>

In [155]:
from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
X_train_tf.shape

(20000, 34327)

In [149]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(20000, 34327)

In [150]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, y_train)


In [119]:
docs_new = [""]
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
predicted = clf.predict(X_new_tfidf)
for doc, category in zip(docs_new, predicted):
    print(doc, '=>', dataset.target_names[category])


Hello, my name is Marco and I want to talk about weapons => weapons


In [205]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier

text_clf = Pipeline([('vect', CountVectorizer(tokenizer=custom_tokenizer)),
                     ('tfidf', TfidfTransformer(use_idf=False)),
                     ('clf', SGDClassifier())
                    ])
                     

In [206]:
text_clf = text_clf.fit(dataset.data, dataset.target)

Evaluation of the performance

In [207]:
# without idf
import numpy as np
docs_test = X_test
predicted = text_clf.predict(test_dataset.data)
np.mean(predicted == test_dataset.target)

0.84311999999999998

In [204]:
# Only words
import numpy as np
docs_test = X_test
predicted = text_clf.predict(test_dataset.data)
np.mean(predicted == test_dataset.target)

0.88551999999999997

In [197]:
# Without PoS tagging
import numpy as np
docs_test = X_test
predicted = text_clf.predict(test_dataset.data)
np.mean(predicted == test_dataset.target)

0.88312000000000002

In [191]:
# SVM with all in
import numpy as np
docs_test = X_test
predicted = text_clf.predict(test_dataset.data)
np.mean(predicted == test_dataset.target)

0.84823999999999999

In [152]:
X_test[0]

b'I really wanted to like The Pillow Book. Intriguing story, interesting character outlines, Ewan Macgregor in the utterly glorious altogether. Unfortunately, I hated every minute of it. Greenaway got so enamoured with presenting the movie uniquely, and not to the film\'s benefit. I won\'t even get into Vivian Wu\'s abysmal acting.<br /><br />You get distracted from the story with 4 billion teeny windows and calligraphy that rolls on the bottom of the screen displaying the lyrics of the music that\'s playing. It seems he lost sight of presenting the actual story and developing the plot, and got entangled with foo-foo embellishments that have nothing to do with anything. It\'s a bit like presenting a John Singer Sargeant portrait in a chintzy Hallmark frame that says "GRANDMA LOVES ME!" in big sparkly letters.<br /><br />This movie seems to be a casualty of the director auteur\'s ego instead of what it could have been - disturbingly and horrifyingly beautiful. In another director\'s han

# SVM

Vamos a probar nuestros diferentes modelos.

Lo interesante de 

In [236]:
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV

text_clf_svm = Pipeline([('vect', CountVectorizer()),
                         ('tfidf', TfidfTransformer()),
                         ('clf', SGDClassifier()),
                        ])

parameters = {
    'vect__tokenizer': (word_tokenize, text_stems_tok, pos_tok, pos_stems_tok, adj_tok, adj_stems_tok),
    'tfidf__use_idf': (True, False),
    'clf__alpha': (1e-3, 1e-4),
}

gs_clf = GridSearchCV(text_clf_svm, parameters, n_jobs=-1)
gs_clf.fit(X_train, y_train)
predicted = gs_clf.predict(test_dataset.data)
np.mean(predicted == test_dataset.target)

0.88207999999999998

In [240]:
import pandas as pd

In [241]:
df = pd.DataFrame(gs_clf.cv_results_)

In [242]:
df

Unnamed: 0,mean_fit_time,mean_score_time,mean_test_score,mean_train_score,param_clf__alpha,param_tfidf__use_idf,param_vect__tokenizer,params,rank_test_score,split0_test_score,split0_train_score,split1_test_score,split1_train_score,split2_test_score,split2_train_score,std_fit_time,std_score_time,std_test_score,std_train_score
0,42.809086,20.595768,0.83835,0.8619,0.001,True,<function word_tokenize at 0x7fd443731e18>,"{'clf__alpha': 0.001, 'vect__tokenizer': <func...",8,0.835158,0.861397,0.835608,0.861697,0.844284,0.862607,0.676449,0.193819,0.0042,0.000515
1,92.014162,44.660706,0.8456,0.864775,0.001,True,<function text_stems_tok at 0x7fd434280bf8>,"{'clf__alpha': 0.001, 'vect__tokenizer': <func...",5,0.838908,0.867622,0.846108,0.867247,0.851785,0.859457,6.561049,0.289353,0.005269,0.003764
2,291.978515,156.644288,0.8299,0.853275,0.001,True,<function pos_tok at 0x7fd434280c80>,"{'clf__alpha': 0.001, 'vect__tokenizer': <func...",11,0.832758,0.859446,0.826459,0.854571,0.830483,0.845808,8.0233,4.274308,0.002605,0.005643
3,361.926162,183.052437,0.8289,0.8535,0.001,True,<function pos_stems_tok at 0x7fd434280ea0>,"{'clf__alpha': 0.001, 'vect__tokenizer': <func...",12,0.826909,0.858171,0.828409,0.855921,0.831383,0.846408,5.289329,11.005596,0.001859,0.005099
4,299.92471,149.12938,0.82605,0.848325,0.001,True,<function adj_tok at 0x7fd434280d90>,"{'clf__alpha': 0.001, 'vect__tokenizer': <func...",14,0.825109,0.848796,0.823309,0.849696,0.829733,0.846483,7.587796,4.65508,0.002706,0.001354
5,373.022203,182.126453,0.8277,0.850125,0.001,True,<function adj_stems_tok at 0x7fd434280f28>,"{'clf__alpha': 0.001, 'vect__tokenizer': <func...",13,0.825109,0.852621,0.825709,0.850296,0.832283,0.847458,2.828221,4.024111,0.00325,0.002112
6,32.689242,16.357067,0.76085,0.768325,0.001,False,<function word_tokenize at 0x7fd443731e18>,"{'clf__alpha': 0.001, 'vect__tokenizer': <func...",22,0.767812,0.772669,0.754762,0.774019,0.759976,0.758287,0.748353,0.439473,0.005363,0.007119
7,87.496522,44.40005,0.7713,0.777725,0.001,False,<function text_stems_tok at 0x7fd434280bf8>,"{'clf__alpha': 0.001, 'vect__tokenizer': <func...",21,0.779061,0.78482,0.763312,0.773494,0.771527,0.774861,1.278681,1.15098,0.006432,0.005048
8,284.839681,141.716291,0.7543,0.763375,0.001,False,<function pos_tok at 0x7fd434280c80>,"{'clf__alpha': 0.001, 'vect__tokenizer': <func...",24,0.761962,0.769519,0.749063,0.768469,0.751875,0.752137,3.571355,1.26366,0.005538,0.007958
9,356.889944,181.822442,0.7589,0.768225,0.001,False,<function pos_stems_tok at 0x7fd434280ea0>,"{'clf__alpha': 0.001, 'vect__tokenizer': <func...",23,0.763462,0.769894,0.749363,0.770119,0.763876,0.764662,13.240786,2.247186,0.006746,0.002521


# X-validation

In [189]:
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV

text_clf_svm = Pipeline([('vect', CountVectorizer()),
                         ('tfidf', TfidfTransformer()),
                         ('clf', SGDClassifier()),
                        ])

parameters = {
    #'tfidf__use_idf': (True, False),
    'clf__alpha': (1e-2, 1e-3),
}

gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1, cv=5)
gs_clf = gs_clf.fit(X_train, y_train)




In [190]:
predicted = gs_clf.predict(X_test)
np.mean(predicted == y_test)

0.97175866495507057

In [191]:
gs_clf.best_score_, gs_clf.best_params_

(0.95921644187540145, {'clf__alpha': 0.01})

# Results

In [153]:
clf.predict(X_test)



TypeError: Cannot cast array data from dtype('float64') to dtype('S32') according to the rule 'safe'

In [192]:
from sklearn import metrics
cr = metrics.classification_report(
    y_test, predicted,
    target_names=dataset.target_names
)
print(cr)

                precision    recall  f1-score   support

   exploration       0.96      0.97      0.96       122
   headhunters       0.99      0.98      0.98       135
  intelligence       0.92      0.99      0.95        72
     logistics       1.00      0.98      0.99       102
      politics       0.98      1.00      0.99       128
transportation       0.99      0.93      0.96       121
       weapons       0.95      0.96      0.95        99

   avg / total       0.97      0.97      0.97       779

