## 2. Análisis de opiniones sobre películas

### a) Construcción del Dataframe

In [34]:
import urllib
import pandas as pd
train_data_url = "http://www.inf.utfsm.cl/~jnancu/stanford-subset/polarity.train"
test_data_url = "http://www.inf.utfsm.cl/~jnancu/stanford-subset/polarity.dev"
train_data_f = urllib.urlretrieve(train_data_url, "train_data.csv")
test_data_f = urllib.urlretrieve(test_data_url, "test_data.csv")
ftr = open("train_data.csv", "r")
fts = open("test_data.csv", "r")
rows = [line.split(" ",1) for line in ftr.readlines()]
train_df = pd.DataFrame(rows, columns=['Sentiment','Text'])
train_df['Sentiment'] = pd.to_numeric(train_df['Sentiment'])
rows = [line.split(" ",1) for line in fts.readlines()]
test_df = pd.DataFrame(rows, columns=['Sentiment','Text'])
test_df['Sentiment'] = pd.to_numeric(test_df['Sentiment'])
print train_df.shape
print test_df.shape

(3554, 2)
(3554, 2)


Tanto el conjunto de entrenamiento como el de pruebas, poseen 3554 registros para cada clase. Dichas clases se llaman "Sentiment" y "Text"

### b) Función word_extractor

In [35]:
import re, time
from nltk.corpus import stopwords
from nltk import WordNetLemmatizer, word_tokenize
from nltk.stem.porter import PorterStemmer
def word_extractor(text, stemming=True):
    if stemming is True:
        wordstemmer = PorterStemmer()
        commonwords = stopwords.words('english')
        text = re.sub(r'([a-z])\1+', r'\1\1',text)#substitute multiple letter by two
        words = ""
        wordtokens = [ wordstemmer.stem(word.lower()) for word in word_tokenize(text.decode('utf-8', 'ignore')) ]
        for word in wordtokens:
            if word not in commonwords:
                words+=" "+word
        return words
    else:
        commonwords = stopwords.words('english')
        text = re.sub(r'([a-z])\1+', r'\1\1',text)#substitute multiple letter by two
        words = ""
        wordtokens = [ word.lower() for word in word_tokenize(text.decode('utf-8', 'ignore')) ]
        for word in wordtokens:
            if word not in commonwords:
                words+=" "+word
        return words
        
#Con stemming
print word_extractor("I love to eat cake")
print word_extractor("I love eating cake")
print word_extractor("I loved eating the cake")
print word_extractor("I do not love eating cake")
print word_extractor("I don't love eating cake")

 love eat cake
 love eat cake
 love eat cake
 love eat cake
 n't love eat cake


Lo que se presenta es la extracción de trozos de texto en una frase. Al usar stemming para dicha tarea se puede apreciar en el output que se muestra la tarea en presente simple, lo que muestra que el stemming usa un vocabulario reducido al ignorar palabras como "eating" y "loved", ya que éstas no corresponden al presente simple.

In [36]:
#Sin stemming
print word_extractor("I love to eat cake", stemming=False)
print word_extractor("I love eating cake", stemming=False)
print word_extractor("I loved eating the cake", stemming=False)
print word_extractor("I do not love eating cake", stemming=False)
print word_extractor("I don't love eating cake", stemming=False)

 love eat cake
 love eating cake
 loved eating cake
 love eating cake
 n't love eating cake


Ahora se ejecutó word_extractor sin usar stemming y se puede apreciar que muestra las palabras exactas, lo que se puede concluir que sin usar stemming se obtiene resultados mejores que con stemming debido a que no se redujo el vocabulario sin el uso de stemming.

### c) Función word_extractor2

In [37]:
#la variavle stopwords indica si el lematizador usa stopwords.
def word_extractor2(text, stopWords=True):
    wordlemmatizer = WordNetLemmatizer()
    if stopWords is True:
        commonwords = stopwords.words('english')
    text = re.sub(r'([a-z])\1+', r'\1\1',text)#substitute multiple letter by two
    words = ""
    wordtokens = [ wordlemmatizer.lemmatize(word.lower()) \
                  for word in word_tokenize(text.decode('utf-8','ignore')) ]
    for word in wordtokens:
        if stopWords is True:
            if word not in commonwords:
                words+=" "+word
        else:
            words+=" "+word
    return words
print word_extractor2("I love to eat cake")
print word_extractor2("I love eating cake")
print word_extractor2("I loved eating the cake")
print word_extractor2("I do not love eating cake")
print word_extractor2("I don't love eating cake")

 love eat cake
 love eating cake
 loved eating cake
 love eating cake
 n't love eating cake


A diferencia de la función word_extractor, que usaba stemming para poder extraer trozos de palabras de una frase, la función word_extractor2 que usa lematización para dicho objetivo devuelve cada trozo de palabra exacta, lo que permite un uso de vocabulario más amplio que usando stemming. 

### d) CountVectorizer

In [38]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
texts_train = [word_extractor2(text) for text in train_df.Text]
texts_test = [word_extractor2(text) for text in test_df.Text]
vectorizer = CountVectorizer(ngram_range=(1, 1), binary='False')
vectorizer.fit(np.asarray(texts_train))
features_train = vectorizer.transform(texts_train)
features_test = vectorizer.transform(texts_test)
labels_train = np.asarray((train_df.Sentiment.astype(float)+1)/2.0)
labels_test = np.asarray((test_df.Sentiment.astype(float)+1)/2.0)
vocab = vectorizer.get_feature_names()

#Conjunto de entrenamiento
tags_train = []
dist=list(np.array(features_train.sum(axis=0)).reshape(-1,))
for tag, count in zip(vocab, dist):
    tags_train.append((count, tag))
    print count, tag

6 10
4 100
2 101
1 105
2 10th
4 11
2 110
1 11th
1 12
2 13
1 13th
1 14
1 140
1 146
3 15
1 16
1 163
2 170
1 18th
4 19
2 1915
1 1934
1 1938
1 1940s
2 1950
1 1950s
1 1954
1 1955
1 1958
1 1959
2 1960s
1 1967
1 1972
3 1975
1 1978
1 1979
2 1980
1 1991
1 1992
2 1995
3 19th
4 20
2 2000
7 2002
4 20th
2 21st
1 22
1 24
2 25
1 25s
2 30
1 300
2 3000
1 30s
1 37
1 3d
5 40
1 400
1 401
1 40s
1 451
1 48
1 49
4 4ever
3 50
2 51
1 53
1 5ths
2 60s
1 65
1 65th
4 70s
1 71
1 77
1 78
1 79
4 80
1 800
3 80s
2 84
2 85
1 88
1 8th
13 90
1 90s
1 93
1 94
2 95
1 96
1 97
2 99
1 aaliyah
2 abandon
1 abandono
1 abbass
1 abbreviated
1 abc
2 abel
1 abhorrent
1 abiding
11 ability
6 able
3 ably
1 aboul
3 above
1 abrahams
1 abrams
1 abrasive
2 abroad
1 abruptly
2 absolute
7 absolutely
1 absorb
1 absorbed
10 absorbing
1 absorption
5 abstract
2 absurd
3 absurdist
4 absurdity
1 absurdly
1 aburrido
4 abuse
1 abysmally
1 acabamos
1 academic
2 academy
7 accent
1 accentuating
2 accept
1 accepting
10 accessible
2 accident
1 accidental
1

In [39]:
#Conjunto de pruebas
tags_test = []
dist=list(np.array(features_test.sum(axis=0)).reshape(-1,))
for tag, count in zip(vocab, dist):
    tags_test.append((count, tag))
    print count, tag

9 10
4 100
2 101
0 105
0 10th
5 11
1 110
0 11th
4 12
6 13
0 13th
0 14
0 140
0 146
6 15
0 16
1 163
1 170
0 18th
4 19
0 1915
1 1934
0 1938
2 1940s
1 1950
0 1950s
0 1954
0 1955
1 1958
0 1959
1 1960s
0 1967
0 1972
1 1975
1 1978
1 1979
0 1980
0 1991
0 1992
1 1995
1 19th
6 20
2 2000
11 2002
1 20th
1 21st
0 22
1 24
1 25
0 25s
1 30
1 300
1 3000
0 30s
0 37
3 3d
3 40
0 400
0 401
0 40s
0 451
1 48
0 49
2 4ever
4 50
4 51
0 53
0 5ths
2 60s
2 65
0 65th
5 70s
0 71
2 77
0 78
0 79
1 80
1 800
0 80s
2 84
0 85
3 88
0 8th
7 90
0 90s
2 93
2 94
2 95
0 96
0 97
0 99
0 aaliyah
5 abandon
0 abandono
1 abbass
0 abbreviated
1 abc
0 abel
1 abhorrent
0 abiding
10 ability
12 able
1 ably
0 aboul
1 above
0 abrahams
0 abrams
0 abrasive
0 abroad
1 abruptly
2 absolute
6 absolutely
0 absorb
2 absorbed
4 absorbing
0 absorption
2 abstract
5 absurd
2 absurdist
6 absurdity
3 absurdly
0 aburrido
3 abuse
0 abysmally
0 acabamos
1 academic
2 academy
2 accent
0 accentuating
1 accept
1 accepting
4 accessible
0 accident
0 accidental
0 

Lo que se muestra en el output corresponde a la cantidad de veces que aparece cierta palabra/número en el conjunto de entrenamiento/prueba. La primera columna corresponde a la cantidad y la segunda columna la palabra o número correspondiente. Las palabras más frecuentes en el conjunto de entrenamiento/pruebas son:

In [40]:
#Top 10 palabras más frecuentes conjunto de entrenamiento
tags_train.sort()
tags_train[:] = tags_train[::-1]

print "Palabras más frecuentes del conjunto de entrenaiento:\n"
for i in range(0,5):
    print str(i+1)+") %s (%d)"%(tags_train[i][1], tags_train[i][0])


Palabras más frecuentes del conjunto de entrenaiento:

1) film (566)
2) movie (481)
3) one (246)
4) like (245)
5) ha (224)


In [41]:
#Top 10 palabras más frecuentes conjunto de entrenamiento
tags_test.sort()
tags_test[:] = tags_test[::-1]

print "Palabras más frecuentes del conjunto de prueba:\n"
for i in range(0,5):
    print str(i+1)+") %s (%d)"%(tags_test[i][1], tags_test[i][0])


Palabras más frecuentes del conjunto de prueba:

1) film (558)
2) movie (540)
3) one (250)
4) ha (238)
5) like (230)


### e) Desempeño de un clasificador

In [42]:
from sklearn.metrics import classification_report
def score_the_model(model,x,y,xt,yt,text):
    acc_tr = model.score(x,y)
    acc_test = model.score(xt[:-1],yt[:-1])
    print "Training Accuracy %s: %f"%(text,acc_tr)
    print "Test Accuracy %s: %f"%(text,acc_test)
    print "Detailed Analysis Testing Results ..."
    print(classification_report(yt, model.predict(xt), target_names=['+','-']))

Las métricas que calcula el método classification_report son las siguientes: precision, recall y F1-score, tal y como se muestra en el output generado usando el clasificador Bayesiano Ingenuo/Multinomial.

- Precision es la cantidad de resultados positivos correctos divididos por la cantidad total de resultados positivos
- Recall corresponde a la cantidad de resultados positivos correctos dividido por el número de resultados positivos que se debería obtener.
- El F1-score es el promedio ponderado de recall y precision. La mejor puntuación corresponde a 1 y la peor corresponde a 0. La fórmula para calcular el F1-score es el siguiente:

$$F1 = 2 * \frac{precision * recall}{precision + recall}$$

### f) Clasificador Bayesiano Ingenuo (Binario)

In [43]:
from sklearn.naive_bayes import BernoulliNB
import random
def do_NAIVE_BAYES(x,y,xt,yt):
    model = BernoulliNB()
    model = model.fit(x, y)
    score_the_model(model,x,y,xt,yt,"BernoulliNB")
    return model

#Lematizador
texts_train = [word_extractor2(text) for text in train_df.Text]
texts_test = [word_extractor2(text) for text in test_df.Text]
vectorizer = CountVectorizer(ngram_range=(1, 1), binary='False')
vectorizer.fit(np.asarray(texts_train))
features_train = vectorizer.transform(texts_train)
features_test = vectorizer.transform(texts_test)
labels_train = np.asarray((train_df.Sentiment.astype(float)+1)/2.0)
labels_test = np.asarray((test_df.Sentiment.astype(float)+1)/2.0)
vocab = vectorizer.get_feature_names()

model=do_NAIVE_BAYES(features_train,labels_train,features_test,labels_test)
test_pred = model.predict_proba(features_test)
spl = random.sample(xrange(len(test_pred)), 15)

for text, sentiment in zip(test_df.Text[spl], test_pred[spl]):
    print sentiment, text

Training Accuracy BernoulliNB: 0.958638
Test Accuracy BernoulliNB: 0.738531
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.75      0.73      0.74      1803
          -       0.73      0.75      0.74      1751

avg / total       0.74      0.74      0.74      3554

[ 0.64182793  0.35817207] the diversity of the artists represented , both in terms of style and ethnicity , prevents the proceedings from feeling repetitious , as does the appropriately brief 40-minute running time .

[ 0.65373927  0.34626073] go see it and enjoy .

[ 0.49598993  0.50401007] it gets the details of its time frame right but it completely misses its emotions .

[ 0.28227058  0.71772942] fessenden's narrative is just as much about the ownership and redefinition of myth as it is about a domestic unit finding their way to joy .

[ 0.14416706  0.85583294] a surprisingly 'solid' achievement by director malcolm d . lee and writer john ridley .

[ 0.010679

In [44]:
#Lematizador sin stopwords
texts_train_stopwords = [word_extractor2(text, stopWords = False) for text in train_df.Text]
texts_test_stopwords = [word_extractor2(text, stopWords = False) for text in test_df.Text]
vectorizer = CountVectorizer(ngram_range=(1, 1), binary='False')
vectorizer.fit(np.asarray(texts_train_stopwords))
features_train = vectorizer.transform(texts_train_stopwords)
features_test = vectorizer.transform(texts_test_stopwords)
labels_train = np.asarray((train_df.Sentiment.astype(float)+1)/2.0)
labels_test = np.asarray((test_df.Sentiment.astype(float)+1)/2.0)
vocab = vectorizer.get_feature_names()

model=do_NAIVE_BAYES(features_train,labels_train,features_test,labels_test)
test_pred = model.predict_proba(features_test)
spl = random.sample(xrange(len(test_pred)), 15)

for text, sentiment in zip(test_df.Text[spl], test_pred[spl]):
    print sentiment, text

Training Accuracy BernoulliNB: 0.955262
Test Accuracy BernoulliNB: 0.748663
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.76      0.74      0.75      1803
          -       0.74      0.76      0.75      1751

avg / total       0.75      0.75      0.75      3554

[ 0.08160359  0.91839641] it's an old story , but a lively script , sharp acting and partially animated interludes make just a kiss seem minty fresh .

[ 0.00487904  0.99512096] dogtown & z-boys evokes the blithe rebel fantasy with the kind of insouciance embedded in the sexy demise of james dean .

[ 0.66809853  0.33190147] never again , while nothing special , is pleasant , diverting and modest -- definitely a step in the right direction .

[ 0.50481812  0.49518188] davis has energy , but she doesn't bother to make her heroine's book sound convincing , the gender-war ideas original , or the comic scenes fly .

[ 0.96270248  0.03729752] unfortunately , neither s

In [45]:
#Stemming
texts_train_stemming = [word_extractor(text) for text in train_df.Text]
texts_test_stemming = [word_extractor(text) for text in test_df.Text]
vectorizer = CountVectorizer(ngram_range=(1, 1), binary='False')
vectorizer.fit(np.asarray(texts_train_stopwords))
features_train = vectorizer.transform(texts_train_stemming)
features_test = vectorizer.transform(texts_test_stemming)
labels_train = np.asarray((train_df.Sentiment.astype(float)+1)/2.0)
labels_test = np.asarray((test_df.Sentiment.astype(float)+1)/2.0)
vocab = vectorizer.get_feature_names()

model=do_NAIVE_BAYES(features_train,labels_train,features_test,labels_test)
test_pred = model.predict_proba(features_test)
spl = random.sample(xrange(len(test_pred)), 15)

for text, sentiment in zip(test_df.Text[spl], test_pred[spl]):
    print sentiment, text

Training Accuracy BernoulliNB: 0.878728
Test Accuracy BernoulliNB: 0.701098
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.72      0.68      0.70      1803
          -       0.69      0.73      0.71      1751

avg / total       0.70      0.70      0.70      3554

[ 0.82913844  0.17086156] love may have been in the air onscreen , but i certainly wasn't feeling any of it .

[ 0.66049872  0.33950128] what's most memorable about circuit is that it's shot on digital video , whose tiny camera enables shafer to navigate spaces both large . . . and small . . . with considerable aplomb .

[ 0.03103459  0.96896541] it has the charm of the original american road movies , feasting on the gorgeous , ramshackle landscape of the filmmaker's motherland .

[ 0.65358748  0.34641252] . . . blade ii is more enjoyable than the original .

[ 0.98405629  0.01594371] mckay deflates his piece of puffery with a sour cliche and heavy doses of mean-

La siguiente tabla comparativa muestra las métricas obtenidas para cada caso al aplicar un clasificador Bayesiano Ingenuo Binario:

### g) Clasificador Bayesiano Ingenuo Multinomial

In [46]:
from sklearn.naive_bayes import MultinomialNB
import random
def do_MULTINOMIAL(x,y,xt,yt):
    model = MultinomialNB()
    model = model.fit(x, y)
    score_the_model(model,x,y,xt,yt,"MULTINOMIAL")
    return model

#Lematizador
texts_train = [word_extractor2(text) for text in train_df.Text]
texts_test = [word_extractor2(text) for text in test_df.Text]
vectorizer = CountVectorizer(ngram_range=(1, 1), binary='False')
vectorizer.fit(np.asarray(texts_train))
features_train = vectorizer.transform(texts_train)
features_test = vectorizer.transform(texts_test)
labels_train = np.asarray((train_df.Sentiment.astype(float)+1)/2.0)
labels_test = np.asarray((test_df.Sentiment.astype(float)+1)/2.0)
vocab = vectorizer.get_feature_names()

model=do_MULTINOMIAL(features_train,labels_train,features_test,labels_test)
test_pred = model.predict_proba(features_test)
spl = random.sample(xrange(len(test_pred)), 15)
for text, sentiment in zip(test_df.Text[spl], test_pred[spl]):
    print sentiment, text

Training Accuracy MULTINOMIAL: 0.959482
Test Accuracy MULTINOMIAL: 0.740782
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.75      0.73      0.74      1803
          -       0.73      0.75      0.74      1751

avg / total       0.74      0.74      0.74      3554

[ 0.70049651  0.29950349] the bottom line with nemesis is the same as it has been with all the films in the series : fans will undoubtedly enjoy it , and the uncommitted needn't waste their time on it .

[ 0.36799322  0.63200678] one of the best looking and stylish animated movies in quite a while . . .

[ 0.0789181  0.9210819] there's something fishy about a seasonal holiday kids' movie . . . that derives its moment of most convincing emotional gravity from a scene where santa gives gifts to grownups .

[ 0.91131429  0.08868571] an ill-conceived jumble that's not scary , not smart and not engaging .

[ 0.97917188  0.02082812] shouldn't have been allowed to use t

In [47]:
#Lematizador sin stopwords
texts_train_stopwords = [word_extractor2(text, stopWords = False) for text in train_df.Text]
texts_test_stopwords = [word_extractor2(text, stopWords = False) for text in test_df.Text]
vectorizer = CountVectorizer(ngram_range=(1, 1), binary='False')
vectorizer.fit(np.asarray(texts_train_stopwords))
features_train = vectorizer.transform(texts_train_stopwords)
features_test = vectorizer.transform(texts_test_stopwords)
labels_train = np.asarray((train_df.Sentiment.astype(float)+1)/2.0)
labels_test = np.asarray((test_df.Sentiment.astype(float)+1)/2.0)
vocab = vectorizer.get_feature_names()

model=do_MULTINOMIAL(features_train,labels_train,features_test,labels_test)
test_pred = model.predict_proba(features_test)
spl = random.sample(xrange(len(test_pred)), 15)

for text, sentiment in zip(test_df.Text[spl], test_pred[spl]):
    print sentiment, text

Training Accuracy MULTINOMIAL: 0.955543
Test Accuracy MULTINOMIAL: 0.747537
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.75      0.75      0.75      1803
          -       0.74      0.74      0.74      1751

avg / total       0.75      0.75      0.75      3554

[ 0.61025075  0.38974925] these characters become wearisome .

[ 0.24655917  0.75344083] both an admirable reconstruction of terrible events , and a fitting memorial to the dead of that day , and of the thousands thereafter .

[ 0.05144267  0.94855733] a moving and solidly entertaining comedy/drama that should bolster director and co-writer juan jos� campanella's reputation in the united states .

[ 0.04705733  0.95294267] solondz creates some effective moments of discomfort for character and viewer alike .

[ 0.20568148  0.79431852] gere gives a good performance in a film that doesn't merit it .

[ 0.1865272  0.8134728] woven together handsomely , recalling sixt

In [48]:
#Stemming
texts_train_stemming = [word_extractor(text) for text in train_df.Text]
texts_test_stemming = [word_extractor(text) for text in test_df.Text]
vectorizer = CountVectorizer(ngram_range=(1, 1), binary='False')
vectorizer.fit(np.asarray(texts_train_stopwords))
features_train = vectorizer.transform(texts_train_stemming)
features_test = vectorizer.transform(texts_test_stemming)
labels_train = np.asarray((train_df.Sentiment.astype(float)+1)/2.0)
labels_test = np.asarray((test_df.Sentiment.astype(float)+1)/2.0)
vocab = vectorizer.get_feature_names()

model=do_MULTINOMIAL(features_train,labels_train,features_test,labels_test)
test_pred = model.predict_proba(features_test)
spl = random.sample(xrange(len(test_pred)), 15)

for text, sentiment in zip(test_df.Text[spl], test_pred[spl]):
    print sentiment, text

Training Accuracy MULTINOMIAL: 0.882949
Test Accuracy MULTINOMIAL: 0.705319
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.71      0.70      0.71      1803
          -       0.70      0.71      0.70      1751

avg / total       0.71      0.71      0.71      3554

[ 0.66134678  0.33865322] excessive , profane , packed with cartoonish violence and comic-strip characters .

[ 0.73692667  0.26307333] a sloppy slapstick throwback to long gone bottom-of-the-bill fare like the ghost and mr . chicken .

[ 0.08222643  0.91777357] waydowntown may not be an important movie , or even a good one , but it provides a nice change of mindless pace in collision with the hot oscar season currently underway .

[ 0.85363522  0.14636478] i can't recommend it . but it's surprisingly harmless .

[ 0.16511424  0.83488576] 'de niro . . . is a veritable source of sincere passion that this hollywood contrivance orbits around . '

[ 0.95424594  0.045

La siguiente tabla comparativa muestra las métricas obtenidas para cada caso al aplicar un clasificador Bayesiano Ingenuo Multinomial:

### h) Regresión Logı́stica Regularizado

In [53]:
from sklearn.linear_model import LogisticRegression
def do_LOGIT(x,y,xt,yt):
    start_t = time.time()
    Cs = [0.01,0.1,10,100,1000]
    for C in Cs:
        print "Usando C= %f"%C
        model = LogisticRegression(penalty='l2',C=C)
        model = model.fit(x, y)
        score_the_model(model,x,y,xt,yt,"LOGISTIC")

#Lematizador
texts_train = [word_extractor2(text) for text in train_df.Text]
texts_test = [word_extractor2(text) for text in test_df.Text]
vectorizer = CountVectorizer(ngram_range=(1, 1), binary='False')
vectorizer.fit(np.asarray(texts_train))
features_train = vectorizer.transform(texts_train)
features_test = vectorizer.transform(texts_test)
labels_train = np.asarray((train_df.Sentiment.astype(float)+1)/2.0)
labels_test = np.asarray((test_df.Sentiment.astype(float)+1)/2.0)
vocab = vectorizer.get_feature_names()
do_LOGIT(features_train,labels_train,features_test,labels_test)

Usando C= 0.010000
Training Accuracy LOGISTIC: 0.784468
Test Accuracy LOGISTIC: 0.678863
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.67      0.73      0.70      1803
          -       0.69      0.63      0.66      1751

avg / total       0.68      0.68      0.68      3554

Usando C= 0.100000
Training Accuracy LOGISTIC: 0.892234
Test Accuracy LOGISTIC: 0.719111
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.72      0.72      0.72      1803
          -       0.72      0.71      0.71      1751

avg / total       0.72      0.72      0.72      3554

Usando C= 10.000000
Training Accuracy LOGISTIC: 1.000000
Test Accuracy LOGISTIC: 0.718548
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.73      0.72      0.72      1803
          -       0.71      0.72      0.72      1751

avg / total       0.72     

In [54]:
#Lematizador sin stopwords
texts_train_stopwords = [word_extractor2(text, stopWords = False) for text in train_df.Text]
texts_test_stopwords = [word_extractor2(text, stopWords = False) for text in test_df.Text]
vectorizer = CountVectorizer(ngram_range=(1, 1), binary='False')
vectorizer.fit(np.asarray(texts_train_stopwords))
features_train = vectorizer.transform(texts_train_stopwords)
features_test = vectorizer.transform(texts_test_stopwords)
labels_train = np.asarray((train_df.Sentiment.astype(float)+1)/2.0)
labels_test = np.asarray((test_df.Sentiment.astype(float)+1)/2.0)
vocab = vectorizer.get_feature_names()
do_LOGIT(features_train,labels_train,features_test,labels_test)

Usando C= 0.010000
Training Accuracy LOGISTIC: 0.734102
Test Accuracy LOGISTIC: 0.671827
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.68      0.68      0.68      1803
          -       0.67      0.66      0.67      1751

avg / total       0.67      0.67      0.67      3554

Usando C= 0.100000
Training Accuracy LOGISTIC: 0.879572
Test Accuracy LOGISTIC: 0.718548
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.72      0.72      0.72      1803
          -       0.71      0.72      0.72      1751

avg / total       0.72      0.72      0.72      3554

Usando C= 10.000000
Training Accuracy LOGISTIC: 1.000000
Test Accuracy LOGISTIC: 0.731495
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.74      0.72      0.73      1803
          -       0.72      0.75      0.73      1751

avg / total       0.73     

In [55]:
#Stemming
texts_train_stemming = [word_extractor(text) for text in train_df.Text]
texts_test_stemming = [word_extractor(text) for text in test_df.Text]
vectorizer = CountVectorizer(ngram_range=(1, 1), binary='False')
vectorizer.fit(np.asarray(texts_train_stopwords))
features_train = vectorizer.transform(texts_train_stemming)
features_test = vectorizer.transform(texts_test_stemming)
labels_train = np.asarray((train_df.Sentiment.astype(float)+1)/2.0)
labels_test = np.asarray((test_df.Sentiment.astype(float)+1)/2.0)
vocab = vectorizer.get_feature_names()
do_LOGIT(features_train,labels_train,features_test,labels_test)

Usando C= 0.010000
Training Accuracy LOGISTIC: 0.723129
Test Accuracy LOGISTIC: 0.654095
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.64      0.72      0.68      1803
          -       0.67      0.58      0.62      1751

avg / total       0.66      0.65      0.65      3554

Usando C= 0.100000
Training Accuracy LOGISTIC: 0.814856
Test Accuracy LOGISTIC: 0.689840
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.69      0.71      0.70      1803
          -       0.69      0.67      0.68      1751

avg / total       0.69      0.69      0.69      3554

Usando C= 10.000000
Training Accuracy LOGISTIC: 0.977209
Test Accuracy LOGISTIC: 0.668731
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.67      0.67      0.67      1803
          -       0.66      0.67      0.66      1751

avg / total       0.67     

### i) Máquina de Vectores de Soporte (SVM) Lineal

In [57]:
from sklearn.svm import LinearSVC
def do_SVM(x,y,xt,yt):
    Cs = [0.01,0.1,10,100,1000]
    for C in Cs:
        print "El valor de C que se esta probando: %f"%C
        model = LinearSVC(C=C)
        model = model.fit(x, y)
        score_the_model(model,x,y,xt,yt,"SVM")

#Lematizador
texts_train = [word_extractor2(text) for text in train_df.Text]
texts_test = [word_extractor2(text) for text in test_df.Text]
vectorizer = CountVectorizer(ngram_range=(1, 1), binary='False')
vectorizer.fit(np.asarray(texts_train))
features_train = vectorizer.transform(texts_train)
features_test = vectorizer.transform(texts_test)
labels_train = np.asarray((train_df.Sentiment.astype(float)+1)/2.0)
labels_test = np.asarray((test_df.Sentiment.astype(float)+1)/2.0)
vocab = vectorizer.get_feature_names()
do_SVM(features_train,labels_train,features_test,labels_test)

El valor de C que se esta probando: 0.010000
Training Accuracy SVM: 0.884637
Test Accuracy SVM: 0.715170
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.72      0.72      0.72      1803
          -       0.71      0.71      0.71      1751

avg / total       0.72      0.72      0.72      3554

El valor de C que se esta probando: 0.100000
Training Accuracy SVM: 0.989589
Test Accuracy SVM: 0.723614
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.73      0.72      0.73      1803
          -       0.72      0.73      0.72      1751

avg / total       0.72      0.72      0.72      3554

El valor de C que se esta probando: 10.000000
Training Accuracy SVM: 1.000000
Test Accuracy SVM: 0.702786
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.71      0.69      0.70      1803
          -       0.69      0.71 

In [58]:
#Lematizador sin stopwords
texts_train_stopwords = [word_extractor2(text, stopWords = False) for text in train_df.Text]
texts_test_stopwords = [word_extractor2(text, stopWords = False) for text in test_df.Text]
vectorizer = CountVectorizer(ngram_range=(1, 1), binary='False')
vectorizer.fit(np.asarray(texts_train_stopwords))
features_train = vectorizer.transform(texts_train_stopwords)
features_test = vectorizer.transform(texts_test_stopwords)
labels_train = np.asarray((train_df.Sentiment.astype(float)+1)/2.0)
labels_test = np.asarray((test_df.Sentiment.astype(float)+1)/2.0)
vocab = vectorizer.get_feature_names()
do_SVM(features_train,labels_train,features_test,labels_test)

El valor de C que se esta probando: 0.010000
Training Accuracy SVM: 0.873382
Test Accuracy SVM: 0.719111
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.72      0.72      0.72      1803
          -       0.71      0.72      0.72      1751

avg / total       0.72      0.72      0.72      3554

El valor de C que se esta probando: 0.100000
Training Accuracy SVM: 0.987901
Test Accuracy SVM: 0.738249
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.75      0.73      0.74      1803
          -       0.73      0.75      0.74      1751

avg / total       0.74      0.74      0.74      3554

El valor de C que se esta probando: 10.000000
Training Accuracy SVM: 1.000000
Test Accuracy SVM: 0.713763
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.73      0.69      0.71      1803
          -       0.70      0.74 

In [59]:
#Stemming
texts_train_stemming = [word_extractor(text) for text in train_df.Text]
texts_test_stemming = [word_extractor(text) for text in test_df.Text]
vectorizer = CountVectorizer(ngram_range=(1, 1), binary='False')
vectorizer.fit(np.asarray(texts_train_stopwords))
features_train = vectorizer.transform(texts_train_stemming)
features_test = vectorizer.transform(texts_test_stemming)
labels_train = np.asarray((train_df.Sentiment.astype(float)+1)/2.0)
labels_test = np.asarray((test_df.Sentiment.astype(float)+1)/2.0)
vocab = vectorizer.get_feature_names()
do_SVM(features_train,labels_train,features_test,labels_test)

El valor de C que se esta probando: 0.010000
Training Accuracy SVM: 0.808385
Test Accuracy SVM: 0.688432
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.69      0.71      0.70      1803
          -       0.69      0.67      0.68      1751

avg / total       0.69      0.69      0.69      3554

El valor de C que se esta probando: 0.100000
Training Accuracy SVM: 0.921497
Test Accuracy SVM: 0.698283
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.70      0.70      0.70      1803
          -       0.69      0.70      0.69      1751

avg / total       0.70      0.70      0.70      3554

El valor de C que se esta probando: 10.000000
Training Accuracy SVM: 0.991559
Test Accuracy SVM: 0.650155
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.65      0.66      0.66      1803
          -       0.65      0.64 