## 2. Análisis de opiniones sobre películas

### a) Construcción del Dataframe

In [62]:
import urllib
import pandas as pd
train_data_url = "http://www.inf.utfsm.cl/~jnancu/stanford-subset/polarity.train"
test_data_url = "http://www.inf.utfsm.cl/~jnancu/stanford-subset/polarity.dev"
train_data_f = urllib.urlretrieve(train_data_url, "train_data.csv")
test_data_f = urllib.urlretrieve(test_data_url, "test_data.csv")
ftr = open("train_data.csv", "r")
fts = open("test_data.csv", "r")
rows = [line.split(" ",1) for line in ftr.readlines()]
train_df = pd.DataFrame(rows, columns=['Sentiment','Text'])
train_df['Sentiment'] = pd.to_numeric(train_df['Sentiment'])
rows = [line.split(" ",1) for line in fts.readlines()]
test_df = pd.DataFrame(rows, columns=['Sentiment','Text'])
test_df['Sentiment'] = pd.to_numeric(test_df['Sentiment'])
print train_df.shape
print test_df.shape

(3554, 2)
(3554, 2)


Tanto el conjunto de entrenamiento como el de pruebas, poseen 3554 registros para cada clase. Dichas clases se llaman "Sentiment" y "Text"

### b) Función word_extractor

In [63]:
import re, time
from nltk.corpus import stopwords
from nltk import WordNetLemmatizer, word_tokenize
from nltk.stem.porter import PorterStemmer
def word_extractor(text, stemming=True):
    if stemming is True:
        wordstemmer = PorterStemmer()
        commonwords = stopwords.words('english')
        text = re.sub(r'([a-z])\1+', r'\1\1',text)#substitute multiple letter by two
        words = ""
        wordtokens = [ wordstemmer.stem(word.lower()) for word in word_tokenize(text.decode('utf-8', 'ignore')) ]
        for word in wordtokens:
            if word not in commonwords:
                words+=" "+word
        return words
    else:
        commonwords = stopwords.words('english')
        text = re.sub(r'([a-z])\1+', r'\1\1',text)#substitute multiple letter by two
        words = ""
        wordtokens = [ word.lower() for word in word_tokenize(text.decode('utf-8', 'ignore')) ]
        for word in wordtokens:
            if word not in commonwords:
                words+=" "+word
        return words
        
#Con stemming
print word_extractor("I love to eat cake")
print word_extractor("I love eating cake")
print word_extractor("I loved eating the cake")
print word_extractor("I do not love eating cake")
print word_extractor("I don't love eating cake")

 love eat cake
 love eat cake
 love eat cake
 love eat cake
 n't love eat cake


Lo que se presenta es la extracción de trozos de texto en una frase. Al usar stemming para dicha tarea se puede apreciar en el output que se muestra la tarea en presente simple, lo que muestra que el stemming usa un vocabulario reducido al ignorar palabras como "eating" y "loved", ya que éstas no corresponden al presente simple.

In [64]:
#Sin stemming
print word_extractor("I love to eat cake", stemming=False)
print word_extractor("I love eating cake", stemming=False)
print word_extractor("I loved eating the cake", stemming=False)
print word_extractor("I do not love eating cake", stemming=False)
print word_extractor("I don't love eating cake", stemming=False)

 love eat cake
 love eating cake
 loved eating cake
 love eating cake
 n't love eating cake


Ahora se ejecutó word_extractor sin usar stemming y se puede apreciar que muestra las palabras exactas, lo que se puede concluir que sin usar stemming se obtiene resultados mejores que con stemming debido a que no se redujo el vocabulario sin el uso de stemming.

### c) Función word_extractor2

In [65]:
#la variavle stopwords indica si el lematizador usa stopwords.
def word_extractor2(text, stopWords=True):
    wordlemmatizer = WordNetLemmatizer()
    if stopWords is True:
        commonwords = stopwords.words('english')
    text = re.sub(r'([a-z])\1+', r'\1\1',text)#substitute multiple letter by two
    words = ""
    wordtokens = [ wordlemmatizer.lemmatize(word.lower()) \
                  for word in word_tokenize(text.decode('utf-8','ignore')) ]
    for word in wordtokens:
        if stopWords is True:
            if word not in commonwords:
                words+=" "+word
        else:
            words+=" "+word
    return words
print word_extractor2("I love to eat cake")
print word_extractor2("I love eating cake")
print word_extractor2("I loved eating the cake")
print word_extractor2("I do not love eating cake")
print word_extractor2("I don't love eating cake")

 love eat cake
 love eating cake
 loved eating cake
 love eating cake
 n't love eating cake


A diferencia de la función word_extractor, que usaba stemming para poder extraer trozos de palabras de una frase, la función word_extractor2 que usa lematización para dicho objetivo devuelve cada trozo de palabra exacta, lo que permite un uso de vocabulario más amplio que usando stemming. 

### d) CountVectorizer

In [66]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
texts_train = [word_extractor2(text) for text in train_df.Text]
texts_test = [word_extractor2(text) for text in test_df.Text]
vectorizer = CountVectorizer(ngram_range=(1, 1), binary='False')
vectorizer.fit(np.asarray(texts_train))
features_train = vectorizer.transform(texts_train)
features_test = vectorizer.transform(texts_test)
labels_train = np.asarray((train_df.Sentiment.astype(float)+1)/2.0)
labels_test = np.asarray((test_df.Sentiment.astype(float)+1)/2.0)
vocab = vectorizer.get_feature_names()

#Conjunto de entrenamiento
tags_train = []
dist=list(np.array(features_train.sum(axis=0)).reshape(-1,))
for tag, count in zip(vocab, dist):
    tags_train.append((count, tag))
    #print count, tag

In [67]:
#Conjunto de pruebas
tags_test = []
dist=list(np.array(features_test.sum(axis=0)).reshape(-1,))
for tag, count in zip(vocab, dist):
    tags_test.append((count, tag))
    #print count, tag

El código que contiene el CountVectorizer tiene como objetivo guardar la cantidad de veces que aparece cierta palabra/número en el conjunto de entrenamiento/prueba. Las palabras más frecuentes en el conjunto de entrenamiento/pruebas son:

In [68]:
#Top 10 palabras más frecuentes conjunto de entrenamiento
tags_train.sort()
tags_train[:] = tags_train[::-1]

print "Palabras más frecuentes del conjunto de entrenaiento:\n"
for i in range(0,5):
    print str(i+1)+") %s (%d)"%(tags_train[i][1], tags_train[i][0])


Palabras más frecuentes del conjunto de entrenaiento:

1) film (566)
2) movie (481)
3) one (246)
4) like (245)
5) ha (224)


In [69]:
#Top 10 palabras más frecuentes conjunto de entrenamiento
tags_test.sort()
tags_test[:] = tags_test[::-1]

print "Palabras más frecuentes del conjunto de prueba:\n"
for i in range(0,5):
    print str(i+1)+") %s (%d)"%(tags_test[i][1], tags_test[i][0])


Palabras más frecuentes del conjunto de prueba:

1) film (558)
2) movie (540)
3) one (250)
4) ha (238)
5) like (230)


### e) Desempeño de un clasificador

In [70]:
from sklearn.metrics import classification_report
def score_the_model(model,x,y,xt,yt,text):
    acc_tr = model.score(x,y)
    acc_test = model.score(xt[:-1],yt[:-1])
    print "Training Accuracy %s: %f"%(text,acc_tr)
    print "Test Accuracy %s: %f"%(text,acc_test)
    print "Detailed Analysis Testing Results ..."
    print(classification_report(yt, model.predict(xt), target_names=['+','-']))

Las métricas que calcula el método classification_report son las siguientes: precision, recall y F1-score, tal y como se muestra en el output generado usando el clasificador Bayesiano Ingenuo/Multinomial.

- Precision es la cantidad de resultados positivos correctos divididos por la cantidad total de resultados positivos
- Recall corresponde a la cantidad de resultados positivos correctos dividido por el número de resultados positivos que se debería obtener.
- El F1-score es el promedio ponderado de recall y precision. La mejor puntuación corresponde a 1 y la peor corresponde a 0. La fórmula para calcular el F1-score es el siguiente:

$$F1 = 2 * \frac{precision * recall}{precision + recall}$$

### f) Clasificador Bayesiano Ingenuo (Binario)

In [71]:
from sklearn.naive_bayes import BernoulliNB
import random
def do_NAIVE_BAYES(x,y,xt,yt):
    model = BernoulliNB()
    model = model.fit(x, y)
    score_the_model(model,x,y,xt,yt,"BernoulliNB")
    return model

#Lematizador
texts_train = [word_extractor2(text) for text in train_df.Text]
texts_test = [word_extractor2(text) for text in test_df.Text]
vectorizer = CountVectorizer(ngram_range=(1, 1), binary='False')
vectorizer.fit(np.asarray(texts_train))
features_train = vectorizer.transform(texts_train)
features_test = vectorizer.transform(texts_test)
labels_train = np.asarray((train_df.Sentiment.astype(float)+1)/2.0)
labels_test = np.asarray((test_df.Sentiment.astype(float)+1)/2.0)
vocab = vectorizer.get_feature_names()

model=do_NAIVE_BAYES(features_train,labels_train,features_test,labels_test)
test_pred = model.predict_proba(features_test)
spl = random.sample(xrange(len(test_pred)), 15)

for text, sentiment in zip(test_df.Text[spl], test_pred[spl]):
    print sentiment, text

Training Accuracy BernoulliNB: 0.958638
Test Accuracy BernoulliNB: 0.738531
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.75      0.73      0.74      1803
          -       0.73      0.75      0.74      1751

avg / total       0.74      0.74      0.74      3554

[ 0.01685004  0.98314996] god help the poor woman if attal is this insecure in real life : his fictional yvan's neuroses are aggravating enough to exhaust the patience of even the most understanding spouse .

[ 0.74278439  0.25721561] those 24-and-unders looking for their own caddyshack to adopt as a generational signpost may have to keep on looking .

[ 0.74010191  0.25989809] melodrama with a message .

[ 0.99700411  0.00299589] there is an almost poignant dimension to the way that every major stunt seagal's character . . . performs is shot from behind , as if it could fool us into thinking that we're not watching a double .

[ 0.38605591  0.61394409] this may 

In [72]:
#Lematizador sin stopwords
texts_train_stopwords = [word_extractor2(text, stopWords = False) for text in train_df.Text]
texts_test_stopwords = [word_extractor2(text, stopWords = False) for text in test_df.Text]
vectorizer = CountVectorizer(ngram_range=(1, 1), binary='False')
vectorizer.fit(np.asarray(texts_train_stopwords))
features_train = vectorizer.transform(texts_train_stopwords)
features_test = vectorizer.transform(texts_test_stopwords)
labels_train = np.asarray((train_df.Sentiment.astype(float)+1)/2.0)
labels_test = np.asarray((test_df.Sentiment.astype(float)+1)/2.0)
vocab = vectorizer.get_feature_names()

model=do_NAIVE_BAYES(features_train,labels_train,features_test,labels_test)
test_pred = model.predict_proba(features_test)
spl = random.sample(xrange(len(test_pred)), 15)

for text, sentiment in zip(test_df.Text[spl], test_pred[spl]):
    print sentiment, text

Training Accuracy BernoulliNB: 0.955262
Test Accuracy BernoulliNB: 0.748663
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.76      0.74      0.75      1803
          -       0.74      0.76      0.75      1751

avg / total       0.75      0.75      0.75      3554

[ 0.93974526  0.06025474] the sweetest thing leaves a bitter taste .

[ 0.17674497  0.82325503] more a load of enjoyable , conan-esque claptrap than the punishing , special-effects soul assaults the mummy pictures represent .

[ 0.88266112  0.11733888] instead of hiding pinocchio from critics , miramax should have hidden it from everyone .

[ 0.03847473  0.96152527] what might have been a predictably heartwarming tale is suffused with complexity .

[ 0.05373896  0.94626104] " spider-man is better than any summer blockbuster we had to endure last summer , and hopefully , sets the tone for a summer of good stuff . if you're a comic fan , you can't miss it . if you'

In [73]:
#Stemming
texts_train_stemming = [word_extractor(text) for text in train_df.Text]
texts_test_stemming = [word_extractor(text) for text in test_df.Text]
vectorizer = CountVectorizer(ngram_range=(1, 1), binary='False')
vectorizer.fit(np.asarray(texts_train_stopwords))
features_train = vectorizer.transform(texts_train_stemming)
features_test = vectorizer.transform(texts_test_stemming)
labels_train = np.asarray((train_df.Sentiment.astype(float)+1)/2.0)
labels_test = np.asarray((test_df.Sentiment.astype(float)+1)/2.0)
vocab = vectorizer.get_feature_names()

model=do_NAIVE_BAYES(features_train,labels_train,features_test,labels_test)
test_pred = model.predict_proba(features_test)
spl = random.sample(xrange(len(test_pred)), 15)

for text, sentiment in zip(test_df.Text[spl], test_pred[spl]):
    print sentiment, text

Training Accuracy BernoulliNB: 0.878728
Test Accuracy BernoulliNB: 0.701098
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.72      0.68      0.70      1803
          -       0.69      0.73      0.71      1751

avg / total       0.70      0.70      0.70      3554

[ 0.59967244  0.40032756] possibly the most irresponsible picture ever released by a major film studio .

[ 0.42926542  0.57073458] [fessenden] is much more into ambiguity and creating mood than he is for on screen thrills

[ 0.19825405  0.80174595] one of those rare films that come by once in a while with flawless amounts of acting , direction , story and pace .

[ 0.48441011  0.51558989] it's not original enough .

[ 0.37252962  0.62747038] a naturally funny film , home movie makes you crave chris smith's next movie .

[ 0.59637639  0.40362361] interesting , but not compelling .

[ 0.15885719  0.84114281] it offers little beyond the momentary joys of pretty and

Con los resultados obtenidos usando el Clasificador Bayesiano Ingenuo Binario

### g) Clasificador Bayesiano Ingenuo Multinomial

In [74]:
from sklearn.naive_bayes import MultinomialNB
import random
def do_MULTINOMIAL(x,y,xt,yt):
    model = MultinomialNB()
    model = model.fit(x, y)
    score_the_model(model,x,y,xt,yt,"MULTINOMIAL")
    return model

#Lematizador
texts_train = [word_extractor2(text) for text in train_df.Text]
texts_test = [word_extractor2(text) for text in test_df.Text]
vectorizer = CountVectorizer(ngram_range=(1, 1), binary='False')
vectorizer.fit(np.asarray(texts_train))
features_train = vectorizer.transform(texts_train)
features_test = vectorizer.transform(texts_test)
labels_train = np.asarray((train_df.Sentiment.astype(float)+1)/2.0)
labels_test = np.asarray((test_df.Sentiment.astype(float)+1)/2.0)
vocab = vectorizer.get_feature_names()

model=do_MULTINOMIAL(features_train,labels_train,features_test,labels_test)
test_pred = model.predict_proba(features_test)
spl = random.sample(xrange(len(test_pred)), 15)
for text, sentiment in zip(test_df.Text[spl], test_pred[spl]):
    print sentiment, text

Training Accuracy MULTINOMIAL: 0.959482
Test Accuracy MULTINOMIAL: 0.740782
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.75      0.73      0.74      1803
          -       0.73      0.75      0.74      1751

avg / total       0.74      0.74      0.74      3554

[ 0.31177543  0.68822457] i still can't relate to stuart : he's a mouse , for cryin' out loud , and all he does is milk it with despondent eyes and whine that nobody treats him human enough .

[ 0.7651984  0.2348016] a muddy psychological thriller rife with miscalculations . it makes me say the obvious : abandon all hope of a good movie ye who enter here .

[ 0.88877297  0.11122703] �passable enough for a shoot-out in the o . k . court house of life type of flick . strictly middle of the road .

[ 0.93182264  0.06817736] the fact that the 'best part' of the movie comes from a 60-second homage to one of demme's good films doesn't bode well for the rest of it .

[ 

In [75]:
#Lematizador sin stopwords
texts_train_stopwords = [word_extractor2(text, stopWords = False) for text in train_df.Text]
texts_test_stopwords = [word_extractor2(text, stopWords = False) for text in test_df.Text]
vectorizer = CountVectorizer(ngram_range=(1, 1), binary='False')
vectorizer.fit(np.asarray(texts_train_stopwords))
features_train = vectorizer.transform(texts_train_stopwords)
features_test = vectorizer.transform(texts_test_stopwords)
labels_train = np.asarray((train_df.Sentiment.astype(float)+1)/2.0)
labels_test = np.asarray((test_df.Sentiment.astype(float)+1)/2.0)
vocab = vectorizer.get_feature_names()

model=do_MULTINOMIAL(features_train,labels_train,features_test,labels_test)
test_pred = model.predict_proba(features_test)
spl = random.sample(xrange(len(test_pred)), 15)

for text, sentiment in zip(test_df.Text[spl], test_pred[spl]):
    print sentiment, text

Training Accuracy MULTINOMIAL: 0.955543
Test Accuracy MULTINOMIAL: 0.747537
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.75      0.75      0.75      1803
          -       0.74      0.74      0.74      1751

avg / total       0.75      0.75      0.75      3554

[ 0.96573087  0.03426913] this is a movie that starts out like heathers , then becomes bring it on , then becomes unwatchable .

[ 0.97825858  0.02174142] consists of a plot and jokes done too often by people far more talented than ali g

[ 0.96318404  0.03681596] an ambitious , serious film that manages to do virtually everything wrong ; sitting through it is something akin to an act of cinematic penance .

[ 0.03591092  0.96408908] stripped almost entirely of such tools as nudity , profanity and violence , labute does manage to make a few points about modern man and his problematic quest for human connection .

[ 0.46269107  0.53730893] home alone goes hollywoo

In [76]:
#Stemming
texts_train_stemming = [word_extractor(text) for text in train_df.Text]
texts_test_stemming = [word_extractor(text) for text in test_df.Text]
vectorizer = CountVectorizer(ngram_range=(1, 1), binary='False')
vectorizer.fit(np.asarray(texts_train_stopwords))
features_train = vectorizer.transform(texts_train_stemming)
features_test = vectorizer.transform(texts_test_stemming)
labels_train = np.asarray((train_df.Sentiment.astype(float)+1)/2.0)
labels_test = np.asarray((test_df.Sentiment.astype(float)+1)/2.0)
vocab = vectorizer.get_feature_names()

model=do_MULTINOMIAL(features_train,labels_train,features_test,labels_test)
test_pred = model.predict_proba(features_test)
spl = random.sample(xrange(len(test_pred)), 15)

for text, sentiment in zip(test_df.Text[spl], test_pred[spl]):
    print sentiment, text

Training Accuracy MULTINOMIAL: 0.882949
Test Accuracy MULTINOMIAL: 0.705319
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.71      0.70      0.71      1803
          -       0.70      0.71      0.70      1751

avg / total       0.71      0.71      0.71      3554

[ 0.4849595  0.5150405] an edgy thriller that delivers a surprising punch .

[ 0.53475197  0.46524803] a fiercely clever and subtle film , capturing the precarious balance between the extravagant confidence of the exiled aristocracy and the cruel earnestness of the victorious revolutionaries .

[ 0.0311196  0.9688804] the film's strength isn't in its details , but in the larger picture it paints - of a culture in conflict with itself , with the thin veneer of nationalism that covers our deepest , media-soaked fears .

[ 0.10967081  0.89032919] it's the chemistry between the women and the droll scene-stealing wit and wolfish pessimism of anna chancellor that makes

La siguiente tabla comparativa muestra las métricas obtenidas para cada caso al aplicar un clasificador Bayesiano Ingenuo Multinomial:

### h) Regresión Logı́stica Regularizado

In [77]:
from sklearn.linear_model import LogisticRegression
def do_LOGIT(x,y,xt,yt):
    start_t = time.time()
    Cs = [0.01,0.1,10,100,1000]
    for C in Cs:
        print "Usando C= %f"%C
        model = LogisticRegression(penalty='l2',C=C)
        model = model.fit(x, y)
        score_the_model(model,x,y,xt,yt,"LOGISTIC")

#Lematizador
texts_train = [word_extractor2(text) for text in train_df.Text]
texts_test = [word_extractor2(text) for text in test_df.Text]
vectorizer = CountVectorizer(ngram_range=(1, 1), binary='False')
vectorizer.fit(np.asarray(texts_train))
features_train = vectorizer.transform(texts_train)
features_test = vectorizer.transform(texts_test)
labels_train = np.asarray((train_df.Sentiment.astype(float)+1)/2.0)
labels_test = np.asarray((test_df.Sentiment.astype(float)+1)/2.0)
vocab = vectorizer.get_feature_names()
do_LOGIT(features_train,labels_train,features_test,labels_test)

Usando C= 0.010000
Training Accuracy LOGISTIC: 0.784468
Test Accuracy LOGISTIC: 0.678863
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.67      0.73      0.70      1803
          -       0.69      0.63      0.66      1751

avg / total       0.68      0.68      0.68      3554

Usando C= 0.100000
Training Accuracy LOGISTIC: 0.892234
Test Accuracy LOGISTIC: 0.719111
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.72      0.72      0.72      1803
          -       0.72      0.71      0.71      1751

avg / total       0.72      0.72      0.72      3554

Usando C= 10.000000
Training Accuracy LOGISTIC: 1.000000
Test Accuracy LOGISTIC: 0.718548
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.73      0.72      0.72      1803
          -       0.71      0.72      0.72      1751

avg / total       0.72     

In [78]:
#Lematizador sin stopwords
texts_train_stopwords = [word_extractor2(text, stopWords = False) for text in train_df.Text]
texts_test_stopwords = [word_extractor2(text, stopWords = False) for text in test_df.Text]
vectorizer = CountVectorizer(ngram_range=(1, 1), binary='False')
vectorizer.fit(np.asarray(texts_train_stopwords))
features_train = vectorizer.transform(texts_train_stopwords)
features_test = vectorizer.transform(texts_test_stopwords)
labels_train = np.asarray((train_df.Sentiment.astype(float)+1)/2.0)
labels_test = np.asarray((test_df.Sentiment.astype(float)+1)/2.0)
vocab = vectorizer.get_feature_names()
do_LOGIT(features_train,labels_train,features_test,labels_test)

Usando C= 0.010000
Training Accuracy LOGISTIC: 0.734102
Test Accuracy LOGISTIC: 0.671827
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.68      0.68      0.68      1803
          -       0.67      0.66      0.67      1751

avg / total       0.67      0.67      0.67      3554

Usando C= 0.100000
Training Accuracy LOGISTIC: 0.879572
Test Accuracy LOGISTIC: 0.718548
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.72      0.72      0.72      1803
          -       0.71      0.72      0.72      1751

avg / total       0.72      0.72      0.72      3554

Usando C= 10.000000
Training Accuracy LOGISTIC: 1.000000
Test Accuracy LOGISTIC: 0.731495
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.74      0.72      0.73      1803
          -       0.72      0.75      0.73      1751

avg / total       0.73     

In [79]:
#Stemming
texts_train_stemming = [word_extractor(text) for text in train_df.Text]
texts_test_stemming = [word_extractor(text) for text in test_df.Text]
vectorizer = CountVectorizer(ngram_range=(1, 1), binary='False')
vectorizer.fit(np.asarray(texts_train_stopwords))
features_train = vectorizer.transform(texts_train_stemming)
features_test = vectorizer.transform(texts_test_stemming)
labels_train = np.asarray((train_df.Sentiment.astype(float)+1)/2.0)
labels_test = np.asarray((test_df.Sentiment.astype(float)+1)/2.0)
vocab = vectorizer.get_feature_names()
do_LOGIT(features_train,labels_train,features_test,labels_test)

Usando C= 0.010000
Training Accuracy LOGISTIC: 0.723129
Test Accuracy LOGISTIC: 0.654095
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.64      0.72      0.68      1803
          -       0.67      0.58      0.62      1751

avg / total       0.66      0.65      0.65      3554

Usando C= 0.100000
Training Accuracy LOGISTIC: 0.814856
Test Accuracy LOGISTIC: 0.689840
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.69      0.71      0.70      1803
          -       0.69      0.67      0.68      1751

avg / total       0.69      0.69      0.69      3554

Usando C= 10.000000
Training Accuracy LOGISTIC: 0.977209
Test Accuracy LOGISTIC: 0.668731
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.67      0.67      0.67      1803
          -       0.66      0.67      0.66      1751

avg / total       0.67     

### i) Máquina de Vectores de Soporte (SVM) Lineal

In [80]:
from sklearn.svm import LinearSVC
def do_SVM(x,y,xt,yt):
    Cs = [0.01,0.1,10,100,1000]
    for C in Cs:
        print "El valor de C que se esta probando: %f"%C
        model = LinearSVC(C=C)
        model = model.fit(x, y)
        score_the_model(model,x,y,xt,yt,"SVM")

#Lematizador
texts_train = [word_extractor2(text) for text in train_df.Text]
texts_test = [word_extractor2(text) for text in test_df.Text]
vectorizer = CountVectorizer(ngram_range=(1, 1), binary='False')
vectorizer.fit(np.asarray(texts_train))
features_train = vectorizer.transform(texts_train)
features_test = vectorizer.transform(texts_test)
labels_train = np.asarray((train_df.Sentiment.astype(float)+1)/2.0)
labels_test = np.asarray((test_df.Sentiment.astype(float)+1)/2.0)
vocab = vectorizer.get_feature_names()
do_SVM(features_train,labels_train,features_test,labels_test)

El valor de C que se esta probando: 0.010000
Training Accuracy SVM: 0.884637
Test Accuracy SVM: 0.715170
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.72      0.72      0.72      1803
          -       0.71      0.71      0.71      1751

avg / total       0.72      0.72      0.72      3554

El valor de C que se esta probando: 0.100000
Training Accuracy SVM: 0.989589
Test Accuracy SVM: 0.723614
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.73      0.72      0.73      1803
          -       0.72      0.73      0.72      1751

avg / total       0.72      0.72      0.72      3554

El valor de C que se esta probando: 10.000000
Training Accuracy SVM: 1.000000
Test Accuracy SVM: 0.702786
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.71      0.69      0.70      1803
          -       0.69      0.71 

In [81]:
#Lematizador sin stopwords
texts_train_stopwords = [word_extractor2(text, stopWords = False) for text in train_df.Text]
texts_test_stopwords = [word_extractor2(text, stopWords = False) for text in test_df.Text]
vectorizer = CountVectorizer(ngram_range=(1, 1), binary='False')
vectorizer.fit(np.asarray(texts_train_stopwords))
features_train = vectorizer.transform(texts_train_stopwords)
features_test = vectorizer.transform(texts_test_stopwords)
labels_train = np.asarray((train_df.Sentiment.astype(float)+1)/2.0)
labels_test = np.asarray((test_df.Sentiment.astype(float)+1)/2.0)
vocab = vectorizer.get_feature_names()
do_SVM(features_train,labels_train,features_test,labels_test)

El valor de C que se esta probando: 0.010000
Training Accuracy SVM: 0.873382
Test Accuracy SVM: 0.719111
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.72      0.72      0.72      1803
          -       0.71      0.72      0.72      1751

avg / total       0.72      0.72      0.72      3554

El valor de C que se esta probando: 0.100000
Training Accuracy SVM: 0.987901
Test Accuracy SVM: 0.738249
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.75      0.73      0.74      1803
          -       0.73      0.75      0.74      1751

avg / total       0.74      0.74      0.74      3554

El valor de C que se esta probando: 10.000000
Training Accuracy SVM: 1.000000
Test Accuracy SVM: 0.713763
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.73      0.69      0.71      1803
          -       0.70      0.74 

In [82]:
#Stemming
texts_train_stemming = [word_extractor(text) for text in train_df.Text]
texts_test_stemming = [word_extractor(text) for text in test_df.Text]
vectorizer = CountVectorizer(ngram_range=(1, 1), binary='False')
vectorizer.fit(np.asarray(texts_train_stopwords))
features_train = vectorizer.transform(texts_train_stemming)
features_test = vectorizer.transform(texts_test_stemming)
labels_train = np.asarray((train_df.Sentiment.astype(float)+1)/2.0)
labels_test = np.asarray((test_df.Sentiment.astype(float)+1)/2.0)
vocab = vectorizer.get_feature_names()
do_SVM(features_train,labels_train,features_test,labels_test)

El valor de C que se esta probando: 0.010000
Training Accuracy SVM: 0.808385
Test Accuracy SVM: 0.688432
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.69      0.71      0.70      1803
          -       0.69      0.67      0.68      1751

avg / total       0.69      0.69      0.69      3554

El valor de C que se esta probando: 0.100000
Training Accuracy SVM: 0.921497
Test Accuracy SVM: 0.698283
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.70      0.70      0.70      1803
          -       0.69      0.70      0.69      1751

avg / total       0.70      0.70      0.70      3554

El valor de C que se esta probando: 10.000000
Training Accuracy SVM: 0.991559
Test Accuracy SVM: 0.650155
Detailed Analysis Testing Results ...
             precision    recall  f1-score   support

          +       0.65      0.66      0.66      1803
          -       0.65      0.64 

### j) Comparando resultados métodos de clasificación