## PRACTICA_GUIADA: Clasificacion de texto y Sentiment Analysis

### Ejemplo usando Naive Bayes Classifier

In [1]:
import nltk
nltk.download('subjectivity')

[nltk_data] Downloading package subjectivity to
[nltk_data]     C:\Users\mbeati\AppData\Roaming\nltk_data...
[nltk_data]   Package subjectivity is already up-to-date!


True

In [2]:
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import subjectivity
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.util import *

Vamos a trabajar con el corpus *subjectivity*. 

Explorémoslo brevemente.

https://www.nltk.org/api/nltk.corpus.reader.html

El corpus subjectivity está etiquetado con valores objetivo / subjetivo

In [3]:
subjectivity.categories()

['obj', 'subj']

Veamos cuál es la primera oración etiquetada como 'objetivo'

In [4]:
subjectivity.sents(categories='obj')[0]

['the',
 'movie',
 'begins',
 'in',
 'the',
 'past',
 'where',
 'a',
 'young',
 'boy',
 'named',
 'sam',
 'attempts',
 'to',
 'save',
 'celebi',
 'from',
 'a',
 'hunter',
 '.']

Veamos cuál es la oración número 11 etiquetada como 'objetivo'

In [5]:
subjectivity.sents(categories='obj')[10]

['women', 'craved', 'him', 'and', 'men', 'wanted', 'to', 'be', 'him', '.']

Veamos cuál es la primera oración etiquetada como 'subjetivo'

In [6]:
subjectivity.sents(categories='subj')[0]

['smart',
 'and',
 'alert',
 ',',
 'thirteen',
 'conversations',
 'about',
 'one',
 'thing',
 'is',
 'a',
 'small',
 'gem',
 '.']

Veamos cuál es la oración número 12 etiquetada como 'subjetivo'

In [7]:
subjectivity.sents(categories='subj')[11]

['directed',
 'by',
 'david',
 'twohy',
 'with',
 'the',
 'same',
 'great',
 'eye',
 'for',
 'eerie',
 'understatement',
 'that',
 'he',
 'brought',
 'to',
 'pitch',
 'black',
 '.']

In [8]:
# Construimos un pequeño corpus con 100 oraciones objetivas y 100 oraciones subjetivas

n_instances = 100

subj_docs = [(sent, 'subj') for sent in\
             subjectivity.sents(categories='subj')[:n_instances]] 
             # Traemos 100 frases subjetivas

obj_docs = [(sent, 'obj') for sent in\
            subjectivity.sents(categories='obj')[:n_instances]]  
            # Traemos 100 frases objetivas

len(subj_docs), len(obj_docs)

(100, 100)

In [9]:
# Cada documento esta representado por una tupla (sentence, label).
# La frase esta tokenizada y representada en una lista de strings.
subj_docs[0]

(['smart',
  'and',
  'alert',
  ',',
  'thirteen',
  'conversations',
  'about',
  'one',
  'thing',
  'is',
  'a',
  'small',
  'gem',
  '.'],
 'subj')

In [10]:
# Separamos la data en train y test de forma balanceada

train_subj_docs = subj_docs[:80]
test_subj_docs = subj_docs[80:100]
train_obj_docs = obj_docs[:80]
test_obj_docs = obj_docs[80:100]

training_docs = train_subj_docs + train_obj_docs
testing_docs = test_subj_docs + test_obj_docs

SentimentAnalyzer

https://www.nltk.org/api/nltk.sentiment.html#module-nltk.sentiment.sentiment_analyzer



nltk.sentiment.util.mark_negation(document, double_neg_flip=False, shallow=False): 
Append _NEG suffix to words that appear in the scope between a negation and a punctuation mark.

all_words(documents, labeled=None):
Return all words/tokens from the documents (with duplicates).

unigram_word_feats(words, top_n=None, min_freq=0):
Return most common top_n word features.

add_feat_extractor(function, **kwargs): 
Add a new function to extract features from a document. This function will be used in extract_features(). Important: in this step our kwargs are only representing additional parameters, and NOT the document we have to parse. The document will always be the first parameter in the parameter list, and it will be added in the extract_features() function.

nltk.sentiment.util.extract_unigram_feats(document, unigrams, handle_negation=False):
Populate a dictionary of unigram features, reflecting the presence/absence in the document of each of the tokens in unigrams.

---

Los N-gramas de textos se usan ampliamente minería de texto y procesamiento de lenguaje natural. Básicamente son un conjunto de palabras concurrentes dentro de una ventana determinada y, cuando se calculan los n-gramas, generalmente se avanza una palabra (aunque puede avanzar X palabras hacia adelante en escenarios más avanzados).

Ejemplo: "The cow jumps over the moon". 
Si N=2 (bigramas), los n-gramas son:

the cow

cow jumps

jumps over

over the

the moon

Tenemos 5 n-gramas. Observar que nos movemos una palabra hacia adelante para generar el sigiente bigrama.
u
the->cow a cow->jumps a jumps->over

Si N=3, los n-gramas son:

the cow jumps

cow jumps over

jumps over the

over the moon

Y tenemos 4 n-gramas.

Cuando N=1, se denominan unigramas y es básicamente cada palabra individual de la oración. 

Cuando N=2, se denominan bigramas.

Objetivo: construir un clasificador que prediga si una oración es objetiva o subjetiva.
El clasificador será el resultado de entrenar sentim_analyzer con Naive Bayes.

In [11]:
# Trabajamos sobre las negaciones usando SentimentAnalyzer

sentim_analyzer = SentimentAnalyzer()

all_words_neg = sentim_analyzer.all_words([mark_negation(doc) for doc in training_docs])

#####es como que lo esta fiteando?
unigram_feats = sentim_analyzer.unigram_word_feats(all_words_neg, top_n=200)

print(unigram_feats)

len(unigram_feats)

['.', 'the', ',', 'a', 'and', 'of', 'to', 'is', 'in', 'with', 'it', 'that', 'his', 'on', 'for', 'an', 'who', 'by', 'he', 'from', 'her', '"', 'film', 'as', 'this', 'movie', 'their', 'but', 'one', 'at', 'about', 'the_NEG', 'a_NEG', 'to_NEG', 'are', "there's", '(', 'story', 'when', 'so', 'be', ',_NEG', ')', 'they', 'you', 'not', 'have', 'like', 'will', 'all', 'into', 'out', 'she', 'what', 'life', 'has', 'its', 'only', 'more', 'even', '--', ':', 'can', ';', 'home', 'look', "it's", 'if', 'where', 'most', 'him', 'search', 'but_NEG', 'love', 'both', 'make', 'begins', 'some', 'two', 'of_NEG', 'made', 'which', 'them', 'just', 'wife', 'much', 'get', 'through', 'time', 'gets', 'it_NEG', 'very', 'i', 'feel', 'really', 'own', 'how', 'other', 'dark', 'lacks', 'then', 'work', 'as_NEG', 'and_NEG', 'young', 'old', '?', 'far', 'come', 'years', 'something', 'called', 'family', 'daughter', 'up', 'take', 'back', 'thing', 'while', 'could', 'been', 'job', 'documentary', 'farm', 'characters', 'script', 'mater

200

In [12]:
sentim_analyzer.add_feat_extractor(extract_unigram_feats, unigrams=unigram_feats)

#####que es add feat extractor?

In [13]:
# Apply all feature extractor functions to the documents:
training_set = sentim_analyzer.apply_features(training_docs)
test_set = sentim_analyzer.apply_features(testing_docs)
test_set[:1]

[({'contains(.)': True, 'contains(the)': True, 'contains(,)': False, 'contains(a)': True, 'contains(and)': False, 'contains(of)': True, 'contains(to)': False, 'contains(is)': False, 'contains(in)': False, 'contains(with)': True, 'contains(it)': False, 'contains(that)': False, 'contains(his)': False, 'contains(on)': False, 'contains(for)': True, 'contains(an)': False, 'contains(who)': False, 'contains(by)': False, 'contains(he)': False, 'contains(from)': False, 'contains(her)': False, 'contains(")': False, 'contains(film)': False, 'contains(as)': False, 'contains(this)': False, 'contains(movie)': False, 'contains(their)': False, 'contains(but)': False, 'contains(one)': False, 'contains(at)': False, 'contains(about)': False, 'contains(the_NEG)': False, 'contains(a_NEG)': False, 'contains(to_NEG)': False, 'contains(are)': False, "contains(there's)": False, 'contains(()': False, 'contains(story)': False, 'contains(when)': False, 'contains(so)': False, 'contains(be)': False, 'contains(,_NEG

Recordemos cómo se calcula recall

Recall: tp / (tp + fn)

In [14]:
# Entrenamos el modelo predictivo y vemos su performance

trainer = NaiveBayesClassifier.train

# train(trainer, training_set, save_classifier=None, **kwargs):
# Train classifier on the training set, optionally saving the output in the file specified by save_classifier. 
# Additional arguments depend on the specific trainer used. 

# Training classifier
classifier = sentim_analyzer.train(trainer, training_set)

# sentim_analyzer.evaluate: Evaluate and print classifier performance on the test set.
for key,value in sorted(sentim_analyzer.evaluate(test_set).items()):
    print('{0}: {1}'.format(key, value))

Training classifier
Evaluating NaiveBayesClassifier results...
Accuracy: 0.8
F-measure [obj]: 0.7894736842105263
F-measure [subj]: 0.8095238095238095
Precision [obj]: 0.8333333333333334
Precision [subj]: 0.7727272727272727
Recall [obj]: 0.75
Recall [subj]: 0.85


In [15]:
# Ejemplo de classificacion de una frase:
sentim_analyzer.classify("i want to ride my bicycle.")

'subj'

In [16]:
sentim_analyzer.classify("sun is shining")

'obj'

### VADER Sentiment Analysis

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. A sentiment lexicon is a list of lexical features (e.g., words) which are generally labelled according to their semantic orientation as either positive or negative.


http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf

https://medium.com/analytics-vidhya/simplifying-social-media-sentiment-analysis-using-vader-in-python-f9e6ec6fc52f

In [17]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sentences = ["VADER is smart, handsome, and funny.", # positive sentence example
             "VADER is smart, handsome, and funny!", # punctuation emphasis handled correctly (sentiment intensity adjusted)
             "VADER is very smart, handsome, and funny.",  # booster words handled correctly (sentiment intensity adjusted)
             "VADER is VERY SMART, handsome, and FUNNY.",  # emphasis for ALLCAPS handled
             "VADER is VERY SMART, handsome, and FUNNY!!!",# combination of signals - VADER appropriately adjusts intensity
             "VADER is VERY SMART, really handsome, and INCREDIBLY FUNNY!!!", # booster words & punctuation make this close to 
                                                                              # ceiling for score
             "The book was good.", # positive sentence
             "The book was kind of good.", # qualified positive sentence is handled correctly (intensity adjusted)
             "The plot was good, but the characters are uncompelling and the dialog is not great.", # mixed negation sentence
             "A really bad, horrible book.", # negative sentence with booster words
             "At least it isn't a horrible book.", # negated negative sentence with contraction
             ":) and :D", # emoticons handled
             "", # an empty string is correctly handled
             "Today sux", #  negative slang handled
             "Today sux!", #  negative slang with punctuation emphasis handled
             "Today SUX!", #  negative slang with capitalization emphasis
             "Today kinda sux! But I'll get by, lol"] # mixed sentiment example with slang and constrastive conjunction "but"

In [18]:
paragraph = "It was one of the worst movies I've seen, despite good reviews. \
Unbelievably bad acting!! Poor direction. VERY poor production. \
The movie was bad. Very bad movie. VERY bad movie. VERY BAD movie. VERY BAD movie!"

In [19]:
nltk.download('punkt')
from nltk import tokenize

lines_list = tokenize.sent_tokenize(paragraph)
sentences.extend(lines_list)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mbeati\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [20]:
tricky_sentences = ["Most automated sentiment analysis tools are shit.",
                    "VADER sentiment analysis is the shit.",
                    "Sentiment analysis has never been good.",
                    "Sentiment analysis with VADER has never been this good.",
                    "Warren Beatty has never been so entertaining.",
                    "I won't say that the movie is astounding and I wouldn't claim that the movie is too banal either.",
                    "I like to hate Michael Bay films, but I couldn't fault this one",
                    "It's one thing to watch an Uwe Boll film, but another thing entirely to pay for it",
                    "The movie was too good",
                    "This movie was actually neither that funny, nor super witty.",
                    "This movie doesn't care about cleverness, wit or any other kind of intelligent humor.",
                    "Those who find ugly meanings in beautiful things are corrupt without being charming.",
                    "There are slow and repetitive parts, BUT it has just enough spice to keep it interesting.",
                    "The script is not fantastic, but the acting is decent and the cinematography is EXCELLENT!",
                    "Roger Dodger is one of the most compelling variations on this theme.",
                    "Roger Dodger is one of the least compelling variations on this theme.",
                    "Roger Dodger is at least compelling as a variation on the theme.",
                    "they fall in love with the product",
                    "but then it breaks",
                    "usually around the time the 90 day warranty expires",
                    "the twin towers collapsed today",
                    "However, Mr. Carter solemnly argues, his client carried out the kidnapping \
                     under orders and in the ''least offensive way possible.''"]

sentences.extend(tricky_sentences)

In [21]:
import nltk
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\mbeati\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [22]:
sid = SentimentIntensityAnalyzer()  # Modelo preentrenado

In [23]:
for sentence in sentences:
    print(sentence)
    ss = sid.polarity_scores(sentence) ##### WHAAAT Q ES POLARITY_SCORES
    for k in sorted(ss):
        print('{0}: {1}, '.format(k, ss[k]), end='')
    print()

VADER is smart, handsome, and funny.
compound: 0.8316, neg: 0.0, neu: 0.254, pos: 0.746, 
VADER is smart, handsome, and funny!
compound: 0.8439, neg: 0.0, neu: 0.248, pos: 0.752, 
VADER is very smart, handsome, and funny.
compound: 0.8545, neg: 0.0, neu: 0.299, pos: 0.701, 
VADER is VERY SMART, handsome, and FUNNY.
compound: 0.9227, neg: 0.0, neu: 0.246, pos: 0.754, 
VADER is VERY SMART, handsome, and FUNNY!!!
compound: 0.9342, neg: 0.0, neu: 0.233, pos: 0.767, 
VADER is VERY SMART, really handsome, and INCREDIBLY FUNNY!!!
compound: 0.9469, neg: 0.0, neu: 0.294, pos: 0.706, 
The book was good.
compound: 0.4404, neg: 0.0, neu: 0.508, pos: 0.492, 
The book was kind of good.
compound: 0.3832, neg: 0.0, neu: 0.657, pos: 0.343, 
The plot was good, but the characters are uncompelling and the dialog is not great.
compound: -0.7042, neg: 0.327, neu: 0.579, pos: 0.094, 
A really bad, horrible book.
compound: -0.8211, neg: 0.791, neu: 0.209, pos: 0.0, 
At least it isn't a horrible book.
compound

---

### Sentiment Analysis con Sklearn

#### Preparamos los datos

In [24]:
import pandas as pd
import numpy as np

# leemos los datos del csv
df = pd.read_csv('../Data/Amazon_Unlocked_Mobile.csv')

# Trabajamos sobre una muestra de los datos para acelerar los cálculos
df = df.sample(frac=0.1, random_state=10)

df.head()

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
394349,Sony XPERIA Z2 D6503 FACTORY UNLOCKED Internat...,,244.95,5,Very good one! Better than Samsung S and iphon...,0.0
34377,Apple iPhone 5c 8GB (Pink) - Verizon Wireless,Apple,194.99,1,"The phone needed a SIM card, would have been n...",1.0
248521,Motorola Droid RAZR MAXX XT912 M Verizon Smart...,Motorola,174.99,5,I was 3 months away from my upgrade and my Str...,3.0
167661,CNPGD [U.S. Office Extended Warranty] Smartwat...,CNPGD,49.99,1,an experience i want to forget,0.0
73287,Apple iPhone 7 Unlocked Phone 256 GB - US Vers...,Apple,922.0,5,GREAT PHONE WORK ACCORDING MY EXPECTATIONS.,1.0


In [25]:
# Dropeamos los datos faltantes 
df.dropna(inplace=True)

# Eliminamos los ratings iguales a 3 por ser neutrales
df = df[df['Rating'] != 3]

# Los 4s y los 5s los encodeamos como 1 (positivo)
# Los 1s y los 2s los encodeamos como 0 (negativo)
df['Positivos'] = np.where(df['Rating'] > 3, 1, 0)
df.head(10)

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes,Positivos
34377,Apple iPhone 5c 8GB (Pink) - Verizon Wireless,Apple,194.99,1,"The phone needed a SIM card, would have been n...",1.0,0
248521,Motorola Droid RAZR MAXX XT912 M Verizon Smart...,Motorola,174.99,5,I was 3 months away from my upgrade and my Str...,3.0,1
167661,CNPGD [U.S. Office Extended Warranty] Smartwat...,CNPGD,49.99,1,an experience i want to forget,0.0,0
73287,Apple iPhone 7 Unlocked Phone 256 GB - US Vers...,Apple,922.0,5,GREAT PHONE WORK ACCORDING MY EXPECTATIONS.,1.0,1
277158,Nokia N8 Unlocked GSM Touch Screen Phone Featu...,Nokia,95.0,5,I fell in love with this phone because it did ...,0.0,1
100311,Blackberry Torch 2 9810 Unlocked Phone with 1....,BlackBerry,77.49,5,I am pleased with this Blackberry phone! The p...,0.0,1
251669,Motorola Moto E (1st Generation) - Black - 4 G...,Motorola,89.99,5,"Great product, best value for money smartphone...",0.0,1
279878,OtterBox 77-29864 Defender Series Hybrid Case ...,OtterBox,9.99,5,I've bought 3 no problems. Fast delivery.,0.0,1
406017,Verizon HTC Rezound 4G Android Smarphone - 8MP...,HTC,74.99,4,Great phone for the price...,0.0,1
302567,"RCA M1 Unlocked Cell Phone, Dual Sim, 5Mp Came...",RCA,159.99,5,My mom is not good with new technoloy but this...,4.0,1


In [26]:
# Las clases están desbalanceadas

df['Positivos'].mean()

0.7471776686078667

In [27]:
from sklearn.model_selection import train_test_split

# Split data en sets de training y test 
X_train, X_test, y_train, y_test = train_test_split(df['Reviews'], 
                                                    df['Positivos'], 
                                                    random_state=0,
                                                   stratify=df['Positivos'])

In [28]:
print('Primera observación del X_train:\n\n', X_train.iloc[0])
print('\n\nX_train shape: ', X_train.shape)

Primera observación del X_train:

 I guess you get what you pay for. Within 30 days the screen went blank. Now it makes a screeching sound so bad that the person I am talking to can not understand what I am saying


X_train shape:  (23052,)


#### CountVectorizer

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

Convierte una colección de documentos de texto en una matrix de frecuencia de tokens.

Devuelve una representación "sparse" como instancia de la clase scipy.sparse.csr_matrix.

Si no pasan como parámetros un diccionario a-priori y no usan ningún tipo de selección de features, el número de features será igual al tamaño del vocabulario encontrado en el análisis de los datos. 


In [29]:
from sklearn.feature_extraction.text import CountVectorizer

# Fiteamos el CountVectorizer a los datos de entrenamiento
vect = CountVectorizer().fit(X_train)

In [30]:
len(vect.get_feature_names())
type(vect.vocabulary_)
len(vect.stop_words_ )
#vect.vocabulary_


0

In [31]:
# devuelve uno cada 2000. nada para el primer argumento, nada para el segundo, 2000 para el tercero
# desde hasta step
vect.get_feature_names()[::2000]

['00',
 'asks',
 'committing',
 'e973',
 'gos',
 'laughed',
 'onset',
 'realizing',
 'sneak',
 'unbooted']

In [32]:
len(vect.get_feature_names())

19446

In [33]:
# transformamos los documentos del training set a una matriz de documentos-términos:

X_train_vectorized = vect.transform(X_train)

X_train_vectorized

<23052x19446 sparse matrix of type '<class 'numpy.int64'>'
	with 607398 stored elements in Compressed Sparse Row format>

In [34]:
from sklearn.linear_model import LogisticRegression

# Entrenamos el modelo
model = LogisticRegression(solver='liblinear')
model.fit(X_train_vectorized, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [35]:
from sklearn.metrics import roc_auc_score

# Hacemos las predicciones sobre el set de testeo:
predictions = model.predict(vect.transform(X_test))

print('AUC: ', roc_auc_score(y_test, predictions))

AUC:  0.8927592965163733


In [36]:
# Obtenemos los nombres de las features como un array de numpy
feature_names = np.array(vect.get_feature_names())

# Ordenamos a los coeficientes del modelo
sorted_coef_index = model.coef_[0].argsort()

# Observamos a los 10 coeficientes más grandes y más chicos:
print('Coefs menores:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Coefs mayores: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Coefs menores:
['junk' 'worst' 'terrible' 'garbage' 'sucks' 'slow' 'defective' 'poor'
 'sucked' 'disappointed']

Coefs mayores: 
['excelente' 'excellent' 'excelent' 'love' 'loves' 'perfectly' 'perfect'
 'exactly' 'great' 'amazing']


#### TfIdf

TfidfVectorizer

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

Convierte una colección de documentos en un amatriz de features TF-IDF

Es equivalente a CountVectorizer seguido de TfidfTransformer.

---

Tf-idf es el producto de dos medidas, frecuencia de término y frecuencia inversa de documento. 

Existen varias maneras de determinar el valor de ambas. En el caso de la frecuencia de término tf(t, d), la opción más sencilla es usar la frecuencia bruta del término t en el documento d, o sea, el número de veces que el término t ocurre en el documento d. Si denotamos la frecuencia bruta de t por f(t,d), entonces el esquema tf simple es tf(t, d) = f(t,d). 


$ tf (t,d) = \frac{f (t,d)} {max( f(t,d) : t \epsilon d)} $

La frecuencia inversa de documento es una medida de si el término es común o no, en la colección de documentos. Se obtiene dividiendo el número total de documentos (D) por el número de documentos que contienen el término, y se toma el logaritmo de ese cociente:

$ idf(t, D) = log \frac{|D|}{|\{d \epsilon D : t \epsilon d \}|} $

donde

|D|: cardinalidad de D, o número de documentos en la colección.

$ \{d \epsilon D : t \epsilon d \}$: número de documentos donde aparece el término t. Si el término no está en la colección se producirá una división-por-cero. Por lo tanto, es común aj

Luego, tf-idf se calcula como:

$ tfidf (t,d,D)= tf (t,d) * idf (t,D) $

In [37]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Fiteamos el TfidfVectorizer al set de entrenamiento definiento un min_df_min=5
vect = TfidfVectorizer(min_df=5).fit(X_train)
len(vect.get_feature_names())

5419

In [38]:
X_train_vectorized = vect.transform(X_train)

model = LogisticRegression(solver='liblinear')
model.fit(X_train_vectorized, y_train)

predictions = model.predict(vect.transform(X_test))

print('AUC: ', roc_auc_score(y_test, predictions))

AUC:  0.8862895553580062


In [39]:
sorted_coef_index = model.coef_[0].argsort()

print('Coefs menores:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Coefs mayores: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Coefs menores:
['buzz' 'contry' 'ahold' 'albums' 'dias' 'custom' 'desia' 'common'
 'certified' 'businesses']

Coefs mayores: 
['autorotate' 'boggling' 'annihilating' 'aug' 'carro' '4glte' '385' 'alth'
 '240' 'aplicación']


In [40]:
# Vemos que el modelo no puede predecir bien los siguientes ejemplos:

print(model.predict(vect.transform(['not an issue, phone is working',
                                    'an issue, phone is not working'])))

[0 0]


#### n-gramas

In [41]:
# Fiteamos el CountVectorizer al set de training especificando una min_df=5 y 
# extrayendo 1-gramas and 2-gramas

vect = CountVectorizer(min_df=5, ngram_range=(1,2)).fit(X_train)

X_train_vectorized = vect.transform(X_train)

len(vect.get_feature_names())

28611

In [42]:
model = LogisticRegression(solver = 'liblinear')
model.fit(X_train_vectorized, y_train)

predictions = model.predict(vect.transform(X_test))

print('AUC: ', roc_auc_score(y_test, predictions))

AUC:  0.9092520677698237


In [43]:
feature_names = np.array(vect.get_feature_names())

sorted_coef_index = model.coef_[0].argsort()

print('Coefs menores:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Coefs mayores: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Coefs menores:
['junk' 'no good' 'poor' 'sucks' 'not good' 'defective' 'slow' 'garbage'
 'broken' 'terrible']

Coefs mayores: 
['excellent' 'excelente' 'excelent' 'perfect' 'great' 'love' 'no problems'
 'not bad' 'awesome' 'amazing']


In [44]:
print(model.predict(vect.transform(['not an issue, phone is working',
                                    'an issue, phone is not working'])))

[1 0]


Adicional
https://www.nltk.org/book/ch02.html