# Análisis de sentimiento 

In [46]:
import pandas as pd

df = pd.read_csv('movie_data.csv')
df.head(5)

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0


In [47]:
df.shape

(50000, 2)

In [48]:
df.sentiment.value_counts()

0    25000
1    25000
Name: sentiment, dtype: int64

In [49]:
df.review[5000]

'This game was terrible. I think they worked too hard on the visuals and didn\'t do much with the gameplay, which is the most important part. I mean, the visuals look incredible, but is the game really "fun"? NO! I mean it\'s like "hey let\'s jump off buildings" and all I\'m doing is holding up and A/X. The game play just isn\'t there, and I don\'t agree with what Ubisoft did, because they had this hot girl (the producer of the game, Jade Raymond), and they were like "OK we\'ve got this hot girl, let\'s pimp her" and if you go to gaming websites, you\'re not gonna see gameplay stuff of Assassin\'s Creed, you\'ll see her face with a microphone and it\'ll be like "We interviewed Jade Raymond about her favorite cookies!" It\'s like man, shut the F*@K UP WHO CARES?! Apparently...a lot of people do, because they bought the game and like it...I mean compare this game with Super Mario Galaxy. A Wii game that really doesn\'t abuse the Wii Remote, but STILL is very innovative and delivers in th

## Crear los vectores de características

Utilizamos el método fit_transform de CountVectorizer para crear el vocabulario de la bag-of-words y transformar los enunciados en vectores.

1. The sun is shining
2. The weather is sweet
3. The sun is shining, the weather is sweet, and one and one is two


### La bolsa de palabras (bag-of-words) nos permite representar texto como vectores de características númericas. 

**Algoritmo:**

1. Crear un vocabulario de componentes léxicos únicos a partir de un conjunto único de documentos (en este caso reviews)
2. Construir un vector de características a partir de cada documento que contiene el recuento de la frecuencia en que cada palabra aparece en un documento en concreto. 


In [51]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()
frases = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining, the weather is sweet, and one and one is two'])
bag = count.fit_transform(frases)

In [52]:
bag

<3x9 sparse matrix of type '<class 'numpy.int64'>'
	with 17 stored elements in Compressed Sparse Row format>

In [53]:
frases

array(['The sun is shining', 'The weather is sweet',
       'The sun is shining, the weather is sweet, and one and one is two'],
      dtype='<U64')

In [55]:
print(count.vocabulary_)

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}


In [54]:
#La matriz generada también se conoce como modelo unigrama
bag1 = bag.toarray()
print(bag1) #Podemos observar la frecuencia de cada token de las 3 frases

[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]


In [10]:
print(frases)

['The sun is shining' 'The weather is sweet'
 'The sun is shining, the weather is sweet, and one and one is two']


## tf-idf: term frequency-inverse document frequency

La siguiente técnica se puede traducir como frecuencia de término-frecuencia inversa de documento, se utiliza para disminuir el peso de las palabras que aparecen muchas veces en multiples documentos, la ecuación es la siguiente: 

$$\text{tf-idf}(t,d)=\text{tf (t,d)}\times \text{idf}(t,d)$$

* El producto de la frecuencia de término $(tf(t,d))$ y la frecuencia inversa de documento $(idf(t,d))$. Donde ***t*** es el número de terminos y ***d*** el número de documento.

* ***idf(t,d)*** se calcula con la siguiente ecuación:


$$\text{idf}(t,d) = \text{log}\frac{n_d}{1+\text{df}(d, t)},$$

* Donde $n_d$ es el número total de documentos 
* ***df (d,t)*** es el número de documentos ***d*** que contienen el término ***t*** 
* El logaritmo se utiliza para evitar que las bajas frecuencias de documentos no adquieran demasiado peso

### Ejemplo:
$$\text{idf}("is", d3) = log \frac{3}{1+3} = -0.287$$


$$\text{tf-idf}("is",d3)= 3 \times (-0.287) = -0.863$$

In [11]:
#count.vocabulary_

In [12]:
bag1

array([[0, 1, 0, 1, 1, 0, 1, 0, 0],
       [0, 1, 0, 0, 0, 1, 1, 0, 1],
       [2, 3, 2, 1, 1, 1, 2, 1, 1]])

In [56]:
import math
math.log(3/4)

-0.2876820724517809

In [57]:
3* math.log(3/(1+3))

-0.8630462173553427

Para transformar la matriz anterior a tf-idf utilizamos la función **TfidfTransformer** de sklearn

In [58]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer()
np.set_printoptions(precision=2)
print(tfidf.fit_transform(bag).toarray())

[[0.   0.43 0.   0.56 0.56 0.   0.43 0.   0.  ]
 [0.   0.43 0.   0.   0.   0.56 0.43 0.   0.56]
 [0.5  0.45 0.5  0.19 0.19 0.19 0.3  0.25 0.19]]



Las ecuaciones para idf y tf-idf implementadas en scikit-learn son: 

$$\text{idf} (t,d) = log\frac{1 + n_d}{1 + \text{df}(d, t)}$$

$$\text{tf-idf}(t,d) = \text{tf}(t,d) \times (\text{idf}(t,d)+1)$$

Scikit-learn implementa la siguiente normalización a las frecuencias (L2), que devuelve un vector de longitud 1, diviendo un vector de característica no normalizado ***v*** por su ***normativa L2***

$$v_{\text{norm}} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v_{1}^{2} + v_{2}^{2} + \dots + v_{n}^{2}}} = \frac{v}{\big (\sum_{i=1}^{n} v_{i}^{2}\big)^\frac{1}{2}}$$

### Calculamos tf-idf para el termino  *is*:

In [66]:
import math
tf_is = 3
n_docs = 3
idf_is = math.log((n_docs+1) / (1+3))
tfidf_is = tf_is * (idf_is + 1)
print('tf-idf("is") = %.2f' % tfidf_is)

tf-idf("is") = 3.00


In [63]:
idf_is


0.0

In [67]:
bag2 = np.zeros([3,9])
for i in range(bag1.shape[0]):
    for j in range(bag1.shape[1]):
        tf = bag1[i,j]
        n_docs=bag1.shape[0]
        df= np.count_nonzero(bag1[:,j])
        idf = math.log((n_docs+1) / (1+df))
        tfidf = tf * (idf + 1)
        bag2[i,j]=tfidf

In [68]:
bag2

array([[0.  , 1.  , 0.  , 1.29, 1.29, 0.  , 1.  , 0.  , 0.  ],
       [0.  , 1.  , 0.  , 0.  , 0.  , 1.29, 1.  , 0.  , 1.29],
       [3.39, 3.  , 3.39, 1.29, 1.29, 1.29, 2.  , 1.69, 1.29]])

In [69]:
#No normalizado
tfidf = TfidfTransformer(norm=None)
tfidf = tfidf.fit_transform(bag).toarray()
tfidf

array([[0.  , 1.  , 0.  , 1.29, 1.29, 0.  , 1.  , 0.  , 0.  ],
       [0.  , 1.  , 0.  , 0.  , 0.  , 1.29, 1.  , 0.  , 1.29],
       [3.39, 3.  , 3.39, 1.29, 1.29, 1.29, 2.  , 1.69, 1.29]])

In [70]:
bag2[2]/np.sqrt((bag2[2]**2).sum())

array([0.5 , 0.45, 0.5 , 0.19, 0.19, 0.19, 0.3 , 0.25, 0.19])

In [71]:
bag2[0]/np.sqrt((bag2[0]**2).sum())

array([0.  , 0.43, 0.  , 0.56, 0.56, 0.  , 0.43, 0.  , 0.  ])

In [72]:
bag2[1]/np.sqrt((bag2[1]**2).sum())

array([0.  , 0.43, 0.  , 0.  , 0.  , 0.56, 0.43, 0.  , 0.56])

$$\text{tfi-df}_{norm} = \frac{[3.39, 3.0, 3.39, 1.29, 1.29, 1.29, 2.0 , 1.69, 1.29]}{\sqrt{[3.39^2+ 3.0^2+3.39^2+ 1.29^2+ 1.29^2+ 1.29^2+ 2.0^2 +1.69^2+1.29^2]}}$$$$=[0.5, 0.45, 0.5, 0.19, 0.19, 0.19, 0.3, 0.25, 0.19]$$$$\Rightarrow \text{tfi-df}_{norm}("is", d3) = 0.45$$

In [73]:
print(tfidf[0]/np.sqrt((tfidf**2).sum(axis=1))[0])
print(tfidf[1]/np.sqrt((tfidf**2).sum(axis=1))[1])
print(tfidf[2]/np.sqrt((tfidf**2).sum(axis=1))[2])


[0.   0.43 0.   0.56 0.56 0.   0.43 0.   0.  ]
[0.   0.43 0.   0.   0.   0.56 0.43 0.   0.56]
[0.5  0.45 0.5  0.19 0.19 0.19 0.3  0.25 0.19]


### Preparación de los datos

In [74]:
df = pd.read_csv('movie_data.csv')
df.head(5)

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0


In [75]:
df.review[2]

'***SPOILER*** Do not read this, if you think about watching that movie, although it would be a waste of time. (By the way: The plot is so predictable that it does not make any difference if you read this or not anyway)<br /><br />If you are wondering whether to see "Coyote Ugly" or not: don\'t! It\'s not worth either the money for the ticket or the VHS / DVD. A typical "Chick-Feel-Good-Flick", one could say. The plot itself is as shallow as it can be, a ridiculous and uncritical version of the American Dream. The young good-looking girl from a small town becoming a big success in New York. The few desperate attempts of giving the movie any depth fail, such as the "tragic" accident of the father, the "difficulties" of Violet\'s relationship with her boyfriend, and so on. McNally (Director) tries to arouse the audience\'s pity and sadness put does not have any chance to succeed in this attempt due to the bad script and the shallow acting. Especially Piper Perabo completely fails in conv

In [76]:
import re
def preprocessor(text):
    text = re.sub(r'<[^>]*>', '', text) #Para eliminar las etiquetas HTML
    text = re.sub(r'[\W]+', ' ', text.lower()) #Eliminamos todos los caracteres que no sean palabras
    return text

In [77]:
preprocessor(df.loc[2, 'review'])

' spoiler do not read this if you think about watching that movie although it would be a waste of time by the way the plot is so predictable that it does not make any difference if you read this or not anyway if you are wondering whether to see coyote ugly or not don t it s not worth either the money for the ticket or the vhs dvd a typical chick feel good flick one could say the plot itself is as shallow as it can be a ridiculous and uncritical version of the american dream the young good looking girl from a small town becoming a big success in new york the few desperate attempts of giving the movie any depth fail such as the tragic accident of the father the difficulties of violet s relationship with her boyfriend and so on mcnally director tries to arouse the audience s pity and sadness put does not have any chance to succeed in this attempt due to the bad script and the shallow acting especially piper perabo completely fails in convincing one of jersey s fear of singing in front of 

In [78]:
df['review'] = df['review'].apply(preprocessor)

In [29]:
df.loc[:,'review'] = df['review'].apply(preprocessor)

In [79]:
df.review[45050]

'with all the excessive violence in this film it could ve been nc 17 but the gore could ve been pg 13 and there were quite a lot of swears when the mum had the original jackass bad hairdewed boy friend there was a lot of character development which made the film better to watch then after the kid came back to life as the scarecrow there was a mindless hour and ten minutes of him killing people the violence was overly excessive and i think the bodycount was higher than twelve which is a large number for movies like this almost every character in the film is stabbed or gets their head chopped off but the teacher who called him white trash and hoodlum though the character lester is anything but a hoodlum not even close i know hoods and am part hood they don t draw in class they sit there and throw stuff at the teacher the teacher deserved a more gruesome death than anyone of the characters but was just stabbed in the back there were two suspenseful scenes in the film but didn t last long 

In [95]:
len(set((' ').join(df.review[:50000].values).split()))

104132

### Tokenizacion

In [96]:
import nltk
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

def tokenizer(text):
    return nltk.word_tokenize(text,"english")


def tokenizer_porter(text):
    return [porter.stem(word) for word in tokenizer(text)]

In [97]:
tokenizer('runners like running and thus they run')

['runners', 'like', 'running', 'and', 'thus', 'they', 'run']

In [98]:
tokenizer_porter('runners like running and thus they run')

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

In [99]:
stop = nltk.corpus.stopwords.words("english")
[w for w in tokenizer_porter('a runners likes running and runs they a lot') if w not in stop]

['runner', 'like', 'run', 'run', 'lot']

### Clasificación del documento

In [35]:
df

Unnamed: 0,review,sentiment
0,in 1974 the teenager martha moxley maggie grac...,1
1,ok so i really like kris kristofferson and his...,0
2,spoiler do not read this if you think about w...,0
3,hi for all the people who have seen this wonde...,1
4,i recently bought the dvd forgetting just how ...,0
...,...,...
49995,ok lets start with the best the building altho...,0
49996,the british heritage film industry is out of c...,0
49997,i don t even know where to begin on this one i...,0
49998,richard tyler is a little boy who is scared of...,0


In [100]:
df["review"][0]

'in 1974 the teenager martha moxley maggie grace moves to the high class area of belle haven greenwich connecticut on the mischief night eve of halloween she was murdered in the backyard of her house and her murder remained unsolved twenty two years later the writer mark fuhrman christopher meloni who is a former la detective that has fallen in disgrace for perjury in o j simpson trial and moved to idaho decides to investigate the case with his partner stephen weeks andrew mitchell with the purpose of writing a book the locals squirm and do not welcome them but with the support of the retired detective steve carroll robert forster that was in charge of the investigation in the 70 s they discover the criminal and a net of power and money to cover the murder murder in greenwich is a good tv movie with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a kennedy the powerful and rich family used their influence to cover the mur

In [101]:
" ".join([w for w in tokenizer_porter(df["review"][2]) if w not in stop])

'spoiler read thi think watch movi although would wast time way plot predict doe make ani differ read thi anyway wonder whether see coyot ugli worth either money ticket vh dvd typic chick feel good flick one could say plot shallow ridicul uncrit version american dream young good look girl small town becom big success new york desper attempt give movi ani depth fail tragic accid father difficulti violet relationship boyfriend mcnalli director tri arous audienc piti sad put doe ani chanc succeed thi attempt due bad script shallow act especi piper perabo complet fail convinc one jersey fear sing front audienc onli good quit funni thing coyot ugli john goodman repres small ray hope thi movi wa veri astonish jerri bruckheim produc thi movi first gone 60 second thi happen great movi like rock con air wa true bruckheim stuff look superfici movi good look women relax even better go see charli angel much funni entertain self iron instead thi flick two thumb 3 10'

In [102]:
def preprocessor2(text):
    return " ".join([w for w in tokenizer_porter(text) if w not in stop])

df['review'] = df['review'].apply(preprocessor2)

In [103]:
from sklearn.model_selection import train_test_split

X = np.asarray(df.review)
y = np.asarray(df.sentiment)

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, stratify=y)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

Train set: (35000,) (35000,)
Test set: (15000,) (15000,)


In [45]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

tfidf = TfidfVectorizer()
LR = LogisticRegression()
pipe = Pipeline([('vect', tfidf),('clf', LR)])

#[('vect', tfidf),('clf', LR)]

pipe.fit(X_train,y_train)
pipe.score(X_test, y_test)

0.8918666666666667

In [103]:
pipe.predict(X_test)

array([0, 1, 0, ..., 0, 0, 1])