## Sentiment analysis
Dato un insieme di recensioni, vorremmo capire se il testo esprime opinioni positive o negative. 
### Modello
Per fare questa cosa dobbiamo innanzitutto modellare il problema matematicamente per poi applicare lo strumento di ML che più ci sembra opportuno.
#### Parole chiave
Online esistono dei dataset (o dizionari in questo caso), contenenti una lista di parole con connotazione positiva o negativa. È evidente che questo è un approccio un po naive dal momento che il significato di una frase non si può desumere completamente dal numero di parole che essa contiene, poiché molto spesso il linguaggio naturale è ambiguo, e alcune parole cambiano completamente significato a seconda del contesto. Tuttavia per recensioni brevi e su situazioni concrete, è qualcosa di accettabile.

Quello che si può fare è tenere un conteggio di quante parole positive e negative esistono in una frase, e sulla base di esse definire la connotazione della recensione.

Dato, ad esempio una tabella con le recensioni strutturata come segue

| **oid** |                                                   **text** | **stars** |
|--------:|-----------------------------------------------------------:|----------:|
| **527** |  Christopher Reeve is the definitive screen "Superman" ... |    5.0    |
| **540** |  Sorry, never watched the movie. If has ANYTHING to do ... |    2.0    |
| **527** | I won't bother going over the plot, as other have done ... |    4.0    |

Risulta possibile aumentarla con due colonne **pos_words** e **neg_words** che conteggiano rispettivamente le parole positive e negative che ci sono 

| **oid** |                                                   **text** | **stars** | **pos_words** | **neg_words** |
|--------:|-----------------------------------------------------------:|----------:|--------------:|--------------:|
| **527** |  Christopher Reeve is the definitive screen "Superman" ... |    5.0    |      10.0     |      3.0      |
| **540** |  Sorry, never watched the movie. If has ANYTHING to do ... |    2.0    |      1.0      |      3.0      |
| **527** | I won't bother going over the plot, as other have done ... |    4.0    |      41.0     |      33.0     |

e, fornendo a un modello di regressione lineare le ultime tre colonne, è possibile predire il punteggio di una nuova recensione sulla base del numero di parole di un tipo e dell'altro.

In [1]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

In [2]:
sia = SentimentIntensityAnalyzer()

In [3]:
sentence = "Wonderful! A really great movie despite a sad ending"

In [4]:
num_positive_words = len([w for w in nltk.word_tokenize(sentence) if sia.polarity_scores(w)['pos'] > 0])
num_negative_words = len([w for w in nltk.word_tokenize(sentence) if sia.polarity_scores(w)['neg'] > 0])


In [5]:
num_negative_words

1

In [6]:
num_positive_words

2

In [7]:
import pandas as pd

In [8]:
df = pd.read_csv('datasets/amznrviews/20191226-reviews.csv')

In [9]:
df

Unnamed: 0,asin,name,rating,date,verified,title,body,helpfulVotes
0,B0000SX2UC,Janet,3,"October 11, 2005",False,"Def not best, but not worst",I had the Samsung A600 for awhile which is abs...,1.0
1,B0000SX2UC,Luke Wyatt,1,"January 7, 2004",False,Text Messaging Doesn't Work,Due to a software issue between Nokia and Spri...,17.0
2,B0000SX2UC,Brooke,5,"December 30, 2003",False,Love This Phone,"This is a great, reliable phone. I also purcha...",5.0
3,B0000SX2UC,amy m. teague,3,"March 18, 2004",False,"Love the Phone, BUT...!","I love the phone and all, because I really did...",1.0
4,B0000SX2UC,tristazbimmer,4,"August 28, 2005",False,"Great phone service and options, lousy case!",The phone has been great for every purpose it ...,1.0
...,...,...,...,...,...,...,...,...
67981,B081H6STQQ,jande,5,"August 16, 2019",False,"Awesome Phone, but finger scanner is a big mis...",I love the camera on this phone. The screen is...,1.0
67982,B081H6STQQ,2cool4u,5,"September 14, 2019",False,Simply Amazing!,I've been an Xperia user for several years and...,1.0
67983,B081H6STQQ,simon,5,"July 14, 2019",False,"great phon3, but many bugs need to fix. still ...",buy one more for my cousin,
67984,B081TJFVCJ,Tobiasz Jedrysiak,5,"December 24, 2019",True,Phone is like new,Product looks and works like new. Very much re...,


In [10]:
df['date']= pd.to_datetime(df['date'])

In [44]:
df

Unnamed: 0,rating,date,body,helpfulVotes
0,3,2005-10-11,I had the Samsung A600 for awhile which is abs...,1.0
1,1,2004-01-07,Due to a software issue between Nokia and Spri...,17.0
2,5,2003-12-30,"This is a great, reliable phone. I also purcha...",5.0
3,3,2004-03-18,"I love the phone and all, because I really did...",1.0
4,4,2005-08-28,The phone has been great for every purpose it ...,1.0
...,...,...,...,...
67981,5,2019-08-16,I love the camera on this phone. The screen is...,1.0
67982,5,2019-09-14,I've been an Xperia user for several years and...,1.0
67983,5,2019-07-14,buy one more for my cousin,
67984,5,2019-12-24,Product looks and works like new. Very much re...,


In [49]:
df['helpfulVotes'].fillna(df['helpfulVotes'].mean(), inplace=True)

In [50]:
df

Unnamed: 0,rating,date,body,helpfulVotes
0,3,2005-10-11,I had the Samsung A600 for awhile which is abs...,1.00000
1,1,2004-01-07,Due to a software issue between Nokia and Spri...,17.00000
2,5,2003-12-30,"This is a great, reliable phone. I also purcha...",5.00000
3,3,2004-03-18,"I love the phone and all, because I really did...",1.00000
4,4,2005-08-28,The phone has been great for every purpose it ...,1.00000
...,...,...,...,...
67981,5,2019-08-16,I love the camera on this phone. The screen is...,1.00000
67982,5,2019-09-14,I've been an Xperia user for several years and...,1.00000
67983,5,2019-07-14,buy one more for my cousin,8.22969
67984,5,2019-12-24,Product looks and works like new. Very much re...,8.22969


In [51]:
df.dropna(subset=['body'], inplace=True)

In [52]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 67965 entries, 0 to 67985
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   rating        67965 non-null  int64         
 1   date          67965 non-null  datetime64[ns]
 2   body          67965 non-null  object        
 3   helpfulVotes  67965 non-null  float64       
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 2.6+ MB


In [58]:
df['pos_words'] = df.apply(lambda x: len([w for w in nltk.word_tokenize(x['body']) if sia.polarity_scores(w)['pos'] > 0]), axis=1)
df['neg_words'] = df.apply(lambda x: len([w for w in nltk.word_tokenize(x['body']) if sia.polarity_scores(w)['neg'] > 0]), axis=1)

In [59]:
df

Unnamed: 0,rating,date,body,helpfulVotes,pos_words,neg_words
0,3,2005-10-11,I had the Samsung A600 for awhile which is abs...,1.00000,17,13
1,1,2004-01-07,Due to a software issue between Nokia and Spri...,17.00000,5,2
2,5,2003-12-30,"This is a great, reliable phone. I also purcha...",5.00000,6,3
3,3,2004-03-18,"I love the phone and all, because I really did...",1.00000,3,0
4,4,2005-08-28,The phone has been great for every purpose it ...,1.00000,6,3
...,...,...,...,...,...,...
67981,5,2019-08-16,I love the camera on this phone. The screen is...,1.00000,5,3
67982,5,2019-09-14,I've been an Xperia user for several years and...,1.00000,4,5
67983,5,2019-07-14,buy one more for my cousin,8.22969,0,0
67984,5,2019-12-24,Product looks and works like new. Very much re...,8.22969,2,0


In [60]:
df.loc[3, 'body']

"I love the phone and all, because I really did need one, but I didn't expect the price of the bill when I received one. Also, I've had my phone for a little over two months now and still have yet to receive my free accessories that were supposed to come with the phone. Every time I call the company, they keep telling me to wait a couple of weeks, and that I should be receiving it shortly. Other than that, I do love the phone and all that I am able to do with it; and I'm not just talking about making the phone calls! :)"

In [69]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import numpy as np

In [64]:
lrm = LinearRegression()

Con *random_state=42* siamo certi che l'esecuzione di questo split produca risultati riproducibili. 

In [137]:
X_train, X_val, y_train, y_val = train_test_split(df.loc[:, ['helpfulVotes', 'pos_words', 'neg_words']], df.loc[:, 'rating'], train_size=0.7, random_state=42)

In [138]:
lrm.fit(X_train, y_train)

LinearRegression()

In [139]:
lrm.predict(X_val)

array([3.88608999, 3.92541537, 3.92541537, ..., 8.72570379, 3.79014593,
       3.88374846])

L'$R^2$ è molto scarso sia sui dati di training che di validation, qualcosa non funziona.

In [140]:
lrm.score(X_train, y_train)

0.1502861538933309

In [141]:
lrm.score(X_val, y_val)

0.14611076826991076

- Intercetta (bias), che corrisponde al voto base è di circa 3.80 stelle, molto simile ai risultati ottenuti dal sistema del prof.
- Coefficienti: 
    - Per quanto riguarda il numero di **recensioni utili**: per ognuna di esse le stelle scendono di uno 0.00190416
    - Per il numero di **parole positive**: vi è un incremento di 0.13526944 per ognuna di esse
    - Per quelle **negative**: è strano che ogni parola tolga 0.31220579, che è molto di più di quanto non aggiunga una parola positiva.

In [130]:
lrm.coef_, lrm.intercept_

(array([-0.00190416,  0.13526944, -0.31220579]), 3.8058166132372144)

Il modello predice bene il 76.2% delle volte

In [144]:
labels_train = np.where(y_train >= 3.5, "pos", "neg")
labels_val = np.where(lrm.predict(X_train) >= 3.5, "pos", "neg")

In [145]:
(labels_train == labels_val).mean()

0.7619548081975828