# Sentiment Analysis of Movie Opinion in Twitter

**Reference:**
https://github.com/rizalespe/Dataset-Sentimen-Analisis-Bahasa-Indonesia/blob/master/dataset_tweet_sentiment_opini_film.csv


**The algorithms:**
Multinomial NB(Naive Bayes)

**Dataset:** 
dataset_tweet_sentiment_opini_film.csv

The dataset contains 200 tweets, negative sentiment and positive sentiment

# **Training Step**



### Read the data

In [1]:
import pandas as pd

df = pd.read_csv("/content/dataset_tweet_sentiment_opini_film.csv", nrows=30000)
df.head()

Unnamed: 0,Id,Sentiment,Text_Tweet
0,1,negative,Jelek filmnya... apalagi si ernest gak mutu bg...
1,2,negative,Film king Arthur ini film paling jelek dari se...
2,3,negative,@beexkuanlin Sepanjang film gwa berkata kasar ...
3,4,negative,Ane ga suka fast and furious..menurutku kok je...
4,5,negative,"@baekhyun36 kan gua ga tau film nya, lu bilang..."


The columns:

*   `polarity` column is whether the tweet is s positive or not.
*   `text` column is the text of the tweet..


How many rows?

In [2]:
df.shape

(200, 3)

How many **positive** and **negative** tweets?

In [3]:
df.Sentiment.value_counts()

negative    100
positive    100
Name: Sentiment, dtype: int64

## Train the algorithms


### Vectorize the tweets

Create a `TfidfVectorizer` and use it to vectorize the tweets. 
Use `max_features` to take a selection of terms (1000) 

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [5]:
vectorizer = TfidfVectorizer(max_features=1000)
vectors = vectorizer.fit_transform(df.Text_Tweet)
words_df = pd.DataFrame(vectors.toarray(), columns=vectorizer.get_feature_names())
words_df.head()

Unnamed: 0,000,10,13,13szi5lbc,15,18,1982,1uxomgqjs,20,2010,2017,207,21theguysquiz,22,2o2irpy,30,41,46i52na21l,4_oaoo,637,690,6tqqdgglj,90,95,abis,abiss,acclaim,actingnya,action,ada,adalah,adanya,adaptasi,adegan,adinia,aduh,aduk,after,agak,agama,...,uda,udah,udh,ujian,ulang,umur,unaazizah,unlocked,unsur,untuk,us,utk,video,visual,vlog,wajib,waktu,walau,warganet,watch,waw,weekend,weird,what,wib,wkwk,woman,wonder,worth,ya,yah,yakin,yang,yanskii,yaoi,yg,you,youtu,youtube,ziarah
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.393599,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.238215,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.241932,0.0,0.0,0.0,0.0,0.0


### Setting the variables

There are two variables: `X` and `y`.

`X` = **features**

`y` = **labels**.

In [6]:
X = words_df
y = df.Sentiment

# **The algorithms:**

1.   Logistic Regression
2.   Random Forest Classifier
3.   Linear SVM
4.   Multinomial NB


**You can pick just ONE or ALL OF THEM.**

In [7]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB

# **Training the algorithms**


**Logistic Regression**

- C: float, default = 1.0
  - Inverse of regularization strength; must be a positive float. 
  - Smaller values specify stronger regularization.

- solver = {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}, default = ’lbfgs’
  - Algorithm to use in the optimization problem.
  - For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ and ‘saga’ are faster for large ones.
  - For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ handle multinomial loss; ‘liblinear’ is limited to one-versus-rest schemes.
  - ‘newton-cg’, ‘lbfgs’, ‘sag’ and ‘saga’ handle L2 or no penalty
  - ‘liblinear’ and ‘saga’ also handle L1 penalty
  - ‘saga’ also supports ‘elasticnet’ penalty
  - ‘liblinear’ does not support setting penalty='none'


In [8]:
%%time
# Create and train a logistic regression
logreg = LogisticRegression(C=1e9, solver='lbfgs', max_iter=1000)
logreg.fit(X, y)

CPU times: user 31.1 ms, sys: 7.02 ms, total: 38.1 ms
Wall time: 27 ms


**Random Forest Classifier**

- n_estimators: int, default = 100
  - The number of trees in the forest.

In [9]:
%%time
# Create and train a random forest classifier
forest = RandomForestClassifier(n_estimators=50)
forest.fit(X, y)

CPU times: user 142 ms, sys: 71.9 ms, total: 214 ms
Wall time: 150 ms


**Linear SVC**

LinearSVC is another (faster) implementation of **Support Vector Classification** for the case of a **linear kernel**. 

Note that **LinearSVC** does not accept parameter kernel, as this is assumed to be linear.

In [10]:
%%time
# Create and train a linear support vector classifier (LinearSVC)
svc = LinearSVC()
svc.fit(X, y)

CPU times: user 5.75 ms, sys: 141 µs, total: 5.89 ms
Wall time: 5.61 ms


**Multinomial NB**

MultinomialNB implements the **naive Bayes algorithm** for multinomially distributed data, and is one of the two classic naive Bayes variants used in text classification (*where the data are typically represented as word vector counts, although tf-idf vectors are also known to work well in practice*).

In [11]:
%%time
# Create and train a multinomial naive bayes classifier (MultinomialNB)
bayes = MultinomialNB()
bayes.fit(X, y)

CPU times: user 6.1 ms, sys: 6.02 ms, total: 12.1 ms
Wall time: 12.2 ms


# Discussion 1

How much faster were an algorithm compared to others?

## Use the models

We will use the model to predict whether the tweet is positive or negative.

### Testing Data

**You can add the testing data below.** 

In [12]:
# Create some test data

pd.set_option("display.max_colwidth", 200)

datatest = pd.DataFrame({'content': [
    "Jelek banget filmnya",
    "Ini jadi film kesukaan",
    "filmnya bagus banget",
    "kecewa sama filmnya",
    "bosen nonton filmnya",
    "filmnya keren",
    "menarik sih buat ditonton",
    "Seru banget filmnya",
    "Recommended buat ditonton",
    "Film ini wajib ditonton",
]})
datatest

Unnamed: 0,content
0,Jelek banget filmnya
1,Ini jadi film kesukaan
2,filmnya bagus banget
3,kecewa sama filmnya
4,bosen nonton filmnya
5,filmnya keren
6,menarik sih buat ditonton
7,Seru banget filmnya
8,Recommended buat ditonton
9,Film ini wajib ditonton




First we need to **vectorizer** the sentences into numbers, so the algorithm can understand them.

Our algorithm only knows **certain words.** 
Run `vectorizer.get_feature_names()` to show you the list of the words it knows.

In [13]:
print(vectorizer.get_feature_names())

['000', '10', '13', '13szi5lbc', '15', '18', '1982', '1uxomgqjs', '20', '2010', '2017', '207', '21theguysquiz', '22', '2o2irpy', '30', '41', '46i52na21l', '4_oaoo', '637', '690', '6tqqdgglj', '90', '95', 'abis', 'abiss', 'acclaim', 'actingnya', 'action', 'ada', 'adalah', 'adanya', 'adaptasi', 'adegan', 'adinia', 'aduh', 'aduk', 'after', 'agak', 'agama', 'ah', 'air', 'aj', 'aja', 'ajaa', 'akan', 'akhir', 'akhirnya', 'akhrnya', 'akting', 'aktingnya', 'aktor', 'aktornya', 'aktrisnya', 'aku', 'ale', 'alfi', 'alien', 'alitalit_', 'alur', 'ama', 'amat', 'amazing', 'ambigu', 'amira', 'an', 'anak', 'ancur', 'and', 'anda', 'andibowooo', 'ane', 'aneh', 'anjlok', 'anya', 'apa', 'apa2', 'apalagi', 'april', 'arah', 'arthur', 'artinya', 'artis', 'asih', 'asik', 'askmenfess', 'asli', 'astagah', 'atas', 'auratmu', 'awal', 'awalnya', 'awisuryadi', 'baca', 'baekhyun36', 'bagi', 'bagian', 'bagus', 'bagussss', 'bahagia', 'bahas', 'bahasa', 'bahwa', 'baik', 'bajakan', 'bakal', 'balap2', 'bandung', 'bang', 

**Because we already have the list of words we know, we only want to count them.** So instead of `.fit_transform`, we just use `.transform`:

```python
datatest_vectors = vectorizer.transform(datatest.content)
datatest_words_df = ......
```


In [14]:
# Put it through the vectoriser

# transform, not fit_transform, because we already learned all our words
datatest_vectors = vectorizer.transform(datatest.content)
datatest_words_df = pd.DataFrame(datatest_vectors.toarray(), columns=vectorizer.get_feature_names())
datatest_words_df.head()

Unnamed: 0,000,10,13,13szi5lbc,15,18,1982,1uxomgqjs,20,2010,2017,207,21theguysquiz,22,2o2irpy,30,41,46i52na21l,4_oaoo,637,690,6tqqdgglj,90,95,abis,abiss,acclaim,actingnya,action,ada,adalah,adanya,adaptasi,adegan,adinia,aduh,aduk,after,agak,agama,...,uda,udah,udh,ujian,ulang,umur,unaazizah,unlocked,unsur,untuk,us,utk,video,visual,vlog,wajib,waktu,walau,warganet,watch,waw,weekend,weird,what,wib,wkwk,woman,wonder,worth,ya,yah,yakin,yang,yanskii,yaoi,yg,you,youtu,youtube,ziarah
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Confirm `datatest_words_df`.

In [15]:
datatest_words_df.shape

(10, 1000)

# **Predicting the models**


We use `.predict` to predict class labels for each sentence and it will give a `0` (*negative*) or a `1` (*positive*) class:

```python
datatest['pred_logreg'] = logreg.predict(datatest_words_df)
```

We use `.predict_proba` to estimate the probability of the prediction.

The returned estimates for all classes are ordered by the label of classes:


```python
datatest['pred_logreg_prob'] = linreg.predict_proba(datatest_words_df)[:,1]
```



In [16]:
# Predict using all our models. 

# Logistic Regression predictions + probabilities
datatest['pred_logreg'] = logreg.predict(datatest_words_df)
datatest['pred_logreg_proba'] = logreg.predict_proba(datatest_words_df)[:,1]

# Random forest predictions + probabilities
datatest['pred_forest'] = forest.predict(datatest_words_df)
datatest['pred_forest_proba'] = forest.predict_proba(datatest_words_df)[:,1]

# SVC predictions
datatest['pred_svc'] = svc.predict(datatest_words_df)

# Bayes predictions + probabilities
datatest['pred_bayes'] = bayes.predict(datatest_words_df)
datatest['pred_bayes_proba'] = bayes.predict_proba(datatest_words_df)[:,1]

In [17]:
datatest

Unnamed: 0,content,pred_logreg,pred_logreg_proba,pred_forest,pred_forest_proba,pred_svc,pred_bayes,pred_bayes_proba
0,Jelek banget filmnya,negative,6.51238e-06,negative,0.32,negative,negative,0.343761
1,Ini jadi film kesukaan,negative,5.740527e-09,negative,0.08,negative,negative,0.343915
2,filmnya bagus banget,positive,0.9999976,positive,0.6,positive,positive,0.64666
3,kecewa sama filmnya,negative,3.399591e-12,negative,0.28,negative,negative,0.240218
4,bosen nonton filmnya,positive,0.9835608,negative,0.48,positive,positive,0.585866
5,filmnya keren,positive,1.0,positive,0.88,positive,positive,0.768228
6,menarik sih buat ditonton,positive,1.0,positive,0.58,positive,positive,0.742361
7,Seru banget filmnya,positive,1.0,positive,0.56,positive,positive,0.705564
8,Recommended buat ditonton,positive,0.9999994,positive,0.54,positive,positive,0.743204
9,Film ini wajib ditonton,positive,0.9998311,negative,0.36,positive,positive,0.726947


# Discussion 2


1.   What do the numbers mean? What's the difference between 0, 1, or 0,5?
2.   Were there any sentences where the classifiers seemed to disagree about? Give your analysis!
3.   What's the difference between using a (0 and 1) in sentiment analysis compared to a range of 0 - 1? When might you use one compared to another?


# **Testing our models**

Which model performs the best??

In [18]:
df.head()

Unnamed: 0,Id,Sentiment,Text_Tweet
0,1,negative,Jelek filmnya... apalagi si ernest gak mutu bgt actingnya... film sampah
1,2,negative,Film king Arthur ini film paling jelek dari seluruh cerita King Arthur
2,3,negative,@beexkuanlin Sepanjang film gwa berkata kasar terus pada bapaknya
3,4,negative,Ane ga suka fast and furious..menurutku kok jelek ya tu film
4,5,negative,"@baekhyun36 kan gua ga tau film nya, lu bilang perang perangan/? Perang""an disebut ama rp yaoi jadi ambigu :v"


Our  dataframe is a list of many tweets. We turned this into `X` - vectorized words - and `y` - whether the tweet is negative or positive.

Before we used `.fit(X, y)` to train on all of our data. 

Instead, **we can test our models** by doing a test/train split and see if the predictions match the actual labels.

In [19]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)

In [20]:
%%time

print("Training logistic regression")
logreg.fit(X_train, y_train)

print("Training random forest")
forest.fit(X_train, y_train)

print("Training SVC")
svc.fit(X_train, y_train)

print("Training Naive Bayes")
bayes.fit(X_train, y_train)

Training logistic regression
Training random forest
Training SVC
Training Naive Bayes
CPU times: user 180 ms, sys: 96.6 ms, total: 277 ms
Wall time: 163 ms


### Confusion matrices


In [21]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score 
from sklearn.metrics import classification_report 

#### Logistic Regression

In [22]:
y_true1 = y_test
y_pred1 = logreg.predict(X_test)
matrix1 = confusion_matrix(y_true1, y_pred1)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix1,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,18,2
Is positive,2,18


#### Random forest

In [23]:
y_true2 = y_test
y_pred2 = forest.predict(X_test)
matrix2 = confusion_matrix(y_true2, y_pred2)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix2,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,16,4
Is positive,3,17


#### SVC

In [24]:
y_true3 = y_test
y_pred3 = svc.predict(X_test)
matrix3 = confusion_matrix(y_true3, y_pred3)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix3,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,18,2
Is positive,2,18


#### Multinomial Naive Bayes

In [25]:
y_true4 = y_test
y_pred4 = bayes.predict(X_test)
matrix4 = confusion_matrix(y_true4, y_pred4)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix4,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,19,1
Is positive,2,18


### Percentage-based confusion matrices



#### Logistic Regression

In [26]:
y_true1 = y_test
y_pred1 = logreg.predict(X_test)
matrix1 = confusion_matrix(y_true2, y_pred2)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix1,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix1.sum(axis=1), axis=0)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,0.8,0.2
Is positive,0.15,0.85


In [27]:
print('Confusion Matrix :')
print(matrix1) 
print('Accuracy Score :',accuracy_score(y_true1, y_pred1))
print('Report : ')
print(classification_report(y_true1, y_pred1))

Confusion Matrix :
[[16  4]
 [ 3 17]]
Accuracy Score : 0.9
Report : 
              precision    recall  f1-score   support

    negative       0.90      0.90      0.90        20
    positive       0.90      0.90      0.90        20

    accuracy                           0.90        40
   macro avg       0.90      0.90      0.90        40
weighted avg       0.90      0.90      0.90        40



#### Random forest

In [28]:
y_true2 = y_test
y_pred2 = forest.predict(X_test)
matrix2 = confusion_matrix(y_true2, y_pred2)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix2,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix3.sum(axis=1), axis=0)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,0.8,0.2
Is positive,0.15,0.85


In [29]:
print('Confusion Matrix :')
print(matrix2) 
print('Accuracy Score :',accuracy_score(y_true2, y_pred2))
print('Report : ')
print(classification_report(y_true2, y_pred2))

Confusion Matrix :
[[16  4]
 [ 3 17]]
Accuracy Score : 0.825
Report : 
              precision    recall  f1-score   support

    negative       0.84      0.80      0.82        20
    positive       0.81      0.85      0.83        20

    accuracy                           0.82        40
   macro avg       0.83      0.82      0.82        40
weighted avg       0.83      0.82      0.82        40



#### SVC

In [30]:
y_true3 = y_test
y_pred3 = svc.predict(X_test)
matrix3 = confusion_matrix(y_true3, y_pred3)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix3,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix3.sum(axis=1), axis=0)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,0.9,0.1
Is positive,0.1,0.9


In [31]:
print('Confusion Matrix :')
print(matrix3) 
print('Accuracy Score :',accuracy_score(y_true3, y_pred3))
print('Report : ')
print(classification_report(y_true3, y_pred3))

Confusion Matrix :
[[18  2]
 [ 2 18]]
Accuracy Score : 0.9
Report : 
              precision    recall  f1-score   support

    negative       0.90      0.90      0.90        20
    positive       0.90      0.90      0.90        20

    accuracy                           0.90        40
   macro avg       0.90      0.90      0.90        40
weighted avg       0.90      0.90      0.90        40



#### Multinomial Naive Bayes

In [32]:
y_true4 = y_test
y_pred4 = bayes.predict(X_test)
matrix4 = confusion_matrix(y_true4, y_pred4)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix4,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix4.sum(axis=1), axis=0)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,0.95,0.05
Is positive,0.1,0.9


In [33]:
print('Confusion Matrix :')
print(matrix4) 
print('Accuracy Score :',accuracy_score(y_true4, y_pred4))
print('Report : ')
print(classification_report(y_true4, y_pred4))

Confusion Matrix :
[[19  1]
 [ 2 18]]
Accuracy Score : 0.925
Report : 
              precision    recall  f1-score   support

    negative       0.90      0.95      0.93        20
    positive       0.95      0.90      0.92        20

    accuracy                           0.93        40
   macro avg       0.93      0.93      0.92        40
weighted avg       0.93      0.93      0.92        40



## Review

- Step 1: use a **vectorizer** to convert the tweets into numbers a computer could understand.
- Step 2: Training step to **build** the models 
- Step 3: split the dataset into **train** and **test** dataset
- Step 4: Testing

## Discussion [3]

* Which models performed the best? Explain the big differences between them!
* Do you think it's more important to be sensitive to negativity or positivity? Do we want more positive things incorrectly marked as negative, or more negative things marked as positive?
* They all had very different training times. Which ones offer the best combination of performance?
* What's good accuracy? Do you think 75% is good enough?
* If there are 2 classifiers and both of them are 75% accurate, which is the best?