# Sentiment analysis in Twitter

**Reference:**
http://www.sentiment140.com/


**The algorithms:**
1.   Linear Regression
2.   Logistic Regression
3.   Random Forest Classifier
4.   Linear SVM
5.   Multinomial NB

**Dataset:** 
sentiment140.csv (Ref: sentiment140.com).

The dataset contains 499.031 tweets, '0' represents negative sentiment and '1' represents positive sentiment

Labelling process: Assume that any tweet with positive emoticons are positive, and tweets with negative emoticons are negative.

# **Training Step**



### Read the data

In [78]:
import pandas as pd

df = pd.read_csv("/content/sentiment140.csv", nrows=30000)
df.head()

Unnamed: 0,polarity,text
0,0,@kconsidder You never tweet
1,0,Sick today coding from the couch.
2,1,"@ChargerJenn Thx for answering so quick,I was afraid I was gonna crash twitter with all the spamming I did 2 RR..sorry bout that"
3,1,Wii fit says I've lost 10 pounds since last time
4,0,@MrKinetik Not a thing!!! I don't really have a life.....


The columns:

*   `polarity` column is whether the tweet is s positive or not.
*   `text` column is the text of the tweet..


How many rows?

In [79]:
df.shape

(30000, 2)

How many **positive** and **negative** tweets?

In [80]:
df.polarity.value_counts()

1    15064
0    14936
Name: polarity, dtype: int64

## Train the algorithms


### Vectorize the tweets

Create a `TfidfVectorizer` and use it to vectorize the tweets. 
Use `max_features` to take a selection of terms (1000) 

In [81]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [82]:
vectorizer = TfidfVectorizer(max_features=1000)
vectors = vectorizer.fit_transform(df.text)
words_df = pd.DataFrame(vectors.toarray(), columns=vectorizer.get_feature_names())
words_df.head()

Unnamed: 0,10,100,11,12,15,1st,20,2day,2nd,30,able,about,account,actually,add,after,afternoon,again,ago,agree,ah,ahh,ahhh,air,album,all,almost,alone,already,alright,also,although,always,am,amazing,amp,an,and,annoying,another,...,work,worked,working,works,world,worried,worry,worse,worst,worth,would,wouldn,wow,write,writing,wrong,wtf,www,xd,xoxo,xx,xxx,ya,yay,yea,yeah,year,years,yep,yes,yesterday,yet,yo,you,young,your,yourself,youtube,yum,yup
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.334095,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.22101,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.427465,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Setting the variables

There are two variables: `X` and `y`.

`X` = **features**

`y` = **labels**.

In [83]:
X = words_df
y = df.polarity

# **The algorithms:**

1.   Linear Regression
2.   Logistic Regression
3.   Random Forest Classifier
4.   Linear SVM
5.   Multinomial NB


**You can pick just ONE or ALL OF THEM.**

In [84]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB

# **Training the algorithms**


**Logistic Regression**

- C: float, default = 1.0
  - Inverse of regularization strength; must be a positive float. 
  - Smaller values specify stronger regularization.

- solver = {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}, default = ’lbfgs’
  - Algorithm to use in the optimization problem.
  - For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ and ‘saga’ are faster for large ones.
  - For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ handle multinomial loss; ‘liblinear’ is limited to one-versus-rest schemes.
  - ‘newton-cg’, ‘lbfgs’, ‘sag’ and ‘saga’ handle L2 or no penalty
  - ‘liblinear’ and ‘saga’ also handle L1 penalty
  - ‘saga’ also supports ‘elasticnet’ penalty
  - ‘liblinear’ does not support setting penalty='none'


In [85]:
%%time
# Create and train a logistic regression
logreg = LogisticRegression(C=1e9, solver='lbfgs', max_iter=1000)
logreg.fit(X, y)

CPU times: user 14.9 s, sys: 865 ms, total: 15.8 s
Wall time: 8.07 s


**Random Forest Classifier**

- n_estimators: int, default = 100
  - The number of trees in the forest.

In [86]:
%%time
# Create and train a random forest classifier
forest = RandomForestClassifier(n_estimators=50)
forest.fit(X, y)

CPU times: user 29.3 s, sys: 34.8 ms, total: 29.4 s
Wall time: 29.4 s


**Linear SVC**

LinearSVC is another (faster) implementation of **Support Vector Classification** for the case of a **linear kernel**. 

Note that **LinearSVC** does not accept parameter kernel, as this is assumed to be linear.

In [87]:
%%time
# Create and train a linear support vector classifier (LinearSVC)
svc = LinearSVC()
svc.fit(X, y)

CPU times: user 378 ms, sys: 5.99 ms, total: 384 ms
Wall time: 386 ms


**Multinomial NB**

MultinomialNB implements the **naive Bayes algorithm** for multinomially distributed data, and is one of the two classic naive Bayes variants used in text classification (*where the data are typically represented as word vector counts, although tf-idf vectors are also known to work well in practice*).

In [88]:
%%time
# Create and train a multinomial naive bayes classifier (MultinomialNB)
bayes = MultinomialNB()
bayes.fit(X, y)

CPU times: user 218 ms, sys: 4.93 ms, total: 223 ms
Wall time: 150 ms


# Discussion 1

How much faster were an algorithm compared to others?

## Use the models

We will use the model to predict whether the tweet is positive or negative.

### Testing Data

**You can add the testing data below.** 

In [98]:
# Create some test data

pd.set_option("display.max_colwidth", 200)

datatest = pd.DataFrame({'content': [
    "I love the food",
    "I hate hate hate hate this game",
    "I'm not sure the taste of eggs",
    "Did you see the news yesterday?",
    "He is upset that he can't update his apps",
    "The package was delivered late and the contents were broken",
    "The wifi getting worst",
    "I watch the video from my phone",
    "I'm fine with this food",
    "not good",
]})
datatest

Unnamed: 0,content
0,I love the food
1,I hate hate hate hate this game
2,I'm not sure the taste of eggs
3,Did you see the news yesterday?
4,He is upset that he can't update his apps
5,The package was delivered late and the contents were broken
6,The wifi getting worst
7,I watch the video from my phone
8,I'm fine with this food
9,not good




First we need to **vectorizer** the sentences into numbers, so the algorithm can understand them.

Our algorithm only knows **certain words.** 
Run `vectorizer.get_feature_names()` to show you the list of the words it knows.

In [99]:
print(vectorizer.get_feature_names())

['10', '100', '11', '12', '15', '1st', '20', '2day', '2nd', '30', 'able', 'about', 'account', 'actually', 'add', 'after', 'afternoon', 'again', 'ago', 'agree', 'ah', 'ahh', 'ahhh', 'air', 'album', 'all', 'almost', 'alone', 'already', 'alright', 'also', 'although', 'always', 'am', 'amazing', 'amp', 'an', 'and', 'annoying', 'another', 'any', 'anymore', 'anyone', 'anything', 'anyway', 'app', 'apparently', 'apple', 'appreciate', 'are', 'around', 'art', 'as', 'ask', 'asleep', 'ass', 'at', 'ate', 'aw', 'awake', 'awards', 'away', 'awesome', 'aww', 'awww', 'baby', 'back', 'bad', 'band', 'bbq', 'bday', 'be', 'beach', 'beautiful', 'because', 'bed', 'been', 'beer', 'before', 'behind', 'being', 'believe', 'best', 'bet', 'better', 'big', 'bike', 'birthday', 'bit', 'bitch', 'black', 'blip', 'blog', 'blue', 'body', 'boo', 'book', 'books', 'bored', 'boring', 'both', 'bought', 'bout', 'box', 'boy', 'boys', 'break', 'breakfast', 'bring', 'bro', 'broke', 'broken', 'brother', 'brothers', 'btw', 'bus', 'bu

**Because we already have the list of words we know, we only want to count them.** So instead of `.fit_transform`, we just use `.transform`:

```python
datatest_vectors = vectorizer.transform(datatest.content)
datatest_words_df = ......
```


In [100]:
# Put it through the vectoriser

# transform, not fit_transform, because we already learned all our words
datatest_vectors = vectorizer.transform(datatest.content)
datatest_words_df = pd.DataFrame(datatest_vectors.toarray(), columns=vectorizer.get_feature_names())
datatest_words_df.head()

Unnamed: 0,10,100,11,12,15,1st,20,2day,2nd,30,able,about,account,actually,add,after,afternoon,again,ago,agree,ah,ahh,ahhh,air,album,all,almost,alone,already,alright,also,although,always,am,amazing,amp,an,and,annoying,another,...,work,worked,working,works,world,worried,worry,worse,worst,worth,would,wouldn,wow,write,writing,wrong,wtf,www,xd,xoxo,xx,xxx,ya,yay,yea,yeah,year,years,yep,yes,yesterday,yet,yo,you,young,your,yourself,youtube,yum,yup
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.521289,0.0,0.0,0.237644,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Confirm `datatest_words_df`.

In [101]:
datatest_words_df.shape

(10, 1000)

# **Predicting the models**


We use `.predict` to predict class labels for each sentence and it will give a `0` (*negative*) or a `1` (*positive*) class:

```python
datatest['pred_logreg'] = logreg.predict(datatest_words_df)
```

We use `.predict_proba` to estimate the probability of the prediction.

The returned estimates for all classes are ordered by the label of classes:


```python
datatest['pred_logreg_prob'] = linreg.predict_proba(datatest_words_df)[:,1]
```



In [102]:
# Predict using all our models. 

# Logistic Regression predictions + probabilities
datatest['pred_logreg'] = logreg.predict(datatest_words_df)
datatest['pred_logreg_proba'] = logreg.predict_proba(datatest_words_df)[:,1]

# Random forest predictions + probabilities
datatest['pred_forest'] = forest.predict(datatest_words_df)
datatest['pred_forest_proba'] = forest.predict_proba(datatest_words_df)[:,1]

# SVC predictions
datatest['pred_svc'] = svc.predict(datatest_words_df)

# Bayes predictions + probabilities
datatest['pred_bayes'] = bayes.predict(datatest_words_df)
datatest['pred_bayes_proba'] = bayes.predict_proba(datatest_words_df)[:,1]

In [103]:
datatest

Unnamed: 0,content,pred_logreg,pred_logreg_proba,pred_forest,pred_forest_proba,pred_svc,pred_bayes,pred_bayes_proba
0,I love the food,1,0.868454,1,0.98,1,1,0.652036
1,I hate hate hate hate this game,0,0.01187,0,0.0,0,0,0.131163
2,I'm not sure the taste of eggs,0,0.368384,0,0.24,0,1,0.520409
3,Did you see the news yesterday?,1,0.5287,1,0.56,1,0,0.465336
4,He is upset that he can't update his apps,0,0.154716,0,0.34,0,0,0.266105
5,The package was delivered late and the contents were broken,0,0.058225,0,0.44,0,0,0.219788
6,The wifi getting worst,0,0.021557,0,0.14,0,0,0.178278
7,I watch the video from my phone,1,0.519172,0,0.36,1,0,0.475156
8,I'm fine with this food,1,0.709365,1,0.826667,1,1,0.620767
9,not good,0,0.478604,0,0.175,0,1,0.532576


# Discussion 2


1.   What do the numbers mean? What's the difference between 0, 1, or 0,5?
2.   Were there any sentences where the classifiers seemed to disagree about? Give your analysis!
3.   What's the difference between using a (0 and 1) in sentiment analysis compared to a range of 0 - 1? When might you use one compared to another?


# **Testing our models**

Which model performs the best??

In [104]:
df.head()

Unnamed: 0,polarity,text
0,0,@kconsidder You never tweet
1,0,Sick today coding from the couch.
2,1,"@ChargerJenn Thx for answering so quick,I was afraid I was gonna crash twitter with all the spamming I did 2 RR..sorry bout that"
3,1,Wii fit says I've lost 10 pounds since last time
4,0,@MrKinetik Not a thing!!! I don't really have a life.....


Our  dataframe is a list of many tweets. We turned this into `X` - vectorized words - and `y` - whether the tweet is negative or positive.

Before we used `.fit(X, y)` to train on all of our data. 

Instead, **we can test our models** by doing a test/train split and see if the predictions match the actual labels.

In [105]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [106]:
%%time

print("Training logistic regression")
logreg.fit(X_train, y_train)

print("Training random forest")
forest.fit(X_train, y_train)

print("Training SVC")
svc.fit(X_train, y_train)

print("Training Naive Bayes")
bayes.fit(X_train, y_train)

Training logistic regression
Training random forest
Training SVC
Training Naive Bayes
CPU times: user 40.3 s, sys: 853 ms, total: 41.2 s
Wall time: 34.8 s


### Confusion matrices


In [107]:
from sklearn.metrics import confusion_matrix

#### Logistic Regression

In [108]:
y_true = y_test
y_pred = logreg.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,2751,993
Is positive,875,2881


#### Random forest

In [109]:
y_true = y_test
y_pred = forest.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,2776,968
Is positive,1058,2698


#### SVC

In [110]:
y_true = y_test
y_pred = svc.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,2745,999
Is positive,857,2899


#### Multinomial Naive Bayes

In [111]:
y_true = y_test
y_pred = bayes.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,2818,926
Is positive,989,2767


### Percentage-based confusion matrices



#### Logistic Regression

In [112]:
y_true = y_test
y_pred = logreg.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,0.734776,0.265224
Is positive,0.232961,0.767039


#### Logistic regression

In [113]:
y_true = y_test
y_pred = logreg.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,0.734776,0.265224
Is positive,0.232961,0.767039


#### Random forest

In [114]:
y_true = y_test
y_pred = forest.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,0.741453,0.258547
Is positive,0.281683,0.718317


#### SVC

In [115]:
y_true = y_test
y_pred = svc.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,0.733173,0.266827
Is positive,0.228168,0.771832


#### Multinomial Naive Bayes

In [116]:
y_true = y_test
y_pred = bayes.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,0.752671,0.247329
Is positive,0.263312,0.736688


## Review

- Step 1: use a **vectorizer** to convert the tweets into numbers a computer could understand.
- Step 2: Training step to **build** the models 
- Step 3: split the dataset into **train** and **test** dataset
- Step 4: Testing

## Discussion [3]

* Which models performed the best? Explain the big differences between them!
* Do you think it's more important to be sensitive to negativity or positivity? Do we want more positive things incorrectly marked as negative, or more negative things marked as positive?
* They all had very different training times. Which ones offer the best combination of performance?
* What's good accuracy? Do you think 75% is good enough?
* If there are 2 classifiers and both of them are 75% accurate, which is the best?