# Tweet sentiment analysis with BERT

1. Build sentence embeddings for each tweet with pre-trained BERT.
2. Train SVM model to classify tweets based on negative, positive or nutral sentiment.

Following intro to using BERT:
`https://github.com/jalammar/jalammar.github.io/blob/master/notebooks/bert/A_Visual_Notebook_to_Using_BERT_for_the_First_Time.ipynb`

The data is a CSV with emoticons removed. Data file format has 6 fields:

0. the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
1. the id of the tweet (2087)
2. the date of the tweet (Sat May 16 23:58:44 UTC 2009)
3. the query (lyx). If there is no query, then this value is NO_QUERY.
4. the user that tweeted (robotickilldozr)
5. the text of the tweet (Lyx is cool)

In [41]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re

In [32]:
df = pd.read_csv('training.1600000.processed.noemoticon.csv', header=None, encoding = "ISO-8859-1")

In [7]:
df.head()

Unnamed: 0,0,1,2,3,4,5
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [8]:
df.shape

(1600000, 6)

# Preprocessing

- Remove user tags (preceeded by '@').
- Remove links.

In [33]:
pattern = re.compile('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+|(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)')
s = df.iloc[0,5]
pattern.sub('', s)

'   Awww thats a bummer  You shoulda got David Carr of Third Day to do it D'

In [34]:
df[5] = df[5].apply(lambda x: pattern.sub('', x).lower())

In [35]:
df.head()

Unnamed: 0,0,1,2,3,4,5
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,awww thats a bummer you shoulda got david ...
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he cant update his facebook by t...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,i dived many times for the ball managed to sa...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,no its not behaving at all im mad why am i he...


- Take a sample to speed things up.

In [36]:
df = df.sample(frac=0.05, replace=False, random_state=1)

In [48]:
df.shape

(80000, 6)

- Take first 10000 posts as our first batch.

In [49]:
first_batch = df.iloc[0:10000,:]

# 1. BERT

In [64]:
import torch
import transformers as ppb # pytorch transformers
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix 
from sklearn.model_selection import GridSearchCV

- Downloading pretrained model weights and tokenizer.

In [20]:
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')

In [22]:
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

HBox(children=(IntProgress(value=0, description='Downloading', max=231508, style=ProgressStyle(description_wid…




HBox(children=(IntProgress(value=0, description='Downloading', max=442, style=ProgressStyle(description_width=…




HBox(children=(IntProgress(value=0, description='Downloading', max=267967963, style=ProgressStyle(description_…




- Applying tokenizer.

In [50]:
tokenized = first_batch[5].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

- Padding.

In [51]:
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])

- Masking.

In [52]:
attention_mask = np.where(padded != 0, 1, 0)

In [53]:
attention_mask.shape

(10000, 60)

- Make embeddings.

In [54]:
input_ids = torch.tensor(padded)  
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

In [55]:
features = last_hidden_states[0][:,0,:].numpy()

# 2. Classifier

- Splitting dataset.

In [56]:
labels = first_batch[0]

In [57]:
X_train, X_test, y_train, y_test = train_test_split(features, labels)

- Grid search for best paramters.

In [67]:
#param_grid = {'C': [0.1,1, 10, 100], 'gamma': [1,0.1,0.01,0.001],'kernel': ['rbf', 'poly', 'sigmoid']}
param_grid = {'C': [0.1,1, 10, 100]}

grid = GridSearchCV(SVC(),param_grid,refit=True,verbose=2)
grid.fit(X_train, y_train)
print(grid.best_estimator_)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 3 folds for each of 4 candidates, totalling 12 fits
[CV] C=0.1 ...........................................................
[CV] ............................................ C=0.1, total=  40.1s
[CV] C=0.1 ...........................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   40.1s remaining:    0.0s


[CV] ............................................ C=0.1, total=  38.4s
[CV] C=0.1 ...........................................................




[CV] ............................................ C=0.1, total=  35.1s
[CV] C=1 .............................................................




[CV] .............................................. C=1, total=  29.0s
[CV] C=1 .............................................................




[CV] .............................................. C=1, total=  29.3s
[CV] C=1 .............................................................




[CV] .............................................. C=1, total=  29.4s
[CV] C=10 ............................................................




[CV] ............................................. C=10, total=  22.9s
[CV] C=10 ............................................................




[CV] ............................................. C=10, total=  23.4s
[CV] C=10 ............................................................




[CV] ............................................. C=10, total=  23.4s
[CV] C=100 ...........................................................




[CV] ............................................ C=100, total=  21.0s
[CV] C=100 ...........................................................




[CV] ............................................ C=100, total=  21.6s
[CV] C=100 ...........................................................




[CV] ............................................ C=100, total=  21.8s


[Parallel(n_jobs=1)]: Done  12 out of  12 | elapsed:  5.6min finished


SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)


In [68]:
clf = SVC(C=100)
clf.fit(X_train, y_train)



SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

In [61]:
clf.score(X_test, y_test)

0.7392

In [69]:
y_pred = clf.predict(X_test)

In [70]:
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[946 308]
 [288 958]]
              precision    recall  f1-score   support

           0       0.77      0.75      0.76      1254
           4       0.76      0.77      0.76      1246

    accuracy                           0.76      2500
   macro avg       0.76      0.76      0.76      2500
weighted avg       0.76      0.76      0.76      2500

