# Math 456 Project: Sentiment Analysis of Covid-19

## Data

The training data contains 5000 labeled tweets while 
the released validation data have 2500 pieces of unlabeled tweets. 

The training data have 3 columns, containing Tweet ID, Tweet text, and labels.

Note that the orders are shown as 

- Optimistic (0), 
- Thankful (1), 
- Empathetic (2),
- Pessimistic (3), 
- Anxious (4), 
- Sad (5), 
- Annoyed (6), 
- Denial (7), 
- Surprise (8), 
- Official report (9),
- Joking (10). 

For example, if the labels are 3 and 6, 
it means that this piece of the tweet is labeled as Pessimistic and Annoyed.

## Goal

Build a mathematical model for sentiment analysis via tweets. 
You may want to test your prediction of sentiments by using the validation dataset. 
However, notice that the validation dataset does not contain a score. 
You are recommended to use few lines (e.g. 50 lines) of the training set as the test data. 
You may first assign scores subjectively to tweets in the validation dataset 
and then compare the subjective scores with the predicted scores based on your model

In [30]:
# extracting the data
import csv
import numpy as np

data = {}
with open('training.csv', 'r') as f:
    reader = csv.DictReader(f)
    for row in reader:
        # row contains fields: ID, Tweet, Labels
        data[row['ID']] = [row['Tweet'], row['Labels']]

tweets, labels = np.transpose([data[k] for k in data])

In [31]:
# We probably want to examine the data
# eg. find the distributions of the labels

In [32]:
# Text cleaning definition
import re
from nltk.corpus import stopwords

REPLACE_BY_SPACE_RE = re.compile('[/(){ }\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))

def clean_text(text):
    text = text.lower() 
    text = REPLACE_BY_SPACE_RE.sub(' ', text)
    text = BAD_SYMBOLS_RE.sub('', text)
    text = ' '.join(word for word in text.split() if word not in STOPWORDS)
    return text

In [33]:
# create x, y split
x = tweets
y = []
for label in labels:
    label_vector = [1 if (str(i) in label.split(" ")) else 0 for i in range(11)]
    y.append(label_vector)

# train-test split
cutoff = 50
x_train = x[cutoff:]
y_train = y[cutoff:]
x_test = x[:cutoff]
y_test = y[:cutoff]

In [34]:
# use sklearn's naive bayes classifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.multioutput import MultiOutputClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.neighbors import KNeighborsClassifier

model = Pipeline([
        ('vect', CountVectorizer(preprocessor=clean_text)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(LogisticRegression())),
    ]
)

model.fit(x_train, y_train)

Pipeline(steps=[('vect',
                 CountVectorizer(preprocessor=<function clean_text at 0x000001855165DA60>)),
                ('tfidf', TfidfTransformer()),
                ('clf', MultiOutputClassifier(estimator=LogisticRegression()))])

In [41]:
predictions_test = model.predict(x_test)
predictions_train = model.predict(x_train)

In [42]:
# compare
def arr_equal(a, b):
    if (len(a) != len(b)):
        return False
    for x, y in zip(a, b):
        if(x != y):
            return False
    return True

def get_labels(a):
    return ','.join((str(i) for i, x in enumerate(a) if x == 1))

def compare(predictions, actuals):
    num_correct = 0
    num_total = len(predictions)
    assert(len(predictions == len(actuals)))
    for pred, act in zip(predictions, actuals):
        if arr_equal(pred, act):
            num_correct += 1
        # else:
            # print(f"{get_labels(pred)}::\t::{get_labels(act)}")
            # print(get_labels(pred),"\t", get_labels(act))
            # print(pred, act)
            # pass
    return num_correct / num_total

correct_test = compare(predictions_test, y_test)
correct_train = compare(predictions_train, y_train)
print(correct_test, correct_train)
            

0.16 0.20505050505050504
