# Text Classification with Deep Learning Models

The Twitter dataset (`tweets.csv`) was scraped from February of 2015 for sentiment analysis on US airline tweets. Contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service"). The dataset can be found [here.](https://www.kaggle.com/crowdflower/twitter-airline-sentiment)

We want to train a supervised machine learning model that, given each new tweet, predicts the sentiment class of that tweet (i.e., positive, negative, or neutral). You should choose a deep learning model and build a text classifier. In particular, you can use the `simpletransformers` library that allows you to simply fine-tune a pre-trained transformer (like BERT) on your current dataset. 

## Importing Modules

In [1]:
import numpy
import pandas
import sklearn.metrics
import sklearn.model_selection
import simpletransformers.classification

2021-11-18 10:24:10.815988: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-11-18 10:24:10.816019: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


## Loading the Dataset

In [2]:
df = pandas.read_csv("../../datasets/tweets.csv")
df.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


## Splitting Data into Training and Test Sets

In [3]:
classes = df["airline_sentiment"].unique().tolist()
df["airline_sentiment"] = df["airline_sentiment"].replace({c: classes.index(c) for c in classes})
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(df["text"], df["airline_sentiment"])
train_df = pandas.DataFrame({"text": x_train, "labels": y_train})
test_df = pandas.DataFrame({"text": x_test, "labels": y_test})

## Loading the Pre-Trained BERT Model and Fine-Tuning It

In [4]:
model = simpletransformers.classification.ClassificationModel(
    "roberta", "roberta-base", use_cuda=False, num_labels=len(classes),
    args={
        "reprocess_input_data": True, 
        "overwrite_output_dir": True, 
        "save_model_every_epoch": False, 
        "num_train_epochs": 1, 
        "early_stopping_consider_epochs": True, 
        "use_early_stopping": True
    }
)
model.train_model(train_df)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'roberta.pooler.dense.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.weight', 'classifie

  0%|          | 0/10980 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/1373 [00:00<?, ?it/s]

(1373, 0.5154268828831825)

## Testing the Trained Model

In [5]:
result, model_outputs, wrong_predictions = model.eval_model(test_df, acc=sklearn.metrics.accuracy_score)
cm = sklearn.metrics.confusion_matrix(y_test, numpy.argmax(model_outputs, axis=1))
print(result)
print(classes)
print(cm)

  0%|          | 0/3660 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/458 [00:00<?, ?it/s]

{'mcc': 0.7066757007816017, 'acc': 0.846448087431694, 'eval_loss': 0.45154602171174013}
['neutral', 'positive', 'negative']
[[ 455   76  237]
 [  52  483   59]
 [  87   51 2160]]
