# Text Classification with Deep Learning Models

The Twitter dataset (`tweets.csv`) was collected in February of 2015. Contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service"). The dataset can be found [here.](https://www.kaggle.com/crowdflower/twitter-airline-sentiment)

You should build an end-to-end NLP pipeline to predict the sentiment class (i.e., positive, negative, or neutral) given a tweet. In particular, you should do the following:
- Load the `tweets` dataset using [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). You can find this dataset in the datasets folder.
- Split the dataset into training and test sets using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).
- Build an end-to-end NLP pipeline, including a deep learning classification model, such as [roberta](https://simpletransformers.ai/docs/classification-specifics/). You can use [Simple Transformers](https://simpletransformers.ai/) to work with pre-trained transformer models (like BERT) and fine-tune them on your dataset at hand.
- Optimize your pipeline by validating your design decisions. 
- Test the best pipeline on the test set and report various [evaluation metrics](https://scikit-learn.org/0.15/modules/model_evaluation.html).  
- Check the documentation to identify the most important hyperparameters, attributes, and methods of the model. Use them in practice.

In [None]:
pip install simpletransformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting simpletransformers
  Downloading simpletransformers-0.63.9-py3-none-any.whl (250 kB)
[K     |████████████████████████████████| 250 kB 27.2 MB/s 
Collecting streamlit
  Downloading streamlit-1.15.1-py2.py3-none-any.whl (10.3 MB)
[K     |████████████████████████████████| 10.3 MB 61.0 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 49.5 MB/s 
Collecting transformers>=4.6.0
  Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 57.4 MB/s 
[?25hCollecting datasets
  Downloading datasets-2.7.1-py3-none-any.whl (451 kB)
[K     |████████████████████████████████| 451 kB 57.9 MB/s 
[?25hCollecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[K     |████████████████████████████████| 43 kB 

In [None]:
import numpy as np
import pandas as pd
import sklearn.metrics 
import sklearn.preprocessing
import sklearn.model_selection 
import simpletransformers
import simpletransformers.classification

In [None]:
from google.colab import files
uploaded = files.upload()

Saving tweets.csv to tweets.csv


In [None]:
df = pd.read_csv("tweets.csv")
df.head(2)

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)


In [None]:
le = sklearn.preprocessing.LabelEncoder()
df["airline_sentiment_encoded"] = le.fit_transform(df["airline_sentiment"])

x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(df["text"], df["airline_sentiment_encoded"]) 
train_df = pd.DataFrame ({"text": x_train, "labels": y_train})
test_df = pd.DataFrame({"text": x_test, "Labels": y_test})

In [None]:
model = simpletransformers.classification.ClassificationModel(
    "roberta", "roberta-base", use_cuda=True, num_labels=df["airline_sentiment_encoded"].nunique(),
    args={
        "overwrite_output_dir": True,
        "save_model_every_epoch": False,
        "num_train_epochs": 3,
    }
)
model.train_model(train_df)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias', 'lm_head.decoder.weight', 'lm_head.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.weight', 'classi

  0%|          | 0/10980 [00:00<?, ?it/s]

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Running Epoch 0 of 3:   0%|          | 0/1373 [00:00<?, ?it/s]

Running Epoch 1 of 3:   0%|          | 0/1373 [00:00<?, ?it/s]

Running Epoch 2 of 3:   0%|          | 0/1373 [00:00<?, ?it/s]

(4119, 0.4270416379380093)

In [None]:
# Test Training Model

result, model_outputs, wrong_predictions = model.eval_model(test_df, acc=sklearn.metrics.accuracy_score)
cm = sklearn.metrics.confusion_matrix(y_test, np.argmax(model_outputs, axis=1))
print(result)
print(le.classes_)
print(cm)

  "Dataframe headers not specified. Falling back to using column 0 as text and column 1 as labels."


  0%|          | 0/3660 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/458 [00:00<?, ?it/s]

{'mcc': 0.7378258671335762, 'acc': 0.860655737704918, 'eval_loss': 0.5871783044660976}
['negative' 'neutral' 'positive']
[[2133  108   41]
 [ 179  504   62]
 [  56   64  513]]
