<a href="https://colab.research.google.com/github/navneetkrc/Flair_SOTA_NLP/blob/master/Flair_text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Dataset
We’ll be working on the [AV Twitter Sentiment Analysis](https://datahack.analyticsvidhya.com/contest/practice-problem-twitter-sentiment-analysis/)  practice problem. Go ahead and download the dataset from there (you’ll need to register/log in first).

The problem statement posed by this challenge is:

The objective of this task is to detect hate speech in tweets. For the sake of simplicity, we say a tweet contains hate speech if it has a racist or sexist sentiment associated with it. So, the task is to classify racist or sexist tweets from other tweets.


##Steps Involved
**1. Text Classification Using Flair Embeddings**

Overview of steps:

Step 1: Import the data into the local Environment of Colab:

Step 2: Installing Flair

Step 3: Preparing text to work with Flair

Step 4: Word Embeddings with Flair

Step 5: Vectorizing the text

Step 6: Partitioning the data for Train and Test Sets

Step 7: Time for predictions!

### Step 1: Import the data into the local Environment of Colab:

In [1]:
# Install the PyDrive wrapper & import libraries.
# This only needs to be done once per notebook.

!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials


# Authenticate and create the PyDrive client.
# This only needs to be done once per notebook.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

# Download a file based on its file ID.
# A file ID looks like: laggVyWshwcyP6kEI-y_W3P8D26sz
file_id = '1GhyH4k9C4uPRnMAMKhJYOqa-V9Tqt4q8' ### File ID ###
data = drive.CreateFile({'id': file_id})
#print('Downloaded content "{}"'.format(downloaded.GetContentString()))

[?25l[K    1% |▎                               | 10kB 17.5MB/s eta 0:00:01[K    2% |▋                               | 20kB 1.6MB/s eta 0:00:01[K    3% |█                               | 30kB 2.3MB/s eta 0:00:01[K    4% |█▎                              | 40kB 1.6MB/s eta 0:00:01[K    5% |█▋                              | 51kB 2.0MB/s eta 0:00:01[K    6% |██                              | 61kB 2.4MB/s eta 0:00:01[K    7% |██▎                             | 71kB 2.8MB/s eta 0:00:01[K    8% |██▋                             | 81kB 3.1MB/s eta 0:00:01[K    9% |███                             | 92kB 3.5MB/s eta 0:00:01[K    10% |███▎                            | 102kB 2.7MB/s eta 0:00:01[K    11% |███▋                            | 112kB 2.7MB/s eta 0:00:01[K    12% |████                            | 122kB 4.1MB/s eta 0:00:01[K    13% |████▎                           | 133kB 4.1MB/s eta 0:00:01[K    14% |████▋                           | 143kB 7.7MB/s eta 0:00:01[

In [2]:
#Import Dataset in Colab Notebook
import io
import pandas as pd
data = pd.read_csv(io.StringIO(data.GetContentString())) 
data.head()

Unnamed: 0.1,Unnamed: 0,label,tweet
0,0,0.0,user when a father is dysfunctional and is s...
1,1,0.0,user user thanks for lyft credit i can t us...
2,2,0.0,bihday your majesty
3,3,0.0,model i love u take with u all the time in ...
4,4,0.0,factsguide society now motivation


###Step 2 Download Flair Library

In [0]:
# download flair library #
import torch
!pip install flair
import flair

In [4]:
from flair.data import Sentence
# create a sentence #
sentence = Sentence('Blogs of Analytics Vidhya are Awesome.')
# print the sentence to see what’s in it. #
print(sentence)

Sentence: "Blogs of Analytics Vidhya are Awesome." - 6 Tokens


###Step 3: Preparing text to work with Flair


In [5]:
#extracting the tweet part#
text = data['tweet'] 
 ## txt is a list of tweets ##
txt = text.tolist()
print(txt[:10])

['  user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction     run', ' user  user thanks for  lyft credit i can t use cause they don t offer wheelchair vans in pdx      disapointed  getthanked', '  bihday your majesty', ' model   i love u take with u all the time in ur                                      ', ' factsguide  society now     motivation', '      huge fan fare and big talking before they leave  chaos and pay disputes when they get there   allshowandnogo  ', '  user camping tomorrow  user  user  user  user  user  user  user danny   ', 'the next school year is the year for exams      can t think about that       school  exams    hate  imagine  actorslife  revolutionschool  girl', 'we won    love the land     allin  cavs  champions  cleveland  clevelandcavaliers      ', '  user  user welcome here    i m   it s so  gr    ']


###Step 4: Word Embeddings with Flair

In [6]:

## Importing the Embeddings ##
from flair.embeddings import WordEmbeddings
from flair.embeddings import CharacterEmbeddings
from flair.embeddings import StackedEmbeddings
from flair.embeddings import FlairEmbeddings
from flair.embeddings import BertEmbeddings
from flair.embeddings import ELMoEmbeddings
from flair.embeddings import FlairEmbeddings

### Initialising embeddings (un-comment to use others) ###
#glove_embedding = WordEmbeddings('glove')
#character_embeddings = CharacterEmbeddings()
flair_forward  = FlairEmbeddings('news-forward-fast')
flair_backward = FlairEmbeddings('news-backward-fast')
#bert_embedding = BertEmbedding()
#elmo_embedding = ElmoEmbedding()

stacked_embeddings = StackedEmbeddings( embeddings = [ 
                                                       flair_forward, 
                                                       flair_backward
                                                      ])

2019-02-26 12:05:45,778 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings/lm-news-english-forward-1024-v0.2rc.pt not found in cache, downloading to /tmp/tmp1h4xt3u_


100%|██████████| 19689779/19689779 [00:03<00:00, 5290312.39B/s]

2019-02-26 12:05:50,653 copying /tmp/tmp1h4xt3u_ to cache at /root/.flair/embeddings/lm-news-english-forward-1024-v0.2rc.pt
2019-02-26 12:05:50,682 removing temp file /tmp/tmp1h4xt3u_





2019-02-26 12:05:57,329 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings/lm-news-english-backward-1024-v0.2rc.pt not found in cache, downloading to /tmp/tmpb8dt6243


100%|██████████| 19689779/19689779 [00:03<00:00, 5151116.44B/s]

2019-02-26 12:06:02,315 copying /tmp/tmpb8dt6243 to cache at /root/.flair/embeddings/lm-news-english-backward-1024-v0.2rc.pt
2019-02-26 12:06:02,341 removing temp file /tmp/tmpb8dt6243





In [7]:
#Testing the stacked embeddings:

# create a sentence #
sentence = Sentence('Blogs of Analytics Vidhya are Awesome.')
# embed words in sentence #
stacked_embeddings.embed(sentence)

for token in sentence:
  print(token.embedding)
# data type and size of embedding #
print(type(token.embedding))
# storing size (length) #
z = token.embedding.size()[0]

tensor([ 3.0279e-04, -1.4077e-07,  2.6455e-06,  ..., -1.1807e-07,
        -4.5203e-06,  3.4654e-03])
tensor([-7.3398e-03, -4.8201e-05,  1.2195e-07,  ..., -1.3866e-08,
        -1.9298e-04,  5.3008e-03])
tensor([ 2.1015e-03, -5.1521e-06,  9.0945e-08,  ..., -3.9210e-09,
         1.5152e-05,  1.3080e-02])
tensor([-3.6214e-03, -1.4667e-06,  6.8676e-07,  ..., -3.8634e-08,
         2.1911e-04,  1.7681e-02])
tensor([ 2.5456e-03,  3.4033e-06,  3.1239e-06,  ..., -5.5899e-08,
        -1.3424e-04,  6.0970e-03])
tensor([-1.0973e-04,  6.7579e-07,  4.5737e-08,  ..., -7.0676e-09,
        -8.7311e-04,  4.5264e-03])
<class 'torch.Tensor'>


###Step 5: Vectorizing the text

**We’ll be showcasing this using two approaches.**

 

Mean of Word Embeddings within a Tweet

We will be calculating the following in this approach:

For each sentence:

1.   Generate word embedding for each word

2.   Calculate the mean of the embeddings of each word to obtain the embedding of the sentence


In [0]:
from tqdm import tqdm ## tracks progress of loop ##

# creating a tensor for storing sentence embeddings #
s = torch.zeros(0,z)

# iterating Sentence (tqdm tracks progress) #
for tweet in tqdm(txt):   
  # empty tensor for words #
  w = torch.zeros(0,z)   
  sentence = Sentence(tweet)
  stacked_embeddings.embed(sentence)
  # for every word #
  for token in sentence:
    # storing word Embeddings of each word in a sentence #
    w = torch.cat((w,token.embedding.view(-1,z)),0)
  # storing sentence Embeddings (mean of embeddings of all words)   #
  s = torch.cat((s, w.mean(dim = 0).view(-1, z)),0)


 85%|████████▍ | 41613/49159 [48:55<14:29,  8.67it/s]

###Document Embedding: Vectorizing the entire Tweet



In [0]:
from flair.embeddings import DocumentPoolEmbeddings

### initialize the document embeddings, mode = mean ###
document_embeddings = DocumentPoolEmbeddings([
                                              flair_embedding_backward,
                                              flair_embedding_forward
                                             ])
# Storing Size of embedding #
z = sentence.embedding.size()[1]

### Vectorising text ###
# creating a tensor for storing sentence embeddings
s = torch.zeros(0,z)
# iterating Sentences #
for tweet in tqdm(txt):   
  sentence = Sentence(tweet)
  document_embeddings.embed(sentence)
  # Adding Document embeddings to list #
  s = torch.cat((s, sentence.embedding.view(-1,z)),0)


You can choose either approach for your model. Now that our text is vectorised, we can feed it to our machine learning model!


###Step 6: Partitioning the data for Train and Test Sets


In [0]:
## tensor to numpy array ##
X = s.numpy()   

## Test set ##
test = X[31962:,:]
train = X[:31962,:]

# extracting labels of the training set #
target = data['label'][data['label'].isnull()==False].values

###Step 6: Building the Model and Defining Custom Evaluator (for F1 Score)


In [0]:
#Defining custom F1 evaluator for XGBoost
def custom_eval(preds, dtrain):
    labels = dtrain.get_label().astype(np.int)
    preds = (preds >= 0.3).astype(np.int)
    return [('f1_score', f1_score(labels, preds))]

**Building XGBoost Model**

In [0]:
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

### Splitting training set ###
x_train, x_valid, y_train, y_valid = train_test_split(train, target,  
                                                      random_state=42, 
                                                          test_size=0.3)

### XGBoost compatible data ###
dtrain = xgb.DMatrix(x_train,y_train)         
dvalid = xgb.DMatrix(x_valid, label = y_valid)

### defining parameters ###
params = {
          'colsample': 0.9,
          'colsample_bytree': 0.5,
          'eta': 0.1,
          'max_depth': 8,
          'min_child_weight': 6,
          'objective': 'binary:logistic',
          'subsample': 0.9
          }

### Training the model ###
xgb_model = xgb.train(
                      params,
                      dtrain,
                      feval= custom_eval,
                      num_boost_round= 1000,
                      maximize=True,
                      evals=[(dvalid, "Validation")],
                      early_stopping_rounds=30
                      )

###Step 7: Time for predictions!

In [0]:

### Reformatting test set for XGB ###
dtest = xgb.DMatrix(test)

### Predicting ###
predict = xgb_model.predict(dtest) # predicting

I uploaded the predictions to the practice problem page with 0.2 as probability threshold:

Word Embedding	F1- Score	
Glove	0.53	
flair-forward -fast	0.45	
flair-backward-fast	0.48	
Stacked (flair-forward-fast + flair-backward-fast)	0.54	
 