# Introduction

<div class="alert alert-block alert-warning">
<font color=black><br>

**What?** Model building. Our final goal will be to deploy our model via Flask

<br></font>
</div>

# Import modules

In [40]:
import pandas as pd
from joblib import dump, load
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split

# Read-in the dataset

<div class="alert alert-block alert-info">
<font color=black><br>

- The dataset has 31,962 rows and 3 columns:
- id: Unique number for each row
- **label = 0**  for the normal tweet, it will be 0 
- **label = 1**  for the racist or sexist tweet, it will be 1. 
- There are 29,720 zeros and 2,242 one’s
- tweet: Tweet posted on Twitter

<br></font>
</div>

In [2]:
data = pd.read_csv('dataset/twitter_sentiments.csv')

In [3]:
data.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


In [4]:
data.shape

(31962, 3)

In [27]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31962 entries, 0 to 31961
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      31962 non-null  int64 
 1   label   31962 non-null  int64 
 2   tweet   31962 non-null  object
dtypes: int64(2), object(1)
memory usage: 749.2+ KB


In [5]:
data.label.value_counts()

0    29720
1     2242
Name: label, dtype: int64

In [38]:
data[data.label == 1].count()

id       2242
label    2242
tweet    2242
dtype: int64

# Split the dataset

<div class="alert alert-block alert-info">
<font color=black><br>

- We will take only 20 percent of the data for testing purposes. 
- We will **stratify** the data on the label column so that the distribution of the target label will be the same in both train and test data:

<br></font>
</div>

In [6]:
train, test = train_test_split(data, test_size = 0.2, stratify = data['label'], random_state=21)

In [7]:
train.shape, test.shape

((25569, 3), (6393, 3))

In [8]:
train.label.value_counts(normalize=True)

0    0.929837
1    0.070163
Name: label, dtype: float64

In [9]:
test.label.value_counts(normalize=True)

0    0.929923
1    0.070077
Name: label, dtype: float64

# Build a TF-IDF word representation

<div class="alert alert-block alert-info">
<font color=black><br>

- **TF-IDF**: Term Frequency -Inverse Document Frequency
- The TF-IDF is obtained by: TF-IDF=TF*IDF
- So we are rescaling the frequency of words by how often they appear in all documents so that the scores for frequent words like “the” that are also frequent across all documents are penalized.
- We will pass the parameter lowercase as True so that it will first convert text to lowercase. 
- We will also keep max features as 1000 and pass the predefined list of stop words present in the scikit-learn library.

<br></font>
</div>

In [10]:
tfidf_vectorizer = TfidfVectorizer(lowercase= True, max_features=1000, stop_words=ENGLISH_STOP_WORDS)

In [11]:
tfidf_vectorizer.fit(train.tweet)

TfidfVectorizer(max_features=1000,
                stop_words=frozenset({'a', 'about', 'above', 'across', 'after',
                                      'afterwards', 'again', 'against', 'all',
                                      'almost', 'alone', 'along', 'already',
                                      'also', 'although', 'always', 'am',
                                      'among', 'amongst', 'amoungst', 'amount',
                                      'an', 'and', 'another', 'any', 'anyhow',
                                      'anyone', 'anything', 'anyway',
                                      'anywhere', ...}))

In [12]:
train_idf = tfidf_vectorizer.transform(train.tweet)
test_idf  = tfidf_vectorizer.transform(test.tweet)

# Build a ML model

<div class="alert alert-block alert-info">
<font color=black><br>

- We are going to build a classifier
- Our choice is for a simple Logistic regression model
- We are not going to optmise this model too much.
- For a finely tune model this [link](https://www.analyticsvidhya.com/blog/2018/07/hands-on-sentiment-analysis-dataset-python/?utm_source=blog&utm_medium=streaming-data-pyspark-machine-learning-model?utm_source=blog&utm_medium=how-to-deploy-machine-learning-model-flask)

<br></font>
</div>

In [13]:
model_LR = LogisticRegression()

In [14]:
model_LR.fit(train_idf, train.label)

LogisticRegression()

In [15]:
predict_train = model_LR.predict(train_idf)

In [16]:
predict_test = model_LR.predict(test_idf)

In [28]:
# f1 score on train data
f1_score(y_true= train.label, y_pred= predict_train)

0.4888178913738019

In [29]:
f1_score(y_true= test.label, y_pred= predict_test)

0.45751633986928114

# Build a pipeline

<div class="alert alert-block alert-info">
<font color=black><br>

- What we did above was broken in many steps, but we can stramline this process via constructing a pipeline.
- The ultimiate goal is to write cleaner/shorter code.
- This pipeline is made of **two steps**: [1] create a word processor and [2] create a logistic regresssor model

<br></font>
</div>

In [30]:
pipeline = Pipeline(steps= [('tfidf', TfidfVectorizer(lowercase=True,
                                                      max_features=1000,
                                                      stop_words= ENGLISH_STOP_WORDS)),
                            ('model', LogisticRegression())])

In [31]:
pipeline.fit(train.tweet, train.label)

Pipeline(steps=[('tfidf',
                 TfidfVectorizer(max_features=1000,
                                 stop_words=frozenset({'a', 'about', 'above',
                                                       'across', 'after',
                                                       'afterwards', 'again',
                                                       'against', 'all',
                                                       'almost', 'alone',
                                                       'along', 'already',
                                                       'also', 'although',
                                                       'always', 'am', 'among',
                                                       'amongst', 'amoungst',
                                                       'amount', 'an', 'and',
                                                       'another', 'any',
                                                       'anyhow', 'anyone',
           

In [32]:
# train the model
pipeline.predict(train.tweet)

array([0, 0, 0, ..., 0, 0, 0])

In [33]:
# f1 score on train data, jsut doble checking we get the same result as before
f1_score(y_true= train.label, y_pred = pipeline.predict(train.tweet))

0.4888178913738019

In [34]:
# Now, we will test the pipeline with a sample twee
text = ["Virat Kohli, AB de Villiers set to auction their 'Green Day' kits from 2016 IPL match to raise funds"]

In [35]:
# The tweet is not offensive!
pipeline.predict(text)

array([0])

<div class="alert alert-block alert-info">
<font color=black><br>

- Now we have built a pipeline.
- We have check it works as expected.
- Next is to dump the trained model with the **joblib** library.

<br></font>
</div>

In [24]:
# Dumo the pipeline
dump(pipeline, filename = "text_classification.joblib")

['text_classification.joblib']

# Loading the joblib model

<div class="alert alert-block alert-info">
<font color=black><br>

- We'll now cload the dumped model
- Then, we'll check if the prediction is still the same,
- As you can see we are good to go.

<br></font>
</div>

In [41]:
loadedPipeline = load("text_classification.joblib")

In [42]:
pipeline.predict(text)

array([0])

# References

<div class="alert alert-block alert-info">
<font color=black><br>

- [Reference article](https://www.analyticsvidhya.com/blog/2020/04/how-to-deploy-machine-learning-model-flask/)
- [Reference code](https://github.com/lakshay-arora/Hate-Speech-Classification-deployed-using-Flask/tree/master)

<br></font>
</div>

# Conclusion

<div class="alert alert-block alert-danger">
<font color=black><br>

- We followed these steps
- #1 Create a model
- #2 Create a pipeline and train it
- #3 Dump a trained pipeline
- #4 Reload the dumped model and perform some sanity check

<br></font>
</div>