![BTS](https://github.com/vfp1/bts-mbds-data-science-foundations-2019/blob/master/sessions/img/Logo-BTS.jpg?raw=1)

# Session 10: Text classification and Sentiment analysis

### Victor F. Pajuelo Madrigal <victor.pajuelo@bts.tech> - Data Science Foundations (2019-11-07)

Open this notebook in Google Colaboratory: [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vfp1/bts-mbds-data-science-foundations-2019/blob/master/sessions/10_Text_classification_and_Sentiment_analysis.ipynb)

**Resources:**


# Spacy installation

```
$ conda activate bts36
$ conda install -c conda-forge spacy
```



## Import language models 



```
$ python -m spacy download en_core_web_sm
$ python -m spacy download en
```



In [1]:
!python -m spacy download en_core_web_lg

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')


In [2]:
!python -m spacy download en

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/Users/ketevani/anaconda3/envs/bts36/lib/python3.6/site-packages/en_core_web_sm
-->
/Users/ketevani/anaconda3/envs/bts36/lib/python3.6/site-packages/spacy/data/en
You can now load the model via spacy.load('en')


Once the model is downloaded and installed, we can load it as follows:

In [3]:
import spacy
nlp = spacy.load("en_core_web_sm")

# Text classification (Sentiment analysis pipeline) 

Today we will be using for first time an `scikit learn` pipeline to prepare the data and to classify it. 

We will start with three different datasets where we have a collection of user reactions:

*   IMBD - Review of movies
*   Amazon - Technology products user review
*   Yelp - Restaurant food reviews

The dataset is coded with a `0` when the review is bad and with a `1` when the review is good. 



## ETL: Extract Transform Load

The first process in our pipeline is an ETL one, i.e. Extract, Transform and Load. We will prepare our dataset to be cleaned and to be passed to the processing pipeline.

### Load datasets

In [2]:
!wget "https://github.com/vfp1/bts-mbds-data-science-foundations-2019/raw/master/sessions/data/amazon_cells_labelled.txt"

--2019-11-07 05:04:13--  https://github.com/vfp1/bts-mbds-data-science-foundations-2019/raw/master/sessions/data/amazon_cells_labelled.txt
Resolving github.com (github.com)... 192.30.253.112
Connecting to github.com (github.com)|192.30.253.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/vfp1/bts-mbds-data-science-foundations-2019/master/sessions/data/amazon_cells_labelled.txt [following]
--2019-11-07 05:04:13--  https://raw.githubusercontent.com/vfp1/bts-mbds-data-science-foundations-2019/master/sessions/data/amazon_cells_labelled.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 58226 (57K) [text/plain]
Saving to: ‘amazon_cells_labelled.txt’


2019-11-07 05:04:13 (2.25 MB/s) - ‘amazon_cells_l

In [3]:
!wget "https://github.com/vfp1/bts-mbds-data-science-foundations-2019/raw/master/sessions/data/imdb_labelled.txt"

--2019-11-07 05:04:14--  https://github.com/vfp1/bts-mbds-data-science-foundations-2019/raw/master/sessions/data/imdb_labelled.txt
Resolving github.com (github.com)... 192.30.253.112
Connecting to github.com (github.com)|192.30.253.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/vfp1/bts-mbds-data-science-foundations-2019/master/sessions/data/imdb_labelled.txt [following]
--2019-11-07 05:04:14--  https://raw.githubusercontent.com/vfp1/bts-mbds-data-science-foundations-2019/master/sessions/data/imdb_labelled.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 85285 (83K) [text/plain]
Saving to: ‘imdb_labelled.txt’


2019-11-07 05:04:14 (3.31 MB/s) - ‘imdb_labelled.txt’ saved [85285/85285]



In [11]:
!wget "https://github.com/vfp1/bts-mbds-data-science-foundations-2019/raw/master/sessions/data/yelp_labelled.txt"

--2019-11-07 05:05:21--  https://github.com/vfp1/bts-mbds-data-science-foundations-2019/raw/master/sessions/data/yelp_labelled.txt
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/vfp1/bts-mbds-data-science-foundations-2019/master/sessions/data/yelp_labelled.txt [following]
--2019-11-07 05:05:21--  https://raw.githubusercontent.com/vfp1/bts-mbds-data-science-foundations-2019/master/sessions/data/yelp_labelled.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 61320 (60K) [text/plain]
Saving to: ‘yelp_labelled.txt’


2019-11-07 05:05:21 (2.42 MB/s) - ‘yelp_labelled.txt’ saved [61320/61320]



Once we have ready data that is labelled (please take a minute to look at the source data), we can read it with `Pandas`.

In [4]:
import pandas as pd

# Load our dataset
df_yelp = pd.read_table('yelp_labelled.txt')
df_imdb = pd.read_table('imdb_labelled.txt')
df_amz = pd.read_table('amazon_cells_labelled.txt')

Then we can concatenate all the datasets, so we create a single dataset that contains movie, technology and food reviews

In [5]:
# Concatenate our Datasets
frames = [df_yelp,df_imdb,df_amz]
frames

[                              Wow... Loved this place.  1
 0                                   Crust is not good.  0
 1            Not tasty and the texture was just nasty.  0
 2    Stopped by during the late May bank holiday of...  1
 3    The selection on the menu was great and so wer...  1
 4       Now I am getting angry and I want my damn pho.  0
 ..                                                 ... ..
 994  I think food should have flavor and texture an...  0
 995                           Appetite instantly gone.  0
 996  Overall I was not impressed and would not go b...  0
 997  The whole experience was underwhelming, and I ...  0
 998  Then, as if I hadn't wasted enough of my life ...  0
 
 [999 rows x 2 columns],
     A very, very, very slow-moving, aimless movie about a distressed, drifting young man.    \
 0    Not sure who was more lost - the flat characte...                                        
 1    Attempting artiness with black & white and cle...                  

In [6]:
# Renaming Column Headers
for colname in frames:
    colname.columns = ["Message","Target"]
frames

[                                               Message  Target
 0                                   Crust is not good.       0
 1            Not tasty and the texture was just nasty.       0
 2    Stopped by during the late May bank holiday of...       1
 3    The selection on the menu was great and so wer...       1
 4       Now I am getting angry and I want my damn pho.       0
 ..                                                 ...     ...
 994  I think food should have flavor and texture an...       0
 995                           Appetite instantly gone.       0
 996  Overall I was not impressed and would not go b...       0
 997  The whole experience was underwhelming, and I ...       0
 998  Then, as if I hadn't wasted enough of my life ...       0
 
 [999 rows x 2 columns],
                                                Message  Target
 0    Not sure who was more lost - the flat characte...       0
 1    Attempting artiness with black & white and cle...       0
 2         Ve

In [9]:
# Assign a Key to Make it Easier
keys = ['Yelp','IMDB','Amazon']

In [10]:
# Merge or Concat our Datasets
df = pd.concat(frames,keys=keys)
df

Unnamed: 0,Unnamed: 1,Message,Target
Yelp,0,Crust is not good.,0
Yelp,1,Not tasty and the texture was just nasty.,0
Yelp,2,Stopped by during the late May bank holiday of...,1
Yelp,3,The selection on the menu was great and so wer...,1
Yelp,4,Now I am getting angry and I want my damn pho.,0
...,...,...,...
Amazon,994,The screen does get smudged easily because it ...,0
Amazon,995,What a piece of junk.. I lose more calls on th...,0
Amazon,996,Item Does Not Match Picture.,0
Amazon,997,The only thing that disappoint me is the infra...,0


In [11]:
# Length and Shape 
df.shape

(2745, 2)

In [12]:
df.head()

Unnamed: 0,Unnamed: 1,Message,Target
Yelp,0,Crust is not good.,0
Yelp,1,Not tasty and the texture was just nasty.,0
Yelp,2,Stopped by during the late May bank holiday of...,1
Yelp,3,The selection on the menu was great and so wer...,1
Yelp,4,Now I am getting angry and I want my damn pho.,0


Last, we can save our raw dataset as a CSV file in our system.

In [13]:
df.to_csv("sentimentdataset.csv")

### Cleaning dataset with spaCy

* Removing Stopwords
* Removing punctuation
* Lemmatizing

In [14]:
# Data Cleaning
df.columns

Index(['Message', 'Target'], dtype='object')

In [15]:
# Checking for Missing Values
df.isnull().sum()

Message    0
Target     0
dtype: int64

In [16]:
# Checking for the balance of our dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 2745 entries, (Yelp, 0) to (Amazon, 998)
Data columns (total 2 columns):
Message    2745 non-null object
Target     2745 non-null int64
dtypes: int64(1), object(1)
memory usage: 58.8+ KB


In [17]:
# Checking for the balance of our dataset
df.Target.value_counts()

1    1385
0    1360
Name: Target, dtype: int64

#### Tokenizing our dataset with spaCy

We will clean data that we do not need, like stopwords, punctuation and such from our dataset.

This time we will also import the `string` dataset which has a good list of punctuation symbols.

The function we will create will input a sentence, and processing into tokens, doing lemmatization, lowercasing, removing stopwords and avoiding punctuation.

In [18]:
import string
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English

# Create our list of punctuation marks
punctuations = string.punctuation

# Create our list of stopwords
nlp = spacy.load('en')
stop_words = spacy.lang.en.stop_words.STOP_WORDS

# Load English tokenizer, tagger, parser, NER and word vectors
parser = English()

# Creating our tokenizer function
def spacy_tokenizer(sentence):
    # Creating our token object, which is used to create documents with linguistic annotations.
    mytokens = parser(sentence)

    # Lemmatizing each token and converting each token into lowercase
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]

    # Removing stop words
    mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]

    # return preprocessed list of tokens
    return mytokens

#### Defining a Transformer

Transformers are trained products from networks which are heavily used in language processing (currently with BERT in Google translate). They are extremely good because they apply attention based models to look at sequences such as text.

We will be using the class `TransformerMixin` from `scikit learn` to create our own class transformer.

Our class will override the `transform`, `fit` and `get_params` from the main function and greate our own. We will also pass a function that remove the spaces and converts the text into lowercase for an easier analysis.

In [19]:
from sklearn.base import TransformerMixin 

# This function will clean the text
def clean_text(text):     
    return text.strip().lower()
    
#Custom transformer using spaCy 
class predictors(TransformerMixin):

    def transform(self, X, **transform_params):
        return [clean_text(text) for text in X]

    def fit(self, X, y=None, **fit_params):
        return self

    def get_params(self, deep=True):
        return {}

## Feature engineering

### Vectorization with Bag of Words and TF-IDF

When we classify text, we end up with text snippets matched with their respective labels. However, we need to represent our text in something that can be represented numerically. Classifying text in positive and negative labels is called **sentiment analysis**.

There are different tools for that, i.e. **Bag of Words** and **TF-IDF**.

#### Bag of Words

The first one converts text into the matrix of occurrence of words within a given document. It focuses on whether given words occurred or not in the document, and it generates a matrix that we might see referred to as a BoW matrix or a document term matrix.

We can generate a BoW matrix for our text data by using scikit-learn‘s CountVectorizer. In the code below, we’re telling CountVectorizer to use the custom spacy_tokenizer function we built as its tokenizer, and defining the ngram range we want.

N-grams are combinations of adjacent words in a given text, where n is the number of words that incuded in the tokens. for example, in the sentence “Who will win the football world cup in 2022?” unigrams would be a sequence of single words such as “who”, “will”, “win” and so on. Bigrams would be a sequence of 2 contiguous words such as “who will”, “will win”, and so on. So the ngram_range parameter we’ll use in the code below sets the lower and upper bounds of the our ngrams (we’ll be using unigrams). Then we’ll assign the ngrams to bow_vector.

*Source: Dataquest*

In [20]:
from sklearn.feature_extraction.text import CountVectorizer

# We create our bag of words (bow) using our tokenizer and defining an ngram range
bow = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1)) 

#### TF-IDF

In short, TF-IDF is a way of normalizing the BOW by looking at each word's frequency in comparisson to the document frequency.

We will skip the sweat from the pass class and use `scikit learn` TF-IDF functionality.

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Using Tfidf
tfvectorizer = TfidfVectorizer(tokenizer = spacy_tokenizer)

## Train-Test split

In machine learning, we always need to split our datasets into train and test. We will use one for training the model and another one to check how the model performs. Luckily, `sklearn` comes with an in-built function for this. 

The split is done randomly, but we can attribute a seed value to make it stable for developing purposes. The usually split is 20% test and 80% train.

In [22]:
# Splitting Data Set
from sklearn.model_selection import train_test_split

In [23]:
# Features and Labels
X = df['Message']
ylabels = df['Target']

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.2, random_state=42)

In [25]:
X_train.shape

(2196,)

In [26]:
y_train.shape

(2196,)

In [27]:
X_test.shape

(549,)

In [28]:
y_test.shape

(549,)

## The classifier

With choosing a classifier, we are choosing the strategy for our model to learn. Since we are trying to do a classification (good and bad) we will need to choose algorithms that are classifiers. 

We can play with the [classifiers from sklearn](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model)

In [29]:
# SVC classifier
from sklearn.svm import LinearSVC

classifier_SVC = LinearSVC(verbose=True)

In [30]:
# Logistic Regression Classifier
from sklearn.linear_model import LogisticRegression

classifier_LG = LogisticRegression(verbose=True)

In [31]:
# Multi layer perceptron
from sklearn.neural_network import MLPClassifier

classifier_MLP =  MLPClassifier(verbose=True)

## The pipeline

We are going to create an `sklearn` pipeline that:

1. Clean and preprocess the text using our predictors class from above
2. Vectorize the words with either BOW or TF-IDF to create word matrixes from our text.
3. Load the classifier which performs the algorithm we have chosen to classify the sentiments.

![alt text](https://www.dataquest.io/wp-content/uploads/2019/04/text-classification-python-spacy.png)

In [32]:
from sklearn.pipeline import Pipeline

In [33]:
# Create the  pipeline to clean, tokenize, vectorize, and classify 
pipe = Pipeline([("cleaner", predictors()),
                 ('vectorizer', bow),
                 ('classifier', classifier_SVC)], verbose=True)

In [34]:
# Fit our data
pipe.fit(X_train,y_train)

[Pipeline] ........... (step 1 of 3) Processing cleaner, total=   0.0s
[Pipeline] ........ (step 2 of 3) Processing vectorizer, total=   0.6s
[LibLinear][Pipeline] ........ (step 3 of 3) Processing classifier, total=   0.0s


Pipeline(memory=None,
         steps=[('cleaner', <__main__.predictors object at 0x11089dba8>),
                ('vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function spacy_tokenizer at 0x10e282ae8>,
                                 vocabulary=None)),
                ('classifier',
                 LinearSVC(C=1.0, class_weight=None, dual=True,
                           fit_intercept=True, intercept_scaling=1,
                      

## Model evaluation

Now that we have evaluated our model, let's look at how it performs! First of all we need to predict the results of the test using our model:

In [35]:
# Predicting with a test dataset
sample_prediction = pipe.predict(X_test)

Let's check the results for each sample:

In [36]:
# Prediction Results
# 1 = Positive review
# 0 = Negative review
for (sample,pred) in zip(X_test,sample_prediction):
    print(sample,"Prediction=>",pred)

Great pork sandwich. Prediction=> 1
It is a true classic.   Prediction=> 1
It's close to my house, it's low-key, non-fancy, affordable prices, good food. Prediction=> 0
Audio Quality is poor, very poor. Prediction=> 0
We loved the biscuits!!! Prediction=> 1
I don't have very many words to say about this place, but it does everything pretty well. Prediction=> 0
Was not happy. Prediction=> 1
The headsets are easy to use and everyone loves them. Prediction=> 1
I miss it and wish they had one in Philadelphia! Prediction=> 0
Still it's quite interesting and entertaining to follow.   Prediction=> 1
All three broke within two months of use. Prediction=> 0
Oh yeah, and the storyline was pathetic too.   Prediction=> 0
IT'S REALLY EASY. Prediction=> 1
Every element of this story was so over the top, excessively phony and contrived that it was painful to sit through.   Prediction=> 0
The food was outstanding and the prices were very reasonable. Prediction=> 1
I am so tired of clichés that is just

Now we can evaluate the model using different metrics, so that we can look at the three main performance metrics:

* **Accuracy**: refers to the percentage of the total predictions our model makes that are completely correct.
* **Precision**: describes the ratio of true positives to true positives plus false positives in our predictions.
* **Recall**: describes the ratio of true positives to true positives plus false negatives in our predictions.

![alt text](https://upload.wikimedia.org/wikipedia/commons/thumb/2/26/Precisionrecall.svg/800px-Precisionrecall.svg.png)

In [63]:
from sklearn import metrics

# Model Accuracy
print("Accuracy:",metrics.accuracy_score(y_test, sample_prediction))
print("Precision:",metrics.precision_score(y_test, sample_prediction))
print("Recall:",metrics.recall_score(y_test, sample_prediction))

Accuracy: 0.7978142076502732
Precision: 0.8215613382899628
Recall: 0.778169014084507


## Let's use our model!

In [64]:
# Another random review
pipe.predict(["This was a great movie"])

array([1])

In [65]:
example = ["I do enjoy my job",
 "What a poor product!,I will have to get a new one",
 "I feel amazing!",
 "This class sucks"]

pipe.predict(example)

array([1, 0, 1, 0])

# Your turn

Compare results with another approach:

*   Try another vectorizer
*   Try another train/test split
*   Try another algorithm
*   Try changing the parameters of the algorithm (more on that in class)
*   If you feel hardcore: try [another dataset](https://lionbridge.ai/datasets/15-free-sentiment-analysis-datasets-for-machine-learning/)

