# Text Classification Using spaCy

The goal of this exercise is to use *review data* `to predict` if an Amazon Alexa product review is **positive** or **negative**.

This dataset consists of a nearly 3000 Amazon customer reviews (input text), star ratings, date of review, variant and feedback of various Amazon Alexa products like Alexa Echo, Echo dots, Alexa Firesticks etc. for learning how to train machine for sentiment analysis.

In [1]:
import spacy
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline

In [2]:
df_amazon = pd.read_csv("datas/amazon_alexa.tsv",sep='\t')

print(f"Shape of data: {df_amazon.shape}")
df_amazon.head()

Shape of data: (3150, 5)


Unnamed: 0,rating,date,variation,verified_reviews,feedback
0,5,31-Jul-18,Charcoal Fabric,Love my Echo!,1
1,5,31-Jul-18,Charcoal Fabric,Loved it!,1
2,4,31-Jul-18,Walnut Finish,"Sometimes while playing a game, you can answer...",1
3,5,31-Jul-18,Charcoal Fabric,I have had a lot of fun with this thing. My 4 ...,1
4,5,31-Jul-18,Charcoal Fabric,Music,1


# Understanding the Data

 - The data has five columns: rating, date, variation, verified_reviews, feedback.
 - *`rating`* denotes the rating each user gave the Alexa (out of 5).
 - *`date`* indicates the date of the review
 - *`variation`* describes which model the user reviewed.
 - *`verified_reviews`* contains the text of each review
 - *`feedback`* contains a sentiment label, with 1 denoting positive sentiment (the user liked it) and 0 denoting negative sentiment (the user didn’t).

In [3]:
df_amazon.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3150 entries, 0 to 3149
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   rating            3150 non-null   int64 
 1   date              3150 non-null   object
 2   variation         3150 non-null   object
 3   verified_reviews  3150 non-null   object
 4   feedback          3150 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 123.2+ KB


In [4]:
df_amazon.feedback.value_counts() # few reviews with negative feedback, therefore it may create problem during classification

1    2893
0     257
Name: feedback, dtype: int64

## Tokenizing the Text

In [5]:
import string
from spacy.lang.en import English
from spacy.lang.en.stop_words import STOP_WORDS

In [6]:
punctuations = string.punctuation # creating list of punctuationmarks
print(punctuations)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [7]:
nlp = spacy.load("en_core_web_sm")

stop_words = spacy.lang.en.stop_words.STOP_WORDS

parser = English()

In [8]:
def spacy_tokenizer(sentence):
    
    """
    This function will accept a sentence as input and processes it into tokens, performing lemmatization,
    lowercasing, removing stopwords and punctuations.
    """
    tokens = nlp(sentence) # creating the token object
    
    # checking the lowercase for pronouns
    tokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in tokens ]
    
    # removing stopwords and punctuations
    tokens = [word for word in tokens if word not in stop_words and word not in punctuations]
    
    # returning preprocessed list of tokens
    return tokens 

### Data Cleaning

In [9]:
# Custom transformer using spaCy
class predictors(TransformerMixin):
    def transform(self, X, **transform_params):
        """
        Override the transform method to clean text
        """
        return [clean_text(text) for text in X]
    
    def fit(self, X, y= None, **fit_params):
        return self
    
    def get_params(self, deep= True):
        return {}

# Basic function to clean the text
def clean_text(text):
    """
    Removing spaces and converting the text into lowercase
    """
    return text.strip().lower() 

## Feature Engineering

### Vectorization

 The labels (feedback 0 and 1 in this case) are in numeric format. Using `Bag of Words(BoW)` to convert text into numeric format.
**BoW** converts text into the matrix of occurrence of words within a given document. It focuses on whether given word occurred or not in given document and generate the matrix called as BoW matrix/Document Term Matrix

In [10]:
# using sklearn's CountVectorizer to generate BoW matrix
# using unigram, in order to lower and upper bound of ngram range to be (1,1)
bow_vector = CountVectorizer(tokenizer= spacy_tokenizer,ngram_range=(1,1)) 

### TF-IDF(Term Frequency-Inverse Document Frequency)

This is helpful to normalize the `Bag of Words(BoW)` by looking at each word's frequency in comparison to document frequency. It's a way of representing how important a particular term is, in the context of given document based on how many times that term appears and in how many documents it appears.

In [11]:
tfidf_vector = TfidfVectorizer(tokenizer=spacy_tokenizer)

### Create Train and Test Datasets

Train dataset will be used to *train the model* and test dataset to *test the model* performance.

In [12]:
from sklearn.model_selection import train_test_split

X = df_amazon['verified_reviews'] # the feature to analyze
ylabels = df_amazon['feedback'] # the labels

X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.3,random_state=1)

print(f'X_train dimension: {X_train.shape}')
print(f'y_train dimension: {y_train.shape}')
print(f'X_test dimension: {X_test.shape}')
print(f'y_test dimension: {y_test.shape}')

X_train dimension: (2205,)
y_train dimension: (2205,)
X_test dimension: (945,)
y_test dimension: (945,)


### Creating a Pipeline and Generating the Model

Using `LogisticRegression` classifier for review classification. 

The pipeline contains three components a **cleaner**, a **vectorizer** and a **classifier**.
 - `Cleaner:` Cleaner uses our 'predictors' class object to clean and preprocess the text.
 - `Vectorizer:` Vectorizer uses 'CountVectorizer' object to create the bag of words matrix for our text.
 - `Classifier:` It performs the logistic regression to classify the sentiments.

In [13]:
# Logistic regression classifier
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression()

# Creating pipeline using BoW
model = Pipeline([("cleaner",predictors()),
                 ("vectorizer",bow_vector),
                 ("classifier",classifier)])

# Model generation
model.fit(X_train,y_train)

Pipeline(steps=[('cleaner', <__main__.predictors object at 0x00000125A8C78340>),
                ('vectorizer',
                 CountVectorizer(tokenizer=<function spacy_tokenizer at 0x00000125A65B2A60>)),
                ('classifier', LogisticRegression())])

### Model Score

The model is trained using training data, now test data will be used for `model evaluation`.

The metrics that will be used to evaluate the model:
 - `Accuracy` refers to the percentage of the total predictions the model makes that are completely correct.
 - `Precision` describes the ratio of true positives to true positives plus false positives in the predictions.
 - `Recall` describes the ratio of true positives to true positives plus false negatives in the predictions.

In [14]:
from sklearn import metrics

# predicting with the test dataset
predicted = model.predict(X_test)

# model accuracy score
print(f'Logistic Regression Accuracy: {metrics.accuracy_score(y_test, predicted)}')
print(f'Logistic Regression Precision: {metrics.precision_score(y_test, predicted)}')
print(f'Logistic Regression Recall: {metrics.recall_score(y_test, predicted)}')

Logistic Regression Accuracy: 0.9375661375661376
Logistic Regression Precision: 0.9501661129568106
Logistic Regression Recall: 0.9839449541284404


Overall, the model correctly identified a comment’s sentiment **93.96%** of the time. When it predicted a review was positive, that review was actually positive **94.82%** of the time. When handed a positive review, the model identified it as positive **98.85%** of the time