<a href="https://colab.research.google.com/github/michalis0/DataMining_and_MachineLearning/blob/master/week6/Text_Analytics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Mining and Machine Learning - Week 6
# Text Analytics

[Text Analytics](https://people.ischool.berkeley.edu/~hearst/text-mining.html) (or text mining) is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources." Written resources may include websites, books, emails, reviews, and articles.

### Table of Contents
#### 1. Summary
* 1.1 Applications
* 1.2 Tokenization and Stopwords
* 1.3 Stemming and Lemmatization
* 1.4 Text Representation

#### 2. Text Preparation
* 2.1 Install spaCy
* 2.2 Tokenization
* 2.3 Dependency Parsing
* 2.4 Remove Stopwords
* 2.5 Lemmatization
* 2.6 Entity Detection 

#### 3. Text Representation
* 3.1 Bag of Words (BOW)
* 3.2 TF-IDF Representation

#### 4. Text Classification: Alexa Reviews
* 4.1 Load and prepare data
* 4.2 Classification of the reviews using logistic regression

## 1. Summary

### 1.1 Applications
There are many applications of text analytics, for example:
* Search for relevant websites or articles using a search engine
* Sentiment Analysis (e.g. classify tweets or film reviews as positive, neutral or negative)
* Chatbots (e.g. Siri, Alexa)
* Project idea: The Impact of Donald Trump’s Tweets on Financial.
* Etc.

### 1.2 Tokenization and Stopwords
Tokens are the elementary building blocks (words, numbers, characters) in a document. Tokenization is the process of splitting an input
sequence into tokens. Example: "I love data science" --> "I", "love", "data", "science". Stopwords are common words that appear very frequently (e.g. "is", "and", "you", etc.). It is convenient to remove them as they do not add much to the content of a document and are therefore generallny not useful for text analysis.

### 1.3 Lemmatization and Stemming
* Goal: have the same token for different forms of a word (e.g. fishing, fished, fisher, fishers, etc.)
* Lemmatization: Find what is the lemma of a word (e.g. feet -> foot)
* Stemming: one method for lemmatization where rules that remove the ending of a word are applied (e.g. fish)


### 1.4 Text Representation
* Goal: transform text such that it can be used for text analysis
* Bag of Words (BOW): works in many case but order is not preserved (solution: n-grams).
* TF-IDF: emphasizes important words.

## 2. Text Preparation
In this section, we explain how to prepare a text for analysis. This includes tockeninzing the text, removing stopwords, etc.

### 2.1 Install spaCy
[spaCy](https://spacy.io/) is an open-source natural language processing library for Python. It is designed particularly for production use, and it can help us to build applications that process massive volumes of text efficiently.

We install the library and its English-language model.

In [None]:
# Install and update spaCy
!pip install -U spacy

# Load the english language model
!python -m spacy download en

Requirement already up-to-date: spacy in /usr/local/lib/python3.6/dist-packages (2.3.2)
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


In [None]:
# Import required packages
import spacy
from spacy import displacy
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd

### 2.2 Tokenization

Tokenization is the process of breaking a text into pieces called tokens. A token simply refers to an individual part of a sentence having some semantic value. SpaCy‘s tokenizer takes input in form of unicode text and outputs a sequence of token objects. In addition, SpaCy automatically breaks your document into tokens when a document is created using the language model.

Let’s take a look at a simple example. Imagine we have the following text, and we’d like to tokenize it:

> When learning data science, you shouldn't get discouraged!

> Challenges and setbacks aren't failures, they're just part of the journey. You've got this!

There are a couple of different ways we can appoach this. The first is called __word tokenization__, which means breaking up the text into individual words. This is a critical step for many language processing applications, as they often require inputs in the form of individual words rather than longer strings of text.

In [None]:
# Load English language model
sp = spacy.load('en_core_web_sm')

# Declare the text
text = """When learning data science, you shouldn't get discouraged!
Challenges and setbacks aren't failures, they're just part of the journey. You've got this!"""

# spaCy object is used to create documents with linguistic annotations.
my_doc = sp(text)

my_doc

When learning data science, you shouldn't get discouraged!
Challenges and setbacks aren't failures, they're just part of the journey. You've got this!

In [None]:
type(my_doc)

spacy.tokens.doc.Doc

In [None]:
# Create list of word tokens
token_list = []

for token in my_doc:
    token_list.append(token.text)

token_list

['When',
 'learning',
 'data',
 'science',
 ',',
 'you',
 'should',
 "n't",
 'get',
 'discouraged',
 '!',
 '\n',
 'Challenges',
 'and',
 'setbacks',
 'are',
 "n't",
 'failures',
 ',',
 'they',
 "'re",
 'just',
 'part',
 'of',
 'the',
 'journey',
 '.',
 'You',
 "'ve",
 'got',
 'this',
 '!']


As we can see, spaCy produces a list that contains each token as a separate item. Notice that it has recognized that contractions such as _shouldn’t_ actually represent two distinct words, and it has thus broken them down into two distinct tokens.

In the example above, we first load language dictionaries. Here we load the english dictionary using the English() class and create an object of this class, “nlp”, which is used to create documents with linguistic annotations and various language properties. After creating the document, we create a list of tokens.

We can also see the parts-of-speech (POS) of each of these tokens using the `.pos_` attribute shown below. POS tagging can be really useful, particularly if you have words or tokens that can have multiple POS tags. For instance, the word "fish" can be used as both a noun and verb, depending upon the context.

In [None]:
# POS
for word in my_doc:
    print(word.text, word.pos_)

When ADV
learning VERB
data NOUN
science NOUN
, PUNCT
you PRON
should VERB
n't PART
get AUX
discouraged ADJ
! PUNCT

 SPACE
Challenges NOUN
and CCONJ
setbacks NOUN
are AUX
n't PART
failures NOUN
, PUNCT
they PRON
're AUX
just ADV
part NOUN
of ADP
the DET
journey NOUN
. PUNCT
You PRON
've AUX
got VERB
this DET
! PUNCT


In [None]:
# Another example
doc1 = sp("I like to fish") # verb
doc2 = sp("I eat a fish") # noun

for word in doc1:
  print(word.text, word.pos_)

print("-----------------")

for word in doc2:
  print(word.text, word.pos_)

I PRON
like VERB
to PART
fish VERB
-----------------
I PRON
eat VERB
a DET
fish NOUN



If we want, we can also break the text into sentences rather than words. This is called __sentence tokenization__. When performing sentence tokenization, the tokenizer looks for specific characters that fall between sentences, like periods, exclaimation points, and newline characters. For sentence tokenization, we will use a preprocessing pipeline because sentence preprocessing using spaCy includes a tokenizer, a tagger, a parser and an entity recognizer that we need to access to correctly identify what’s a sentence and what isn’t.

In the code below, spaCy tokenizes the text and creates a Doc object. This Doc object uses our preprocessing pipeline’s components tagger, parser and entity recognizer to break the text down into components. From this pipeline we can extract any component, but here we’re going to access sentence tokens using the sentencizer component.

In [None]:
# create list of sentence tokens
sents_list = []

for sent in my_doc.sents:
    sents_list.append(sent.text)

sents_list

["When learning data science, you shouldn't get discouraged!\n",
 "Challenges and setbacks aren't failures, they're just part of the journey.",
 "You've got this!"]

### 2.3 Dependency Parsing
__Depenency parsing__ is a language processing technique that allows to better determine the meaning of a sentence by analyzing how it’s constructed to determine how the individual words relate to each other.

Consider, for example, the sentence “Joe throws the ball.” We have two nouns (Joe and ball) and one verb (throws). But we can’t just look at these words individually, or we may end up thinking that the ball is throwing Bill! To understand the sentence correctly, we need to look at the word order and sentence structure, not just the words and their parts of speech.

Below, we have a short sentence. We’ll use a spaCy method called `noun_chunks`, which breaks the input down into nouns and the words describing them, and iterate through each chunk in our source text, identifying the word, its root, its dependency identification, and which chunk it belongs to.

In [None]:
doc = sp(" Joe threw a ball and President Donald, in pursuit of the ball, hit a wall.") # notice the space at the beginning

for chunk in doc.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_,
          chunk.root.head.text)

 Joe Joe nsubj threw
a ball ball dobj threw
President Donald Donald conj ball
pursuit pursuit pobj in
the ball ball pobj of
a wall wall dobj hit


In [None]:
# Let's visualize this
displacy.render(doc, style="dep", jupyter= True, options={'distance': 120})

### 2.4 Remove Stopwords
Most text data that we work with is going to contain a lot of words that aren’t actually useful to us. These words, called stopwords, are useful in human speech, but they don’t have much to contribute to data analysis. Removing stopwords helps us eliminate noise and distraction from our text data, and also speeds up the time analysis takes (since there are fewer words to process).

Let’s take a look at the stopwords spaCy includes by default.

In [None]:
# Import stopwords from English language
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

# Print total number of stopwords
print('Number of stop words: %d' % len(spacy_stopwords))

# Print first 20 stopwords
print('First 20 stop words: %s' % list(spacy_stopwords)[:20])

Number of stop words: 326
First 20 stop words: ['so', 'cannot', 'everyone', 'through', 'eight', 'two', 'whereafter', 'an', 'regarding', 'anyone', 'onto', 'someone', 'three', 'during', 'five', 'always', 'they', 'does', 'neither', 'only']


Now that we’ve got our list of stopwords, let’s use it to remove the stopwords from the text string we were working on in the previous section.

In [None]:
my_doc

When learning data science, you shouldn't get discouraged!
Challenges and setbacks aren't failures, they're just part of the journey. You've got this!

In [None]:
# Declare list for filtered sentence
filtered_sent = []

# Filter stopwords
for word in my_doc:
    if word.is_stop == False:
        filtered_sent.append(word)

filtered_sent

[learning,
 data,
 science,
 ,,
 discouraged,
 !,
 ,
 Challenges,
 setbacks,
 failures,
 ,,
 journey,
 .,
 got,
 !]

In [None]:
# We can also remove the punctuation
filtered_sent2 = []
removed_tokens = []

# Filter stopwords and punctuation
for word in my_doc:
  if (word.is_stop == True) or (word.is_punct == True):
    removed_tokens.append(word)
  else:
    filtered_sent2.append(word)

removed_tokens

[When,
 ,,
 you,
 should,
 n't,
 get,
 !,
 and,
 are,
 n't,
 ,,
 they,
 're,
 just,
 part,
 of,
 the,
 .,
 You,
 've,
 this,
 !]

In [None]:
filtered_sent2

[learning,
 data,
 science,
 discouraged,
 ,
 Challenges,
 setbacks,
 failures,
 journey,
 got]

### 2.5 Lemmatization
Lemmatization is a way of dealing with the fact that while words like connect, connection, connecting, connected, etc. aren’t exactly the same, they all have the same essential meaning: connect. The differences in spelling have grammatical functions in spoken language, but for machine processing, those differences can be confusing, so we need a way to change all the words that are forms of the word connect into the word connect itself.

One method for doing this is called __stemming__. Stemming involves simply lopping off easily-identified prefixes and suffixes to produce what’s often the simplest version of a word, the root. Connection, for example, would have the -ion suffix removed and be correctly reduced to connect. This kind of simple stemming is often all that’s needed, but lemmatization—which actually looks at words and their roots (called lemma) as described in the dictionary—is more precise (as long as the words exist in the dictionary).

Let's look at this simple example.

In [None]:
# Lemmatization
lem = sp("run runs ran running runner runners")

# Find lemma for each word
for word in lem:
    print(word.text, word.lemma_)

run run
runs run
ran run
running run
runner runner
runners runner


### 2.6 Entity Detection

__Entity detection__, also called entity recognition, is a more advanced form of language processing that identifies important elements like places, people, organizations, and languages within a text. This is really helpful for quickly extracting information from the text, since you can quickly pick out important topics or indentify key sections of it.

Let’s try out some entity detection using a few paragraphs from this [article](https://www.bloomberg.com/features/trump-tweets-market/).

In [None]:
article = sp("""
President Donald Trump gets a lot of attention for using Twitter to attack American trading partners, political foes, and media companies. But he often takes to the platform to celebrate the strength of the world’s largest economy and its publicly-traded companies.

Before U.S. stocks peaked in late January, he drew a direct connection between the increase in market value of American companies and his administration’s pro-growth policies on more than 10 occasions in that month alone.
""")

entities = [(i, i.label_, i.label) for i in article.ents]
entities

[(Donald Trump, 'PERSON', 380),
 (Twitter, 'ORG', 383),
 (American, 'NORP', 381),
 (U.S., 'GPE', 384),
 (late January, 'DATE', 391),
 (American, 'NORP', 381),
 (more than 10, 'CARDINAL', 397),
 (that month alone, 'DATE', 391)]

example above shows spacy is able to identify a variety of different entity types, including specific locations (GPE), date-related words (DATE), important numbers (CARDINAL), specific individuals (PERSON), etc.

Using `displaCy` we can also visualize the text, with each identified entity highlighted by color and labeled. We’ll use `style = "ent"` to tell displaCy that we want to visualize entities here.

In [None]:
displacy.render(article, style = "ent", jupyter = True)

## 3. Text Representation
We now show how to transform a text into an usable into for text classification. We use the article from the last section and two other sentences.

In [None]:
# Article as a string, not a spacy object
article = """
President Donald Trump gets a lot of attention for using Twitter to attack American trading partners, political foes, and media companies. But he often takes to the platform to celebrate the strength of the world’s largest economy and its publicly-traded companies.

Before U.S. stocks peaked in late January, he drew a direct connection between the increase in market value of American companies and his administration’s pro-growth policies on more than 10 occasions in that month alone.
"""

# Sentences
s1 = """Donald Trump is a great friend, and he has four or five Picassos on his plane. And that's where I would look at them.""" # from Shaquille O'Neal
s2 = """Donald Trump is a phony, a fraud. His promises are as worthless as a degree from Trump University.""" # from Mitt Romney

texts = [article, s1, s2]

### 3.1 Bag of Words (BOW)

In [None]:
# Using default tokenizer 
count = CountVectorizer(ngram_range=(1,2), stop_words="english")
bow = count.fit_transform(texts)

# Show feature matrix
bow.toarray()

array([[1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 0, 0, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
        0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
        1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
        0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

In [None]:
# Get feature names
feature_names = count.get_feature_names()

# View feature names
feature_names[:10]

['10',
 '10 occasions',
 'administration',
 'administration pro',
 'american',
 'american companies',
 'american trading',
 'attack',
 'attack american',
 'attention']

In [None]:
# Show as a dataframe
pd.DataFrame(
    bow.todense(), 
    columns=feature_names
    )

Unnamed: 0,10,10 occasions,administration,administration pro,american,american companies,american trading,attack,attack american,attention,attention using,celebrate,celebrate strength,companies,companies administration,companies stocks,companies takes,connection,connection increase,degree,degree trump,direct,direct connection,donald,donald trump,drew,drew direct,economy,economy publicly,foes,foes media,fraud,fraud promises,friend,friend picassos,gets,gets lot,great,great friend,growth,...,platform,platform celebrate,policies,policies 10,political,political foes,president,president donald,pro,pro growth,promises,promises worthless,publicly,publicly traded,stocks,stocks peaked,strength,strength world,takes,takes platform,traded,traded companies,trading,trading partners,trump,trump gets,trump great,trump phony,trump university,twitter,twitter attack,university,using,using twitter,value,value american,world,world largest,worthless,worthless degree
0,1,1,1,1,2,1,1,1,1,1,1,1,1,3,1,1,1,1,1,0,0,1,1,1,1,1,1,1,1,1,1,0,0,0,0,1,1,0,0,1,...,1,1,1,1,1,1,1,1,1,1,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,1,1,0,1,1,1,1,1,1,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,1,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,1,1,0,0,1,0,0,0,0,0,0,1,1


### 3.2 TF-IDF Representation


Recall that:

- term frequency tf = count(word, document) / len(document) 
- term frequency idf = log( len(collection) / count(document_containing_term, collection) )
- tf-idf = tf * idf 

It is important to mention that the IDF value for a word remains the same throughout all the documents as it depends upon the total number of documents. On the other hand, TF values of a word differ from document to document.

The TF for the word "car" is 1/7.

Let's find the IDF frequency of the word "car". Since we have 2 documents and the word "car" occurs in 1 of them, therefore the IDF value of the word "car" is log(2/1) = 1.66.



Finally, the TF-IDF values are calculated by multiplying TF values with their corresponding IDF values.

**Note**: In the example below, you may not get the exact values by multiplying those two numbers, because nltk normalizes each row to have norm of 1. However the relative importance of the terms won't change.




In [None]:
# Using default tokenizer in TfidfVectorizer
tfidf = TfidfVectorizer(ngram_range=(1, 1), stop_words="english")
features = tfidf.fit_transform(texts)
pd.DataFrame(
    features.todense(),
    columns=tfidf.get_feature_names()
)

Unnamed: 0,10,administration,american,attack,attention,celebrate,companies,connection,degree,direct,donald,drew,economy,foes,fraud,friend,gets,great,growth,increase,january,largest,late,look,lot,market,media,month,occasions,partners,peaked,phony,picassos,plane,platform,policies,political,president,pro,promises,publicly,stocks,strength,takes,traded,trading,trump,twitter,university,using,value,world,worthless
0,0.13908,0.13908,0.27816,0.13908,0.13908,0.13908,0.41724,0.13908,0.0,0.13908,0.082143,0.13908,0.13908,0.13908,0.0,0.0,0.13908,0.0,0.13908,0.13908,0.13908,0.13908,0.13908,0.0,0.13908,0.13908,0.13908,0.13908,0.13908,0.13908,0.13908,0.0,0.0,0.0,0.13908,0.13908,0.13908,0.13908,0.13908,0.0,0.13908,0.13908,0.13908,0.13908,0.13908,0.13908,0.082143,0.13908,0.0,0.13908,0.13908,0.13908,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.247433,0.0,0.0,0.0,0.0,0.41894,0.0,0.41894,0.0,0.0,0.0,0.0,0.0,0.41894,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.41894,0.41894,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.247433,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.359347,0.0,0.212236,0.0,0.0,0.0,0.359347,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.359347,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.359347,0.0,0.0,0.0,0.0,0.0,0.0,0.424472,0.0,0.359347,0.0,0.0,0.0,0.359347


## 4. Text Classification: Alexa reviews

we’re going to use a real-world data set: [Amazon Alexa product reviews](https://www.kaggle.com/sid321axn/amazon-alexa-reviews/download).

This data set comes as a tab-separated file (.tsv). It has has five columns: `rating`, `date`, `variation`, `verified_reviews`, `feedback`.

`rating` denotes the rating each user gave the Alexa (out of 5). `date` indicates the date of the review, and `variation` describes which model the user reviewed. `verified_reviews` contains the text of each review, and `feedback` contains a sentiment label, with 1 denoting positive sentiment (the user liked it) and 0 denoting negative sentiment (the user didn’t).

This dataset has consumer reviews of amazon Alexa products like Echos, Echo Dots, Alexa Firesticks etc. What we’re going to do is develop a classification model that looks at the review text and predicts whether a review is positive or negative. Since this data set already includes whether a review is positive or negative in the `feedback` column, we can use those answers to train and test our model. Our goal here is to produce an accurate model that we could then use to process new user reviews and quickly determine whether they were positive or negative.

### 4.1 Load and prepare data

In [None]:
# Import additional packages
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
import string
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

In [None]:
# Load data
df = pd.read_csv("https://raw.githubusercontent.com/ahmadajal/2019_DM_ML_course/master/6.%20Handling%20Text/data/amazon_alexa.tsv?token=ALM4BCKF2A25WIOVGTL2W327TLN5A", delimiter="\t")
df.sample(10)

Unnamed: 0,rating,date,variation,verified_reviews,feedback
1054,4,30-Jul-18,Black Spot,Worthy successor to the echo dot and right at ...,1
1402,3,31-Jul-18,Black Show,It seems to work well. Unfortunately a lot of...,1
1235,5,26-Jul-18,Black Spot,It was easy set up. I use it more than I thou...,1
2087,5,2-Jul-18,Black Plus,Love this device,1
3133,4,30-Jul-18,White Dot,I like having more Alexa devices in my house a...,1
868,5,30-Jul-18,Charcoal Fabric,BEST father's day gift. Dad joked to my mom th...,1
2886,3,30-Jul-18,White Dot,"Hard to hear, had to blue tooth it to a better...",1
426,5,10-Jul-18,Black,"I love the fact, that I can unplug it, and tak...",1
846,5,30-Jul-18,Sandstone Fabric,Love it,1
331,5,29-Jul-18,Heather Gray Fabric,This is a great product! Set up was easy. So...,1


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3150 entries, 0 to 3149
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   rating            3150 non-null   int64 
 1   date              3150 non-null   object
 2   variation         3150 non-null   object
 3   verified_reviews  3150 non-null   object
 4   feedback          3150 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 123.2+ KB


In [None]:
# Change date to datetime
df["date"] = pd.to_datetime(df["date"])

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3150 entries, 0 to 3149
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   rating            3150 non-null   int64         
 1   date              3150 non-null   datetime64[ns]
 2   variation         3150 non-null   object        
 3   verified_reviews  3150 non-null   object        
 4   feedback          3150 non-null   int64         
dtypes: datetime64[ns](1), int64(2), object(2)
memory usage: 123.2+ KB


In [None]:
# Base rate: the data-set is unbalanced!
df.feedback.value_counts()

1    2893
0     257
Name: feedback, dtype: int64

In [None]:
round(df.feedback.value_counts()[1] / len(df), 4)

0.9184

In [None]:
# Convert text into lowercase
def clean_text(text):
    return text.strip().lower()

df["verified_reviews"] = df["verified_reviews"].map(clean_text)

df.head()

Unnamed: 0,rating,date,variation,verified_reviews,feedback
0,5,2018-07-31,Charcoal Fabric,love my echo!,1
1,5,2018-07-31,Charcoal Fabric,loved it!,1
2,4,2018-07-31,Walnut Finish,"sometimes while playing a game, you can answer...",1
3,5,2018-07-31,Charcoal Fabric,i have had a lot of fun with this thing. my 4 ...,1
4,5,2018-07-31,Charcoal Fabric,music,1


###### Tokening the Data With spaCy

we’ll create a `spacy_tokenizer()` function that accepts a sentence as input and processes the sentence into tokens, performing lemmatization, lowercasing, and removing stop words. 

__A note from spacy documentation__: spaCy adds a special case for pronouns: all pronouns are lemmatized to the special token `-PRON-`. Unlike verbs and common nouns, there’s no clear base form of a personal pronoun. Should the lemma of “me” be “I”, or should we normalize person as well, giving “it” — or maybe “he”? spaCy’s solution is to introduce a novel symbol, `-PRON-`, which is used as the lemma for all personal pronouns.

In [None]:
# Create our list of punctuation marks
punctuations = string.punctuation

punctuations

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [None]:
# Create our list of stopwords
stop_words = spacy.lang.en.stop_words.STOP_WORDS

list(stop_words)[:10]

['so',
 'cannot',
 'everyone',
 'through',
 'eight',
 'two',
 'whereafter',
 'an',
 'regarding',
 'anyone']

In [None]:
# Load English language model
sp = spacy.load('en_core_web_sm')

# Creating our tokenizer function
def spacy_tokenizer(sentence):
    # Creating our token object, which is used to create documents with linguistic annotations.
    mytokens = sp(sentence)

    # Lemmatizing each token and converting each token into lowercase
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]

    # Removing stop words and punctuation
    mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]

    # return preprocessed list of tokens
    return mytokens

# Example
review = df["verified_reviews"].sample()
review.values[0]

'love being able to listen to music easily. still learning all the features available'

In [None]:
spacy_tokenizer(review.values[0])

['love', 'able', 'listen', 'music', 'easily', 'learn', 'feature', 'available']

#### Vectorization Feature Engineering (TF-IDF)

We’ll use the TF-IDF (Term Frequency-Inverse Document Frequency) to vectorize the documents. This is a way of representing how important a particular term is in the context of a given document, based on how many times the term appears and how many other documents that same term appears in. The higher the TF-IDF, the more important that term is to that document.

In [None]:
tfidf_vector = TfidfVectorizer(tokenizer = spacy_tokenizer) # we use the above defined tokenizer

### 4.2 Classification of the reviews using logistic regression

In [None]:
# Train test split
X = df['verified_reviews'] # the features we want to analyze
ylabels = df['feedback'] # the labels, or answers, we want to test against

X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.3, random_state=72)

X_train

962                                          works great!
2041                                           perfect!!!
499     now i'm weary about these picking up conversat...
3028    small convenient and dependable. this dot is a...
2351    very satisfied, easy setup. great product. hig...
                              ...                        
2885               like having the music where ever i am.
3146    listening to music, searching locations, check...
1070                 it is very slow compared to the echo
1811                                  home entertainment.
472     love it even though i’m still trying to figure...
Name: verified_reviews, Length: 2205, dtype: object

In [None]:
y_train

962     1
2041    1
499     0
3028    1
2351    1
       ..
2885    1
3146    1
1070    0
1811    1
472     1
Name: feedback, Length: 2205, dtype: int64

In [None]:
# Define classifier
classifier = LogisticRegression(solver="lbfgs")

# Create pipeline
pipe = Pipeline([('vectorizer', tfidf_vector),
                 ('classifier', classifier)])

# Generate Model on training set
pipe.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('vectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_patt...,
                                 tokenizer=<function spacy_tokenizer at 0x7fb4e853d158>,
                                 use_idf=True, vocabulary=None)),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_i

In [None]:
# Evaluate the model
def evaluate(true, pred):
    precision = precision_score(true, pred)
    recall = recall_score(true, pred)
    f1 = f1_score(true, pred)
    print(f"CONFUSION MATRIX:\n{confusion_matrix(true, pred)}")
    print(f"ACCURACY SCORE:\n{accuracy_score(true, pred):.4f}")
    print(f"CLASSIFICATION REPORT:\n\tPrecision: {precision:.4f}\n\tRecall: {recall:.4f}\n\tF1_Score: {f1:.4f}")

In [None]:
# Predictions
y_pred = pipe.predict(X_test)

# Evaluation
evaluate(y_test, y_pred)

CONFUSION MATRIX:
[[  1  63]
 [  0 881]]
ACCURACY SCORE:
0.9333
CLASSIFICATION REPORT:
	Precision: 0.9333
	Recall: 1.0000
	F1_Score: 0.9655


Our model correctly identified a comment’s sentiment 93.6% of the time. This is greater than the base rate, so we are fine. When it predicted a review was positive, that review was actually positive 93.2% of the time. When handed a positive review, our model identified it as positive 100% of the time.

In [None]:
# BONUS: predict the rating
df.sample()

Unnamed: 0,rating,date,variation,verified_reviews,feedback
0,5,2018-07-31,Charcoal Fabric,love my echo!,1
1,5,2018-07-31,Charcoal Fabric,loved it!,1
2,4,2018-07-31,Walnut Finish,"sometimes while playing a game, you can answer...",1
3,5,2018-07-31,Charcoal Fabric,i have had a lot of fun with this thing. my 4 ...,1
4,5,2018-07-31,Charcoal Fabric,music,1


In [None]:
df.rating.value_counts()

5    2286
4     455
1     161
3     152
2      96
Name: rating, dtype: int64

In [None]:
# Base rate
round(df.rating.value_counts()[5] / len(df), 4)

0.7257

In [None]:
# Train test split
X = df['verified_reviews'] # the features we want to analyze
y = df['rating'] # the labels, or answers, we want to test against

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=72)

# Define classifier
classifier = LogisticRegression(solver="lbfgs")

# Create pipeline
pipe = Pipeline([('vectorizer', tfidf_vector),
                 ('classifier', classifier)])

# Generate Model on training set
pipe.fit(X_train, y_train)

# Predictions
y_pred = pipe.predict(X_test)

# Evaluation
print(f"CONFUSION MATRIX:\n{confusion_matrix(y_test, y_pred)}")
print(f"ACCURACY SCORE:\n{accuracy_score(y_test, y_pred):.4f}")

CONFUSION MATRIX:
[[  4   0   0   0  39]
 [  0   0   0   1  20]
 [  0   0   2   6  45]
 [  1   0   0   8 128]
 [  0   0   0   5 686]]
ACCURACY SCORE:
0.7407


In [None]:
# BONUS 2: use random forest
from sklearn.ensemble import RandomForestClassifier

# Define classifier
classifier = RandomForestClassifier()

# Create pipeline
pipe = Pipeline([('vectorizer', tfidf_vector),
                 ('classifier', classifier)])

# Generate Model on training set
pipe.fit(X_train, y_train)

# Predictions
y_pred = pipe.predict(X_test)

# Evaluation
print(f"CONFUSION MATRIX:\n{confusion_matrix(y_test, y_pred)}")
print(f"ACCURACY SCORE:\n{accuracy_score(y_test, y_pred):.4f}")

CONFUSION MATRIX:
[[ 14   0   0   0  29]
 [  2   8   0   0  11]
 [  0   1  19   2  31]
 [  1   0   0  42  94]
 [  0   0   0   6 685]]
ACCURACY SCORE:
0.8127
