<img src="https://drive.google.com/uc?id=1D9x2uKV9smn5slG9JftecrbiSz00Rylz" />

Text classification & Sentiment analysis
--

Text classification – The aim of text classification is to automatically classify the text documents based on pretrained categories.

Applications:
--
1. Sentiment Analysis
2. Document classification
3. Spam – ham mail classification
4. Resume shortlisting
5. Document summarization

Problem
--
to do : Spam - ham classification using machine learning.

Solution
--
If you observe, your Gmail has a folder called “Spam.” It will basically
classify your emails into spam and ham so that you don’t have to read
unnecessary emails.

In [None]:
# Let’s follow the step-by-step method to build the classifier.

# Step 1 :  Data collection and understanding
# Please download data from the below link and save it in your working directory:
# https://www.kaggle.com/uciml/sms-spam-collection-dataset#spam.csv

import pandas as pd
#Read the data
Email_Data = pd.read_csv("spam.csv",encoding ='latin1')

#Data undestanding
Email_Data.columns

In [None]:
Email_Data = Email_Data[['v1', 'v2']]
Email_Data = Email_Data.rename(columns={"v1":"Target","v2":"Email"})
Email_Data.head()

In [None]:
# step 2 : Text processing and feature engineering

# all imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import string
from nltk.stem import SnowballStemmer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import os
from textblob import TextBlob
from nltk.stem import PorterStemmer
from textblob import Word
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
import sklearn.feature_extraction.text as text
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm

In [None]:
#pre processing steps like lower case, stemming and lemmatization
Email_Data['Email'] = Email_Data['Email'].apply(lambda x: " ".join(x.lower() for x in x.split()))

import nltk
nltk.download('stopwords')
stop = stopwords.words('english')
Email_Data['Email'] = Email_Data['Email'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))

#st = PorterStemmer()
#Email_Data['Email'] = Email_Data['Email'].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))

nltk.download('wordnet')
Email_Data['Email'] = Email_Data['Email'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
Email_Data.head()

In [None]:
# step 3:
# Splitting data into train and validation
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(Email_Data['Email'], Email_Data['Target'])

# TFIDF feature generation for a maximum of 5000 features
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)

tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)

tfidf_vect.fit(Email_Data['Email'])
xtrain_tfidf = tfidf_vect.transform(train_x)
xvalid_tfidf = tfidf_vect.transform(valid_x)
print(xtrain_tfidf.data.shape)
xtrain_tfidf.data

In [None]:
#tfidf_vect.vocabulary_

#tfidf_vect.get_feature_names()
print(xtrain_tfidf)

# What is the element at index 4783 ?
#Keys = [key  for (key, value) in tfidf_vect.vocabulary_.items() if value == 4783]
#print(Keys)

In [None]:
# step 4: 
# Model training
# we have defined a generalized function for training any given model:

def train_model(classifier, feature_vector_train, label,feature_vector_valid, is_neural_net=False):
 # fit the training dataset on the classifier
 classifier.fit(feature_vector_train, label)
 # predict the labels on validation dataset
 predictions = classifier.predict(feature_vector_valid)
 return metrics.accuracy_score(predictions, valid_y)

# Note : U can create any Classifier Object and pass it to the above fn, along 
# with the training & testing data
# NaiveBayes Classifier
accuracy = train_model(naive_bayes.MultinomialNB(alpha=0.2),xtrain_tfidf, train_y, xvalid_tfidf)
print("Accuracy: ", accuracy)

In [None]:
# trying one more classifier, so that we can compare its performance with Naive Bayes
# Linear Classifier on Word Level TF IDF Vectors
accuracy = train_model(linear_model.LogisticRegression(),xtrain_tfidf, train_y, xvalid_tfidf)
print ("Accuracy: ", accuracy)

# although LogisticRegression() is a binary classifier and performs well on such 2-class dataset
# It did perform well. But not as well as NaiveBayes Classifier

Recalling Sentiment Analysis using TextBlob
--
In this section, we are going to discuss how to understand the sentiment of
a particular sentence or statement. Sentiment analysis is one of the widely
used techniques across the industries to understand the sentiments of the
customers/users around the products/services. Sentiment analysis gives
the sentiment score of a sentence/statement tending toward positive or negative.

Problem
--
You want to do a sentiment analysis of **Amazon’s Alexa range of products**.

Solution
--
The simplest way to do this by using a TextBlob or VADER library.


How It Works
--
Let’s follow the steps in this section to do sentiment analysis using TextBlob. 
It will basically give 2 metrics.

• <font color='green'>Polarity</font> = Polarity lies in the range of [-1,1] where 1 means a positive statement and -1 means a negative statement.

• <font color='green'>Subjectivity</font> = Subjectivity can be between [0,1], where 0 means no subjectivity i.e 100% Objective and 1 means totally subjective i.e like your personal opinion.

In [1]:
# Create the sample data
#review = "I like this phone. screen quality and camera clarity is really good."
#review2 = "This tv is not good. Bad quality, no clarity, worst experience"

# Recall the same function you had learned -> in NB 10 -> ML PRE-PROCESSING PIPELINE
# Cleaning and preprocessing
def processRow(row):
    import re
    import nltk
    from textblob import TextBlob
    from nltk.corpus import stopwords
    from nltk.stem import PorterStemmer
    from textblob import Word
    from nltk.util import ngrams
    from nltk.tokenize import word_tokenize


    #Lower case
    row.lower()

    #Removes unicode strings like "\u002c"  -> ,(comma)
    row= re.sub(r'(\\u[0-9A-Fa-f]+)',r'', row)

    # Removes non-ascii characters. note : \x00 to \x7f is 00 to 255
    # non-ascii characters like copyrigth symbol, trademark symbol
    row = re.sub(r'[^\x00-\x7f]',r'',row)

    #convert any url to URL
    row = re.sub('((www\.[^\s]+)|(https?://[^\s]+))','URL',row)

    #Convert any @Username to "AT_USER"
    row = re.sub('@[^\s]+','AT_USER',row)

    #Remove additional white spaces
    row = re.sub('[\s]+', ' ', row)
    row = re.sub('[\n]+', ' ', row)

    #Remove not alphanumeric symbols white spaces
    row = re.sub(r'[^\w]', ' ', row)

    #Removes hastag in front of a word """
    row = re.sub(r'#([^\s]+)', r'\1', row)

    #Replace #word with word
    row = re.sub(r'#([^\s]+)', r'\1', row)

    #Removes all possible emoticons
    row = re.sub(':\)|:\(|:\)|;\)|:-\)|\(-:|:-D|=D|:P|xD|X-p|\^\^|:-|\^\.\^|\^\-\^|\^\_\^|\,-\)|\)-:|:\'\(|:\(|:-\(|:\S|T\.T|\.\_\.|:<|:-\S|:-<|\\-\*|:O|=O|=\-O|O\.o|XO|O\_O|:-\@|=/|:/|X\-\(|>\.<|>=\(|D:', '', row)

    #remove numbers -> this is optional
    row = ''.join([i for i in row if not i.isdigit()])

    #remove multiple exclamation -> this is optional
    row = re.sub(r"(\!)\1+", ' ', row)

    #remove multiple question marks -> this is optional
    row = re.sub(r"(\?)\1+", ' ', row)

    #remove multistop -> this is optional
    row = re.sub(r"(\.)\1+", ' ', row)

    #trim
    row = row.strip('\'"')

    #lemma
    from textblob import Word
    row =" ".join([Word(word).lemmatize() for word in row.split()])

    #stemmer
    #st = PorterStemmer()
    #row=" ".join([st.stem(word) for word in row.split()])


    return row
               
#call the function to process your data
#review = processRow(review)
#review2 = processRow(review2)

In [None]:
# Get the sentiment scores
# import libraries
from textblob import TextBlob

#TextBlob has a pre trained sentiment prediction model
blob = TextBlob(review)
blob.sentiment


In [None]:
# Again using TextBlob, over review2
# get the sentiment

blob = TextBlob(review2)
blob.sentiment


<hr>
Sentiment analysis of <font color='green'><b>Amazon’s Alexa range of products</b></font>.
<hr>

Data Source link : 
https://drive.google.com/open?id=1xpQVZHR84LJ3MX-UihL7noK0w1bOIQgV

You can use this data to analyze Amazon’s Alexa product range; discover insights into consumer reviews :
> how many positive reviews ? 

> how many negative reviews ?

> And accordingly help digital marketers/ecomm portals to concentrate on positive reviewed products.

<small>The dataset consists of a nearly 3000 Amazon customer reviews (input text), star ratings, date of review, variant and feedback of various amazon Alexa products like Alexa Echo, Echo dots, Alexa Firesticks etc. for learning how to train Machine for sentiment analysis. </small>


In [2]:
import pandas
df_review = pandas.read_csv('amazon_alexa.csv', sep='\t')
df_review.head()

# Note Feedback is already given to us (may be its hand coded) : 
# Feedback 1 -> indicates positive feedback
# Feedback 0 -> indicates negative feedback

# But in reality we would only get the customer reviews, isn't it.
# so, lets ignore the feedback column and only use for assessing
# TextBlob sentiment analyser o/p.

# More-ever the feedback column doesn't even have Neutral feedbacks. 
# Like all feedbacks need not be +ve or -ve only ?

Unnamed: 0,rating,date,variation,verified_reviews,feedback
0,5,31-Jul-18,Charcoal Fabric,Love my Echo!,1
1,5,31-Jul-18,Charcoal Fabric,Loved it!,1
2,4,31-Jul-18,Walnut Finish,"Sometimes while playing a game, you can answer...",1
3,5,31-Jul-18,Charcoal Fabric,I have had a lot of fun with this thing. My 4 ...,1
4,5,31-Jul-18,Charcoal Fabric,Music,1


In [3]:
# lets find no. of positive and negative reviews in the dataset.
df_review['feedback'].value_counts()

# Now , imagine we run the Sentiment analyser code over this 
# entire dataset, we shd get a similar ans. isn't it ?

1    2893
0     257
Name: feedback, dtype: int64

In [4]:
# clean your verified_reviews
from textblob import TextBlob
cleaned_verified_reviews = []

for line in df_review['verified_reviews'] :
    cleanLine = processRow(line) 
    cleaned_verified_reviews.append(cleanLine)
    
import numpy as np    
df_review['cleaned_verified_reviews'] = np.asarray(cleaned_verified_reviews)

df_review.head(5)


Unnamed: 0,rating,date,variation,verified_reviews,feedback,cleaned_verified_reviews
0,5,31-Jul-18,Charcoal Fabric,Love my Echo!,1,Love my Echo
1,5,31-Jul-18,Charcoal Fabric,Loved it!,1,Loved it
2,4,31-Jul-18,Walnut Finish,"Sometimes while playing a game, you can answer...",1,Sometimes while playing a game you can answer ...
3,5,31-Jul-18,Charcoal Fabric,I have had a lot of fun with this thing. My 4 ...,1,I have had a lot of fun with this thing My yr ...
4,5,31-Jul-18,Charcoal Fabric,Music,1,Music


In [5]:
# Let's define our sentiment analyzer function:
def analyze_sentiment(cleaned_verified_reviews):
    analysis = TextBlob(cleaned_verified_reviews)
    if analysis.sentiment.polarity > 0:
        return 'Positive'
    elif analysis.sentiment.polarity == 0:
        return 'Neutral'
    else:
        return 'Negative'

In [None]:
# Lets find the Sentiment by calling the above defn fn.
# create a new column called 'Sentiment'
df_review['Sentiment'] = df_review['cleaned_verified_reviews'].apply(lambda x: analyze_sentiment(x))

df_review[['cleaned_verified_reviews', 'Sentiment']].head(3)


In [None]:
# no. of positive sentiment reviews
df_review[df_review.Sentiment=='Positive'].shape[0]


In [None]:
# no. of negative sentiment reviews
df_review[df_review.Sentiment=='Negative'].shape[0]


In [None]:
# no. of neutral sentiment reviews
df_review[df_review.Sentiment=='Neutral'].shape[0]

In [None]:
# lets add polarity also to the df
df_review['polarity'] = df_review['cleaned_verified_reviews'].map(lambda text: TextBlob(text).sentiment.polarity)

df_review[['cleaned_verified_reviews', 'Sentiment', 'polarity']].head(3)

In [None]:
print('5 random reviews with the highest positive sentiment polarity: \n')
cl = df_review.loc[df_review.polarity == 1, ['cleaned_verified_reviews']].sample(5).values
for c in cl:
    print(c[0])

In [None]:
print('2 random reviews with the high negative sentiment polarity: \n')
# your code here
cl = df_review.loc[df_review.polarity < -0.5, ['cleaned_verified_reviews']].sample(5).values
for c in cl:
    print(c[0])


**Observation on the above reviews:**
