### Reviews Sentiment

Collect reviews from `amazon` books and manually label each post as positive or negative sentiment.   
Extract text `features` from the posts, such as word frequencies or TF-IDF values.  
Build a `dataset` with these features and labels.  
Train a `KNN` classifier to classify new posts based on their sentiment.  

In [26]:
import pandas as pd

df = pd.read_csv("reviews.csv")
display(df)

print("Shape:", df.shape)

Unnamed: 0,Review,Sentiment
0,"When you go to the 'Look inside' option, you c...",negative
1,I wouldn't recommend if completely new to codi...,negative
2,I bought this book for a master's data science...,positive
3,Please review the table of contents before pur...,positive
4,If you are following a parallel course of ML t...,positive
5,I have been studying data science during this ...,positive
6,Chris Albon has a broad and deep knowledge of ...,positive
7,This is a great resource for quick and insight...,positive
8,The book starts by explaining each function wi...,positive
9,Perfect when you need to find out how to do so...,positive


Shape: (35, 2)


### Preprocessing

Remove `whitespaces` and numbers, convert to lower case.  
Remove punctuation `characters`.  

In [27]:
import re

def clear_text(A):
    A = [s.strip() for s in A]
    A = [re.sub(r"[0-9]", "", s) for s in A]
    A = [s.lower() for s in A]
    return A

df['Review'] = clear_text(df['Review'])
display(df.head())

Unnamed: 0,Review,Sentiment
0,"when you go to the 'look inside' option, you c...",negative
1,i wouldn't recommend if completely new to codi...,negative
2,i bought this book for a master's data science...,positive
3,please review the table of contents before pur...,positive
4,if you are following a parallel course of ml t...,positive


In [28]:
import unicodedata, sys

def remove_punctuation(A):
    P = dict()
    for i in range(sys.maxunicode):
        if unicodedata.category(chr(i)).startswith('P'):
            P[i] = None
    A = [s.translate(P) for s in A]
    return A

df['Review'] = remove_punctuation(df['Review'])
display(df.head())

Unnamed: 0,Review,Sentiment
0,when you go to the look inside option you can ...,negative
1,i wouldnt recommend if completely new to codin...,negative
2,i bought this book for a masters data science ...,positive
3,please review the table of contents before pur...,positive
4,if you are following a parallel course of ml t...,positive


### Tokenize

Tokenize text by `splitting` text into words.  
Remove `stop` words, words that bring little on no information.  
Stem words by producing `root/base` words.  

In [29]:

from nltk.tokenize import word_tokenize

def tokenize_words(A):
    A = [word_tokenize(s) for s in A]
    return A

df['Words'] = tokenize_words(df['Review'])
display(df.head())

Unnamed: 0,Review,Sentiment,Words
0,when you go to the look inside option you can ...,negative,"[when, you, go, to, the, look, inside, option,..."
1,i wouldnt recommend if completely new to codin...,negative,"[i, wouldnt, recommend, if, completely, new, t..."
2,i bought this book for a masters data science ...,positive,"[i, bought, this, book, for, a, masters, data,..."
3,please review the table of contents before pur...,positive,"[please, review, the, table, of, contents, bef..."
4,if you are following a parallel course of ml t...,positive,"[if, you, are, following, a, parallel, course,..."


In [30]:
from nltk.corpus import stopwords

def remove_stopwords(A):
    A = [[w for w in words if w not in stopwords.words('english')] for words in A]
    return A

df['Words'] = remove_stopwords(df['Words'])
display(df.head())

Unnamed: 0,Review,Sentiment,Words
0,when you go to the look inside option you can ...,negative,"[go, look, inside, option, see, code, colors, ..."
1,i wouldnt recommend if completely new to codin...,negative,"[wouldnt, recommend, completely, new, coding, ..."
2,i bought this book for a masters data science ...,positive,"[bought, book, masters, data, science, class, ..."
3,please review the table of contents before pur...,positive,"[please, review, table, contents, purchasing, ..."
4,if you are following a parallel course of ml t...,positive,"[following, parallel, course, ml, book, great,..."


In [31]:
from nltk.stem.porter import PorterStemmer

def steming_words(A):
    porter = PorterStemmer()
    A = [[porter.stem(w) for w in words] for words in A]
    return A

df['Words'] = steming_words(df['Words'])
display(df.head())

Unnamed: 0,Review,Sentiment,Words
0,when you go to the look inside option you can ...,negative,"[go, look, insid, option, see, code, color, pa..."
1,i wouldnt recommend if completely new to codin...,negative,"[wouldnt, recommend, complet, new, code, gener..."
2,i bought this book for a masters data science ...,positive,"[bought, book, master, data, scienc, class, on..."
3,please review the table of contents before pur...,positive,"[pleas, review, tabl, content, purchas, book, ..."
4,if you are following a parallel course of ml t...,positive,"[follow, parallel, cours, ml, book, great, sup..."


### Extract features

Extract text features from reviews using word `frequencies` (bag of words).

In [32]:
from sklearn.feature_extraction.text import CountVectorizer

# Vectorizer
vectorizer = CountVectorizer() # bag of words

# Bag of words (sparse matrix)
TXT = [" ".join(words) for words in df['Words']] # joined words texts
B = vectorizer.fit_transform(TXT) # bag of words
display(B.toarray())

# Feature matrix
Features_names = vectorizer.get_feature_names_out()
Feature_matrix = pd.DataFrame(B.toarray(), columns=Features_names)
display(Feature_matrix.head())

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

Unnamed: 0,abl,abund,access,accompani,account,actual,ad,addit,advanc,albon,...,would,wouldnt,wrangl,write,written,wrote,year,yet,youll,your
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,1,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Final Dataset

Use a different vectorizer, like `TF-IDF` values to build the final dataset.

In [33]:
from sklearn.feature_extraction.text import TfidfVectorizer

def build_dataset(df, vectorizer):

    # Sparse matrix (tf-idf)
    TXT = [" ".join(words) for words in df['Words']] # joined words texts
    B = vectorizer.fit_transform(TXT)

    # Feature matrix
    Features_names = vectorizer.get_feature_names_out()
    Feature_matrix = pd.DataFrame(B.toarray(), columns=Features_names)

    # New DataFrame (with sentiment, words and counts)
    df_new = pd.concat([df['Sentiment'], Feature_matrix], axis=1)
    return df_new

final_dataset = build_dataset(df, TfidfVectorizer())
display(final_dataset.head())

Unnamed: 0,Sentiment,abl,abund,access,accompani,account,actual,ad,addit,advanc,...,would,wouldnt,wrangl,write,written,wrote,year,yet,youll,your
0,negative,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,negative,0.0,0.0,0.0,0.0,0.220705,0.0,0.0,0.0,0.0,...,0.0,0.197703,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,positive,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,positive,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,positive,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### KNN Classifier


The classifier learns patterns in order to make predictions, to `classify new` reviews.

In [35]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Features and label
X = final_dataset.iloc[:, 1:] # all columns except first
y = final_dataset['Sentiment']

# Training and testing sets
X1, X2, y1, y2 = train_test_split(X, y, test_size=0.2)

# Learn model
knn = KNeighborsClassifier()
knn.fit(X1, y1)

# Predictions
y_pred = knn.predict(X2)

# Output
print("Test data:\t", " ".join(y2))
print("Prediction:\t", " ".join(y_pred))
print("Score on Train:", knn.score(X1, y1).round(2))
print("Score on Test:", knn.score(X2, y2).round(2))
print("Report:", classification_report(y2, y_pred, zero_division=0))

Test data:	 negative positive positive positive negative positive positive
Prediction:	 positive negative positive negative positive positive positive
Score on Train: 0.82
Score on Test: 0.43
Report:               precision    recall  f1-score   support

    negative       0.00      0.00      0.00         2
    positive       0.60      0.60      0.60         5

    accuracy                           0.43         7
   macro avg       0.30      0.30      0.30         7
weighted avg       0.43      0.43      0.43         7

