# <a id='toc1_'></a>[Data Cleaning and Processing](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- [Data Cleaning and Processing](#toc1_)    
- [Data Cleaning Processes](#toc2_)    
  - [Removing Unnecessary Columns](#toc2_1_)    
  - [NaN/Missing Values in the Data](#toc2_2_)    
  - [Duplicates in the Data](#toc2_3_)    
  - [Outliers in the Data](#toc2_4_)    
  - [Standardizing and Checking Data Types](#toc2_5_)    
  - [Normalizing Data](#toc2_6_)    
  - [Check Data Imbalance](#toc2_7_)    
- [Dealing with Categorical Data](#toc3_)    
- [Dealing with Text Data](#toc4_)    
  - [Text Normalization](#toc4_1_)    
    - [Description](#toc4_1_1_)    
    - [Title](#toc4_1_2_)    
    - [Brand](#toc4_1_3_)    
    - [Category](#toc4_1_4_)    
    - [ReviewText](#toc4_1_5_)    
  - [Text Cleaning](#toc4_2_)    
    - [Tokenization](#toc4_2_1_)    
    - [ Stopword Removal](#toc4_2_2_)    
    - [Stemming and Lemmatization](#toc4_2_3_)    
- [Sentiment Analysis](#toc5_)    
  - [Sentiment Analysis using Lexicon-based Methods](#toc5_1_)    
    - [VADER](#toc5_1_1_)    
    - [TextBlob](#toc5_1_2_)    
    - [Bing, AFINN, and NRC](#toc5_1_3_)    
    - [Comparison of Lexicon-based Methods](#toc5_1_4_)    
- [Creating New Dataset and Saving](#toc6_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [191]:
# reset (removing all variables, functions, and other objects from memory)
%reset -f

# import all the necessary packages
import pandas as pd
import numpy as np
import warnings
import re



In this workbook we will clean and process the data for the project. We will also create a new dataset that will be used for the analysis. This dataset will be saved as a csv file and will be used in the analysis workbook. Specifically, we will do the following:

1. Load the data from the csv file created in the data loadings workbook
2. Clean the data by:
    - removing or looking at columns that are not needed
    - looking at missing values
    - looking at duplicates
    - looking at outliers
    - standardizing and checking data types
    - deal with text data
    - deal with categorical data (categories)
    - check for data consistency across columns
    - look at normalizing data (price, votes, etc. )
    - Looking at rows with missing values
    - check data balance
    - feature engineering (sentiment analysis, etc.)

3. Create a new dataset
4. Save the new dataset

# <a id='toc2_'></a>[Data Cleaning Processes](#toc0_)

In [192]:
# load data - MAC OS
# amz_rev = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/all_revs_with_records_1.csv', low_memory=False)

# # load data - Set 1
amz_rev = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/set1_data.csv')

# # load data - Set 2
# amz_rev = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/set2_data.csv')

# # load data - Set 3
# amz_rev = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/set3_data.csv')

# load data - Set 4
# amz_rev = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/set4_data.csv')

  amz_rev = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/set1_data.csv')


In [193]:
#  drop column 'Unnamed: 0' or 'Unnamed: 0.1'
amz_rev = amz_rev.drop(columns=['Unnamed: 0.1'])
amz_rev = amz_rev.drop(columns=['Unnamed: 0'])

# initial data view
display(amz_rev.head(3))
print("Shape of the data: ", amz_rev.shape)

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,category,vote,image
0,5.0,False,"09 11, 2017",A33PVCHCQ2BTN0,B0010ZBORW,{'Color:': ' Nail Brush'},VW,I really like this nail brush from Urban Spa. ...,"Handy nail brush, gets garden dirt out from un...",1505088000,beauty,,
1,4.0,False,"09 2, 2017",A2503LT8PZIHAD,B0010ZBORW,{'Color:': ' Foot File'},Trouble,This is about the same quality foot file as th...,Basic foot file,1504310400,beauty,,
2,4.0,False,"02 21, 2018",A1MAI0TUIM3R2X,B001LNODUS,{'Color:': ' Body Lotion'},Princess Bookworm,Nice lavender lotion that absorbs easily in my...,Fragrant Lavender Lotion,1519171200,beauty,,


Shape of the data:  (1103032, 13)


## <a id='toc2_1_'></a>[Removing Unnecessary Columns](#toc0_)

We look at removing columns that are not needed for the analysis.


In [194]:
# see columns
print("Columns in the data: \n", amz_rev.columns)

Columns in the data: 
 Index(['overall', 'verified', 'reviewTime', 'reviewerID', 'asin', 'style',
       'reviewerName', 'reviewText', 'summary', 'unixReviewTime', 'category',
       'vote', 'image'],
      dtype='object')


In [195]:
# remove certain columns
amz_rev.drop(['verified', 'style', 'reviewTime', 'vote', 'summary', 'image'], axis=1, inplace=True)

# see updated dataframe
display(amz_rev.head(3))

Unnamed: 0,overall,reviewerID,asin,reviewerName,reviewText,unixReviewTime,category
0,5.0,A33PVCHCQ2BTN0,B0010ZBORW,VW,I really like this nail brush from Urban Spa. ...,1505088000,beauty
1,4.0,A2503LT8PZIHAD,B0010ZBORW,Trouble,This is about the same quality foot file as th...,1504310400,beauty
2,4.0,A1MAI0TUIM3R2X,B001LNODUS,Princess Bookworm,Nice lavender lotion that absorbs easily in my...,1519171200,beauty


In [196]:
# sort and order the data
amz_rev.sort_values(by=['asin', 'overall'], ascending=[True, False], inplace=True)

# reorder columns
amz_rev = amz_rev[['reviewerID', 'reviewerName', 'unixReviewTime', 'asin', 'reviewText', 'category', 'overall']]

# see updated dataframe
display(amz_rev.head(3))

Unnamed: 0,reviewerID,reviewerName,unixReviewTime,asin,reviewText,category,overall
229835,A1OKMIT8B373YD,djhexane,1070236800,5164885,The classic of Trans-Siberian Orchestra. This...,cds_and_vinyl,5.0
231082,A1SCJWCMQ3W3KK,Irishgal,1166745600,5164885,"In the 1980s, Mannheim Steamroller took Christ...",cds_and_vinyl,5.0
232494,AZOILH84GFKHO,L. Beth Stock,1438732800,5164885,LOVE IT!,cds_and_vinyl,5.0


## <a id='toc2_2_'></a>[NaN/Missing Values in the Data](#toc0_)

We look at NaN values in the data. If any review (row) has missing values (NaN) in any column listed here:
    - (description, title, brand, price)
then we remove that review (row) from the dataset. 



In [197]:
# how many nulls in each column
amz_rev.isnull().sum()

reviewerID          0
reviewerName      325
unixReviewTime      0
asin                0
reviewText        436
category            0
overall             0
dtype: int64

In [198]:
# remove rows with null values
amz_rev = amz_rev.dropna()

In [199]:
# shape of the data
print("Shape of the data: ", amz_rev.shape)

# see if any missing values
print("\nNumber of missing values: ", amz_rev.isnull().sum().sum())

# show count columns have missing values (as a percentage)
print("\nPercentage of missing values in each column: \n", round(amz_rev.isnull().sum()/len(amz_rev)*100,2))

# show count of rows with missing data per category (as a percentage)
print("\nPercentage of missing values in each category: \n", round(amz_rev.groupby(['category']).apply(lambda x: x.isnull().sum()).sum(axis=1)/len(amz_rev)*100,2))

Shape of the data:  (1102271, 7)

Number of missing values:  0

Percentage of missing values in each column: 
 reviewerID        0.0
reviewerName      0.0
unixReviewTime    0.0
asin              0.0
reviewText        0.0
category          0.0
overall           0.0
dtype: float64

Percentage of missing values in each category: 
 category
appliances                    0.0
arts_crafts                   0.0
automotive                    0.0
beauty                        0.0
cds_and_vinyl                 0.0
cell_phones                   0.0
clothing_shoes_and_jewelry    0.0
digital_music                 0.0
electronics                   0.0
fashion                       0.0
gift_cards                    0.0
grocery_and_gourmet_food      0.0
home_and_kitchen              0.0
industrial                    0.0
kindle_store                  0.0
luxury_beauty                 0.0
magazine_subscriptions        0.0
movies_and_tv                 0.0
musical_instruments           0.0
office_products

## <a id='toc2_3_'></a>[Duplicates in the Data](#toc0_)

We looking at duplicates in the data.

A duplicate is defined as a review (row) that has the same values across all columns. We remove duplicates from the dataset.

In [200]:
# see if any duplicates
print("Number of duplicates: ", amz_rev.duplicated().sum())

# see duplicates
amz_rev[amz_rev.duplicated(keep=False)].sort_values(by=['asin']).head(4)

Number of duplicates:  72731


Unnamed: 0,reviewerID,reviewerName,unixReviewTime,asin,reviewText,category,overall
946377,AWG2O9C42XW5G,Oreo Cookie Cherry Cola 2018,1482710400,782010792,this is a awesome movie. there is a reason why...,movies_and_tv,5.0
960630,AWG2O9C42XW5G,Oreo Cookie Cherry Cola 2018,1482710400,782010792,this is a awesome movie. there is a reason why...,movies_and_tv,5.0
979022,AWG2O9C42XW5G,Oreo Cookie Cherry Cola 2018,1482710400,782010792,this is a awesome movie. there is a reason why...,movies_and_tv,5.0
951096,A16QODENBJVUI1,Robert Moore,1152144000,1424819210,I have an extremely conflicted relationship wi...,movies_and_tv,5.0


In [201]:
# remove duplicates
amz_rev.drop_duplicates(inplace=True)

# see updated dataframe
display(amz_rev.head(3))

# shape of the data
print("Shape of the data: ", amz_rev.shape)

Unnamed: 0,reviewerID,reviewerName,unixReviewTime,asin,reviewText,category,overall
229835,A1OKMIT8B373YD,djhexane,1070236800,5164885,The classic of Trans-Siberian Orchestra. This...,cds_and_vinyl,5.0
231082,A1SCJWCMQ3W3KK,Irishgal,1166745600,5164885,"In the 1980s, Mannheim Steamroller took Christ...",cds_and_vinyl,5.0
232494,AZOILH84GFKHO,L. Beth Stock,1438732800,5164885,LOVE IT!,cds_and_vinyl,5.0


Shape of the data:  (1029540, 7)


In [202]:
# # data set 1
amz_rev = amz_rev[amz_rev['asin'].isin(amz_rev.groupby('asin').size().reset_index(name='counts').query('counts >= 10')['asin'])]
amz_rev = amz_rev[amz_rev['reviewerID'].isin(amz_rev.groupby('reviewerID').size().reset_index(name='counts').query('counts >= 10')['reviewerID'])]
print(amz_rev.shape)

# # data set 2
# amz_rev = amz_rev[amz_rev['asin'].isin(amz_rev.groupby('asin').size().reset_index(name='counts').query('counts >= 12')['asin'])]
# amz_rev = amz_rev[amz_rev['reviewerID'].isin(amz_rev.groupby('reviewerID').size().reset_index(name='counts').query('counts >= 12')['reviewerID'])]
# print(amz_rev.shape)

# # data set 3
# amz_rev = amz_rev[amz_rev['asin'].isin(amz_rev.groupby('asin').size().reset_index(name='counts').query('counts >= 14')['asin'])]
# amz_rev = amz_rev[amz_rev['reviewerID'].isin(amz_rev.groupby('reviewerID').size().reset_index(name='counts').query('counts >= 14')['reviewerID'])]
# print(amz_rev.shape)

# data set 4
# amz_rev = amz_rev[amz_rev['asin'].isin(amz_rev.groupby('asin').size().reset_index(name='counts').query('counts >= 20')['asin'])]
# amz_rev = amz_rev[amz_rev['reviewerID'].isin(amz_rev.groupby('reviewerID').size().reset_index(name='counts').query('counts >= 20')['reviewerID'])]
# print(amz_rev.shape)

(959425, 7)


## <a id='toc2_5_'></a>[Standardizing and Checking Data Types](#toc0_)

We turn to standardizing and checking data types. 

In [203]:
# change unixReviewTime to datetime
amz_rev['unixReviewTime'] = pd.to_datetime(amz_rev['unixReviewTime'], unit='s')

# rename column: unixReviewTime to reviewTime
amz_rev.rename(columns={'unixReviewTime': 'reviewTime'}, inplace=True)

# see updated dataframe
display(amz_rev.head(3))

Unnamed: 0,reviewerID,reviewerName,reviewTime,asin,reviewText,category,overall
229835,A1OKMIT8B373YD,djhexane,2003-12-01,5164885,The classic of Trans-Siberian Orchestra. This...,cds_and_vinyl,5.0
231082,A1SCJWCMQ3W3KK,Irishgal,2006-12-22,5164885,"In the 1980s, Mannheim Steamroller took Christ...",cds_and_vinyl,5.0
232494,AZOILH84GFKHO,L. Beth Stock,2015-08-05,5164885,LOVE IT!,cds_and_vinyl,5.0


In [204]:
# check data types
amz_rev.dtypes

reviewerID              object
reviewerName            object
reviewTime      datetime64[ns]
asin                    object
reviewText              object
category                object
overall                float64
dtype: object

## <a id='toc2_6_'></a>[Normalizing Data](#toc0_)

We look at normalizing data (price, votes, etc. ) We used min-max. 

In [205]:
# get min and max for ratings
min_rating = amz_rev['overall'].min()
max_rating = amz_rev['overall'].max()

# Normalize the ratings to a range from 0 to 1
amz_rev['normalized_rating'] = (amz_rev['overall'] - min_rating) / (max_rating - min_rating)

# see updated dataframe
amz_rev.head(3)

# see summary of normalized ratings
amz_rev['normalized_rating'].describe()


count    959425.000000
mean          0.860155
std           0.247412
min           0.000000
25%           0.750000
50%           1.000000
75%           1.000000
max           1.000000
Name: normalized_rating, dtype: float64

The '`normalized_rating`' column will contain the normalized ratings between 0 and 1, where 0 corresponds to the minimum rating and 1 corresponds to the maximum rating. Note, the '`normalized_rating`' column is created by subtracting the minimum value from each rating and dividing it by the range (maximum value minus minimum value).

Ratings which were 1 are now 0 and ratings which were 5 are now 1 etc. 

In [206]:
# see updated dataframe
display(amz_rev.head(3))


Unnamed: 0,reviewerID,reviewerName,reviewTime,asin,reviewText,category,overall,normalized_rating
229835,A1OKMIT8B373YD,djhexane,2003-12-01,5164885,The classic of Trans-Siberian Orchestra. This...,cds_and_vinyl,5.0,1.0
231082,A1SCJWCMQ3W3KK,Irishgal,2006-12-22,5164885,"In the 1980s, Mannheim Steamroller took Christ...",cds_and_vinyl,5.0,1.0
232494,AZOILH84GFKHO,L. Beth Stock,2015-08-05,5164885,LOVE IT!,cds_and_vinyl,5.0,1.0


# <a id='toc4_'></a>[Dealing with Text Data](#toc0_)

We deal with text data. We have the following columns that contain text data:
- description
- title
- brand
- category
- reviewText


We handle each of these columns separately. We look out for: 

**Identify special characters or symbols**: Look for any special characters, symbols, or non-alphanumeric characters that may need to be cleaned or removed. These characters can sometimes interfere with downstream analysis or modeling.

**Handle HTML tags or formatting**: If the 'description' column contains HTML tags or formatting, you may consider removing them or converting them into plain text

## <a id='toc4_1_'></a>[Text Normalization](#toc0_)

**Text Normalization**: Text normalization is the process of transforming text into a single canonical form that it might not have had before. This is done by removing unnecessary characters, such as punctuation or special characters; converting all letters to lowercase or uppercase; and/or expanding abbreviations.


### <a id='toc4_1_5_'></a>[ReviewText](#toc0_)

In [207]:
# see brand
amz_rev['reviewText']

# create function to clean brand column
def clean_brand(text):
    # Remove square brackets, single quotes, HTML tags, double quotes, commas, URLs, and CSS styling
    text = re.sub(r"\[|\]|\'|<[^>]*>|\"|,|\\https?://[^\s]+|{[^}]+}", "", text)
    
    # Remove extra whitespace
    text = " ".join(text.split())
    
    # Remove common update notes
    update_regex = r'update \d+/\d+/\d+:'
    text = re.sub(update_regex, '', text)

    # Remove trailing punctuation
    text = re.sub(r"\.$", "", text)

    # remove dates
    text = re.sub(r'\d{2}/\d{2}/\d{2}', '', text)

    
    # Convert to lowercase
    text = text.lower()
    
    return text


# apply function to brand column
amz_rev['reviewText'] = amz_rev['reviewText'].apply(clean_brand)

# see updated dataframe
amz_rev['reviewText']

229835    the classic of trans-siberian orchestra. this ...
231082    in the 1980s mannheim steamroller took christm...
232494                                             love it!
233609    this is one of my favorite cds. ive seen trans...
234450    the album tells a story - it is a sweet story ...
                                ...                        
743429    fun collection. still as creepy as before but ...
800044    im relatively new to the current game consoles...
671916    bioshock has long been one of my favorite seri...
719111    alright first im going to get this part out of...
725017    sorry im giving every re-release a 1/5 stars b...
Name: reviewText, Length: 959425, dtype: object

In [208]:
# check any special characters in title
amz_rev[amz_rev['reviewText'].str.contains("[^a-zA-Z0-9\s]")]

# replace special characters in title
amz_rev['reviewText'] = amz_rev['reviewText'].str.replace("[^a-zA-Z0-9\s]", "")

# count of records with special character in the title
print("Number records with special character in reviewText:", amz_rev[amz_rev['reviewText'].str.contains("[^a-zA-Z0-9\s]")]['asin'].unique().size)

  amz_rev['reviewText'] = amz_rev['reviewText'].str.replace("[^a-zA-Z0-9\s]", "")


Number records with special character in reviewText: 0


***
## <a id='toc4_2_'></a>[Text Cleaning](#toc0_)

We specifically explore: 
1. Tokenization
2. Stopword Removal
3. Stemming and Lemmatization
4. Handling Noise and Irrelevant Information

We do this for only the `reviewText` columns and the `description` columns. 


In [209]:
display(amz_rev.head(6))

Unnamed: 0,reviewerID,reviewerName,reviewTime,asin,reviewText,category,overall,normalized_rating
229835,A1OKMIT8B373YD,djhexane,2003-12-01,5164885,the classic of transsiberian orchestra this al...,cds_and_vinyl,5.0,1.0
231082,A1SCJWCMQ3W3KK,Irishgal,2006-12-22,5164885,in the 1980s mannheim steamroller took christm...,cds_and_vinyl,5.0,1.0
232494,AZOILH84GFKHO,L. Beth Stock,2015-08-05,5164885,love it,cds_and_vinyl,5.0,1.0
233609,A1W4O4F225MSKD,Bearlady59,2017-02-08,5164885,this is one of my favorite cds ive seen transs...,cds_and_vinyl,5.0,1.0
234450,A36VOVWL720LJ7,Babs,2015-03-01,5164885,the album tells a story it is a sweet story a...,cds_and_vinyl,5.0,1.0
236011,A22LU2IH0YX6EY,L. Gold,2015-11-21,5164885,wanted it finally got it love the music every ...,cds_and_vinyl,5.0,1.0


### <a id='toc4_2_1_'></a>[Tokenization](#toc0_)

**Tokenization**: Tokenization is the process of splitting text into smaller chunks, called tokens. Tokens are usually words, sentences, or individual characters. Tokenization is a crucial step in text analysis, as the meaning of a word can fundamentally change based on the tokens surrounding it.


In [210]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

# Assuming amz_rev.reviewText are Pandas Series
# Tokenize the text using a loop

tokens_reviewText = []
for text in amz_rev.reviewText:
    if isinstance(text, str):
        tokens_reviewText.append(word_tokenize(text))
    else:
        tokens_reviewText.append([])

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/pavansingh/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [211]:
# shape of tokens
print("Shape of tokens_reviewText:", len(tokens_reviewText))

# shape of reviewText
print("Shape of reviewText:", len(amz_rev.reviewText))

Shape of tokens_reviewText: 959425
Shape of reviewText: 959425


In [212]:
# Print the tokens
print("Some output of review text tokens:", tokens_reviewText[0:2])

Some output of review text tokens: [['the', 'classic', 'of', 'transsiberian', 'orchestra', 'this', 'album', 'is', 'a', 'wonderful', 'mix', 'of', 'christmas', 'favourites', 'and', 'newly', 'made', 'tracks', 'kicked', 'up', 'a', 'notch', 'this', 'is', 'what', 'theatrical', 'orchestral', 'rock', 'is', 'the', 'album', 'starts', 'with', 'an', 'angel', 'sent', 'down', 'from', 'god', 'to', 'bring', 'back', 'all', 'the', 'good', 'his', 'children', 'have', 'done', 'since', 'the', 'birth', 'of', 'jesus', 'the', 'angel', 'visits', 'many', 'places', 'and', 'we', 'are', 'given', 'many', 'stories', 'he', 'visits', 'sarajevo', 'during', 'the', 'war', 'he', 'hears', 'the', 'bombs', 'and', 'sees', 'the', 'fighting', 'but', 'he', 'can', 'still', 'hear', 'people', 'singing', 'the', 'joy', 'of', 'the', 'christmas', 'season', 'my', 'favourite', 'is', 'the', 'story', 'of', 'the', 'little', 'homelessgirl', 'who', 'ran', 'away', 'from', 'home', 'a', 'long', 'time', 'ago', 'she', 'is', 'destitute', 'and', 'has

### <a id='toc4_2_2_'></a>[ Stopword Removal](#toc0_)

**Stopword Removal**: Stopwords are words that are commonly used in the English language, such as "the," "a," "an," "is," and "are." These words are often removed from text during preprocessing as they can negatively impact downstream analysis, such as natural language processing and machine learning.


In [213]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')

# Get the set of English stopwords
stopword_set = set(stopwords.words('english'))

# Remove stopwords from the tokens
filtered_tokens_reviewText = [[word for word in tokens if word.lower() not in stopword_set] for tokens in tokens_reviewText]

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/pavansingh/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [214]:

# Print the filtered tokens
print("Review Text tokens after stop word removal:", filtered_tokens_reviewText[0:2])

Review Text tokens after stop word removal: [['classic', 'transsiberian', 'orchestra', 'album', 'wonderful', 'mix', 'christmas', 'favourites', 'newly', 'made', 'tracks', 'kicked', 'notch', 'theatrical', 'orchestral', 'rock', 'album', 'starts', 'angel', 'sent', 'god', 'bring', 'back', 'good', 'children', 'done', 'since', 'birth', 'jesus', 'angel', 'visits', 'many', 'places', 'given', 'many', 'stories', 'visits', 'sarajevo', 'war', 'hears', 'bombs', 'sees', 'fighting', 'still', 'hear', 'people', 'singing', 'joy', 'christmas', 'season', 'favourite', 'story', 'little', 'homelessgirl', 'ran', 'away', 'home', 'long', 'time', 'ago', 'destitute', 'nothing', 'visits', 'bar', 'bartender', 'bartender', 'moved', 'child', 'calls', 'cab', 'emptys', 'register', 'drawer', 'sends', 'child', 'home', 'greeted', 'loving', 'family', 'brings', 'tears', 'eyes', 'everyone', 'duty', 'christmas', 'classic'], ['1980s', 'mannheim', 'steamroller', 'took', 'christmas', 'music', 'updated', 'new', 'century', 'decade'

### <a id='toc4_2_3_'></a>[Stemming and Lemmatization](#toc0_)


**Stemming and Lemmatization**: Stemming is the process of reducing a word to its stem, or its root form. For example, "fishing," "fished," "fisher" all reduce to the stem "fish." Lemmatization is similar to stemming, but it brings context to the words. So, it links words with similar meaning to one word. For example, "better" and "good" are lemmatized to "good."

Stemming involves removing prefixes and suffixes from words to obtain the root form. For example, the word "running" would be stemmed to "run." Stemming is a simpler and faster process but may result in the root form not being an actual word. This can lead to potential loss of meaning or incorrect interpretations.

On the other hand, lemmatiation aims to determine the lemma or dictionary form of a word. It takes into account the word's context and part of speech, ensuring that the resulting lemma is a valid word. For example, the word "running" would be lemmatized to "run." Lemmatization provides more accurate results but can be computationally more expensive compared to stemming.

If preserving the exact meaning and interpretability of words is crucial for your recommender system, lemmatization would be a better choice. However, if speed and simplicity are more important, stemming could be sufficient.

In [215]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download('punkt')

# Stemming
stemmer = PorterStemmer()
stemmed_words_revText = [[stemmer.stem(word) for word in tokens] for tokens in filtered_tokens_reviewText]

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_words_revText = [[lemmatizer.lemmatize(word) for word in tokens] for tokens in filtered_tokens_reviewText]


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/pavansingh/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [216]:
print("Original words:", filtered_tokens_reviewText[0:2])
print("Stemmed words:", stemmed_words_revText[0:2])
print("Lemmatized words:", lemmatized_words_revText[0:2])

Original words: [['classic', 'transsiberian', 'orchestra', 'album', 'wonderful', 'mix', 'christmas', 'favourites', 'newly', 'made', 'tracks', 'kicked', 'notch', 'theatrical', 'orchestral', 'rock', 'album', 'starts', 'angel', 'sent', 'god', 'bring', 'back', 'good', 'children', 'done', 'since', 'birth', 'jesus', 'angel', 'visits', 'many', 'places', 'given', 'many', 'stories', 'visits', 'sarajevo', 'war', 'hears', 'bombs', 'sees', 'fighting', 'still', 'hear', 'people', 'singing', 'joy', 'christmas', 'season', 'favourite', 'story', 'little', 'homelessgirl', 'ran', 'away', 'home', 'long', 'time', 'ago', 'destitute', 'nothing', 'visits', 'bar', 'bartender', 'bartender', 'moved', 'child', 'calls', 'cab', 'emptys', 'register', 'drawer', 'sends', 'child', 'home', 'greeted', 'loving', 'family', 'brings', 'tears', 'eyes', 'everyone', 'duty', 'christmas', 'classic'], ['1980s', 'mannheim', 'steamroller', 'took', 'christmas', 'music', 'updated', 'new', 'century', 'decade', 'half', 'later', 'transsib

***
# <a id='toc6_'></a>[Creating New Dataset and Saving](#toc0_)

We create a new dataset using all the cleaning and processing done above and save it as a csv file called `data_clean.csv`.

In [217]:
# get shapes of all generated dataframes
print("Shape of stemmed_words_revText:", len(stemmed_words_revText))
print("Shape of lemmatized_words_revText:", len(lemmatized_words_revText))
print("Shape of filtered_tokens_reviewText:", len(filtered_tokens_reviewText))

Shape of stemmed_words_revText: 959425
Shape of lemmatized_words_revText: 959425
Shape of filtered_tokens_reviewText: 959425


In [218]:
# attach stemmed words to dataframe
amz_rev['stemmed_words_revText'] = stemmed_words_revText

# attach lemmitized words to dataframe
amz_rev['lemmatized_words_revText'] = lemmatized_words_revText

# attach filtered tokens to dataframe
amz_rev['filtered_tokens_revText'] = filtered_tokens_reviewText


In [219]:
# see resulting dataframe
amz_rev.head(6)

Unnamed: 0,reviewerID,reviewerName,reviewTime,asin,reviewText,category,overall,normalized_rating,stemmed_words_revText,lemmatized_words_revText,filtered_tokens_revText
229835,A1OKMIT8B373YD,djhexane,2003-12-01,5164885,the classic of transsiberian orchestra this al...,cds_and_vinyl,5.0,1.0,"[classic, transsiberian, orchestra, album, won...","[classic, transsiberian, orchestra, album, won...","[classic, transsiberian, orchestra, album, won..."
231082,A1SCJWCMQ3W3KK,Irishgal,2006-12-22,5164885,in the 1980s mannheim steamroller took christm...,cds_and_vinyl,5.0,1.0,"[1980, mannheim, steamrol, took, christma, mus...","[1980s, mannheim, steamroller, took, christmas...","[1980s, mannheim, steamroller, took, christmas..."
232494,AZOILH84GFKHO,L. Beth Stock,2015-08-05,5164885,love it,cds_and_vinyl,5.0,1.0,[love],[love],[love]
233609,A1W4O4F225MSKD,Bearlady59,2017-02-08,5164885,this is one of my favorite cds ive seen transs...,cds_and_vinyl,5.0,1.0,"[one, favorit, cd, ive, seen, transsiberian, o...","[one, favorite, cd, ive, seen, transsiberian, ...","[one, favorite, cds, ive, seen, transsiberian,..."
234450,A36VOVWL720LJ7,Babs,2015-03-01,5164885,the album tells a story it is a sweet story a...,cds_and_vinyl,5.0,1.0,"[album, tell, stori, sweet, stori, music, great]","[album, tell, story, sweet, story, music, great]","[album, tells, story, sweet, story, music, great]"
236011,A22LU2IH0YX6EY,L. Gold,2015-11-21,5164885,wanted it finally got it love the music every ...,cds_and_vinyl,5.0,1.0,"[want, final, got, love, music, everi, time, h...","[wanted, finally, got, love, music, every, tim...","[wanted, finally, got, love, music, every, tim..."


In [220]:
# # save cleaned data
amz_rev.to_csv("Data/set1_data_cleaned.csv", index=False)

# # save cleaned data
# amz_rev.to_csv("Data/set2_data_cleaned.csv", index=False)

# # save cleaned data
# amz_rev.to_csv("Data/set3_data_cleaned.csv", index=False)

# save cleaned data
# amz_rev.to_csv("Data/set4_data_cleaned.csv", index=False)


# stats of cleaned data
print("Shape of cleaned data:", amz_rev.shape)
print("\nNumber of unique products:", amz_rev['asin'].unique().size)
print("\nNumber of unique users:", amz_rev['reviewerID'].unique().size)
print("\nNumber of unique categories:", amz_rev['category'].unique().size)
print("\nNumber of reviews per category:\n", amz_rev['category'].value_counts())
print("\nColumns available:", amz_rev.columns)

Shape of cleaned data: (959425, 11)

Number of unique products: 41182

Number of unique users: 51524

Number of unique categories: 28

Number of reviews per category:
 video_games                   173032
arts_crafts                   124943
office_products               111452
grocery_and_gourmet_food       77380
prime_pantry                   74773
patio_lawn_and_garden          64513
musical_instruments            51529
cds_and_vinyl                  41211
movies_and_tv                  40959
digital_music                  33103
pet_supplies                   32551
tools_and_home_improvement     26639
industrial                     19966
cell_phones                    16098
luxury_beauty                  15537
automotive                     12350
toys_and_games                 10679
electronics                     9669
sports_and_outdoors             6952
home_and_kitchen                6451
software                        6096
clothing_shoes_and_jewelry       944
gift_cards        