# <a id='toc1_'></a>[Data Cleaning and Processing](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- [Data Cleaning and Processing](#toc1_)    
- [Data Cleaning Processes](#toc2_)    
  - [Removing Unnecessary Columns](#toc2_1_)    
  - [NaN/Missing Values in the Data](#toc2_2_)    
  - [Duplicates in the Data](#toc2_3_)    
  - [Outliers in the Data](#toc2_4_)    
  - [Standardizing and Checking Data Types](#toc2_5_)    
  - [Normalizing Data](#toc2_6_)    
  - [Check Data Imbalance](#toc2_7_)    
- [Dealing with Categorical Data](#toc3_)    
- [Dealing with Text Data](#toc4_)    
  - [Text Normalization](#toc4_1_)    
    - [Description](#toc4_1_1_)    
    - [Title](#toc4_1_2_)    
    - [Brand](#toc4_1_3_)    
    - [Category](#toc4_1_4_)    
    - [ReviewText](#toc4_1_5_)    
  - [Text Cleaning](#toc4_2_)    
    - [Tokenization](#toc4_2_1_)    
    - [ Stopword Removal](#toc4_2_2_)    
    - [Stemming and Lemmatization](#toc4_2_3_)    
- [Sentiment Analysis](#toc5_)    
  - [Sentiment Analysis using Lexicon-based Methods](#toc5_1_)    
    - [VADER](#toc5_1_1_)    
    - [TextBlob](#toc5_1_2_)    
    - [Bing, AFINN, and NRC](#toc5_1_3_)    
    - [Comparison of Lexicon-based Methods](#toc5_1_4_)    
- [Creating New Dataset and Saving](#toc6_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [1]:
# reset (removing all variables, functions, and other objects from memory)
%reset -f

# import all the necessary packages
import pandas as pd
import numpy as np
import warnings
import re



In this workbook we will clean and process the data for the project. We will also create a new dataset that will be used for the analysis. This dataset will be saved as a csv file and will be used in the analysis workbook. Specifically, we will do the following:

1. Load the data from the csv file created in the data loadings workbook
2. Clean the data by:
    - removing or looking at columns that are not needed
    - looking at missing values
    - looking at duplicates
    - looking at outliers
    - standardizing and checking data types
    - deal with text data
    - deal with categorical data (categories)
    - check for data consistency across columns
    - look at normalizing data (price, votes, etc. )
    - Looking at rows with missing values
    - check data balance
    - feature engineering (sentiment analysis, etc.)

3. Create a new dataset
4. Save the new dataset

# <a id='toc2_'></a>[Data Cleaning Processes](#toc0_)

In [3]:
# load data - MAC OS
amz_rev = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/all_revs_with_records_1.csv', low_memory=False)

In [5]:
#  drop column 'Unnamed: 0'
amz_rev = amz_rev.drop(columns=['Unnamed: 0'])

# initial data view
display(amz_rev.head(3))
print("Shape of the data: ", amz_rev.shape)

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,category,vote,image
0,5.0,False,"09 11, 2017",A33PVCHCQ2BTN0,B0010ZBORW,{'Color:': ' Nail Brush'},VW,I really like this nail brush from Urban Spa. ...,"Handy nail brush, gets garden dirt out from un...",1505088000,beauty,,
1,4.0,False,"09 2, 2017",A2503LT8PZIHAD,B0010ZBORW,{'Color:': ' Foot File'},Trouble,This is about the same quality foot file as th...,Basic foot file,1504310400,beauty,,
2,4.0,False,"02 21, 2018",A1MAI0TUIM3R2X,B001LNODUS,{'Color:': ' Body Lotion'},Princess Bookworm,Nice lavender lotion that absorbs easily in my...,Fragrant Lavender Lotion,1519171200,beauty,,


Shape of the data:  (4164059, 13)


## <a id='toc2_1_'></a>[Removing Unnecessary Columns](#toc0_)

We look at removing columns that are not needed for the analysis.


In [6]:
# see columns
print("Columns in the data: \n", amz_rev.columns)

Columns in the data: 
 Index(['overall', 'verified', 'reviewTime', 'reviewerID', 'asin', 'style',
       'reviewerName', 'reviewText', 'summary', 'unixReviewTime', 'category',
       'vote', 'image'],
      dtype='object')


In [7]:
# remove certain columns
amz_rev.drop(['verified', 'style', 'reviewTime', 'vote', 'summary', 'image'], axis=1, inplace=True)

# see updated dataframe
display(amz_rev.head(3))

Unnamed: 0,overall,reviewerID,asin,reviewerName,reviewText,unixReviewTime,category
0,5.0,A33PVCHCQ2BTN0,B0010ZBORW,VW,I really like this nail brush from Urban Spa. ...,1505088000,beauty
1,4.0,A2503LT8PZIHAD,B0010ZBORW,Trouble,This is about the same quality foot file as th...,1504310400,beauty
2,4.0,A1MAI0TUIM3R2X,B001LNODUS,Princess Bookworm,Nice lavender lotion that absorbs easily in my...,1519171200,beauty


In [8]:
# sort and order the data
amz_rev.sort_values(by=['asin', 'overall'], ascending=[True, False], inplace=True)

# reorder columns
amz_rev = amz_rev[['reviewerID', 'reviewerName', 'unixReviewTime', 'asin', 'reviewText', 'category', 'overall']]

# see updated dataframe
display(amz_rev.head(3))

Unnamed: 0,reviewerID,reviewerName,unixReviewTime,asin,reviewText,category,overall
696553,A12R54MKO17TW0,J. Bynum,1326067200,1393774,Keith Green / Songs for the Shepherd: His pre...,cds_and_vinyl,5.0
840939,A3SNL7UJY7GWBI,Lady Leatherneck,1455148800,1393774,"THank you Jesus Lord God, that brother Green's...",cds_and_vinyl,5.0
887328,AEKGGV851HY3K,Avid Reader,1130803200,1393774,Keith Green had a passionate love for Jesus. ...,cds_and_vinyl,5.0


## <a id='toc2_2_'></a>[NaN/Missing Values in the Data](#toc0_)

We look at NaN values in the data. If any review (row) has missing values (NaN) in any column listed here:
    - (description, title, brand, price)
then we remove that review (row) from the dataset. 



In [9]:
# how many nulls in each column
amz_rev.isnull().sum()

reviewerID           0
reviewerName       791
unixReviewTime       0
asin                 0
reviewText        1391
category             0
overall              0
dtype: int64

In [10]:
# remove rows with null values
amz_rev = amz_rev.dropna()

In [12]:
# shape of the data
print("Shape of the data: ", amz_rev.shape)

# see if any missing values
print("\nNumber of missing values: ", amz_rev.isnull().sum().sum())

# show count columns have missing values (as a percentage)
print("\nPercentage of missing values in each column: \n", round(amz_rev.isnull().sum()/len(amz_rev)*100,2))

# show count of rows with missing data per category (as a percentage)
print("\nPercentage of missing values in each category: \n", round(amz_rev.groupby(['category']).apply(lambda x: x.isnull().sum()).sum(axis=1)/len(amz_rev)*100,2))

Shape of the data:  (4161877, 7)

Number of missing values:  0

Percentage of missing values in each column: 
 reviewerID        0.0
reviewerName      0.0
unixReviewTime    0.0
asin              0.0
reviewText        0.0
category          0.0
overall           0.0
dtype: float64

Percentage of missing values in each category: 
 category
appliances                    0.0
arts_crafts                   0.0
automotive                    0.0
beauty                        0.0
cds_and_vinyl                 0.0
cell_phones                   0.0
clothing_shoes_and_jewelry    0.0
digital_music                 0.0
electronics                   0.0
fashion                       0.0
gift_cards                    0.0
grocery_and_gourmet_food      0.0
home_and_kitchen              0.0
industrial                    0.0
kindle_store                  0.0
luxury_beauty                 0.0
magazine_subscriptions        0.0
movies_and_tv                 0.0
musical_instruments           0.0
office_products

## <a id='toc2_3_'></a>[Duplicates in the Data](#toc0_)

We looking at duplicates in the data.

A duplicate is defined as a review (row) that has the same values across all columns. We remove duplicates from the dataset.

In [13]:
# see if any duplicates
print("Number of duplicates: ", amz_rev.duplicated().sum())

# see duplicates
amz_rev[amz_rev.duplicated(keep=False)].sort_values(by=['asin']).head(4)

Number of duplicates:  126374


Unnamed: 0,reviewerID,reviewerName,unixReviewTime,asin,reviewText,category,overall
1712045,AIYVMK30Q2G1L,Monica Lambert,1465430400,439499887,a gift and she loved it,office_products,5.0
1756241,AIYVMK30Q2G1L,Monica Lambert,1465430400,439499887,a gift and she loved it,office_products,5.0
1770904,A38X1A0N3BJ6EY,Raymond K.,1451174400,439499887,Granddaughter loves the book.,office_products,5.0
1870218,A38X1A0N3BJ6EY,Raymond K.,1451174400,439499887,Granddaughter loves the book.,office_products,5.0


In [14]:
# remove duplicates
amz_rev.drop_duplicates(inplace=True)

# see updated dataframe
display(amz_rev.head(3))

# shape of the data
print("Shape of the data: ", amz_rev.shape)

Unnamed: 0,reviewerID,reviewerName,unixReviewTime,asin,reviewText,category,overall
696553,A12R54MKO17TW0,J. Bynum,1326067200,1393774,Keith Green / Songs for the Shepherd: His pre...,cds_and_vinyl,5.0
840939,A3SNL7UJY7GWBI,Lady Leatherneck,1455148800,1393774,"THank you Jesus Lord God, that brother Green's...",cds_and_vinyl,5.0
887328,AEKGGV851HY3K,Avid Reader,1130803200,1393774,Keith Green had a passionate love for Jesus. ...,cds_and_vinyl,5.0


Shape of the data:  (4035503, 7)


In [25]:
amz_rev.groupby('reviewerID').filter(lambda x: len(x) >= 10)


Unnamed: 0,reviewerID,reviewerName,unixReviewTime,asin,reviewText,category,overall
696553,A12R54MKO17TW0,J. Bynum,1326067200,0001393774,Keith Green / Songs for the Shepherd: His pre...,cds_and_vinyl,5.0
840939,A3SNL7UJY7GWBI,Lady Leatherneck,1455148800,0001393774,"THank you Jesus Lord God, that brother Green's...",cds_and_vinyl,5.0
887328,AEKGGV851HY3K,Avid Reader,1130803200,0001393774,Keith Green had a passionate love for Jesus. ...,cds_and_vinyl,5.0
3587154,AFR9EUQIILJLC,libertyinmo,1421625600,0001526863,Great way for children and adults to memorize ...,movies_and_tv,5.0
3713974,A1J0AEZCHZIWOL,GBC93,1425859200,0005000009,Great documentary! Very recommended.,movies_and_tv,5.0
...,...,...,...,...,...,...,...
974835,A5CNVTMYJXH53,komelina,1471478400,B01HJH9IN6,This case is nice but definitely for looks rat...,cell_phones,3.0
1040004,A4DTRG6NC8CED,Cyn,1472688000,B01HJH9IN6,Just received it and it's VERY thin. No need t...,cell_phones,1.0
1137418,A1E0QSD4PGWH96,BlueFug8,1477440000,B01HJHC4WS,There's no better way to say this without bein...,clothing_shoes_and_jewelry,5.0
2842280,A4E9SIJIN79V2,tom poulton,1474675200,B01HJHS73S,Everything it is advertised to be. Really sol...,tools_and_home_improvement,5.0


In [26]:
# Count how many reviews per reviewer
review_counts = amz_rev.groupby('reviewerID').count()

# Filter the DataFrame to keep only users with over 10 reviews
amz_rev_filtered = amz_rev[amz_rev['reviewerID'].isin(review_counts[review_counts['asin'] > 10].index)]

# check shape  
print("Shape of filtered data:", amz_rev_filtered.shape)

Shape of filtered data: (3695125, 7)


## <a id='toc2_5_'></a>[Standardizing and Checking Data Types](#toc0_)

We turn to standardizing and checking data types. 

In [28]:
# change unixReviewTime to datetime
amz_rev['unixReviewTime'] = pd.to_datetime(amz_rev['unixReviewTime'], unit='s')

# rename column: unixReviewTime to reviewTime
amz_rev.rename(columns={'unixReviewTime': 'reviewTime'}, inplace=True)

# see updated dataframe
display(amz_rev.head(3))

Unnamed: 0,reviewerID,reviewerName,reviewTime,asin,reviewText,category,overall
696553,A12R54MKO17TW0,J. Bynum,2012-01-09,1393774,Keith Green / Songs for the Shepherd: His pre...,cds_and_vinyl,5.0
840939,A3SNL7UJY7GWBI,Lady Leatherneck,2016-02-11,1393774,"THank you Jesus Lord God, that brother Green's...",cds_and_vinyl,5.0
887328,AEKGGV851HY3K,Avid Reader,2005-11-01,1393774,Keith Green had a passionate love for Jesus. ...,cds_and_vinyl,5.0


In [29]:
# check data types
amz_rev.dtypes

reviewerID              object
reviewerName            object
reviewTime      datetime64[ns]
asin                    object
reviewText              object
category                object
overall                float64
dtype: object

## <a id='toc2_6_'></a>[Normalizing Data](#toc0_)

We look at normalizing data (price, votes, etc. ) We used min-max. 

In [30]:
# get min and max for ratings
min_rating = amz_rev['overall'].min()
max_rating = amz_rev['overall'].max()

# Normalize the ratings to a range from 0 to 1
amz_rev['normalized_rating'] = (amz_rev['overall'] - min_rating) / (max_rating - min_rating)

# see updated dataframe
amz_rev.head(3)

# see summary of normalized ratings
amz_rev['normalized_rating'].describe()


count    4.035503e+06
mean     8.537596e-01
std      2.572161e-01
min      0.000000e+00
25%      7.500000e-01
50%      1.000000e+00
75%      1.000000e+00
max      1.000000e+00
Name: normalized_rating, dtype: float64

The '`normalized_rating`' column will contain the normalized ratings between 0 and 1, where 0 corresponds to the minimum rating and 1 corresponds to the maximum rating. Note, the '`normalized_rating`' column is created by subtracting the minimum value from each rating and dividing it by the range (maximum value minus minimum value).

Ratings which were 1 are now 0 and ratings which were 5 are now 1 etc. 

In [31]:
# see updated dataframe
display(amz_rev.head(3))


Unnamed: 0,reviewerID,reviewerName,reviewTime,asin,reviewText,category,overall,normalized_rating
696553,A12R54MKO17TW0,J. Bynum,2012-01-09,1393774,Keith Green / Songs for the Shepherd: His pre...,cds_and_vinyl,5.0,1.0
840939,A3SNL7UJY7GWBI,Lady Leatherneck,2016-02-11,1393774,"THank you Jesus Lord God, that brother Green's...",cds_and_vinyl,5.0,1.0
887328,AEKGGV851HY3K,Avid Reader,2005-11-01,1393774,Keith Green had a passionate love for Jesus. ...,cds_and_vinyl,5.0,1.0


# <a id='toc4_'></a>[Dealing with Text Data](#toc0_)

We deal with text data. We have the following columns that contain text data:
- description
- title
- brand
- category
- reviewText


We handle each of these columns separately. We look out for: 

**Identify special characters or symbols**: Look for any special characters, symbols, or non-alphanumeric characters that may need to be cleaned or removed. These characters can sometimes interfere with downstream analysis or modeling.

**Handle HTML tags or formatting**: If the 'description' column contains HTML tags or formatting, you may consider removing them or converting them into plain text

## <a id='toc4_1_'></a>[Text Normalization](#toc0_)

**Text Normalization**: Text normalization is the process of transforming text into a single canonical form that it might not have had before. This is done by removing unnecessary characters, such as punctuation or special characters; converting all letters to lowercase or uppercase; and/or expanding abbreviations.


### <a id='toc4_1_5_'></a>[ReviewText](#toc0_)

In [33]:
# see brand
amz_rev['reviewText']

# create function to clean brand column
def clean_brand(text):
    # Remove square brackets, single quotes, HTML tags, double quotes, commas, URLs, and CSS styling
    text = re.sub(r"\[|\]|\'|<[^>]*>|\"|,|\\https?://[^\s]+|{[^}]+}", "", text)
    
    # Remove extra whitespace
    text = " ".join(text.split())
    
    # Remove common update notes
    update_regex = r'update \d+/\d+/\d+:'
    text = re.sub(update_regex, '', text)

    # Remove trailing punctuation
    text = re.sub(r"\.$", "", text)

    # remove dates
    text = re.sub(r'\d{2}/\d{2}/\d{2}', '', text)

    
    # Convert to lowercase
    text = text.lower()
    
    return text


# apply function to brand column
amz_rev['reviewText'] = amz_rev['reviewText'].apply(clean_brand)

# see updated dataframe
amz_rev['reviewText']

696553     keith green / songs for the shepherd: his prev...
840939     thank you jesus lord god that brother greens m...
887328     keith green had a passionate love for jesus. t...
3587154    great way for children and adults to memorize ...
3713974                  great documentary! very recommended
                                 ...                        
974835     this case is nice but definitely for looks rat...
1040004    just received it and its very thin. no need to...
1137418    theres no better way to say this without being...
2842280    everything it is advertised to be. really soli...
506315                                             excellent
Name: reviewText, Length: 4035503, dtype: object

In [34]:
# check any special characters in title
amz_rev[amz_rev['reviewText'].str.contains("[^a-zA-Z0-9\s]")]

# replace special characters in title
amz_rev['reviewText'] = amz_rev['reviewText'].str.replace("[^a-zA-Z0-9\s]", "")

# count of records with special character in the title
print("Number records with special character in reviewText:", amz_rev[amz_rev['reviewText'].str.contains("[^a-zA-Z0-9\s]")]['asin'].unique().size)

  amz_rev['reviewText'] = amz_rev['reviewText'].str.replace("[^a-zA-Z0-9\s]", "")


Number records with special character in reviewText: 0


In [35]:
# see record
display(amz_rev[(amz_rev['reviewTime']=='2012-09-11') & (amz_rev['asin']=='B001YTK3XK')])

# view data 
amz_rev[(amz_rev['reviewTime']=='2012-09-11') & (amz_rev['asin']=='B001YTK3XK')]['reviewText'].values[0]

Unnamed: 0,reviewerID,reviewerName,reviewTime,asin,reviewText,category,overall,normalized_rating
1772008,A2LEIANN1UZTHP,brainout,2012-09-11,B001YTK3XK,update this machine is way overpriced now i g...,office_products,2.0,0.25




***
## <a id='toc4_2_'></a>[Text Cleaning](#toc0_)

We specifically explore: 
1. Tokenization
2. Stopword Removal
3. Stemming and Lemmatization
4. Handling Noise and Irrelevant Information

We do this for only the `reviewText` columns and the `description` columns. 


In [36]:
display(amz_rev.head(6))

Unnamed: 0,reviewerID,reviewerName,reviewTime,asin,reviewText,category,overall,normalized_rating
696553,A12R54MKO17TW0,J. Bynum,2012-01-09,1393774,keith green songs for the shepherd his previo...,cds_and_vinyl,5.0,1.0
840939,A3SNL7UJY7GWBI,Lady Leatherneck,2016-02-11,1393774,thank you jesus lord god that brother greens m...,cds_and_vinyl,5.0,1.0
887328,AEKGGV851HY3K,Avid Reader,2005-11-01,1393774,keith green had a passionate love for jesus th...,cds_and_vinyl,5.0,1.0
3587154,AFR9EUQIILJLC,libertyinmo,2015-01-19,1526863,great way for children and adults to memorize ...,movies_and_tv,5.0,1.0
3713974,A1J0AEZCHZIWOL,GBC93,2015-03-09,5000009,great documentary very recommended,movies_and_tv,5.0,1.0
3629903,A19UQOTRTZC7ZX,frantac,2015-04-04,5000009,old ugly bad choice,movies_and_tv,1.0,0.0


### <a id='toc4_2_1_'></a>[Tokenization](#toc0_)

**Tokenization**: Tokenization is the process of splitting text into smaller chunks, called tokens. Tokens are usually words, sentences, or individual characters. Tokenization is a crucial step in text analysis, as the meaning of a word can fundamentally change based on the tokens surrounding it.


In [37]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

# Assuming amz_rev.reviewText are Pandas Series
# Tokenize the text using a loop

tokens_reviewText = []
for text in amz_rev.reviewText:
    if isinstance(text, str):
        tokens_reviewText.append(word_tokenize(text))
    else:
        tokens_reviewText.append([])

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/pavansingh/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [38]:
# shape of tokens
print("Shape of tokens_reviewText:", len(tokens_reviewText))

# shape of reviewText
print("Shape of reviewText:", len(amz_rev.reviewText))

Shape of tokens_reviewText: 4035503
Shape of reviewText: 4035503


In [39]:
# Print the tokens
print("Some output of review text tokens:", tokens_reviewText[0:2])

Some output of review text tokens: [['keith', 'green', 'songs', 'for', 'the', 'shepherd', 'his', 'previous', 'albums', 'were', 'more', 'focused', 'on', 'encouragement', 'and', 'correction', 'towards', 'the', 'church', 'this', 'his', 'last', 'is', 'focused', 'on', 'praise', 'this', 'one', 'is', 'still', 'not', 'as', 'great', 'as', 'his', 'first', 'album', 'but', 'it', 'is', 'a', 'strong', 'enough', 'praise', 'album', 'to', 'earn', 'five', 'stars'], ['thank', 'you', 'jesus', 'lord', 'god', 'that', 'brother', 'greens', 'music', 'is', 'still', 'sounding', 'though', 'he', 'is', 'home', 'with', 'you', 'now']]


### <a id='toc4_2_2_'></a>[ Stopword Removal](#toc0_)

**Stopword Removal**: Stopwords are words that are commonly used in the English language, such as "the," "a," "an," "is," and "are." These words are often removed from text during preprocessing as they can negatively impact downstream analysis, such as natural language processing and machine learning.


In [40]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')

# Get the set of English stopwords
stopword_set = set(stopwords.words('english'))

# Remove stopwords from the tokens
filtered_tokens_reviewText = [[word for word in tokens if word.lower() not in stopword_set] for tokens in tokens_reviewText]

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/pavansingh/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [41]:

# Print the filtered tokens
print("Review Text tokens after stop word removal:", filtered_tokens_reviewText[0:2])

Review Text tokens after stop word removal: [['keith', 'green', 'songs', 'shepherd', 'previous', 'albums', 'focused', 'encouragement', 'correction', 'towards', 'church', 'last', 'focused', 'praise', 'one', 'still', 'great', 'first', 'album', 'strong', 'enough', 'praise', 'album', 'earn', 'five', 'stars'], ['thank', 'jesus', 'lord', 'god', 'brother', 'greens', 'music', 'still', 'sounding', 'though', 'home']]


### <a id='toc4_2_3_'></a>[Stemming and Lemmatization](#toc0_)


**Stemming and Lemmatization**: Stemming is the process of reducing a word to its stem, or its root form. For example, "fishing," "fished," "fisher" all reduce to the stem "fish." Lemmatization is similar to stemming, but it brings context to the words. So, it links words with similar meaning to one word. For example, "better" and "good" are lemmatized to "good."

Stemming involves removing prefixes and suffixes from words to obtain the root form. For example, the word "running" would be stemmed to "run." Stemming is a simpler and faster process but may result in the root form not being an actual word. This can lead to potential loss of meaning or incorrect interpretations.

On the other hand, lemmatiation aims to determine the lemma or dictionary form of a word. It takes into account the word's context and part of speech, ensuring that the resulting lemma is a valid word. For example, the word "running" would be lemmatized to "run." Lemmatization provides more accurate results but can be computationally more expensive compared to stemming.

If preserving the exact meaning and interpretability of words is crucial for your recommender system, lemmatization would be a better choice. However, if speed and simplicity are more important, stemming could be sufficient.

In [42]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download('punkt')

# Stemming
stemmer = PorterStemmer()
stemmed_words_revText = [[stemmer.stem(word) for word in tokens] for tokens in filtered_tokens_reviewText]

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_words_revText = [[lemmatizer.lemmatize(word) for word in tokens] for tokens in filtered_tokens_reviewText]


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/pavansingh/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [43]:
print("Original words:", filtered_tokens_reviewText[0:2])
print("Stemmed words:", stemmed_words_revText[0:2])
print("Lemmatized words:", lemmatized_words_revText[0:2])

Original words: [['keith', 'green', 'songs', 'shepherd', 'previous', 'albums', 'focused', 'encouragement', 'correction', 'towards', 'church', 'last', 'focused', 'praise', 'one', 'still', 'great', 'first', 'album', 'strong', 'enough', 'praise', 'album', 'earn', 'five', 'stars'], ['thank', 'jesus', 'lord', 'god', 'brother', 'greens', 'music', 'still', 'sounding', 'though', 'home']]
Stemmed words: [['keith', 'green', 'song', 'shepherd', 'previou', 'album', 'focus', 'encourag', 'correct', 'toward', 'church', 'last', 'focus', 'prais', 'one', 'still', 'great', 'first', 'album', 'strong', 'enough', 'prais', 'album', 'earn', 'five', 'star'], ['thank', 'jesu', 'lord', 'god', 'brother', 'green', 'music', 'still', 'sound', 'though', 'home']]
Lemmatized words: [['keith', 'green', 'song', 'shepherd', 'previous', 'album', 'focused', 'encouragement', 'correction', 'towards', 'church', 'last', 'focused', 'praise', 'one', 'still', 'great', 'first', 'album', 'strong', 'enough', 'praise', 'album', 'earn'

***
# <a id='toc6_'></a>[Creating New Dataset and Saving](#toc0_)

We create a new dataset using all the cleaning and processing done above and save it as a csv file called `data_clean.csv`.

In [44]:
# get shapes of all generated dataframes
print("Shape of stemmed_words_revText:", len(stemmed_words_revText))
print("Shape of lemmatized_words_revText:", len(lemmatized_words_revText))
print("Shape of filtered_tokens_reviewText:", len(filtered_tokens_reviewText))

Shape of stemmed_words_revText: 4035503
Shape of lemmatized_words_revText: 4035503
Shape of filtered_tokens_reviewText: 4035503


In [45]:
# attach stemmed words to dataframe
amz_rev['stemmed_words_revText'] = stemmed_words_revText

# attach lemmitized words to dataframe
amz_rev['lemmatized_words_revText'] = lemmatized_words_revText

# attach filtered tokens to dataframe
amz_rev['filtered_tokens_revText'] = filtered_tokens_reviewText


In [57]:
# see resulting dataframe
amz_rev.head(6)

Unnamed: 0,reviewerID,reviewerName,reviewTime,asin,reviewText,category,overall,normalized_rating,stemmed_words_revText,lemmatized_words_revText,filtered_tokens_revText
696553,A12R54MKO17TW0,J. Bynum,2012-01-09,1393774,keith green songs for the shepherd his previo...,cds_and_vinyl,5.0,1.0,"[keith, green, song, shepherd, previou, album,...","[keith, green, song, shepherd, previous, album...","[keith, green, songs, shepherd, previous, albu..."
840939,A3SNL7UJY7GWBI,Lady Leatherneck,2016-02-11,1393774,thank you jesus lord god that brother greens m...,cds_and_vinyl,5.0,1.0,"[thank, jesu, lord, god, brother, green, music...","[thank, jesus, lord, god, brother, green, musi...","[thank, jesus, lord, god, brother, greens, mus..."
887328,AEKGGV851HY3K,Avid Reader,2005-11-01,1393774,keith green had a passionate love for jesus th...,cds_and_vinyl,5.0,1.0,"[keith, green, passion, love, jesu, evid, life...","[keith, green, passionate, love, jesus, eviden...","[keith, green, passionate, love, jesus, eviden..."
3587154,AFR9EUQIILJLC,libertyinmo,2015-01-19,1526863,great way for children and adults to memorize ...,movies_and_tv,5.0,1.0,"[great, way, children, adult, memor, bibl, ver...","[great, way, child, adult, memorize, bible, ve...","[great, way, children, adults, memorize, bible..."
3713974,A1J0AEZCHZIWOL,GBC93,2015-03-09,5000009,great documentary very recommended,movies_and_tv,5.0,1.0,"[great, documentari, recommend]","[great, documentary, recommended]","[great, documentary, recommended]"
3629903,A19UQOTRTZC7ZX,frantac,2015-04-04,5000009,old ugly bad choice,movies_and_tv,1.0,0.0,"[old, ugli, bad, choic]","[old, ugly, bad, choice]","[old, ugly, bad, choice]"


In [59]:
# save cleaned data
amz_rev.to_csv("Data/amz_rev_cleaned_1.csv", index=False)

# stats of cleaned data
print("Shape of cleaned data:", amz_rev.shape)
print("\nNumber of unique products:", amz_rev['asin'].unique().size)
print("\nNumber of unique users:", amz_rev['reviewerID'].unique().size)
print("\nNumber of unique categories:", amz_rev['category'].unique().size)
print("\nNumber of reviews per category:\n", amz_rev['category'].value_counts())
print("\nColumns available:", amz_rev.columns)

Shape of cleaned data: (4035503, 11)

Number of unique products: 787192

Number of unique users: 203238

Number of unique categories: 28

Number of reviews per category:
 video_games                   312409
arts_crafts                   307979
office_products               298753
patio_lawn_and_garden         284202
grocery_and_gourmet_food      260506
cds_and_vinyl                 256067
tools_and_home_improvement    232943
kindle_store                  223939
movies_and_tv                 184548
automotive                    183731
cell_phones                   176943
toys_and_games                170947
pet_supplies                  164386
sports_and_outdoors           158620
home_and_kitchen              141080
electronics                   140111
musical_instruments           139877
digital_music                 112194
prime_pantry                  109072
clothing_shoes_and_jewelry     79625
industrial                     60520
luxury_beauty                  21871
software       

In [60]:
# create a smaller dataframe by randomly sampling 10% of the data
amz_rev_sample = amz_rev.sample(frac=0.05, random_state=42)

# save sampled data
amz_rev_sample.to_csv("Data/amz_rev_sample_1.csv", index=False)

: 