# 2 Data wrangling<a id='2_Data_wrangling'></a>

## 2.1 Contents<a id='2.1_Contents'></a>
* [2 Data wrangling](#2_Data_wrangling)
  * [2.1 Contents](#2.1_Contents)
  * [2.2 Introduction](#2.2_Introduction)
    * [2.2.1 Data Science Problem](#2.2.1_Data_Science_Problem)
  * [2.3 Imports](#2.3_Imports)
  * [2.4 Load Data](#2.4_Load_Data)
  * [2.5 Missing Values](#2.5_Missing_Values)
  * [2.6 Duplicate Values](#2.6_Duplicate_Values)
    * [2.6.1 Duplicate Records](#2.6.1_Duplicate_Records)
    * [2.6.2 Duplicate Reviews](#2.6.2_Duplicate_Reviews)
  * [2.7 Creating a Helpfulness Ratio](#2.7.2_Helpfulness_Ratio)
  * [2.8 Converting Time to DateTime](#2.8_Converting_Time)
  * [2.9 Text Preprocessing](#2.9_Text_Preprocessing)
  * [2.10 Tokenization and Stopword Removal](#2.10_Tokenization)
  * [2.11 Lemmatization](#2.11_Lemmatization)
  * [2.12 Stemming](#2.12_Stemming)    
  * [2.13 Saving the Cleaned Data](#2.13_Saving)
  * [2.14 Summary](#2.14_Summary)

## 2.2 Introduction<a id='2.2_Introduction'></a>

Data wrangling is the process of cleaning, transforming, and organizing raw data into a usable format. Below are the steps for performing data wrangling on the "Amazon Fine Food Reviews" dataset with the columns Id, ProductId, UserId, ProfileName, HelpfulnessNumerator, 

### 2.2.1 Data Science Problem<a id='2.2.1_Data_Science_Problem'></a>

The "Amazon Fine Food Reviews" dataset offers valuable insights into customer sentiment and preferences through product reviews. By applying natural language processing (NLP) and machine learning techniques, the challenge is to analyze review texts, ratings, and other attributes to identify patterns and trends in customer feedback. This will help uncover key themes related to product quality, delivery experience, and overall satisfaction. The goal is to provide actionable insights for improving product offerings, enhancing customer service, and informing marketing strategies, ultimately leading to better customer understanding and increased loyalty.

## 2.3 Imports<a id='2.3_Imports'></a>

In [1]:
#importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import sqlite3
import nltk

## 2.4 Load Data<a id='2.4_Load_Data'></a>

In [2]:
# Read the CSV file
amazon_data = pd.read_csv("Reviews.csv")

# Shuffle the DataFrame
amazon_data = amazon_data.sample(frac=1, random_state=42)

# Limit to the first 5000 records
amazon_data = amazon_data.head(100000)

# Display the first few records to verify
amazon_data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
165256,165257,B000EVG8J2,A1L01D2BD3RKVO,"B. Miller ""pet person""",0,0,5,1268179200,Crunchy & Good Gluten-Free Sandwich Cookies!,Having tried a couple of other brands of glute...
231465,231466,B0000BXJIS,A3U62RE5XZDP0G,Marty,0,0,5,1298937600,great kitty treats,My cat loves these treats. If ever I can't fin...
427827,427828,B008FHUFAU,AOXC0JQQZGGB6,Kenneth Shevlin,0,2,3,1224028800,COFFEE TASTE,A little less than I expected. It tends to ha...
433954,433955,B006BXV14E,A3PWPNZVMNX3PA,rareoopdvds,0,1,2,1335312000,So the Mini-Wheats were too big?,"First there was Frosted Mini-Wheats, in origin..."
70260,70261,B007I7Z3Z0,A1XNZ7PCE45KK7,Og8ys1,0,2,5,1334707200,Great Taste . . .,and I want to congratulate the graphic artist ...


In [3]:
# con = sqlite3.connect('database.sqlite')

In [4]:
# amazon_data = pd.read_sql_query(""" SELECT * FROM Reviews LIMIT 10000""", con)
# amazon_data.head()

## 2.5 Missing Values<a id='2.5_Missing_Values'></a>

In [5]:
# Check for missing values
missing_values = amazon_data.isnull().sum()
print(missing_values)

Id                        0
ProductId                 0
UserId                    0
ProfileName               2
HelpfulnessNumerator      0
HelpfulnessDenominator    0
Score                     0
Time                      0
Summary                   9
Text                      0
dtype: int64


In [6]:
# Fill missing values in 'ProfileName' with 'Unknown'
amazon_data['ProfileName'].fillna('Unknown', inplace=True)

# Drop rows where 'Text' is missing, as text data is crucial
amazon_data.dropna(subset=['Text'], inplace=True)

There are no missing values in the dataset.

## 2.6 Duplicate Values<a id='2.6_Duplicate_Values'></a>

### 2.6.1 Duplicate Records<a id='2.6.1_Duplicate_Records'></a>

In [7]:
# List of columns to check for duplicates, excluding 'Id'
columns_to_check = [col for col in amazon_data.columns if col != 'Id']

# Check for duplicate records based on all columns except 'Id'
duplicate_records = amazon_data[amazon_data.duplicated(subset=columns_to_check, keep=False)]

In [8]:
duplicate_records.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
170486,170487,B000MXEN9O,A3O0VDZUOJPZWX,Orlando Mom,0,0,5,1313020800,Best price anywhere!,My little one loves these and you cannot beat ...
81058,81059,B000MXHQW0,AO29VDV2AUM6W,Desiree Calora,0,0,5,1319068800,Great baby food,"This is great for your little one, no artifici..."
325098,325099,B0002DGRRA,AJD41FBJD9010,"N. Ferguson ""Two, Daisy, Hannah, and Kitten""",0,0,5,1233360000,best dog treat-- great for training--- all do...,Freeze dried liver has a hypnotic effect on do...
170484,170485,B000MXEN9O,A3O0VDZUOJPZWX,Orlando Mom,0,0,5,1313020800,Best price anywhere!,My little one loves these and you cannot beat ...
255267,255268,B0029NVJX8,A3LCQXQ4SFYBAU,Johna Jane,0,0,5,1345420800,Cat's favorite,These treats are my picky cat's favorite. I'v...


In [9]:
duplicate_records.shape

(26, 10)

There are a total of 506 duplicate records identified in the dataset.

### 2.6.2 Duplicate Reviews<a id='2.6.2_Duplicate_Reviews'></a>

In [10]:
# Store non-duplicate records
non_duplicate_records = amazon_data.drop_duplicates(keep=False)

# Check for duplicate records based on ProductId, UserId, and ProfileName
duplicate_reviews = non_duplicate_records[non_duplicate_records.duplicated(subset=['ProductId', 'UserId', 'ProfileName', 'Time'], keep=False)]

In [11]:
duplicate_reviews.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
41443,41444,B0088YBUOU,A37REIKYSHU4ZF,Miles Hiniker,2,2,3,1199750400,Yum,These potatoes are good. But i don't know if ...
64007,64008,B000NHYNNU,A36RGOVNCCA5KI,M. Barrow,0,0,5,1194566400,Good stuff,I decided to try this tea because I am a big f...
282916,282917,B002Z9EQPO,A2ISKAWUPGGOLZ,M. S. Handley,0,1,1,1310774400,Kitty Junk Food,We have five cats - one an elderly cat of 15 y...
188194,188195,B001NXM3I0,A2BT4NQWOUS5C0,sandybeaches,0,0,5,1283299200,Tasty and Healthy,I love that my son is learning to feed himself...
8534,8535,B003VXFK44,A12BJ9GOL0T54E,"Andy L. ""NC_Andy""",0,0,3,1308182400,Didn't say anywhere it was coconut favored!,"It's coconut flavored, either you'll like it o..."


In [12]:
duplicate_reviews.shape

(305, 10)

There are 7,424 duplicate reviews in the dataset for the same product (identified by ProductId), submitted by the same user (identified by UserId) at the same time (identified by Time).

## 2.7 Creating a Helpfulness Ratio<a id='2.7.2_Helpfulness_Ratio'></a>

In [13]:
# Create a new column for the helpfulness ratio
amazon_data['HelpfulnessRatio'] = amazon_data['HelpfulnessNumerator'] / amazon_data['HelpfulnessDenominator']

# Handle division by zero and missing values by filling with 0
amazon_data['HelpfulnessRatio'].fillna(0, inplace=True)

In [14]:
amazon_data['HelpfulnessRatio'].value_counts(ascending=False)

0.000000    53360
1.000000    32304
0.500000     3894
0.666667     1798
0.750000     1127
            ...  
0.523810        1
0.931596        1
0.656250        1
0.045455        1
0.190476        1
Name: HelpfulnessRatio, Length: 478, dtype: int64

The output shows the distribution of the HelpfulnessRatio in the amazon_data DataFrame. The data reveals that a significant number of reviews are marked as either completely unhelpful (0.0) or entirely helpful (1.0), with counts of 303,826 and 183,309 respectively. Intermediate values, such as 0.5, 0.67, and 0.75, appear less frequently, indicating some reviews receive mixed feedback. 

## 2.8 Converting Time to DateTime<a id='2.8_Converting_Time'></a>

In [15]:
# Convert Time column to datetime
amazon_data['ReviewTime'] = pd.to_datetime(amazon_data['Time'], unit='s')

In [16]:
amazon_data['ReviewTime'].value_counts(ascending=False)

2012-10-16    206
2012-09-06    187
2011-11-25    185
2012-08-16    181
2012-08-06    177
             ... 
2006-07-28      1
2003-07-02      1
2006-01-19      1
2004-05-10      1
2005-08-09      1
Name: ReviewTime, Length: 2699, dtype: int64

October 16, 2012, has the highest frequency of customer reviews.

In [17]:
print('The earliest review times', amazon_data['ReviewTime'].min(), 'and the latest review times', amazon_data['ReviewTime'].max())

The earliest review times 1999-12-06 00:00:00 and the latest review times 2012-10-26 00:00:00


## 2.9 Text Preprocessing<a id='2.9_Text_Preprocessing'></a>

In [18]:
import re

# Define the clean_text function
def clean_text(text):
    if isinstance(text, float):
        text = str(text)
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'\d+', '', text)  # Remove digits
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    return text


# Ensure the Text and Summary columns are strings
amazon_data['Text'] = amazon_data['Text'].astype(str)
amazon_data['Summary'] = amazon_data['Summary'].astype(str)

# Apply the clean_text function to the Text and Summary columns
amazon_data['CleanedText'] = amazon_data['Text'].apply(clean_text)
amazon_data['CleanedSummary'] = amazon_data['Summary'].apply(clean_text)

The preprocessing of text data converts the 'Text' and 'Summary' columns in the amazon_data dataframe to lowercase, and removes digits, punctuation, extra spaces, and leading/trailing spaces to create cleaned versions of these columns.

## 2.10 Tokenization and Stopword Removal<a id='2.10_Tokenization'></a>

In [19]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Ensure you have the stopwords and punkt data downloaded
import nltk
nltk.download('stopwords')
nltk.download('punkt')

stop_words = set(stopwords.words('english'))

# Function to tokenize and remove stopwords
def tokenize_text(text):
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in stop_words]
    return tokens

# Apply the tokenize_text function to the CleanedText and CleanedSummary columns
amazon_data['TokenizedText'] = amazon_data['CleanedText'].apply(tokenize_text)
amazon_data['TokenizedSummary'] = amazon_data['CleanedSummary'].apply(tokenize_text)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\armeh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\armeh\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


The code tokenizes the cleaned text data from the 'CleanedText' and 'CleanedSummary' columns in the amazon_data dataframe and removes stopwords, creating tokenized versions of these columns.

## 2.11 Lemmatization<a id='2.11_Lemmatization'></a>

In [20]:
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()

# Function to lemmatize tokens
def lemmatize_tokens(tokens):
    return [lemmatizer.lemmatize(token) for token in tokens]

# Apply the lemmatize_tokens function to the TokenizedText and TokenizedSummary columns
amazon_data['LemmatizedText'] = amazon_data['TokenizedText'].apply(lemmatize_tokens)
amazon_data['LemmatizedSummary'] = amazon_data['TokenizedSummary'].apply(lemmatize_tokens)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\armeh\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


The code lemmatizes the tokenized text data from the 'TokenizedText' and 'TokenizedSummary' columns in the amazon_data dataframe, creating lemmatized versions of these columns to ensure words are reduced to their base or root form.

## 2.12 Stemming<a id='2.12_Lemmatization'></a>

In [21]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

# Function to stem text
def stem_text(text):
    tokens = word_tokenize(text)
    stemmed_tokens = [stemmer.stem(word) for word in tokens]
    return ' '.join(stemmed_tokens)

# Apply stemming to the 'Summary' and 'Text' columns
amazon_data['StemmedSummary'] = amazon_data['Summary'].apply(stem_text)
amazon_data['StemmedText'] = amazon_data['Text'].apply(stem_text)

#import ace_tools as tools; tools.display_dataframe_to_user(name="Stemmed Amazon Data", dataframe=amazon_data)

This code applies stemming to the tokens in the tokens column of the amazon_data DataFrame using the Porter Stemmer from the NLTK library, and stores the resulting stemmed tokens in a new column called stemmed_tokens.

## 2.13 Saving the Cleaned Data<a id='2.13_Saving'></a>

In [22]:
amazon_data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,HelpfulnessRatio,ReviewTime,CleanedText,CleanedSummary,TokenizedText,TokenizedSummary,LemmatizedText,LemmatizedSummary,StemmedSummary,StemmedText
165256,165257,B000EVG8J2,A1L01D2BD3RKVO,"B. Miller ""pet person""",0,0,5,1268179200,Crunchy & Good Gluten-Free Sandwich Cookies!,Having tried a couple of other brands of glute...,0.0,2010-03-10,having tried a couple of other brands of glute...,crunchy good glutenfree sandwich cookies,"[tried, couple, brands, glutenfree, sandwich, ...","[crunchy, good, glutenfree, sandwich, cookies]","[tried, couple, brand, glutenfree, sandwich, c...","[crunchy, good, glutenfree, sandwich, cooky]",crunchi & good gluten-fre sandwich cooki !,have tri a coupl of other brand of gluten-fre ...
231465,231466,B0000BXJIS,A3U62RE5XZDP0G,Marty,0,0,5,1298937600,great kitty treats,My cat loves these treats. If ever I can't fin...,0.0,2011-03-01,my cat loves these treats if ever i cant find ...,great kitty treats,"[cat, loves, treats, ever, cant, find, house, ...","[great, kitty, treats]","[cat, love, treat, ever, cant, find, house, po...","[great, kitty, treat]",great kitti treat,my cat love these treat . if ever i ca n't fin...
427827,427828,B008FHUFAU,AOXC0JQQZGGB6,Kenneth Shevlin,0,2,3,1224028800,COFFEE TASTE,A little less than I expected. It tends to ha...,0.0,2008-10-15,a little less than i expected it tends to hav...,coffee taste,"[little, less, expected, tends, muddy, taste, ...","[coffee, taste]","[little, le, expected, tends, muddy, taste, ex...","[coffee, taste]",coffe tast,a littl less than i expect . it tend to have a...
433954,433955,B006BXV14E,A3PWPNZVMNX3PA,rareoopdvds,0,1,2,1335312000,So the Mini-Wheats were too big?,"First there was Frosted Mini-Wheats, in origin...",0.0,2012-04-25,first there was frosted miniwheats in original...,so the miniwheats were too big,"[first, frosted, miniwheats, original, size, f...","[miniwheats, big]","[first, frosted, miniwheats, original, size, f...","[miniwheats, big]",so the mini-wheat were too big ?,"first there wa frost mini-wheat , in origin si..."
70260,70261,B007I7Z3Z0,A1XNZ7PCE45KK7,Og8ys1,0,2,5,1334707200,Great Taste . . .,and I want to congratulate the graphic artist ...,0.0,2012-04-18,and i want to congratulate the graphic artist ...,great taste,"[want, congratulate, graphic, artist, putting,...","[great, taste]","[want, congratulate, graphic, artist, putting,...","[great, taste]",great tast . . .,and i want to congratul the graphic artist for...


In [23]:
# Save the cleaned data to a new CSV file
amazon_data.to_csv('amazon_data_wrangling.csv', index=False)

## 2.14 Summary<a id='2.14_Summary'></a>

The above steps provide a comprehensive guide for wrangling the  "Amazon Fine Food Reviews" dataset. The process includes handling missing values, creating new features, preprocessing text data, tokenizing and lemmatizing, encoding categorical variables, and saving the cleaned data. This ensures the dataset is in a usable format for further analysis and modeling tasks.