# Data Wrangling

In [1]:
import pandas as pd
import numpy as np
import warnings
import re
from nltk.tokenize import WhitespaceTokenizer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
stop = stopwords.words('english')
from library.sb_utils import save_file
warnings.filterwarnings("ignore")

First thing is to get all the data gathered. Right now, the data is split into reviews and products, and I need to consolidate them into a single dataframe.

In [2]:
reviews = pd.read_csv('../data/combined/reviews.csv')
products = pd.read_csv('../data/combined/products.csv')
data = pd.read_csv('../data/combined/reviews.csv')

In [3]:
reviews.head(3)

Unnamed: 0,brand,key,author,date,stars,title,helpful_yes,helpful_no,text,taste,ingredients,texture,likes
0,bj,0_bj,Ilovebennjerry,2017-04-15,3,Not enough brownies!,10.0,3.0,"Super good, don't get me wrong. But I came for...",,,,
1,bj,0_bj,Sweettooth909,2020-01-05,5,I’m OBSESSED with this pint!,3.0,0.0,I decided to try it out although I’m not a hug...,,,,
2,bj,0_bj,LaTanga71,2018-04-26,3,My favorite...More Caramel Please,5.0,2.0,My caramel core begins to disappear about half...,,,,


In [4]:
products.head(3)

Unnamed: 0,brand,key,name,subhead,description,rating,rating_count,ingredients
0,bj,0_bj,Salted Caramel Core,Sweet Cream Ice Cream with Blonde Brownies & a...,Find your way to the ultimate ice cream experi...,3.7,208,"CREAM, SKIM MILK, LIQUID SUGAR (SUGAR, WATER),..."
1,bj,1_bj,Netflix & Chilll'd™,Peanut Butter Ice Cream with Sweet & Salty Pre...,There’s something for everyone to watch on Net...,4.0,127,"CREAM, SKIM MILK, LIQUID SUGAR (SUGAR, WATER),..."
2,bj,2_bj,Chip Happens,A Cold Mess of Chocolate Ice Cream with Fudge ...,Sometimes “chip” happens and everything’s a me...,4.7,130,"CREAM, LIQUID SUGAR (SUGAR, WATER), SKIM MILK,..."


Before I create any new dataframes, I need to make sure that the the keys align.

In [5]:
set(products['key']) == set(reviews['key'])

True

Perfect! Now I can perform the merge.

In [6]:
data['name'] = None 
data['description'] = None
data['rating'] = None 
data['rating_count'] = None 

In [7]:
for i in range(len(reviews)):
    key = reviews['key'].iloc[i]
    df = products[products['key'] == key]
    data['name'].iloc[i] = pd.DataFrame(df['name']).values[0][0]
    data['description'].iloc[i] = pd.DataFrame(df['description']).values[0][0]
    data['rating'].iloc[i] = pd.DataFrame(df['rating']).values[0][0]
    data['rating_count'].iloc[i] = pd.DataFrame(df['rating_count']).values[0][0]

Now we take a peek at the data and it's shape.

In [8]:
data.head(3)

Unnamed: 0,brand,key,author,date,stars,title,helpful_yes,helpful_no,text,taste,ingredients,texture,likes,name,description,rating,rating_count
0,bj,0_bj,Ilovebennjerry,2017-04-15,3,Not enough brownies!,10.0,3.0,"Super good, don't get me wrong. But I came for...",,,,,Salted Caramel Core,Find your way to the ultimate ice cream experi...,3.7,208
1,bj,0_bj,Sweettooth909,2020-01-05,5,I’m OBSESSED with this pint!,3.0,0.0,I decided to try it out although I’m not a hug...,,,,,Salted Caramel Core,Find your way to the ultimate ice cream experi...,3.7,208
2,bj,0_bj,LaTanga71,2018-04-26,3,My favorite...More Caramel Please,5.0,2.0,My caramel core begins to disappear about half...,,,,,Salted Caramel Core,Find your way to the ultimate ice cream experi...,3.7,208


In [9]:
data.shape

(21674, 17)

Now that there is one complete dataset, I can start checking out some of the data.

In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21674 entries, 0 to 21673
Data columns (total 17 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   brand         21674 non-null  object 
 1   key           21674 non-null  object 
 2   author        20874 non-null  object 
 3   date          21674 non-null  object 
 4   stars         21674 non-null  int64  
 5   title         16275 non-null  object 
 6   helpful_yes   21674 non-null  float64
 7   helpful_no    21674 non-null  float64
 8   text          21674 non-null  object 
 9   taste         4265 non-null   float64
 10  ingredients   4265 non-null   float64
 11  texture       4265 non-null   float64
 12  likes         2295 non-null   object 
 13  name          21674 non-null  object 
 14  description   21139 non-null  object 
 15  rating        21674 non-null  object 
 16  rating_count  21674 non-null  object 
dtypes: float64(5), int64(1), object(11)
memory usage: 2.8+ MB


Check for the number of nulls in each column.

In [11]:
for column in data:
    print(column + ": ", sum(data[column].isnull()))

brand:  0
key:  0
author:  800
date:  0
stars:  0
title:  5399
helpful_yes:  0
helpful_no:  0
text:  0
taste:  17409
ingredients:  17409
texture:  17409
likes:  19379
name:  0
description:  535
rating:  0
rating_count:  0


Seems that `taste`, `ingredients`, `texture` and `likes` have a fair number of null values, so I'll drop those. Time is also irrelevant to this project, so I will also drop `date`.

In [12]:
data.drop(['taste', 'ingredients', 'texture', 'likes', 'date', 'helpful_yes', 'helpful_no'], axis=1, inplace=True)

In [13]:
data.head(3)

Unnamed: 0,brand,key,author,stars,title,text,name,description,rating,rating_count
0,bj,0_bj,Ilovebennjerry,3,Not enough brownies!,"Super good, don't get me wrong. But I came for...",Salted Caramel Core,Find your way to the ultimate ice cream experi...,3.7,208
1,bj,0_bj,Sweettooth909,5,I’m OBSESSED with this pint!,I decided to try it out although I’m not a hug...,Salted Caramel Core,Find your way to the ultimate ice cream experi...,3.7,208
2,bj,0_bj,LaTanga71,3,My favorite...More Caramel Please,My caramel core begins to disappear about half...,Salted Caramel Core,Find your way to the ultimate ice cream experi...,3.7,208


This looks better. Now, to prep that data for sentiment analysis and EDA, I am going to combine the title and text data into a single column. But first, I need to make sure that the data is all in string format.

In [14]:
data['text'] = data['text'].astype(str)
data['title'] = data['title'].astype(str)

In [15]:
data['title'].fillna('', inplace = True)

In [16]:
data['text'] = data[['title', 'text']].apply('-'.join, axis=1)

In [17]:
data.drop(['title'], axis=1, inplace=True)

In [18]:
data.head(3)

Unnamed: 0,brand,key,author,stars,text,name,description,rating,rating_count
0,bj,0_bj,Ilovebennjerry,3,"Not enough brownies!-Super good, don't get me ...",Salted Caramel Core,Find your way to the ultimate ice cream experi...,3.7,208
1,bj,0_bj,Sweettooth909,5,I’m OBSESSED with this pint!-I decided to try ...,Salted Caramel Core,Find your way to the ultimate ice cream experi...,3.7,208
2,bj,0_bj,LaTanga71,3,My favorite...More Caramel Please-My caramel c...,Salted Caramel Core,Find your way to the ultimate ice cream experi...,3.7,208


In [19]:
data.shape

(21674, 9)

I'll also filter out any reviews that have less than 10 total ratings. 

In [20]:
data = data[data.rating_count > 10]

In [21]:
data.shape

(21610, 9)

I will also clean the text up a bit and remove the stop words, so I can have a more usable text column for EDA.

In [22]:
def cleanText(text):
    text = re.sub(r'https?:\/\/\S+', '', text) # remove links
    text = re.sub(r'@[A-Za-z0-0]+', '', text) # remove @ and numbers
    text = re.sub(r'#', '', text)
    return text

In [23]:
data['text'] = data['text'].apply(cleanText)

In [24]:
data['stop_text'] = data['text'].apply(lambda words: ' '.join(word.lower() for word in words.split() if word not in stop))

I'll also use a tokenizer and get the lemma from each word. This should make the modelling and sentiment analysis a bit easier.

In [25]:
tokenizer = WhitespaceTokenizer()
lemmatizer = WordNetLemmatizer()

In [26]:
def lemmatize_text(text):
    return [lemmatizer.lemmatize(w) for w in tokenizer.tokenize(text)]

In [27]:
data['stop_text'] = data['stop_text'].apply(lemmatize_text)

In [28]:
data['stop_text'] = [' '.join(map(str, l)) for l in data['stop_text']]

In [29]:
data.head(3)

Unnamed: 0,brand,key,author,stars,text,name,description,rating,rating_count,stop_text
0,bj,0_bj,Ilovebennjerry,3,"Not enough brownies!-Super good, don't get me ...",Salted Caramel Core,Find your way to the ultimate ice cream experi...,3.7,208,"not enough brownies!-super good, get wrong. bu..."
1,bj,0_bj,Sweettooth909,5,I’m OBSESSED with this pint!-I decided to try ...,Salted Caramel Core,Find your way to the ultimate ice cream experi...,3.7,208,i’m obsessed pint!-i decided try although i’m ...
2,bj,0_bj,LaTanga71,3,My favorite...More Caramel Please-My caramel c...,Salted Caramel Core,Find your way to the ultimate ice cream experi...,3.7,208,my favorite...more caramel please-my caramel c...


One last step. I am going to do a bit of feature engineering here. 

Ratings are given stars on a scale from 1 to 5. This would therefore give 5 different potential classes for each record. This makes this problem into a multi classification problem. I am not sure yet if I want to do a multi classification problem or a binary classification problem yet, so I am going to create a new column, `good_review`, that will take the value `Good` if `stars` is 4 or 5, and `Bad` otherwise

In [30]:
data['good_review'] = np.where((data.stars == 4) | (data.stars==5), "Good", "Bad")

In [31]:
data.head(4)

Unnamed: 0,brand,key,author,stars,text,name,description,rating,rating_count,stop_text,good_review
0,bj,0_bj,Ilovebennjerry,3,"Not enough brownies!-Super good, don't get me ...",Salted Caramel Core,Find your way to the ultimate ice cream experi...,3.7,208,"not enough brownies!-super good, get wrong. bu...",Bad
1,bj,0_bj,Sweettooth909,5,I’m OBSESSED with this pint!-I decided to try ...,Salted Caramel Core,Find your way to the ultimate ice cream experi...,3.7,208,i’m obsessed pint!-i decided try although i’m ...,Good
2,bj,0_bj,LaTanga71,3,My favorite...More Caramel Please-My caramel c...,Salted Caramel Core,Find your way to the ultimate ice cream experi...,3.7,208,my favorite...more caramel please-my caramel c...,Bad
3,bj,0_bj,chicago220,5,Obsessed!!!-Why are people complaining about t...,Salted Caramel Core,Find your way to the ultimate ice cream experi...,3.7,208,obsessed!!!-why people complaining blonde brow...,Good


Great! This looks like a pretty solid data set to kick things off with. I'll save this as a csv and move on to the EDA.

# Save Data

In [32]:
save_file(data, 'ice_cream_data.csv', '../data')

A file already exists with this name.

Do you want to overwrite? (Y/N)Y
Writing file.  "../data/ice_cream_data.csv"
