# 1.0 Data Wrangling <a name="1.1_datawrangling"></a>


## Contents <a name="contents"></a>

* [1 Data Wrangling](#1.1_datawrangling)

    * [Contents](#contents)


* [1.2 Introduction](#1.2_introduction)

    * [1.2.1 Problem Recap](#1.2.1_recap)

    * [1.2.2 Data Wrangling Introduction](#1.2.2_wrangling)
    

* [1.3 Imports](#1.3_imports)


* [1.4 Load Data](#1.4_loaddata)


* [1.5 Fashion Dataset](#1.5_exploredata)

    * [1.5.1 Fashion data features](#1.5.1_feat)
    
    * [1.5.2 Fashion missing data](#1.5.2_f_missing)


* [1.6 Electronics dataset](#1.6_electronics)
    
    * [1.6.1 Electronics data features](#1.6.1_feat)

    * [1.6.2 Electronics missing data](#1.6.2_e_missing)
        

* [1.7 Home/kitchen dataset](#1.7_h/k)
    
    * [1.7.1 Home/kitchen data features](#1.7.1_h/k_exp)
        
    * [1.7.2 Home/kitchen missing data](#1.7.2_h/k_missing)
            
        
* [1.8 Save Data](#1.8_savedata)<a name="1.8_savedata"></a>
    
    * [1.8.1 Create CSVs](#1.8.1)
    
    * [1.8.2 Creating a parsing function](#1.8.2)
    
    * [1.8.3 Fashion CSV](#1.8.3)
        
    * [1.8.4 Electronics CSV](#1.8.4)
        
    * [1.8.5 Home/Kitchen CSV](#1.8.5)


* [1.9 Summary](#1.9_summary)

## 1.1 Data Wrangling 

## 1.2 Introduction<a name="1.2_introduction"></a>
In this notebook, we are going to primarily be focusing on loading the data, examining missing and duplicated values, and dropping unneccesary features. We may do a small amount of data cleaning, but we do not want to prematurely drop features that we may later need.

### 1.2.1 Problem Recap <a name="1.2.1_recap"></a>

Using customer text data about amazon products, we will build, evaluate and compare models to estimate the probability that a given text review can be classified as “positive”, “neutral”, or “negative”.

Our goal is to build a text classifier using Amazon product review data which can be used to analyze customer sentiment which does not have accompanying numeric data.

### 1.2.2 Data Wrangling Introduction <a name="1.2.2_wrangling"></a>
We will read in the data, remove unnecessary columns/features, and then save it in an easier-to-work-with tabular format (CSV). 

## 1.3 imports <a name="1.3_imports"></a>

In [1]:
import json
import csv
import pandas as pd

## 1.4 Load Data<a name="1.4_loaddata"></a>
Datasets are sourced from:

[Jianmo Ni, Jiacheng Li, Julian McAuley
Empirical Methods in Natural Language Processing (EMNLP), 2019](https://nijianmo.github.io/amazon/index.html)

The data is split up by product category. 

Our categories will be:

#### Fashion (883,636 reviews)

#### Electronics (20,994,353 reviews)

#### Home and Kitchen (21,928,568 reviews)

The data is available as compressed JSON files. We will use pandas to load the fashion data and determine which features to keep or drop and which features need preliminary cleaning.

Because of how large the Electronics and Home and Kitchen datasets are (approx. 10 GB each), we'll use the fashion dataset to explore the features and build a set of functions to parse each of the three category files. We will stream the JSON data, drop the values/features we do not want to keep, and save to separate CSV files.

In [2]:
!ls -hl data/raw

total 21G
-rw-rw-r-- 1 paul paul 341M Mar 27 15:56 AMAZON_FASHION.json
-rw-rw-r-- 1 paul paul  11G Mar 27 16:20 Electronics.json
-rw-rw-r-- 1 paul paul 9.6G Mar 28 11:44 Home_and_Kitchen.json


## 1.5 Explore the Fashion data <a name="1.5_exploredata"></a>

Let's take a look at the top of the **Fashion** dataset

In [3]:
fashion_head = pd.read_json("data/raw/AMAZON_FASHION.json", lines=True, nrows=200)

In [4]:
fashion_head.head(3)

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,vote,style,image
0,5,True,"10 20, 2014",A1D4G1SNUZWQOT,7106116521,Tracy,Exactly what I needed.,perfect replacements!!,1413763200,,,
1,2,True,"09 28, 2014",A3DDWDH9PX2YX2,7106116521,Sonja Lau,"I agree with the other review, the opening is ...","I agree with the other review, the opening is ...",1411862400,3.0,,
2,4,False,"08 25, 2014",A2MWC41EW7XL15,7106116521,Kathleen,Love these... I am going to order another pack...,My New 'Friends' !!,1408924800,,,


In [5]:
fashion_head.dtypes

overall             int64
verified             bool
reviewTime         object
reviewerID         object
asin               object
reviewerName       object
reviewText         object
summary            object
unixReviewTime      int64
vote              float64
style              object
image              object
dtype: object

### 1.5.1 Fashion data features <a name="1.5.1_feat"></a>

The **overall** feature is the number of stars, which we'll be using for our categorical target feature (positive/neutral/negative). It's an integer from 1-5. 

The **verified** feature is a boolean and determines whether Amazon can confirm the user purchased the item they're reviewing (not a free/discounted reviewed item).

The **reviewTime** is in a date string format, and we'll drop it and keep the **unixReviewTime** instead.

The **reviewerID** and **reviewerName** have similar information and we can keep just one (the ID should be fine).

The **asin** is the specific product ID. We could keep it if we want to get item metadata later and use that for additional exploratory analysis (ie. price, product tags, product subcategory).

The **summary** is a user provided (or amazon generated) short item summary.

The **vote** column seems to indicate how many users found a review useful.

### Important features:

The **reviewText** will be the main input feature/features we will be using for our sentiment prediction and modelling.

Our target feature will be based on the **overall** column. We will be using the star value for each review to convert the data to Negative/Neutral/Positive categories. This will be the feature that our model trains and predicts with.

Let's check the **style** column.

In [6]:
fashion_head[~(fashion_head["style"].isna())]

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,vote,style,image
7,3,True,"09 22, 2013",A1BB77SEBQT8VX,B00007GDFV,Darrow H Ankrum II,mother - in - law wanted it as a present for h...,bought as a present,1379808000,,{'Color:': ' Black'},
8,3,True,"07 17, 2013",AHWOW7D1ABO9C,B00007GDFV,rosieO,"Item is of good quality. Looks great, too. But...",Buxton heiress collection,1374019200,,{'Color:': ' Black'},
9,3,True,"04 13, 2013",AKS3GULZE0HFC,B00007GDFV,M. Waltman,I had used my last el-cheapo fake leather ciga...,Top Clasp Broke Within 3 days!,1365811200,,{'Color:': ' Black'},
10,4,True,"03 9, 2013",A38NS6NF6WPXS,B00007GDFV,BTDoxies,This brand has been around a long time and you...,BUXTON QUALITY!,1362787200,,{'Color:': ' Black'},
11,2,True,"01 27, 2013",A1KOKO3HTSAI1H,B00007GDFV,Robin Howard,I smoke 100's and these are NOT made for them....,Buxton Heiress Collection Black Leather Cigare...,1359244800,,{'Color:': ' Black'},
...,...,...,...,...,...,...,...,...,...,...,...,...
175,5,True,"08 28, 2016",A3U8IMV1FKI5Y3,B00008JPRZ,SPB,Love shirt and even makes me look good!,Love the shirt!!!,1472342400,,"{'Size:': ' 16.5 - 34', 'Color:': ' White'}",
176,5,True,"05 25, 2016",A21CTBULJC61N1,B00008JPRZ,Harrycarnival,great shirt,Five Stars,1464134400,,"{'Size:': ' 17.5 - 37', 'Color:': ' White'}",
193,5,False,"03 2, 2006",A3JGLFGCB7KZQX,B0000AWXMM,DG,"It was exactly as described. Beautiful ring, ...",BEAUTIFUL RING!!!! NO REGRETS WHATSOEVER,1141257600,,"{'Size:': ' 12', 'Metal Type:': ' yellow-gold'}",
194,4,False,"08 22, 2005",A179HMPWBPTJ2I,B0000AWXMM,Leana J. Kemp,This is a nice ring. It is simple and elegant...,White Gold Ring,1124668800,6.0,"{'Size:': ' 10', 'Metal Type:': ' white-gold'}",


The **style** column seems to have sizing and color info but is missing for most reviews.

### 1.5.2 Fashion missing data <a name="1.5.2_f_missing"></a>

In [7]:
fashion = pd.read_json("data/raw/AMAZON_FASHION.json", lines=True, chunksize=100000)

Using a json_reader iterator from Pandas, we can iterate through the data rather than loading the entire dataset into memory


In [8]:
missing_stats=[]
for chunk in fashion:
    missing_stats.append(chunk.isna().sum())

In [9]:
missing_vals = pd.DataFrame(missing_stats)

In [10]:
missing_vals

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,vote,style,image
0,0,0,0,0,0,6,109,39,0,90595,28037,98423
1,0,0,0,0,0,18,108,61,0,89490,49765,97805
2,0,0,0,0,0,13,173,68,0,89152,44147,95963
3,0,0,0,0,0,14,172,77,0,86690,41255,94941
4,0,0,0,0,0,6,93,26,0,92429,78710,98412
5,0,0,0,0,0,5,123,49,0,93244,88410,97702
6,0,0,0,0,0,7,174,78,0,92668,87838,96291
7,0,0,0,0,0,11,152,76,0,92517,87214,95966
8,0,0,0,0,0,12,129,59,0,76951,73691,79326


In [11]:
missing_vals.sum()

overall                0
verified               0
reviewTime             0
reviewerID             0
asin                   0
reviewerName          92
reviewText          1233
summary              533
unixReviewTime         0
vote              803736
style             579067
image             854829
dtype: int64

We have no missing values in our **overall** column, which is excellent. The **vote**, **style** and **image** columns have a lot of missing values, which is fine as we'll be dropping those columns. There are not many missing **reviewText** values.

Our final column/feature list to keep will be:

* **overall**
* **reviewerID**
* **verified**
* **asin**
* **reviewText**
* **summary**
* **unixReviewTime**

In [12]:
cols_to_keep = ["overall",  "verified", "reviewerID", "asin", "reviewText", "summary", "unixReviewTime"]

Let's do a quick preview to make sure our column list is correct.

In [13]:
fashion_head[cols_to_keep].head()

Unnamed: 0,overall,verified,reviewerID,asin,reviewText,summary,unixReviewTime
0,5,True,A1D4G1SNUZWQOT,7106116521,Exactly what I needed.,perfect replacements!!,1413763200
1,2,True,A3DDWDH9PX2YX2,7106116521,"I agree with the other review, the opening is ...","I agree with the other review, the opening is ...",1411862400
2,4,False,A2MWC41EW7XL15,7106116521,Love these... I am going to order another pack...,My New 'Friends' !!,1408924800
3,2,True,A2UH2QQ275NV45,7106116521,too tiny an opening,Two Stars,1408838400
4,3,False,A89F3LQADZBS5,7106116521,Okay,Three Stars,1406419200


## 1.6 Electronics dataset <a name="1.6_electronics"></a>


In [14]:
elec_head = pd.read_json("data/raw/Electronics.json", lines=True, nrows=1000)

In [15]:
elec_head.head()

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,vote,image
0,5,True,"07 17, 2002",A1N070NS9CJQ2I,60009810,{'Format:': ' Hardcover'},Teri Adams,This was the first time I read Garcia-Aguilera...,Hit The Spot!,1026864000,,
1,5,False,"07 6, 2002",A3P0KRKOBQK1KN,60009810,{'Format:': ' Hardcover'},Willa C.,"As with all of Ms. Garcia-Aguilera's books, I ...",one hot summer is HOT HOT HOT!,1025913600,,
2,5,False,"07 3, 2002",A192HO2ICJ75VU,60009810,{'Format:': ' Hardcover'},Kit,I've not read any of Ms Aguilera's works befor...,One Hot Summer,1025654400,2.0,
3,4,False,"06 30, 2002",A2T278FKFL3BLT,60009810,{'Format:': ' Hardcover'},Andres,This romance novel is right up there with the ...,I love this book!,1025395200,3.0,
4,5,False,"06 28, 2002",A2ZUXVTW8RXBXW,60009810,{'Format:': ' Hardcover'},John,Carolina Garcia Aguilera has done it again. S...,One Hot Book,1025222400,,


### 1.6.1 Electronics data features <a name="1.6.1_feat"></a>



In [16]:
elec_head.dtypes

overall             int64
verified             bool
reviewTime         object
reviewerID         object
asin                int64
style              object
reviewerName       object
reviewText         object
summary            object
unixReviewTime      int64
vote              float64
image              object
dtype: object

It looks like our column names and data is pretty similar. Let's check and make sure the columns we wish to keep have the same names.

In [17]:
elec_head[cols_to_keep]

Unnamed: 0,overall,verified,reviewerID,asin,reviewText,summary,unixReviewTime
0,5,True,A1N070NS9CJQ2I,60009810,This was the first time I read Garcia-Aguilera...,Hit The Spot!,1026864000
1,5,False,A3P0KRKOBQK1KN,60009810,"As with all of Ms. Garcia-Aguilera's books, I ...",one hot summer is HOT HOT HOT!,1025913600
2,5,False,A192HO2ICJ75VU,60009810,I've not read any of Ms Aguilera's works befor...,One Hot Summer,1025654400
3,4,False,A2T278FKFL3BLT,60009810,This romance novel is right up there with the ...,I love this book!,1025395200
4,5,False,A2ZUXVTW8RXBXW,60009810,Carolina Garcia Aguilera has done it again. S...,One Hot Book,1025222400
...,...,...,...,...,...,...,...
995,5,True,A107S4MT25VXQ5,594481902,Perfect...Nook brand so exactly what was neede...,Juat what I needed...,1399507200
996,4,True,A1T8TP4OHT5T13,594481902,This cord helps me have two locations to plug ...,Extra Cord,1399420800
997,5,True,A3HICVLF4PFFMN,594481902,"bought for a spare for my 9"" Nook HD and it fi...",great fit,1399248000
998,5,True,AY0BL2JD0GAFT,594481902,Lost my charger. Found this one for less than ...,Lost the.charger.for.my Nook!,1399248000


### 1.6.2 Electronics missing data <a name=1.6.2_e_missing></a>

In [18]:
elect = pd.read_json("data/raw/Electronics.json", lines=True, chunksize=100000)

Using a json_reader iterator from Pandas, we'll iterate through the dataset to get missing value summaries, like we did with the fashion dataset.

In [19]:
missing_stats=[]
for chunk in elect:
    missing_stats.append(chunk.isna().sum())

In [20]:
missing_vals = pd.DataFrame(missing_stats)

In [21]:
missing_vals.sum()

overall                  0
verified                 0
reviewTime               0
reviewerID               0
asin                     0
style             10497616
reviewerName          1713
reviewText            9684
summary               4754
unixReviewTime           0
vote              18300976
image             20645630
dtype: int64

Similarly to the fashion dataset, the style, vote and image columns have lots of missing values but we'll be dropping those columns. The two important columns, **overall** and **reviewText** have 0 and 9684 missing values respectively out of about 20 million reviews.

## 1.7 Home/kitchen dataset <a name="1.7_h/k"></a>            

In [22]:
hk_head = pd.read_json("data/raw/Home_and_Kitchen.json", lines=True, nrows=200)

In [23]:
hk_head.head()

Unnamed: 0,overall,vote,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,style,image
0,5,2.0,True,"08 31, 2010",A3NSN9WOX8470M,6564224,mmm,"I don't use these for their original use, and ...",Many uses...,1283212800,,
1,5,2.0,True,"04 2, 2010",A2AMX0AJ2BUDNV,6564224,John R. Welch,"Seems a bit expensive for a plastic bottle, bu...",Dispenser bottle,1270166400,,
2,5,,True,"11 5, 2015",A8LUWTIPU9CZB,560467893,Linda Fahner,"Great product, love it!!",Five Stars,1446681600,,
3,4,4.0,True,"10 29, 2015",AABKIIHAL0L66,560467893,TheBlueChain,This is a sturdy floating corner shelf! We mo...,"Sturdy Shelf, Poor Installation Instructions",1446076800,,
4,3,,True,"09 9, 2015",A3DA0KIQ5OBK5C,560467893,angelaarden,I purchased 4 of these shelves. they look grea...,Look great - one bad one...,1441756800,,


In [24]:
hk_head.dtypes

overall             int64
vote              float64
verified             bool
reviewTime         object
reviewerID         object
asin               object
reviewerName       object
reviewText         object
summary            object
unixReviewTime      int64
style              object
image              object
dtype: object

### 1.7.1 Home/kitchen data features <a name="1.7.1_h/k_exp"></a>

In [25]:
hk_head

Unnamed: 0,overall,vote,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,style,image
0,5,2.0,True,"08 31, 2010",A3NSN9WOX8470M,0006564224,mmm,"I don't use these for their original use, and ...",Many uses...,1283212800,,
1,5,2.0,True,"04 2, 2010",A2AMX0AJ2BUDNV,0006564224,John R. Welch,"Seems a bit expensive for a plastic bottle, bu...",Dispenser bottle,1270166400,,
2,5,,True,"11 5, 2015",A8LUWTIPU9CZB,0560467893,Linda Fahner,"Great product, love it!!",Five Stars,1446681600,,
3,4,4.0,True,"10 29, 2015",AABKIIHAL0L66,0560467893,TheBlueChain,This is a sturdy floating corner shelf! We mo...,"Sturdy Shelf, Poor Installation Instructions",1446076800,,
4,3,,True,"09 9, 2015",A3DA0KIQ5OBK5C,0560467893,angelaarden,I purchased 4 of these shelves. they look grea...,Look great - one bad one...,1441756800,,
...,...,...,...,...,...,...,...,...,...,...,...,...
195,5,,True,"09 1, 2015",A1PNQ4F0UC9V4,1605161810,party,I like this and thank you.,Five Stars,1441065600,,
196,5,,True,"08 14, 2015",AC1W8G7HU9VH8,1605161810,Jeanette Dodds,My mom loves it! Delivery was exactly on time....,Five Stars! Beautiful!,1439510400,,[https://images-na.ssl-images-amazon.com/image...
197,5,,True,"07 18, 2015",AFY118HH4OG7F,1605161810,GothicPoohBear,good quality!,good quality,1437177600,,
198,5,,True,"06 19, 2015",AG6BH7V520TJ5,1605161810,Deb,Beautiful!,Love it!,1434672000,,


Looks like the same columns names. Let's make sure.

In [26]:
hk_head[cols_to_keep]

Unnamed: 0,overall,verified,reviewerID,asin,reviewText,summary,unixReviewTime
0,5,True,A3NSN9WOX8470M,0006564224,"I don't use these for their original use, and ...",Many uses...,1283212800
1,5,True,A2AMX0AJ2BUDNV,0006564224,"Seems a bit expensive for a plastic bottle, bu...",Dispenser bottle,1270166400
2,5,True,A8LUWTIPU9CZB,0560467893,"Great product, love it!!",Five Stars,1446681600
3,4,True,AABKIIHAL0L66,0560467893,This is a sturdy floating corner shelf! We mo...,"Sturdy Shelf, Poor Installation Instructions",1446076800
4,3,True,A3DA0KIQ5OBK5C,0560467893,I purchased 4 of these shelves. they look grea...,Look great - one bad one...,1441756800
...,...,...,...,...,...,...,...
195,5,True,A1PNQ4F0UC9V4,1605161810,I like this and thank you.,Five Stars,1441065600
196,5,True,AC1W8G7HU9VH8,1605161810,My mom loves it! Delivery was exactly on time....,Five Stars! Beautiful!,1439510400
197,5,True,AFY118HH4OG7F,1605161810,good quality!,good quality,1437177600
198,5,True,AG6BH7V520TJ5,1605161810,Beautiful!,Love it!,1434672000


### 1.7.2 Home/kitchen missing <a name="1.7.2_h/k_missing"></a>

In [27]:
hk = pd.read_json("data/raw/Home_and_Kitchen.json", lines=True, chunksize=100000)

Using a json_reader iterator from Pandas, we'll iterate through the dataset to get missing value summaries, like we did with the fashion dataset.

In [28]:
missing_stats=[]
for chunk in hk:
    missing_stats.append(chunk.isna().sum())

In [29]:
missing_vals = pd.DataFrame(missing_stats)

In [30]:
missing_vals.sum()

overall                  0
vote              19117548
verified                 0
reviewTime               0
reviewerID               0
asin                     0
reviewerName          1160
reviewText           14866
summary               5781
unixReviewTime           0
style              9785019
image             21296935
dtype: int64

Out of about 22 million reviews, we have no missing **overall** ratings and only 14,866 missing **reviewText**.

## 1.8 Save Data <a name="1.8_savedata"></a>

### 1.8.1 Create CSVs <a name="1.8.1"></a>    

In [31]:
for category in ["fashion", "electronics", "kitchen"]:
    filename=f'data/edited/{category}.csv'
    with open(filename, 'w') as f:
        writer = csv.writer(f)
        writer.writerow(cols_to_keep)
    print(pd.read_csv(filename))

Empty DataFrame
Columns: [overall, verified, reviewerID, asin, reviewText, summary, unixReviewTime]
Index: []
Empty DataFrame
Columns: [overall, verified, reviewerID, asin, reviewText, summary, unixReviewTime]
Index: []
Empty DataFrame
Columns: [overall, verified, reviewerID, asin, reviewText, summary, unixReviewTime]
Index: []


In [32]:
ls data/edited

electronics.csv  fashion.csv  kitchen.csv


### 1.8.2 Creating a parsing function <a name="1.8.2"></a>

In [33]:
def df_cleaner(temp_df_chunk: pd.DataFrame,
               columns_to_keep=cols_to_keep) -> pd.DataFrame:
    
    """This function will take in a chunk of our dataset as a dataframe.
       It will return only the columns we wish.
       
       Takes:
           temp_df_chunk: a dataframe to be edited
           columns_to_keep: the list of columns we wish to keep from that df          
       Returns:
           clean_df: a smaller df with drop_by_row NaNs dropped and only a subset of original columns
       """
  
    return temp_df_chunk[columns_to_keep]

Let's do a quick test.

In [34]:
df_cleaner(hk_head)

Unnamed: 0,overall,verified,reviewerID,asin,reviewText,summary,unixReviewTime
0,5,True,A3NSN9WOX8470M,0006564224,"I don't use these for their original use, and ...",Many uses...,1283212800
1,5,True,A2AMX0AJ2BUDNV,0006564224,"Seems a bit expensive for a plastic bottle, bu...",Dispenser bottle,1270166400
2,5,True,A8LUWTIPU9CZB,0560467893,"Great product, love it!!",Five Stars,1446681600
3,4,True,AABKIIHAL0L66,0560467893,This is a sturdy floating corner shelf! We mo...,"Sturdy Shelf, Poor Installation Instructions",1446076800
4,3,True,A3DA0KIQ5OBK5C,0560467893,I purchased 4 of these shelves. they look grea...,Look great - one bad one...,1441756800
...,...,...,...,...,...,...,...
195,5,True,A1PNQ4F0UC9V4,1605161810,I like this and thank you.,Five Stars,1441065600
196,5,True,AC1W8G7HU9VH8,1605161810,My mom loves it! Delivery was exactly on time....,Five Stars! Beautiful!,1439510400
197,5,True,AFY118HH4OG7F,1605161810,good quality!,good quality,1437177600
198,5,True,AG6BH7V520TJ5,1605161810,Beautiful!,Love it!,1434672000


### 1.8.3 Fashion CSV <a name="1.8.3"></a>

In [35]:
fashion = pd.read_json("data/raw/AMAZON_FASHION.json", lines=True, chunksize=100000)

In [36]:
for chunk in fashion:
    df_cleaner(chunk).to_csv(path_or_buf="data/edited/fashion.csv", header=False, mode="a")

In [37]:
pd.read_csv("data/edited/fashion.csv")

Unnamed: 0,overall,verified,reviewerID,asin,reviewText,summary,unixReviewTime
0,5,True,A1D4G1SNUZWQOT,7106116521,Exactly what I needed.,perfect replacements!!,1413763200
1,2,True,A3DDWDH9PX2YX2,7106116521,"I agree with the other review, the opening is ...","I agree with the other review, the opening is ...",1411862400
2,4,False,A2MWC41EW7XL15,7106116521,Love these... I am going to order another pack...,My New 'Friends' !!,1408924800
3,2,True,A2UH2QQ275NV45,7106116521,too tiny an opening,Two Stars,1408838400
4,3,False,A89F3LQADZBS5,7106116521,Okay,Three Stars,1406419200
...,...,...,...,...,...,...,...
883631,5,True,A1ZSB2Q144UTEY,B01HJHTH5U,I absolutely love this dress!! It's sexy and ...,I absolutely love this dress,1487635200
883632,5,True,A2CCDV0J5VB6F2,B01HJHTH5U,I'm 5'6 175lbs. I'm on the tall side. I wear a...,I wear a large and ordered a large and it stil...,1480032000
883633,3,True,A3O90PACS7B61K,B01HJHTH5U,Too big in the chest area!,Three Stars,1478736000
883634,3,True,A2HO94I89U3LNH,B01HJHF97K,"Too clear in the back, needs lining",Three Stars,1478736000


#### Looks like our function and loop worked great. It will take a lot longer for the two larger files.

### 1.8.4 Electronics CSV <a name="1.8.4"></a> 
    

In [38]:
elect = pd.read_json("data/raw/Electronics.json", lines=True, chunksize=100000)

In [39]:
for chunk in elect:
    df_cleaner(chunk).to_csv(path_or_buf="data/edited/electronics.csv", header=False, mode="a", escapechar="\\")

### 1.8.5 Home/Kitchen CSV <a name="1.8.5"></a>

In [40]:
hk = pd.read_json("data/raw/Home_and_Kitchen.json", lines=True, chunksize=100000)

In [41]:
for chunk in hk:
    df_cleaner(chunk).to_csv(path_or_buf="data/edited/kitchen.csv", header=False, mode="a")

In [42]:
!ls -ahl data/edited

total 14G
drwxrwxr-x 2 paul paul 4.0K Apr  5 10:19 .
drwxrwxr-x 4 paul paul 4.0K Apr  5 07:55 ..
-rw-rw-r-- 1 paul paul 7.2G Apr  5 10:23 electronics.csv
-rw-rw-r-- 1 paul paul 188M Apr  5 10:19 fashion.csv
-rw-rw-r-- 1 paul paul 5.8G Apr  5 10:27 kitchen.csv


## 1.9 Summary <a name="1.9_summary"></a>

We have cleaned the three datasets. We did not drop any rows from our data yet, as we only had a small number of missing **reviewText** data, which we will explore in the future.

We dropped the **reviewTime** and **reviewerName** columns as they're somewhat duplicated by our **unixReviewTime** and **reviewerID** variables. We dropped the **vote**, **style**, and **image** columns, as they're both largely filled with missing values and unimportant to our analysis generally.

Feature variable: **reviewText** will be our main data source for analysis, though the review summary also contains text, possibly from the user, about their feelings about the item.

Target variable: **overall** stars (1-5) will be converted in the next part of the project into categorical variables, to predict positive, neutral and negative feelings regarding user's opinions of items.