# <a id='toc1_'></a>[Data Loading](#toc0_)

Here we are looking at taking several samples of the amazon review dataset and loading them into a dataframe.

- the key thing to remember is that we need to sample the reviewers and make sure we take all their reviews across all the datasets

In [None]:
# reset (removing all variables, functions, and other objects from memory)
%reset -f

# get modules in 
import pandas as pd
import gzip
import json
import random
import linecache

# <a id='toc2_'></a>[Datasets](#toc0_)

We have individual datasets for each category. These data have been reduced to extract the $k$-core, such that each of the remaining users and items have $k$ reviews each.

- Amazon Fashion	
- All Beauty	
- Appliances	
- Arts, Crafts and Sewing	
- Automotive	
- Books	
- CDs and Vinyl	
- Cell Phones and Accessories	
- Clothing, Shoes and Jewelry	
- Digital Music	
- Electronics	
- Gift Cards	
- Grocery and Gourmet Food	
- Home and Kitchen	
- Industrial and Scientific	
- Kindle Store	
- Luxury Beauty	
- Magazine Subscriptions	
- Movies and TV	
- Musical Instruments	
- Office Products	
- Patio, Lawn and Garden	
- Pet Supplies	
- Prime Pantry	
- Software	
- Sports and Outdoors	
- Tools and Home Improvement	
- Toys and Games	
- Video Games	

***

### <a id='toc2_1_1_'></a>[Review Dataset](#toc0_)
Format is one-review-per-line in json. 

- **overall**: ratings of the product
- **reviewerID**: ID of the reviewer, e.g. A2SUAM1J3GNN3B
- **asin**: ID of the product, e.g. 0000013714
- **reviewerName**: name of the reviewer
- **vote**: helpful votes of the review
- **style**: a disctionary of the product metadata, e.g., "Format" is "Hardcover"
- **reviewText**: text of the review
- **summary**: summary of the review
- **unixReviewTime**: time of the review (unix time)
- **reviewTime**: time of the review (raw)
- **image**: images that users post after they have received the product

***
### <a id='toc2_1_2_'></a>[Product Metadata Dataset](#toc0_)
We also have metadata. 

- **asin**: ID of the product, e.g. 0000031852
- **title**: name of the product
- **feature**: bullet-point format features of the product
- **description**: description of the product
- **price**: price in US dollars (at time of crawl)
- **imageURL**: url of the product image
- **imageURL**: url of the high resolution product image
- **related**: related products (also bought, also viewed, bought together, buy after viewing)
- **salesRank**: sales rank information
- **brand**: brand name
- **categories**: list of categories the product belongs to
- **tech1**: the first technical detail table of the product
- **tech2**: the second technical detail table of the product
- **similar**: similar product table


***
# <a id='toc4_'></a>[The Review Dataset and Metadata Dataset](#toc0_)

We have individual datasets for each category. We combine them to generate one larger datasets encompassing all the categories (5-core dataset).

The following function is created to read in large JSON files:

``` py
def read_file(filename, category):
    num_lines = sum(1 for line in open(filename))
    selected_lines = set()
    while len(selected_lines) < min(50000, num_lines):
        line_num = random.randint(1, num_lines)
        if line_num not in selected_lines:
            selected_lines.add(line_num)
            line = linecache.getline(filename, line_num)
            selected_data = json.loads(line)
            selected_data['category'] = category
            yield selected_data
```

1. It calculates the total number of lines in the file using the `sum(1 for line in open(filename))` expression.
2. It initializes an empty set called `selected_lines`, which will **store the line numbers that have been selected**.
3. It enters a loop that continues until the number of selected lines reaches the minimum value between 500,000 and the total number of lines in the file (`min(500000, num_lines)`).
4. Within each iteration of the loop, it generates a random line number using `random.randint(1, num_lines)`.
5. If the randomly generated line number is not already in the `selected_lines` set, it adds the line number to the set and proceeds to read that specific line from the file using `linecache.getline(filename, line_num)`.
6. The selected line is then parsed as JSON using `json.loads(line)`.
7. Additional data, such as the **category**, is added to the selected data object.
8. The selected data object is yielded, which means it will be returned as an element of an iterator.
9. The loop continues until the desired number of lines is selected.

The function defined as:

```py
def read_matching_metadata(filename, category, product_ids):
    with open(filename, 'r') as file:
        for line in file:
            data = json.loads(line)
            if data['asin'] in product_ids:
                data['category'] = category
                yield data
```

Reads a JSON file and yields metadata entries that match a given set of product IDs. 
- `read_matching_metadata` is a function that takes three parameters: `filename`, `category`, and `product_ids`.
- It opens the specified filename (assumed to be a JSON file) in read mode using a with statement, which ensures the file is properly closed after reading.
- It iterates over each line in the file using a for loop.
- For each line, it loads the line as a JSON object using `json.loads(line)`.
- It checks if the value of the '`asin`' key in the loaded JSON data is present in the `product_ids` set.
- If there is a match, it adds the '`category`' key to the data dictionary and assigns it the value of the `category` parameter.
- Finally, it yields the modified data using the `yield` statement, allowing the caller to iterate over the matching metadata entries one by one.


In [None]:
# review data
def read_file(filename, category):
    num_lines = sum(1 for line in open(filename))
    selected_lines = set()
    while len(selected_lines) < min(500000, num_lines):
        line_num = random.randint(1, num_lines)
        if line_num not in selected_lines:
            selected_lines.add(line_num)
            line = linecache.getline(filename, line_num)
            selected_data = json.loads(line)
            selected_data['category'] = category
            yield selected_data

## <a id='toc4_1_'></a>[Data with Fewer Reviews](#toc0_)



In [None]:
# initialise data list
data = []

# category files - smaller reviews
beauty = "/Users/pavansingh/Desktop/Amazon Review Data/All_Beauty_5.json"
fashion = "/Users/pavansingh/Desktop/Amazon Review Data/AMAZON_FASHION_5.json"
appliances = "/Users/pavansingh/Desktop/Amazon Review Data/Appliances_5.json"
gift_cards = "/Users/pavansingh/Desktop/Amazon Review Data/Gift_Cards_5.json"
industrial = "/Users/pavansingh/Desktop/Amazon Review Data/Industrial_and_Scientific_5.json"
luxury_beauty = "/Users/pavansingh/Desktop/Amazon Review Data/Luxury_Beauty_5.json"
magazine_subscriptions = "/Users/pavansingh/Desktop/Amazon Review Data/Magazine_Subscriptions_5.json"
software = "/Users/pavansingh/Desktop/Amazon Review Data/Software_5.json"

# load each file and join into dataframe
for category, filename in [('beauty', beauty), ('fashion', fashion), ('appliances', appliances), ('gift_cards', gift_cards), ('industrial', industrial), ('luxury_beauty', luxury_beauty), ('magazine_subscriptions', magazine_subscriptions), ('software', software)]:
    for selected_data in read_file(filename, category):
        data.append(selected_data)

# make it into a dataframe
data_with_less_reviews = pd.DataFrame(data)

# show the dataframe
print("Shape of all data:", data_with_less_reviews.shape)
display(data_with_less_reviews.head(5))

# save data_with_less_reviews to csv called few_revs.csv in folder Data
data_with_less_reviews.to_csv('Data/few_revs_1.csv')

# category value counts
print("Value counts of product reviews per category:\n",data_with_less_reviews['category'].value_counts())

## <a id='toc4_2_'></a>[Data with A Lot of Reviews](#toc0_)

We split this up into 9 batches and load them seperately as the metadata is quite large and takes up a lot of memory.  

#### <a id='toc4_2_1_1_'></a>[Batch 1](#toc0_)

- arts_crafts_and_sewing
- automotive

In [None]:
# loading the review data!

data = []

arts_crafts = "/Users/pavansingh/Desktop/Amazon Review Data/Arts_Crafts_and_Sewing_5.json"
automotive = "/Users/pavansingh/Desktop/Amazon Review Data/Automotive_5.json"

# load each file and join into dataframe
for category, filename in [('arts_crafts', arts_crafts), ('automotive', automotive)]:
    for selected_data in read_file(filename, category):
        data.append(selected_data)

# make it into a dataframe
data = pd.DataFrame(data)

# show the dataframe
print("Shape of all data:", data.shape)
display(data.head(5))

# save data_with_less_reviews to csv called few_revs.csv in folder Data
data.to_csv('Data/revs_batch1_1.csv')

# category value counts
print("Value counts of product reviews per category:\n",data['category'].value_counts())


#### <a id='toc4_2_1_2_'></a>[Batch 2](#toc0_)

- cds_and_vinyl
- cell_phones_and_accessories

In [None]:
# loading the review data!

data = []

cds_and_vinyl = "/Users/pavansingh/Desktop/Amazon Review Data/CDs_and_Vinyl_5.json"
cell_phones = "/Users/pavansingh/Desktop/Amazon Review Data/Cell_Phones_and_Accessories_5.json"

# load each file and join into dataframe
for category, filename in [('cds_and_vinyl', cds_and_vinyl), ('cell_phones', cell_phones)]:
    for selected_data in read_file(filename, category):
        data.append(selected_data)

# make it into a dataframe
data = pd.DataFrame(data)

# show the dataframe
print("Shape of all data:", data.shape)
display(data.head(5))

# save data_with_less_reviews to csv called few_revs.csv in folder Data
data.to_csv('Data/revs_batch2_1.csv')

# category value counts
print("Value counts of product reviews per category:\n",data['category'].value_counts())



#### <a id='toc4_2_1_3_'></a>[Batch 3](#toc0_)

- clothing_shoes_and_jewelry
- digital_music

In [None]:
# loading the review data!

data = []

clothing_shoes_and_jewelry = "/Users/pavansingh/Desktop/Amazon Review Data/Clothing_Shoes_and_Jewelry_5.json"
digital_music = "/Users/pavansingh/Desktop/Amazon Review Data/Digital_Music_5.json"

# load each file and join into dataframe
for category, filename in [('clothing_shoes_and_jewelry', clothing_shoes_and_jewelry), ('digital_music', digital_music)]:
    for selected_data in read_file(filename, category):
        data.append(selected_data)

# make it into a dataframe
data = pd.DataFrame(data)

# show the dataframe
print("Shape of all data:", data.shape)
display(data.head(5))

# save data in folder Data
data.to_csv('Data/revs_batch3_1.csv')

# category value counts
print("Value counts of product reviews per category:\n",data['category'].value_counts())


### <a id='toc4_2_2_'></a>[Batch 4](#toc0_)

- electronics
- musical_instruments

In [None]:
# loading the review data!

data = []

electronics = "/Users/pavansingh/Desktop/Amazon Review Data/Electronics_5.json"
musical_instruments = "/Users/pavansingh/Desktop/Amazon Review Data/Musical_Instruments_5.json"

# load each file and join into dataframe
for category, filename in [('electronics', electronics), ('musical_instruments', musical_instruments)]:
    for selected_data in read_file(filename, category):
        data.append(selected_data)

# make it into a dataframe
data = pd.DataFrame(data)

# show the dataframe
print("Shape of all data:", data.shape)
display(data.head(5))

# save data in folder Data
data.to_csv('Data/revs_batch4_1.csv')

# category value counts
print("Value counts of product reviews per category:\n",data['category'].value_counts())


### <a id='toc4_2_3_'></a>[Batch 5](#toc0_)

- office_products
- patio_lawn_and_garden

In [None]:
# loading the review data!

data = []

office_products = "/Users/pavansingh/Desktop/Amazon Review Data/Office_Products_5.json"
patio_lawn_and_garden = "/Users/pavansingh/Desktop/Amazon Review Data/Patio_Lawn_and_Garden_5.json"

# load each file and join into dataframe
for category, filename in [('office_products', office_products), ('patio_lawn_and_garden', patio_lawn_and_garden)]:
    for selected_data in read_file(filename, category):
        data.append(selected_data)

# make it into a dataframe
data = pd.DataFrame(data)

# show the dataframe
print("Shape of all data:", data.shape)
display(data.head(5))

# save data in folder Data
data.to_csv('Data/revs_batch5_1.csv')

# category value counts
print("Value counts of product reviews per category:\n",data['category'].value_counts())


### <a id='toc4_2_4_'></a>[Batch 6](#toc0_)

- sports_and_outdoors
- video_games

In [None]:
# loading the review data!

data = []

sports_and_outdoors = "/Users/pavansingh/Desktop/Amazon Review Data/Sports_and_Outdoors_5.json"
video_games = "/Users/pavansingh/Desktop/Amazon Review Data/Video_Games_5.json"

# load each file and join into dataframe
for category, filename in [('sports_and_outdoors', sports_and_outdoors), ('video_games', video_games)]:
    for selected_data in read_file(filename, category):
        data.append(selected_data)

# make it into a dataframe
data = pd.DataFrame(data)

# show the dataframe
print("Shape of all data:", data.shape)
display(data.head(5))

# save data in folder Data
data.to_csv('Data/revs_batch6_1.csv')

# category value counts
print("Value counts of product reviews per category:\n",data['category'].value_counts())


### <a id='toc4_2_5_'></a>[Batch 7](#toc0_)

- tools_and_home_improvement
- kindle_store

In [None]:
# loading the review data!

data = []

tools_and_home_improvement = "/Users/pavansingh/Desktop/Amazon Review Data/Tools_and_Home_Improvement_5.json"
kindle_store = "/Users/pavansingh/Desktop/Amazon Review Data/Kindle_Store_5.json"

# load each file and join into dataframe
for category, filename in [('tools_and_home_improvement', tools_and_home_improvement), ('kindle_store', kindle_store)]:
    for selected_data in read_file(filename, category):
        data.append(selected_data)

# make it into a dataframe
data = pd.DataFrame(data)

# show the dataframe
print("Shape of all data:", data.shape)
display(data.head(5))

# save data in folder Data
data.to_csv('Data/revs_batch7_1.csv')

# category value counts
print("Value counts of product reviews per category:\n",data['category'].value_counts())

### <a id='toc4_2_6_'></a>[Batch 8](#toc0_)

- toys_and_games
- prime_pantry

In [None]:
# loading the review data!

data = []

toys_and_games = "/Users/pavansingh/Desktop/Amazon Review Data/Toys_and_Games_5.json"
prime_pantry = "/Users/pavansingh/Desktop/Amazon Review Data/Prime_Pantry_5.json"

# load each file and join into dataframe
for category, filename in [('toys_and_games', toys_and_games), ('prime_pantry', prime_pantry)]:
    for selected_data in read_file(filename, category):
        data.append(selected_data)

# make it into a dataframe
data = pd.DataFrame(data)

# show the dataframe
print("Shape of all data:", data.shape)
display(data.head(5))

# save data in folder Data
data.to_csv('Data/revs_batch8_1.csv')

# category value counts
print("Value counts of product reviews per category:\n",data['category'].value_counts())

### <a id='toc4_2_7_'></a>[Batch 9](#toc0_)

- home_and_kitchen
- movies_and_tv

In [None]:
# loading the review data!

data = []

home_and_kitchen = "/Users/pavansingh/Desktop/Amazon Review Data/Home_and_Kitchen_5.json"
movies_and_tv = "/Users/pavansingh/Desktop/Amazon Review Data/Movies_and_TV_5.json"

# load each file and join into dataframe
for category, filename in [('home_and_kitchen', home_and_kitchen), ('movies_and_tv', movies_and_tv)]:
    for selected_data in read_file(filename, category):
        data.append(selected_data)

# make it into a dataframe
data = pd.DataFrame(data)

# show the dataframe
print("Shape of all data:", data.shape)
display(data.head(5))

# save data in folder Data
data.to_csv('Data/revs_batch9_1.csv')

# category value counts
print("Value counts of product reviews per category:\n",data['category'].value_counts())


### <a id='toc4_2_8_'></a>[Batch 10](#toc0_)

- pet_supplies
- grocery_and_gourmet_food




In [None]:
# loading the review data!

data = []

pet_supplies = "/Users/pavansingh/Desktop/Amazon Review Data/Pet_Supplies_5.json"
grocery_and_gourmet_food = "/Users/pavansingh/Desktop/Amazon Review Data/Grocery_and_Gourmet_Food_5.json"

# load each file and join into dataframe
for category, filename in [('pet_supplies', pet_supplies), ('grocery_and_gourmet_food', grocery_and_gourmet_food)]:
    for selected_data in read_file(filename, category):
        data.append(selected_data)

# make it into a dataframe
data = pd.DataFrame(data)

# show the dataframe
print("Shape of all data:", data.shape)
display(data.head(5))

# save data in folder Data
data.to_csv('Data/revs_batch10_1.csv')

# category value counts
print("Value counts of product reviews per category:\n",data['category'].value_counts())

### <a id='toc4_2_9_'></a>[Merge Batches (for large reviews data)](#toc0_)

In this section, we merge the batches together to create one large dataset.

We use the `pd.concat()` function to merge the batches together. The resulting dataset is saved as a CSV file for use in the next section - **data cleaning**. 




In [None]:
# load and merge csv files
df1 = pd.read_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/revs_batch1_1.csv", low_memory=False)
df2 = pd.read_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/revs_batch2_1.csv", low_memory=False)
df3 = pd.read_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/revs_batch3_1.csv", low_memory=False)
df4 = pd.read_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/revs_batch4_1.csv", low_memory=False)
df5 = pd.read_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/revs_batch5_1.csv", low_memory=False)
df6 = pd.read_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/revs_batch6_1.csv", low_memory=False)
df7 = pd.read_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/revs_batch7_1.csv", low_memory=False)
df8 = pd.read_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/revs_batch8_1.csv", low_memory=False)
df9 = pd.read_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/revs_batch9_1.csv", low_memory=False)
df10 = pd.read_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/revs_batch10_1.csv", low_memory=False)

# merge all dataframes
frames = [df1, df2, df3, df4, df5, df6, df7, df8, df9, df10]
lots_revs_meta = pd.concat(frames)

# save to csv
lots_revs_meta.to_csv('Data/lots_revs_1.csv')


In [None]:
# quick look at the data
print("Shape of all data:", lots_revs_meta.shape)
display(lots_revs_meta.head(3))

# value counts
print("\nValue counts of product reviews per category:\n",lots_revs_meta['category'].value_counts())

***
## <a id='toc4_3_'></a>[Merge Large Reviews with Few Reviews](#toc0_)

We now have two CSV files:

1. `few_revs_meta.csv`: contains the metadata for products with fewer reviews
2. `lots_revs_meta.csv`: contains the metadata for products with a lot of reviews

We merge these two datasets together to create one large dataset that we will use for data cleaning and the subsequent analysis.

In [None]:
# load and merge csv files
few = pd.read_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/few_revs_1.csv", low_memory=False)
#lots = pd.read_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/lots_revs_1.csv", low_memory=False)
lots = lots_revs_meta

# merge all dataframes
frames = [few, lots]
all_revs_meta = pd.concat(frames)
all_revs_meta = all_revs_meta.drop(columns=['Unnamed: 0', 'Unnamed: 0.1'])

# save to csv
all_revs_meta.to_csv('Data/all_revs_1.csv')

## Only Keep Reviewers with more than X reviews and Products with X reviews

In [None]:
# load all_revs
all_revs = pd.read_csv("/Users/pavansingh/Desktop/all_revs_1.csv")

In [None]:
# quick look at the data
all_revs.head(3)
print(all_revs.shape)
print("number of unique products:", all_revs['asin'].nunique())
print("number of unique reviewers:", all_revs['reviewerID'].nunique())

test = all_revs.copy()

In [None]:
# generates  reviews
# ===  # === 

test = test[test['asin'].isin(test.groupby('asin').size().reset_index(name='counts').query('counts >= 10')['asin'])]
test = test[test['reviewerID'].isin(test.groupby('reviewerID').size().reset_index(name='counts').query('counts >= 10')['reviewerID'])]
test = test[test['asin'].isin(test.groupby('asin').size().reset_index(name='counts').query('counts >= 10')['asin'])]

# shape
print(test.shape)

# show number of ratings per reviewer in table
display(test.groupby('reviewerID').size().reset_index(name='counts').sort_values('counts', ascending=True).head(5))

# show number of ratings per product in table
display(test.groupby('asin').size().reset_index(name='counts').sort_values('counts', ascending=True).head(5))

# save to csv
test.to_csv('Data/set1_data.csv')

In [None]:
# generates reviews
# ===  # === 

test = test[test['asin'].isin(test.groupby('asin').size().reset_index(name='counts').query('counts >= 12')['asin'])]
test = test[test['reviewerID'].isin(test.groupby('reviewerID').size().reset_index(name='counts').query('counts >= 12')['reviewerID'])]
test = test[test['asin'].isin(test.groupby('asin').size().reset_index(name='counts').query('counts >= 12')['asin'])]

# shape
print(test.shape)

# show number of ratings per reviewer in table
display(test.groupby('reviewerID').size().reset_index(name='counts').sort_values('counts', ascending=True).head(5))

# show number of ratings per product in table
display(test.groupby('asin').size().reset_index(name='counts').sort_values('counts', ascending=True).head(5))

# save to csv
test.to_csv('Data/set2_data.csv')

In [None]:
# generates reviews
# ===  # === 

test = test[test['asin'].isin(test.groupby('asin').size().reset_index(name='counts').query('counts >= 14')['asin'])]
test = test[test['reviewerID'].isin(test.groupby('reviewerID').size().reset_index(name='counts').query('counts >= 14')['reviewerID'])]
test = test[test['asin'].isin(test.groupby('asin').size().reset_index(name='counts').query('counts >= 14')['asin'])]

# shape
print(test.shape)

# show number of ratings per reviewer in table
display(test.groupby('reviewerID').size().reset_index(name='counts').sort_values('counts', ascending=True).head(5))

# show number of ratings per product in table
display(test.groupby('asin').size().reset_index(name='counts').sort_values('counts', ascending=True).head(5))

# save to csv
test.to_csv('Data/set3_data.csv')

In [None]:
# generates reviews
# ===  # === 

test = test[test['asin'].isin(test.groupby('asin').size().reset_index(name='counts').query('counts >= 20')['asin'])]
test = test[test['reviewerID'].isin(test.groupby('reviewerID').size().reset_index(name='counts').query('counts >= 20')['reviewerID'])]
test = test[test['asin'].isin(test.groupby('asin').size().reset_index(name='counts').query('counts >= 20')['asin'])]

# shape
print(test.shape)

# show number of ratings per reviewer in table
display(test.groupby('reviewerID').size().reset_index(name='counts').sort_values('counts', ascending=True).head(5))

# show number of ratings per product in table
display(test.groupby('asin').size().reset_index(name='counts').sort_values('counts', ascending=True).head(5))

# save to csv
test.to_csv('Data/set4_data.csv')

In [None]:
# quick look at the data
print("Shape of all data:", test.shape) #4 164 059
display(test.head(3))

# value counts
print("\nValue counts of product reviews per category:\n",test['category'].value_counts())