# 0. Download and Read Data

## 0.1 Set-up
---
The following code will mount this file in your Google Drive, move the Kaggle API key to the root folder and create the following directory: `/content/drive/MyDrive/bt4222_group_6/bt4222_group_6_amazon/data` referred to as `data`

In [1]:
from google.colab import drive
import os
import shutil

# Mount Google Drive
drive.mount('/content/drive', force_remount=True)

# Create directories

# Define project and kaggle paths
project_dir = '/content/drive/MyDrive/bt4222_group_6/bt4222_group_6_amazon'
kaggle_src = os.path.join(project_dir, 'kaggle.json')

Mounted at /content/drive


In [2]:
! pip install kaggle



In [3]:
! mkdir ~/.kaggle

In [4]:
! cp /content/drive/MyDrive/bt4222_group_6/bt4222_group_6_amazon/kaggle.json ~/.kaggle/

In [5]:
! chmod 600 ~/.kaggle/kaggle.json

## 0.2 Downloading the Dataset

---
The following code will check the contents of the Kaggle dataset, [Amazon US Customer Reviews Dataset](https://https://www.kaggle.com/datasets/cynthiarempel/amazon-us-customer-reviews-dataset) and download the categories, Electronics, Furniture, Major Appliances and Personal Care Apliances. These categories are saved as `electronics.csv`, `furniture.csv`, `major_appliances.csv`, and `personal_care_appliances.csv` respectively under `data`.


In [6]:
! kaggle datasets files -d cynthiarempel/amazon-us-customer-reviews-dataset

Next Page Token = CfDJ8KT8tnOr7fFFm_byYmusL7gFj-ddgMDBjclo0rAH0vbHaPcMflpCD_-NkgYYYwnC-dMGmFDO1JbK0Z0S0xcfxz8
name                                                      size  creationDate                
--------------------------------------------------  ----------  --------------------------  
amazon_reviews_multilingual_US_v1_00.tsv            3629753164  2021-06-16 20:15:06.432000  
amazon_reviews_us_Apparel_v1_00.tsv                 1971061630  2021-06-16 20:12:30.596000  
amazon_reviews_us_Automotive_v1_00.tsv              1350294084  2021-06-16 20:10:19.493000  
amazon_reviews_us_Baby_v1_00.tsv                     872274720  2021-06-16 20:09:43.830000  
amazon_reviews_us_Beauty_v1_00.tsv                  2152186111  2021-06-16 20:11:52.020000  
amazon_reviews_us_Books_v1_02.tsv                   3238702530  2021-06-16 20:14:41.804000  
amazon_reviews_us_Camera_v1_00.tsv                  1100169988  2021-06-16 20:09:58.549000  
amazon_reviews_us_Digital_Ebook_Purchase_v1_01.tsv  3

In [16]:
! kaggle datasets download -d cynthiarempel/amazon-us-customer-reviews-dataset -f amazon_reviews_us_Electronics_v1_00.tsv --force

! unzip -o amazon_reviews_us_Electronics_v1_00.tsv

Dataset URL: https://www.kaggle.com/datasets/cynthiarempel/amazon-us-customer-reviews-dataset
License(s): other
Downloading amazon_reviews_us_Electronics_v1_00.tsv to /content
 99% 666M/675M [00:02<00:00, 247MB/s]
100% 675M/675M [00:02<00:00, 295MB/s]
Archive:  amazon_reviews_us_Electronics_v1_00.tsv
  inflating: amazon_reviews_us_Electronics_v1_00.tsv  


In [17]:
! kaggle datasets download -d cynthiarempel/amazon-us-customer-reviews-dataset -f amazon_reviews_us_Furniture_v1_00.tsv --force

! unzip -o amazon_reviews_us_Furniture_v1_00.tsv

Dataset URL: https://www.kaggle.com/datasets/cynthiarempel/amazon-us-customer-reviews-dataset
License(s): other
Downloading amazon_reviews_us_Furniture_v1_00.tsv to /content
 95% 137M/144M [00:00<00:00, 762MB/s] 
100% 144M/144M [00:00<00:00, 437MB/s]
Archive:  amazon_reviews_us_Furniture_v1_00.tsv
  inflating: amazon_reviews_us_Furniture_v1_00.tsv  


In [18]:
! kaggle datasets download -d cynthiarempel/amazon-us-customer-reviews-dataset -f amazon_reviews_us_Major_Appliances_v1_00.tsv --force

! unzip -o amazon_reviews_us_Major_Appliances_v1_00.tsv

Dataset URL: https://www.kaggle.com/datasets/cynthiarempel/amazon-us-customer-reviews-dataset
License(s): other
Downloading amazon_reviews_us_Major_Appliances_v1_00.tsv to /content
  0% 0.00/23.5M [00:00<?, ?B/s]
100% 23.5M/23.5M [00:00<00:00, 315MB/s]
Archive:  amazon_reviews_us_Major_Appliances_v1_00.tsv
  inflating: amazon_reviews_us_Major_Appliances_v1_00.tsv  


In [19]:
! kaggle datasets download -d cynthiarempel/amazon-us-customer-reviews-dataset -f amazon_reviews_us_Personal_Care_Appliances_v1_00.tsv --force

! unzip -o amazon_reviews_us_Personal_Care_Appliances_v1_00.tsv

Dataset URL: https://www.kaggle.com/datasets/cynthiarempel/amazon-us-customer-reviews-dataset
License(s): other
Downloading amazon_reviews_us_Personal_Care_Appliances_v1_00.tsv to /content
  0% 0.00/17.0M [00:00<?, ?B/s]
100% 17.0M/17.0M [00:00<00:00, 516MB/s]
Archive:  amazon_reviews_us_Personal_Care_Appliances_v1_00.tsv
  inflating: amazon_reviews_us_Personal_Care_Appliances_v1_00.tsv  


In [21]:
import pandas as pd

# Paths
data_dir = os.path.join(project_dir, 'data')
os.makedirs(data_dir, exist_ok=True)

# File mapping: {original_tsv: renamed_csv}
file_mapping = {
    'amazon_reviews_us_Electronics_v1_00.tsv': 'electronics.csv',
    'amazon_reviews_us_Furniture_v1_00.tsv': 'furniture.csv',
    'amazon_reviews_us_Major_Appliances_v1_00.tsv': 'major_appliance.csv',
    'amazon_reviews_us_Personal_Care_Appliances_v1_00.tsv': 'personal_care_appliances.csv'
}

# Process each file
for tsv_filename, csv_filename in file_mapping.items():
    tsv_path = f"/content/{tsv_filename}"
    csv_path = os.path.join(data_dir, csv_filename)

    try:
        # Load TSV
        df = pd.read_csv(tsv_path, sep='\t', engine='python', on_bad_lines='skip')
        # Save as CSV to Google Drive
        df.to_csv(csv_path, index=False)
        print(f"Converted and saved: {csv_filename}")
    except FileNotFoundError:
        print(f"File not found, skipped: {tsv_filename}")
    except Exception as e:
        print(f"Error processing {tsv_filename}: {e}")

Converted and saved: electronics.csv
Converted and saved: furniture.csv
Converted and saved: major_appliance.csv
Converted and saved: personal_care_appliances.csv


## 0.3 Merging Files
---
 The following code will merge together the files `electronic.csv`, `furniture.csv`, `major_appliances.csv` and `personal_care_appliances.csv` together as `merged_reviews.csv` under `data`.

In [22]:
# Merge into one CSV
merged_df = pd.concat([
    pd.read_csv(os.path.join(data_dir, csv))
    for csv in file_mapping.values()
], ignore_index=True)

merged_df.to_csv(os.path.join(data_dir, 'merged_reviews.csv'), index=False)
print("Merged all reviews into merged_reviews.csv")

Merged all reviews into merged_reviews.csv


In [23]:
merged_df.head()

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
0,US,41409413,R2MTG1GCZLR2DK,B00428R89M,112201306,yoomall 5M Antenna WIFI RP-SMA Female to Male ...,Electronics,5,0,0,N,Y,Five Stars,As described.,2015-08-31
1,US,49668221,R2HBOEM8LE9928,B000068O48,734576678,"Hosa GPM-103 3.5mm TRS to 1/4"" TRS Adaptor",Electronics,5,0,0,N,Y,It works as advertising.,It works as advertising.,2015-08-31
2,US,12338275,R1P4RW1R9FDPEE,B000GGKOG8,614448099,Channel Master Titan 2 Antenna Preamplifier,Electronics,5,1,1,N,Y,Five Stars,Works pissa,2015-08-31
3,US,38487968,R1EBPM82ENI67M,B000NU4OTA,72265257,LIMTECH Wall charger + USB Hotsync & Charging ...,Electronics,1,0,0,N,Y,One Star,Did not work at all.,2015-08-31
4,US,23732619,R372S58V6D11AT,B00JOQIO6S,308169188,Skullcandy Air Raid Portable Bluetooth Speaker,Electronics,5,1,1,N,Y,Overall pleased with the item,Works well. Bass is somewhat lacking but is pr...,2015-08-31
