# Process the raw data

The data comes in three text files:

- titles.txt
- desciptions.txt
- categories.txt <em>(labels)</em>

They need to be processed into a single csv with the following columns <em>(all strings)</em>:
- item_id 
- title
- description
- labels

This will allow us to more easilt process the data further down the line

In [1]:
# Import the libraries
import re
import pandas as pd
from IPython.display import clear_output

In [2]:
# Create string of data location
data_location = '../Datasets/AmazonCat-13K/raw/'

## Process the labels

In [3]:
# Load in the labels
f_labels = open(data_location + 'categories.txt', encoding='latin1')
labels = f_labels.read()

In [4]:
# Show the first 300 Characters
print(labels[:300])

B0027DQHA0
  Movies & TV, TV
  Music, Classical
0756400120
  Books, Literature & Fiction, Anthologies & Literary Collections, General
  Books, Literature & Fiction, United States
  Books, Science Fiction & Fantasy, Science Fiction, Anthologies
  Books, Science Fiction & Fantasy, Science Fiction, Sho


In [5]:
# Close the file
f_labels.close()

#### Create a dictionary of the item_id to its labels

In [6]:
# Start with empty id_labels_dict and item_id placeholder
item_id = ''
id_labels_dict = {}
labels_ids_count = 0

In [7]:
# Open the file, loop through each line, and process them as either an item_id or label
with open(data_location + 'categories.txt', encoding='latin1') as labels_file:
    for line in labels_file:
        if not re.match(' ', line): # If it doesn't start with ' ', it's an item_id
            item_id = line.strip()
            labels_ids_count += 1
        else:
            # If this isn't the first label for the item_id, add the label to existing labels
            if item_id in id_labels_dict: 
                id_labels_dict[item_id] += [label.strip() for label in line.split(',')] 
            # If this is the first label for the item_id, add the item_id to the dict with the label
            else: 
                id_labels_dict[item_id] = [label.strip() for label in line.split(',')]
    # Close the file
    labels_file.close() 

In [8]:
# Have a look at the ids count
print(f'Labels count: {labels_ids_count}')

Labels count: 2441053


In [9]:
# Have a look at the first item
id_labels_dict['B0027DQHA0']

['Movies & TV', 'TV', 'Music', 'Classical']

## Process the titles

In [10]:
# Load in the titles
f_titles = open(data_location + 'titles.txt', encoding='latin1')
titles = f_titles.read()

In [11]:
# Show the first 300 Characters
print(titles[:300])

B00308CJ12 Bulletproof Salesman (2008)
189138922X Classical Mechanics
B0000CEP9J Fiesta Black 464 7-1/4-inch Salad Plate
B000HRH6IA Baby Blue Aurora Blue Gem Butterfly Belly Ring
B000002ERY Predicciones Leo
B000JM2796 Shakespeare Synergy Supreme Rear Drag Spin Reel, 5435R
0570070538 The Very First E


In [12]:
# Close the file
f_titles.close()

#### Create a dictionary of item_id to its title

In [13]:
# Create an empty dictionary
id_title_dict = {}
title_ids_count = 0
titles_count = 0

In [14]:
# Open the file, loop through each line, and process the content as an item_id or title
with open(data_location + 'titles.txt', encoding='latin1') as titles_file:
    for line in titles_file:
        line = line.strip()
        splitted_line = line.split(' ')
        item_id = splitted_line[0]
        title_ids_count += 1
        title = (' '.join(splitted_line[1:]))
        titles_count += 1
        id_title_dict[item_id] = title
    titles_file.close()

In [15]:
# Have a look at the ids and titles count
print(f'IDs count: {title_ids_count}')
print(f'Titles count: {titles_count}')

IDs count: 1720307
Titles count: 1720307


In [16]:
# Have a look at the first item
id_title_dict['B00308CJ12']

'Bulletproof Salesman (2008)'

## Process the descriptions

In [17]:
# Load in the descriptions
f_descs = open(data_location + 'descriptions.txt', encoding='latin1')
descs = f_descs.read()

In [18]:
# Show the first 1000 Characters
print(descs[:1000])

product/productId: B0027DQHA0
product/description: Conducted by John Neschling since 1997, the orchestra is defined by its emblematic interpretations of Latin American music. Here the orchestra yet again grips the listener with an electrifying selection of Brazilian and Latin American classics including w

product/productId: 0756400120
product/description: This fast, lightweight anthology of 12 time-travel tales contains a handful of standout stories, but many others rely on familiar tricks: Will the hero change his destiny by changing his past? Will the hero realize that that sound he heard all those years ago was his meddling future self? The most successful stories toy with genre conventions or use time travel as a device in support of bigger concerns. James P. Hogan's slyly amusing "Convolution" focuses on time-machine inventor Professor Abercrombie. The professor loses his notes before completing his machine, but a future version of himself sends a time machine back, embroiling Ab

In [19]:
# Close the file
f_descs.close()

#### Create a dictionary of item_id to its description

In [20]:
id_desc_dict = {}
desc_id_count = 0
descs_count = 0

In [21]:
item_id = ''
description = ''
# Open the file, loop through each line, and process the content as an item_id or title
with open(data_location + 'descriptions.txt', encoding='latin1') as descs_file:
    for line in descs_file:
        line = line.strip()
        if item_id != '' and description != '':
            if item_id not in id_desc_dict:
                desc_id_count += 1
                descs_count += 1
                id_desc_dict[item_id] = description
            item_id = ''
            description = ''
        elif 'product/productId' in line:
            item_id = line[19:]
        elif 'product/description' in line:
            description = line[21:]
    descs_file.close()

NOTE: For one item, with id 0070134561, half the description is missing because it's on the next line. Only this one item has this 'error'.

In [22]:
# Have a look at the ids and descriptions count
print(f'ID count: {desc_id_count}')
print(f'Description count: {descs_count}')

ID count: 1494801
Description count: 1494801


In [23]:
# Have a look at the first item
id_desc_dict['B0027DQHA0']

'Conducted by John Neschling since 1997, the orchestra is defined by its emblematic interpretations of Latin American music. Here the orchestra yet again grips the listener with an electrifying selection of Brazilian and Latin American classics including w'

## Figure out which items have missing information

In [24]:
# Have a look at the counts of each type of information
print(f'Titles count: {titles_count}')
print(f'Descriptions count: {descs_count}')
print(f'Labels count: {labels_ids_count}')

Titles count: 1720307
Descriptions count: 1494801
Labels count: 2441053


There are a lot of labels with not titles or descriotions. But after looking at papers that use AmazonCat-13K, it looks like they use only 1,493,021 instances of the data. It's likely they discarded items that are missing the title, description, or both. Most of the information is contained in the descriptions, so let's see how the descriptions match up with the labels.

In [25]:
# Create var with starting value of 0
has_desc_and_labels = 0

In [26]:
for key in id_desc_dict.keys():
    try:
        id_labels_dict[key]
        has_desc_and_labels += 1
    except:
        pass

In [27]:
print(f'Has description and labels: {has_desc_and_labels}')

Has description and labels: 1494801


All items with a description has labels. But that's still higher than the reported 1,493,021 instances used by the papers. Let's see how many of these items have titles.

In [28]:
# Create var with starting value of 0
has_desc_and_labels_and_title = 0

In [29]:
for key in id_desc_dict.keys():
    try:
        id_labels_dict[key]
        try:
            id_title_dict[key]
            has_desc_and_labels_and_title += 1
        except:
            pass
    except:
        pass

In [30]:
print(f'Has description and labels and title: {has_desc_and_labels_and_title}')

Has description and labels and title: 1053117


That's not enough. So the 'title' probably didn't get used in the other papers. Now lets see if there are any with no descriptions or labels.

In [31]:
# Create var with starting value of 0
has_non_empty_desc_and_labels = 0

In [32]:
for key in id_desc_dict.keys():
    if len(id_desc_dict[key]) > 0:
        try:
            labels = id_labels_dict[key]
            if len(labels) > 0:
                has_non_empty_desc_and_labels += 1
        except:
            pass

In [33]:
print(f'Has non-empty description and labels and title: {has_non_empty_desc_and_labels}')

Has non-empty description and labels and title: 1494801


No change. They all have some value for descriptions and labels. I guess being a little bit off isn't too bad.

## Create dataframe

In [34]:
# Create a list of lists
items_list = []
for key, value in id_desc_dict.items():
    title = 'NO_TITLE'
    if key in id_title_dict:
        title = id_title_dict[key]
    items_list.append(['ID:' + key, title, value, id_labels_dict[key]])

In [35]:
# Convert the list of lists to a DataFrame
processed_data = pd.DataFrame(items_list, columns = ['item_id', 'title', 'description', 'labels'])

In [36]:
# Have a look at the shape of the dataframe
nrows = processed_data.shape[0]
nrows

1494801

In [37]:
# Have a look at the first 5 rows
processed_data.head(n=5)

Unnamed: 0,item_id,title,description,labels
0,ID:B0027DQHA0,Sao Paulo Samba (2008),"Conducted by John Neschling since 1997, the or...","[Movies & TV, TV, Music, Classical]"
1,ID:0756400120,Past Imperfect (Daw Book Collectors),"This fast, lightweight anthology of 12 time-tr...","[Books, Literature & Fiction, Anthologies & Li..."
2,ID:B00024YAOQ,Winning Every Time: How to Use the Skills of a...,Whether you're hoping to obtain a raise from y...,"[Books, Business & Investing, Business Life, M..."
3,ID:B000BUGXAU,Nano Cube 24 Gallon Deluxe,Just add water!\tThe Nano Cube is a 24-gallon ...,"[Pet Supplies, Fish & Aquatic Pets, Aquariums]"
4,ID:B0007YMWC8,Asalto En Tijuana (2005),An honest citizen is forced to steal the world...,"[Movies & TV, Movies]"


#### Save the dataframe as csv files

In [38]:
# Create save_as_csv function
def save_as_csv(df, path):
    df.to_csv(path, 
              header=True, 
              index=None, 
              encoding='latin1')

In [39]:
# Save as csv (broken up into 5 files)
num_files = 10
size = nrows // num_files
save_path = '../Datasets/AmazonCat-13K/processed/first_pass'
for file_num in range(num_files):
    if file_num == 0:
        save_as_csv(processed_data[:size], save_path + f'_no{file_num + 1}.csv')
    elif file_num == (num_files - 1):
        save_as_csv(processed_data[size * file_num:], save_path + f'_no{file_num + 1}.csv')
    else:
        save_as_csv(processed_data[size * file_num: size * (file_num + 1)], save_path + f'_no{file_num + 1}.csv')