# Process the raw data

The data comes in three text files:

- titles.txt
- desciptions.txt
- categories.txt <em>(labels)</em>

They need to be processed into a single csv with the following columns <em>(all strings)</em>:
- item_id 
- title
- description
- labels

This will allow us to more easilt process the data further down the line

In [1]:
# Import the libraries
import re
import pandas as pd

In [2]:
# Create string of data location
data_location = '../Datasets/Amazon-Cat13K/raw/'

### Process the labels

In [3]:
# Load in the labels
f_labels = open(data_location + 'categories.txt', encoding='latin1')
labels = f_labels.read()

In [4]:
# Show the first 300 Characters
print(labels[:300])

B0027DQHA0
  Movies & TV, TV
  Music, Classical
0756400120
  Books, Literature & Fiction, Anthologies & Literary Collections, General
  Books, Literature & Fiction, United States
  Books, Science Fiction & Fantasy, Science Fiction, Anthologies
  Books, Science Fiction & Fantasy, Science Fiction, Sho


In [5]:
# Close the file
f_labels.close()

#### Create a dictionary of the item_id to its labels

In [6]:
# Start with empty id_labels_dict and item_id placeholder
item_id = ''
id_labels_dict = {}

In [7]:
# Open the file, loop through each line, and process them as either an item_id or label
with open(data_location + 'categories.txt', 'r') as labels_file:
    for line in labels_file:
        if not re.match('  ', line): # If it doesn't start with '   ', it's an item_id
            item_id = line.strip()
        else:
            # If this isn't the first label for the item_id, add the label to existing labels
            if item_id in id_labels_dict: 
                id_labels_dict[item_id].append(line.strip())
            # If this is the first label for the item_id, add the item_id to the dict with the label
            else: 
                id_labels_dict[item_id] = [line.strip()]
    # Close the file
    labels_file.close() 

In [8]:
# Have a look at the dict
id_labels_dict

{'B0027DQHA0': ['Movies & TV, TV', 'Music, Classical'],
 '0756400120': ['Books, Literature & Fiction, Anthologies & Literary Collections, General',
  'Books, Literature & Fiction, United States',
  'Books, Science Fiction & Fantasy, Science Fiction, Anthologies',
  'Books, Science Fiction & Fantasy, Science Fiction, Short Stories'],
 'B0000012D5': ['Music, Blues', 'Music, Pop', 'Music, R&B'],
 'B00024YAOQ': ['Books, Business & Investing, Business Life, Motivation & Self-Improvement'],
 '068413263X': ['Books'],
 'B000BUGXAU': ['Pet Supplies, Fish & Aquatic Pets, Aquariums'],
 'B0000A0QE6': ['Clothing & Accessories, Men, Tops & Tees, Dress Shirts'],
 'B0007YMWC8': ['Movies & TV, Movies'],
 'B0000A0QE2': ['Clothing & Accessories, Men, Tops & Tees, Dress Shirts'],
 'B000GG6ENK': ['Health & Personal Care, Nutrition & Wellness, Vitamins & Supplements, Vitamins & Multivitamins'],
 'B0001WPQ6U': ['Sports & Outdoors, Fan Shop, Sports Souvenirs, Helmets, Mini Helmets'],
 'B0000012D9': ['Music, P

### Process the titles

In [9]:
# Load in the titles
f_titles = open(data_location + 'titles.txt', encoding='latin1')
titles = f_titles.read()

In [10]:
# Show the first 300 Characters
print(titles[:300])

B00308CJ12 Bulletproof Salesman (2008)
189138922X Classical Mechanics
B0000CEP9J Fiesta Black 464 7-1/4-inch Salad Plate
B000HRH6IA Baby Blue Aurora Blue Gem Butterfly Belly Ring
B000002ERY Predicciones Leo
B000JM2796 Shakespeare Synergy Supreme Rear Drag Spin Reel, 5435R
0570070538 The Very First E


In [11]:
# Close the file
f_titles.close()

#### Create a dictionary of item_id to its title

In [12]:
# Create an empty dictionary
id_title_dict = {}

In [13]:
# Open the file, loop through each line, and process the content as an item_id or title
with open(data_location + 'titles.txt', encoding='latin1') as titles_file:
    for line in titles_file:
        line = line.strip()
        splitted_line = line.split(' ')
        item_id = splitted_line[0]
        title = (' '.join(splitted_line[1:]))
        id_title_dict[item_id] = title
    titles_file.close()

In [14]:
# Have a look at the dict
id_title_dict

{'B00308CJ12': 'Bulletproof Salesman (2008)',
 '189138922X': 'Classical Mechanics',
 'B0000CEP9J': 'Fiesta Black 464 7-1/4-inch Salad Plate',
 'B000HRH6IA': 'Baby Blue Aurora Blue Gem Butterfly Belly Ring',
 'B000002ERY': 'Predicciones Leo',
 'B000JM2796': 'Shakespeare Synergy Supreme Rear Drag Spin Reel, 5435R',
 '0570070538': 'The Very First Easter',
 '0971987203': 'Road Repair Handbook (Project Logic Series)',
 'B0007UENZG': 'NOA by Cacharel Perfume 3.4 oz Spray New in Box tester',
 'B000F3WSFC': 'Taylors of Harrogate, Yorkshire Tea, Loose Leaf, 8.8-Ounce Packages (Pack of 6)',
 'B00000I5CT': '18 Tracks From...Film Chicago Blues',
 'B000OV5XWU': 'The Spooky Art: Some Thoughts On Writing (SIGNED)',
 'B000JM76EC': 'The 15 Minute Vegetarian Gourmet',
 '1851821813': "Auraicept Na N-Eces: The Scholars' Primer (Celtic studies)",
 'B000NPAFTI': 'Historical and Geographical Dictionary of Japan.',
 'B000JGG9H8': "Miss Spider Books: Miss Spider's Tea Party/Miss Spider's New Car/Miss Spider's 

### Process the descriptions

In [15]:
# Load in the descriptions
f_descs = open(data_location + 'descriptions.txt', encoding='latin1')
descs = f_descs.read()

In [16]:
# Show the first 1000 Characters
print(descs[20000:30000])

he cornerstone of your retirement isn't really planning at all.Corporate Pension Plans Are Going the Way of the Hula HoopRemember hula hoops? They were all the rage back in the early 1950s among hip-swiveling young baby boomers. But within a few years, sales of these plastic novelty items fell from the millions to perhaps a few thousand a year, and today the few remaining hula hoops are little more than nostalgic relics of a more innocent era.Well, the trajectory has been similar, though not nearly as short-lived, for defined-benefit pensions. These are the types of pensions most of us think of (or used to think of) when we hear the term pensionthat is, one in which the company puts money into an investment fund and, regardless of the performance of the investments, promises to pay you a monthly check for life based on how many years you worked at the company and the size of your salary. Often, after putting in twenty-five or more years at a company, retirees could walk away with pensi

In [17]:
# Close the file
f_descs.close()

In [18]:
id_desc_dict = {}

In [19]:
item_id = ''
description = ''
# Open the file, loop through each line, and process the content as an item_id or title
with open(data_location + 'descriptions.txt', encoding='latin1') as descs_file:
    for line in descs_file:
        line = line.strip()
        if item_id != '' and description != '':
            id_desc_dict[item_id] = description
            item_id = ''
            description = ''
        elif 'product/productId' in line:
            item_id = line[19:]
        elif 'product/description' in line:
            description = line[21:]
    descs_file.close()

NOTE: For one item, with id 0070134561, half the description is missing because it's on the next line. Only this one item has this 'error'.

In [20]:
# Have a look at the dict
id_desc_dict

{'B0027DQHA0': 'Conducted by John Neschling since 1997, the orchestra is defined by its emblematic interpretations of Latin American music. Here the orchestra yet again grips the listener with an electrifying selection of Brazilian and Latin American classics including w',
 '0756400120': 'This fast, lightweight anthology of 12 time-travel tales contains a handful of standout stories, but many others rely on familiar tricks: Will the hero change his destiny by changing his past? Will the hero realize that that sound he heard all those years ago was his meddling future self? The most successful stories toy with genre conventions or use time travel as a device in support of bigger concerns. James P. Hogan\'s slyly amusing "Convolution" focuses on time-machine inventor Professor Abercrombie. The professor loses his notes before completing his machine, but a future version of himself sends a time machine back, embroiling Abercrombie in a neatly dovetailed succession of weird cross-time comm

### Check that nothing is missing

In [21]:
# Have a look at some items
item_ids = ['B0027DQHA0', 'B000FNYKGW', '0070134561']
for item_id in item_ids:
    print(f'Title: {id_title_dict[item_id]}')
    print(f'Description: {id_desc_dict[item_id]}')
    print(f'Labels: {id_labels_dict[item_id]}')
    print(f'Label Count: {len(id_labels_dict[item_id])}')  
    print(' ')

Title: Sao Paulo Samba (2008)
Description: Conducted by John Neschling since 1997, the orchestra is defined by its emblematic interpretations of Latin American music. Here the orchestra yet again grips the listener with an electrifying selection of Brazilian and Latin American classics including w
Labels: ['Movies & TV, TV', 'Music, Classical']
Label Count: 2
 
Title: Amazon.com: NCAA Iowa Hawkeyes Canvas Chair: Clothing
Description: Features two adjustable arm rests with a cup holder on the right arm rest. Logo featured on the front and back of headrest and the carrying bag.	Features two adjustable arm rests with a cup holder on the right arm rest. Logo featured on the front and back of headrest and the carrying bag.
Labels: ['Sports & Outdoors, Fan Shop, Home & Garden, Furniture, Folding Chairs']
Label Count: 1
 
Title: Light Airplane Navigation Essentials
Description: Small plane navigation made easy-Learn the basics, in plain english If you need a practical-and basic-introduction t

In [22]:
missing_descriptions = 0
missing_labels = 0

In [23]:
for key in id_title_dict.keys():
    try:
        description = id_desc_dict[key]
        try: 
            labels = id_labels_dict[key]
        except:
#             print(f'Labels missing for: {key}')
            missing_labels += 1
    except:
        try: 
            labels = id_labels_dict[key]
#             print(f'Description missing for: {key}')
            missing_descriptions += 1
        except:
#             print(f'Description and Labels missing for: {key}')
            missing_descriptions += 1
            missing_labels += 1

In [24]:
print(f'Number of missing descritions: {missing_descriptions}')
print(f'Number of missing labels: {missing_labels}')

Number of missing descritions: 667189
Number of missing labels: 294
