# Amazon MP3 Reviews
Read file line by line to create list of values for each field. Then create a dictionary to turn into a dataframe
- "#####" indicates a new record and will be used to update list index
- do not assume that new record index is sequential
- do not assume all records have the same fields and are in the same order

## Review criteria
Tried two different filters:
- length of **filtered** review words (no stop words) < 50
- length of **raw** review words (with stop words) < 50

Decided to filter using **filtered** review words length instead of **raw**. Filtered yielded 16,780 records while raw yielded 24,409 records. When terms that occurred in less than 10 documents were removed, further filtering by length < 50 yielded ~16,200 records with either method. The target size of data is 16,680 records. 

To get as close as possible to original processed dataset, only 1 filter iteration for length will be applied using filtered words.

In [1]:
import re
import pandas as pd

In [2]:
# dictionary to hold data to be transformed to pandas dataframe
amazon_data = {
    'record_id':[],
    'id':[],
    'productId':[],
    'standardName':[],
    'productName':[],
    'title':[],
    'author':[],
    'createDate':[],
    'summary':[],
    'fullText':[],
    'rating':[],
    'recommend':[],
    'paid':[],
    'helpfulNum':[],
    'totalNum':[],
    'commentNum':[],
    'webHome':[],
    'webUrl':[],
    'htmlPath':[],
    'textPath':[]
}

In [3]:
rec_num = 0
# create dictionary with field names as keys and list of values from input file
with open('amazon_mp3', 'r') as file:
    # loop through file
    while True:
        line = file.readline()
         # end of file
        if not line:
            print('Last record is %d' % record_id)
            # just in case all fields are not present
            for k, v in amazon_data.items(): 
                if (k != 'record_id') & (len(v) != rec_num):
                    amazon_data[k].append('')
            break
        # indicator for new record
        if re.search('^#####', line): 
            record_id = int(re.search('\d+',line).group())
            amazon_data['record_id'].append(record_id) 
             # just in case all fields are not present in previous record
            for k, v in amazon_data.items():
                if (k != 'record_id') & (len(v) != rec_num):
                    amazon_data[k].append('')
            rec_num = rec_num + 1
        # newline in between records
        elif line == '\n':
            pass
        # for each field in record
        else:
            try:
                # remove punctuations
                key = re.search('\[(\w+)\]', line).group(1)
                # create list of words
                value = line.split(':',1)[1].strip()
                # append value to list
                amazon_data[key].append(value)
            except:
                print('Key not found for record %d' % record_id)


Last record is 55740


In [4]:
# check lengths
lengths=[]
for k, v in amazon_data.items():
    l = len(v)
    lengths.append(l)
    # print('Length of %s is %d' % (k, l))
exp_len = max(lengths)
print('Expected length is %d' % exp_len)
# print any fields not matching the expect length
for k, v in amazon_data.items():
    if len(v) != exp_len:
        print('Check %s as it is missing %d entries' % (k, exp_len-len(v)))

Expected length is 31000


In [5]:
# create dataframe from dictionary
amazon_reviews_df = pd.DataFrame.from_dict(amazon_data)

In [6]:
# get stop words
with open('stopwords.txt') as file:
    stop_words_list = [word for word in file.read().splitlines()]

def get_review_data(review):
    '''
    returns list of words not in stop words with punctuations removed and length of raw review
    input: string
    output: list of words 
    '''
    review_wo_punc = re.sub(r'[^\w\s]', '', review)
    review_words = review_wo_punc.lower().split()
    raw_review_length = len(review_words)
    review_data = [word.lower() for word in review_words if word.lower() not in stop_words_list]
    filtered_review_length = len(review_data)
    return review_data, filtered_review_length, raw_review_length

In [7]:
# get review data as list of words from review
amazon_reviews_df['review_data'] = \
    amazon_reviews_df['fullText'].apply(lambda x: get_review_data(x))

In [8]:
# alternative to filtering by raw review word count yielded more records
raw_review_count = len(amazon_reviews_df[amazon_reviews_df['review_data'].str[2] >= 50])
print('Reviews with more than 50 words in raw review: %d' % raw_review_count)

Reviews with more than 50 words in raw review: 24409


In [9]:
# will filter instead with filtered words (no stop words)
amazon_reviews_df = amazon_reviews_df[amazon_reviews_df['review_data'].str[1] >= 50]
print('Reviews with more than 50 words in filtered review: %d' % len(amazon_reviews_df))

Reviews with more than 50 words in filtered review: 16780


In [10]:
# build dictionary with doc counts for each term
term_doc_counts = {}

In [11]:
for _, row in amazon_reviews_df.iterrows():
    for w in row['review_data'][0]:
        if w in term_doc_counts.keys():
            term_doc_counts[w] = term_doc_counts[w] + 1
        else:
            term_doc_counts[w] = 1

In [12]:
# build list of terms to keep which are in at least 10 reviews
terms_to_keep = dict(filter(lambda x: x[1] >= 10, term_doc_counts.items())).keys()

In [13]:
# total terms
print('Terms to vocabulary: %d' % len(terms_to_keep))

Terms to vocabulary: 8637


In [14]:
# keep only terms that are in at least 10 reviews
amazon_reviews_df['review_words'] = amazon_reviews_df['review_data'].apply(lambda x: [w for w in x[0] if w in terms_to_keep])

In [15]:
amazon_reviews_df['review_words_count'] = amazon_reviews_df['review_words'].apply(lambda x: len(x))
two_filter_count = len(amazon_reviews_df[amazon_reviews_df['review_words_count'] >= 50])
print('If second filter on length of review were applied,',
      'then record count would be: %d' % two_filter_count)

If second filter on length of review were applied, then record count would be: 16257


In [16]:
# check if maybe filtering for rating too
len(amazon_reviews_df[amazon_reviews_df['rating'] > ''])

16780

In [17]:
amazon_reviews_df.columns

Index(['author', 'commentNum', 'createDate', 'fullText', 'helpfulNum',
       'htmlPath', 'id', 'paid', 'productId', 'productName', 'rating',
       'recommend', 'record_id', 'standardName', 'summary', 'textPath',
       'title', 'totalNum', 'webHome', 'webUrl', 'review_data', 'review_words',
       'review_words_count'],
      dtype='object')

In [18]:
amazon_reviews_df[['review_words','rating']].to_pickle('processed_amazon_reviews.pkl')