## Data Pre-processing
### In this notebook we processed load and clean the Amazon review and software meta data.

The preprocessing includes the following steps:
1. Remove stop words and html tags from text fields.
2. Generating license fee, maintenance fee and implementation fee, based on software category information.

To avoid duplication of code, we have this pre-processing task implemented in [here](../data_preprocessing.py)

In [1]:
import os
import pandas as pd
import numpy as np
import seaborn as sns
os.chdir('../')


In [2]:
meta_data_path = "../../external_data/filtered_metadata.csv"
review_data_path = "../../external_data/reviews_full.csv"

In [3]:
meta_data = pd.read_csv(meta_data_path)
review_data = pd.read_csv(review_data_path)

  review_data = pd.read_csv(review_data_path)


In [6]:
meta_data.head()

Unnamed: 0,category,tech1,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,main_cat,similar_item,date,price,asin,details
0,"['Software', 'Education &amp; Reference']",,"['Slides with Video, Teaching Public Speaking ...",,Instructor's Resource CD-ROM for The Art of Sp...,[],,McGraw Hill,[],"18,178 in Software (",[],Software,,</div>,$8.00,007742817X,
1,"['Software', 'Education &amp; Reference']",,"[""Contains a guided tour of the program, Plann...",,Magruder's American Government Resource Pro CD...,[],,Magruder's,[],"19,702 in Software (",['0130679550'],Software,,</div>,,0130438480,
2,"['Software', 'Education & Reference', 'Test Pr...",,[],,Prentice Hall Test Manager a Comprehensive Sui...,[],,prentice hall,[],"54,036 in Software (",[],Software,,</div>,,0130852414,
3,"['Software', 'Education &amp; Reference', 'Tes...",,"['Windos 95, 98, NT4, 200, XP\nMac OS 9.1-9.2 ...",,Magruder's American Government Itext Interacti...,[],,Magruder's,['Interactive Learning Tools-Bring Content to ...,"52,031 in Software (",[],Software,,</div>,,0131817949,
4,"['Software', 'Design &amp; Illustration', 'CAD']",,"['2.5 Floppy', '', '']",,AUTOCAD The Student Edition Release 10 (1982-89),[],,Autodesk,[],"30,901 in Software (",[],Software,,</div>,,0201656302,


In [8]:
meta_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17424 entries, 0 to 17423
Data columns (total 17 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   category      17424 non-null  object 
 1   tech1         2 non-null      object 
 2   description   17424 non-null  object 
 3   fit           0 non-null      float64
 4   title         17422 non-null  object 
 5   also_buy      17424 non-null  object 
 6   tech2         0 non-null      float64
 7   brand         17273 non-null  object 
 8   feature       17424 non-null  object 
 9   rank          17424 non-null  object 
 10  also_view     17424 non-null  object 
 11  main_cat      17424 non-null  object 
 12  similar_item  0 non-null      float64
 13  date          16932 non-null  object 
 14  price         4044 non-null   object 
 15  asin          17424 non-null  object 
 16  details       16935 non-null  object 
dtypes: float64(3), object(14)
memory usage: 2.3+ MB


In [9]:
meta_data.also_view.value_counts()

also_view
[]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      12733
['B07GJKBZX4', 'B00A6ZSU8S', 'B07HPWFG56', 'B005WX2ULM', 'B00ELTWQR6', 'B01CGWHK5W', 'B00AORIQIW', 'B00QHFL72M']                                                                                                                

### review data

In [13]:
review_data.head()

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,vote
0,5.0,False,"07 23, 2008",A8IOST6U6WH9B,615179088,C. Radey,Human Japanese is a truly superb introduction ...,Human Japanese,12
1,5.0,False,"06 4, 2008",A1MUV9F35OROS5,615179088,D. Abel,I got Human Japanese as a demo from its websit...,Best Japanese Program Available,11
2,4.0,False,"04 8, 2008",A27PAMABWVQ892,615179088,piepiepie75,My first experience with Human Japanese was th...,Better than the Human Japanese 1...but not muc...,99
3,5.0,False,"03 26, 2008",A3HWWVK0L3JEKF,615179088,K. Grier,This is the first language software that I hav...,Great Product,4
4,5.0,False,"02 20, 2008",A3NO2V2JU4Y8UY,615179088,H. Granat,Human japanese is the best pc program for lear...,Love it!,2


In [14]:
review_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400723 entries, 0 to 400722
Data columns (total 9 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   overall       400723 non-null  float64
 1   verified      400723 non-null  bool   
 2   reviewTime    400723 non-null  object 
 3   reviewerID    400723 non-null  object 
 4   asin          400723 non-null  object 
 5   reviewerName  400656 non-null  object 
 6   reviewText    400652 non-null  object 
 7   summary       400676 non-null  object 
 8   vote          111590 non-null  object 
dtypes: bool(1), float64(1), object(7)
memory usage: 24.8+ MB


### Pre-processing

In [7]:
from data_preprocessing import PreProcessor

In [8]:
obj = PreProcessor(meta_data_path ,review_data_path)
review_data , software_data = obj.main()
review_data.to_csv("../data/reviews.csv")
software_data.to_csv("../data/softwares.csv")

data_preprocessing.py | 23 | __init__ | 20:09:53 | INFO: meta data is read with size 17424
  self.review_data = pd.read_csv(reviews_data_path)
data_preprocessing.py | 25 | __init__ | 20:09:54 | INFO: reviews data is read with size 400723
data_preprocessing.py | 30 | cleaning_price | 20:09:54 | INFO: formatting price field to be number only
data_preprocessing.py | 46 | assign_values | 20:09:54 | INFO: Generating other fee columns
data_preprocessing.py | 87 | main | 20:09:54 | INFO: Text fields cleaning
data_preprocessing.py | 92 | main | 20:28:01 | INFO: Text fields cleaning complete


OSError: Cannot save file into a non-existent directory: 'data'

### reading output of data prepocessing 

In [5]:
reviews = pd.read_csv("../data/reviews.csv")
softwares = pd.read_csv("../data/softwares.csv")
review_meta = reviews.merge(softwares, on='asin', how='inner')

  reviews = pd.read_csv("../data/reviews.csv")


In [5]:
review_meta.head()

Unnamed: 0.1,Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,vote,...,also_view,main_cat,similar_item,date,price,details,software_category,Licensing_Fee,Implemention_cost,Maintenance_cost
0,0,5.0,False,"07 23, 2008",A8IOST6U6WH9B,615179088,C. Radey,Human Japanese is a truly superb introduction ...,Human Japanese,12,...,"['B00N5EXLMC', '0976998122', '4789014401', '06...",Software,,</div>,39.94,,Education & Reference,0.008,19.97,3.994
1,1,5.0,False,"06 4, 2008",A1MUV9F35OROS5,615179088,D. Abel,I got Human Japanese as a demo from its websit...,Best Japanese Program Available,11,...,"['B00N5EXLMC', '0976998122', '4789014401', '06...",Software,,</div>,39.94,,Education & Reference,0.008,19.97,3.994
2,2,4.0,False,"04 8, 2008",A27PAMABWVQ892,615179088,piepiepie75,My first experience with Human Japanese was th...,Better than the Human Japanese 1...but not muc...,99,...,"['B00N5EXLMC', '0976998122', '4789014401', '06...",Software,,</div>,39.94,,Education & Reference,0.008,19.97,3.994
3,3,5.0,False,"03 26, 2008",A3HWWVK0L3JEKF,615179088,K. Grier,This is the first language software that I hav...,Great Product,4,...,"['B00N5EXLMC', '0976998122', '4789014401', '06...",Software,,</div>,39.94,,Education & Reference,0.008,19.97,3.994
4,4,5.0,False,"02 20, 2008",A3NO2V2JU4Y8UY,615179088,H. Granat,Human japanese is the best pc program for lear...,Love it!,2,...,"['B00N5EXLMC', '0976998122', '4789014401', '06...",Software,,</div>,39.94,,Education & Reference,0.008,19.97,3.994


In [6]:
review_meta.columns

Index(['Unnamed: 0', 'overall', 'verified', 'reviewTime', 'reviewerID', 'asin',
       'reviewerName', 'reviewText', 'summary', 'vote', 'category', 'tech1',
       'description', 'fit', 'title', 'also_buy', 'tech2', 'brand', 'feature',
       'rank', 'also_view', 'main_cat', 'similar_item', 'date', 'price',
       'details', 'software_category', 'Licensing_Fee', 'Implemention_cost',
       'Maintenance_cost'],
      dtype='object')

In [7]:
review_meta.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145863 entries, 0 to 145862
Data columns (total 30 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   Unnamed: 0         145863 non-null  int64  
 1   overall            145863 non-null  float64
 2   verified           145863 non-null  bool   
 3   reviewTime         145863 non-null  object 
 4   reviewerID         145863 non-null  object 
 5   asin               145863 non-null  object 
 6   reviewerName       145836 non-null  object 
 7   reviewText         145834 non-null  object 
 8   summary            145844 non-null  object 
 9   vote               30493 non-null   object 
 10  category           145863 non-null  object 
 11  tech1              26 non-null      object 
 12  description        145863 non-null  object 
 13  fit                0 non-null       float64
 14  title              145863 non-null  object 
 15  also_buy           145863 non-null  object 
 16  te

In [8]:
review_meta.describe()

Unnamed: 0.1,Unnamed: 0,overall,fit,tech2,similar_item,price,Licensing_Fee,Implemention_cost,Maintenance_cost
count,145863.0,145863.0,0.0,0.0,0.0,145863.0,145863.0,145863.0,145863.0
mean,72931.0,3.749224,,,,61.87379,1.023924,30.936895,6.187379
std,42107.165495,1.584138,,,,82.919461,1.569105,41.459731,8.291946
min,0.0,1.0,,,,0.0,0.0,0.0,0.0
25%,36465.5,2.0,,,,14.99,0.0,7.495,1.499
50%,72931.0,5.0,,,,36.88,0.008,18.44,3.688
75%,109396.5,5.0,,,,79.99,1.272,39.995,7.999
max,145862.0,5.0,,,,3175.0,11.96,1587.5,317.5
