# Preprocessing

In [1]:
import pandas as pd
import numpy as np
import warnings
import sqlite3
import nltk
warnings.filterwarnings('ignore')

## Importing Data
- Dataset: [Amazon Fine Food Reviews (240 MB)](https://www.kaggle.com/snap/amazon-fine-food-reviews)
    - consists of 568,454 food reviews Amazon users left up to October 2012
    - using sqlite database to populate a pandas data frame (feels faster than using an excel sheet)

### sqlite operations

In [2]:
file = r'C:\Users\nishi\Downloads\amazon-fine-food-reviews\database.sqlite'
conn = sqlite3.connect(file)
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
# Prints all table names from database
print(cursor.fetchall())

[('Reviews',)]


### sqlite to pandas df

In [3]:
df = pd.read_sql_query("select * from Reviews;", conn)

### Data exploration
- getting to know data
- helps feature selection for the model

In [4]:
df.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [5]:
df.shape

(568454, 10)

### Data Preprocessing
- we will be using this data to train a Markov Model/Hidden Markov Model for Text generation and prediction
- Dropping columns other than the text ones only makes sense
- Hence saving only "Summary" and "Text" columns

In [6]:
df = df[['Summary', 'Text']]

In [7]:
df.head()

Unnamed: 0,Summary,Text
0,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,"""Delight"" says it all",This is a confection that has been around a fe...
3,Cough Medicine,If you are looking for the secret ingredient i...
4,Great taffy,Great taffy at a great price. There was a wid...


In [8]:
# dropping null values
df = df.dropna()
df.shape

(568454, 2)

### Pickling to save computation time

In [12]:
df['tokenized_Summary'] = df.apply(lambda row: nltk.word_tokenize(row['Summary']), axis=1)
df['tokenized_Reviews'] = df.apply(lambda row: nltk.word_tokenize(row['Text']), axis=1)

In [13]:
df.to_pickle(r'C:\Users\nishi\source\repos\HiddenMarkovModel\archives\my_df.pkl',compression='gzip')

In [14]:
df

Unnamed: 0,Summary,Text,tokenized_Summary,tokenized_Reviews
0,Good Quality Dog Food,I have bought several of the Vitality canned d...,"[Good, Quality, Dog, Food]","[I, have, bought, several, of, the, Vitality, ..."
1,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...,"[Not, as, Advertised]","[Product, arrived, labeled, as, Jumbo, Salted,..."
2,"""Delight"" says it all",This is a confection that has been around a fe...,"[``, Delight, '', says, it, all]","[This, is, a, confection, that, has, been, aro..."
3,Cough Medicine,If you are looking for the secret ingredient i...,"[Cough, Medicine]","[If, you, are, looking, for, the, secret, ingr..."
4,Great taffy,Great taffy at a great price. There was a wid...,"[Great, taffy]","[Great, taffy, at, a, great, price, ., There, ..."
5,Nice Taffy,I got a wild hair for taffy and ordered this f...,"[Nice, Taffy]","[I, got, a, wild, hair, for, taffy, and, order..."
6,Great! Just as good as the expensive brands!,This saltwater taffy had great flavors and was...,"[Great, !, Just, as, good, as, the, expensive,...","[This, saltwater, taffy, had, great, flavors, ..."
7,"Wonderful, tasty taffy",This taffy is so good. It is very soft and ch...,"[Wonderful, ,, tasty, taffy]","[This, taffy, is, so, good, ., It, is, very, s..."
8,Yay Barley,Right now I'm mostly just sprouting this so my...,"[Yay, Barley]","[Right, now, I, 'm, mostly, just, sprouting, t..."
9,Healthy Dog Food,This is a very healthy dog food. Good for thei...,"[Healthy, Dog, Food]","[This, is, a, very, healthy, dog, food, ., Goo..."
