In [5]:
import pandas as pd
import numpy as np

## Import data
In this notebook I import the data from the website and do some preparation before pickling two versions, one big--600,000 cases, one small--10,000. Both are just the combined text and the positive or negative value of the review.

In [6]:
# url = 'https://www.kaggle.com/snap/amazon-fine-food-reviews/downloads/amazon-fine-food-reviews.zip/2'
df = pd.read_csv('data/amazon-fine-food-reviews/Reviews.csv')

In [7]:
df.head(3)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...


Here I combine all the test into one variable.

In [8]:
df["text_all"] = df.Summary.str.cat(df.Text, sep = ' . ')

Confirm that they are, in fact, strings.

In [9]:
type(df.text_all[0])

str

Get rid of all the variables that we will not be using in the analysis under any circumstances. I might use the Summary variable and the hulpfulness numerator and the denominator variables. Everything else I am going to drop. 

In [10]:
df = df.drop(['Text', 'Id', 'ProductId', 'UserId', 'ProfileName', 'Time'], axis=1)

Create a small function that sets the threshold for when a review is considered positive or negative. 

In [11]:
def postitive_threshold(df, threshold):
    '''
    Input: 
        df: a Pandas DataFrame
        threshold: a number between 2 and 5 
    Sets the threshold at which a review will be coded as 
    positive or negative. For example, if set at 4 the reviews 
    that gave 4 or 5 stars will be positive and 3 and below negative.
    '''
    df.loc[df.Score >=threshold,'positive'] = int(1)
    df.loc[df.Score <threshold,'positive'] = int(0)
    return df

In [12]:
df = postitive_threshold(df, 4)

In [13]:
sum(df.positive)/len(df)

0.7806735461444549

In [14]:
df.head()

Unnamed: 0,HelpfulnessNumerator,HelpfulnessDenominator,Score,Summary,text_all,positive
0,1,1,5,Good Quality Dog Food,Good Quality Dog Food . I have bought several ...,1.0
1,0,0,1,Not as Advertised,Not as Advertised . Product arrived labeled as...,0.0
2,1,1,4,"""Delight"" says it all","""Delight"" says it all . This is a confection t...",1.0
3,3,3,2,Cough Medicine,Cough Medicine . If you are looking for the se...,0.0
4,0,0,5,Great taffy,Great taffy . Great taffy at a great price. T...,1.0


So setting things at 4 and 5 as positive creates a fairly skewed data set but not so bad that we are looking at using models like random forest.

First we are going to try to fit a model using only the text data so from here on out I am going to drop everything but the text and the outcome we are trying to predict, 'positive'.

In [12]:
#change the data type of positive to integer
df.positive = df.positive.apply(int)

In [13]:
df.head(3)

Unnamed: 0,HelpfulnessNumerator,HelpfulnessDenominator,Score,Summary,text_all,positive
0,1,1,5,Good Quality Dog Food,Good Quality Dog Food . I have bought several ...,1
1,0,0,1,Not as Advertised,Not as Advertised . Product arrived labeled as...,0
2,1,1,4,"""Delight"" says it all","""Delight"" says it all . This is a confection t...",1


In [14]:
type(df)

pandas.core.frame.DataFrame

In [15]:
def importdf_sample_magnitude(order_of_magnitude=None, random_state=None):
    '''This function unpickle's the dataframe and returns a random sample of the DataFrame 
    of a specified magnitude. Allows user to specify the order of magnitude of a random 
    sampling of the DataFrame. The order_of_magnitude parameter defaults to None, in which 
    case the function returns the entire data frame. Otherwise, the user enters an integer 
    which determines the order of magnitude of the DataFrame. A random_state argument is
    included as an option.
    
    IN: integer
    OUT: DataFrame'''
    df = pd.read_pickle('df_text.pk')
    
    if order_of_magnitude:
        random_state = random_state
        sample_size = 10**order_of_magnitude
        df = df.sample(sample_size, random_state=random_state)
        return df
    else: 
        return df

In [16]:
df.to_pickle('df_text.pk')