# Intro to Data Science week 1, exercise 2

N.B! 2.1 and 2.2 don't produce inputs/outputs
...so let's just initialize helpers.

In [18]:
import pandas as pd

def updateCol(df, colName, fn):
    """
    Takes DataFrame, column name and function. Applies function to column and returns a new copy of updated DataFrame.
    """
    
    newval = df[colName].apply(fn)
    kwargs = {colName: newval}
    return df.assign(**kwargs)


### Exercise 2.3: 
a)
First we need to make the file valid JSON and then read it

In [19]:
def arrayify_set_of_json(sourcefile, destfile):
    source = open(sourcefile, 'r')
    dest = open(destfile, 'w')
    line = source.readline()
    dest.write('[')
    while(line):
        dest.write(line)
        nextline = source.readline()
        if(nextline):
            dest.write(',')
            line = nextline
        else:
            line = None
    dest.write(']')
    source.close()
    dest.close()

source = 'reviews_automotive_5.json'
datafile = 'reviews_automotive_5_proper.json'

arrayify_set_of_json(source, datafile)

...and then we can read it!

In [20]:
data = pd.read_json(datafile)
print(data)

             asin   helpful  overall  \
0      B00002243X    [4, 4]        5   
1      B00002243X    [1, 1]        4   
2      B00002243X    [0, 0]        5   
3      B00002243X  [19, 19]        5   
4      B00002243X    [0, 0]        5   
5      B00002243X    [1, 1]        5   
6      B00002243X    [1, 1]        5   
7      B00002243X    [0, 0]        5   
8      B00002243X    [0, 0]        4   
9      B00002243X    [0, 0]        5   
10     B00002243Z    [0, 0]        4   
11     B00002243Z  [19, 21]        5   
12     B00002243Z    [0, 0]        5   
13     B00002243Z  [20, 21]        4   
14     B00002243Z    [1, 1]        4   
15     B00002243Z    [1, 3]        4   
16     B00008BKX5    [0, 0]        3   
17     B00008BKX5    [3, 4]        4   
18     B00008BKX5  [51, 51]        5   
19     B00008BKX5    [0, 0]        5   
20     B00008BKX5    [0, 0]        5   
21     B00008BKX5    [0, 0]        4   
22     B00008BKX5    [1, 1]        5   
23     B00008RW9U    [1, 2]        5   


b) downcase all the content

In [21]:
data1 = updateCol(data, 'reviewText', lambda text: text.lower())
print(data1['reviewText'])

0        i needed a set of jumper cables for my new car...
1        these long cables work fine for my truck, but ...
2        can't comment much on these since they have no...
3        i absolutley love amazon!!!  for the price of ...
4        i purchased the 12' feet long cable set and th...
5        these jumper cables are heavy duty, yet easy t...
6        bought these for my k2500 suburban plenty of l...
7        these are good enough to get most motorized ve...
8        the coleman cable 08665 12-feet heavy-duty tru...
9        i have an old car, its bound to need these som...
10       i seem to use jumper cables at least several t...
11       all other jumper cables are not real jumper ca...
12       i'm one of those guys who stops and helps peop...
13       so these aren't the best cables you can buy.  ...
14       it is hard to find pure copper cabled jumper c...
15       these are an insurance policy for my land rove...
16       this product serves its purpose. i use it for .

c) remove punctuation

In [22]:
data2 = updateCol(data1, 'reviewText', lambda text: text.replace('.', ''))
print(data2['reviewText'][0])

i needed a set of jumper cables for my new car and these had good reviews and were at a good price  they have been used a few times already and do what they are supposed to - no complaints therewhat i will say is that 12 feet really isn't an ideal length  sure, if you pull up front bumper to front bumper they are plenty long, but a lot of times you will be beside another car or can't get really close  because of this, i would recommend something a little longer than 12'great brand - get 16' version though


d) reduce words to stems (note: this computation takes some time...)

Note: nltk.word_tokenize requires an extra package. If you get errors from below that urge you to download stuff, here are the steps:

1. import nltk
2. nltk.download()
3. choose "d" for "download"
4. type "punkt" and hit enter
5. great success!

In [24]:
import nltk
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer('english')

def stemSentence(sentence):
    return " ".join(list(map(stemmer.stem, nltk.word_tokenize(sentence))))

data3 = updateCol(data2, 'reviewText', lambda text: stemSentence(text))
print(data3)

             asin   helpful  overall  \
0      B00002243X    [4, 4]        5   
1      B00002243X    [1, 1]        4   
2      B00002243X    [0, 0]        5   
3      B00002243X  [19, 19]        5   
4      B00002243X    [0, 0]        5   
5      B00002243X    [1, 1]        5   
6      B00002243X    [1, 1]        5   
7      B00002243X    [0, 0]        5   
8      B00002243X    [0, 0]        4   
9      B00002243X    [0, 0]        5   
10     B00002243Z    [0, 0]        4   
11     B00002243Z  [19, 21]        5   
12     B00002243Z    [0, 0]        5   
13     B00002243Z  [20, 21]        4   
14     B00002243Z    [1, 1]        4   
15     B00002243Z    [1, 3]        4   
16     B00008BKX5    [0, 0]        3   
17     B00008BKX5    [3, 4]        4   
18     B00008BKX5  [51, 51]        5   
19     B00008BKX5    [0, 0]        5   
20     B00008BKX5    [0, 0]        5   
21     B00008BKX5    [0, 0]        4   
22     B00008BKX5    [1, 1]        5   
23     B00008RW9U    [1, 2]        5   


d) filtering and saving the data

In [38]:
positiveReviews = data3[data3['overall'] > 3]
negativeReviews = data3[data3['overall'] < 3]

def write_reviews_to_file(filename, reviewSeries):
    file = open(filename, 'w')
    for index, value in reviewSeries.iteritems():
        file.write(value)
        file.write("\n")
    file.close()

write_reviews_to_file('ex2_positive_reviews.txt', positiveReviews['reviewText'])
write_reviews_to_file('ex2_negative_reviews.txt', negativeReviews['reviewText'])