<a href="https://colab.research.google.com/github/mattfredericksen/CSCE-4205-ML-Project/blob/dataShrinkage/feature_extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup

In [None]:
import os
import json
import re
from time import time
from contextlib import suppress

import gzip
from urllib.request import urlopen
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
# Progress Bar
# https://colab.research.google.com/drive/1I2o3Ie34vJ3G4M6eE54-OyrmzJNBwhOp#scrollTo=EbF9oPhzOqZj

from IPython.display import HTML, display

def progress(value, max=100):
    return HTML(f"""
        <progress
            value='{value}'
            max='{max}',
            style='width: 50%'
        >
            {value}
        </progress>
    """)

**Dataset links**
- [Books (\~30 million, too large!)](http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Books_5.json.gz)  
- [Clothing, Shoes, and Jewelry (\~11 million)](http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Clothing_Shoes_and_Jewelry_5.json.gz)    
- [Electronics (\~7 million)](http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Electronics_5.json.gz  )  
- [Home and Kitchen (\~7 million)](http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Home_and_Kitchen_5.json.gz)  
- [Movies and TV (\~3.5 million)](http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Movies_and_TV_5.json.gz)  

In [None]:
!wget http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Movies_and_TV_5.json.gz

--2020-11-15 01:47:05--  http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Movies_and_TV_5.json.gz
Resolving deepyeti.ucsd.edu (deepyeti.ucsd.edu)... 169.228.63.50
Connecting to deepyeti.ucsd.edu (deepyeti.ucsd.edu)|169.228.63.50|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 791322468 (755M) [application/octet-stream]
Saving to: ‘Movies_and_TV_5.json.gz’


2020-11-15 01:47:48 (18.0 MB/s) - ‘Movies_and_TV_5.json.gz’ saved [791322468/791322468]



# Feature Engineering
**List of features**  
`reviewerID` - ID of the reviewer, e.g. A2SUAM1J3GNN3B  
`asin` - ID of the product, e.g. 0000013714  
`reviewerName` - name of the reviewer  
`vote` - helpful votes of the review  
`style` - a disctionary of the product metadata, e.g., "Format" is "Hardcover"  
`reviewText` - text of the review  
`overall` - rating of the product  
`summary` - summary of the review  
`unixReviewTime` - time of the review (unix time)  
`reviewTime` - time of the review (raw)  
`image` - images that users post after they have received the product  

In [None]:
def drop_features(d):
  kept_features = ("overall", "reviewText")
  return {f: d[f] for f in kept_features}

data = []

with gzip.open("Movies_and_TV_5.json.gz") as file:
  for line in file:
    with suppress(KeyError):
      data.append(drop_features(json.loads(line.strip())))

print(f'{len(data)} reviews loaded.')

3408438 reviews loaded.


In [None]:
review_data = pd.DataFrame.from_dict(data)
review_len = len(data)
# del data
review_data

Unnamed: 0,overall,reviewText
0,5.0,So sorry I didn't purchase this years ago when...
1,5.0,Believe me when I tell you that you will recei...
2,5.0,"I have seen X live many times, both in the ear..."
3,5.0,"I was so excited for this! Finally, a live co..."
4,5.0,X is one of the best punk bands ever. I don't ...
...,...,...
3408433,4.0,The singing parts are very good as expected fr...
3408434,5.0,This recording of the 2015 production by the M...
3408435,4.0,I do not wish to write a review about this rel...
3408436,5.0,It was a gift.


https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

https://scikit-learn.org/stable/modules/feature_extraction.html#customizing-the-vectorizer-classes

In [None]:
class LemmaTokenizer:
    def __init__(self, pb):
        self.wnl = WordNetLemmatizer()
        self.word_re = re.compile(r'[^A-Za-z]+')
        # cut back sequences of the same character to at most 2
        self.shorten_re = re.compile(r'(.)\1{2,}')
        self.call_count = 0
        self.pb = pb
        # experimental
        self.stopwords = set(stopwords.words('english'))

    def __call__(self, doc):
        self.call_count += 1
        if self.call_count % 1024 == 0:
            self.pb.update(progress(self.call_count, cut_len))
        return tuple(self.wnl.lemmatize(self.shorten_re.sub(r'\1\1', t)) for t in self.word_re.split(doc) if t not in self.stopwords)

In [None]:
cut_data = review_data[:review_len//10]
cut_len = len(cut_data)
cut_data

Unnamed: 0,overall,reviewText
0,5.0,So sorry I didn't purchase this years ago when...
1,5.0,Believe me when I tell you that you will recei...
2,5.0,"I have seen X live many times, both in the ear..."
3,5.0,"I was so excited for this! Finally, a live co..."
4,5.0,X is one of the best punk bands ever. I don't ...
...,...,...
340838,5.0,Need a kid friendly movie with lots of laughs ...
340839,4.0,Pretty good though improbable plot.\nGood acti...
340840,5.0,great movie
340841,5.0,10 stars......You will love it


In [None]:
vectorizer = TfidfVectorizer(stop_words='english')
start = time()
vectorizer.fit_transform(cut_data['reviewText'])
print(f'execution time: {((time() - start) / 60):.1f} minutes')
print(f'{len(vectorizer.get_feature_names())} features (unique words)')

execution time: 0.4 minutes
179115 features (unique words)


In [None]:
pb = display(progress(0, cut_len), display_id=True)

vectorizer = TfidfVectorizer(stop_words='english', tokenizer=LemmaTokenizer(pb))
start = time()
# temporarily cutting back review_len
vectorizer.fit_transform(cut_data['reviewText'])
print(f'execution time: {((time() - start) / 60):.1f} minutes')
print(f'{len(vectorizer.get_feature_names())} features (unique words)')

  'stop_words.' % sorted(inconsistent))


execution time: 2.2 minutes
157822 features (unique words)


In [None]:
print(vectorizer.get_feature_names()[:50])
print(vectorizer.get_feature_names()[10000:10050])
print(vectorizer.get_feature_names()[157800:])

['', 'aa', 'aabcu', 'aaberg', 'aac', 'aacckk', 'aaceptional', 'aack', 'aackk', 'aacs', 'aacute', 'aadams', 'aadditional', 'aado', 'aadoores', 'aae', 'aaf', 'aafes', 'aagean', 'aagghh', 'aagh', 'aah', 'aahed', 'aahh', 'aahhaahaa', 'aahhaahh', 'aahhing', 'aahing', 'aahoo', 'aahs', 'aaiirr', 'aainst', 'aak', 'aakafranz', 'aaker', 'aakroyd', 'aalbaek', 'aalberg', 'aaliyah', 'aall', 'aalong', 'aalready', 'aalto', 'aamateur', 'aamazing', 'aamazon', 'aamerican', 'aames', 'aamfbca', 'aamuuzzon']
['backprojected', 'backprojection', 'backprojections', 'backprop', 'backrgound', 'backroad', 'backroads', 'backroom', 'backround', 'backrounds', 'backrub', 'backrupt', 'backscapes', 'backscratcher', 'backscreen', 'backseat', 'backseated', 'backshooting', 'backside', 'backsight', 'backslapping', 'backslaps', 'backslid', 'backslide', 'backsliding', 'backson', 'backstab', 'backstabbed', 'backstabber', 'backstabbers', 'backstabbing', 'backstabbling', 'backstabs', 'backstage', 'backstein', 'backsteps', 'bac

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Training/Testing Split
In the next cell, we randomly split the data into training and testing sets. At a later time, we may want switch to using [stratified K-fold cross validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold), which performs cross validation while ensuring an equal distribution of classes (star ratings).

In [None]:
targets = review_data[['overall']]
features = review_data[['reviewText']]
train_features, test_features, train_targets, test_targets = train_test_split(features, targets, test_size=0.2)

https://scikit-learn.org/stable/modules/naive_bayes.html#multinomial-naive-bayes