<a href="https://colab.research.google.com/github/mattfredericksen/CSCE-4205-ML-Project/blob/main/feature_extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup

In [29]:
import os
import json
from time import time
from contextlib import suppress

import gzip
from urllib.request import urlopen
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download('wordnet')
nltk.download('punkt')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [30]:
# Progress Bar
# https://colab.research.google.com/drive/1I2o3Ie34vJ3G4M6eE54-OyrmzJNBwhOp#scrollTo=EbF9oPhzOqZj

from IPython.display import HTML, display

def progress(value, max=100):
    return HTML(f"""
        <progress
            value='{value}'
            max='{max}',
            style='width: 50%'
        >
            {value}
        </progress>
    """)

**Dataset links**
- [Books (\~30 million, too large!)](http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Books_5.json.gz)  
- [Clothing, Shoes, and Jewelry (\~11 million)](http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Clothing_Shoes_and_Jewelry_5.json.gz)    
- [Electronics (\~7 million)](http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Electronics_5.json.gz  )  
- [Home and Kitchen (\~7 million)](http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Home_and_Kitchen_5.json.gz)  
- [Movies and TV (\~3.5 million)](http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Movies_and_TV_5.json.gz)  

In [2]:
!wget http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Movies_and_TV_5.json.gz

--2020-11-09 21:31:07--  http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Movies_and_TV_5.json.gz
Resolving deepyeti.ucsd.edu (deepyeti.ucsd.edu)... 169.228.63.50
Connecting to deepyeti.ucsd.edu (deepyeti.ucsd.edu)|169.228.63.50|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 791322468 (755M) [application/octet-stream]
Saving to: ‘Movies_and_TV_5.json.gz.1’


2020-11-09 21:31:58 (14.8 MB/s) - ‘Movies_and_TV_5.json.gz.1’ saved [791322468/791322468]



# Feature Engineering
**List of features**  
`reviewerID` - ID of the reviewer, e.g. A2SUAM1J3GNN3B  
`asin` - ID of the product, e.g. 0000013714  
`reviewerName` - name of the reviewer  
`vote` - helpful votes of the review  
`style` - a disctionary of the product metadata, e.g., "Format" is "Hardcover"  
`reviewText` - text of the review  
`overall` - rating of the product  
`summary` - summary of the review  
`unixReviewTime` - time of the review (unix time)  
`reviewTime` - time of the review (raw)  
`image` - images that users post after they have received the product  

In [3]:
def drop_features(d):
  kept_features = ("overall", "reviewText")
  return {f: d[f] for f in kept_features}

data = []

with gzip.open("Movies_and_TV_5.json.gz") as file:
  for line in file:
    with suppress(KeyError):
      data.append(drop_features(json.loads(line.strip())))

review_len = len(data)
print(f'{review_len} reviews loaded.')

3408438 reviews loaded.


In [11]:
review_data = pd.DataFrame.from_dict(data)
# del data
review_data

https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

https://scikit-learn.org/stable/modules/feature_extraction.html#customizing-the-vectorizer-classes

In [39]:
class LemmaTokenizer:
    def __init__(self, pb):
        self.wnl = WordNetLemmatizer()
        self.call_count = 0
        self.pb = pb

    def __call__(self, doc):
        self.call_count += 1
        if self.call_count % 1024 == 0:
            # temporarily cutting back review_len
            self.pb.update(progress(self.call_count, review_len//10))
        return tuple(self.wnl.lemmatize(t) for t in word_tokenize(doc))

In [34]:
vectorizer = TfidfVectorizer(stop_words='english')
start = time()
vectorizer.fit_transform(review_data['reviewText'][:review_len//10])
print(f'execution time: {((time() - start) / 60):.1f} minutes')
print(f'{len(vectorizer.get_feature_names())} features (unique words)')

execution time: 0.4 minutes
179115 features (unique words)


In [40]:
pb = display(progress(0, review_len//10), display_id=True)

vectorizer = TfidfVectorizer(stop_words='english', tokenizer=LemmaTokenizer(pb))
start = time()
# temporarily cutting back review_len
vectorizer.fit_transform(review_data['reviewText'][:review_len//10])
print(f'execution time: {((time() - start) / 60):.1f} minutes')
print(f'{len(vectorizer.get_feature_names())} features (unique words)')

  'stop_words.' % sorted(inconsistent))


execution time: 8.9 minutes
270275 features (unique words)


# Training/Testing Split
In the next cell, we randomly split the data into training and testing sets. At a later time, we may want switch to using [stratified K-fold cross validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold), which performs cross validation while ensuring an equal distribution of classes (star ratings).

In [None]:
targets = review_data[['overall']]
features = review_data[['reviewText']]
train_features, test_features, train_targets, test_targets = train_test_split(features, targets, test_size=0.2)

https://scikit-learn.org/stable/modules/naive_bayes.html#multinomial-naive-bayes