In [11]:
import pandas as pd
import numpy as np
import re

import spacy
nlp = spacy.load('en')

import string

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

for the MVP, files stored in '/mvp_reviews'

48 strains x 104 reviews (Leafly is into multiples of 8)

In [23]:
stored_reviews = pd.read_pickle('/home/nate/ds/metis/class_work/projects/project_fletcher/mvp_reviews.pkl')

# unpack and check raw data

the way I've stored the reviews is in one large list of lists

we'll unpack this, but first a quick peek

In [3]:
len(stored_reviews)

48

let's just look at the first review of the first strain...

we're concerned with formatting, thankfully everything is already text because of the way we scraped it

however, we still have: numbers, punctuation, capitalization, interesting vocabulary

In [4]:
stored_reviews[0][0]

"Friends, stoners, red-eyed countrymen, lend me your ears; for I bring unto thee a tale of the Blue Dream... T’was a calm April night, 2014 it was, and I had eagerly purchased an eighth of some pungent Blue Dream. It’s abundance of sugary trichomes, paired with the thick density of the bud was enough to bring a tear to your eye. I enthusiastically ground up the cheeba, packed a generous bowl and went to town. Eight minutes and a bowl later, I was beginning to assume that my herb wasn’t all that strong…but then it hit me like a 150-ton locomotive of euphoria. “Whoooa” was the only thing that I could say, as I looked at everything around the living room. Everything looked as if it were lagging behind by a few frames, and this cerebral adventure lasted for the first few minutes…but just when I thought that Blue Dream had shown me everything there was to experience about her, her sativa effects began to kick in. All of a sudden, I felt as if I was briskly cruising on a warm cloud, which wa

In [6]:
# let's explore a couple different ways to handle cleaning

remove = str.maketrans('', '', string.punctuation)


clean_list = []
for inner in stored_reviews:
    strain = ''
    
    for review in inner:
        review = review.translate(remove) # translate runs on C, so it's fast, really fast
        # join is another quick function, but not quite as efficient, here's an interpretable implementation using a list comprehension
        review = ''.join([i for i in review if not i.isdigit()]) 
        strain += review.lower()
        
    clean_list.append(strain)

now we have text that's ready to see some stronger stuff

In [7]:
clean_list[0][:100]

'friends stoners redeyed countrymen lend me your ears for i bring unto thee a tale of the blue dream '

# exploring spacy

spacy is excellent! after spending some time scouring the docs, I came across quite a few practical examples

first let's get all of the documents into a format we can clean and lemmatize

lemmatization is the process of turning words into their root, so ['describe', 'described', 'describes'] all become 'describe'

this first cell where we call nlp is the bulk of the work

cleaning was fast, but we need something we can throw into a model

In [9]:
def noise(token):
    
    '''
    for each token, determine if it's noise
    
    i.e. if it's a stopword, too short, or punctuation
    '''
    
    noise = False
    if token.is_stop == True:
        noise = True
    elif token.is_punct == True:
        noise = True
    elif token.is_digit == True:
        noise = True
    elif token.is_space == True:
        noise = True
    elif len(token) < 3:
        noise = True
    return noise

In [10]:
tokenized_reviews = []

for strain in clean_list:
    tokenized_strain = nlp(strain)
    strain_review = ''
    for token in tokenized_strain:
        if noise(token):
            pass
        else:
            strain_review += str(token.lemma_) + ' '
            
    tokenized_reviews.append(strain_review)

In [11]:
tokenized_reviews[0][:100]

'friend stoner redey countryman lend ear bring unto thee tale blue dream t’wa calm april night eagerl'

our data is almost ready for some machine learning!

now we need to split every word into it's own place

the result is one massive list of lists (again)

In [12]:
preprocessed = []

for review in tokenized_reviews:
    
    preprocessed.append(review.split())

In [13]:
preprocessed[0][:10]

['friend',
 'stoner',
 'redey',
 'countryman',
 'lend',
 'ear',
 'bring',
 'unto',
 'thee',
 'tale']

In [90]:
import pickle

pickle_out = open('mvp_preprocessed.pkl', 'wb')
pickle.dump(preprocessed, pickle_out)
pickle_out.close()