# Sprint Challenge
## *Data Science Unit 4 Sprint 1*

After a week of Natural Language Processing, you've learned some cool new stuff: how to process text, how turn text into vectors, and how to model topics from documents. Apply your newly acquired skills to one of the most famous NLP datasets out there: [Yelp](https://www.yelp.com/dataset). As part of the job selection process, some of my friends have been asked to create analysis of this dataset, so I want to empower you to have a head start.  

The real dataset is massive (almost 8 gigs uncompressed). I've sampled the data for you to something more manageable for the Sprint Challenge. You can analyze the full dataset as a stretch goal or after the sprint challenge. As you work on the challenge, I suggest adding notes about your findings and things you want to analyze in the future.

## Challenge Objectives
Successfully complete all these objectives to earn full credit. 

**Successful completion is defined as passing all the unit tests in each objective.**  

Each unit test that you pass is 1 point. 

There are 5 total possible points in this sprint challenge. 


There are more details on each objective further down in the notebook.*
* <a href="#p1">Part 1</a>: Write a function to tokenize the yelp reviews
* <a href="#p2">Part 2</a>: Create a vector representation of those tokens
* <a href="#p3">Part 3</a>: Use your tokens in a classification model on yelp rating
* <a href="#p4">Part 4</a>: Estimate & Interpret a topic model of the Yelp reviews

____

# Before you submit your notebook you must first

1) Restart your notebook's Kernel

2) Run all cells sequentially, from top to bottom, so that cell numbers are sequential numbers (i.e. 1,2,3,4,5...)
- Easiest way to do this is to click on the **Cell** tab at the top of your notebook and select **Run All** from the drop down menu. 

3) Comment out the cell that generates a pyLDAvis visual in objective 4 (see instructions in that section). 
____



### Import Data

In [7]:
import pandas as pd

# Load reviews from URL
data_url = 'https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_4/unit1_nlp/review_sample.json'

# Import data into a DataFrame named df
# YOUR CODE HERE
df = pd.read_csv(data_url, sep='delimiter', header=None)
df.head()


  return func(*args, **kwargs)


Unnamed: 0,0
0,"{""business_id"":""nDuEqIyRc8YKS1q1fX0CZg"",""cool""..."
1,"{""business_id"":""eMYeEapscbKNqUDCx705hg"",""cool""..."
2,"{""business_id"":""6Q7-wkCPc1KF75jZLOTcMw"",""cool""..."
3,"{""business_id"":""k3zrItO4l9hwfLRwHBDc9w"",""cool""..."
4,"{""business_id"":""6hpfRwGlOzbNv7k5eP9rsQ"",""cool""..."


In [26]:
pd.options.display.max_rows
pd.set_option('display.max_rows', None)


AttributeError: module 'pandas' has no attribute 'display'

In [8]:
# Visible Testing
assert isinstance(df, pd.DataFrame), 'df is not a DataFrame. Did you import the data into df?'
assert df.shape[0] == 10000, 'DataFrame df has the wrong number of rows.'

## Part 1: Tokenize Function
<a id="#p1"></a>

Complete the function `tokenize`. Your function should
- accept one document at a time
- return a list of tokens

You are free to use any method you have learned this week.

In [47]:
# Optional: Consider using spaCy in your function. The spaCy library can be imported by running this cell.
# A pre-trained model (en_core_web_sm) has been made available to you in the CodeGrade container.
# If you DON'T need use the en_core_web_sm model, you can comment it out below.
import spacy
nlp = spacy.load('en_core_web_md')

doc = nlp(str(df[0]))

In [16]:
print(df.columns)

Int64Index([0], dtype='int64')


In [48]:
tokens = []

#this function uses nlp.pipe (a spacy method that breaks docs into tokens) 
#to build a pipeline that takes a column in and iterate through it
#to create each review into a separate document
for doc in nlp.pipe(df[0]):
    
    doc_tokens = []
    
    for token in doc: #each doc is already composed of tokens (with spacy)
        
        doc_tokens.append(token.text) #for each doc and token, append the text
    
    tokens.append(doc_tokens) #build up a token structure for each review doc
    
# save tokens to df
df['tokens'] = tokens

In [59]:
#df['tokens'][0]

In [51]:
'''Testing'''
assert isinstance(tokenize(df.sample(n=1)["text"].iloc[0]), list), "Make sure your tokenizer function accepts a single document and returns a list of tokens!"

'Testing'

## Part 2: Vector Representation
<a id="#p2"></a>
1. Create a vector representation of the reviews (i.e. create a doc-term matrix).
2. Write a fake review and query for the 10 most similar reviews, print the text of the reviews. Do you notice any patterns?
    - Given the size of the dataset, use `NearestNeighbors` model for this. 

In [76]:
from sklearn.feature_extraction.text import CountVectorizer


# list of text documents
text = ["We created a new dataset which emphasizes diversity of content, by scraping content from the Internet."," In order to preserve document quality, we used only pages which have been curated/filtered by humans—specifically, we used outbound links from Reddit which received at least 3 karma."," This can be thought of as a heuristic indicator for whether other users found the link interesting (whether educational or funny), leading to higher data quality than other similar datasets, such as CommonCrawl."]

# create the transformer
vect = CountVectorizer() #instantiate the CountVectorizer class with the vect object

# build vocab
vect.fit(df[0]) #fit the object to text

# transform text
dtm = vect.transform(df[0])


from scipy import sparse as sparse

dtm_sparse_matrix = dtm
dtm_dense_matrix = dtm_sparse_matrix.todense()
dtm_dense_matrix

# Get Word Counts for each document
dtm_new = pd.DataFrame(dtm_sparse_matrix.todense(), columns=vect.get_feature_names()) #column names are from get_feature_names()
#dtm_new

In [74]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english', max_features=5000)

dtm = tfidf.fit_transform(df[0])

dtm = pd.DataFrame(dtm.todense(), columns=tfidf.get_feature_names())

# View Feature Matrix as DataFrame
dtm.head() #prints out the tfidf score for each word; the tfidf scores gives the data a much richer information compare to the mere count of it

Unnamed: 0,00,000,00pm,07,10,100,1000,101,10pm,11,...,younger,yuck,yuk,yum,yummy,yup,zero,zone,zoo,zucchini
0,0.0,0.0,0.0,0.0,0.086046,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [117]:
# Create and fit a NearestNeighbors model named "nn"
from sklearn.neighbors import NearestNeighbors

nn = NearestNeighbors(n_neighbors=10, algorithm='kd_tree') 
nn.fit(dtm)

# sample a doc from dtm to use as our query point
doc_index = 0
doc = [dtm.iloc[doc_index].values]

# Query Using kneighbors 
neigh_dist, neigh_index = nn.kneighbors(doc)

print(neigh_index)

[[   0 9889 6311 6204 8470 6899 9347 3427 2131 6711]]


In [118]:
'''Testing.'''
assert nn.__module__ == 'sklearn.neighbors._unsupervised', ' nn is not a NearestNeighbors instance.'
assert nn.n_neighbors == 10, 'nn has the wrong value for n_neighbors'

In [119]:
# Create a fake review and find the 10 most similar reviews

fake_reviews = []

for ind in neigh_index:
    fake_review.append(df['tokens'][ind])

print(*fake_review, sep = ',')



0       [{, ", business_id":"nDuEqIyRc8YKS1q1fX0CZg","...
9889    [{, ", business_id":"af7Vo1LUL7wgWol1axTYlA","...
6311    [{, ", business_id":"gTw6PENNGl68ZPUpYWP50A","...
6204    [{, ", business_id":"W2D2eDrP9hGXWYli-4QJdg","...
8470    [{, ", business_id":"ScBVPrkJFR4seiRQJqwGbA","...
6899    [{, ", business_id":"jGe5sKmpIStORtVbsd9yyQ","...
9347    [{, ", business_id":"YdRSx, -, i6uidsAO04GRMQV...
3427    [{, ", business_id":"CVZNI8Ei4feCAhZpk1KTwQ","...
2131    [{, ", business_id":"YfPHmToBq2IPJPhfDwqeeg","...
6711    [{, ", business_id":"oBEFhUe7yEH1PK25bImCWA","...
Name: tokens, dtype: object,0       [{, ", business_id":"nDuEqIyRc8YKS1q1fX0CZg","...
9889    [{, ", business_id":"af7Vo1LUL7wgWol1axTYlA","...
6311    [{, ", business_id":"gTw6PENNGl68ZPUpYWP50A","...
6204    [{, ", business_id":"W2D2eDrP9hGXWYli-4QJdg","...
8470    [{, ", business_id":"ScBVPrkJFR4seiRQJqwGbA","...
6899    [{, ", business_id":"jGe5sKmpIStORtVbsd9yyQ","...
9347    [{, ", business_id":"YdRSx, -, i6uid

In [123]:
fake_review = ['TERRIBLE TERRIBLE EXPERIENCE!']

new = vect.transform(fake_review)

nn.kneighbors(new.todense())

matches = df.text[[0, 9889, 6311, 6204, 8470, 6899, 9347, 3427, 2131, 6711]]

print(matches)

ValueError: query data dimension must match training data dimension

## Part 3: Classification
<a id="#p3"></a>
Your goal in this section will be to predict `stars` from the review dataset. 

1. Create a pipeline object with a sklearn `CountVectorizer` or `TfidfVector` and any sklearn classifier.
    - Use that pipeline to train a model to predict the `stars` feature (i.e. the labels). 
    - Use that Pipeline to predict a star rating for your fake review from Part 2. 



2. Create a parameter dict including `one parameter for the vectorizer` and `one parameter for the model`. 
    - Include 2 possible values for each parameter
    - **Use `n_jobs` = 1** 
    - Due to limited computational resources on CodeGrader `DO NOT INCLUDE ADDITIONAL PARAMETERS OR VALUES PLEASE.`
    
    
3. Train the entire pipeline with a GridSearch
    - Name your GridSearch object as `gs`

In [108]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

# Name the gridsearch instance "gs"


In [114]:
# save our model input data to X
X = data.data

# save our targets/labels to Y 
Y = data.target

# clean our docs 
X_clean = [clean_data(text) for text in data.data] #for each text string, we'll clean it

# Create Pipeline Components

# create vectorizer
tfidf = TfidfVectorizer(stop_words="english", tokenizer=None) # instantiate the TfidfVectorizer; data transformer 

# create classifier
rfc = RandomForestClassifier(random_state=42) # instantiate the Classifer; estimator 

AttributeError: 'Series' object has no attribute 'data'

In [None]:
# create a hyper-parameter dictionary for BOTH our vectorizer and our ML model 
# here we will determine which tfidf parameter values lead to the best performing model
parameters = {
    'vect__max_df': ( 0.75, 1.0),
    'vect__max_features': (500, 1000),
    'clf__n_estimators':(10, 100),
    'clf__max_depth':(15, 20)
}

# Instantiate a GridSearchCV object
gs = GridSearchCV(pipe, param_grid=parameters, n_jobs=1, cv=3, verbose=1)
# Note: For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. For example with n_jobs=-2, all CPUs but one are used.

gs.fit(X_clean, Y)
###END SOLUTION

In [None]:
# Visible Testing
prediction = gs.predict(["I wish dogs knew how to speak English."])[0]
assert prediction in df.stars.values, 'You gs object should be able to accept raw text within a list. Did you include a vectorizer in your pipeline?'

## Part 4: Topic Modeling

Let's find out what those yelp reviews are saying! :D

1. Estimate a LDA topic model of the review text
    - Set num_topics to `5`
    - Name your LDA model `lda`
2. Create 1-2 visualizations of the results
    - You can use the most important 3 words of a topic in relevant visualizations. Refer to yesterday's notebook to extract. 
3. In markdown, write 1-2 paragraphs of analysis on the results of your topic model

When you instantiate your LDA model, it should look like this: 

```python
lda = LdaModel(corpus=corpus,
               id2word=id2word,
               random_state=723812,
               num_topics = num_topics,
               passes=1
              )

```

__*Note*__: You can pass the DataFrame column of text reviews to gensim. You do not have to use a generator.

## Note about  pyLDAvis

**pyLDAvis** is the Topic modeling package that we used in class to visualize the topics that LDA generates for us.

You are welcomed to use pyLDAvis if you'd like for your visualization. However, **you MUST comment out the code that imports the package and the cell that generates the visualization before you submit your notebook to CodeGrade.** 

Although you should leave the print out of the visualization for graders to see (i.e. comment out the cell after you run it to create the viz). 

In [130]:
import re
import numpy as np
import pandas as pd
from pprint import pprint

import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

from sklearn.datasets import fetch_20newsgroups
from pandarallel import pandarallel

import spacy
spacy.util.fix_random_seed(0)

import pyLDAvis
import pyLDAvis.gensim_models 
import pyLDAvis.gensim_models as gensimvis
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

### 1. Estimate a LDA topic model of the review tex

In [136]:
# Remember to read the LDA docs for more information on the various class attirbutes and methods available to you
# in the LDA model: https://radimrehurek.com/gensim/models/ldamodel.html

# don't change this value 
num_topics = 5

df['lemmas'] = df[0].apply(lambda x: [token.lemma_ for token in nlp(x) if (token.is_stop != True) and (token.is_punct != True)])

def filter_lemmas(lemmas):
   
    #simple list comprehension to filter out any lemmas that are 2 characters or smaller
    return [lemma for lemma in lemmas if len(lemma) > 2]

df["filtered_lemmas"] = df["lemmas"].apply(filter_lemmas)

df["filtered_lemmas"]

0       [business_id":"ndueqiyrc8yks1q1fx0czg","cool":...
1       [business_id":"eMYeEapscbKNqUDCx705hg","cool":...
2       [business_id":"6Q7, wkCPc1KF75jZLOTcMw","cool"...
3       [business_id":"k3zrItO4l9hwfLRwHBDc9w","cool":...
4       [business_id":"6hpfrwglozbnv7k5ep9rsq","cool":...
5       [business_id":"Db3CfZWrtG33UZSs8Tdlsg","cool":...
6       [business_id":"gjhmeq2nvh27tz8lqbd3eq","cool":...
7       [business_id":"Yt5gK4E9NqVa14WNiQdBlQ","cool":...
8       [business_id":"c7wsc8sbuclyzkrezx9dga","cool":...
9       [business_id":"NSifXpsCRvnsBRqrHF9CJA","cool":...
10      [business_id":"WbMZHFOOzjmQybWyHdfDxA","cool":...
11      [business_id":"el4fc8jcawuvgw_0eicbaq","cool":...
12      [business_id":"SVGApDPNdpFlEjwRQThCxA","cool":...
13      [business_id":"n1awip49a5roronnbtm0ow","cool":...
14      [business_id":"XKQP, tkvlhazoek0ckez5w","cool"...
15      [business_id":"id6pB8fqOULNOH_nKO9bhQ","cool":...
16      [business_id":"9m_wg9xwjdiqap2, rpf4lw","cool"...
17      [busin

In [140]:
tokens = []

# use tokenize function you created earlier to create tokens 
for doc in nlp.pipe(df[0]):
    
    doc_tokens = []
    
    for token in doc: #each doc is already composed of tokens (with spacy)
        
        doc_tokens.append(token.text) #for each doc and token, append the text
    
    tokens.append(doc_tokens) #build up a token structure for each review doc
    
# save tokens to df
df['filtered_lemmas'] = tokens

In [141]:
# create a id2word object (hint: use corpora.Dictionary)

id2word = corpora.Dictionary(df['filtered_lemmas'] ) #give our lemmas column to gensim to make a dictionary


In [142]:
# create a corpus object (hint: id2word.doc2bow)

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in df['filtered_lemmas']] #list comprehension to apply to all docs

# corpus stores (token id, token count) for each doc in the corpus
doc_id = 5
corpus[doc_id]

[(0, 4),
 (2, 2),
 (6, 3),
 (7, 1),
 (8, 5),
 (20, 2),
 (22, 2),
 (29, 2),
 (30, 1),
 (32, 2),
 (41, 3),
 (70, 1),
 (72, 1),
 (76, 1),
 (82, 1),
 (83, 1),
 (93, 2),
 (96, 2),
 (98, 3),
 (99, 1),
 (121, 1),
 (126, 1),
 (147, 1),
 (183, 1),
 (185, 1),
 (201, 1),
 (214, 1),
 (236, 1),
 (239, 1),
 (254, 1),
 (268, 1),
 (280, 1),
 (299, 1),
 (300, 1),
 (301, 1),
 (302, 1),
 (303, 1),
 (304, 1),
 (305, 1),
 (306, 1),
 (307, 1),
 (308, 1),
 (309, 1),
 (310, 1),
 (311, 1),
 (312, 1),
 (313, 1),
 (314, 1),
 (315, 1),
 (316, 1),
 (317, 1),
 (318, 1),
 (319, 1),
 (320, 1),
 (321, 1),
 (322, 1),
 (323, 1)]

In [145]:
# instantiate an lda model
lda = gensim.models.ldamulticore.LdaMulticore(corpus=corpus,
                                                        id2word=id2word,
                                                        num_topics=num_topics, 
                                                        chunksize=100,
                                                        passes=10,# runtime related parameter
                                                        per_word_topics=True,
                                                        workers=10, # runtime related parameter
                                                        random_state=1234, 
                                                        iterations=20) # runtime related parameter

#### Testing

In [146]:
# Visible Testing
assert lda.get_topics().shape[0] == 5, 'Did your model complete its training? Did you set num_topics to 5?'

#### 2. Create 1-2 visualizations of the results

In [1]:
import seaborn as sns
import matplotlib.pyplot as plt
import pyLDAvis.gensim_models


# Use pyLDAvis (or a ploting tool of your choice) to visualize your results 
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda, corpus, id2word, mds='pcoa', sort_topics=True)
vis


ModuleNotFoundError: No module named 'pyLDAvis.gensim_models'

#### 3. In markdown, write 1-2 paragraphs of analysis on the results of your topic model

Unfortunately I am unable to resolve the problem "modulenotfounderror: no module named 'pyLDAvis.gensim_models" to visualize my topic model :(