
## Autograded Notebook (Canvas & CodeGrade)

This notebook will be automatically graded. It is designed to test your answers and award points for the correct answers. Following the instructions for each Task carefully.
Instructions

- **Download** this notebook as you would any other ipynb file 
- **Upload** to Google Colab or work locally (if you have that set-up)
- **Delete** `raise NotImplementedError()`

- **Write** your code in the `# YOUR CODE HERE` space


- **Execute** the Test cells that contain assert statements - these help you check your work (others contain hidden tests that will be checked when you submit through Canvas)

- **Save** your notebook when you are finished
- **Download** as a ipynb file (if working in Colab)
- **Upload** your complete notebook to Canvas (there will be additional instructions in Slack and/or Canvas)



# Sprint Challenge
## *Data Science Unit 4 Sprint 1*

After a week of Natural Language Processing, you've learned some cool new stuff: how to process text, how turn text into vectors, and how to model topics from documents. Apply your newly acquired skills to one of the most famous NLP datasets out there: [Yelp](https://www.yelp.com/dataset). As part of the job selection process, some of my friends have been asked to create analysis of this dataset, so I want to empower you to have a head start.  

The real dataset is massive (almost 8 gigs uncompressed). I've sampled the data for you to something more manageable for the Sprint Challenge. You can analyze the full dataset as a stretch goal or after the sprint challenge. As you work on the challenge, I suggest adding notes about your findings and things you want to analyze in the future.

## Challenge Objectives
Successfully complete all these objectives to earn full credit. 

**Successful completion is defined as passing all the unit tests in each objective.**  

Each unit test that you pass is 1 point. 

There are 5 total possible points in this sprint challenge. 


There are more details on each objective further down in the notebook.*
* <a href="#p1">Part 1</a>: Write a function to tokenize the yelp reviews
* <a href="#p2">Part 2</a>: Create a vector representation of those tokens
* <a href="#p3">Part 3</a>: Use your tokens in a classification model on yelp rating
* <a href="#p4">Part 4</a>: Estimate & Interpret a topic model of the Yelp reviews

____

# Before you submit your notebook you must first

1) Restart your notebook's Kernel

2) Run all cells sequentially, from top to bottom, so that cell numbers are sequential numbers (i.e. 1,2,3,4,5...)
- Easiest way to do this is to click on the **Cell** tab at the top of your notebook and select **Run All** from the drop down menu. 

3) Comment out the cell that generates a pyLDAvis visual in objective 4 (see instructions in that section). 
____



### Import Data

In [None]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [None]:
import pandas as pd
pd.set_option("display.max_columns", None)

# Load reviews from URL
data_url = 'https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_4/unit1_nlp/review_sample.json'

# Import data into a DataFrame named df
df = pd.read_json(data_url, lines=True)

In [None]:
# Visible Testing
assert isinstance(df, pd.DataFrame), 'df is not a DataFrame. Did you import the data into df?'
assert df.shape[0] == 10000, 'DataFrame df has the wrong number of rows.'

## Part 1: Tokenize Function
<a id="#p1"></a>

Complete the function `tokenize`. Your function should
- accept one document at a time
- return a list of tokens

You are free to use any method you have learned this week.

In [None]:
# Optional: Consider using spaCy in your function. The spaCy library can be imported by running this cell.
# A pre-trained model (en_core_web_sm) has been made available to you in the CodeGrade container.
# If you DON'T need use the en_core_web_sm model, you can comment it out below.
import re
import spacy
nlp = spacy.load('en_core_web_sm')

In [None]:
# Create a clean_text function to reduce runtime in Code Grade, not having to run tokenize with mesy text
def clean_data(doc):
    """
    Takes in text and returns a clean text which mean
    Non-alphabet are filtered out. 
    Replace multi white spaces with single white space
    """
    non_alpha = '[^\u4e00-\u9fa5_a-zA-Z]' #[\u4e00-\u9fa5_] to detect chinese characters
    multi_white_spaces = "[ ]{2,}"
    
    doc = re.sub(non_alpha, ' ', doc)
    doc = re.sub(multi_white_spaces, " ", doc)
    
    return doc.lower().strip()

df['clean_text'] = df['text'].apply(clean_data)

In [None]:
STOP_WORDS = nlp.Defaults.stop_words.union(['s', 'year', 't', 'dr'])
def tokenize(doc):
    """
    Takes a doc and returns a list of tokens in the form of lemmas
    Stop words, punctation, and pronoun are filtered out.
    """
    doc = nlp(doc)
    
    return [token.lemma_.strip() for token in doc if (token.text.lower() not in STOP_WORDS) and 
            (token.is_punct != True) and (token.pos_ != 'PRON')]

In [None]:
'''Testing'''
assert isinstance(tokenize(df.sample(n=1)["text"].iloc[0]), list), "Make sure your tokenizer function accepts a single document and returns a list of tokens!"

## Part 2: Vector Representation
<a id="#p2"></a>
1. Create a vector representation of the reviews (i.e. create a doc-term matrix).
2. Write a fake review and query for the 10 most similar reviews, print the text of the reviews. Do you notice any patterns?
    - Given the size of the dataset, use `NearestNeighbors` model for this. 

In [None]:
%%time
from sklearn.feature_extraction.text import TfidfVectorizer
# Create a vector representation of the reviews 
tfidf_vect = TfidfVectorizer(tokenizer=tokenize)

# Name that doc-term matrix "dtm"
dtm = tfidf_vect.fit_transform(df['clean_text'])

# View Feature Matrix as DataFrame
dtm = pd.DataFrame(data=dtm.toarray(), columns=tfidf_vect.get_feature_names())

CPU times: user 2min, sys: 1.05 s, total: 2min 1s
Wall time: 2min


In [None]:
# Create and fit a NearestNeighbors model named "nn"
from sklearn.neighbors import NearestNeighbors

# YOUR CODE HERE
nn = NearestNeighbors(n_neighbors=10).fit(dtm)

#### Alex: "instructor confirmed that unit test needs to be updated" 
I uploaded this to colab to edit the testing cell and change 'sklearn.neighbors.unsupervised' to 'sklearn.neighbors._unsupervised'

In [None]:
'''Testing.'''
assert nn.__module__ == 'sklearn.neighbors._unsupervised', ' nn is not a NearestNeighbors instance.'
assert nn.n_neighbors == 10, 'nn has the wrong value for n_neighbors'

In [None]:
# Create a fake review and find the 10 most similar reviews
fake_review = """Fake. This is the worst company I have ever seen. I could not believe what they did to me. 
I already called their office number to talk to their manager. They never answer the phone"""

In [None]:
def find_similarity(doc, df):
    """
    function receive doc and process it to input into NearestNeighbors
    Arg: 
    text -- the input document
    df -- is the dataframe where all docs locate
    return index of 10 documents from the dataframe that is similar to our input doc
    """
    # Create a new df as a copy of df['clean_text']
    df_new = df['clean_text'].copy()

    # Attach fake review to the end of the new df
    df_new.loc[len(df_new.index)] = doc

    # Transform the new df to tfidf
    dtm_new = tfidf_vect.fit_transform(df_new)

    # View Feature Matrix as DataFrame
    dtm_new = pd.DataFrame(data=dtm_new.toarray(), columns=tfidf_vect.get_feature_names())
    
    # Fit new dtm into NearestNeighbors
    nn = NearestNeighbors(n_neighbors=10, algorithm='auto').fit(dtm_new)
    doc = [dtm_new.iloc[-1].values]
    
    # Query Using kneighbors 
    neigh_dist, neigh_index = nn.kneighbors(doc)
    
    return neigh_index[0]

In [None]:
# Display the nine docs that are similar to the fake review
docs_index = find_similarity(fake_review, df)
pd.DataFrame(data=df['clean_text'].loc[df.index.isin(docs_index[1:])])

Unnamed: 0,clean_text
0,beware fake fake fake we also own a small busi...
2943,well from the outside it looks like a pretty c...
3180,this walmart has the rudest of employees i hav...
4406,probably the worst hvac service i have used al...
4491,this is a update to my earlier review the mech...
5956,yesterday my two friends and i were at madison...
6019,i overall liked the atmosphere of this locatio...
8470,if could leave a star i would i was on hold fo...
9587,other than the pricing this company is awful t...


#### All the text are negative review of the companies. There is a pattern that using these word: worst, rudest, fake.

## Part 3: Classification
<a id="#p3"></a>
Your goal in this section will be to predict `stars` from the review dataset. 

1. Create a pipeline object with a sklearn `CountVectorizer` or `TfidfVector` and any sklearn classifier.
    - Use that pipeline to train a model to predict the `stars` feature (i.e. the labels). 
    - Use that Pipeline to predict a star rating for your fake review from Part 2. 



2. Create a parameter dict including `one parameter for the vectorizer` and `one parameter for the model`. 
    - Include 2 possible values for each parameter
    - **Use `n_jobs` = 1** 
    - Due to limited computational resources on CodeGrader `DO NOT INCLUDE ADDITIONAL PARAMETERS OR VALUES PLEASE.`
    
    
3. Train the entire pipeline with a GridSearch
    - Name your GridSearch object as `gs`

In [None]:
# Assign X, Y for training
X = df['clean_text']
Y = df['stars']

In [None]:
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV


# Instantiate model
tfid = TfidfVectorizer(tokenizer=tokenize)
vect = CountVectorizer(tokenizer=tokenize)
clf = KNeighborsClassifier('kd_tree')

# Add feature union to add another vector representation method
union = FeatureUnion(
    transformer_list = [
        ("tfidf", tfid),
        ("vect", vect)
    ]
)

# Instantiate pipeline
pipe = Pipeline([('union', union),
                 ('clf', clf)])

# create a hyper-parameter dict
params = {
    "union__tfidf__max_df": [1, .9],
    "union__vect__max_df": [1, .9],
    "clf__n_neighbors": [5, 10]  
}

# Name the gridsearch instance "gs"
gs = GridSearchCV(pipe, 
                  params, 
                  n_jobs=1, 
                  cv=3, 
                  verbose=1)

# run the gridsearch
gs.fit(X, Y)

Fitting 3 folds for each of 8 candidates, totalling 24 fits


GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('union',
                                        FeatureUnion(transformer_list=[('tfidf',
                                                                        TfidfVectorizer(tokenizer=<function tokenize at 0x7f527d70b940>)),
                                                                       ('vect',
                                                                        CountVectorizer(tokenizer=<function tokenize at 0x7f527d70b940>))])),
                                       ('clf',
                                        KNeighborsClassifier(n_neighbors='kd_tree'))]),
             n_jobs=6,
             param_grid={'clf__n_neighbors': [5, 10],
                         'union__tfidf__max_df': [1, 0.9],
                         'union__vect__max_df': [1, 0.9]},
             verbose=1)

In [None]:
gs.best_score_

0.5051012899730285

In [None]:
# Visible Testing
prediction = gs.predict(["I wish dogs knew how to speak English."])[0]
assert prediction in df.stars.values, 'You gs object should be able to accept raw text within a list. Did you include a vectorizer in your pipeline?'

## Part 4: Topic Modeling

Let's find out what those yelp reviews are saying! :D

1. Estimate a LDA topic model of the review text
    - Set num_topics to `5`
    - Name your LDA model `lda`
2. Create 1-2 visualizations of the results
    - You can use the most important 3 words of a topic in relevant visualizations. Refer to yesterday's notebook to extract. 
3. In markdown, write 1-2 paragraphs of analysis on the results of your topic model

When you instantiate your LDA model, it should look like this: 

```python
lda = LdaModel(corpus=corpus,
               id2word=id2word,
               random_state=723812,
               num_topics = num_topics,
               passes=1
              )

```

__*Note*__: You can pass the DataFrame column of text reviews to gensim. You do not have to use a generator.

## Note about  pyLDAvis

**pyLDAvis** is the Topic modeling package that we used in class to visualize the topics that LDA generates for us.

You are welcomed to use pyLDAvis if you'd like for your visualization. However, **you MUST comment out the code that imports the package and the cell that generates the visualization before you submit your notebook to CodeGrade.** 

Although you should leave the print out of the visualization for graders to see (i.e. comment out the cell after you run it to create the viz). 

In [None]:
from gensim import corpora
# Due to limited computationalresources on CodeGrader, use the non-multicore version of LDA 
from gensim.models.ldamodel import LdaModel
import gensim
import re

### 1. Estimate a LDA topic model of the review tex

In [None]:
# Remember to read the LDA docs for more information on the various class attirbutes and methods available to you
# in the LDA model: https://radimrehurek.com/gensim/models/ldamodel.html

# don't change this value 
num_topics = 5

# use tokenize function you created earlier to create tokens 
df['tokens'] = df['clean_text'].apply(tokenize)
# create a id2word object (hint: use corpora.Dictionary)
id2word = corpora.Dictionary(df['tokens'] )
# create a corpus object (hint: id2word.doc2bow)
corpus = [id2word.doc2bow(text) for text in df['tokens']]
# instantiate an lda model
lda = LdaModel(corpus=corpus,
               id2word=id2word,
               random_state=723812,
               num_topics = num_topics,
               passes=1
              )

#### Testing

In [None]:
# Visible Testing
assert lda.get_topics().shape[0] == 5, 'Did your model complete its training? Did you set num_topics to 5?'

#### 2. Create 1-2 visualizations of the results

In [None]:
# import pyLDAvis
# import pyLDAvis.gensim_models
# # Use pyLDAvis (or a ploting tool of your choice) to visualize your results 
# pyLDAvis.enable_notebook()
# vis = pyLDAvis.gensim_models.prepare(lda, corpus, id2word)
# vis

#### 3. In markdown, write 1-2 paragraphs of analysis on the results of your topic model

The first topic is about food restaurant. There are rice, noogle, sushi, pork words inside it. Topic 2 and 5 are also restaurants topic. Not only they are very close to each other and to topic one on the axis. They have words like bread, pancake, cappuccino. Topic 5 gears more toward drinking store with brew, beer, mojito. Words like mojito only appear in topic 5. Topic 3 is about personal store like gym, hair, airline while topic 4 is for office/company. Words in topic 4 is car, insurance, phone, doctor, tech.