# Homework 4

### Due: Sat Dec 11th @ 11:59pm ET

# NLP: Recommendations and Sentiment Analysis

In this homework we will perform two common NLP tasks: 
 1. Generate recommendations for products based on product descriptions using an LDA topic model.
 2. Perform sentiment analysis based on product reviews using sklearn Pipelines.

Instructions:

- Follow the comments below and fill in the blanks (____) to complete.
- Please **'Restart and Run All'** prior to submission.
- **Save pdf in Landscape** and **check that all of your code is shown** in the submission.
- When submitting in Gradescope, be sure to **select which page corresponds to which question.**

Out of 45 points total.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

# Part 1: Generate Recommendations from LDA Transformation

In this part we will transform a set of product descriptions using TfIdf and LDA topic modeling to generate product recommendations based on similarity in LDA space. 

## Load data and transform text using TfIDF

In [None]:
# 1. (1pts) Load the Data

# The dataset we'll be working with is a set of product descriptions 
#   from the JCPenney department store.

# Load product information from ../data/jcpenney-products_subset.csv.zip
# Use pandas read_csv function with the default parameters.
# Note that this is a compressed version of a csv file (has a .zip suffix).
# .read_csv() has a parameter 'compression' with default 
#     value 'infer' that will handle unzipping the data for us.
# Store the resulting dataframe as df_jcp.
____

# print a summary of df_jcp using .info()
# there should be 5000 rows with 2 columns
____

In [None]:
# 2. (2pts) Print an Example

# The two columns of the dataframe we're interested in are:
#   'name_title' which is the name of the product stored as a string
#   'description' which is a description of the product stored as a string
#
# We'll print out the product in the first row as an example
# If we try to print both at the same time, pandas will truncate the strings
#   so we'll print them seperately

# print the name_title column in row 0 of df_jcp
____

# print the desciption column in row 0 of df_jcp
____

In [None]:
# 3. (4pts) Transform Descriptions using TfIdf

# In order to pass our product descriptions to the LDA model, we first
#   need to vectorize from strings to fixed vectors of floats.
# To do this we will transform our documents using TfIdf

# Import TfidfVectorizer from sklearn
____

#  Instantiate a TfidfVectorizer that will
#    use both unigrams + bigrams
#    exclude terms which appear in less than 5 documents
#    exclude terms which appear in more than 20% of the documents
#    all other parameters leave as default
# Store as tfidf
____

# fit_transform() tfidf on the descriptions column of df_jcp, 
#    creating the transformed dataset X_tfidf
# Store as X_tfidf
____

# Print the shape of X_tfidf (should be 5000 x 12347)
____

In [None]:
# 4: (1pts) Show The Terms Extracted From Row 0

# X_tfidf is a matrix of floats, one row per document, one column per vocab term
# We can see what terms were extracted, and kept, for the document at df_jcp row 0
#   using the .inverse_transform() function
# Print the result of calling:
#   the .inverse_transform() function of tfidf on the first row of X_tfidf
# You should see an array starting with 'show detail'
____

In [None]:
# 5. (3pts) Format Bigrams and Print Sample of Extracted Vocabulary 

# The learned vocabulary can be retrieved from tfidf as a list using .get_feature_names()
# Store the extracted vocabulary as vocab
____

# Sklearn joins bigrams with a space character.
# To make our output easier to read, replace the spaces in each term in 
#    vocab (a list of strings) with an underscore.
# To do this we can use the string .replace() method.
# For example x.replace(' ','_') will replace all ' ' in x with '_'.
# Store the result back into vocab
____

# Print the last 5 terms in the vocab
# The first term printed should be 'zirconia_back'
____

## Transform product descriptions into topics and print sample terms from topics


In [None]:
# 6. (3pts) Perform Topic Modeling with LDA

# Now that we have our vectorized data, we can use Latent Direchlet Allocation to learn 
#   per-document topic distributions and per-topic term distributions.
# Though the documents are likely composed of more, we'll model our dataset using 
#     20 topics for ease of printing.

# Import LatentDirichletAllocation from sklearn
____

# Instantiate a LatentDirichletAllocation model that will
#    produce 20 topics
#    use all available cores
#    random_state=123
# Store as lda
____

# Run fit_transform on lda using X_tfidf.
# Store the output (the per-document topic distributions) as X_lda
____

# Print the shape of the X_lda (should be 5000 x 20)
____

In [None]:
# 7. (4pts) Get Assigned Topics for Product at df_jcp row 0

# Get the assigned topic proportions for the document at row 0 of X_lda
# This will be a list of 20 floats between 0 and 1
# Round all values to a precision of 2 using
# Store as theta_0
____
print(f'{theta_0 = :}\n')

# LDA will assign a small weight (or proability) to each topic for a document
# How many of the topics in theta_0 have a (relatively) large weight (> .01)?
# Store in n_assigned_0
____
print(f'{n_assigned_0 = :}\n')

# What are the indices of the assigned topics, sorted descending by the values in theta_0?
# Use np.argsort() to return the indices sorted by value (ascending)
# Use [::-1] to reverse the sorting order (from ascending to descending)
# Return only the first n_assigned_0 indices, those with large probability
# Store as assigned_topics_0
# HINT: You should see around 5 indices
____
print(f'{assigned_topics_0 = :}')

# Now that we have the topic indexes, we need to see what each topic looks like
#   using the per topic word distrutions stored in lda.components_ (next question)

In [None]:
# 8. (5pts) Print Top Topic Terms

# To get a sense of what each topic is composed of, we can print the most likely terms for each topic.
# We'd like a print statement that looks something like this:
#     Topic # 0 : socks spandex fits shoe fits_shoe

# To make indexing easier, first convert vocab from a list to np.array()
# Store back into vocab
____

# For each topic print f'Topic #{topic_idx:2d} : ' followed by the top 5 most likely terms in that topic.
# Hints: 
#   The per topic term distributions are stored in lda.components_ 
#      which should be a numpy array with shape (20, 12347)
#   Iterate through the rows of lda.components_, one row per topic
#   Use np.argsort() to get the indices of the current row of lda.components_
#      sorted by the values in that row in ascending order
#   Use [::-1] to reverse the order of the sorted indices
#   Use numpy array indexing to get the first 5 index values
#   Use these indices to get the corresponding terms from vocab
#   Join the list of terms with spaces using ' '.join()
#   Each print statement should start with f'Topic #{topic_idx:2d} : ' 
#      where topic_idx is an integer 0 to 19
# The first line should look something like:
# Topic # 0 : socks spandex fits shoe fits_shoe

# Use as many lines of code as you need
____

In [None]:
# Looking at the description column of row 0, the assigned_topics_0 and 
# the top terms per topic above, our LDA model seems to have generated
# distributions that match what we might expect fairly well, though 
# since this is unsupervised, there may be some surprises.

## Generate recommendations using topics

In [None]:
# 9. (3pts) Generate Similarity Matrix

# We'll use Content-Based Filtering to make recommendations based on a query product.
# Each product will be represented by its LDA topic weights learned above (X_lda).
# We'd like to recommend similar products in LDA space.
# We'll use cosine_distance as our measure of similarity.

# Import cosine_distances from sklearn.metrics.pairwise
____

# Use cosine_distance to generate similarity scores on our X_lda data
# Store as distances
# NOTE: we only need to pass X_lda in once as an argument,
#   the function will calculate pairwise distance between all rows in that matrix
____

# print the shape of the distances matrix (should be 5000 x 5000)
____

In [None]:
# 10. (4pts) Find Recommended Products

# Let's test our proposed recommendation engine using the product at row 0 in df_jcp.
#   The name of this product is "Mixit™ Silver-Tone and Purple Enamel Earring and Pendant Necklace Set"
#   Our system will recommend products similiar to this product.

# Print the names for the top 10 most similar products to this query.
# Suggested way to do this is:
#   get the cosine distances from row 0 of the distances matrix
#   get the indices of this first row of distances sorted by value ascending using np.argsort()
#   get the first 10 indexes from this sorted array of indices
#   use those indices to index into df_jcp.name_title 
#   to get the full string, use .values
#   print the resulting array

# HINT: The first two products will likely be:
#   'Mixit™ Silver-Tone and Purple Enamel Earring and Pendant Necklace Set',
#   '1/7 CT. T.W. Diamond 10K White Gold Pendant'
____

# Part 2: Sentiment Analysis Using Pipelines

Here we will train a model to classify positive vs negative sentiment on a set of pet supply product reviews using sklearn Pipelines.

In [None]:
# 11. (2pts) Load the Data

# The dataset we'll be working with is a set of product reviews
#   of pet supply items on Amazon.
# This data is taken from https://nijianmo.github.io/amazon/index.html
#   Justifying recommendations using distantly-labeled reviews and fined-grained aspects
#   Jianmo Ni, Jiacheng Li, Julian McAuley
#   Empirical Methods in Natural Language Processing (EMNLP), 2019

# Load product reviews from ../data/amazon-petsupply-reviews_subset.csv.zips
# Use pandas read_csv function with the default parameters as in part 1.
# Store the resulting dataframe as df_amzn.
____

# print a summary of df_amzn using .info()
# there should be 10000 rows with 2 columns
____

# print the first row of the dataframe as an example
# you should see the beginning of the review and the associated rating
____

In [None]:
# 12. (2pts) Transform Target

# The ratings are originally in a 5 point scale
# We'll turn this into a binary classification task to approximate positive vs negative sentiment

# Print the proportions of values seen in the rating column
#   using value_counts() with normalize=True
____

# Create a new binary target by setting
#  rows where rating is 5 to True
#  rows where rating is not 5 to False
# Store in y
____

# adding an empty print statment to insert an empty line in the output
print()

# Print the proportions of values seen in y
#   using value_counts() with normalize=True
# True here means a rating of 5 (eg positive)
# False means a rating less than 5 (eg negative)
____

In [None]:
# 13. (1pts) Train-test split

# Import train_test_split from sklearn
____

# Split df_amzn.review and y into a train and test set
#   using train_test_split
#   with test_size = .2 and stratifying by y
# Store as reviews_train,reviews_test,y_train,y_test
____

In [None]:
# 14. (3pts) Create a Pipeline of TfIdf transformation and Classification

# import Pipeline, GradientBoostingClassifier from sklearn
____

# Create a pipeline with two steps: 
#  TfIdfVectorizer with min_df=5 and max_df=.5 named 'tfidf'
#  GradientBoostingClassifier with 200 trees named 'gbc'
# Store as pipe_gbc
____

# Print out the pipeline
# You should see both steps: tfidf and gbc
print(pipe_gbc)

In [None]:
# 15. (5pts) Perform Grid Search on pipe_gbc

# import GridSearchCV from sklearn
____

# Create a parameter grid to test using:
#   unigrams or unigrams + bigrams in the tfidf step
#   max_depth of 2 or 10 in the gbc step
# Store as param_grid
____

# Instantiate GridSearchCV to evaluate pipe_gbc on the values in param_grid
#   use cv=2 and n_jobs=-1 to reduce run time
# Fit on the training set
# Store as gs_pipe_gbc
____

# Print the best parameter settings in gs_pipe_gbc found by grid search
____

# Print the best cv score found by grid search, with a precision of 2
____

In [None]:
# 16. (1 pts) Evaluate on the test set

# Calculate the test set score using the fit gs_pipe_gbc 
#   to give confidence that we have not overfit
#   while still improving over a random baseline classifier
# Print the accuracy score on the test set with a precision of 2
____

In [None]:
# 17. (1 pts) Evaluate on example reviews

# Generate predictions for these two sentences using the fit gs_pipe_gbc:
#   'This is a great product.'
#   'This product is not great.'
# You should see True for the first (rating of 5) 
#   and False for the second (rating of less than 5)
____