# Homework 4

### Due: Fri Dec 18th @ 11:59pm ET

In this homework we will covering NLP, Topic Modeling and Recommendation Engines

We will generate recommendations on products from a department store based on product descriptions.
We'll first transform the data into topics using Latent Dirichlet Allocation, and then generate recommendations based on this new representation using Content-Based Filtering.


Instructions:

- **Follow the comments below and fill in the blanks (____)**
- **Please use default arguments whenever arguments aren't specified.**
- **Please 'Restart and Run All' prior to submission.**
- **When submitting to Gradescope, please mark on which page each question is answered.**


Out of 28 points total.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

# LDA and Recommendation Engines

We are going to create a recommendation engine for products from a department store.  
The recommendations will be based on the similarity of product descriptions.  
We'll use Content-Filtering to query a product and get back a list of products that are similar.  
Instead of using the descriptions directly, we will first do some topic modeling using LDA to transform the descriptions into a topic space.

## Transform product descriptions into topics and print sample terms from topics


In [3]:
# 1. (2pts) Load the Data

# The dataset we'll be working with is a set of product descriptions from the department store JCPenney.

# Load product information from ../data/jcpenney-products_subset.csv.zip
# This is a compressed version of a csv file.
# Use pandas read_csv function with the default parameters.
# read_csv has an argument 'compression' with default value 'infer' that will handle unzipping the data.
# There is no need to unzip the data prior to using read_csv.
# Store the resulting dataframe as df_products.
df_products = pd.read_csv('../data/jcpenney-products_subset.csv.zip')

# print a summary of df_products using .info. There should be 5000 records and 6 columns
df_products.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   uniq_id        5000 non-null   object
 1   sku            5000 non-null   object
 2   name_title     5000 non-null   object
 3   description    5000 non-null   object
 4   category       4698 non-null   object
 5   category_tree  4698 non-null   object
dtypes: object(6)
memory usage: 234.5+ KB


In [4]:
# 2. (2pts) Print an Example

# The two columns of the dataframe we're interested in are:
#   'name_title' which is the name of the product stored as a string
#   'description' which is a description of the product stored as a string
#
# We'll print out the product in the first row as an example
# If we try to print both columns at the same time, pandas will truncate the strings
#   so we'll print them seperately

# print the column 'name_title' in row 0 of df_products
print(df_products.name_title[0])

print('---') # to visually separate the two strings

# print the column 'desciption' in row 0 of df_products
print(df_products.description[0])

Alfred Dunner® Essential Pull On Capri Pant
---
You'll return to our Alfred Dunner pull-on capris again and again when you want an updated, casual look and all the comfort you love.   elastic waistband approx. 19-21" inseam slash pockets polyester washable imported      


In [5]:
# 3. (4pts) Transform Descriptions using TfIdf

# In order to pass our product descriptions to the LDA model, we first need to vectorize 
#   from strings to fixed-length feature vectors.
# To do this we will transform our documents using Tf-Idf,

# Import TfidfVectorizer from sklearn.feature_extraction.text
from sklearn.feature_extraction.text import TfidfVectorizer

#  Instantiate a TfidfVectorizer with
#   ngram_range=(1,2), use both unigrams and bigrams
#   min_df=20,         excluding terms which appear in less than 20 documents
#   max_df=.7          excluding terms which appear in more than 70% of documents   
#   all other arguments as their default

# Store as tfidf
tfidf = TfidfVectorizer(ngram_range = (1,2),
                       min_df = 20,
                       max_df = .7)

# fit_transform tfidf on the 'description' column of the df_products dataframe 
# Store the transformed dataset as X_tfidf
X_tfidf = tfidf.fit_transform(df_products.description)

# Assert that the shape of X_tfidf is 5000 rows by 2732 columns
assert X_tfidf.shape == (5000, 2732)

In [6]:
# 4. (4pts) Format Bigrams and Print Sample of Extracted Vocabulary 

# The extracted vocabulary can be retrieved from tfidf as a list using .get_feature_names()
# Store the extracted vocabulary list in the variable vocab
vocab = tfidf.get_feature_names()

# Sklearn joins bigrams with a space character.
# To make output easier to read, replace all spaces in the vocab list with underscores.
# To do this we can use the string .replace() method.
# For example, if x is a string, x.replace(' ','_') will replace all ' ' with '_' in x.
# A list comprehension would be useful here, but use any method you like
#    to iterate through each item in vocab, replacing spaces with underscores.
# Store the result back into vocab
vocab = [i.replace(' ', '_') for i in vocab]

# Print the last 5 terms in the vocabulary (should end with 'zippered')
vocab[-5:]

['zip_front', 'zip_pocket', 'zipper', 'zipper_closure', 'zippered']

In [13]:
# 5. (4pts) Perform Topic Modeling with LDA

# Now that we have our vectorized data, we can use Latent Direchlet Allocation to learn 
#   the per-document topic distributions and per-topic term distributions.
# Though there are likely more, we'll model our dataset using 20 topics to keep things small.

# Import LatentDirichletAllocation from sklearn.decomposition
from sklearn.decomposition import LatentDirichletAllocation

# Instantiate a LatentDirichletAllocation model with
#    n_components=20    fit 20 topics
#    n_jobs=-1,         use all cores
#    random_state=123   for reproducability
# Store the model as lda
lda = LatentDirichletAllocation(n_components= 20,
                               n_jobs= -1,
                               random_state= 123)

# Run fit_transform() on lda using X_tfidf.
# Store the output (the per-document topic distributions) as X_lda
X_lda = lda.fit_transform(X_tfidf)

# Assert that the shape of X_lda is 5000 rows by 20 columns
assert X_lda.shape == (5000, 20)

In [14]:
# 6. (5pts) Print Top Topic Terms

# To get a sense of what each topic is composed of, we can print the most likely terms for each topic.
# For each topic print 'Topic {topic_idx:2d} : ' followed by 
#   the top 5 most likely terms in that topic given the per-topic term distribution
# Example:
#    Topic  0 : wicking moisture moisture_wicking dri fabric

# We'll use the vocab created above, but first convert from a list to np.array to make indexing easier
vocab = np.array(vocab)

# This function returns a list of the top n terms in vocab given the weights in term_weights
def get_top_terms(term_weights, top_n=5):
    # np.argsort() returns the indices of an np.array sorted by their value, in ascending order
    # [::-1] reverses the order of an np.array (descending order)
    # list() converts from an np.array() back to a list
    return list(vocab[np.argsort(term_weights)[::-1]][:top_n])

# The per-topic term distributions (term_weights) are stored in lda.components_
# For each array of term_weights in lda.components_ 
#    use get_top_terms() to print the top 5 terms per topic.
# Example:
#    Topic  0 : wicking moisture moisture_wicking dri fabric
# Hints:
#   use enumerate to get a topic_idx and the term_weights from lda.components_
#   prefix each line with the string produced by f'Topic {topic_idx:2d} : '
#   use ' '.join() to join the list of terms returned by get_top_terms()
weights = lda.components_
for topic_idx in range(len(weights)):
    print (f'Topic {topic_idx:2d} :', ' '.join(get_top_terms(weights[topic_idx])) )

Topic  0 : dress sleeveless length_from shoulder from_shoulder
Topic  1 : elastic shorts cargo pockets waistband
Topic  2 : pocket interior pockets exterior zip
Topic  3 : spandex our front with inseam
Topic  4 : fit sits leg straight waist
Topic  5 : spot spot_clean clean_imported clean pillow
Topic  6 : upper sole rubber synthetic rubber_sole
Topic  7 : polyester_washable washable_imported washable regular drawstring
Topic  8 : set safe king dishwasher comforter
Topic  9 : the to for with of
Topic 10 : button collar shirt button_front chest
Topic 11 : spread spread_collar lycra dress_shirt stays
Topic 12 : sleepwear safety flame flame_resistant meet
Topic 13 : rug resistant yes indoor backing
Topic 14 : rod sold sold_individually individually rod_pocket
Topic 15 : clean_only dry_clean only_imported only tie
Topic 16 : short sleeves cotton tee washable_imported
Topic 17 : case dial color width show
Topic 18 : metal jewelry photos show_detail photos_are
Topic 19 : moisture wicking mois

### Generate recommendations using topics

In [9]:
# 7. (3pts) Generate Similarity Matrix

# We'll use Content-Based Filtering to make recommendations based on a query product.
# Each product will be represented by its LDA topic weights learned above.
# We'd like to recommend products similar in LDA space.
# We'll use euclidan distance as a measure of similarity.

# Import euclidean_distances from sklearn.metrics.pairwise 
from sklearn.metrics.pairwise import euclidean_distances

# Use euclidean_distances to generate similarity scores on our X_lda data
# Recall that when using distance as similarity, smaller is better.
# Store in a variable named distances.
# NOTE: we only need to pass X_lda in once,
#   the function will calculate pairwise distances for all rows in that matrix
distances = euclidean_distances(X_lda)

# Assert that the shape of the similarities matrix is 5000 rows by 5000 columns
assert distances.shape == (5000,5000)

In [10]:
# 8.(4pts) Generate Recommendations

# Let's test our proposed recommendation engine using the product at row 0 in df_products as the query.
#   The name of this product is "Alfred Dunner® Essential Pull On Capri Pant"

# Print the names for the top 10 most similar products to this query.
# Suggested way to do this is:
#   get the euclidean distances from row 0 of the distances matrix
#   get the indices of this array sorted by value using np.argsort()
#   get the first 10 indexes from this array
#   use those indices to index into df_products.name_title and print the result

# NOTE: The first product should be:
#   'Alfred Dunner® Essential Pull On Capri Pant', (the original query product)
query_idx = 0
best_idxs_asc = np.argsort(distances[query_idx])
first_10idx= best_idxs_asc[:10]
df_products.name_title[first_10idx]

0             Alfred Dunner® Essential Pull On Capri Pant
2481    City Streets® Colorblocked Performance Capris ...
1351              Made for Life™ Pintucked Bermuda Shorts
2671    Total Girl® French Terry Knit-Waist Capris - G...
43         Dickies® Womens Youth Cargo Scrub Pants–Petite
2985    Liz Claiborne® City-Fit Straight-Leg Jeans - Tall
2559               Liz Claiborne® Sculpted Denim Jeggings
4427                     Dockers® Cargo Shorts–Big & Tall
3620                         Alfred Dunner® Pull-On Pants
4619        Dickies® Women’s Youth Cargo Scrub Pants–Tall
Name: name_title, dtype: object