# Homework 4

### Due: Wed May 13 @ 11:59pm

In this homework we will covering NLP, Topic Modeling and Recommendation Engines

We will generate recommendations on products from a department store based on product descriptions.
We'll first transform the data into topics using Latent Dirichlet Approximation, and then generate recommendations based on this new representation.


Instructions
Follow the comments below and fill in the blanks (____) to complete.

**Please 'Restart and Run All' prior to submission.**

**When submitting to Gradescope, please mark on which page each question is answered.**

Out of 26 points total.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)

%matplotlib inline
np.random.seed(123)

# LDA and Recommendation Engines

We are going to create a recommendation engine for products from a department store.  
The recommendations will be based on the similarity of product descriptions.  
We'll query a product and get back a list of products that are similar.  
Instead of using the descriptions directly, we will first do some topic modeling using LDA to transform the descriptions into a topic space.

## Transform product descriptions into topics and print sample terms from topics


In [2]:
# 1. (2pts) Load the Data

# The dataset we'll be working with is a set of product descriptions from JCPenney.

# Load product information from ../data/jcpenney-products_subset.csv.zip
# This is compressed version of a csv file.
# Use pandas read_csv function with the default parameters.
# read_csv has a parameter compression with default value 'infer' that will handle unzipping the data.
# Store the resulting dataframe as df_products.
df_products=pd.read_csv('../data/jcpenney-products_subset.csv.zip')

# print a summary of df_products using .info, noting the number of records (should be 5000)
df_products.info

<bound method DataFrame.info of                                uniq_id           sku  \
0     b6c0b6bea69c722939585baeac73c13d  pp5006380337   
1     8ffd0ef4fcaf1a82fb514aba5d20e05b  pp5006790247   
2     4d9337e3c8f974d3c420cdc5c58b3fc3  pp5007090172   
3     5abf9d28e9e0404369ece10807d99d0e  pp5006790242   
4     3c291110238ee460390c7002e4622ade  pp5006790695   
...                                ...           ...   
4995  d6c9485a403998f60cf76d03b7e43f73  pp5005901018   
4996  c78a231cbb474c49be916e88fd209fea  pp5006480763   
4997  5ff46580a73ef5c301e6ef33f6092759       1562c2b   
4998  707fc4fe51bb5dd3b8b05b906f15ec3a  pp5005860259   
4999  6ea61a3c3dd3ae7257f26fa05dbcaf40  pp5006461700   

                                             name_title  \
0           Alfred Dunner® Essential Pull On Capri Pant   
1     Alfred Dunner® Feels Like Spring 3/4 Sleeve Le...   
2     Alfred Dunner® Feels Like Spring 3/4-Sleeve Le...   
3               Alfred Dunner® Feels Like Spring Capris   


In [3]:
# 2. (2pts) Print an Example

# The two columns of the dataframe we're interested in are:
#   name_title which is the name of the product stored as a string
#   description which is a description of the product stored as a string
#
# We'll print out the product in the first row as an example
# If we try to print both at the same time, pandas will truncate the strings
#   so we'll print them seperately

# print the product name_title in row 0 of df_products
print(df_products.name_title[0])

# print the product desciption in row 0 of df_products
print(df_products.description[0])

Alfred Dunner® Essential Pull On Capri Pant
You'll return to our Alfred Dunner pull-on capris again and again when you want an updated, casual look and all the comfort you love.   elastic waistband approx. 19-21" inseam slash pockets polyester washable imported      


In [4]:
# 3. (4pts) Transform Descriptions using TfIdf

# In order to pass our product descriptions to the LDA model, we first need to vectorize from strings to 
#   fixed vectors of floats.
# To do this we will transform our documents into unigrams using Tf-Idf,
#    use both unigrams and bigrams
#    excluding terms which appear in less than 10 documents
#    excluding common English stop words and

# Import TfidfVectorizer from sklearn.feature_extraction.text
from sklearn.feature_extraction.text import TfidfVectorizer

#  Instantiate a TfidfVectorizer with
#   ngram_range=(1,2),
#   min_df=10,
#   stop_words='english'
# Store as tfidf
tfidf=TfidfVectorizer(ngram_range=(1,2),min_df=10,stop_words='english')

# fit_transform tfidf on the descriptions column of our dataframe, creating the transformed dataset X_tfidf
# Store as X_tfidf
X_tfidf=tfidf.fit_transform(df_products.description)

# Print the shape of X_tfidf (should be 5000 x 3979)
X_tfidf.shape

(5000, 3979)

In [5]:
# 4. (3pts) Format Bigram Labels and Print Sample of Extracted Vocabulary 

# The extracted vocabulary cat be retrieved from tfidf as a list using get_feature_names()
# Store the extracted vocabulary as vocabulary
vocabulary=tfidf.get_feature_names()

# Sklearn joins bigrams with a space character.
# To make output easier to read, replace all spaces in our vocabulary list with underscores.
# To do this we can use the string replace() method.
# For example x.replace(' ','_') with replace all ' ' in x with '_'.
# Store the result back in vocabulary.
vocabulary=[x.replace(' ','_') for x in vocabulary]

# Print the last 5 terms in the vocabulary
vocabulary[-5:]

['zipper_pockets', 'zippered', 'zippers', 'zirconia', 'zone']

In [6]:
# 5. (4pts) Perform Topic Modeling with LDA

# Now that we have our vectorized data, we can use Latent Direchlet Allocation to learn 
#   per-document topic distributions and per-topic term distributions.
# Though there are likely more, we'll model our dataset using 20 topics to keep things small.
# We'd like the model to run on all of the cores available in the machine we're using.
#    `n_jobs` tells the model how many cores to use, while `n_jobs=-1` indicates use all available.
# We'd also like the results to always be the same, so set random_state=123

# Import LatentDirichletAllocation from sklearn.decomposition
from sklearn.decomposition import LatentDirichletAllocation

# Instantiate a LatentDirichletAllocation model with
#    n_components=20, n_jobs=-1, random_state=123
# Store as lda
lda=LatentDirichletAllocation(n_components=20, n_jobs=-1, random_state=123)

# Run fit_transform on lda using X_tfidf.
# Store the output (the per-document topic distributions) as X_lda
# NOTE: this step may take a minute or more depending on your setup.
X_lda=lda.fit_transform(X_tfidf)

# Print the shape of the X_lda (should be 5000 x 20)
X_lda.shape

(5000, 20)

In [7]:
# 6. (4pts) Print Top Topic Terms

# To get a sense of what each topic is composed of, we can print the most likely terms for each topic.
# We'd like a print statement that looks like this:
#    Topic #0 upper sole rubber synthetic rubber_sole
#
# For each topic print 'Topic #{idx} ' followed by the top 5 most likely terms in that topic.
# Hints: 
#   Use vocabulary created above, but first convert from a list to np.array to make indexing easier
#   The per topic term distributions are stored in model.components_
#   np.argsort returns the indices of an np.array sorted by their value, in ascending order
#   [::-1] reverses the order of an np.array
topic_words = {}
terms=np.array(vocabulary)
for topic, comp in enumerate(lda.components_):
    word_idx = np.argsort(comp)[::-1][:5]
    topic_words[topic] = [terms[i] for i in word_idx]
for topic, words in topic_words.items():
    print(f'Topic #{topic} '+', '.join(words))

Topic #0 upper, sole, rubber, synthetic, rubber_sole
Topic #1 fringe, cups, swim, band, partially_lined
Topic #2 wash, dry, line, line_dry, wash_line
Topic #3 length, approx, washable_imported, washable, length_polyester
Topic #4 metal, jewelry, photos, enlarged, photos_enlarged
Topic #5 flats, upper, sole, thermoplastic, upper_lace
Topic #6 elastane, seat, thigh, elastane_washable, belt
Topic #7 clean, imported, measures, easy, wipe
Topic #8 wicking, moisture, moisture_wicking, safe, dri
Topic #9 wedge, button_placket, upper, resistant, textile
Topic #10 support, bra, imported_manufacturer, manufacturer_style, manufacturer
Topic #11 clean_imported, clean, spot, spot_clean, polyester
Topic #12 cotton, sleeves, washable_imported, washable, short
Topic #13 cool_casual, tee_features, adhesive, self_adhesive, faux_drawstring
Topic #14 pockets, button, fly, cotton, closure
Topic #15 rug, resistant, yes, indoor, backing
Topic #16 wine, sunglasses, dry_america, frames, drawers
Topic #17 stain

### Generate recommendations using topics

In [8]:
# 7. (3pts) Generate Similarity Matrix

# We'll use Content Filtering to make recommendations based on a query product.
# Each product will be represented by its LDA topic weights learned above.
# We'd like to recommend similar products in LDA space.
# We'll use cosine_similarity as measure of similarity.

# From sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import cosine_similarity

# Use cosine_similarity to generate similarity scores on our X_lda data
# Store as similarities.
# NOTE: we only need to pass X_lda in once,
#   the function will calculate pairwise similarity for all elements in that matrix
similarities=cosine_similarity(X_lda)

# print the shape of the similarities matrix (should be 5000x5000)
similarities.shape

(5000, 5000)

In [9]:
# 8.(4pts) Generate Recommendations

# Let's test our proposed recommendation engine using the product at row 0 in df_products.
#   The name of this product is "Alfred Dunner® Essential Pull On Capri Pant"

# Print the names for the top 10 most similar products to this query.
# Suggested way to do this is:
#   get the cosine similarities from row 0 of the similarities matrix
#   get the indices of this array sorted by value using np.argsort
#   reverse the order of these indices (remember, we want high values and np.argsort evaluates ascending)
#   get the first 10 indexes from this reversed array
#   use those indices to index into df_products.name_title and print the result

# HINT: The first two products should be:
#   'Alfred Dunner® Essential Pull On Capri Pant', (the original query product)
#   'Alfred Dunner® Pull-On Pants - Plus',
df_products.name_title[np.argsort(similarities[0,:])[::-1][:10]]

0             Alfred Dunner® Essential Pull On Capri Pant
2091                  Alfred Dunner® Pull-On Pants - Plus
3620                         Alfred Dunner® Pull-On Pants
662                           Alfred Dunner® Pull On Pant
2246      Alfred Dunner® Sao Paolo Pull-On Pants - Petite
3390                        Stylus™ Stretch Bootcut Jeans
1146           St. John's Bay® Linen Cropped Pants - Plus
1578    City Streets® Colorblock Performance Cropped L...
1838                                 Stylus™ Linen Shorts
3169                   Liz Claiborne® Jogger Pants - Tall
Name: name_title, dtype: object