## Goals 🎯

The project can be cut into three steps :

1. Identify groups of products that have similar descriptions.

2. Use the groups of similar products to build a simple recommender system algorithm.

3. Use topic modeling algorithms to automatically assess the latent topics present in the item descriptions.

In [1]:
import pandas as pd
import numpy as np
import spacy

from spacy.lang.en.stop_words import STOP_WORDS
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import DBSCAN
from sklearn.decomposition import TruncatedSVD

import matplotlib.pyplot as plt
import wordcloud

nlp = spacy.load("en_core_web_sm")

In [2]:
data = pd.read_csv("src/North_face_data.csv")
data.head()

Unnamed: 0,id,description
0,1,Active classic boxers - There's a reason why o...
1,2,Active sport boxer briefs - Skinning up Glory ...
2,3,Active sport briefs - These superbreathable no...
3,4,"Alpine guide pants - Skin in, climb ice, switc..."
4,5,"Alpine wind jkt - On high ridges, steep ice an..."


In [3]:
data.shape

(500, 2)

In [4]:
print('Description du 1er produit : ',data.loc[0,'description'])

Description du 1er produit :  Active classic boxers - There's a reason why our boxers are a cult favorite - they keep their cool, especially in sticky situations. The quick-drying, lightweight underwear takes up minimal space in a travel pack. An exposed, brushed waistband offers next-to-skin softness, five-panel construction with a traditional boxer back for a classic fit, and a functional fly. Made of 3.7-oz 100% recycled polyester with moisture-wicking performance. Inseam (size M) is 4 1/2". Recyclable through the Common Threads Recycling Program.<br><br><b>Details:</b><ul> <li>"Silky Capilene 1 fabric is ultralight, breathable and quick-to-dry"</li> <li>"Exposed, brushed elastic waistband for comfort"</li> <li>5-panel construction with traditional boxer back</li> <li>"Inseam (size M) is 4 1/2"""</li></ul><br><br><b>Fabric: </b>3.7-oz 100% all-recycled polyester with Gladiodor natural odor control for the garment. Recyclable through the Common Threads Recycling Program<br><br><b>Wei

### Preprocessing of textual data

In [5]:
data['clean_documents'] = data['description'].str.replace(r"[^A-Za-z0-9 ]+", " ")
data['clean_documents'] = data['description'].str.replace(r"<[a-z/]+>", " ") 
data['clean_documents'] = data['description'].str.replace(r"<[a-z/]+>", " ") 
data['clean_documents'] = data['clean_documents'].str.replace(r"[^A-Za-z]+", " ") 
data['clean_documents'] = data['clean_documents'].fillna('').apply(lambda x: x.lower())
data['clean_documents'] = data['clean_documents'].str.replace("br", " ")
data['clean_documents'] = data['clean_documents'].str.replace("ul", " ")
data['clean_documents'] = data['clean_documents'].str.replace("li", " ")
data.head()

Unnamed: 0,id,description,clean_documents
0,1,Active classic boxers - There's a reason why o...,active classic boxers - there's a reason why o...
1,2,Active sport boxer briefs - Skinning up Glory ...,active sport boxer iefs - skinning up glory r...
2,3,Active sport briefs - These superbreathable no...,active sport iefs - these super eathable no-f...
3,4,"Alpine guide pants - Skin in, climb ice, switc...","alpine guide pants - skin in, c mb ice, switch..."
4,5,"Alpine wind jkt - On high ridges, steep ice an...","alpine wind jkt - on high ridges, steep ice an..."


In [8]:
print('Description du 1er produit : ',data.loc[0,'clean_documents'])

Description du 1er produit :  Active classic boxers There s a reason why our boxers are a cult favorite they keep their cool especially in sticky situations The quick drying lightweight underwear takes up minimal space in a travel pack An exposed brushed waistband offers next to skin softness five panel construction with a traditional boxer back for a classic fit and a functional fly Made of oz recycled polyester with moisture wicking performance Inseam size M is Recyclable through the Common Threads Recycling Program Details Silky Capilene fabric is ultralight breathable and quick to dry Exposed brushed elastic waistband for comfort panel construction with traditional boxer back Inseam size M is Fabric oz all recycled polyester with Gladiodor natural odor control for the garment Recyclable through the Common Threads Recycling Program Weight g oz Made in Mexico 


#### tokenization - lemmalization & stop word

In [9]:
data["clean_documents"] = data["clean_documents"].apply(lambda x: [token.lemma_ for token in nlp(x) if token.text not in STOP_WORDS])
data.head()



Unnamed: 0,id,description,clean_documents
0,1,Active classic boxers - There's a reason why o...,"[active, classic, boxer, there, s, reason, box..."
1,2,Active sport boxer briefs - Skinning up Glory ...,"[active, sport, boxer, brief, skin, Glory, req..."
2,3,Active sport briefs - These superbreathable no...,"[active, sport, brief, these, superbreathable,..."
3,4,"Alpine guide pants - Skin in, climb ice, switc...","[alpine, guide, pant, skin, climb, ice, switch..."
4,5,"Alpine wind jkt - On high ridges, steep ice an...","[alpine, wind, jkt, on, high, ridge, steep, ic..."


In [10]:
# on insére les mots dans une colonne nlp_description
data["nlp_description"] = [" ".join(x) for x in data['clean_documents']]
data.head()

Unnamed: 0,id,description,clean_documents,nlp_description
0,1,Active classic boxers - There's a reason why o...,"[active, classic, boxer, there, s, reason, box...",active classic boxer there s reason boxer cult...
1,2,Active sport boxer briefs - Skinning up Glory ...,"[active, sport, boxer, brief, skin, Glory, req...",active sport boxer brief skin Glory require mo...
2,3,Active sport briefs - These superbreathable no...,"[active, sport, brief, these, superbreathable,...",active sport brief these superbreathable fly b...
3,4,"Alpine guide pants - Skin in, climb ice, switc...","[alpine, guide, pant, skin, climb, ice, switch...",alpine guide pant skin climb ice switch rock t...
4,5,"Alpine wind jkt - On high ridges, steep ice an...","[alpine, wind, jkt, on, high, ridge, steep, ic...",alpine wind jkt on high ridge steep ice alpine...


#### TF-IDF vector : term frequency-inverse document frequency 

In [None]:
vectorizer = TfidfVectorizer(stop_words='english', smooth_idf=True)
X = vectorizer.fit_transform(data['nlp_description'])


In [None]:
# densité du tableau

dense = X.toarray()
dense


In [None]:
# On obtient une matrice sparse avec de 500 lignes et 3761 colonnes  => 3761 mots

print(X.shape)

In [None]:
# dictionnaire d'occurences

#vectorizer.vocabulary_

### Part 1 : Groups of products with similar descriptions

In [None]:
# Instanciate DBSCAN 
db_cluster = DBSCAN(eps=0.7, min_samples=4, metric="cosine", algorithm="brute")

# Fit on data 
## No need to normalize data, it already is! 
db_cluster.fit(dense)

In [None]:
dense[:5,:5]

In [None]:
# nb de cluster

#db_cluster.labels_

In [None]:
# pr chaque doc mettre nous avons le cluster

data['cluster'] = db_cluster.labels_

data.head()

In [None]:
# -1 est du bruit = outlier

data.cluster.value_counts()

In [None]:
# visualisation des clusters

#fig = px.scatter_mapbox(
 #       data[data.cluster != -1], 
  #      lat="lat", 
  #      lon="lng",
  #      color="cluster",
  #      mapbox_style="carto-positron"
#)

 #fig.show()

In [None]:
# Word cloud pour voir les clusters

wd = wordcloud.WordCloud()
for c in data['cluster'].value_counts().index[:20] :
    print("CLUSTER ", c)
    texts = " ".join(data.loc[data['cluster']==c,'nlp_description'])
    cloud = wd.generate(texts)
    plt.imshow(cloud)
    plt.show()
    print('-----------')

#### Part 2 - Recommender system

🎯

Then, you can use the cluster ids from part 1 to build a recommender system. The aim is to be able to suggest to a user some products that are similar to the ones he is interested in. To do this, we will consider that products belonging to the same cluster are similar.

Ensuite, vous pouvez utiliser les identifiants de cluster de la partie 1 pour créer un système de recommandation. Le but est de pouvoir proposer à un utilisateur des produits similaires à ceux qui l'intéressent. Pour ce faire, nous considérerons que les produits appartenant à un même cluster sont similaires.

Create a function named find_similar_items that takes an argument item_id representing the id of a product, and that returns a list of 5 item ids that belong to the same cluster as the product passed in argument
Use python's input() function to allow the user to choose a product and perform some suggestions of similar items

In [None]:
def find_similar_product(product_id):
    """
    liste de 5 id de produit qui appartiennent au même cluster
    """
    choose_product = data.loc[data['id']==product_id, 'cluster'].values[0] # choix du produit par l'id
    similar_product = data.loc[data['cluster']==choose_product,:].sample(5) # 5 lignes du dataframes au hasard
    id_similar_product = similar_product['id']
    return id_similar_product
  

In [None]:
pd.set_option('display.max_rows', data.shape[0]+1)
print(data[['id','description']])


In [None]:
a = data.loc[data['id']==3, 'cluster'].values[0]
b = data.loc[data['cluster']== a,:].sample(5)
c = b['id']

for i in c:
    print(i)

In [None]:
#data[data.id == 50]

In [None]:
product_id = int(input("indiquer l'id du produit que vous désirez ? "))
print("")

try:
    choose = find_similar_product(product_id)
except:
    print("Produit non trouvé")
else:
    print('Produit selectionné : ',data.loc[product_id,'nlp_description'])

    print("-------")

    print("")
    print("")
    print("D'aprés l'analyse de vos choix, nous vous proposons 5 produits similaires :")
    print("")

    for i in choose:
        print(data.loc[i,'nlp_description'])
        print("-------")

#### Topic modeling

🎯
This part is independant from the two others.
The aim is to use an LSA model to automatically extract latent topics in the products descriptions.

Cette partie est indépendante des deux autres.
L'objectif est d'utiliser un modèle LSA pour extraire automatiquement les sujets latents dans les descriptions des produits.

In [None]:
# SVD represent documents and terms in vectors 
svd_model = TruncatedSVD(n_components=10, algorithm='randomized', n_iter=100, random_state=122)
lsa = svd_model.fit_transform(X)

topic_encoded_df = pd.DataFrame(lsa, columns = ["topic_" + str(i) for i in range(lsa.shape[1])])
topic_encoded_df["documents"] = data['nlp_description']
topic_encoded_df.head(2)

In [None]:
def extract_main_topics(x):
    topics = np.abs(x)
    main_topic = topics.sort_values(ascending=False).index[0]
    return main_topic

# Initialize column `main_topics` with NANs
topic_encoded_df.loc[:, 'main_topic'] = np.nan

for i, row in topic_encoded_df.iloc[:,:-2].iterrows():
    topic_encoded_df.loc[i, 'main_topic'] = extract_main_topics(row)

topic_encoded_df.head(2)

In [None]:
# resultat LSA
topic_encoded_df['main_topic'].value_counts()

In [None]:
# mots plus fréquents

# Create DataFrame containing the description of each topic in terms of the words in the vocabulary
topics_description = pd.DataFrame(svd_model.components_, columns = vectorizer.get_feature_names(), 
                                  index = ['topic_' + str(i) for i in range(svd_model.components_.shape[0])])

# Compute absolute values of coefficients
topics_description = topics_description.apply(np.abs, axis = 1)

topics_description.head()

In [None]:
# Loop over each topic and print 5 most important words
for i,row in topics_description.iterrows():
    print('TOPIC ', i)
    
    print(row.sort_values(ascending=False)[0:10].index.tolist())

In [None]:
svd_model.explained_variance_ratio_

In [None]:
#https://www.kaggle.com/pierrelouisdanieau/nlp-clustering-recommender-system-lsa

In [39]:
npr = pd.read_csv('src/North_face_data.csv')
npr.head()

Unnamed: 0,id,description
0,1,Active classic boxers - There's a reason why o...
1,2,Active sport boxer briefs - Skinning up Glory ...
2,3,Active sport briefs - These superbreathable no...
3,4,"Alpine guide pants - Skin in, climb ice, switc..."
4,5,"Alpine wind jkt - On high ridges, steep ice an..."


In [40]:
tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')

dtm = tfidf.fit_transform(npr['description'])

dtm

<500x2627 sparse matrix of type '<class 'numpy.float64'>'
	with 32658 stored elements in Compressed Sparse Row format>

In [41]:
from sklearn.decomposition import NMF

In [42]:
# on entraine le modele : peut prendre du temps si gros dataset
nmf_model.fit(dtm)

NMF(n_components=10, random_state=42)

In [43]:
for index,topic in enumerate(nmf_model.components_):
    print(f'THE TOP 10 WORDS FOR TOPIC #{index}')
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-10:]])
    print('\n')

THE TOP 10 WORDS FOR TOPIC #0
['shoulder', 'zippered', 'polyester', 'compartment', 'water', 'polyurethane', 'strap', 'denier', 'pocket', 'mesh']


THE TOP 10 WORDS FOR TOPIC #1
['cotton', 'organic', 'free', 'shoulder', 'inks', 'pvc', 'phthalate', 'ringspun', 'taped', 'shirt']


THE TOP 10 WORDS FOR TOPIC #2
['beneath', 'baselayer', 'layers', 'brushed', 'garment', 'odor', 'natural', 'capilene', 'gladiodor', 'control']


THE TOP 10 WORDS FOR TOPIC #3
['waistband', 'closure', 'size', 'zip', 'fly', 'pants', 'nylon', 'shorts', 'pockets', 'inseam']


THE TOP 10 WORDS FOR TOPIC #4
['skin', 'naturally', 'dry', 'slow', 'machine', 'lay', 'odor', 'wash', 'wool', 'merino']


THE TOP 10 WORDS FOR TOPIC #5
['rise', 'blend', 'hips', 'improved', 'lined', 'nylon', 'coverage', 'spandex', '18', '82']


THE TOP 10 WORDS FOR TOPIC #6
['thailand', 'tencel', 'recyclable', 'common', 'threads', 'program', 'recycling', 'button', 'cotton', 'organic']


THE TOP 10 WORDS FOR TOPIC #7
['catalog', 'site', 'dimension

In [None]:
# new column to the original quora dataframe that labels each question into one of the 10 topic categories.

In [44]:
topic_results = nmf_model.transform(dtm)
topic_results.shape

(500, 10)

In [45]:
npr['Topic'] = topic_results.argmax(axis=1)

In [46]:
npr.head()

Unnamed: 0,id,description,Topic
0,1,Active classic boxers - There's a reason why o...,2
1,2,Active sport boxer briefs - Skinning up Glory ...,3
2,3,Active sport briefs - These superbreathable no...,2
3,4,"Alpine guide pants - Skin in, climb ice, switc...",3
4,5,"Alpine wind jkt - On high ridges, steep ice an...",8


In [47]:
npr.Topic.value_counts()

6    99
8    79
3    79
1    61
9    40
0    40
2    37
5    34
4    23
7     8
Name: Topic, dtype: int64

In [48]:
npr.description[0]

'Active classic boxers - There\'s a reason why our boxers are a cult favorite - they keep their cool, especially in sticky situations. The quick-drying, lightweight underwear takes up minimal space in a travel pack. An exposed, brushed waistband offers next-to-skin softness, five-panel construction with a traditional boxer back for a classic fit, and a functional fly. Made of 3.7-oz 100% recycled polyester with moisture-wicking performance. Inseam (size M) is 4 1/2". Recyclable through the Common Threads Recycling Program.<br><br><b>Details:</b><ul> <li>"Silky Capilene 1 fabric is ultralight, breathable and quick-to-dry"</li> <li>"Exposed, brushed elastic waistband for comfort"</li> <li>5-panel construction with traditional boxer back</li> <li>"Inseam (size M) is 4 1/2"""</li></ul><br><br><b>Fabric: </b>3.7-oz 100% all-recycled polyester with Gladiodor natural odor control for the garment. Recyclable through the Common Threads Recycling Program<br><br><b>Weight: </b>99 g (3.5 oz)<br><b

In [49]:
for index,topic in enumerate(nmf_model.components_):
    print(f'THE TOP 10 WORDS FOR TOPIC #{index}')
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-10:]])
    print('\n')

THE TOP 10 WORDS FOR TOPIC #0
['shoulder', 'zippered', 'polyester', 'compartment', 'water', 'polyurethane', 'strap', 'denier', 'pocket', 'mesh']


THE TOP 10 WORDS FOR TOPIC #1
['cotton', 'organic', 'free', 'shoulder', 'inks', 'pvc', 'phthalate', 'ringspun', 'taped', 'shirt']


THE TOP 10 WORDS FOR TOPIC #2
['beneath', 'baselayer', 'layers', 'brushed', 'garment', 'odor', 'natural', 'capilene', 'gladiodor', 'control']


THE TOP 10 WORDS FOR TOPIC #3
['waistband', 'closure', 'size', 'zip', 'fly', 'pants', 'nylon', 'shorts', 'pockets', 'inseam']


THE TOP 10 WORDS FOR TOPIC #4
['skin', 'naturally', 'dry', 'slow', 'machine', 'lay', 'odor', 'wash', 'wool', 'merino']


THE TOP 10 WORDS FOR TOPIC #5
['rise', 'blend', 'hips', 'improved', 'lined', 'nylon', 'coverage', 'spandex', '18', '82']


THE TOP 10 WORDS FOR TOPIC #6
['thailand', 'tencel', 'recyclable', 'common', 'threads', 'program', 'recycling', 'button', 'cotton', 'organic']


THE TOP 10 WORDS FOR TOPIC #7
['catalog', 'site', 'dimension