# Recommendation System for Amazon Clothing Products
---
## 4. Content Based Recommendation System

*Author*: Mariam Elsayed

*Contact*: mariamkelsayed@gmail.com

*Notebook*: 4 of 5

*Previous Notebook*: `popularity_rec.ipynb`

*Next Notebook*: `colab_rec.ipynb`

In [133]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity 

## Table of Contents

* [Introduction](#intro)

* [Loading the Data](#loading)

* [Content Based Recommendation System](#rec)

    * [Using Cosine Similarity](#cosine)

    * [Creating a General Function](#function)

* [Conclusion](#conc)

## Introduction <a class="anchor" id="intro"></a>

Next let's create a content based recommendation system. This would rely on ranking how similar different products are based on their description. 

## Loading the Data <a class="anchor" id="loading"></a>

This recommendation system will use the products data created in the preprocessing notebooks.

In [134]:
# Reading the data
products_df = pd.read_csv('Data/products_cleaned.csv')

In [135]:
products_df

Unnamed: 0,category,description,title,brand,rank,asin,imageURL,price,maincat_Luggage & Travel Gear,maincat_Backpacks,...,subcat_Shoes,subcat_Handbags & Wallets,subcat_Girls,subcat_Boys,"subcat_Shoe, Jewelry & Watch Accessories",subcat_Jewelry Accessories,subcat_Shoe Care & Accessories,subcat_Contemporary & Designer,subcat_Travel Accessories,"subcat_Surf, Skate & Street"
0,"['Clothing, Shoes & Jewelry', 'Luggage & Trave...",The Hottest Bag in Town! Brand: Anello Conditi...,Japan Anello Backpack Unisex PINK BEIGE LARGE ...,Anello,3994472,0204444454,['https://images-na.ssl-images-amazon.com/imag...,70.00,True,True,...,False,False,False,False,False,False,False,False,False,False
1,"['Clothing, Shoes & Jewelry', 'Luggage & Trave...",The Hottest Bag in Town! Brand: Anello Conditi...,Japan Anello Backpack Unisex BLACK LARGE PU LE...,Anello,635761,0204444403,['https://images-na.ssl-images-amazon.com/imag...,65.99,True,True,...,False,False,False,False,False,False,False,False,False,False
2,"['Clothing, Shoes & Jewelry', 'Novelty & More'...",Brand New. Hat Centre Length: adult about 8cm...,bettyhome Unisex Adult Winter Spring Thicken C...,bettyhome,5061041,0206313535,['https://images-na.ssl-images-amazon.com/imag...,18.99,False,False,...,False,False,False,False,False,False,False,False,False,False
3,"['Clothing, Shoes & Jewelry', 'Women', 'Clothi...",Please allow 1-2cm dimension deviation. 100% b...,bettyhome Womens Lace Short Sleeves Top Printi...,bettyhome,10635107,0206335962,['https://images-na.ssl-images-amazon.com/imag...,23.99,False,False,...,False,False,False,False,False,False,False,False,False,False
4,"['Clothing, Shoes & Jewelry', 'Luggage & Trave...",2 Way Shoulder Handle Polyester Canvas Boston ...,Japan Anello LARGE CAMO 2 Way Unisex Shoulder ...,Anello,1615335,024444448X,['https://images-na.ssl-images-amazon.com/imag...,65.33,True,True,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
678727,"['Clothing, Shoes & Jewelry', 'Men', 'Accessor...",Get maximum protection from the elements. Clif...,Kaenon Cliff Sunglasses - Select Color,Kaenon,1908928,B01HJH6SA2,['https://images-na.ssl-images-amazon.com/imag...,199.00,False,False,...,False,False,False,False,False,False,False,False,False,False
678728,"['Clothing, Shoes & Jewelry', 'Men', 'Shoes', ...",A classic wingtip with a subtle twist that cov...,Deer Stags Men's Hampden Oxford,Unknown,956501,B01HJH7W0W,['https://images-na.ssl-images-amazon.com/imag...,52.38,False,False,...,True,False,False,False,False,False,False,False,False,False
678729,"['Clothing, Shoes & Jewelry', 'Women', 'Clothi...",Our classy but sexy strappy sheath cocktail wi...,Laundry by Shelli Segal Women's Fitted Strappy...,Unknown,3633844,B01HJI0G5Y,['https://images-na.ssl-images-amazon.com/imag...,18.99,False,False,...,False,False,False,False,False,False,False,False,False,False
678730,"['Clothing, Shoes & Jewelry', 'Baby', 'Baby Gi...",Size Length Hip*2 Age Advice 70 39.5 CM 30 CM ...,Newborn Baby Girl Bodysuit Lace Floral Romper ...,Hotone,1671980,B01HJHR8A6,['https://images-na.ssl-images-amazon.com/imag...,4.99,False,False,...,False,False,False,False,False,False,False,False,False,False


In [136]:
products_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 678732 entries, 0 to 678731
Data columns (total 62 columns):
 #   Column                                     Non-Null Count   Dtype  
---  ------                                     --------------   -----  
 0   category                                   678732 non-null  object 
 1   description                                678530 non-null  object 
 2   title                                      678732 non-null  object 
 3   brand                                      678732 non-null  object 
 4   rank                                       678732 non-null  int64  
 5   asin                                       678732 non-null  object 
 6   imageURL                                   550904 non-null  object 
 7   price                                      678732 non-null  float64
 8   maincat_Luggage & Travel Gear              678732 non-null  bool   
 9   maincat_Backpacks                          678732 non-null  bool   
 10  maincat_

In [137]:
# Dropping products with no description
products_df = products_df.dropna(axis=0, subset=['description'])

In [138]:
products_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 678530 entries, 0 to 678731
Data columns (total 62 columns):
 #   Column                                     Non-Null Count   Dtype  
---  ------                                     --------------   -----  
 0   category                                   678530 non-null  object 
 1   description                                678530 non-null  object 
 2   title                                      678530 non-null  object 
 3   brand                                      678530 non-null  object 
 4   rank                                       678530 non-null  int64  
 5   asin                                       678530 non-null  object 
 6   imageURL                                   550712 non-null  object 
 7   price                                      678530 non-null  float64
 8   maincat_Luggage & Travel Gear              678530 non-null  bool   
 9   maincat_Backpacks                          678530 non-null  bool   
 10  maincat_

## Content Based Recommendation System <a class="anchor" id="content_rec"></a>

Content based recommedation systems use the features of the products to recommend products that are similar. In our this case, most information describing the product are in the description. To quantify the descriptions, a TF-IDF matrix will be used. 

TF-IDF stands for term frequency - inverse document frequency, where term frequency is the number of times the word appears in a document and where inverse document frequency looks at how common the word is amongst the corpus.

### Example: The Anello Backpack <a class="anchor" id="cosine"></a>

Cosine similarity is a metric used to measure the similarity between two vectors. This similarity is scored between 0 and 1, 1 being the most similar. Let's use the backpack below as an example of how cosine similarity works.

In [139]:
products_df.loc[products_df['asin'] == '0204444454']

Unnamed: 0,category,description,title,brand,rank,asin,imageURL,price,maincat_Luggage & Travel Gear,maincat_Backpacks,...,subcat_Shoes,subcat_Handbags & Wallets,subcat_Girls,subcat_Boys,"subcat_Shoe, Jewelry & Watch Accessories",subcat_Jewelry Accessories,subcat_Shoe Care & Accessories,subcat_Contemporary & Designer,subcat_Travel Accessories,"subcat_Surf, Skate & Street"
0,"['Clothing, Shoes & Jewelry', 'Luggage & Trave...",The Hottest Bag in Town! Brand: Anello Conditi...,Japan Anello Backpack Unisex PINK BEIGE LARGE ...,Anello,3994472,204444454,['https://images-na.ssl-images-amazon.com/imag...,70.0,True,True,...,False,False,False,False,False,False,False,False,False,False


The TF IDF matrix would be too large to work with, so lets break up the dataframe into the category the item is in.

In [141]:
products_backpacks = products_df[products_df['maincat_Backpacks'] == True]
products_backpacks

Unnamed: 0,category,description,title,brand,rank,asin,imageURL,price,maincat_Luggage & Travel Gear,maincat_Backpacks,...,subcat_Shoes,subcat_Handbags & Wallets,subcat_Girls,subcat_Boys,"subcat_Shoe, Jewelry & Watch Accessories",subcat_Jewelry Accessories,subcat_Shoe Care & Accessories,subcat_Contemporary & Designer,subcat_Travel Accessories,"subcat_Surf, Skate & Street"
0,"['Clothing, Shoes & Jewelry', 'Luggage & Trave...",The Hottest Bag in Town! Brand: Anello Conditi...,Japan Anello Backpack Unisex PINK BEIGE LARGE ...,Anello,3994472,0204444454,['https://images-na.ssl-images-amazon.com/imag...,70.00,True,True,...,False,False,False,False,False,False,False,False,False,False
1,"['Clothing, Shoes & Jewelry', 'Luggage & Trave...",The Hottest Bag in Town! Brand: Anello Conditi...,Japan Anello Backpack Unisex BLACK LARGE PU LE...,Anello,635761,0204444403,['https://images-na.ssl-images-amazon.com/imag...,65.99,True,True,...,False,False,False,False,False,False,False,False,False,False
4,"['Clothing, Shoes & Jewelry', 'Luggage & Trave...",2 Way Shoulder Handle Polyester Canvas Boston ...,Japan Anello LARGE CAMO 2 Way Unisex Shoulder ...,Anello,1615335,024444448X,['https://images-na.ssl-images-amazon.com/imag...,65.33,True,True,...,False,False,False,False,False,False,False,False,False,False
195,"['Clothing, Shoes & Jewelry', 'Luggage & Trave...",Carry your essential items around town in this...,AmeriLeather Miles Backpack,Amerileather,2990358,B00065EIT8,['https://images-na.ssl-images-amazon.com/imag...,75.99,True,True,...,False,False,False,False,False,False,False,False,False,False
891,"['Clothing, Shoes & Jewelry', 'Luggage & Trave...","Yak Pak's most popular backpack, the Student B...",Yak Pak 635 Basic Student Backpack - Black,Yak Pak,4404877,B00080LNUS,['https://images-na.ssl-images-amazon.com/imag...,19.52,True,True,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
677184,"['Clothing, Shoes & Jewelry', 'Luggage & Trave...","Specifically designed for girls,lovely pattern...",Dog Pawprint Cat Fingerprint Backpack for Elem...,MIFULGOO,155356,B01HGSLJKI,['https://images-na.ssl-images-amazon.com/imag...,4.58,True,True,...,False,False,False,False,False,False,False,False,False,False
677744,"['Clothing, Shoes & Jewelry', 'Luggage & Trave...",Kid's Teenage Mutant Ninja Turtles Out of the ...,Teenage Mutant Ninja Turtles Movie Out of The ...,Unknown,2683371,B01HHDG7L8,['https://images-na.ssl-images-amazon.com/imag...,13.99,True,True,...,False,False,False,False,False,False,False,False,False,False
677911,"['Clothing, Shoes & Jewelry', 'Luggage & Trave...",Withmystyle provide the latest trends and styl...,Casual Canvas Fashion Shool Backpack,Generic,1659770,B01HHNCBH2,['https://images-na.ssl-images-amazon.com/imag...,11.95,True,True,...,False,False,False,False,False,False,False,False,False,False
678252,"['Clothing, Shoes & Jewelry', 'Luggage & Trave...",Black & white canvas backpack from Fall Out Bo...,Fall Out Boy Mint Anchor Print Backpack,Unknown,1584668,B01HIGYDYW,['https://images-na.ssl-images-amazon.com/imag...,39.90,True,True,...,False,False,False,False,False,False,False,False,False,False


Let's look at the other backpacks with the same brand 'Anello'.

In [191]:
products_backpacks.loc[products_backpacks['brand'] == 'Anello']

Unnamed: 0,category,description,title,brand,rank,asin,imageURL,price,maincat_Luggage & Travel Gear,maincat_Backpacks,...,subcat_Shoes,subcat_Handbags & Wallets,subcat_Girls,subcat_Boys,"subcat_Shoe, Jewelry & Watch Accessories",subcat_Jewelry Accessories,subcat_Shoe Care & Accessories,subcat_Contemporary & Designer,subcat_Travel Accessories,"subcat_Surf, Skate & Street"
0,"['Clothing, Shoes & Jewelry', 'Luggage & Trave...",The Hottest Bag in Town! Brand: Anello Conditi...,Japan Anello Backpack Unisex PINK BEIGE LARGE ...,Anello,3994472,0204444454,['https://images-na.ssl-images-amazon.com/imag...,70.0,True,True,...,False,False,False,False,False,False,False,False,False,False
1,"['Clothing, Shoes & Jewelry', 'Luggage & Trave...",The Hottest Bag in Town! Brand: Anello Conditi...,Japan Anello Backpack Unisex BLACK LARGE PU LE...,Anello,635761,0204444403,['https://images-na.ssl-images-amazon.com/imag...,65.99,True,True,...,False,False,False,False,False,False,False,False,False,False
4,"['Clothing, Shoes & Jewelry', 'Luggage & Trave...",2 Way Shoulder Handle Polyester Canvas Boston ...,Japan Anello LARGE CAMO 2 Way Unisex Shoulder ...,Anello,1615335,024444448X,['https://images-na.ssl-images-amazon.com/imag...,65.33,True,True,...,False,False,False,False,False,False,False,False,False,False
525330,"['Clothing, Shoes & Jewelry', 'Luggage & Trave...","Light and using a durable polyester fabric, ru...",anello Porikyan capped Luc back zipper AT-B019...,Anello,844054,B016U0OPFY,['https://images-na.ssl-images-amazon.com/imag...,47.89,True,True,...,False,False,False,False,False,False,False,False,False,False
616110,"['Clothing, Shoes & Jewelry', 'Luggage & Trave...",Please note there are two deifferent types. 01...,anello #AT-B0197A small casual backpack napsac...,Anello,5382719,B01D2MK25U,['https://images-na.ssl-images-amazon.com/imag...,73.4,True,True,...,False,False,False,False,False,False,False,False,False,False
616112,"['Clothing, Shoes & Jewelry', 'Luggage & Trave...",The brand Anello means Growth ring in Italian....,anello #AT-B1212 backpack cream beige,Anello,5576780,B01D2MKBUQ,['https://images-na.ssl-images-amazon.com/imag...,72.99,True,True,...,False,False,False,False,False,False,False,False,False,False
616114,"['Clothing, Shoes & Jewelry', 'Luggage & Trave...",The brand Anello means Growth ring in Italian....,anello #AT-B1212 backpack brown,Anello,6286974,B01D2MKB38,['https://images-na.ssl-images-amazon.com/imag...,70.49,True,True,...,False,False,False,False,False,False,False,False,False,False
620765,"['Clothing, Shoes & Jewelry', 'Luggage & Trave...",This bag is small size. Please be careful as t...,anello #AT-B0197B small backpack with side poc...,Anello,377595,B01DLVYOPG,['https://images-na.ssl-images-amazon.com/imag...,38.0,True,True,...,False,False,False,False,False,False,False,False,False,False
632262,"['Clothing, Shoes & Jewelry', 'Luggage & Trave...",Popular No. 1 product with anello. When openin...,Anello Official Ruby Red Japan Fashion Shoulde...,Anello,858178,B01EBI8UH6,['https://images-na.ssl-images-amazon.com/imag...,9.22,True,True,...,False,False,False,False,False,False,False,False,False,False
632267,"['Clothing, Shoes & Jewelry', 'Luggage & Trave...",Anello is one of the top fashion backpack styl...,Anello Official Blue Japan Fashion Shoulder Ru...,Anello,1392217,B01EBI8XBO,['https://images-na.ssl-images-amazon.com/imag...,9.22,True,True,...,False,False,False,False,False,False,False,False,False,False


Now let's start vectorizing using the products_backpacks description column and finding the cosine similarities.

In [142]:
vectorizer =TfidfVectorizer(stop_words='english', min_df=5, lowercase=True)

TF_IDF_matrix = vectorizer.fit_transform(products_backpacks['description'])

In [143]:
TF_IDF_matrix.shape

(2087, 2000)

In [144]:
# Finding the similarities
similarities = cosine_similarity(TF_IDF_matrix, dense_output=False)

Let's create a dataframe to shows the products and their similarity to the first backpack.

In [145]:
product_index = products_backpacks[products_backpacks['asin'] == '0204444454'].index

# defining dataframe containing product names and similarities
sim_df = pd.DataFrame(
    {
        'asin': products_backpacks['asin'],
        'product_name': products_backpacks['title'],
        'similarity': np.array(similarities[product_index, :].todense()).squeeze()
    }
)

Sorting by the highest similarities, we get our original backpack. In the second and third place, other Anello backpacks were recommended, how

In [146]:
sim_df.sort_values(by='similarity', ascending=False).head(10)

Unnamed: 0,asin,product_name,similarity
0,0204444454,Japan Anello Backpack Unisex PINK BEIGE LARGE ...,1.0
1,0204444403,Japan Anello Backpack Unisex BLACK LARGE PU LE...,1.0
4,024444448X,Japan Anello LARGE CAMO 2 Way Unisex Shoulder ...,0.690546
627874,B01E27V99M,Women Girls Ladies Backpack Fashion Shoulder B...,0.314537
205967,B00EME7H5Q,SHENGXILU Women's/Lady's PU Leather Backpack G...,0.254859
472248,B012CHMA2Y,Soft PU Backpack School Bag Travel Bag Frozen ...,0.253976
403536,B00UAD3L6C,Soft PU Backpack Children's School Bag Travel ...,0.253976
405473,B00UHDNKDY,New Fashion Gold /Silver School Travel Gym Sho...,0.252526
252569,B00I49IZSI,Brixton Men's Carson Backpack,0.249477
388732,B00SWJI3WS,"Peppa Pig Large 16"" School Backpack(purple)",0.247832


### Creating a General Function <a class="anchor" id="function"></a>

Let's create a function that takes in a product and returns the most similar items in the products dataframe. This function follows a similar flow as the example above.

Some product categories still yield a TF-IDF matrix that is too large, so the dataframe will be sampled from if the number of rows is larger than 10000. 

In [187]:
def content_recommender_cosine(asin):

    '''
    Function that recommends similar item using cosine similarity 

    INPUT: asin (obj) - Unique identifier of product

    OUTPUT: dataframe containing top 10 most similar items
    '''
    # Getting product in dataframe
    product = products_df[products_df['asin'] == asin]

    # Getting index of product in dataframe
    product_index = products_df[products_df['asin'] == asin].index.astype(int)[0]

    ### Making dataframe contain only that category ###

    # Making dataframe containing OHEd main category
    maincat_df = products_df.filter(regex='^maincat_', axis=1)
    
    # Taking the first category as the maincat
    maincat = maincat_df.columns[maincat_df.loc[product_index] == True][0]

    # Making dataframe contain only that category
    maincat_products_df = products_df[products_df[maincat] == True]

    # Resampling if the dataframe is too large
    if maincat_products_df.shape[0] > 10000:

        maincat_products_df = maincat_products_df.sample(n=5000)

        if maincat_products_df.isin([asin]).any().any() == False: 
            
            maincat_products_df = pd.concat([maincat_products_df, product], axis=0)

        else: pass

    else: pass

    # Resetting index
    maincat_products_df = maincat_products_df.reset_index(drop=True)

    # Getting updated products index
    product_index = maincat_products_df[maincat_products_df['asin'] == asin].index.astype(int)[0]

    vectorizer =TfidfVectorizer(stop_words='english', min_df=30, lowercase=True)
    
    TF_IDF_matrix = vectorizer.fit_transform(maincat_products_df['description'])

    similarities = cosine_similarity(TF_IDF_matrix, dense_output=False)

    sim_df = pd.DataFrame(
        {
            'product_name': maincat_products_df['title'],
            'asin': maincat_products_df['asin'],
            'similarity': (np.array(similarities[product_index, :].todense())).squeeze()
        }
    )

    top_products_df = sim_df.sort_values(by='similarity', ascending=False).head(10)

    print(f"Content-based recommendations for '{product.iloc[0]['title']}' in the {maincat.lstrip('maincat_')} category")
    
    return top_products_df

#### Testing Function
Let's test out the funtion for various products.

In [189]:
content_recommender_cosine('024444448X')

Content-based recommendations for 'Japan Anello LARGE CAMO 2 Way Unisex Shoulder Bag Poly Canvas Waterproof Campus' in the Luggage & Travel Gear category


Unnamed: 0,product_name,asin,similarity
2,Japan Anello LARGE CAMO 2 Way Unisex Shoulder ...,024444448X,1.0
0,Japan Anello Backpack Unisex PINK BEIGE LARGE ...,0204444454,0.683637
1,Japan Anello Backpack Unisex BLACK LARGE PU LE...,0204444403,0.683637
5224,Anchor Print Drawstring Backpack Sports Bag,B014QDHEAA,0.372056
155,Rothco Nato Canvas Medic Messenger Bag,B000T8ESPG,0.366854
3109,Chevron Print Zig Zag School Travel Backpack,B00KE9BG2I,0.347337
907,Marvel Thor The Mighty Avenger Movie Large Bac...,B004XE33D0,0.346837
1478,"ANGRY BIRDS 16"" LARGE SCHOOL BACKPACK",B0085GAF2E,0.331933
4053,"Peppa Pig Large 16"" School Backpack(purple)",B00SWJI3WS,0.322385
5763,Mossy Oak Cotton Canvas Buckle Backpack,B019FXV73G,0.318321


In [188]:
content_recommender_cosine('B004C0TY7E')

Content-based recommendations for 'eBags Shoe Sleeves with Drawstring - For Travel - Set of 2' in the Luggage & Travel Gear category


Unnamed: 0,product_name,asin,similarity
791,eBags Shoe Sleeves with Drawstring - For Trave...,B004C0TY7E,1.0
3055,eBags Shoe Sleeves with Drawstring - For Trave...,B00K20EV62,1.0
218,eBags Shoe Sleeves with Drawstring - For Trave...,B0013KFIXA,1.0
7012,"Travel Shoe Organizer Bags for Boots, High Hee...",B01HB0AZXS,0.500866
845,TUMI - Travel Accessories Shoe Bags - Luggage ...,B004LG4K46,0.490664
7056,"Shoe Bag, Bukm Portable Travel Friends Nylon S...",B01HHOIUEY,0.477676
7043,"Shoe Bag, Bukm Portable Travel Friends Nylon S...",B01HF2KPF0,0.477676
1334,2 Travel Shoe Bags Luggage Black Bag Golf Suit...,B00756V72M,0.442499
6653,HiDay 4pcs Portable Waterproof Travel Shoe Bag...,B01FA2KROQ,0.4346
6153,Lewis N. Clark Travel Shoe Covers,B01C4R7PG8,0.432948


## Conclusion <a class="anchor" id="conc"></a>

The content-based recommendation created finds similar products using cosine similarity on the product's TF-IDF matrix made from the description of the product. A general function that does this was created.

*Next Notebook*: `colab_rec.ipynb`