# Content Based Filtering

## Import Libraries

In [1]:
!pip install opendatasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting opendatasets
  Downloading opendatasets-0.1.22-py3-none-any.whl (15 kB)
Installing collected packages: opendatasets
Successfully installed opendatasets-0.1.22


In [3]:
import pandas as pd
import numpy as np
import time
import opendatasets as od
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import euclidean_distances

## Download Dataset

In [7]:
od.download("https://www.kaggle.com/datasets/thedevastator/adidas-fashion-retail-products-dataset-9300-prod")

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: muhamaddani
Your Kaggle Key: ··········
Downloading adidas-fashion-retail-products-dataset-9300-prod.zip to ./adidas-fashion-retail-products-dataset-9300-prod


100%|██████████| 289k/289k [00:00<00:00, 645kB/s]







## Univariate Exploratory Data Analysis

### Adidas Product Variabel

In [37]:
df_adidas = pd.read_csv("/content/adidas-fashion-retail-products-dataset-9300-prod/adidas_usa.csv")
df_adidas.head()

Unnamed: 0,index,url,name,sku,selling_price,original_price,currency,availability,color,category,...,source_website,breadcrumbs,description,brand,images,country,language,average_rating,reviews_count,crawled_at
0,0,https://www.adidas.com/us/beach-shorts/FJ5089....,Beach Shorts,FJ5089,40,,USD,InStock,Black,Clothing,...,https://www.adidas.com,Women/Clothing,Splashing in the surf. Making memories with yo...,adidas,"https://assets.adidas.com/images/w_600,f_auto,...",USA,en,4.5,35,2021-10-23 17:50:17.331255
1,1,https://www.adidas.com/us/five-ten-kestrel-lac...,Five Ten Kestrel Lace Mountain Bike Shoes,BC0770,150,,USD,InStock,Grey,Shoes,...,https://www.adidas.com,Women/Shoes,Lace up and get after it. The Five Ten Kestrel...,adidas,"https://assets.adidas.com/images/w_600,f_auto,...",USA,en,4.8,4,2021-10-23 17:50:17.423830
2,2,https://www.adidas.com/us/mexico-away-jersey/G...,Mexico Away Jersey,GC7946,70,,USD,InStock,White,Clothing,...,https://www.adidas.com,Kids/Clothing,"Clean and crisp, this adidas Mexico Away Jerse...",adidas,"https://assets.adidas.com/images/w_600,f_auto,...",USA,en,4.9,42,2021-10-23 17:50:17.530834
3,3,https://www.adidas.com/us/five-ten-hiangle-pro...,Five Ten Hiangle Pro Competition Climbing Shoes,FV4744,160,,USD,InStock,Black,Shoes,...,https://www.adidas.com,Five Ten/Shoes,The Hiangle Pro takes on the classic shape of ...,adidas,"https://assets.adidas.com/images/w_600,f_auto,...",USA,en,3.7,7,2021-10-23 17:50:17.615054
4,4,https://www.adidas.com/us/mesh-broken-stripe-p...,Mesh Broken-Stripe Polo Shirt,GM0239,65,,USD,InStock,Blue,Clothing,...,https://www.adidas.com,Men/Clothing,Step up to the tee relaxed. This adidas golf p...,adidas,"https://assets.adidas.com/images/w_600,f_auto,...",USA,en,4.7,11,2021-10-23 17:50:17.702680


In [10]:
df_adidas.describe()

Unnamed: 0,index,selling_price,average_rating,reviews_count
count,845.0,845.0,845.0,845.0
mean,422.0,53.192899,4.608402,426.178698
std,244.074784,31.411645,0.293795,1229.158277
min,0.0,9.0,1.0,1.0
25%,211.0,28.0,4.5,19.0
50%,422.0,48.0,4.7,68.0
75%,633.0,70.0,4.8,314.0
max,844.0,240.0,5.0,11750.0


In [40]:
print('Jumlah Produk Adidas: ', len(df_adidas.name.unique()))
print('Jumlah Kategori : ', len(df_adidas.category.unique()))
print('Jumlah Sub Kategori (breadcrumbs) : ', len(df_adidas.breadcrumbs.unique()))

Jumlah Produk Adidas:  431
Jumlah Kategori :  3
Jumlah Sub Kategori (breadcrumbs) :  22


## Data Preprocessing

### Menentukan fitur yang akan digunakan

Fitur yang akan digunakan adalah name, sku dan breadcrumbs. Pada kasus ini kita lebih mengutamakan penggunaan breadcrumbs daripada category karena memiliki nilai yang lebih variatif.

In [13]:
adidass = df_adidas[["sku", "name", "breadcrumbs"]]
adidass.head()

Unnamed: 0,sku,name,breadcrumbs
0,FJ5089,Beach Shorts,Women/Clothing
1,BC0770,Five Ten Kestrel Lace Mountain Bike Shoes,Women/Shoes
2,GC7946,Mexico Away Jersey,Kids/Clothing
3,FV4744,Five Ten Hiangle Pro Competition Climbing Shoes,Five Ten/Shoes
4,GM0239,Mesh Broken-Stripe Polo Shirt,Men/Clothing


## Data Preparation

### Mengatasi Missing Value

In [14]:
adidass.isnull().sum()

sku            0
name           0
breadcrumbs    0
dtype: int64

In [15]:
adidass.isna().sum()

sku            0
name           0
breadcrumbs    0
dtype: int64

Dari output di atas, terlihat bahwa tidak ada *missing value* pada dataset.

## Model Development dengan Content Based Filtering

### TF-IDF Vectorizer

In [16]:
# Inisialisasi TfidfVectorizer
tf = TfidfVectorizer()

# Melakukan perhitungan idf pada data adidas
tf.fit(adidass['breadcrumbs']) 
 
# Mapping array dari fitur index integer ke fitur nama
tf.get_feature_names() 



['accessories',
 'clothing',
 'essentials',
 'five',
 'kids',
 'men',
 'originals',
 'running',
 'shoes',
 'soccer',
 'sportswear',
 'swim',
 'ten',
 'training',
 'women']

In [17]:
# Melakukan fit lalu ditransformasikan ke bentuk matrix
tfidf_matrix = tf.fit_transform(adidass['breadcrumbs']) 
 
# Melihat ukuran matrix tfidf
tfidf_matrix.shape 

(845, 15)

In [18]:
# Mengubah vektor tf-idf dalam bentuk matriks dengan fungsi todense()
tfidf_matrix.todense()

matrix([[0.        , 0.7125031 , 0.        , ..., 0.        , 0.        ,
         0.70166897],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.74638166],
        [0.        , 0.52413986, 0.        , ..., 0.        , 0.        ,
         0.        ],
        ...,
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.74638166]])

In [38]:
# Membuat dataframe untuk melihat tf-idf matrix
 
pd.DataFrame(
    tfidf_matrix.todense(), 
    columns=tf.get_feature_names(),
    index=adidass.name
).sample(10, axis=1).sample(5, axis=0)



Unnamed: 0_level_0,accessories,clothing,kids,soccer,running,training,women,originals,swim,sportswear
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Supernova Shoes,0.0,0.0,0.0,0.0,0.0,0.0,0.746382,0.0,0.0,0.0
ZX 1K Boost Shoes,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.908607,0.0,0.0
Adicolor Branded Webbing Waist Bag,0.671753,0.0,0.0,0.0,0.0,0.0,0.0,0.740775,0.0,0.0
EQ21 Run COLD.RDY Shoes,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Bryony Shoes,0.0,0.0,0.0,0.0,0.0,0.0,0.746382,0.0,0.0,0.0


### Cosine Similarity


In [20]:
def cosine_sim_handler(df_tfidf, series_title):
  # Menghitung cosine similarity pada dataframe tfidf
  cosine_sim = cosine_similarity(df_tfidf)

  # Membuat dataframe dari variabel cosine_sim dengan baris dan kolom berupa nama produk
  df_cosine_sim = pd.DataFrame(cosine_sim, index=series_title, columns=series_title)

  # Melihat similarity matrix pada setiap produk
  return df_cosine_sim

In [21]:
# Menghitung cosine similarity pada matrix tf-idf
start = time.time()
cosine_sim_df = cosine_sim_handler(tfidf_matrix, adidass['name'])
cosine_exec_time = time.time() - start
print("Exec Time Cosine Similarity (Seconds) :", cosine_exec_time)

Exec Time Cosine Similarity (Seconds) : 0.019988536834716797


In [41]:
# Melihat similarity matrix pada setiap produk
print('Shape:', cosine_sim_df.shape)
cosine_sim_df.sample(5, axis=1).sample(10, axis=0)

Shape: (845, 845)


name,ZX 1K Boost Shoes,Court Tourino Shoes,Adicolor Classics Collegiate Tight Tee,Stan Smith Shoes,Superstar Shoes
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Adicolor Classics 3-Stripes Crew Sweatshirt,0.0,0.0,0.482365,0.57254,0.57254
Adicolor Classics Collegiate Cropped Hoodie,0.523713,0.523713,1.0,0.0,0.0
Runfalcon 2.0 Shoes,0.316415,0.316415,0.0,0.298746,0.298746
NMD_R1 Spectoo Shoes,1.0,1.0,0.523713,0.418182,0.418182
Court Rallye Slip Shoes,0.277955,0.277955,0.0,0.262434,0.262434
Camo Everyday Shorts,0.0,0.0,0.482365,0.57254,0.57254
Marvel Predator Freak.1 Firm Ground Cleats,0.205888,0.205888,0.0,0.194391,0.194391
Bra Top,0.523713,0.523713,1.0,0.0,0.0
Graphics Camo Allover Print Tee,0.0,0.0,0.482365,0.57254,0.57254
Manga Short Sleeve Tee,0.0,0.0,0.482365,0.57254,0.57254


### Euclidean Distance

In [23]:
def euclidean_sim_handler(df_tfidf, series_title):
  # Menghitung euclidean distance pada dataframe tfidf
  euclidean_dist = euclidean_distances(df_tfidf)

  # Menghitung euclidean similarity
  # Ref: https://stackoverflow.com/a/35216364
  f = lambda x: 1 / (1 + x)
  euclidean_sim = f(euclidean_dist)

  # Membuat dataframe dari variabel euclidean_sim dengan baris dan kolom berupa nama produk
  df_euclidean_sim = pd.DataFrame(euclidean_sim, index=series_title, columns=series_title)

  # Melihat similarity matrix pada setiap produk
  return df_euclidean_sim

In [24]:
start = time.time()
euclidean_sim_df = euclidean_sim_handler(tfidf_matrix, adidass["name"])
euclidean_exec_time = time.time() - start
print("Exec Time Euclidean Similarity (Seconds) :", euclidean_exec_time)

Exec Time Euclidean Similarity (Seconds) : 0.029959440231323242


In [42]:
print('Shape:', euclidean_sim_df.shape)
euclidean_sim_df.sample(5, axis=1).sample(10, axis=0)

Shape: (845, 845)


name,Runner Tee,Superstar Shoes,Adizero 1/2 Zip Long Sleeve Tee,ZX 1K Boost Shoes,ZX 1K Boost Shoes
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Sherpa Jacket,0.495667,0.414214,1.0,0.506073,0.506073
adidas Designed 2 Move AEROREADY Cropped Tee,0.495667,0.414214,1.0,0.506073,0.506073
Ozelia Shoes,0.519584,0.451559,0.414214,0.481065,0.481065
FARM Rio Print Relaxed Lightweight Windbreaker,0.495667,0.414214,1.0,0.506073,0.506073
Special 21 Shoes,0.414214,0.454194,0.506073,1.0,1.0
Solid Swim Shorts,1.0,0.414214,0.495667,0.414214,0.414214
Court Tourino Shoes,0.414214,0.454194,0.506073,1.0,1.0
Soft Floral Box Graphic Tank Top,0.495667,0.414214,1.0,0.506073,0.506073
Adilette Lite Slides,0.414214,0.454194,0.506073,1.0,1.0
Hamburg Shoes,0.414214,0.441298,0.414214,0.460987,0.460987


### Mendapatkan Rekomendasi

In [30]:
def product_recommendations(nama_produk, similarity_data, items=adidass, k=10):
 
    # Mengambil data dengan menggunakan argpartition untuk melakukan partisi secara tidak langsung sepanjang sumbu yang diberikan    
    # Dataframe diubah menjadi numpy
    # Range(start, stop, step)
    index = similarity_data.loc[:,nama_produk].to_numpy().argpartition(
        range(-1, -k, -1)) # Ngambil 10 data terakhir setelah diurutkan dari kecil sampai besar
    
    # Mengambil data dengan similarity terbesar dari index yang ada
    closest = similarity_data.columns[index[-1:-(k+2):-1]]
    
    # Drop nama_produk agar nama produk yang dicari tidak muncul dalam daftar rekomendasi
    closest = closest.drop(nama_produk, errors='ignore')
 
    return pd.DataFrame(closest).merge(items).head(k)

In [43]:
adidass[adidass["name"].eq('Real Madrid Tee')]

Unnamed: 0,sku,name,breadcrumbs
761,GR4259,Real Madrid Tee,Kids/Clothing


#### Rekomendasi dengan Cosine Similarity

In [44]:
product_recommendations(
    nama_produk="Real Madrid Tee",
    similarity_data=cosine_sim_df
)

Unnamed: 0,name,sku,breadcrumbs
0,Graphic Tee and Shorts Set,EX3625,Kids/Clothing
1,Camo-Print SST Top,H20311,Kids/Clothing
2,Camo-Print Hoodie,H20312,Kids/Clothing
3,Marimekko Techfit Primegreen AEROREADY Trainin...,GV2052,Kids/Clothing
4,Techfit Tights,EY1067,Kids/Clothing
5,Techfit Tights,EY1068,Kids/Clothing
6,Techfit Tights,EY0319,Kids/Clothing
7,Techfit Tights,EY1066,Kids/Clothing
8,Techfit Tights,EY1067,Kids/Clothing
9,Techfit Tights,EY1068,Kids/Clothing


#### Rekomendasi dengan Euclidean Distance

In [45]:
product_recommendations(
    nama_produk="Real Madrid Tee",
    similarity_data=euclidean_sim_df
)

Unnamed: 0,name,sku,breadcrumbs
0,Graphic Tee and Shorts Set,EX3625,Kids/Clothing
1,Camo-Print SST Top,H20311,Kids/Clothing
2,Camo-Print Hoodie,H20312,Kids/Clothing
3,Marimekko Techfit Primegreen AEROREADY Trainin...,GV2052,Kids/Clothing
4,Techfit Tights,EY1067,Kids/Clothing
5,Techfit Tights,EY1068,Kids/Clothing
6,Techfit Tights,EY0319,Kids/Clothing
7,Techfit Tights,EY1066,Kids/Clothing
8,Techfit Tights,EY1067,Kids/Clothing
9,Techfit Tights,EY1068,Kids/Clothing


## Evaluasi

$$\text{Recommender system precision (P)} = \frac{\text{#of our recommendation that relevant}}{\text{#of item we recommend}}\times 100% $$

Dari hasil rekomendasi di atas, dapat diketahui bahwa `Real Madrid Tee` termasuk ke dalam kategori (breadcrumbs) `Kids/Clothing`. Dari 10 produk yang direkomendasikan, berikut nilai _precision_ pada model _cosine similarity_ dan _euclidean distance_.
 
|Model | Sesuai | Tidak Sesuai |Total| Precision |
|---|---|---|---|---|
|_Cosine Similarity_|10|0|10|100%|
|_Euclidean Similarity_|10|0|10|100%|
 
Pada tabel di atas, terlihat bahwa model *Cosine Similiarity* dan *Euclidean Distance* memiliki nilai _precision_ yang sama pada top-10 rekomendasi di atas.

Selain dari nilai _precision_, lama komputasi setiap metode juga perlu dipertimbangkan. Berikut perbandingannya:

In [46]:
df_exec_time_models = pd.DataFrame(index=['Time (Seconds)'],
    columns=['Cosine Similarity', 'Euclidean Similarity'])

df_exec_time_models['Cosine Similarity'] = [cosine_exec_time]
df_exec_time_models['Euclidean Similarity'] = [euclidean_exec_time]

df_exec_time_models

Unnamed: 0,Cosine Similarity,Euclidean Similarity
Time (Seconds),0.019989,0.029959


Berdasarkan output di atas, waktu komputasi pada metode Cosine Similarity (0.019989 detik) lebih cepat dibandingkan Euclidean Similarity (0.029959 detik).

## Data Diri

- Muhamad Dani
- M346X0902
- M06