# WEB ANALSIS SHOPEE - CREATE A RECOMMENDER SYSTEM USING KNN AND TF-TDF MODEL

In this project, we crawled and analyzed a dataset of over 5.000 products listed on Shopee and building a recommendation system for these products.
The dataset includes details of product such as product name, , price, rating, and user reviews.
The goal of this project is to understand customer preferences, identify purchasing patterns, and develop a recommendation system to suggest products to users based on their interests.

*This project will consist of the following steps :*

1. Data Collection : Collect the Shopee products dataset from Shopee's website through API
2. Data Preparation : Clean and preprocess the dataset for analysis.
3. Exploratory Data Analysis : Analyze the data to understand the distribution of products by categories, customer ratings, and reviews.
4. Data Visuallization : Visualize the data to identify trends and patterns
5. Build the Recommender system using TF-TDF model
6. Back-test model and conclusion

## Team members
1. Hoàng Ngọc Lan Hoa - K194131662
2. Vương Quốc Thịnh - K204111789
3. Nguyễn Ngọc Như Ý - K214110828
4. Lê Gia Huy - K214110798
5. Nguyễn Ngọc Thanh Trúc - K214131334
6. Bùi Phương Nguyên - K214130880
7. Trần Thị Nhàn - K214131988

## 1. Introduction

Recommender Systems - Item based collaborative filtering
1. In user based recommendation systems, habits of users can be changed. This situation makes hard to recommendation. However, in item based recommendation systems, movies or stuffs does not change. Therefore recommendation is easier.
2. On the other hand, there are almost 7 billion people all over the world. Comparing people increases the computational power. However, if items are compared, computational power is less.

## 2. Crawling Data

In [None]:
#import libraries
import requests
import json
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
import pandas as pd
import time 

We use the chromedrive package to crawling data in this project

In [None]:
chrome_driver_path = r'C:\Users\Admin\Desktop\crawlData\chromedriver-win64\chromedriver-win64\chromedriver.exe'

fileNameBackupCsv = 'data_backup'
fileNameBackupJson = 'data_backup'
service = Service(chrome_driver_path)
driver = webdriver.Chrome(service=service)
chrome_options = Options()
chrome_options.add_argument("--headless")  
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--disable-extensions")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")

driver = webdriver.Chrome(options=chrome_options)
list_cate = {
    "Thời Trang Nữ": "11035639",
    "Thời Trang Nam": "11035567",
    "Thời Trang Trẻ Em": "11036382",
    "Giày Dép Nam":"11035801",
    "Giày Dép Nữ":"11035825",
    "Phụ Kiện Trang Sức Nữ":"11035853",
    "Ba Lô Túi Ví Nam":"11035741",
    "Đồng Hồ":"11035788",
    "Đồ Chơi":"11036932", 
    "Túi Ví Nữ":"11035761",
    "Sắc Đẹp":"11036279"
}

#save to csv/ excel file
fileNameBackupCsv = 'data_backup'
fileNameBackupJson = 'data_backup'
excel_file_path = 'output.xlsx'

all_dataframes = []

for key, cat_id in list_cate.items():
    print("Processing category:", key)

    # Gọi API với cat_id của từng danh mục
    api = f'https://shopee.vn/api/v4/recommend/recommend?bundle=category_landing_page&cat_level=1&catid={cat_id}&limit=5000000&offset=0'
    print('call api: ')
    print(api)
    response = requests.get(api)

    if response.status_code == 200:
        data = response.json()
        list_item = data['data']['sections'][0]['data']['item']

        # Convert danh sách thành DataFrame
        df = pd.DataFrame(list_item)

        # Xóa các cột không cần thiết
        columns_to_drop = [
              'shopid', 
     'itemid',
    'label_ids',
    'catid',
    'hidden_price_display',
    'has_lowest_price_guarantee',
    'is_category_failed',
    'size_chart',
    'video_info_list', 
    'reference_item_id',
    'transparent_background_image',
    'is_adult',
    'badge_icon_type',
    'shopee_verified',
    'is_official_shop',
    'show_official_shop_label',
    'show_shopee_verified_label',
    'show_official_shop_label_in_title',
    'is_cc_installment_payment_eligible',
    'is_non_cc_installment_payment_eligible',
    'coin_earn_label',
    'show_free_shipping',
    'preview_info',
    'coin_info',
    'exclusive_price_info',
    'bundle_deal_id',
     'is_group_buy_item',
    'has_group_buy_stock',
    'group_buy_info',
    'welcome_package_type',
    'welcome_package_info',
    'can_use_wholesale',
    'is_preferred_plus_seller',
    'has_model_with_available_shopee_stock',
    'is_on_flash_sale',
    'spl_installment_tenure',
    'is_live_streaming_price',
    'is_mart',
    'pack_size',
    'overlay_image',
    'autogen_title',
    'autogen_title_id',
    'overlay_id',
    'is_service_by_shopee',
    'free_shipping_info',
    'global_sold_count',
    'repurchase_rate',
    'best_selling_tag',
    'is_seller_configured',
    'tp_label',
    'flash_sale_design_style',
    'flash_sale_label_content',
    'flash_sale_sold_percentage',
    'info',
    'data_type',
    'key',
    'count',
    'adsid',
    'campaignid',
    'deduction_info',
    'video_display_control',
    'deep_discount_skin',
    'experiment_info',
    'relationship_label',
    'live_stream_session',
    'live_streaming_info',
    'new_user_label',
    'wp_eligibility',
    'platform_voucher',
    'rcmd_reason',
    'highlight_video',
    'can_use_cod',
    'pub_id',
    'pub_context_id',
    'friend_relationship_label',
    'showing_rs_label',
    'showing_friend_rs_label',
    'show_flash_sale_label',
    'search_id',
    'ext_info',
    'session_id',
    'algo_info',
    'hostid',
    'from',
    'view_cnt',
    'cover',
    'title',
    'avatar',
    'user_name',
    'play_url',
    'has_voucher',
    'has_draw',
    'draw_type',
    'has_streaming_price',
    'coins_per_claim',
    'play_url_expiration',
    'coins_can_claim_cnt',
    'item',
    'ui_type',
    'room_id',
    'user_id',
    'nick_name',
    'product_banners',
    'top_product_label',
    'fashion_item',
    'image_search',
    'generic_search_card'
        ]
        df = df.drop(columns=columns_to_drop, axis=1)

        # Thêm DataFrame vào danh sách
        df['Category'] = key
        all_dataframes.append(df)
        print("process success, will waiting 1 min")
        time.sleep(10)

    else:
        print('Get Api Fail:', response.status_code)
        print("process fail, will waiting 1 min")
        time.sleep(10)

# Ghép tất cả các DataFrame lại theo chiều dọc
final_dataframe = pd.concat(all_dataframes, ignore_index=True)

# Lưu vào file Excel
final_dataframe.to_excel(excel_file_path, index=False)

driver.quit()

## 2. Cleaning data

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as ptl
import sklearn as sk
import re
import streamlit as st 
from gensim import corpora, models, similarities
from PIL import Image
from scipy import stats
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.metrics import ndcg_score
import math

In [3]:
shopee = pd.read_excel('output.xlsx')
shopee.head(2)

Unnamed: 0,name,image,images,currency,stock,status,ctime,sold,historical_sold,liked,...,item_rating,item_type,can_use_bundle_deal,bundle_deal_info,add_on_deal_info,shop_location,voucher_info,shop_name,shop_rating,flash_sale_stock
0,Áo Sơ Mi Chất Nhung Tăm Nam Nữ Form Rộng Nâu B...,f50d3f92dd1ada0d005f6c5fdcab8c28,"['f50d3f92dd1ada0d005f6c5fdcab8c28', '9404417e...",VND,6553,1,1654401683,591,11742,False,...,"{'rating_star': 4.612869198312236, 'rating_cou...",0,False,,,Hà Nội,,NASU MAY -XƯỞNG MAY THỜI TRANG,4.564263,0
1,Quần short jean nữ trắng vải denim rách bền đẹ...,201dbe9de76b3d51ee162762a92d0fce,"['201dbe9de76b3d51ee162762a92d0fce', '263a7dec...",VND,6494,1,1652782242,1269,28852,False,...,"{'rating_star': 4.860752965446106, 'rating_cou...",0,False,,"{'add_on_deal_id': 213749830, 'add_on_deal_lab...",TP. Hồ Chí Minh,,AnNgo - Mê Jean Nữ,4.858771,0


In [4]:
#check the dataset's columns
shopee.columns

Index(['name', 'image', 'images', 'currency', 'stock', 'status', 'ctime',
       'sold', 'historical_sold', 'liked', 'liked_count', 'view_count',
       'brand', 'cmt_count', 'flag', 'cb_option', 'item_status', 'price',
       'price_min', 'price_max', 'price_min_before_discount',
       'price_max_before_discount', 'price_before_discount', 'show_discount',
       'raw_discount', 'discount', 'tier_variations', 'item_rating',
       'item_type', 'can_use_bundle_deal', 'bundle_deal_info',
       'add_on_deal_info', 'shop_location', 'voucher_info', 'shop_name',
       'shop_rating', 'flash_sale_stock'],
      dtype='object')

In [5]:
#drop unnescessary columns
shopee=shopee.drop(columns=['image', 'images', 'currency', 'stock', 'status', 'ctime', 'view_count', 'brand', 'flag', 'cb_option', 'item_status', 'price_min', 'price_max', 'price_min_before_discount','price_max_before_discount','price_before_discount','show_discount','raw_discount','item_type','can_use_bundle_deal','bundle_deal_info','add_on_deal_info','voucher_info','flash_sale_stock'], axis=1)
shopee.head()

Unnamed: 0,name,sold,historical_sold,liked,liked_count,cmt_count,price,discount,tier_variations,item_rating,shop_location,shop_name,shop_rating
0,Áo Sơ Mi Chất Nhung Tăm Nam Nữ Form Rộng Nâu B...,591,11742,False,6841,3792,7900000000,46%,"[{'name': 'Màu sắc', 'options': ['NÂU', 'BE NU...","{'rating_star': 4.612869198312236, 'rating_cou...",Hà Nội,NASU MAY -XƯỞNG MAY THỜI TRANG,4.564263
1,Quần short jean nữ trắng vải denim rách bền đẹ...,1269,28852,False,22433,7756,8800000000,32%,"[{'name': 'Màu sắc', 'options': ['199 trắng - ...","{'rating_star': 4.860752965446106, 'rating_cou...",TP. Hồ Chí Minh,AnNgo - Mê Jean Nữ,4.858771
2,Quần jean nữ ống rộng đen xám phong cách Ulzza...,2236,57608,False,49836,21183,13900000000,44%,"[{'name': 'Mẫu', 'options': ['Đen Xám ĐaiCúc(J...","{'rating_star': 4.870550467377963, 'rating_cou...",TP. Hồ Chí Minh,Kyubi Shop Official,4.878541
3,CHÂN VÁY BÒ 2 TÚI TRƯỚC XẺ VẠT SIÊU HOT 3 MÀU ...,870,22012,False,12315,7220,2500000000,38%,"[{'name': 'Màu sắc', 'options': ['XANH NHẠT', ...","{'rating_star': 4.7997227997228, 'rating_count...",Hà Nội,XƯỞNG MAY TRANG LINH - JEANS,4.800839
4,"ÁO THUN UNISEX RAGE OF THE SEA(ROTS STUDIO) ""H...",696,20118,False,4919,5973,9120000000,51%,"[{'name': 'Màu Sắc', 'options': ['Màu Đen', 'M...","{'rating_star': 4.850661309224845, 'rating_cou...",Đồng Nai,Rage Of The Sea (ROTS),4.866905


In [6]:
# List total number of rows and columns
print("This dataset contains ", shopee.shape[0], " rows and ", shopee.shape[1], " columns")

#clean the '%' symbol in 'discount' column
shopee['discount'] = shopee['discount'].str.replace('%','').astype('float64')
shopee['discount'] = shopee['discount']/10

#check for missing value
shopee.isna().sum()

This dataset contains  500  rows and  13  columns


name                0
sold                0
historical_sold     0
liked               0
liked_count         0
cmt_count           0
price               0
discount           41
tier_variations     0
item_rating         0
shop_location       0
shop_name           0
shop_rating         0
dtype: int64

In [21]:
#fill in missing data
shopee['discount'].fillna(shopee['discount'].mean(), inplace=True)

In [None]:
#drop dupliacetes records
df.drop_duplicates(inplace=True)

## 3. EDA

## 4. Building Recommender System

### Content-Based Filtering
* TF-IDF Vectorization with Cosine Similarity: Convert item attributes into text representations (as shown in the previous code) and use cosine similarity to find similar items based on their attributes.

In [22]:
#create a new dataframe from dataset
df = pd.DataFrame(shopee)

# Feature engineering: Combine relevant attributes into a single text representation
df['attributes'] = df['name']+ ' '+ df['sold'].astype(str) + ' '+df['price'].astype(str)+' '+df['discount'].astype(str) + ' '+df['shop_rating'].astype(str)

# Vectorize the text attributes
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(df['attributes'])


# Calculate cosine similarity between items based on their attributes
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)


# Function to get recommendations based on item similarity
def get_recommendations(item_name, cosine_sim=cosine_sim, df=df, top_n=5):
     # Check if the item exists in the DataFrame
    if item_name not in df['name'].values:
        print(f"Item '{item_name}' not found in the dataset.")
        return None
    
   # Get the index of the item in the DataFrame
    idx = df[df['name'] == item_name].index[0]
    
    # Get pairwise similarity scores with other items
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    # Sort the items based on similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Get top N similar items (excluding the queried item itself)
    sim_scores = sim_scores[1:top_n + 1]
    
    # Get indices of the top N similar items
    item_indices = [i[0] for i in sim_scores]
    
    # Return the top N similar items
    return df.iloc[item_indices]['name']


In [23]:
recommendations = get_recommendations('Áo Sơ Mi Chất Nhung Tăm Nam Nữ Form Rộng Nâu Be Siêu Hot')
print("Recommendations for 'Áo Sơ Mi Chất Nhung Tăm Nam Nữ Form Rộng Nâu Be Siêu Hot':")
print(recommendations)

Recommendations for 'Áo Sơ Mi Chất Nhung Tăm Nam Nữ Form Rộng Nâu Be Siêu Hot':
281    Áo sơ mi nữ dài tay form rộng Hoạ Tiết Sọc Dán...
456    Áo sơ mi nam nhung tay ngắn VICENZO thời trang...
193    Áo sơ mi đũi xước dáng rộng , áo sơ mi chất vả...
168    Áo sơ mi nữ dài tay form rộng ulzzang kiểu hàn...
352    Quần Nhung Tăm Ống Rộng Dây Rút Xuất Hàn Mặc T...
Name: name, dtype: object


* Feature Engineering and Similarity Measures: Create numerical or categorical features from the item attributes and apply distance-based similarity measures (e.g., Euclidean distance, Jaccard similarity) or other techniques like k-nearest neighbors (KNN) for recommendations.

In [24]:
from sklearn.metrics.pairwise import euclidean_distances


# Normalizing numerical columns
df_normalized = df.copy()
df_normalized['sold'] = (df['sold'] - df['sold'].min()) / (df['sold'].max() - df['sold'].min())
df_normalized['price'] = (df['price'] - df['price'].min()) / (df['price'].max() - df['price'].min())
df_normalized['discount'] = (df['discount'] - df['discount'].min()) / (df['discount'].max() - df['discount'].min())
df_normalized['shop_rating'] = (df['shop_rating'] - df['shop_rating'].min()) / (df['shop_rating'].max() - df['shop_rating'].min())

# Select features for similarity calculation
features = df_normalized[['sold', 'price', 'discount', 'shop_rating']].values

# Calculate pairwise similarity using Euclidean distance
similarities = euclidean_distances(features, features)

# Function to get recommendations based on item similarity
def get_recommendations(item_id, similarities=similarities, df=df, top_n=5):
    # Get the index of the item in the DataFrame
    idx = df[df['name'] == item_id].index[0]
    
    # Get pairwise similarity scores with other items
    sim_scores = list(enumerate(similarities[idx]))
    
    # Sort the items based on similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1])
    
    # Get top N similar items (excluding the queried item itself)
    sim_scores = sim_scores[1:top_n + 1]
    
    # Get indices of the top N similar items
    item_indices = [i[0] for i in sim_scores]
    
    # Return the top N similar items
    return df.iloc[item_indices]['name']


In [25]:
# Example: Get recommendations for 'Product1'
recommendations = get_recommendations('Áo Sơ Mi Chất Nhung Tăm Nam Nữ Form Rộng Nâu Be Siêu Hot')
print("Recommendations:")
print(recommendations)

Recommendations:
453    Quần Dài Ống Rộng Hoạ Tiết Gấp Gấu Quần Caro P...
66     Áo khoác da croptop basic, áo khoác da nữ phon...
103    ÁO CHỐNG NẮNG GÂN ĐŨA TO CÓ NÓN THỜI TRANG CAO...
137    [Video, Ảnh Thật]Áo Khoác Nỉ Bông Thêu Hoodie ...
80     Quần Jean Ống Loe Cạp Cao nữ màu Đen vải bò gi...
Name: name, dtype: object


### Using KNN
KNN with Euclidean distance is used to find the k nearest neighbors for each item based on their normalized attributes ('sold_quantity', 'price', 'discount', 'rating'). The get_recommendations_knn function retrieves the top similar items using KNN for a given item.

In [28]:
from sklearn.neighbors import NearestNeighbors

# Initialize KNN model
k = 3  # Number of neighbors
knn_model = NearestNeighbors(n_neighbors=k, metric='euclidean')  # Using Euclidean distance


# Fit the model
knn_model.fit(features)

# Function to get recommendations based on KNN
def get_recommendations_knn(item_id, knn_model=knn_model, df=df, top_n=10):
    # Get the index of the item in the DataFrame
    idx = df[df['name'] == item_id].index[0]
    
    # Get distances and indices of k-nearest neighbors
    distances, indices = knn_model.kneighbors([features[idx]])
    
    # Exclude the queried item itself from recommendations
    neighbor_indices = indices[0][1:]  # Exclude the first index (self)
    
    # Get names of similar items
    similar_items = df.iloc[neighbor_indices]['name']
    
    return similar_items.head(top_n)


In [29]:
# Example: Get recommendations using KNN for 'Product1'
recommendations_knn = get_recommendations_knn('Áo Sơ Mi Chất Nhung Tăm Nam Nữ Form Rộng Nâu Be Siêu Hot')
print("Recommendations using KNN for 'Áo Sơ Mi Chất Nhung Tăm Nam Nữ Form Rộng Nâu Be Siêu Hot':")
print(recommendations_knn)

Recommendations using KNN for 'Áo Sơ Mi Chất Nhung Tăm Nam Nữ Form Rộng Nâu Be Siêu Hot':
453    Quần Dài Ống Rộng Hoạ Tiết Gấp Gấu Quần Caro P...
66     Áo khoác da croptop basic, áo khoác da nữ phon...
Name: name, dtype: object


## Application for a bigger dataset

In [37]:

from sklearn.model_selection import train_test_split

# Split the dataset into train and test sets
train_data, test_data = train_test_split(features, test_size=0.2, random_state=42)

# Initialize KNN model
k = 3  # Number of neighbors
knn_model = NearestNeighbors(n_neighbors=k, metric='euclidean')  # Using Euclidean distance

# Fit the model with training data
knn_model.fit(train_data)

# Get distances and indices of k-nearest neighbors for each test item
distances, indices = knn_model.kneighbors(test_data)

# The following line generates placeholder predictions based on test_data
predicted_distances, predicted_indices = knn_model.kneighbors(test_data)

# Function to get recommendations based on KNN
def get_recommendations_knn(item_index, knn_model=knn_model, df=df, top_n=5):
    # Get distances and indices of k-nearest neighbors
    distances, indices = knn_model.kneighbors([test_data[item_index]])
    
    # Exclude the queried item itself from recommendations
    neighbor_indices = indices[0][1:]  # Exclude the first index (self)
    
    # Get names of similar items
    similar_items = df.iloc[neighbor_indices]['name']
    
    return similar_items.head(top_n)

In [36]:
test_item_index = 0  # Example index for test item
recommendations_knn = get_recommendations_knn(test_item_index)
print(f"Recommendations using KNN for test item {test_item_index}:")
print(recommendations_knn)

Recommendations using KNN for test item 0:
367    Quần Shorts Nữ Ống Loe, Quần Đùi Ống Rộng Thời...
263    Quần Nỉ Nhung Tăm To Đôc Đáo Gemi Ống Rộng Dây...
Name: name, dtype: object


* Chossing the appropriate number of neighbors (k) in K-Nearest Neighbors (KNN)  is an important aspect that can impact the performance of the model in recommendation systems. 
* A smaller k tends to capture more local information but might be sensitive to noise, while a larger k might oversmooth and generalize too much.


### Using the rule of Thumb
We use the square root of the total number of items or a fraction of it.
In details, the dataset have 5.000 items so we might choose k = sqrt(5000) = 70 or a smaller fraction like k = 50

### Using Cross-validation
Perform cross-validation with different values of k and evaluate the model's performance (e.g., using accuracy, RMSE, or other appropriate metrics). Choose the value of k that gives the best performance on the validation set.

## Evaluate model's performance

In [39]:
import numpy as np
from sklearn.metrics import mean_squared_error

# Calculate RMSE between predicted and actual values
rmse = np.sqrt(mean_squared_error(distances, predicted_distances))
print(f"RMSE: {rmse}")

RMSE: 0.0
