# Project: Amazon Fashion Recommender Demo


## Table of contents

1. [Introduction](#1)
2. [Data splitting](#2)
3. [Exploratory Data Analysis (EDA)](#3)
4. [Feature engineering](#4)
5. [Preprocessing and transformations](#5) 
6. [Baseline model](#6)
7. [Linear models](#7)
8. [Different models](#8)
9. [Feature selection](#9)
10. [Hyperparameter optimization](#10)
11. [Interpretation and feature importances](#11) 
12. [Results on the test set](#12)
13. [Summary of the results](#13)


<!-- BEGIN QUESTION -->

## Imports

In [1]:
from hashlib import sha1
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.dummy import DummyRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_transformer
from sklearn.model_selection import RandomizedSearchCV
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import Ridge
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import (
    GridSearchCV,
    cross_val_score,
    cross_validate,
    train_test_split,
)
from sklearn.feature_selection import RFE, SequentialFeatureSelector
from sklearn.pipeline import Pipeline, make_pipeline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.metrics import mean_absolute_error
import shap
plt.rcParams["font.size"] = 16

In [2]:
#pip install numpy pandas matplotlib scikit-learn shap



In [3]:
#!pip install ipywidgets




## Introduction <a name="in"></a>




#### A final note

## 1) Clustering + Regression (rating 예측)

“`amazon_co-ecommerce_sample.csv`를 불러와서,

**피처 준비**  
- 수치형: `price`, `number_of_reviews`, `number_available_in_stock`  
- 범주형: `manufacturer`, `amazon_category_and_sub_category` → One-Hot 인코딩  

**클러스터링**  
- KMeans로 상품을 k개 군집(예: k=10)으로 묶고, `cluster` 컬럼 추가  

**회귀 모델 학습**  
- 각 군집마다 `RandomForestRegressor` 등을 사용해 `average_review_rating` (1~5) 예측 모델 학습  

**추천**  
1. 사용자가 본(또는 선택한) 상품의 `cluster` c 식별  
2. 군집 c 내 다른 상품들의 피처로 `regressor[c].predict()` 수행 → 예측 평점 산출  
3. 예측 평점 높은 순으로 상위 N개 상품 추천

---

## 2) Popularity-Based Recommendation (인기 순위)

“`amazon_co-ecommerce_sample.csv`에서

- `number_of_reviews` (리뷰 개수)와 `average_review_rating`(평점)을 기준으로  
- **인기 지표**: 예) `score = average_review_rating * log(1 + number_of_reviews)`  
- `score` 내림차순 정렬 후 상위 N개 `uniq_id`, `product_name` 반환  
- (옵션) `number_of_answered_questions`나 `number_available_in_stock`도 가중치로 활용

---

## 3) Price & Rating Filter (가격·평점 필터)

“사용자로부터 예산 범위(`min_price`, `max_price`)와 평점 기준(`min_rating`)을 입력받아,

1. `price`가 `min_price ≤ price ≤ max_price` 이고  
2. `average_review_rating ≥ min_rating` 인 상품 필터링  
3. 필터된 결과를 `average_review_rating` 내림차순으로 정렬  
4. 상위 N개 `uniq_id`, `product_name`, `price`, `average_review_rating` 반환”



<br><br>

In [4]:
...

Ellipsis

In [6]:
df = pd.read_csv('amazon_co-ecommerce_sample.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'amazon_co-ecommerce_sample.csv'

<!-- END QUESTION -->

<br><br>


## 2. Data splitting <a name="2"></a>


**Your tasks:**

1. Split the data into train (70%) and test (30%) portions with `random_state=123`.


NameError: name 'amazon_co' is not defined

<!-- END QUESTION -->

<br><br>


## 3. EDA <a name="3"></a>


**Your tasks:**

1. Perform exploratory data analysis on the train set.
2. Include at least two summary statistics and two visualizations that you find useful, and accompany each one with a sentence explaining it.
3. Summarize your initial observations about the data. 
4. Pick appropriate metric/metrics for assessment. 

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<br><br>


## 4. Feature engineering <a name="4"></a>

**Your tasks:**

1. Carry out feature engineering. In other words, extract new features relevant for the problem and work with your new feature set in the following exercises. You may have to go back and forth between feature engineering and preprocessing. 

<!-- END QUESTION -->

<br><br>



## 5. Preprocessing and transformations <a name="5"></a>
<hr>


In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<br><br>


## 6. Baseline model <a name="6"></a>


**Your tasks:**
1. Try `scikit-learn`'s baseline model and report results.

In [None]:
...

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<br><br>

## 7. Linear models <a name="7"></a>

**Your tasks:**

1. Try a linear model as a first real attempt. 
2. Carry out hyperparameter tuning to explore different values for the complexity hyperparameter. 
3. Report cross-validation scores along with standard deviation. 
4. Summarize your results.

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<br><br>



## 8. Different models <a name="8"></a>


**Your tasks:**
1. Try at least 3 other models aside from a linear model. One of these models should be a tree-based ensemble model. 
2. Summarize your results in terms of overfitting/underfitting and fit and score times. Can you beat a linear model? 

In [None]:
...

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<br><br>



## 9. Feature selection <a name="9"></a>


**Your tasks:**

Make some attempts to select relevant features. You may try `RFECV` or forward selection for this. Do the results improve with feature selection? Summarize your results. If you see improvements in the results, keep feature selection in your pipeline. If not, you may abandon it in the next exercises. 

In [None]:
...

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<br><br>



## 10. Hyperparameter optimization <a name="10"></a>


**Your tasks:**

Make some attempts to optimize hyperparameters for the models you've tried and summarize your results. In at least one case you should be optimizing multiple hyperparameters for a single model. You may use `sklearn`'s methods for hyperparameter optimization or fancier Bayesian optimization methods. 
  - [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)   
  - [RandomizedSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)
  - [scikit-optimize](https://github.com/scikit-optimize/scikit-optimize) 

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<br><br>



## 11. Interpretation and feature importances <a name="1"></a>


**Your tasks:**

1. Use the methods we saw in class (e.g., `shap`) (or any other methods of your choice) to examine the most important features of one of the non-linear models. 
2. Summarize your observations. 

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<br><br>



## 12. Results on the test set <a name="12"></a>

**Your tasks:**

1. Try your best performing model on the test data and report test scores. 
2. Do the test scores agree with the validation scores from before? To what extent do you trust your results? Do you think you've had issues with optimization bias? 
3. Take one or two test predictions and explain these individual predictions (e.g., with SHAP force plots).  

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<br><br>


## 13. Summary of results <a name="13"></a>

Imagine that you want to present the summary of these results to your boss and co-workers. 

**Your tasks:**

1. Create a table summarizing important results. 
2. Write concluding remarks.
3. Discuss other ideas that you did not try but could potentially improve the performance/interpretability . 


In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<br><br>

<br><br>



## 14. Your takeaway <a name="15"></a>


**Your tasks:**

What is your biggest takeaway from the supervised machine learning material?

<!-- END QUESTION -->

<br><br>

![](img/eva-well-done.png)