<a href="https://colab.research.google.com/github/hyuna0926/cp2_phase2/blob/main/2%EC%9B%94%201%EC%9D%BC/CB_fashion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Content Based recommendation(CB)
컨텐츠(아이템)의 정보를 추출해 유사한 컨텐츠를 찾아 추천해주는 방식

### * CB의 장점
* 특정 유저를 위한 추천시, 다른 유저의 정보 불필요
* 아이템에 대한 Cold start 문제 해결
* 인기가 낮은 아이템도 비슷한 상품이 있다면 추천 가능
* 위 예시와 같이 설명을 통해 실효성에 대한 검증 가능

<br>

### * CB의 단점

* 아이템의 적합한 Feature 가공이 필수
  * 실제 위 예시와 같이 한글 데이터로는 활용이 어려워 벡터화 필요
* 한가지 분야의 비슷한 상품만 나올 수 있음
  * 유사도를 기반으로 같은 분야/장르의 상품만 추천 가능
* 다른 유저의 데이터를 활용 불가능
  * 장점이자 단점으로, 유저가 봤던 것(또는 호감을 표한 것) 안에서만 추천이 가능하기 때문 

# 1. 필요 라이브러리 import

In [None]:
import pandas as pd
import numpy as np
import warnings
# 경고 제거
warnings.filterwarnings("ignore")

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# 2. 사용 데이터 확인
- `fashion_campus` 중 `product.csv` 사용

In [None]:
df = pd.read_csv('/content/drive/MyDrive/CP2_Phase2/product.csv', on_bad_lines='skip')

In [None]:
df.head()

Unnamed: 0,id,gender,masterCategory,subCategory,articleType,baseColour,season,year,usage,productDisplayName
0,15970,Men,Apparel,Topwear,Shirts,Navy Blue,Fall,2011.0,Casual,Turtle Check Men Navy Blue Shirt
1,39386,Men,Apparel,Bottomwear,Jeans,Blue,Summer,2012.0,Casual,Peter England Men Party Blue Jeans
2,59263,Women,Accessories,Watches,Watches,Silver,Winter,2016.0,Casual,Titan Women Silver Watch
3,21379,Men,Apparel,Bottomwear,Track Pants,Black,Fall,2011.0,Casual,Manchester United Men Solid Black Track Pants
4,53759,Men,Apparel,Topwear,Tshirts,Grey,Summer,2012.0,Casual,Puma Men Grey T-shirt


In [None]:
df.rename(columns={'id':'product_id'},inplace=True)

- 중복값 제거

In [None]:
df['productDisplayName'].duplicated().sum()

13302

- 확인 결과, product_id만 다르고 나머지는 다 동일함

In [None]:
df[df['productDisplayName']=='Murcia Women Casual Brown Handbag']

Unnamed: 0,product_id,gender,masterCategory,subCategory,articleType,baseColour,season,year,usage,productDisplayName
29,21977,Women,Accessories,Bags,Handbags,Brown,Winter,2015.0,Casual,Murcia Women Casual Brown Handbag
186,21948,Women,Accessories,Bags,Handbags,Brown,Winter,2015.0,Casual,Murcia Women Casual Brown Handbag
2878,21940,Women,Accessories,Bags,Handbags,Brown,Winter,2015.0,Casual,Murcia Women Casual Brown Handbag
3578,21985,Women,Accessories,Bags,Handbags,Brown,Winter,2015.0,Casual,Murcia Women Casual Brown Handbag
19036,21991,Women,Accessories,Bags,Handbags,Brown,Winter,2015.0,Casual,Murcia Women Casual Brown Handbag
21514,21964,Women,Accessories,Bags,Handbags,Brown,Winter,2015.0,Casual,Murcia Women Casual Brown Handbag
21830,21952,Women,Accessories,Bags,Handbags,Brown,Winter,2015.0,Casual,Murcia Women Casual Brown Handbag
23627,21987,Women,Accessories,Bags,Handbags,Brown,Winter,2015.0,Casual,Murcia Women Casual Brown Handbag
24314,21942,Women,Accessories,Bags,Handbags,Brown,Winter,2015.0,Casual,Murcia Women Casual Brown Handbag
27177,21975,Women,Accessories,Bags,Handbags,Brown,Winter,2015.0,Casual,Murcia Women Casual Brown Handbag


In [None]:
df.drop_duplicates(subset=['productDisplayName'], keep='first', inplace=True,ignore_index=True)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 31122 entries, 0 to 44423
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   product_id          31122 non-null  int64  
 1   gender              31122 non-null  object 
 2   masterCategory      31122 non-null  object 
 3   subCategory         31122 non-null  object 
 4   articleType         31122 non-null  object 
 5   baseColour          31116 non-null  object 
 6   season              31101 non-null  object 
 7   year                31121 non-null  float64
 8   usage               30825 non-null  object 
 9   productDisplayName  31121 non-null  object 
dtypes: float64(1), int64(1), object(8)
memory usage: 2.6+ MB


## 1) 결측값 제거
- product 이름이 없으니 삭제
- year 결측값이 하나기때문에 삭제
- 나머지 결측값은 unknown으로 대체

In [None]:
df_c = df.copy()

In [None]:
df_c.dropna(subset=['productDisplayName','year'],inplace=True)

In [None]:
# df[num_cols]=df[num_cols].fillna(0)
df_c = df_c.fillna('unknown')

In [None]:
df_c.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 31120 entries, 0 to 31121
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   product_id          31120 non-null  int64  
 1   gender              31120 non-null  object 
 2   masterCategory      31120 non-null  object 
 3   subCategory         31120 non-null  object 
 4   articleType         31120 non-null  object 
 5   baseColour          31120 non-null  object 
 6   season              31120 non-null  object 
 7   year                31120 non-null  float64
 8   usage               31120 non-null  object 
 9   productDisplayName  31120 non-null  object 
dtypes: float64(1), int64(1), object(8)
memory usage: 2.6+ MB


## 2) 데이터 경량화
- 3.4+MB -> 2.0 MB 줄어들었음

In [None]:
num_cols = [col for col in df_c.columns if df_c[col].dtype!='object']
cat_cols = [col for col in df_c.columns if col not in num_cols]

In [None]:
df_c[cat_cols] = df_c[cat_cols].astype('category')
df_c[num_cols] = df_c[num_cols].astype('int32')

In [None]:
df_c.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 31120 entries, 0 to 31121
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   product_id          31120 non-null  int32   
 1   gender              31120 non-null  category
 2   masterCategory      31120 non-null  category
 3   subCategory         31120 non-null  category
 4   articleType         31120 non-null  category
 5   baseColour          31120 non-null  category
 6   season              31120 non-null  category
 7   year                31120 non-null  int32   
 8   usage               31120 non-null  category
 9   productDisplayName  31120 non-null  category
dtypes: category(8), int32(2)
memory usage: 2.0 MB


## 3) 새로운 feature 만들어주기
- gender, articleType, baseColour, season, usage 합치기

In [None]:
df_c['features'] = df_c[['gender','articleType','baseColour','season','usage']].apply(' '.join, axis=1)

In [None]:
df_c.head()

Unnamed: 0,product_id,gender,masterCategory,subCategory,articleType,baseColour,season,year,usage,productDisplayName,features
0,15970,Men,Apparel,Topwear,Shirts,Navy Blue,Fall,2011,Casual,Turtle Check Men Navy Blue Shirt,Men Shirts Navy Blue Fall Casual
1,39386,Men,Apparel,Bottomwear,Jeans,Blue,Summer,2012,Casual,Peter England Men Party Blue Jeans,Men Jeans Blue Summer Casual
2,59263,Women,Accessories,Watches,Watches,Silver,Winter,2016,Casual,Titan Women Silver Watch,Women Watches Silver Winter Casual
3,21379,Men,Apparel,Bottomwear,Track Pants,Black,Fall,2011,Casual,Manchester United Men Solid Black Track Pants,Men Track Pants Black Fall Casual
4,53759,Men,Apparel,Topwear,Tshirts,Grey,Summer,2012,Casual,Puma Men Grey T-shirt,Men Tshirts Grey Summer Casual


# 3. TF-idf를 이용한 컨텐츠 기반 추천시스템 

- 메모리 에러 발생으로 문서의 수 줄이겠음

In [None]:
data = df_c.loc[:15000].reset_index(drop=True)

In [None]:
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(data['features'])
print(tfidf_matrix.shape)

(15000, 224)


In [None]:
df_tfidf = pd.DataFrame(tfidf_matrix.todense(), columns = tfidf.get_feature_names())
df_tfidf

Unnamed: 0,accessories,accessory,and,baby,backpacks,bag,bangle,basketballs,bath,beauty,...,waistcoat,wallets,wash,watches,water,white,winter,women,wristbands,yellow
0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0
1,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0
2,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.541117,0.0,0.0,0.379352,0.254751,0.0,0.0
3,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0
4,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14995,0.0,0.0,0.0,0.0,0.678504,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0
14996,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.242245,0.0,0.0
14997,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0
14998,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0


In [None]:
cosine_df= pd.DataFrame(cosine_similarity(tfidf_matrix,tfidf_matrix),index = data.product_id, columns=data.product_id)

In [None]:
cosine_df.head()

product_id,15970,39386,59263,21379,53759,1855,30805,26960,29114,30039,...,10874,59615,10048,5883,24873,15708,38410,26718,17863,54338
product_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
15970,1.0,0.26908,0.031107,0.155529,0.121103,0.121103,0.338141,0.344077,0.544192,0.109349,...,0.769403,0.090953,0.287681,0.058421,0.033745,0.28911,0.02958,0.11619,0.083465,0.241485
39386,0.26908,1.0,0.031961,0.073167,0.207224,0.207224,0.124416,0.107095,0.296284,0.112353,...,0.287347,0.0,0.135337,0.116246,0.09962,0.192469,0.087323,0.198817,0.14282,0.323881
59263,0.031107,0.031961,1.0,0.023885,0.04062,0.04062,0.0,0.109941,0.02772,0.581993,...,0.033219,0.056927,0.044181,0.055165,0.031864,0.028834,0.089643,0.038972,0.027995,0.119298
21379,0.155529,0.073167,0.023885,1.0,0.092989,0.092989,0.045758,0.027855,0.063458,0.207792,...,0.166087,0.069838,0.370056,0.044858,0.025911,0.101604,0.022713,0.089217,0.064089,0.030226
53759,0.121103,0.207224,0.04062,0.092989,1.0,1.0,0.158121,0.136109,0.179725,0.14279,...,0.34743,0.0,0.462082,0.147738,0.335823,0.039874,0.11098,0.508561,0.181512,0.147693


- product_name과 product_id 매핑해주기

In [None]:
# product과 id를 매팅할 dictionary 생성
product2id = {}
for i,c in zip(data['product_id'],data['productDisplayName']):
  product2id[i]=c

In [None]:
# id와 product를 매핑할 dictionary를 생성해줍니다. 
id2product = {}
for i, c in product2id.items():
    id2product[c] = i

## product_id로 유사도 높은 k개 추출

In [None]:
sim_scores=cosine_df.loc[15970].sort_values(ascending=False).to_frame().reset_index()
sim_scores

Unnamed: 0,product_id,15970
0,15970,1.0
1,16050,1.0
2,20142,1.0
3,16303,1.0
4,9748,1.0
...,...,...
14995,59366,0.0
14996,31977,0.0
14997,30900,0.0
14998,3294,0.0


In [None]:
k_sim_scores = [(product2id[i], score) for i, score in zip(sim_scores['product_id'][0:10],sim_scores[15970])]
k_sim_scores

[('Turtle Check Men Navy Blue Shirt', 1.0),
 ('Highlander Men Check Navy Blue Shirt', 1.0),
 ('Wrangler Men Authentic Rope Navy Blue Shirt', 1.0),
 ('Peter England Men Check Navy Blue Shirt', 1.0),
 ('Indian Terrain Men Navy Blue Shirts', 1.0),
 ('Indigo Nation Men Checks Navy Blue Shirt', 1.0),
 ('Levis Men Check Navy Blue Shirts', 1.0),
 ('Highlander Men Stripes Navy Blue Shirt', 1.0),
 ('Spykar Men Solid Navy Blue Shirt', 1.0),
 ('Wrangler Men Wander Wheels Navy Blue Shirts', 1.0)]

- 함수로 만들기

In [None]:
def sim_cosine(product_id, cosine_df=cosine_df,k=25):
  try: 
    sim_scores=cosine_df.loc[product_id].sort_values(ascending=False).to_frame().reset_index()
    k_sim_scores = [(product2id[i], score) for i, score in zip(sim_scores['product_id'][0:k],sim_scores[product_id])]
    print('----top25----')
    return k_sim_scores
  except:
    print("Product_id doesn't exist!!")  

In [None]:
sim_cosine(1)

Product_id doesn't exist!!


In [None]:
sim_cosine(22345)

----top25----


[('United Colors of Benetton Men Blue T-shirt', 1.0000000000000002),
 ('Puma Men Ferrari Vintage Black Polo T-shirt', 1.0000000000000002),
 ('Flying Machine Men Stripes Blue Polo Tshirts', 1.0000000000000002),
 ('Proline Blue Polo T-shirt', 1.0000000000000002),
 ('Locomotive Men Printed Blue T-shirt', 1.0000000000000002),
 ('Locomotive Men Printed Navy Blue TShirt', 1.0000000000000002),
 ('Lee Men Printed Blue Tshirts', 1.0000000000000002),
 ('Mark Taylor Men Blue Printed T-shirt', 1.0000000000000002),
 ("Mr.Men Men's Royal Blue T-shirt", 1.0000000000000002),
 ('Proline Men Blue & White Striped Polo T-shirt', 1.0000000000000002),
 ('Inkfruit Men Printed Blue T-shirt', 1.0000000000000002),
 ('U.S. Polo Assn. Men Stripes Limoges Polo Tshirt', 1.0000000000000002),
 ('Mr. Men Men Sleep Placid Blue Tshirts', 1.0000000000000002),
 ('U.S. Polo Assn. Men Stripes Blue Polo Tshirt', 1.0000000000000002),
 ('Locomotive Men Solid Blue TShirt', 1.0000000000000002),
 ('Arrow Sport Men Solid Blue Polo