#### 사용자의 검색어의 연관도를 활용한 사용자-제품 연관관계 추론

    1. 사용자의 모든 활동 기간 동안 검색한 단어들의 집합을 구하여 TF-IDF를 이용해서 사용자간 유사도를 계산한다. 
    
    2. 사용자 개인마다 사용자의 모든 구매 물품의 집합을 구하여 각 제품마다 이전에 구한 유사도 값을 더하도록 한다. 
    
    3. 최종적으로 특정한 사용자는 검색어를 통해 유사한 사람들의 구매이력을 바탕으로 각 제품들의 유사도를 확인할 수 있다.

 #### [Step 1] Data Processing
     1. 고객의 온라인 행동정보 테이블을 불러와 검색어들을 추출하는 단계이다.

In [1]:
import pandas as pd 
data = pd.read_csv("../data/online_action.csv", encoding='utf-8')
on_action_df = pd.DataFrame(data) 
keyword_df = on_action_df.dropna(subset=["sech_kwd"])
keyword_df = keyword_df[["clnt_id", "sech_kwd"]]
keyword_df = keyword_df.sort_values(by=["clnt_id"])
keyword_df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,clnt_id,sech_kwd
3136925,1,과일선물세트 백화점
3091872,1,초등가을잠바
3084082,1,초등남아옷
3123796,1,노스페이스키즈
3142609,1,노스페이스키즈


    2. konlpy의 Okt 형태소 분석기를 활용하여 전처리하는 단계이다. 이 때 숫자를 제거하고 한글 중에서 명사("Noun")만 추출하여 집합으로 만든다.

In [2]:
from konlpy.tag import Okt
import re 
okt = Okt() 
p = re.compile("[^0-9]")

In [3]:
kwd_dict = dict() 
for i in range(len(keyword_df)): 
    row = keyword_df.iloc[i] 
    clnt_id, kwd = row["clnt_id"], row["sech_kwd"]
    kwd = ''.join(p.findall(kwd)) 
    nlp_kwd = okt.pos(kwd)
    if clnt_id not in kwd_dict: kwd_dict[clnt_id] = list()
    for elem, tag in nlp_kwd:
        if tag != 'Noun': continue
        if elem not in kwd_dict[clnt_id]: 
            kwd_dict[clnt_id].append(elem)

    3. TfidfVectorizer에 대입하기 위해 최종적으로 리스트에 저장하도록 한다.

In [4]:
key_dict = dict() 
text_list = list() 
idx = 0 
for clnt_id in kwd_dict: 
    text_list.append( " ".join(kwd_dict[clnt_id]) ) 
    key_dict[idx] = clnt_id 
    idx += 1

#### [step 2] TF-IDF Vectorizing
    각 사용자들이 검색했던 단어들의 빈도수를 비교하여 유사도를 Cosine similarity로 계산하는 과정이다. 이 때 공통적으로 자주 등장하는 단어는 제외하고 계산하도록 하는 효과를 지닌다.

    1. TfidfVectorizer의 객체에 min_df=1로 주어 길이가 1인 명사도 포함하도록 하였다. 
    2. 모든 단어 집합들을 Vectorizing하여 각 사용자들간의 유사도를 계산하도록 하였고 결과는 다음과 같다.

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(min_df=1)
tfidf_matrix = tfidf_vectorizer.fit_transform(text_list)
document_distances = (tfidf_matrix * tfidf_matrix.T)
document_distances
array = document_distances.toarray()
array

array([[1.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.03500367],
       [0.        , 1.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.03500367, 0.        , 0.        , ..., 0.        , 0.        ,
        1.        ]])

#### [Step 3] 각 제품마다 각 사용자의 유사도 값 대입 및 합계
    1. 각 사용자의 구매이력 테이블을 불러와서 제품 정보가 있는 것만 불러온다.

In [5]:
data = pd.read_csv("../data/transaction.csv", encoding='utf-8')
trans_df = pd.DataFrame(data) 
trans_df = trans_df[trans_df["pd_c"] != "unknown"]
trans_df = trans_df.astype({"pd_c": 'int64'})
trans_df = trans_df[["clnt_id", "pd_c", "biz_unit"]]
trans_df.head()

Unnamed: 0,clnt_id,pd_c,biz_unit
4,39423,565,A03
5,21279,565,A03
6,48969,572,A03
7,30533,670,A03
8,64346,543,A03


    2. 각 제품마다 플랫폼("biz_unit")의 정보까지 구분하여 구매한 제품들의 집합을 구한다. 이후 새로운 테이블을 작성하는데 필요한 딕셔너리등을 선언하도록 하였다.

In [6]:
trans_dict = dict() 
product_set = set() 
for i in range(len(trans_df)): 
    row = trans_df.iloc[i]
    clnt_id, *pd_c = row
    pd_c = tuple(pd_c)
    if clnt_id not in trans_dict: trans_dict[clnt_id] = list() 
    if pd_c not in trans_dict[clnt_id]: trans_dict[clnt_id].append(pd_c)
    product_set.add(pd_c)
product_list = list(product_set)
product_list.sort(key=lambda x:x[0])

product_col_dict = dict() 
for idx, elem in enumerate(product_list): 
    pd_c, biz_unit = elem
    product_col_dict[str(pd_c) + "_" + biz_unit] = idx+1
product_col_dict

{'1_B01': 1,
 '2_B01': 2,
 '3_B01': 3,
 '4_A01': 4,
 '5_B02': 5,
 '5_B01': 6,
 '6_A03': 7,
 '6_B01': 8,
 '6_B02': 9,
 '6_A01': 10,
 '6_B03': 11,
 '7_B01': 12,
 '7_A01': 13,
 '8_B01': 14,
 '9_B01': 15,
 '10_B01': 16,
 '11_B01': 17,
 '12_B01': 18,
 '13_B01': 19,
 '13_A01': 20,
 '14_B01': 21,
 '15_B01': 22,
 '16_A02': 23,
 '17_A02': 24,
 '17_A01': 25,
 '18_B01': 26,
 '19_B01': 27,
 '20_B01': 28,
 '21_B01': 29,
 '22_B01': 30,
 '23_B02': 31,
 '23_B01': 32,
 '23_A03': 33,
 '24_A01': 34,
 '24_B01': 35,
 '25_B01': 36,
 '26_B01': 37,
 '27_B01': 38,
 '27_B02': 39,
 '27_A03': 40,
 '28_B01': 41,
 '29_B01': 42,
 '30_A03': 43,
 '30_B01': 44,
 '30_B02': 45,
 '31_B01': 46,
 '31_B02': 47,
 '31_A03': 48,
 '31_A02': 49,
 '31_A01': 50,
 '32_B01': 51,
 '32_A03': 52,
 '32_B02': 53,
 '32_A02': 54,
 '33_A03': 55,
 '33_B02': 56,
 '33_B01': 57,
 '34_B01': 58,
 '35_A01': 59,
 '35_B01': 60,
 '36_B01': 61,
 '36_A01': 62,
 '37_B01': 63,
 '38_B01': 64,
 '39_B01': 65,
 '40_B01': 66,
 '41_B01': 67,
 '42_B01': 68,
 '43

    3. 새로운 테이블을 작성하도록 한다. 각 사용자의 id와 4040개의 플랫폼 특성까지 구분한 제품들의 목록을 컬럼으로 지정한다.

In [5]:
col_list = list() 
col_list.append("clnt_id")
for pd_c, biz_unit in product_list:
    col = str(pd_c)+"_"+biz_unit
    col_list.append(col)
new_df = pd.DataFrame(columns=col_list)
new_df

Unnamed: 0,clnt_id,1_B01,2_B01,3_B01,4_A01,5_B02,5_B01,6_A01,6_B03,6_B01,...,1664_B01,1664_A01,1665_B01,1665_A01,1666_A01,1666_B01,1666_A02,1667_A02,1667_B01,1667_A01


    4. 사용자와 제품간의 상관도를 저장하는 테이블을 만드는 과정이다. 우선 TF-IDF를 이용한 테이블의 값들중 사용자 자기자신의 값은 제외하고 다른 사용자의 연관도를 불러와 그 사용자가 구매한 제품들에게 유사도의 값을 부여한다. 만약 다른 사용자가 구매한 제품과 겹칠경우 해당 유사도와 합하여 저장하도록 한다.

In [9]:
arr_len = len(array)
cnt = 0
for i in range(arr_len): 
    tmp_arr = array[i]
    tmp_list = list() 
    for idx, elem in enumerate(tmp_arr): 
        if elem >= 0.99 or elem <= 0: continue 
        idx = key_dict[idx]
        tmp_list.append((idx, elem))

    user_buy_list = trans_dict.get(key_dict[i], []) 
    user_id = key_dict[i]
    user_row = [int(user_id)] + [0]*len(product_list)
    for clnt_id, elem in tmp_list: 
        pd_list = trans_dict.get(clnt_id, [])
        for pd_c, biz_unit in pd_list: 
            col_name = str(pd_c) + "_" + biz_unit
            user_row[product_col_dict[col_name]] += elem
    new_df.loc[i] = user_row
    cnt += 1
    if cnt % 1000 == 0: print(cnt, "/", arr_len, "complete")

1000 / 38564 complete
2000 / 38564 complete
3000 / 38564 complete
4000 / 38564 complete
5000 / 38564 complete
6000 / 38564 complete
7000 / 38564 complete
8000 / 38564 complete
9000 / 38564 complete
10000 / 38564 complete
11000 / 38564 complete
12000 / 38564 complete
13000 / 38564 complete
14000 / 38564 complete
15000 / 38564 complete
16000 / 38564 complete
17000 / 38564 complete
18000 / 38564 complete
19000 / 38564 complete
20000 / 38564 complete
21000 / 38564 complete
22000 / 38564 complete
23000 / 38564 complete
24000 / 38564 complete
25000 / 38564 complete
26000 / 38564 complete
27000 / 38564 complete
28000 / 38564 complete
29000 / 38564 complete
30000 / 38564 complete
31000 / 38564 complete
32000 / 38564 complete
33000 / 38564 complete
34000 / 38564 complete
35000 / 38564 complete
36000 / 38564 complete
37000 / 38564 complete
38000 / 38564 complete


    5. 최종적인 테이블은 다음과 같다. 0에 가까울수록 사용자는 특정 제품에 대한 연관도가 없음을 뜻한다.

In [None]:
new_df = new_df.astype({"clnt_id": 'int64'})
new_df

Unnamed: 0,clnt_id,1_B01,2_B01,3_B01,4_A01,5_B01,5_B02,6_B03,6_A01,6_B02,...,1664_A01,1664_A02,1665_B01,1665_A01,1666_A02,1666_A01,1666_B01,1667_A01,1667_B01,1667_A02
0,1,0.057889,0.000000,0.000000,0.000000,0.307914,0.000000,0.000000,0.000000,0.000000,...,0.059934,0.563261,0.0,0.015307,1.170073,0.492548,0.542039,0.0,0.357980,0.355827
1,2,0.167871,0.040374,0.192273,0.000000,0.614879,0.000000,0.056325,0.036712,0.031314,...,0.061681,1.677228,0.0,0.000000,3.076033,0.404429,1.165988,0.0,1.581158,0.471470
2,3,0.103366,0.000000,0.000000,0.058457,0.183617,0.000000,0.000000,0.000000,0.000000,...,0.089014,1.074976,0.0,0.000000,1.139756,0.223298,0.265743,0.0,0.035172,0.336807
3,4,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000
4,6,0.052923,0.000000,0.000000,0.000000,0.538846,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.423454,0.0,0.000000,1.392167,0.204447,0.342597,0.0,0.347927,0.161338
5,7,0.000000,0.000000,0.000000,0.000000,0.168890,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.650266,0.0,0.000000,0.829322,0.044416,0.792453,0.0,0.381552,0.092941
6,8,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.094300,0.0,0.000000,0.176852,0.073672,0.149052,0.0,0.000000,0.000000
7,9,0.254589,0.052289,0.205298,0.000000,0.610180,0.000000,0.019336,0.000000,0.000000,...,0.043501,1.917178,0.0,0.000000,3.210538,0.488736,1.958645,0.0,1.766411,0.418309
8,10,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.0,0.000000,0.085535,0.000000,0.000000,0.0,0.000000,0.000000
9,11,0.061425,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.0,0.000000,0.101115,0.061425,0.871684,0.0,0.000000,0.101115


    6. 테이블의 크기가 큰 관계로 csv파일로 저장하였다.

In [None]:
new_df.to_csv("./user_item_corr_via_keyword.csv", 
                 mode='w', index=False, encoding='utf-8')

#### [Step 4] 실제 사용자 구매이력과의 연관관계
    지금까지 검색어를 통한 유사도를 이용하여 사용자와 제품간의 유사도를 측정해 보았다. 그렇다면 실제 사용자가 구매한 이력과 비교하여 과연 연관관계가 있는지 비교하는 과정이다.


    1. 우선 이전 검색어를 이용한 사용자와 제품의 유사도를 저장한 테이블을 불러온다.

In [7]:
data = pd.read_csv("./user_item_corr_via_keyword.csv", encoding='utf-8')
load_df = pd.DataFrame(data) 
load_df.head()

Unnamed: 0,clnt_id,1_B01,2_B01,3_B01,4_A01,5_B01,5_B02,6_B03,6_A01,6_B02,...,1664_A01,1664_A02,1665_B01,1665_A01,1666_A02,1666_A01,1666_B01,1667_A01,1667_B01,1667_A02
0,1,0.057889,0.0,0.0,0.0,0.307914,0.0,0.0,0.0,0.0,...,0.059934,0.563261,0.0,0.015307,1.170073,0.492548,0.542039,0.0,0.35798,0.355827
1,2,0.167871,0.040374,0.192273,0.0,0.614879,0.0,0.056325,0.036712,0.031314,...,0.061681,1.677228,0.0,0.0,3.076033,0.404429,1.165988,0.0,1.581158,0.47147
2,3,0.103366,0.0,0.0,0.058457,0.183617,0.0,0.0,0.0,0.0,...,0.089014,1.074976,0.0,0.0,1.139756,0.223298,0.265743,0.0,0.035172,0.336807
3,4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,6,0.052923,0.0,0.0,0.0,0.538846,0.0,0.0,0.0,0.0,...,0.0,0.423454,0.0,0.0,1.392167,0.204447,0.342597,0.0,0.347927,0.161338


    2. 구매이력이 있는 사용자에 한해 유사도와 실제 구매 이력에 대한 연관관계를 분석할 예정이다. 우선 구매이력이 있는 사용자들만 해당 제품을 구매하였으면 1, 그렇지 않으면 0을 저장하는 새로운 테이블을 작성한다.

In [8]:
clnt_id_list = [x for x in trans_dict]
clnt_id_list.sort()

new_df_clnt_set = set(clnt_id_list)
load_df_clnt_set = set(load_df["clnt_id"])
total_clnt_set = set() 
for elem in new_df_clnt_set: 
    if elem in load_df_clnt_set: total_clnt_set.add(elem)
    
new_df = pd.DataFrame(columns=col_list)
length = len(product_col_dict)
for idx, clnt_id in enumerate(total_clnt_set): 
    tmp_list = user_row = [int(clnt_id)] + [0]*length
    for pd_c, biz_unit in trans_dict[clnt_id]: 
        col = str(pd_c) + '_' + biz_unit
        tmp_list[product_col_dict[col]] = 1
    new_df.loc[idx] = tmp_list
new_df = new_df.sort_values(by=["clnt_id"])
new_df

Unnamed: 0,clnt_id,1_B01,2_B01,3_B01,4_A01,5_B02,5_B01,6_A01,6_B03,6_B01,...,1664_B01,1664_A01,1665_B01,1665_A01,1666_A01,1666_B01,1666_A02,1667_A02,1667_B01,1667_A01
0,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,9,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,12,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,23,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10,24,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12,29,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
13,38,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
15,40,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
16,41,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
18,43,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


    3. 이전에 불렀던 검색어를 이용한 사용자와 제품간의 유사도를 저장한 테이블 중 구매이력이 있는 사용자들만의 행만 따로 저장한다.

In [9]:
compare_df = pd.DataFrame(columns=col_list)
cnt = 0
for i in range(len(load_df)):
    row = load_df.iloc[i] 
    if row["clnt_id"] in total_clnt_set:
        compare_df.loc[cnt] = row
        cnt += 1
compare_df = compare_df.astype({"clnt_id": "int"})
compare_df = compare_df.sort_values(by=["clnt_id"])
compare_df

Unnamed: 0,clnt_id,1_B01,2_B01,3_B01,4_A01,5_B02,5_B01,6_A01,6_B03,6_B01,...,1664_B01,1664_A01,1665_B01,1665_A01,1666_A01,1666_B01,1666_A02,1667_A02,1667_B01,1667_A01
0,2,0.167871,0.040374,0.192273,0.000000,0.000000,0.614879,0.036712,0.056325,0.935972,...,0.153431,0.061681,0.0,0.000000,0.404429,1.165988,3.076033,0.471470,1.581158,0.0
1,9,0.254589,0.052289,0.205298,0.000000,0.000000,0.610180,0.000000,0.019336,1.154622,...,0.159407,0.043501,0.0,0.000000,0.488736,1.958645,3.210538,0.418309,1.766411,0.0
2,12,0.212738,0.243121,0.109942,0.000000,0.000000,0.139861,0.091552,0.000000,0.584150,...,0.000000,0.082229,0.0,0.000000,0.294306,0.874308,2.822672,0.716513,1.333674,0.0
3,23,0.598132,0.000000,0.000000,0.014569,0.000000,0.673088,0.000000,0.000000,1.084705,...,0.000000,0.283898,0.0,0.085808,1.037722,0.643669,3.624017,0.921948,0.626181,0.0
4,24,0.336407,0.000000,0.107841,0.061515,0.000000,0.697206,0.050276,0.014498,1.516332,...,0.086355,0.104188,0.0,0.000000,0.422499,1.892356,2.786695,0.830854,2.142426,0.0
5,29,0.321060,0.023973,0.022021,0.013936,0.000000,0.448825,0.000000,0.000000,0.744457,...,0.000000,0.207171,0.0,0.023834,0.867126,0.544164,3.648488,1.096701,0.610933,0.0
6,38,0.079269,0.000000,0.000000,0.000000,0.000000,0.292769,0.051819,0.000000,0.494833,...,0.053400,0.000000,0.0,0.000000,0.093777,0.488400,1.551095,0.000000,0.914812,0.0
7,40,0.477613,0.000000,0.009805,0.009452,0.000000,0.605459,0.000000,0.000000,1.248133,...,0.000000,0.269883,0.0,0.071260,1.299320,1.020612,4.032257,1.333717,0.894281,0.0
8,41,0.262025,0.000000,0.000000,0.000000,0.000000,0.511361,0.000000,0.000000,0.494506,...,0.000000,0.098621,0.0,0.000000,0.317758,0.368384,1.310202,0.317895,0.670858,0.0
9,43,0.220994,0.000000,0.000000,0.000000,0.000000,0.302449,0.000000,0.000000,0.474272,...,0.000000,0.187705,0.0,0.009285,0.489416,0.364351,1.887315,0.541545,0.552825,0.0


    4. 각 제품별로 나누어 실제 유사도의 값과 구매여부를 비교하기 위해서는 유사도의 값은 연속된 값이고 구매여부는 0, 1로 나누어 지는 값으로 이루어져 있으므로 Point-biserial 방식으로 분석하기로 한다. 그 전에 연속된 값을 지닌 열들을 0~1사이의 값으로 Normalization하도록 하고 만약 모든 사용자가 특정 물품에 대한 유사도가 0이라면 분석하지 않기로 한다.

In [10]:
from scipy.stats import pointbiserialr
from sklearn.preprocessing import MinMaxScaler
min_max_scaler = MinMaxScaler()

result_dict = dict() 
for product in new_df: 
    binary = compare_df[product] 
    coefficient = new_df[product]
    coefficient = coefficient.astype("float64")
    if len(set(binary)) == 1 and len(set(coefficient)) == 1: continue
    x = coefficient.values.reshape(-1, 1)
    x_scaled = min_max_scaler.fit_transform(x)
    x_res = [elem[0] for elem in x_scaled]
    pbc = pointbiserialr(binary, x_res)
    result_dict[product] = pbc

  r = r_num / r_den
  r = r_num / r_den
  r = r_num / r_den
  r = r_num / r_den
  r = r_num / r_den
  r = r_num / r_den
  r = r_num / r_den
  r = r_num / r_den
  r = r_num / r_den
  r = r_num / r_den
  r = r_num / r_den
  r = r_num / r_den
  r = r_num / r_den
  r = r_num / r_den


    5. 연관관계를 구한 값 중 95%의 신뢰도를 따르는 제품들만 골라 결과 리스트에 저장하도록 하였다.

In [11]:
import math 
result = list() 
for key, val in result_dict.items():
    if not math.isnan(val[0]):
        if val[1] <= 0.05: result.append((key, *val))
result.sort(key=lambda x: x[1], reverse=True)
result = result[1:]
result

[('964_A03', 0.4857963381760907, 0.0),
 ('1617_A03', 0.466450955534489, 0.0),
 ('347_A03', 0.42656525073794405, 0.0),
 ('1395_A03', 0.41631248998365505, 0.0),
 ('1616_A03', 0.38148757077573375, 4.707215410952352e-280),
 ('1584_A03', 0.35403483385721074, 8.782789371212427e-239),
 ('114_A03', 0.3506817573406628, 5.144149775676515e-234),
 ('188_A03', 0.3487098553882208, 3.0764467787324527e-231),
 ('1213_A03', 0.34403903047800505, 9.67957705552561e-225),
 ('516_A03', 0.34385646853487517, 1.728105768649174e-224),
 ('354_A03', 0.33734967944364286, 1.2563076525214993e-215),
 ('1581_A03', 0.33319817084676023, 4.386163716505122e-210),
 ('1394_A03', 0.3308078059140125, 6.234562167238656e-207),
 ('194_A03', 0.3284774426456001, 6.938073948755634e-204),
 ('348_A03', 0.32538584157100925, 6.945047688938311e-200),
 ('1566_A03', 0.3246026030187087, 7.042774809943007e-199),
 ('1604_A03', 0.3223734935326796, 4.9509328797958655e-196),
 ('515_A03', 0.31491902955378864, 1.0985282523095844e-186),
 ('1573_A03

    6. 최종적으로 0.2 이상의 연관도를 가진 제품들은 다음과 같으며 연관도가 높진 않지만 어느정도 특정 사용자가 검색어를 참고하여 구매여부와 관련있는 제품들이 있다는 것을 확인할 수 있다.

In [12]:
final_result = [elem for elem in result if elem[1] >= 0.2]
len(final_result), final_result

(97,
 [('964_A03', 0.4857963381760907, 0.0),
  ('1617_A03', 0.466450955534489, 0.0),
  ('347_A03', 0.42656525073794405, 0.0),
  ('1395_A03', 0.41631248998365505, 0.0),
  ('1616_A03', 0.38148757077573375, 4.707215410952352e-280),
  ('1584_A03', 0.35403483385721074, 8.782789371212427e-239),
  ('114_A03', 0.3506817573406628, 5.144149775676515e-234),
  ('188_A03', 0.3487098553882208, 3.0764467787324527e-231),
  ('1213_A03', 0.34403903047800505, 9.67957705552561e-225),
  ('516_A03', 0.34385646853487517, 1.728105768649174e-224),
  ('354_A03', 0.33734967944364286, 1.2563076525214993e-215),
  ('1581_A03', 0.33319817084676023, 4.386163716505122e-210),
  ('1394_A03', 0.3308078059140125, 6.234562167238656e-207),
  ('194_A03', 0.3284774426456001, 6.938073948755634e-204),
  ('348_A03', 0.32538584157100925, 6.945047688938311e-200),
  ('1566_A03', 0.3246026030187087, 7.042774809943007e-199),
  ('1604_A03', 0.3223734935326796, 4.9509328797958655e-196),
  ('515_A03', 0.31491902955378864, 1.098528252309

#### [Step 5] Prototype
    그렇다면 실제 사용자의 검색어를 기준으로 가장 해당 사용자와 연관이 깊은 제품들을 추천하려는 프로토타입을 설계하고자 한다.

    1. 다음과 같이 사용자가 검색을 했다고 가정했을 때 이다. 이전 방식과 동일하게 NLP처리 후 명사만을 추출한다.

In [7]:
new_keyword = ["사과", "유기농", "맥주", "과자", "가을잠바, tkaf123"]
new_text = ""
p = re.compile("[^0-9]")
for keyword in new_keyword: 
    kwd = ''.join(p.findall(keyword)) 
    nlp_kwd = okt.pos(kwd)
    for elem, tag in nlp_kwd:
        if tag != 'Noun': continue
        new_text = new_text + " " +  elem
print(new_text)
text_list.append(new_text)

 사과 유기농 맥주 과자 가을 잠바


    2. TF-IDF 과정을 통해 비슷한 검색을 한 사용자와의 유사도를 파악하기 위해 동일한 과정을 거치고 예시의 사용자의 배열을 가져와 전체 사용자와의 유사도를 확인한다.

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(min_df=1)
tfidf_matrix = tfidf_vectorizer.fit_transform(text_list)
document_distances = (tfidf_matrix * tfidf_matrix.T)
document_distances
array = document_distances.toarray()
array[-1]

array([0.32508864, 0.        , 0.        , ..., 0.        , 0.        ,
       1.        ])

    3. 2 과정에서 가져온 배열 중 본인을 제외하고 유사도가 0 인 사용자는 제외한 사용자의 제품 구매 목록들을 통해 새로운 사용자와 모든 제품과의 유사도를 확인하도록 한다.

In [9]:
tmp_arr = array[-1]
tmp_list = list() 
for idx, elem in enumerate(tmp_arr): 
    if elem >= 0.99 or elem <= 0: continue 
    idx = key_dict[idx]
    tmp_list.append((idx, elem))
user_id = 'new_user'
user_row = [user_id] + [0]*len(product_list)
for clnt_id, elem in tmp_list: 
    pd_list = trans_dict.get(clnt_id, [])
    for pd_c, biz_unit in pd_list: 
        col_name = str(pd_c) + "_" + biz_unit
        user_row[product_col_dict[col_name]] += elem
user_row

['new_user',
 0.18208265937499107,
 0,
 0.09616061640713744,
 0,
 0,
 0.19653990818118117,
 0,
 0.710005550112761,
 0,
 0.0835664833494375,
 0,
 0.04783145155431058,
 0,
 0,
 0.04022500185182397,
 0.039963459971840025,
 0,
 0,
 0.09875055969694102,
 0.044490092379588356,
 0.0390942376097899,
 0,
 0.035159275142688316,
 0,
 0,
 0.02673063449142933,
 0,
 0.14199448569688977,
 0.13233012262869984,
 0,
 0,
 0.03840211776177635,
 0,
 0,
 0.24137610937632842,
 0.039693381406126126,
 0.013232825480410694,
 0.6857659807208034,
 0,
 0.053305127238421394,
 0,
 0,
 0.2870076180810755,
 0.8997308989438239,
 0.22433830597682375,
 0,
 0,
 0.37035806881002425,
 0,
 0.06205098569004992,
 0.8053597647078422,
 0.19601594622194438,
 0.18624557155928326,
 0.03308900273012612,
 0.7365210992131701,
 0.42684909100253565,
 1.2319209994489633,
 0,
 0,
 0,
 0.032754944618418325,
 0,
 0.025369343250463295,
 0.05334418482857218,
 0.10691342732576167,
 0.09793721372800643,
 0,
 0,
 0,
 0,
 0,
 0.035159275142688316

    4. 3의 과정을 거치면 다음과 같다.

In [13]:
col_list = list() 
col_list.append("clnt_id")
for pd_c, biz_unit in product_list:
    col = str(pd_c)+"_"+biz_unit
    col_list.append(col)
    
result_df = pd.DataFrame(columns=col_list)
result_df.loc[0] = user_row
result_df

Unnamed: 0,clnt_id,1_B01,2_B01,3_B01,4_A01,5_B02,5_B01,6_A03,6_B01,6_B02,...,1664_A01,1664_A02,1665_A01,1665_B01,1666_A02,1666_B01,1666_A01,1667_A01,1667_B01,1667_A02
0,new_user,0.182083,0,0.096161,0,0,0.19654,0,0.710006,0,...,0.027733,0.900857,0,0,1.147385,0.572381,0.633213,0,0.640133,0.240877


    5. 이 중 가장 유사도가 높게 측정된 제품들을 기준으로 정렬하였다.

In [25]:
new_result = list() 
for i in range(1, len(user_row)): 
    new_result.append([i, user_row[i]])
new_result.sort(key=lambda x:x[1], reverse=True)

[[2377, 34.34630663276697],
 [3434, 29.783861303243874],
 [3033, 26.00325424316589],
 [902, 25.058466447039386],
 [3935, 24.940602660546006],
 [3432, 22.199636764317976],
 [253, 20.93104179112478],
 [1343, 20.354586102085463],
 [2932, 19.95673752678104],
 [3027, 19.561620085259182],
 [906, 19.264968609340517],
 [561, 18.219455314843824],
 [925, 18.089638881667295],
 [1341, 17.754854444171386],
 [1492, 17.29120833959481],
 [417, 17.01243374264062],
 [3037, 16.714031512593245],
 [481, 16.4777122653467],
 [3933, 15.792703627586373],
 [3428, 14.747781401324271],
 [900, 14.744663630655282],
 [359, 14.423498481811874],
 [1121, 14.241076434090154],
 [2341, 13.689451271789723],
 [3840, 13.62165493266727],
 [1421, 13.418264197320596],
 [3934, 12.894117055571973],
 [911, 12.833378098891986],
 [463, 12.662289032241114],
 [3043, 12.431543905206395],
 [1397, 12.223310617055422],
 [3456, 12.095188385521931],
 [1411, 11.759566092677263],
 [428, 11.693025012840822],
 [2375, 11.612237992603403],
 [1351

    6. 실제로 제품의 소분류를 확인하여 과연 검색어와 연관이 있는 제품인지 확인하기 위해 제품 정보가 있는 테이블을 불러오도록 한다.

In [34]:
data = pd.read_csv("../data/product.csv", encoding='utf-8') 
product_df = pd.DataFrame(data) 
product_df = product_df[["pd_c", "clac_nm3"]]
product_df.head()

Unnamed: 0,pd_c,clac_nm3
0,1,Automobile Oil / Additives
1,2,Car Lights
2,3,Car Paint
3,4,Filters
4,5,Wiper Blades


    7. 상위 15개의 결과를 본 결과 주로 식료품이나 과자, 과일 등의 종류들이 나왔음을 확인할 수 있다.

In [37]:
final_result = list() 
for elem, _ in new_result[:15]: 
    tmp_result = product_df[ product_df["pd_c"] == product_list[elem][0] ]
    final_result.append(tmp_result["clac_nm3"].values[0])
final_result

['Processed Chicken Eggs',
 'Ramens',
 'General Snacks',
 'Fresh Milk',
 'Tofu',
 'Bibim Ramens',
 'Water',
 'Frozen Fried Foods',
 'Cookies',
 'Energy Bars',
 'Butter and Margarine',
 'Trash Bags',
 'Spoon Type Yogurts',
 'Frozen Fried Foods',
 'Bananas']