<h1> Precision Test </h1>
<ul>
    <li>Random Recommender</li>
    <li>Popularity Based Recommender</li>
    <li>Test Set Creation</li>
    <li>Popularity Based Recommender Precision</li>
</ul>

In [12]:
%load_ext autoreload
%autoreload 2
%matplotlib inline


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [13]:
import pandas as pd
import numpy as np
import random
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel
from tqdm.auto import tqdm

In [14]:
articles_df = pd.read_csv("../datasets/articles_transactions_5.csv")
articles_df.shape

(26722, 26)

In [15]:
articles_df = pd.read_csv("../datasets/articles_transactions_5.csv")
T = pd.read_csv("../datasets/transactions_5.csv")
pd.set_option("display.max_rows", None)
i = articles_df.copy()
i['detail_desc'] = i['detail_desc'].fillna("")

<h2>Random Recommender</h2>

In [16]:
def random_recommender(customer):
    # print(f"{customer.customer_id} ", np.random.choice(articles_df['article_id'].values,5))
    return random.sample(sorted(articles_df['article_id'].values),5)

<h2>Popularity Based Recommender</h2>

In [17]:
def popularity_recommender(customer):
    # Top 5 most bought
    popular_products = T['article_id'].value_counts().nlargest(5).to_frame()['article_id'].keys().values
    return popular_products

<h2>Content Based Recommender Using TFID
</h2>

In [18]:
vectorizer = TfidfVectorizer(stop_words = "english")
candidate_profile_X = vectorizer.fit_transform(i["detail_desc"])
print(candidate_profile_X.shape)
cosine_similarity = linear_kernel(candidate_profile_X,candidate_profile_X) 
print(cosine_similarity)
indices = pd.Series(i.index,index=i["article_id"]).drop_duplicates()

(26722, 2148)
[[1.         0.15415515 0.15415515 ... 0.         0.17445125 0.06544307]
 [0.15415515 1.         1.         ... 0.03417172 0.02818911 0.18738553]
 [0.15415515 1.         1.         ... 0.03417172 0.02818911 0.18738553]
 ...
 [0.         0.03417172 0.03417172 ... 1.         0.         0.01974166]
 [0.17445125 0.02818911 0.02818911 ... 0.         1.         0.05267121]
 [0.06544307 0.18738553 0.18738553 ... 0.01974166 0.05267121 1.        ]]


In [19]:
def rec_Tfid(article_id,cosine_similarity):
    idx = indices[article_id]
    scores = enumerate(cosine_similarity[idx])
    scores = sorted(scores,key=lambda val:val[1])
    scores = scores[-6:-1]
    # scores = scores[-2:-1] # we want the one before the Identical one (THE MOST SIMILAR)
    # print(scores)
    # return [i['article_id'].iloc[tar[0]] for tar  in scores]
    return scores # now returns a list of tuple

In [20]:
def content_based_recommender_Tfid(customer):
    tuple_list = [] # list with similarity and index
    customer_purchases = T['article_id'][T['customer_id'] == customer].drop_duplicates().values
    for product in customer_purchases:
        tuple_list += rec_Tfid(product,cosine_similarity)
    tuple_list = set(tuple_list)
    tuple_list = list(tuple_list)
    scores = sorted(tuple_list, key=lambda val: val[1])
    scores = scores[-6:-1] # pick bottom 5, with the highest scores
    return [i['article_id'].iloc[tar[0]] for tar in scores]

<h2>Content Based Recommender Using Count Vectorizer
</h2>

In [None]:
count_vectorizer = CountVectorizer(binary=True,stop_words = "english")
profiles = count_vectorizer.fit_transform(i["detail_desc"])
similarity = linear_kernel(profiles,profiles) 
indices = pd.Series(i.index,index=i["article_id"]).drop_duplicates()
# indices

In [None]:
def rec_count(article_id,similarity):
    idx = indices[article_id]
#     print(idx)
    scores = enumerate(similarity[idx])
    scores = sorted(scores,key=lambda val:val[1])
    scores = scores[-6:-1]
    # scores = scores[-2:-1] # we want the one before the Identical one (THE MOST SIMILAR)
    # print(scores)
    # return [i['article_id'].iloc[tar[0]] for tar  in scores]
    return scores # now returns a list of tuple

In [None]:
def content_based_recommender_count(customer):
    tuple_list = [] # list with similarity and index
    customer_purchases = T['article_id'][T['customer_id'] == customer].drop_duplicates().values
    for product in customer_purchases:
        tuple_list += rec_count(product,similarity)
    tuple_list = set(tuple_list)
    tuple_list = list(tuple_list)
    scores = sorted(tuple_list, key=lambda val: val[1])
    scores = scores[-6:-1] # pick bottom 5, with the highest scores
    return [i['article_id'].iloc[tar[0]] for tar in scores]

In [None]:
# print(content_based_recommender_count('65cb62c794232651e2ac711faa11c2b4e3d41d5f3b59b50bee3ffde1d5776644'))

<h2>Test Set Creation</h2>

<p>
From the Kaggle transactions, extract a set T for just one month.
Let U be the users who have at least one transaction inT.
Let I be the items that are in at least one transaction in T.

For any user u who has fewer than 5 transactions in T, 
- delete u from U
- delete u's transactions from T.

From U, choose 1000 users at random. Call these test-U.

For each user u in test-U, 
- move the last 20% of their transactions from T to test-T (i.e. delete them from T, insert them into test-T).

Train a recommender system on T. (This does not apply to the random recommender, but it does apply to the popularity recommender because you need to know which items in T are the most popular.)

Test the recommender as follows; for each u in test-U,
- obtain n recommendations (e.g. 5 recommendations)
- compute precision (based on how many of u's recommendations are in test-T for user u)
Afterwards, you have the precision for each user in test-U. So now compute the mean of these.

</p>

In [None]:
T.shape

In [None]:
u = T['customer_id'].drop_duplicates().to_frame()

In [None]:
u.shape

In [None]:
i = T['article_id'].drop_duplicates().to_frame()
i.shape

In [None]:
T.shape
test_u = u.sample(n=1000, random_state = 1)
test_t = pd.DataFrame()
for k,cust in enumerate(test_u['customer_id']):  
    cust_transac = T[T['customer_id'] == cust]
    bottom_transac = cust_transac[-1 * round(0.20 * len(cust_transac)):]
    test_t = test_t.append(bottom_transac)
    indexs = bottom_transac.index
    T.drop(labels = indexs, axis = 0,inplace=True )
    print(k)
test_t.shape

<h2>Polularity and Random Recommender Precision</h2>

In [None]:
precision_list_popularity = []
precision_list_random = []
precision_list_content = []
precision_list_content_count = []
for j,user in tqdm(enumerate(test_u['customer_id'])):
    popular_recommendations = popularity_recommender(user)
    random_recommendations = random_recommender(user)
    content_recommendations = content_based_recommender_Tfid(user)
    content_recommendations_count = content_based_recommender_count(user)
    
    u_purchases = test_t[test_t['customer_id'] == user]['article_id'].values
    
    precision_list_popularity.append(len(np.intersect1d(popular_recommendations,u_purchases)))
    precision_list_random.append(len(np.intersect1d(random_recommendations,u_purchases)))
    precision_list_content.append(len(np.intersect1d(content_recommendations, u_purchases)))
    precision_list_content_count.append(len(np.intersect1d(content_recommendations_count, u_purchases)))
    
#     print(j)
print("MEAN Precison of Popularity Recommender",sum(precision_list_popularity)/len(precision_list_popularity))
print("MEAN Precison of Random Recommender",sum(precision_list_random)/len(precision_list_random))
print("MEAN Precison of Content Based Recommender TFID" ,sum(precision_list_content)/len(precision_list_content))
print("MEAN Precison of Content Based Recommender COUNT" ,sum(precision_list_content_count)/len(precision_list_content_count))