# Knowledge Data Discovery and Neural Networks : Final Project

In this notebook we will encounter the domain of recommender systems.

The purpose of this section is to be able to face a new problem with the the skills you have thus far.

The grade will be based on your results on the test set and will be realtive to the other class mates - we attach a file "example_submission.csv" which you need to submit so we can check your results on the test set -"recommender_test.csv" (only we know the labels), you need to use "recommender_train.csv" for the training and validation of the algorithm you choose. We will test you on the root mean squared error metric (RMSE).

We add here a couple of questions to guide you throw the process of understanding the problem world, but they will not be graded.
We recommend to try and use a couple of algorithms from [surprise package](http://surpriselib.com/) and find the one that works best for you. 

We **recommend** to read a couple of posts online on "collaborative filtering" in recommender systems to get to know the topic.

#### guided questions - 

1. What are the features we have? are they numerical or categorical or do we have both?
2. What are we trying to predict, is it classification or regression?
3. Offer a very simple prediction algorithm that you may use and can implement yourself (it doesn't have to be complicated but make sure at least that each user gets a differnt rating for an item in the test set) - you may find it useful especially if you will have problems with [surprise package](http://surpriselib.com/) or other package that you want to use.


It is recommended to read the [original paper on svd](https://datajobs.com/data-science-repo/Recommender-Systems-[Netflix].pdf).
Other resources on collaborative filtering:

* [collaborative filtering with knn ](https://medium.com/sfu-cspmp/recommendation-systems-user-based-collaborative-filtering-using-n-nearest-neighbors-bf7361dc24e0)
* [more collaborative filtering](https://towardsdatascience.com/prototyping-a-recommender-system-step-by-step-part-1-knn-item-based-collaborative-filtering-637969614ea)

In [149]:
# add more packages in this section
import numpy as np
import pandas as pd
# import surprise # install it first
%matplotlib inline
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import pairwise_distances

In [150]:
train = pd.read_csv("data/recommender_train.csv")
test = pd.read_csv("data/recommender_test.csv")

In [232]:
test = pd.read_csv("data/recommender_test.csv")

In [153]:
# train_data=train.pivot(index='user', columns='item', values='rating').reset_index(drop=True)
# train_data

In [154]:
test

Unnamed: 0,user,item
0,0,692
1,1,1340
2,2,1499
3,3,2384
4,4,458
...,...,...
6035,6035,271
6036,6036,1013
6037,6037,578
6038,6038,271


In [155]:
Mean = train.groupby(by="user",as_index=False)['rating'].mean()
train_avg = pd.merge(train,Mean,on='user')
train_avg['adg_rating']=train_avg['rating_x']-train_avg['rating_y']
train_avg.head()

Unnamed: 0,user,item,rating_x,rating_y,adg_rating
0,0,0,4,3.417722,0.582278
1,0,1,5,3.417722,1.582278
2,0,2,5,3.417722,1.582278
3,0,3,3,3.417722,-0.417722
4,0,4,2,3.417722,-1.417722


In [156]:
check = pd.pivot_table(train_avg,values='rating_x',index='user',columns='item')
check.head()

item,0,1,2,3,4,5,6,7,8,9,...,3214,3215,3216,3217,3218,3219,3220,3221,3222,3223
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,4.0,5.0,5.0,3.0,2.0,2.0,4.0,5.0,5.0,1.0,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,5.0,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,


In [157]:
final = pd.pivot_table(train_avg,values='adg_rating',index='user',columns='item')
final.head()

item,0,1,2,3,4,5,6,7,8,9,...,3214,3215,3216,3217,3218,3219,3220,3221,3222,3223
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.582278,1.582278,1.582278,-0.417722,-1.417722,-1.417722,0.582278,1.582278,1.582278,-2.417722,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,0.768571,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,


In [158]:
# Replacing NaN by Movie Average
final_movie = final.fillna(final.mean(axis=0))

# Replacing NaN by user Average
final_user = final.apply(lambda row: row.fillna(row.mean()), axis=1)

In [159]:
# user similarity on replacing NAN by user avg
b = cosine_similarity(final_user)
np.fill_diagonal(b, 0 )
similarity_with_user = pd.DataFrame(b,index=final_user.index)
similarity_with_user.columns=final_user.index
similarity_with_user.head()

user,0,1,2,3,4,5,6,7,8,9,...,6030,6031,6032,6033,6034,6035,6036,6037,6038,6039
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.0,0.088983,0.027842,0.074631,0.022721,0.075254,0.091324,0.004353,0.106054,0.072068,...,0.046912,0.06615,0.015779,0.034663,0.023709,0.033642,0.018911,-0.015001,0.021805,-0.021538
1,0.088983,0.0,0.044145,0.091791,0.054692,0.172569,0.055311,0.117706,0.104808,0.209102,...,-0.005067,0.06393,0.006837,0.147216,0.018798,0.027862,-0.029027,0.042364,0.090816,-0.029186
2,0.027842,0.044145,0.0,0.023108,0.023958,0.067539,0.118446,0.05764,0.035077,0.099356,...,0.012045,0.064848,-0.009254,0.009589,0.01314,-0.04955,0.021236,0.016827,0.014049,0.053099
3,0.074631,0.091791,0.023108,0.0,0.095678,0.110725,0.080587,0.163385,0.077252,0.096846,...,-0.069119,0.172024,0.024692,0.145923,0.016177,-0.014352,-0.00156,-0.046801,0.056304,0.032911
4,0.022721,0.054692,0.023958,0.095678,0.0,0.059839,0.043282,0.029942,0.035088,0.008932,...,0.004066,0.076292,-0.018626,0.010696,-0.018958,0.047844,0.084663,-0.0112,0.0161,-0.010158


In [160]:
# user similarity on replacing NAN by item(movie) avg
cosine = cosine_similarity(final_movie)
np.fill_diagonal(cosine, 0 )
similarity_with_movie = pd.DataFrame(cosine,index=final_movie.index)
similarity_with_movie.columns=final_user.index
similarity_with_movie.head()

user,0,1,2,3,4,5,6,7,8,9,...,6030,6031,6032,6033,6034,6035,6036,6037,6038,6039
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.0,0.699037,0.735184,0.768514,0.748429,0.745484,0.726303,0.725042,0.716481,0.693702,...,0.797737,0.790414,0.800669,0.783562,0.781314,0.784181,0.801594,0.791593,0.69835,0.80011
1,0.699037,0.0,0.774597,0.806088,0.796408,0.790302,0.744124,0.778458,0.742633,0.740724,...,0.840287,0.83871,0.845996,0.840269,0.825651,0.829215,0.847535,0.839256,0.753414,0.84515
2,0.735184,0.774597,0.0,0.86313,0.851906,0.836481,0.818845,0.834104,0.782887,0.780703,...,0.901527,0.897582,0.907114,0.890242,0.88171,0.880571,0.911176,0.902198,0.794231,0.911036
3,0.768514,0.806088,0.86313,0.0,0.882608,0.86744,0.83591,0.876561,0.8126,0.811815,...,0.927552,0.932787,0.938111,0.922335,0.916345,0.911255,0.939215,0.926775,0.819045,0.940906
4,0.748429,0.796408,0.851906,0.882608,0.0,0.851207,0.815554,0.848873,0.797445,0.782351,...,0.919312,0.916228,0.924354,0.9067,0.900851,0.902338,0.929975,0.915121,0.81001,0.925833


In [161]:
def find_n_neighbours(df,n):
    order = np.argsort(df.values, axis=1)[:, :n]
    df = df.apply(lambda x: pd.Series(x.sort_values(ascending=False)
           .iloc[:n].index, 
          index=['top{}'.format(i) for i in range(1, n+1)]), axis=1)
    return df

In [162]:
# top 30 neighbours for each user
sim_user_30_u = find_n_neighbours(similarity_with_user,30)
sim_user_30_u.head()

Unnamed: 0_level_0,top1,top2,top3,top4,top5,top6,top7,top8,top9,top10,...,top21,top22,top23,top24,top25,top26,top27,top28,top29,top30
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,857,4636,5810,97,4701,5237,359,3896,4655,1085,...,5090,5008,2940,1723,33,4842,5523,2512,5145,5486
1,2485,208,695,2237,941,3722,312,517,9,3788,...,52,646,2187,192,2009,405,3628,4102,5251,1665
2,5860,1972,4999,2283,89,3000,3124,549,729,5429,...,408,426,213,244,1987,6001,575,4731,845,1850
3,5810,3896,299,5247,3615,4783,1692,586,1094,6003,...,5548,1748,5012,3549,990,3462,3242,3197,558,1511
4,3069,5553,5567,3400,1449,3253,2480,2280,980,3506,...,3922,5624,4590,3450,3718,4583,233,1124,4173,2105


In [163]:
# top 30 neighbours for each user
sim_user_30_m= find_n_neighbours(similarity_with_movie,30)
sim_user_30_m.head()

Unnamed: 0_level_0,top1,top2,top3,top4,top5,top6,top7,top8,top9,top10,...,top21,top22,top23,top24,top25,top26,top27,top28,top29,top30
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,2028,4655,3794,4636,1458,950,2293,1775,5145,5732,...,4668,5569,4864,1260,3710,1438,1906,2578,2140,2662
1,3156,1741,2930,4005,3994,3221,3203,5190,4645,3554,...,2676,1651,3972,2802,4235,1438,5950,2759,2681,5430
2,2035,5190,4731,4235,4604,1576,5011,4296,4723,496,...,3738,3375,1741,1906,1512,1911,2383,5499,1403,3158
3,1557,6003,5357,2949,2851,3209,1123,2712,1490,4051,...,3821,496,1741,2990,2048,1467,1588,4928,2173,1347
4,1403,2662,1531,4514,5567,4250,5569,2681,1741,4620,...,476,5282,4005,1645,6036,4543,4976,5190,1588,3954


In [164]:
def get_user_similar_movies( user1, user2 ):
    common_movies = train_avg[train_avg.user == user1].merge(
    train_avg[train_avg.user == user2],
    on = "item",
    how = "inner" )
    return common_movies

In [165]:
a = get_user_similar_movies(20,225)
a = a.loc[ : , ['rating_x_x','rating_x_y','item']]
a.head()

Unnamed: 0,rating_x_x,rating_x_y,item
0,3,4,1859
1,4,4,1134
2,2,1,491
3,3,1,1136
4,3,4,37


In [166]:
def User_item_score(user,item):
    a = sim_user_30_m[sim_user_30_m.index==user].values
    b = a.squeeze().tolist()
    c = final_movie.loc[:,item]
    d = c[c.index.isin(b)]
    f = d[d.notnull()]
    avg_user = Mean.loc[Mean['user'] == user,'rating'].values[0]
    index = f.index.values.squeeze().tolist()
    corr = similarity_with_movie.loc[user,index]
    fin = pd.concat([f, corr], axis=1)
    fin.columns = ['adg_score','correlation']
    fin['score']=fin.apply(lambda x:x['adg_score'] * x['correlation'],axis=1)
    nume = fin['score'].sum()
    deno = fin['correlation'].sum()
    final_score = avg_user + (nume/deno)
    return final_score

In [222]:
score = User_item_score(320,224)
print("score (u,i) is",score)

score (u,i) is 4.312812044367899


In [235]:
filt=test[test.item==min(test.item)]
filt
test.loc[5516]=[5516,300]

In [236]:
test.loc[5516]

user    5516
item     300
Name: 5516, dtype: int64

In [237]:
def f(user,item):
    return User_item_score(user,item)
test['rating'] = test.apply(lambda x: f(x['user'], x['item']), axis=1)
test

Unnamed: 0,user,item,rating
0,0,692,3.812560
1,1,1340,4.328711
2,2,1499,4.533431
3,3,2384,4.285115
4,4,458,3.340225
...,...,...,...
6035,6035,271,4.346576
6036,6036,1013,4.162236
6037,6037,578,4.109313
6038,6038,271,3.118845


### Predictions

For your convience, we add a code that creates "example_submission.csv".
You need to replace "algo" with your best algorithm.
If you choose a different method to predict or create the algorithm you may write different code - it is not obligatory

In [238]:
#writing to file
test['rating'].to_csv('example_submission.csv', index = None, header = None)