# Build Recommendation System by Sklearn

## Fire up Packages

In [1]:
import sklearn
import matplotlib.pyplot as plt
%matplotlib inline
import pandas
from sklearn.cross_validation import train_test_split
import numpy



## Load Data

In [2]:
song_data=pandas.read_csv('song_data.csv')

In [3]:
song_data.head()

Unnamed: 0,user_id,song_id,listen_count,title,artist,song
0,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOAKIMP12A8C130995,1,The Cove,Jack Johnson,The Cove - Jack Johnson
1,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBBMDR12A8C13253B,2,Entre Dos Aguas,Paco De Lucia,Entre Dos Aguas - Paco De Lucia
2,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBXHDL12A81C204C0,1,Stronger,Kanye West,Stronger - Kanye West
3,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBYHAJ12A6701BF1D,1,Constellations,Jack Johnson,Constellations - Jack Johnson
4,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SODACBL12A8C13C273,1,Learn To Fly,Foo Fighters,Learn To Fly - Foo Fighters


In [4]:
song_data.shape

(1116609, 6)

## Statement

**Unlike other machine learning issues, a recommendation system can be built in several ways. In Graphlab, the recommendation system is built in the package already. However, in order to better understand the mechanism behind the recommendation system, I create this notebook in order to break down the steps of building a basic recommendation system.**

## User-Item Collaborative Filtering

**In this case, the recommendation system should be built with the method of user-item collaborative filtering. Namely, the recommendation should be backed up by: “Customers who are similar to you also liked …”.**

### A: Recommend Songs

#### Build user-item matrix

In [18]:
song=song_data.drop_duplicates(['user_id','song','title','artist'])
song=song.reset_index(drop=True)

###### Due to the memory limitation of Jupyter notebook, I just use first 100000 rows of data.

In [19]:
song=song.ix[0:100000,]
song.shape

(100001, 6)

In [20]:
n_user=len(song['user_id'].unique())
n_song=len(song['song_id'].unique())
n_artist=len(song['artist'].unique())
print 'We have '+str(n_user)+' unique users,'+str(n_song)+' unique songs and '+str(n_artist)+' unique artist in the data table.'

We have 5905 unique users,9890 unique songs and 3359 unique artist in the data table.


###### User_item matrix can be created by pivot table.

In [21]:
song_pivot=song.pivot(index='user_id',columns='song_id',values='listen_count')

In [22]:
song_pivot.shape

(5905, 9890)

In [23]:
song_pivot=song_pivot.fillna(0)

###### Another way of creating matrix is to loop over the dataframe and fill in the zero matrix.

#### Implement KNN method to recommend songs

In [24]:
from sklearn.neighbors import NearestNeighbors
knn=NearestNeighbors(n_neighbors=20,algorithm='brute',metric='cosine')
Model=knn.fit(song_pivot)

**Let us try a user**

In [25]:
song_pivot=song_pivot.reset_index(drop=True)

In [30]:
User_Index=Model.kneighbors(song_pivot.ix[1,])[1][0]



** Now we have a list of indexes of the users that are similar to the user we want to recommend the songs. We will find out who are they and what songs do they like to listen. After that, we can create a list with the songs we want to recommend.**

In [122]:
All_user=song['user_id'].unique()
Self=User_Index[0]
Others=User_Index[1:]
Relevant_user=All_user[Others]
User_data=song[song['user_id'].isin(Relevant_user)]
All_song=pandas.DataFrame(User_data.groupby(['song'])['listen_count'].sum())
All_song=pandas.DataFrame({'Count':All_song['listen_count'],'Song':All_song.index.tolist()})
All_song=All_song.sort('Count',ascending=False)
All_song=All_song.reset_index(drop=True)



**The top 30 Recommended songs for the specific user**

In [131]:
Recommended_Song=All_song['Song'][0:30]
print Recommended_Song

0                               Rio - Another Sunny Day
1                          Strani Amori - Laura Pausini
2     Sinisten tähtien alla - J. Karjalainen & Musta...
3                                 Nothing - Ryan Leslie
4                         Ain\'t Misbehavin - Sam Cooke
5                 Frisch und g\'sund - Die Mooskirchner
6               Just Dance - Lady GaGa / Colby O\'Donis
7                          Représente - Alliance Ethnik
8     All I Do Is Win (feat. T-Pain_ Ludacris_ Snoop...
9                        Fireflies - Charttraxx Karaoke
10                             Missing You - John Waite
11               Pass Out (Instrumental) - Tinie Tempah
12    What Goes Around...Comes Around - Justin Timbe...
13                      Here Without You - 3 Doors Down
14                                Bulletproof - La Roux
15    Horn Concerto No. 4 in E flat K495: II. Romanc...
16                            Kryptonite - 3 Doors Down
17                                 Not Big - Lil

### B: Recommend Artists

##### Count how many times the use listen the music of artist

In [146]:
artist_count=song.groupby(['user_id','artist'],as_index=False).size().reset_index(name='count')
artist_count=pandas.DataFrame(artist_count)

##### Create user item matrix

In [148]:
pivot_artist=artist_count.pivot(index='user_id',columns='artist',values='count')

In [149]:
pivot_artist=pivot_artist.fillna(0)

In [151]:
Model_artist=knn.fit(pivot_artist)

##### Also, try a user

In [153]:
Neighbours=Model_artist.kneighbors(pivot_artist.ix[0,:])



In [154]:
Neighbour_index=Neighbours[1][0]

In [159]:
Art_others=Neighbours[1:]
All_user_artist=artist_count.ix[Neighbour_index,:]

In [161]:
All_user_artist=All_user_artist.sort('count',ascending='False')

  if __name__ == '__main__':


In [162]:
Recommended_artists = All_user_artist['artist']
print Recommended_artists

1468              Sara Bareilles
430     The All-American Rejects
549                   Trey Songz
3970                    The Gits
4949              The Black Keys
1295          Corinne Bailey Rae
4679                  Crazy Town
5644                  Katy Perry
4612              Camera Obscura
2316             Alliance Ethnik
986                     Harmonia
766          Kanye West / T-Pain
1192               Elliott Smith
4354     Prince & The Revolution
503           Christina Aguilera
820                 Travie McCoy
836                  Damien Rice
388                      Flyleaf
5721                   Daft Punk
0              Alanis Morissette
Name: artist, dtype: object


## Conclusion

**So far, I have built two recommender systems that can recommend songs and artists to the specific user. The basic theory behind it is to use cosine distance to filter the similar users, and then, based on their known preference, recommend songs to the user.**

**Key elements include:**

**1: Use of pivot table, a fast and convenient way to build user-item matrix**

**2: KNN Model: Simple but effective way to find similar users **

**3: Group by: Serves as the function of SQL in python. Transform the data frame into the form we want to have**