<img src="http://drive.google.com/uc?export=view&id=1tpOCamr9aWz817atPnyXus8w5gJ3mIts" width=500px>

# Content and Popularity based Recommender System

## Problem Statement


A lot of companies worldwide recommend songs to listeners based on their interests. Some popular examples of such companies are Spotify, iTunes, Amazon Music and Saavn. Song recommendations helps users to discover new artists that make music similar to the genre which they listen to. This helps in increasing revenue across these platforms and helps artists make a survival by streaming their music online.

As part of this exercise we will build a recommendation system that recommends a list of songs based on the user's song preference.

## Attribute Information

There are 2 files that we will be using in this case study, 'songs.csv' and 'song_extra_info.csv'. 

The 'songs.csv' file has the following attributes:

- song_id: Unique id of the song
- song_length: Duration of the song
- genre_ids: Unique id of the genre of the song
- artist_name: Name of the artist who represents the song
- composer: Name of the composer of the song
- lyricist: Name of the lyricist of the song
- language: The language of the song



The 'song_extra_info.csv' file has the following attributes:

- song_id: Unique id of the song
- name: name of the song
- isrc: International standard recording code

## Table of Content

1. Import Libraries

2. Setting options

3. Read Data 

4. Exploratory Data Analysis and Data Preprocessing

  4.1 - Check shape 

  4.2 - Check for missing values

  4.3 - Sample only 10000 data points from the huge dataset 



5. Content Based Recommendation System

6. Conclusion and Interpretation

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 1. Import Required Libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 
import os
import glob
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from zipfile import ZipFile

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity

# filterwarnings to ignore all unnecessary warnings and logs
import warnings
warnings.filterwarnings('ignore')

## 2. Setting Options

In [3]:
# suppress display of warnings
warnings.filterwarnings('ignore')

# display all dataframe columns
pd.options.display.max_columns = None

# to set the limit to 3 decimals
pd.options.display.float_format = '{:.7f}'.format

# display all dataframe rows
pd.options.display.max_rows = None

In [4]:
os.chdir('/content/drive/MyDrive')
os.getcwd()

'/content/drive/MyDrive'

## 3. Read Data and EDA

In [3]:
#read the data files
songs = pd.read_csv('C:/Users/Mrinal Kalita/Python Projects/Recommendation System/songs.csv',encoding='latin')
info = pd.read_csv('C:/Users/Mrinal Kalita/Python Projects/Recommendation System/song_extra_info.csv',encoding='latin')

In [6]:
# check few rows of the imported dataset
songs.head()

Unnamed: 0,song_id,song_length,genre_ids,artist_name,composer,lyricist,language
0,CXoTN1eb7AI+DntdU1vbcwGRV4SCIDxZu+YD8JP8r4E=,247640,465,å¼µä¿¡å² (Jeff Chang),è£è²,ä½åå¼,3.0
1,o0kFgae9QtnYgRkVPqLJwa05zIhRlUjfF7O1tDw0ZDU=,197328,444,BLACKPINK,TEDDY| FUTURE BOUNCE| Bekuh BOOM,TEDDY,31.0
2,DwVvVurfpuz+XPuFvucclVQEyPqcpUkHR0ne1RQzPs0=,231781,465,SUPER JUNIOR,,,31.0
3,dKMBWoZyScdxSkihKG+Vf47nc18N9q4m58+b4e7dSSE=,273554,465,S.H.E,æ¹¯å°åº·,å¾ä¸ç,3.0
4,W3bqWd3T+VeHFzHAUfARgW9AvVRaF4N5Yzm4Mr6Eo/o=,140329,726,è²´æç²¾é¸,Traditional,Traditional,52.0


In [7]:
info.head()

Unnamed: 0,song_id,name,isrc
0,LP7pLJoJFBvyuUwvu+oLzjT+bI+UeBPURCecJsX1jjs=,æå,TWUM71200043
1,ClazTFnk6r0Bnuie44bocdNMM3rdlrq0bCGAsGUWcHE=,Let Me Love You,QMZSY1600015
2,u2ja/bZE3zhCGxvbbOB3zOoUjx27u40cf5g09UXMoKQ=,åè«æ,TWA530887303
3,92Fqsy0+p6+RHe2EoLKjHahORHR1Kq1TBJoClW9v+Ts=,Classic,USSM11301446
4,0QFmz/+rJy1Q56C1DuYqT9hKKqi5TUqx0sN0IwvoHrw=,ææç¾ç¶²,TWA471306001


## 4. Exploratory Data Analysis and Data Preprocessing

### 4.1 Check shape

In [8]:
songs.shape

(2296320, 7)

In [9]:
info.shape

(2295971, 3)

In [10]:
# check the columns in each dataframe
print(songs.columns)
print('===============================================')
print(info.columns)

Index(['song_id', 'song_length', 'genre_ids', 'artist_name', 'composer',
       'lyricist', 'language'],
      dtype='object')
Index(['song_id', 'name', 'isrc'], dtype='object')


In [11]:
#merge the two dataframes
df = info.merge(songs,on='song_id')

In [12]:
print(df.columns)

Index(['song_id', 'name', 'isrc', 'song_length', 'genre_ids', 'artist_name',
       'composer', 'lyricist', 'language'],
      dtype='object')


In [13]:
# make a copy of the original dataframe to 
df_composer = df.copy()

In [14]:
df_composer = df_composer.drop(df_composer.columns.difference(['song_id','name','composer']),axis =1)

In [15]:
df_composer.shape

(2295422, 3)

In [16]:
#Check Data types
df_composer.dtypes

song_id     object
name        object
composer    object
dtype: object

In [17]:
df_composer.head(4)

Unnamed: 0,song_id,name,composer
0,LP7pLJoJFBvyuUwvu+oLzjT+bI+UeBPURCecJsX1jjs=,æå,An-An Tso
1,ClazTFnk6r0Bnuie44bocdNMM3rdlrq0bCGAsGUWcHE=,Let Me Love You,Justin Bieber| William Grigahcine| Andrew Watt...
2,u2ja/bZE3zhCGxvbbOB3zOoUjx27u40cf5g09UXMoKQ=,åè«æ,A Qin
3,92Fqsy0+p6+RHe2EoLKjHahORHR1Kq1TBJoClW9v+Ts=,Classic,Evan Bogart|Andrew Goldstein|Lindy Robbins|Ema...


### 4.2 Check missing values

In [18]:
# Check for missing values present
print('Number of missing values across columns-\n', df_composer.isnull().sum())


Number of missing values across columns-
 song_id           0
name              2
composer    1070938
dtype: int64


There are 2 missing values in name columns and 1070938 in composer column with total records 2295422.

Let's drop the missing values.

In [19]:
df_composer.dropna(inplace=True)

In [20]:
df_composer.isnull().sum()

song_id     0
name        0
composer    0
dtype: int64

### 4.3 Sample only 10000 data points from the huge dataset

In [21]:
df_sampled = df_composer.sample(n=10000,random_state=98)

In [22]:
df_sampled.head()

Unnamed: 0,song_id,name,composer
528785,mIPh1riiWsr6144pZrVCkif1Yi5+185/mq/lRwRDdco=,ä¸è½åè¨´ä½ Â,Zheng Zhi-Hua
887327,fF2MVQ+R9jZ3A6EEwqHpmtlePMUqRODS/hOTC5lpFDc=,California Dreamin',A. Phillips| M. G. Phillips
1381286,nCItc/z5KUqJI9/hj3XUj1pvZ03Sg0tsOx9eT/FgXPM=,GREENSLEVES,DOMÃNIO PÃBLICO CLAUDE DEBUSSY (1862 1918)
171239,NIIl5lIxh6ZCbfGCJA+Yq6IRgkiOI61Q0PAKoqKciEM=,Another Day in Paradise,Bertie Higgins
1941379,h0H0wjQL9TGyMJOJEF5nL7pjYOO3pg/Tu+/HMjRDTa4=,V.V.V.,Sone


In [23]:
df_sampled.shape

(10000, 3)

#5. Content Based Recommendation System

We will create a document term frequency matrix using count-vectorizer on the composer column 

To learn more about Count Vectorizer click [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

In [24]:
cv = CountVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
cv_matrix = cv.fit_transform(df_sampled['composer'])

In [25]:
cv_matrix.shape

(10000, 26958)

We calculate the cosine similarity for the tfidf matrix we generated using count vectorizer

In [26]:
cosine_sim = linear_kernel(cv_matrix, cv_matrix)

In [27]:
cosine_sim.shape

(10000, 10000)

In [28]:
df_sampled.head()

Unnamed: 0,song_id,name,composer
528785,mIPh1riiWsr6144pZrVCkif1Yi5+185/mq/lRwRDdco=,ä¸è½åè¨´ä½ Â,Zheng Zhi-Hua
887327,fF2MVQ+R9jZ3A6EEwqHpmtlePMUqRODS/hOTC5lpFDc=,California Dreamin',A. Phillips| M. G. Phillips
1381286,nCItc/z5KUqJI9/hj3XUj1pvZ03Sg0tsOx9eT/FgXPM=,GREENSLEVES,DOMÃNIO PÃBLICO CLAUDE DEBUSSY (1862 1918)
171239,NIIl5lIxh6ZCbfGCJA+Yq6IRgkiOI61Q0PAKoqKciEM=,Another Day in Paradise,Bertie Higgins
1941379,h0H0wjQL9TGyMJOJEF5nL7pjYOO3pg/Tu+/HMjRDTa4=,V.V.V.,Sone


In [29]:
df_sampled = df_sampled.reset_index()

In [30]:
df_sampled.head()

Unnamed: 0,index,song_id,name,composer
0,528785,mIPh1riiWsr6144pZrVCkif1Yi5+185/mq/lRwRDdco=,ä¸è½åè¨´ä½ Â,Zheng Zhi-Hua
1,887327,fF2MVQ+R9jZ3A6EEwqHpmtlePMUqRODS/hOTC5lpFDc=,California Dreamin',A. Phillips| M. G. Phillips
2,1381286,nCItc/z5KUqJI9/hj3XUj1pvZ03Sg0tsOx9eT/FgXPM=,GREENSLEVES,DOMÃNIO PÃBLICO CLAUDE DEBUSSY (1862 1918)
3,171239,NIIl5lIxh6ZCbfGCJA+Yq6IRgkiOI61Q0PAKoqKciEM=,Another Day in Paradise,Bertie Higgins
4,1941379,h0H0wjQL9TGyMJOJEF5nL7pjYOO3pg/Tu+/HMjRDTa4=,V.V.V.,Sone


In [31]:
titles = df_sampled['name']
indices = pd.Series(df_sampled.index, index=df_sampled['name'])

We create an indices dataframe which will give the index of the song given the song name

In [32]:
indices.head()

name
ä¸è½åè¨´ä½ Â           0
California Dreamin'        1
GREENSLEVES                2
Another Day in Paradise    3
V.V.V.                     4
dtype: int64

This function takes in a song name as an argument, finds it's index. Then it gets a list of all similarity scores for the song index. Then it sorts the similarity scores from highest to lowest and takes only the first 30 scores and returns the song names for these indices with highest scores.

In [33]:
def get_recommendations(Name):
    idx = indices[Name]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    music_indices = [i[0] for i in sim_scores]
    return titles.iloc[music_indices]


Let us try it on a few songs

In [34]:
get_recommendations('Another Day in Paradise').head(5)

354               Sound Of The Underground
5770    Little White Lies (Wideboys Remix)
9074                               Floatin
0                        ä¸è½åè¨´ä½ Â 
1                      California Dreamin'
Name: name, dtype: object

In [35]:
get_recommendations('V.V.V.').head(10)

0                   ä¸è½åè¨´ä½ Â 
1                 California Dreamin'
2                         GREENSLEVES
3             Another Day in Paradise
5                             Tai Chi
6            æ¨ä¸å¾æç¼ççé
7     Ãtude No. 3 in E Major| Op. 10
8                             Hey You
9                       Fools Rush In
10                          é¨æ«»è±
Name: name, dtype: object

### 6. We can also recommend the users based on popularity. Display the 5 most popular songs.

In [39]:
#The top 5, most popular songs in our sample are:
df_sampled['name'].value_counts().head(5)

1. Allegro    10
I. Allegro     7
I Love You     6
Tonight        6
Intro          5
Name: name, dtype: int64

### 7. Conclusion and Interpretation

Thus, we have successfully built a content based song recommendation engine using 10000 songs from the entire dataset of songs that was available to us.