# **Capstone Project: Music Recommendation System**

## **Context**
----------------
Spotify is one such audio content provider with a huge market base across the world. With the ever-increasing volume of songs becoming available on the Internet, searching for songs of interest has become a tedious task in itself. However, Spotify has grown significantly in the market because of its ability to recommend the ‘best’ next song to each and every customer based on a huge preference database gathered over time - millions of customers and billions of songs. This is done by using smart recommendation systems that can recommend songs based on users’ likes/dislikes.

----------------
## **Objective**
----------------
Build a recommendation system to propose the top 10 songs for a user based on the likelihood of listening to those songs.

----------------
## **Dataset**
----------------
**song_data**
- song_id: A unique id given to every song
- title: Title of the song
- Release: Name of the released album
- Artist_name: Name of the artist
- year: Year of release

**count_data**
- user _id: A unique id given to the user
- song_id: A unique id given to the song
- play_count: Number of times the song was played

## **Importing the necessary libraries and overview of the dataset**

In [None]:
import warnings                                 # Used to ignore the warning given as output of the code
warnings.filterwarnings('ignore')

import numpy as np                              # Basic libraries of python for numeric and dataframe computations
import pandas as pd

import matplotlib.pyplot as plt                 # Basic library for data visualization
import seaborn as sns                           # Slightly advanced library for data visualization

from collections import defaultdict             # A dictionary output that does not raise a key error

from sklearn.metrics import mean_squared_error  # A performance metrics in sklearn

In [None]:
import os
import sys
module_path = os.path.abspath(os.path.join('../'))
if module_path not in sys.path:
    sys.path.append(module_path)

import utility.PlotlyObject as plot

from plotly.offline import download_plotlyjs, init_notebook_mode, iplot
init_notebook_mode()
import plotly.graph_objs as go

### **Loading the song data**

In [None]:
file_path = "/Users/mac/PycharmProjects/Capstone-Project---Music-Recommendation-System/csv/"

In [None]:
# Import the dataset

song_df = pd.read_csv(f'{file_path}song_data.csv') # There are no headers in the data file
song_df.head() #Some data in the year are 0. We should drop them.

In [None]:
song_df.tail() #Some data in the year are 0. We should drop them.

In [None]:
song_df.isna().sum()  #title column has 15 rows of null values. release column has 5 rows of null values. We should drop them as well.

In [None]:
song_df.info() #Should the year column be 'int64'?

**Observations**
- title column has 15 rows of null values. release column has 5 rows of null values. We should drop them.
- Some data in the year column are 0. We should drop them as well.

### **Remove the null values in the song data**

In [None]:
song_df2 = song_df.dropna()
song_df2.isna().sum() #remove all null values. Check if they are all removed.

In [None]:
song_df2["normalized_year"] = song_df2['year'].apply(lambda x: "NA" if x == 0 else x) #Remove year == 0
year_count = song_df2['normalized_year'].value_counts().reset_index(name='count')#.sort_values(['index'])
year_count_2 = year_count[year_count['index']!= "NA"].sort_values(['index'])

#create bar plot for song count distribution in all years
trace = plot.create_bar(year_count_2, "index", 'count', '', True)

data = [trace]

layout = go.Layout(
    title=dict(
            text="<b><br> Song count distribution across different years <b>",
            x=0.5,
            xanchor='center'),
    xaxis=dict(
        title='year',
        showline=True
    ),
    yaxis=dict(
        title='Song count',
        showline=True
    ))
fig = go.Figure(data=data, layout=layout)
fig.show()

In [None]:
#create line plot for song count distribution in all years

trace = plot.create_scatter_trace(year_count_2, "index", 'count', '', True)

data = [trace]

layout = go.Layout(
    title=dict(
            text="<b><br> Song count distribution across different years <b>",
            x=0.5,
            xanchor='center'),
    xaxis=dict(
        title='year',
        showline=True
    ),
    yaxis=dict(
        title='Song count',
        showline=True
    ))
fig = go.Figure(data=data, layout=layout)
fig.show()

In [None]:
song_df2.year.describe()

In [None]:
# song_df2['year'].apply(lambda x: "NA" if x == 0 else x) #Remove year == 0

song_df2 = song_df2.drop(song_df2[song_df2.year == 0].index) #remove the values in year column = 0 
song_df2.shape #The data has only 515576 rows left that are not either null or in the year 0.

son

In [None]:
song_df2.tail() #Checked the head and tail. There is no year of 0.

In [None]:
#Maybe there are 0 in other columns too? #Should we manually check if there is a title or release is called 0?
(song_df2.release == '0').value_counts() #0 is a string not an integer

In [None]:
#Check the relase = '0'. Checked the internet and Spotify artist profile page. The artisit does have a album called '0', and the songs title are correct even though there are called "***" and "**".
song_df2[song_df2.release == '0'].reset_index()

#Therefore, we do not remove these rows.

**Observations**
- The song data now has only 515576 rows left that are not either null or year '0'.

### **Load the count data**

In [None]:
count_df = pd.read_csv(f'{file_path}count_data.csv', index_col = 0)
count_df.head()

In [None]:
count_df.tail()

In [None]:
count_df.isna().sum() #Checking if there are null values in the data set. There is no null value.

In [None]:
count_df.shape #There are 2000000 rows and 4 columns in this data set. 

In [None]:
count_df.info() 

In [None]:
df = pd.merge(count_df, song_df2, on = 'song_id', how = 'inner')
df

In [None]:
df.info()

- rank-based learning
- content-based learning?
- user-user collaborative learning
- item-item collaborative learning
- model based collaborative learning
- clustering based?

In [None]:
df2.shape

In [None]:
song_df2['year'].unique()