# Preparing the development of a music recommender system

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Data Cleaning

##### users.csv

**Task**: Import the *users.csv* file 

In [166]:
users = pd.read_csv("users.csv", delimiter=";")
print(users.shape)
users.head()

(512, 5)


Unnamed: 0,uid,p,m1,m2,m3
0,33616,1,1136,,1250.632049
1,32048,Yes,2011,,2294.355415
2,29095,Yes,1486,,1346.632769
3,32106,No,131,,140.639993
4,31885,Yes,929,,820.273301


**Task**: Rename the columns according to the description in the exercise sheet into a more readible format.

In [197]:
users.rename(columns={"uid": "UserId", 
                      "p": "premium", 
                      "m1": "minutes1", 
                      "m2": "minutes2", 
                      "m3": "minutes3"}, 
             inplace=True)
users.head()

Unnamed: 0,UserId,premium,minutes1,minutes2,minutes3
0,33616,True,1136,1193.316025,1250.632049
1,32048,True,2011,2152.677708,2294.355415
2,29095,True,1486,1416.316384,1346.632769
3,32106,False,131,135.819997,140.639993
4,31885,True,929,874.63665,820.273301


**Task**: Unify the labels for the *Premium* attribute.

In [198]:
users.premium.replace({"1": True, "Yes": True, "0": False, "No": False}, inplace=True)
users.premium.value_counts()

False    273
True     239
Name: premium, dtype: int64

**Task**: Impute the missing values of the attribute *Minutes2* using the values of *Minutes1* and *Minutes3*.

In [200]:
users.minutes2 = users.minutes2.fillna((users.minutes1 + users.minutes3) / 2)
users.head()

Unnamed: 0,UserId,premium,minutes1,minutes2,minutes3
0,33616,True,1136,1193.316025,1250.632049
1,32048,True,2011,2152.677708,2294.355415
2,29095,True,1486,1416.316384,1346.632769
3,32106,False,131,135.819997,140.639993
4,31885,True,929,874.63665,820.273301


##### user_behavior.csv

**Task**: Read the *user_behavior.csv* file.

In [170]:
user_behavior = pd.read_csv("user_behavior.csv", delimiter=";")
user_behavior.head()

Unnamed: 0,user_id,song_id,num_clicks,ml,g,f,mod,artists
0,29158,55060,64,251.98246,Rock,1,2023-09-22,662
1,33692,5080,63,260.001056,Pop,0,2023-08-14,4937
2,31198,25839,24,105.35502,Hip-Hop,1,2023-06-24,6289
3,33302,87341,36,142.085267,Electronic,0,2023-07-22,1356
4,34592,47110,74,301.314994,Pop,0,2023-04-28,8373


**Task**: Rename the columns according to the description in the exercise sheet.

In [171]:
user_behavior.rename(columns={"user_id": "UserId", 
                              "song_id": "SongId", 
                              "num_clicks": "NumClicks", 
                              "ml": "MinutesListened",
                              "g": "Genre",
                              "f": "Favorite",
                              "mod": "Modification",
                              "artists": "Artists",
                             }, 
                     inplace=True)
user_behavior.head()

Unnamed: 0,UserId,SongId,NumClicks,MinutesListened,Genre,Favorite,Modification,Artists
0,29158,55060,64,251.98246,Rock,1,2023-09-22,662
1,33692,5080,63,260.001056,Pop,0,2023-08-14,4937
2,31198,25839,24,105.35502,Hip-Hop,1,2023-06-24,6289
3,33302,87341,36,142.085267,Electronic,0,2023-07-22,1356
4,34592,47110,74,301.314994,Pop,0,2023-04-28,8373


**Task:** Fix the data types of the attributes *Genre* (categorical) and *Favorite* (binary, categorical).

In [172]:
user_behavior.Genre = user_behavior.Genre.astype('category')
user_behavior.Favorite = user_behavior.Favorite.astype('category')
user_behavior.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   UserId           10000 non-null  int64   
 1   SongId           10000 non-null  int64   
 2   NumClicks        10000 non-null  int64   
 3   MinutesListened  10000 non-null  float64 
 4   Genre            10000 non-null  category
 5   Favorite         10000 non-null  category
 6   Modification     10000 non-null  object  
 7   Artists          10000 non-null  int64   
dtypes: category(2), float64(1), int64(4), object(1)
memory usage: 489.2+ KB


**Task:** Some genres have more songs than others. Adjust the data set such that it includes only the four largest genres and the genre "Other" that summarizes all remaining genres.

In [173]:
# add category Other
user_behavior.Genre = user_behavior.Genre.cat.add_categories(["Other"])

In [174]:
count = user_behavior.Genre.value_counts()
allowed_vals = [count.index[0], count.index[1], count.index[2], count.index[3]]

user_behavior.loc[~user_behavior.Genre.isin(allowed_vals), "Genre"] = "Other"
user_behavior.head()

Unnamed: 0,UserId,SongId,NumClicks,MinutesListened,Genre,Favorite,Modification,Artists
0,29158,55060,64,251.98246,Rock,1,2023-09-22,662
1,33692,5080,63,260.001056,Pop,0,2023-08-14,4937
2,31198,25839,24,105.35502,Hip-Hop,1,2023-06-24,6289
3,33302,87341,36,142.085267,Electronic,0,2023-07-22,1356
4,34592,47110,74,301.314994,Pop,0,2023-04-28,8373


**Task:** Create for a new column for the weekday, year, month, and day of each date names *ModifiedAt*.

In [175]:
pd.to_datetime(user_behavior.Modification) # convert to datetime

user_behavior["weekday"] = pd.to_datetime(user_behavior.Modification).dt.day_name()
user_behavior["year"] = pd.to_datetime(user_behavior.Modification).dt.year
user_behavior["month"] = pd.to_datetime(user_behavior.Modification).dt.month
user_behavior["day"] = pd.to_datetime(user_behavior.Modification).dt.day
user_behavior.head()

Unnamed: 0,UserId,SongId,NumClicks,MinutesListened,Genre,Favorite,Modification,Artists,weekday,year,month,day
0,29158,55060,64,251.98246,Rock,1,2023-09-22,662,Friday,2023,9,22
1,33692,5080,63,260.001056,Pop,0,2023-08-14,4937,Monday,2023,8,14
2,31198,25839,24,105.35502,Hip-Hop,1,2023-06-24,6289,Saturday,2023,6,24
3,33302,87341,36,142.085267,Electronic,0,2023-07-22,1356,Saturday,2023,7,22
4,34592,47110,74,301.314994,Pop,0,2023-04-28,8373,Friday,2023,4,28


#### artists.csv

**Task**: Read the *artists.csv* file and re-name the columns according to the exercise sheet.

In [183]:
artists = pd.read_csv("artists.csv", delimiter=";")
artists.head()

Unnamed: 0,artist_id,genre,featured,monthly_listeners
0,8497,Error,0,529178
1,3331,Error,0,1319507
2,7097,Rock,0,739339
3,1352,Error,0,127536
4,457,Error,0,387548


**Task:** Convert the attributes *Genre* and *Featured* to categorical variables.

In [186]:
artists.genre = artists.genre.astype('category')
artists.featured = artists.featured.astype('bool')
artists.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 899 entries, 0 to 898
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   artist_id          899 non-null    int64   
 1   genre              899 non-null    category
 2   featured           899 non-null    bool    
 3   monthly_listeners  899 non-null    int64   
dtypes: bool(1), category(1), int64(2)
memory usage: 16.1 KB


### Data aggregation

**Task:** Merge the *users* and *user_behavior* tables together. Create a view in which you determine how many minutes a user listens to songs on average. Additionally, what is the highest number of clicks a user had on a song?

In [236]:
print(users.columns)
print(user_behavior.columns)

Index(['UserId', 'premium', 'minutes1', 'minutes2', 'minutes3'], dtype='object')
Index(['UserId', 'SongId', 'NumClicks', 'MinutesListened', 'Genre', 'Favorite',
       'Modification', 'Artists', 'weekday', 'year', 'month', 'day'],
      dtype='object')


In [276]:
# users_and_behavior = pd.merge(users[["UserId", "minutes1", "minutes2", "minutes3"]], user_behavior[["UserId", "NumClicks", "MinutesListened"]], on=["UserId"])
# OR
users_and_behavior = users.merge(user_behavior)
users_and_behavior.head()

Unnamed: 0,UserId,premium,minutes1,minutes2,minutes3,avg_minutes_listened,max_num_clicks,SongId,NumClicks,MinutesListened,Genre,Favorite,Modification,Artists,weekday,year,month,day
0,33616,True,1136,1193.316025,1250.632049,,,49432,75,287.101132,Other,0,2023-08-09,3742,Wednesday,2023,8,9
1,33616,True,1136,1193.316025,1250.632049,,,27893,91,331.458582,Hip-Hop,0,2023-04-11,5193,Tuesday,2023,4,11
2,33616,True,1136,1193.316025,1250.632049,,,82964,65,243.904879,Electronic,1,2023-10-21,6897,Saturday,2023,10,21
3,33616,True,1136,1193.316025,1250.632049,,,3891,94,391.607028,Other,1,2023-02-01,923,Wednesday,2023,2,1
4,33616,True,1136,1193.316025,1250.632049,,,9766,39,155.845557,Electronic,0,2023-05-06,3699,Saturday,2023,5,6


In [275]:
users_and_behavior.groupby('UserId').agg(MeanListen=('MinutesListened', 'mean'), MaxClick=('NumClicks', 'max')).reset_index()

Unnamed: 0,UserId,MeanListen,MaxClick
0,27367,202.428192,95
1,27378,175.538977,97
2,27393,206.299595,98
3,27395,216.096020,91
4,27397,181.559834,99
...,...,...,...
507,37260,227.883949,99
508,37273,163.418818,98
509,37284,213.932807,96
510,37294,203.551624,98


**Task:** Merge the *user_behavior* and *artist* tables to determine the most clicked artist per genre (defined by the song).

In [280]:
user_behavior_and_artists = user_behavior.merge(artists, left_on="Artists", right_on="artist_id")
user_behavior_and_artists.head()

Unnamed: 0,UserId,SongId,NumClicks,MinutesListened,Genre,Favorite,Modification,Artists,weekday,year,month,day,artist_id,genre,featured,monthly_listeners
0,33653,36293,93,387.776814,Rock,0,2023-01-24,9319,Tuesday,2023,1,24,9319,Error,False,470286
1,31413,35986,93,358.914078,Other,0,2023-05-15,9319,Monday,2023,5,15,9319,Error,False,470286
2,29822,19424,51,220.712439,Hip-Hop,1,2023-10-19,9319,Thursday,2023,10,19,9319,Error,False,470286
3,36155,25334,34,151.326636,Rock,0,2023-05-09,9319,Tuesday,2023,5,9,9319,Error,False,470286
4,28163,51819,47,185.714636,Hip-Hop,0,2023-11-28,2159,Tuesday,2023,11,28,2159,Error,False,641610


In [None]:
user_behavior_and_artists.rename(map {"Genre": "GenreSong". "genre": "GenreArtist"})

Question: Why can't we just use artists.merge(user_behavior)?

Answer: 

Which is the most clicked artist per genre of the song?

**Task**: Determine for each artist, the fan that spend the spends the most minutes listening their music