# Preparing the development of a music recommender system

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Data Cleaning

##### users.csv

**Task**: Import the *users.csv* file 

In [None]:
users = pd.read_csv('users.csv')
users

That did not work, we need to use a different delimiter. 

In [None]:
users = pd.read_csv('users.csv', delimiter=';')
users

**Task**: Rename the columns according to the description in the exercise sheet into a more readible format.

In [None]:
users.rename(columns={
    'uid': 'UserId',
    'p': 'Premium',
    'm1': 'Minutes1',
    'm2': 'Minutes2',
    'm3': 'Minutes3' 
}, inplace=True)

users

**Task**: Unify the labels for the *Premium* attribute.

In [None]:
users['Premium'] = users['Premium'].map({'0': False, 
                                         '1': True,
                                         'Yes': True,
                                         'No': False},)

users

**Task**: Impute the missing values of the attribute *Minutes2* using the values of *Minutes1* and *Minutes3*.

In [None]:
plt.plot(users['Minutes3'] - users['Minutes1'], 'o')
plt.title('Difference between the minutes listened before three months and last month')

We can see that the listening times vary quite stable around zero. Hence, we assume that the listening patterns are consistent between shorter time periods, and thus, we can impute the missing data by taking the average of the other two columns.

In [None]:
# This does NOT work as we would overwrite non-missing values:
# users['Minutes2'] = (users['Minutes1'] + users['Minutes3'])/2
# users

users['Minutes2'] = users['Minutes2'].fillna((users['Minutes1'] + users['Minutes3'])/2)
users

##### user_behavior.csv

**Task**: Read the *user_behavior.csv* file.

In [None]:
user_behavior = pd.read_csv("user_behavior.csv", delimiter=';')
user_behavior

**Task**: Rename the columns according to the description in the exercise sheet.

In [None]:
user_behavior.rename(columns={
    'user_id': 'UserId',
    'song_id': 'SongId',
    'num_clicks': 'NumClicks',
    'ml': 'MinutesListened',
    'g': 'Genre',
    'f': 'Favorite',
    'mod': 'ModifiedAt', 
    'artists': 'Artists'
}, inplace=True)

user_behavior

**Task:** Fix the data types of the attributes *Genre* (categorical) and *Favorite* (binary, categorical).

In [None]:
user_behavior['Genre'] = user_behavior['Genre'].astype('category')
user_behavior['Favorite'] = user_behavior['Favorite'].astype('bool')

In [None]:
user_behavior.info()

**Task:** Some genres have more songs than others. Adjust the data set such that it includes only the four largest genres and the genre "Other" that summarizes all remaining genres.

First, we are going to plot the number of songs per genre.

In [None]:
genre_counts = user_behavior['Genre'].value_counts()

plt.bar(genre_counts.index, genre_counts.values)
plt.xticks(rotation=45)
plt.title('Number of songs per genre')

We can see that there are four genres that are streamed more often than the genres. We might want to consider to aggregate the remaining genres into a single group.

In [None]:
user_behavior['Genre'] = user_behavior['Genre'].map({
    'Electronic': 'Electronic',
    'Rock': 'Rock',
    'Hip-Hop': 'Hip-Hop',
    'Pop': 'Pop'
}).fillna('Other').astype('category')

Let's plot the updated genres again.

In [None]:
genre_counts = user_behavior['Genre'].value_counts()

plt.bar(genre_counts.index, genre_counts.values)
plt.xticks(rotation=45)
plt.title('Number of songs per genre')

**Task:** Create for a new column for the weekday, year, month, and day of each date names *ModifiedAt*.

In [None]:
user_behavior['ModifiedAt'] = user_behavior['ModifiedAt'].astype('datetime64[ns]')
user_behavior['Weekday'] = user_behavior['ModifiedAt'].dt.day_name()
user_behavior['Year'] = user_behavior['ModifiedAt'].dt.year
user_behavior['Month'] = user_behavior['ModifiedAt'].dt.month   
user_behavior['Day'] = user_behavior['ModifiedAt'].dt.day

user_behavior

#### artists.csv

**Task**: Read the *artists.csv* file and re-name the columns according to the exercise sheet.

In [None]:
artists = pd.read_csv("artists.csv", sep=';')

artists.rename(columns={
    'artist_id': "ArtistId",
    'genre': "Genre",
    'featured': "Featured",
    'monthly_listeners': "MonthListeners"
}, inplace=True)

artists.info()

**Task:** Convert the attributes *Genre* and *Featured* to categorical variables.

In [None]:
artists["Genre"] = artists["Genre"].astype("category")
artists["Featured"] = artists["Featured"].astype("category")

In [None]:
artists

In [None]:
artists.info()

### Data aggregation

**Task:** Merge the *users* and *user_behavior* tables together. Create a view in which you determine how many minutes a user listens to songs on average. Additionally, what is the highest number of clicks a user had on a song?

In [None]:
users_with_behavior = users.merge(user_behavior)
users_with_behavior

In [None]:
users_with_behavior.groupby('UserId').agg(MeanListen=('MinutesListened', 'mean'), MaxClick=('NumClicks', 'max')).reset_index()

**Task:** Merge the *user_behavior* and *artist* tables to determine the most clicked artist per genre (defined by the song).

In [None]:
user_behavior

In [None]:
artists_with_behavior = artists.merge(user_behavior, left_on="ArtistId", right_on="Artists")
artists_with_behavior

Question: Why can't we just use artists.merge(user_behavior)?

Answer: Without specifying the join condition, the merge would automatically join both table on "Genre", which leads to a wrong result.

Which is the most clicked artist per genre of the song?

In [None]:
artists_with_behavior = artists_with_behavior.rename(columns={
    "Genre_x": "GenreSong",
    "Genre_y": "GenreArtist"
})

artists_with_behavior.info()

In [None]:
group = artists_with_behavior.groupby(["GenreSong", "ArtistId"]).agg(NumClicks=('NumClicks', 'sum')).reset_index()
group

In [None]:
group[group['NumClicks'] == group.groupby('GenreSong')['NumClicks'].transform('max')]

**Task**: Determine for each artist, the fan that spend the spends the most minutes listening their music

In [None]:
users_with_behavior

In [None]:
data = users_with_behavior.merge(artists, left_on="Artists", right_on="ArtistId")
data

In [None]:
group = data.groupby(["ArtistId", "UserId"]).agg(MinutesListened=("MinutesListened", "sum")).reset_index()
group

In [None]:
group[group['MinutesListened'] == group.groupby('ArtistId')['MinutesListened'].transform('max')]