### In this notebook we go through the process of training our model

We'll start by loading our data and encoding our categorical variables,  
since in our EDA process we saw that our data contains many unique values,  
it's best to use binary encoding over one-hot encoding

In addition to this, after some experimentation I've come to the conclusion  
that having the data in it's current form is not beneficial, I'll have to transform  
it back to the form of Artist : text, Genres: text[], so we will have to put in  
a form known as the transactional format

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import category_encoders as ce

ag_df = pd.read_csv('data/ag_data.csv')
ag_df.head()

Unnamed: 0,Artist,Subgenre
0,Jason Mraz,acoustic pop
1,Jason Mraz,dance pop
2,Jason Mraz,neo mellow
3,Jason Mraz,pop
4,The Paper Kites,acoustic pop


In [2]:
ag_df2 = ag_df[~(ag_df['Subgenre'].str.contains('sleep'))].copy()

transactional_data = ag_df2.groupby('Artist')['Subgenre'].agg(list).reset_index()
transactional_data.head()

Unnamed: 0,Artist,Subgenre
0,!!!,"[alternative dance, dance rock, dance-punk, el..."
1,!T.O.O.H.!,[technical grindcore]
2,"""14""",[swedish rock-and-roll]
3,"""Cats"" 1981 Original London Cast",[show tunes]
4,"""DEMONS""","[action rock, punk 'n' roll, swedish garage rock]"


In [3]:
transactional_data['Subgenre'] = transactional_data['Subgenre'].astype('string') 

In [4]:
transactional_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34962 entries, 0 to 34961
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Artist    34962 non-null  object
 1   Subgenre  34962 non-null  string
dtypes: object(1), string(1)
memory usage: 546.4+ KB


In [5]:
transactional_data.to_csv('data/transactional_data.csv',index=False)

Let's now use Binary Encoding on the subgenre column and create our training set

In [105]:
encoder = ce.BinaryEncoder(cols=['Subgenre'])

data_enc = encoder.fit_transform(transactional_data['Subgenre'])

data_enc.shape

(34962, 14)

Now that our data is preprocessed we can begin with training our model,   
we're going to opt for K-Modes since we're dealing with categorical data,  
to choose the optimal amount of clusters we'll take a look at the elbow method

In [6]:
from kmodes.kmodes import KModes
import matplotlib.pyplot as plt


cost = []
K = range(1,200,10)
for num_clusters in list(K):
    kmode = KModes(n_clusters=num_clusters)
    kmode.fit_predict(data_enc,n_init=5,init='Huang')
    cost.append(kmode.cost_)
    
plt.plot(K, cost, 'bx-')
plt.xlabel('No. of clusters')
plt.ylabel('Cost')
plt.title('Elbow Method For Optimal k')
plt.show()

KeyboardInterrupt: 

We'll opt for 75 as our number of clusters

In [134]:
from kmodes.kmodes import KModes

km = KModes(n_clusters=75, init='Huang', n_init=5)
clusters = km.fit_predict(data_enc)

for i in range(75):
    artists = transactional_data[clusters == i]['Artist'].tolist()
    print(f"Cluster {i+1}: {artists}")

Cluster 1: ['AC Slater', 'AFI', 'AKB48', 'AKKI (DE)', 'ANGRA', 'ANIMAL HACK', 'Abandon All Ships', 'Abstract', 'Accelera Deck', 'Adam Johnson', 'Adema', "Adolescent's Orquesta", 'Adriana Proenza', 'Adson & Alana', 'AeLL.', 'Aeileen', 'Aer', 'Afsky', 'Age Of Love', 'Ago', 'Agonizer', 'Ahola', 'Airut', 'Alan & Alex', 'Alazka', 'Albert Cummings', 'Alberto Barros', 'Alberto Ruiz', 'Albin Myers', 'Alcatrazz', 'Alcohol Funnycar', 'Alcymar Monteiro', 'Aldair Playboy', 'Ale Mendoza', 'Alecia Nugent', 'Alegre All Stars', 'Alejandro González', 'Alejandro Sanz', 'Alejo', 'Aleks Schmidt', 'Alessandra Roncone', 'Alessandro Safina', 'Alessandro Scarlatti', 'Alessio Bax', 'Alesso', 'Alex & Konrado', 'Alex & Yvan', 'Alex Asli', 'Alex Bau', 'Alex Campos', 'Alex M.O.R.P.H.', 'Alex Newell', 'Alex Preston', 'Alex Rose', 'Alex Schulz', 'Alex Wright', 'Alexander Honky-Tonk Band', 'Alexander Kowalski', 'Alexandra Kay', 'Alexandra Prince', 'Alexandre Carlo', 'Alexio', 'Alexio Kawara', 'Alexis Roberts', 'Alexi

Let's now create a helper function that'll fetch us artists corresponding to genres of interest,  
we'll save this as a utility function

In [135]:
def recommend_artists(genre_lst,model,encoder,data,rec_count):
    
    genre_df = pd.DataFrame({'Subgenre':[genre_lst]})
    genre_df['Subgenre'] = genre_df['Subgenre'].astype('string')
    genre_enc = encoder.transform(genre_df['Subgenre'])

    user_cluster = model.predict(genre_enc)
    user_recommendations = data[model.labels_ == user_cluster]['Artist'].to_numpy()
    idx = np.random.randint(0,len(user_recommendations),rec_count)
    return user_recommendations[idx]

In [149]:
print(recommend_artists(['EDM','Trance'],km,encoder,transactional_data,10))

['Marburg' 'Carvis Turney' 'Taarka' "'Til Tuesday" 'Emrah Turken'
 '10-FEET' 'Rose Royce' 'Angel Olsen' 'The Diamonds' 'Igneous Flame']


Let's save our model and encoder

In [150]:
import joblib

joblib.dump(value = [km,encoder], filename='models/model.pkl')

['models/model.pkl']