In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

In this kernel, I explore the **genres** with respect to **listener behaviour **and co-occurrence. Are there pairs or groups of genres, which are commonly listened to by the same users? We could use this type of user-driven genre similarity as a feature in recommender systems, in particular for cold-start cases. Let's say, people who like *alternative rock* commonly also like *indie* - we can recommend *indie* songs to new users who only listened to a few *alternative rock* tracks so far.

Since we only have the genre-IDs and don't know their "physical" meaning, we cannot use any higher-level background info (i.e. from music genre taxonomies). My idea for this kernel is to model genres in a **graph**. Each genre represents a node, associated with a node score which indicates how often this genre appears in the train set. Edges between nodes represent co-occurrences. The edge weights indicate how often a single user in the train set liked (by means of target = 1) both genres. For example, if the edge weight between alternative rock and indie has weight 500, this means that there are 500 users with at least one song with target = 1 and alternative rock as a genre tag, who also have at least one song with target = 1 and indie tag.

I first construct a similarity matrix from the training data, then convert it to a **networkx** graph and use a 2-D layout to visualise genre similarities and popularities.

In [None]:
import networkx as nx
import itertools
import matplotlib.pyplot as plt # nx won't draw without this
%matplotlib notebook 

In [None]:
dataPath = '../input/' # set the data path

In [None]:
# read songs into pandas data frame
songs = pd.read_csv(dataPath + 'songs.csv')
songs = songs[['song_id','genre_ids']] # we don't need the rest

There may be various genre tags per song, separated by a "|". So we need to convert them to lists...

In [None]:
# convert genre ids to list
songs['genre_ids'] = songs['genre_ids'].map(lambda x: [int(y) for y in str(x).split('|')] if not pd.isnull(x) else [])

In [None]:
# get unique list of genre ids
genres = songs['genre_ids'].values.tolist()
genres = [j for i in genres for j in i]
genres = list(set(genres))

In [None]:
# number of unique genres
numGenres = len(genres)
print("There are %s unique genres." %numGenres)

In [None]:
# init genre similarity matrix S and genre score list
S = np.zeros((numGenres,numGenres))
scores = np.zeros((numGenres,))

In [None]:
# read user listening data
listen = pd.read_csv(dataPath + 'train.csv')

In [None]:
# we only consider songs with target 1
listen = listen[listen['target'] == 1]
listen = listen[['msno','song_id']] # we don't need the rest

In [None]:
# join the two datasets
songs.set_index('song_id', inplace=True)
df = listen.join(songs, how="left", on="song_id")
df.dropna(axis=0,inplace=True) # drop anything with missing data

Now we can construct the genre similarity matrix. For each user, we update genre scores (+1 if this user liked this genre) and the co-occurrence in S for each possible pair of the genres this user liked. This is a bit slow...

In [None]:
# group by user and process groups
for user, frame in df.groupby('msno'):
    userGenres = frame['genre_ids'].values.tolist() # get all the genres liked by this user
    userGenres = [j for i in userGenres for j in i] # convert to a single list
    userGenres = set(list(userGenres)) # take only unique values

    for aGenre in userGenres: # increase genre score
        m = genres.index(aGenre) 
        scores[m] += 1

    combs = itertools.combinations(userGenres, 2) # increase co-occurrence scores in matrix S
    for comb in combs:
        S[genres.index(comb[0]),genres.index(comb[1])] += 1
        S[genres.index(comb[1]),genres.index(comb[0])] += 1

Now, we create the graph using networkx

In [None]:
G = nx.Graph()

In [None]:
for g in genres: # add nodes
    G.add_node(g)

In [None]:
for i,gI in enumerate(genres): # add edges
    for j,gJ in enumerate(genres):
        if gJ >= gI:
            continue
        if S[i][j] > 0:
            G.add_edge(gI,gJ,weight=S[i][j])

There are a lot of nodes, which have a score that is very low or zero (only the label appears, no red circle) and are not connected to any other nodes. These are genres which do not or very infrequently appear in the training data with target = 1. So let's remove all nodes with score < 1000 and see if the graph looks better if we only consider very "popular" genres...

In [None]:
# filter out nodes with score < 1000
nodeList = [x for i,x in enumerate(genres) if scores[i] > 1000]
G2 = G.subgraph(nodeList)
nodeSizes = [0.1 * x for x in scores if x > 1000]

We can now draw the graph using a spectral layout (I found this is one works best). Nodes connected by strong edges should appear close to each other and weakly connected nodes should be isolated.

In [None]:
nx.draw_spectral(G2, with_labels=True, node_size=nodeSizes, alpha=0.2, width=0.1, random_state=1985) # draw the new graph
plt.show() # there are some warnings that seem to come from nx interacting with matplotlib

We can see a bunch of things (you will need to zoom around a bit):
*  genre-Ids 2107, 423 and 798 are very disconnected - meaning users listening to them don't tend to listen to other genres
* 465 and 458 are two very popular genres which also co-appear a lot (I bet these are "rock" and "pop") and are very connected to other genres
* 451 is related to them but a bit less "popular" (is it a coincidence that they start with a 4?)
* some more rather "disconnected" genres are 1180, 1572, 275, 1287 and 726
* inside the strongly connected area, there seem to be some subgroups, i.e. 451+465+458, 2022+1259, 444+437+1609+139 etc...

It would be interesting to see if this info helps in a recommendation system for the cold-start cases.

This is my first ever kaggle kernel and I am glad for any feedback.