# Exploratory data analysis

<font color='red'>NOTE</font>
 
* Some visualization functionalities do not work due to large dataset; summarize first to visualize.

### Data & variable descriptions (from [kaggle](https://www.kaggle.com/competitions/dsg17-online-phase/data))

The goal of this challenge is to predict whether the users of the test dataset listened to the first track Flow proposed them or not. Deezer considers that a track is "listened" if the user has listened to more than 30 seconds of it (is_listened =1). If the user presses the skip button to change the song before 30 seconds, then the track is not considered as being listened (is_listened = 0).


- File descriptions

    - train.csv - the training set
    - test.csv - the test set
    - sample_submission_kaggle.csv - a sample submission file in the correct format
    - extra_infos.json - supplementary information about the songs
<br>
<br>
- Data fields

    - media_id - identifiant of the song listened by the user
    - album_id - identifiant of the album of the song
    - media_duration - duration of the song
    - user_gender -  gender of the user
    - user_id -  anonymized id of the user
    - context_type - type of content where the song was listened: playlist, album ...
    - release_date - release date of the song with the format YYYYMMDD
    - ts_listen - timestamp of the listening in UNIX time
    - platform_name - type of os
    - platform_family - type of device
    - user_age - age of the user
    - listen_type - if the songs was listened in a flow or not
    - artist_id - identifiant of the artist of the song
    - genre_id - identifiant of the genre of the song
    - is_listened - 1 if the track was listened, 0 otherwise

In [71]:
import pandas as pd
import numpy as np
import time
import tensorflow as tf
import tensorflow_recommenders as tfrs
from numpy import count_nonzero

In [2]:
# have a look at the data

train_df = pd.read_csv('../data/train.csv')
test_df = pd.read_csv('../data/test.csv')
print(train_df.shape)
print(test_df.shape)
train_df.head()

(7558834, 15)
(19918, 15)


Unnamed: 0,genre_id,ts_listen,media_id,album_id,context_type,release_date,platform_name,platform_family,media_duration,listen_type,user_gender,user_id,artist_id,user_age,is_listened
0,25471,1480597215,222606,41774,12,20040704,1,0,223,0,0,9241,55164,29,0
1,25571,1480544735,250467,43941,0,20060301,2,1,171,0,0,16547,55830,30,1
2,16,1479563953,305197,48078,1,20140714,2,1,149,1,1,7665,2704,29,1
3,7,1480152098,900502,71521,0,20001030,0,0,240,0,1,1580,938,30,0
4,7,1478368974,542335,71718,0,20080215,0,0,150,0,1,1812,2939,24,1


In [3]:
# NA analysis
train_df.isnull().sum(axis=0)
# no NAs

genre_id           0
ts_listen          0
media_id           0
album_id           0
context_type       0
release_date       0
platform_name      0
platform_family    0
media_duration     0
listen_type        0
user_gender        0
user_id            0
artist_id          0
user_age           0
is_listened        0
dtype: int64

In [4]:
# number of unique values per variable

# train set
sum_train = train_df.apply(lambda x: x.nunique()).to_frame()
sum_train['dataset'] = 'train'
sum_train.reset_index(inplace=True)
sum_train = sum_train.rename(columns={0:'value'})

# test dataset
sum_test = test_df.apply(lambda x: x.nunique()).to_frame()
sum_test['dataset'] = 'test'
sum_test.reset_index(inplace=True)
sum_test = sum_test.rename(columns={0:'value'})

# concatenate
sum_all = pd.concat([sum_train, sum_test], axis=0)

In [5]:
sum_all

Unnamed: 0,index,value,dataset
0,genre_id,2922,train
1,ts_listen,2256230,train
2,media_id,452975,train
3,album_id,151471,train
4,context_type,74,train
5,release_date,8902,train
6,platform_name,3,train
7,platform_family,3,train
8,media_duration,1652,train
9,listen_type,2,train


In [6]:
import plotly.express as px

sum_all['log_value'] = sum_all['value'].apply(lambda x: np.log(x))

fig = px.bar(sum_all, x='index', y='log_value', color ='dataset',
    barmode='group', title="log-# unique values", template='plotly_dark')
fig.show()

In [7]:
# compare distributions of variables

# get only dichotomous variables:
dich_train_df = train_df.loc[:, train_df.nunique() < 3]
dich_train_df = dich_train_df.apply(pd.Series.value_counts).T
dich_train_df.reset_index(inplace=True)
dich_train_df['dataset'] = 'train'

dich_test_df = test_df.loc[:, test_df.nunique() < 3]
dich_test_df = dich_test_df.apply(pd.Series.value_counts).T
dich_test_df.reset_index(inplace=True)
dich_test_df['dataset'] = 'test'

dich_all = pd.concat([dich_train_df, dich_test_df], axis=0)
dich_all = pd.melt(dich_all, id_vars=['index', 'dataset'], value_vars=[0, 1], var_name='yes/no')
dich_all['log_value'] = dich_all['value'].apply(lambda x: np.log(x))
dich_all

Unnamed: 0,index,dataset,yes/no,value,log_value
0,listen_type,train,0,5239223,15.471684
1,user_gender,train,0,4583009,15.337866
2,is_listened,train,0,2388342,14.68611
3,listen_type,test,0,1,0.0
4,user_gender,test,0,11289,9.331584
5,listen_type,train,1,2319611,14.65691
6,user_gender,train,1,2975825,14.906032
7,is_listened,train,1,5170492,15.458478
8,listen_type,test,1,19917,9.899329
9,user_gender,test,1,8629,9.062884


In [8]:
fig = px.bar(dich_all, x='index', y='log_value', color ='yes/no', 
    facet_row = 'dataset', title="Balance of dichotomous variables between train and test data", 
    template='plotly_dark')
fig.show()

In [9]:
# check balance of dataset with respect to flow and listened to
pd.crosstab(train_df.is_listened, train_df.listen_type).apply(lambda r: r/len(train_df), axis=1)

listen_type,0,1
is_listened,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.193304,0.122663
1,0.499822,0.184211


Seems like most songs were not implemented in a flow and listened to; about 70% of the songs were listened to and 30% not, and 70% were not presented in a flow.

In [74]:
# check sparsity of item-user matrix

def get_user_item_matrix(df, show_head=False, get_info=True):
    """
    Get the user-item frequency matrix, 
    and print infos about sparsity.
    """

    st = time.time()
    # get media x user table
    tab = df.groupby(['media_id', 'user_id']).size()
    # replace Nan
    iu_mat = tab.unstack().fillna(0)
    
    if show_head:
        print(iu_mat.head(5))
    
    # convert to nparray & compute sparsity
    iu_mat = iu_mat.to_numpy()
    sparsity = 1.0 - (count_nonzero(iu_mat) / float(iu_mat.size))
    et = time.time()

    if get_info:
        print(f'\n\nNon-zero values: {count_nonzero(iu_mat)} \nSize of matrix:',
               f'{iu_mat.size}\nMatrix sparsity: {sparsity}', 
               f'\n\nExecution time: {round((et-st)/60, 4)}min')
    return iu_mat

#get_user_item_matrix(train_df)



In [52]:
# get distribution of variables of train and test set and compare!

In [53]:
# check for which users we have how many data in the training and test datasets!

In [54]:
# group data by customer and by item and get distributions etc.

In [86]:
# are the songs recommended by flow in the test dataset the next songs the people listened to? it is implied in the description, but to be sure:
# check this by verifying that the unix timestamp for each person of the songs listened to int the test dataset is bigger than all previous timestamps. 