# **What I intend to do?** 

In this kernel, my sole intention is to create a highly feature engineered dataframe after merging existing files, which can be fed directly as input for modeling. 

By this I mean dataframes for both the train set and test set which would contain same number of columns with encoded features.

At a glance I will be looking into the following issues:
* Merging Dataframes          -> (*completed*)
* Handling Missing Values   -> (*completed*)
    * Came across something interesting where number of unique values in column '**source_screen_name**' were different for both test and train set.
    * Columns like '**genre_ids**' have a combination of more than one genre which must be handled appropriately.
* Feature Engineering          -> (*in progress*)
    * number of genres per song
    * number of lyricists per song
    * number of composers per song
    * whether song has features artists
    * number of artists per song
    * whether artist and composer are the same
    * whether artist, composer and lyricist are all the same
    
## (thinking of more features)

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from matplotlib import pyplot as plt
import seaborn as sns
import missingno as msno
import re
import math
from collections import Counter

from subprocess import check_output

df_train = pd.read_csv('../input/train.csv')
df_test = pd.read_csv('../input/test.csv')
df_songs = pd.read_csv('../input/songs.csv')
df_members = pd.read_csv('../input/members.csv')

# Merging Dataframes

First, we will merge the train and test data with the members and songs data. We can keep the merged data and delete the independant ones, to save memory exhaustion on this kernel.

In [None]:
#--- Merging dadtaframes ---
df_train_members = pd.merge(df_train, df_members, on='msno', how='inner')
df_train_merged = pd.merge(df_train_members, df_songs, on='song_id', how='outer')

df_test_members = pd.merge(df_test, df_members, on='msno', how='inner')
df_test_merged = pd.merge(df_test_members, df_songs, on='song_id', how='outer')

#--- delete unwanted dataframe ---
del df_train_members
del df_test_members

del df_songs
del df_members

## Dropping rows with missing **msno** values
Upon checking the number of rows in the original train data and the merged train data, they would **not** be the same. 

In [None]:
print(len(df_train))
print(len(df_train_merged))

The same case goes for the test dataframes as well.

In [None]:
print(len(df_test))
print(len(df_test_merged))

This is because rows with missing **msno** have also been included while merging which must be dropped:

In [None]:
df_train_merged = df_train_merged[pd.notnull(df_train_merged['msno'])]
df_test_merged = df_test_merged[pd.notnull(df_test_merged['msno'])]

Now let's check the length again:

In [None]:
print(len(df_train))
print(len(df_train_merged))

print(len(df_test))
print(len(df_test_merged))

## Saving **target** and **id** columns separately
Why? Only then can both the dataframes beconcatenated to perform encoding!

This is important. Because when train and test data are encoded separately there is a high possibility of variables getting encoded differently. 

Consider the following case where a column has three distinct categorical variables (A, B, C) to be encoded in both the train and test data. While encoding they will be converted to numerical form (1, 2, 3). 

In the train data, if 'A' is encountered first it is assigned 1. Then if 'B' is encountered next then it is assigned 2. Likewise 'C' is assigned 3. But in the test data if 'B' is encountered first  it is assigned 1.

So in order to avoid **misinterpretation** of original information. It is a good habit to concatenate both the train and test data, while using a separate column to distinguish the two.

In [None]:
df_test_merged.columns

In [None]:
df_train_merged.columns

Saving the **id** column from test data and **target** column from train data separately; and deleting those respective columns from the dataframes.

In [None]:
#--- before that save unique columns in train and test set separately ---
df_train_target = df_train_merged['target'].astype(np.int8)
df_test_id = df_test_merged['id']

#--- now dropping those columns from respective dfs ---
df_train_merged.drop('target', axis=1, inplace=True)
df_test_merged.drop('id', axis=1, inplace=True)

Appending another column **is_train** to distinguish between train and test data:

In [None]:
df_train_merged['is_train'] = 1
df_test_merged['is_train'] = 0

# Handling Missing Values
Handling missing values in an elegant way always boosts the prediction of any algorithm. 

First we will see which columns have missing values. Later on we will see how to impute them.

In [None]:
cols_missing_val_train = df_train_merged.columns[df_train_merged.isnull().any()].tolist()
print(cols_missing_val_train)

cols_missing_val_test = df_test_merged.columns[df_test_merged.isnull().any()].tolist()
print(cols_missing_val_test)

We see that the same column have missing values in both the train and test dataset.

## Visualizations

In [None]:
msno.bar(df_train_merged[cols_missing_val_train],figsize=(20,8),color="#32885e",fontsize=18,labels=True,)

msno.bar(df_test_merged[cols_missing_val_test],figsize=(20,8),color="#32885e",fontsize=18,labels=True,)

In [None]:
msno.matrix(df_train_merged[cols_missing_val_train],width_ratios=(10,1),\
            figsize=(20,8),color=(0.2,0.2,0.2),fontsize=18,sparkline=True,labels=True)

msno.matrix(df_test_merged[cols_missing_val_test],width_ratios=(10,1),\
            figsize=(20,8),color=(0.2,0.2,0.2),fontsize=18,sparkline=True,labels=True)

Columns **gender**, **composer** and **lyricist** have high number of missing values.

## Imputing Missing Values

In order to impute missing values, the other unique values in the respective columns must be known as well.

We will impute missing values for each column individually together for train and test data.

### **source_system_tab**

In [None]:
print(df_train_merged.source_system_tab.nunique())
print(df_test_merged.source_system_tab.nunique())

print(df_train_merged.source_system_tab.unique())
print(df_test_merged.source_system_tab.unique())

The unique values and their count are the same across train and test data. Hence we will impute the missing value with a common string:

In [None]:
df_train_merged.source_system_tab = df_train_merged.source_system_tab.fillna('others')
df_test_merged.source_system_tab = df_test_merged.source_system_tab.fillna('others')

### **source_screen_name**

In [None]:
print(df_train_merged.source_screen_name.nunique())
print(df_test_merged.source_screen_name.nunique())

print(df_train_merged.source_screen_name.unique())
print(df_test_merged.source_screen_name.unique())

Interesting! We can see a change in the number of unique values in this column.

In [None]:
source_screen_name_uniq_train = list(df_train_merged.source_screen_name.unique())
source_screen_name_uniq_test = list(df_test_merged.source_screen_name.unique())

In [None]:
#--- common values ---
print(set(source_screen_name_uniq_train) & set(source_screen_name_uniq_test))

#--- different values ---
print(set(source_screen_name_uniq_train) ^ set(source_screen_name_uniq_test))

The values **People local** and **People global** are new in test set.

The missing values in train and test set can be replaced with **other_sources**. The values of **People local** and **People global** in test data can also be replaced with this because there is not a single occurrence in the train data.

In [None]:
df_train_merged.source_screen_name = df_train_merged.source_screen_name.fillna('other_sources')
df_test_merged.source_screen_name = df_test_merged.source_screen_name.fillna('other_sources')

df_test_merged['source_screen_name'] = df_test_merged['source_screen_name'].replace(['People local', 'People global'], 'other_sources')

### **source_type**

In [None]:
print(df_train_merged.source_type.nunique())
print(df_test_merged.source_type.nunique())

print(df_train_merged.source_type.unique())
print(df_test_merged.source_type.unique())

In [None]:
#--- Check whether odd elements are present ---
print(set(list(df_train_merged.source_type.unique())) ^ set(list(df_test_merged.source_type.unique())))

We don't have any. So we can simply impute missing values with a common string.

In [None]:
df_train_merged.source_type = df_train_merged.source_type.fillna('other_types')
df_test_merged.source_type = df_test_merged.source_type.fillna('other_types')

### **gender**

In [None]:
print(df_train_merged.gender.unique())
print(df_test_merged.gender.unique())

Imputing missing value with common string:

In [None]:
df_test_merged.gender = df_test_merged.gender.fillna('unknown')
df_train_merged.gender = df_train_merged.gender.fillna('unknown')

In this case the missing values can also be randomly chosen between **male** and **female**

### **song_length**
Imputing missing values with the **mean** of existing values

In [None]:
df_train_merged['song_length'].fillna((df_train_merged['song_length'].mean()), inplace=True)
df_test_merged['song_length'].fillna((df_test_merged['song_length'].mean()), inplace=True)

### **language**

In [None]:
print(df_train_merged.language.nunique())
print(df_test_merged.language.nunique())

print(df_train_merged.language.unique())
print(df_test_merged.language.unique())

Imputing nan with value 0

In [None]:
df_train_merged.language = df_train_merged.language.fillna(0)
df_test_merged.language = df_test_merged.language.fillna(0)

### **genre_ids**

In [None]:
print(df_train_merged.genre_ids.nunique())
print(df_test_merged.genre_ids.nunique())

In [None]:
print(df_train_merged.genre_ids.unique())

### Inferences:
* Upon observing the values in the **genre_ids** column below, we see a combination of singlular and mixed genres..We need to know how many **unique individual ** genres are present.
* Also for rows with more than one genre, they are separated using `|`. The following code snippet obtains unique individual genre_ids present.
* We can also create a new column mentioning the number of genre_ids used in that particular song

In [None]:
df_train_merged['genre_ids']

j = 'Mary has a lamb'
j.count('a')

In [None]:
print(len(df_train_merged.genre_ids.unique()))
print(len(df_train_merged.genre_ids))

#--- List containing unique genre_ids from column inclusive of combinations ---
genre = df_train_merged.genre_ids.unique().tolist()

#--- List containing unique individual genre_ids ---
genre_new = []

for i in range(len(genre)):
    if (type(genre[i]) == str):      #--- to avoid the nan type---
        lw = genre[i].split('|')
        #lw = re.findall(r"[^|]+", genre[i])
        for j in range(len(lw)):
            genre_new.append(lw[j])
            
print(len(genre_new))
print(len(set(genre_new)))
 
letter_counts = Counter(genre_new)
dfoo = pd.DataFrame.from_dict(letter_counts, orient='index')
dfoo.plot(kind='bar', figsize=(30,15), title = 'Distribution of frequency of genre_ids')

Genre_ids whose frequency of occurence is equal to or more than 10:

In [None]:
[k for k, v in letter_counts.items() if v >= 10]

We see something strange here:
* The number of unique elements vary to a large extent in the train and test data.
* The genre_ids appear be mostly a blend of two or more genres.

We have our task cut out here. We cannot consider a genre like '465|2213|2215' to be different but a combination of '465' , '2213' and 2215'. 

Before proceeding on imputing missing values, we need to know how many such **individual** genre_ids are present.

In [None]:
#--- Values in this column are of string type ---
df_train_merged.genre_ids.dtype

In [None]:
genre = df_train_merged.genre_ids.unique().tolist()

genre_new = []
for i in range(len(genre)):
    if (type(genre[i]) == str):
        lw = genre[i].split('|')
        #lw = re.findall(r"[^|]+", genre[i])
        for j in range(len(lw)):
            genre_new.append(lw[j])

print('Number of unique genre ids in train set: ',len(genre))

genre_new = set(genre_new)
print('Number of unique genre ids after splitting them individually: ', len(genre_new))

print('Genre ids used in combination with other genres: ', len(set(genre) & set(genre_new)))
#print('Genre ids NOT present in both train set OR after splitting: ', len(set(genre) ^ set(genre_new)))

print('Genre ids not used in combination with other genres', len(genre_new - (set(genre) & set(genre_new))))

Performing the same for the test data as well:

In [None]:
genre_test = df_test_merged.genre_ids.unique().tolist()

genre_test_new = []
for i in range(len(genre_test)):
    if (type(genre_test[i]) == str):
        lw = genre_test[i].split('|')
        #lw = re.findall(r"[^|]+", genre[i])
        for j in range(len(lw)):
            genre_test_new.append(lw[j])

print('Number of unique genre ids in train set: ',len(genre_test))

genre_test_new = set(genre_test_new)
print('Number of unique genre ids after splitting them individually: ', len(genre_test_new))

print('Genre ids used in combination with other genres: ', len(set(genre_test) & set(genre_test_new)))
#print('Genre ids NOT present in both train set OR after splitting: ', len(set(genre) ^ set(genre_new)))

print('Genre ids not used in combination with other genres', len(genre_test_new - (set(genre_test) & set(genre_test_new))))

In [None]:
#--- combination of genre_ids in train and test ---
print('Genre_ids combinations present in both train and test set: ', len(set(genre_test) & set(genre)))

print('Genre_ids combinations present in test but not in train set: ', len(set(genre_test) - (set(genre_test) & set(genre))))

In [None]:
#--- Intersection between unique genre_ids in train and test set ---
print('Genre_ids present in both train and test set: ', len(set(genre_test_new) & set(genre_new)))

print('New genre_ids present in both train or test set: ', len(set(genre_test_new) ^ set(genre_new)))

print('Total number of unique genre_ids present in both train and test set: ', len(set(genre_test_new) | set(genre_new)))

Now since we have the count of all possible genre_ids we can go ahead with imputing missing values:
* Here we can impute missing values based on the **song_id** if present in another row 
* If the above phenomenon does not persist, then we can create a new value for all missing ones.

The following code snippet checks whether same song_id of genre_ids with missing values are present elsewhere.

In [None]:
print('rows without Nan values:', df_train_merged.genre_ids.count())      
print('rows with Nan values: ', len(df_train_merged) - df_train_merged.genre_ids.count() )    

genre = df_train_merged[['song_id', 'genre_ids']]          #--- df containing all song id and artists ---
genre_wo_nan = genre.drop_duplicates().ix[~df_train_merged['genre_ids'].isnull(), :]       #--- df with unique song id and artist name where artist name is not nana
genre_w_nan = genre.drop_duplicates().ix[df_train_merged['genre_ids'].isnull(), :]          #--- df with unique song id and artist name where artist name is nan

#--- if it is zero means there are no intersections between 
print('Whether intersections are present or not : ',np.intersect1d(genre_wo_nan['song_id'], genre_w_nan['song_id']) )

Working out the same for the test set as well: (commented because it has no intersections and to save kernel run time)

In [None]:
''' 
print('rows without Nan values:', df_test_merged.genre_ids.count())      
print('rows with Nan values: ', len(df_test_merged) - df_test_merged.genre_ids.count() )    

genre = df_test_merged[['song_id', 'genre_ids']]          #--- df containing all song id and artists ---
genre_wo_nan = genre.drop_duplicates().ix[~df_test_merged['genre_ids'].isnull(), :]       #--- df with unique song id and artist name where artist name is not nana
genre_w_nan = genre.drop_duplicates().ix[df_test_merged['genre_ids'].isnull(), :]          #--- df with unique song id and artist name where artist name is nan

#--- if it is zero means there are no intersections between 
print('Whether intersections are present or not : ',np.intersect1d(genre_wo_nan['song_id'], genre_w_nan['song_id']) )

'''

The missing values in **genre_ids** column of both train and test set can be imputed using a common string:

In [None]:
df_train_merged.genre_ids = df_train_merged.genre_ids.fillna('no_genre_id')
df_test_merged.genre_ids = df_test_merged.genre_ids.fillna('no_genre_id')

### **composer**

In [None]:
print(df_train_merged.composer.nunique())
print(df_test_merged.composer.nunique())

In [None]:
print(df_train_merged.composer.unique())

We can see something similar to what we ust saw with column **genre_ids**.

Before proceeding on imputing missing values, we need to know how many such **individual** composers are present.

In [None]:
composer = df_train_merged.composer.unique().tolist()

composer_new = []
for i in range(len(composer)):
    if (type(composer[i]) == str):
        lw = composer[i].split('|')
        #lw = re.findall(r"[^|]+", genre[i])
        for j in range(len(lw)):
            composer_new.append(lw[j])

print('Number of unique composers in train set: ',len(composer))

composer_new = set(composer_new)
print('Number of unique composers after splitting them individually: ', len(composer_new))

print('composers in combination with other composers: ', len(set(composer) & set(composer_new)))
#print('Genre ids NOT present in both train set OR after splitting: ', len(set(genre) ^ set(genre_new)))

print('composers not in combination with other composers', len(composer_new - (set(composer) & set(composer_new))))

For test data:

In [None]:
composer_test = df_test_merged.composer.unique().tolist()

composer_test_new = []
for i in range(len(composer_test)):
    if (type(composer_test[i]) == str):
        lw = composer_test[i].split('|')
        #lw = re.findall(r"[^|]+", genre[i])
        for j in range(len(lw)):
            composer_test_new.append(lw[j])

print('Number of unique composers in train set: ',len(composer_test))

composer_test_new = set(composer_test_new)
print('Number of unique composers after splitting them individually: ', len(composer_test_new))

print('composers used in combination with other composers: ', len(set(composer_test) & set(composer_test_new)))
#print('Genre ids NOT present in both train set OR after splitting: ', len(set(genre) ^ set(genre_new)))

print('composers not used in combination with other composers', len(composer_test_new - (set(composer_test) & set(composer_test_new))))

In [None]:
#--- Intersection between unique composers in train and test set ---
print('composers present in both train and test set: ', len(set(composer_test_new) & set(composer_new)))

print('New composers present in both train or test set: ', len(set(composer_test_new) ^ set(composer_new)))

print('Total number of unique composers present in both train and test set: ', len(set(composer_test_new) | set(composer_new)))

Since we have the count of composers let us fill the missing values:

In [None]:
print('rows without Nan values:', df_train_merged.composer.count())      
print('rows with Nan values: ', len(df_train_merged) - df_train_merged.composer.count() )    

composer = df_train_merged[['song_id', 'composer']]          #--- df containing all song id and artists ---
composer_wo_nan = composer.drop_duplicates().ix[~df_train_merged['composer'].isnull(), :]       #--- df with unique song id and artist name where artist name is not nana
composer_w_nan = composer.drop_duplicates().ix[df_train_merged['composer'].isnull(), :]          #--- df with unique song id and artist name where artist name is nan

#--- if it is zero means there are no intersections between 
print('Whether intersections are present or not : ',np.intersect1d(composer_wo_nan['song_id'], composer_w_nan['song_id']) )

Here also we do not have any intersections, hence we can impute with a common string:

In [None]:
df_train_merged.composer = df_train_merged.composer.fillna('no_composer')
df_test_merged.composer = df_test_merged.composer.fillna('no_composer')

### **artist_name**

In [None]:
print(df_train_merged.artist_name.nunique())
print(df_test_merged.artist_name.nunique())

There are no intersections present here either, hence we will impute missing values with a common string.

In [None]:
df_train_merged.artist_name = df_train_merged.artist_name.fillna('no_artist')
df_test_merged.artist_name = df_test_merged.artist_name.fillna('no_artist')

### **lyricist**

In [None]:
print(df_train_merged.lyricist.nunique())
print(df_test_merged.lyricist.nunique())

In [None]:
print(df_train_merged.lyricist.unique())

In this column we too have the same occurrence of multiple lyricists.

In [None]:
lyricist = df_train_merged.lyricist.unique().tolist()

lyricist_new = []
for i in range(len(lyricist)):
    if (type(lyricist[i]) == str):
        lw = lyricist[i].split('|')
        #lw = re.findall(r"[^|]+", genre[i])
        for j in range(len(lw)):
            lyricist_new.append(lw[j])

print('Number of unique lyricists in train set: ',len(lyricist))

lyricist_new = set(lyricist_new)
print('Number of unique lyricists after splitting them individually: ', len(lyricist_new))

print('lyricists in combination with other lyricists: ', len(set(lyricist) & set(lyricist_new)))
#print('Genre ids NOT present in both train set OR after splitting: ', len(set(genre) ^ set(genre_new)))

print('lyricists not in combination with other lyricists', len(lyricist_new - (set(lyricist) & set(lyricist_new))))

For the test data:

In [None]:
lyricist_test = df_test_merged.lyricist.unique().tolist()

lyricist_test_new = []
for i in range(len(lyricist_test)):
    if (type(lyricist_test[i]) == str):
        lw = lyricist_test[i].split('|')
        #lw = re.findall(r"[^|]+", genre[i])
        for j in range(len(lw)):
            lyricist_test_new.append(lw[j])

print('Number of unique lyricists in _test set: ',len(lyricist_test))

lyricist_test_new = set(lyricist_test_new)
print('Number of unique lyricists after splitting them individually: ', len(lyricist_test_new))

print('lyricists in combination with other lyricists: ', len(set(lyricist_test) & set(lyricist_test_new)))
#print('Genre ids NOT present in both train set OR after splitting: ', len(set(genre) ^ set(genre_new)))

print('lyricists not in combination with other lyricists', len(lyricist_test_new - (set(lyricist_test) & set(lyricist_test_new))))

In [None]:
#--- Intersection between unique lyricists in train and test set ---
print('lyricists present in both train and test set: ', len(set(lyricist_test_new) & set(lyricist_new)))

print('New lyricists present in both train or test set: ', len(set(lyricist_test_new) ^ set(lyricist_new)))

print('Total number of unique lyricists present in both train and test set: ', len(set(lyricist_test_new) | set(lyricist_new)))

Now that the count is obtained let us see how to impute missing values:

In [None]:
print('rows without Nan values:', df_train_merged.lyricist.count())      
print('rows with Nan values: ', len(df_train_merged) - df_train_merged.lyricist.count() )    

lyricist = df_train_merged[['song_id', 'lyricist']]          #--- df containing all song id and artists ---
lyricist_wo_nan = lyricist.drop_duplicates().ix[~df_train_merged['lyricist'].isnull(), :]       #--- df with unique song id and artist name where artist name is not nana
lyricist_w_nan = lyricist.drop_duplicates().ix[df_train_merged['lyricist'].isnull(), :]          #--- df with unique song id and artist name where artist name is nan

#--- if it is zero means there are no intersections between 
print('Whether intersections are present or not : ',np.intersect1d(lyricist_wo_nan['song_id'], lyricist_w_nan['song_id']) )

Since there are no intersections as usual, we will impute missing values with a common string:

In [None]:
df_train_merged.lyricist = df_train_merged.lyricist.fillna('no_lyricist')
df_test_merged.lyricist = df_test_merged.lyricist.fillna('no_lyricist')

# Feature Engineering

## Feature 1 : **genre_ids_total**

This column contains the number of different genres used in the song. This was created while analyzing the **genre_ids** column in both the test and train data set

The following snippet:
* puts 0 when string `no_genre_id` is encountered, which was filled to impute missing values.
* otherwise the number of genres is summed up.

In [None]:
def genre_id_count(x):
    if x == 'no_genre_id':
        return 0
    else:
        return x.count('|') + 1

df_train_merged['genre_ids_count'] = df_train_merged['genre_ids'].apply(genre_id_count).astype(np.int8)
df_test_merged['genre_ids_count'] = df_test_merged['genre_ids'].apply(genre_id_count).astype(np.int8)

## Feature 2 : **lyricists_count**
This column reveals the number of lyricists for a particular song.

Unlike the previous case where we had occurrences of `|`, here we have to deal with the following: `|`, `/`, `\\` `;`. 

Take a look below to see what I mean!

Occurrences with `|`:
* `'Andy Cato| Tom Findlay| Julie McAlpine'`
* `'Max Martin| Shellback| Tiffany Amber'`

Occurrences with `/` and/or `|`:
* `'Korean Lyrics by Lee| Seu Ran (12.5%) Greg Paul Stephen Bonnick / Hayden Chapman / Jeremy Tyrone Jasper / Adrian McKinnon'`
* `'Misfit / Karen Poole / Stuart Crichton'`

Occurrences with `\\`:
* `'張震嶽 Ayal Komod\\陳昱榕 E-SO\\周文傑 KENZY\\林睦淵 MUTA'`
* `'黃煜俊\\黃揚哲'`

Occurrences with `;`:
* `'Hiroyuki Himeno;Zheng ShuFei;Ikoman'`

To make matters worse there are combinations of these as well! See below:
* `'克麗絲叮(Christine Welch)/"李惠群 Li| Hui-Qun"/"易家揚 Yi| Jia-Yang"'`
* `'CA: DAVID| MACK/ LOUIGUY、中文詞：小玉 林忠諭 '`
* `'Yasunori| Kawauchi \\黃東焜'`

Sometimes you also encounter weird strings like:
* `'Korean Lyrics by Lee| Ha Jin /Amber J. Liu / Gen Neo'`
* `'Korean Lyrics by Lee| Chae Yoon / Teddy Riley / DOM / Richard Garcia / Dantae Johnson / Labyron “Miko” Walton'`
* `'BoA (35%) / Harvey Mason Jr. (12.5%) / Mike Daley (17.5%) / Andrew Hey (17.5%) / Tiffany Fred (17.5%)'`
* `'Korean Lyrics by Cho| Yun Kyoung / January 8th / Kim| Dong Hyun / Teddy Riley| DOM| Lee| Hyun Seung for (TRX) / J.SOL (Jason J Lopez) / Dantae Johnson'`
* `'m-flo + Matt Cab for STAR BASE MUSIC'`
* `'Koji Tamaki and Tetsuya Komuro'`

In [None]:
list(df_train_merged.lyricist.unique())

In [None]:
def lyricist_count(x):
    if x == 'no_lyricist':
        return 0
    else:
        return sum(map(x.count, ['|', '/', '\\', ';'])) + 1
    return sum(map(x.count, ['|', '/', '\\', ';']))

df_train_merged['lyricists_count'] = df_train_merged['lyricist'].apply(lyricist_count).astype(np.int8)
df_test_merged['lyricists_count'] = df_test_merged['lyricist'].apply(lyricist_count).astype(np.int8)

## Feature 3 : **composers_count**

This column contains the number of composers.

In [None]:
list(df_train_merged['composer'].unique())

We have similar occurrences of separation as the previous case.

In [None]:
def composer_count(x):
    if x == 'no_composer':
        return 0
    else:
        return sum(map(x.count, ['|', '/', '\\', ';'])) + 1

df_train_merged['composer_count'] = df_train_merged['composer'].apply(composer_count).astype(np.int8)
df_test_merged['composer_count'] = df_test_merged['composer'].apply(composer_count).astype(np.int8)

## Feature 4 : **is_featured**
This is a binary column emphasizing whether featured artists have performed or not.

In [None]:
list(df_train_merged['artist_name'].unique())

To count featured artists strings like `feat.` and `featuring` must be taken into account.

In [None]:
def is_featured(x):
    if ((x.find('feat.') == True) | (x.find('featuring') == True)):
        return 1
    else:
        return 0

df_train_merged['is_featured'] = df_train_merged['artist_name'].apply(is_featured).astype(np.int8)
df_test_merged['is_featured'] = df_test_merged['artist_name'].apply(is_featured).astype(np.int8)

## Feature 5 : **artist_count**
This column stores the number of artists who performed the song including the featured artists  

In [None]:
def artist_count(x):
    if x == 'no_artist':
        return 0
    else:
        return sum(map(x.count, ['&', 'and', 'feat.', 'featuring'])) + 1

df_train_merged['artist_count'] = df_train_merged['artist_name'].apply(artist_count).astype(np.int8)
df_test_merged['artist_count'] = df_test_merged['artist_name'].apply(artist_count).astype(np.int8)

## Feature 6 : **artist_composer**

New column to check whther the artist and composer is the same person.

In [None]:
df_train_merged['artist_composer'] = (df_train_merged['artist_name'] == df_train_merged['composer']).astype(np.int8)
df_test_merged['artist_composer'] = (df_test_merged['artist_name'] == df_test_merged['composer']).astype(np.int8)

## Feature 7 : **artist_composer_lyricist**

New column to check whther the artist, composer and lyricist is the same person.

In [None]:
df_train_merged['artist_composer_lyricist'] = ((df_train_merged['artist_name'] == df_train_merged['composer']) & (df_train_merged['artist_name'] == df_train_merged['lyricist']) & (df_train_merged['composer'] == df_train_merged['lyricist'])).astype(np.int8)
df_test_merged['artist_composer_lyricist'] = ((df_test_merged['artist_name'] == df_test_merged['composer']) & (df_test_merged['artist_name'] == df_test_merged['lyricist']) & (df_test_merged['composer'] == df_test_merged['lyricist'])).astype(np.int8)

In [None]:
print(df_train_merged.shape)
print(df_test_merged.shape)
df_train_merged['artist_composer_lyricist'].unique()

# *TO BE CONTINUED ....*