# About this Notebook:

This notebook show a brief exploratory data analysis(EDA) on the MetroLyrics Data Set that found from Kaggle.

The details of the dataset can be found here __[380,000+ lyrics from MetroLyrics - Lyrics, Artist , Genre, Year](https://www.kaggle.com/gyani95/380000-lyrics-from-metrolyrics/kernels)__

Notebook Enviroment

- Kernel: Python 3

## Question to be answered:

1. How many soungs in the Hip-Hop genre?
2. How many different MCs are included in this dataset?
3. Can we extract the Hip-Hop genre subset as the data for our lyrics database?

## TODO

Export a smaller set of data as CSV file that only include Hip-Hop as the genre

# Load Modules

In [5]:
from __future__ import print_function
import pandas as pd
import numpy as np
import json
import csv
import os
from os.path import dirname as up
import time

# Commonly Shared Statics

In [39]:
projectPath = up(os.getcwd())
# This file lyrics.csv is list in the .gitignore file and will not be added to the repo
data_path = projectPath + '/data/lyrics.csv'
exporting_path = projectPath + '/data/lyrics_hiphop.csv'

# Parse the Data

Load the csv file into pandas dataframe

In [16]:
data_df = pd.read_csv(data_path)

Check the first 5 and last 5 rows of the dataframe with head and tail

In [17]:
data_df.head()

Unnamed: 0,index,song,year,artist,genre,lyrics
0,0,ego-remix,2009,beyonce-knowles,Pop,"Oh baby, how you doing?\nYou know I'm gonna cu..."
1,1,then-tell-me,2009,beyonce-knowles,Pop,"playin' everything so easy,\nit's like you see..."
2,2,honesty,2009,beyonce-knowles,Pop,If you search\nFor tenderness\nIt isn't hard t...
3,3,you-are-my-rock,2009,beyonce-knowles,Pop,"Oh oh oh I, oh oh oh I\n[Verse 1:]\nIf I wrote..."
4,4,black-culture,2009,beyonce-knowles,Pop,"Party the people, the people the party it's po..."


In [18]:
data_df.tail()

Unnamed: 0,index,song,year,artist,genre,lyrics
362232,362232,who-am-i-drinking-tonight,2012,edens-edge,Country,"I gotta say\nBoy, after only just a couple of ..."
362233,362233,liar,2012,edens-edge,Country,I helped you find her diamond ring\nYou made m...
362234,362234,last-supper,2012,edens-edge,Country,Look at the couple in the corner booth\nLooks ...
362235,362235,christ-alone-live-in-studio,2012,edens-edge,Country,When I fly off this mortal earth\nAnd I'm meas...
362236,362236,amen,2012,edens-edge,Country,I heard from a friend of a friend of a friend ...


Check the overview info of the dataframe

In [19]:
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 362237 entries, 0 to 362236
Data columns (total 6 columns):
index     362237 non-null int64
song      362235 non-null object
year      362237 non-null int64
artist    362237 non-null object
genre     362237 non-null object
lyrics    266557 non-null object
dtypes: int64(2), object(4)
memory usage: 16.6+ MB


Check on the distribution of the genre

In [22]:
data_df.genre.value_counts(dropna=False)

Rock             131377
Pop               49444
Hip-Hop           33965
Not Available     29814
Metal             28408
Other             23683
Country           17286
Jazz              17147
Electronic        16205
R&B                5935
Indie              5732
Folk               3241
Name: genre, dtype: int64

Extract hip_hop_data_df from data_df

In [34]:
hip_hop_data_df = data_df[data_df.genre == 'Hip-Hop']

Drop rows where there is no data for  lyrics 

In [35]:
hip_hop_data_df.dropna(subset=['lyrics'],inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [36]:
hip_hop_data_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24850 entries, 249 to 362156
Data columns (total 6 columns):
index     24850 non-null int64
song      24850 non-null object
year      24850 non-null int64
artist    24850 non-null object
genre     24850 non-null object
lyrics    24850 non-null object
dtypes: int64(2), object(4)
memory usage: 1.3+ MB


# Save the hip_hop_data_df to lyrics_hiphop.csv

In [42]:
hip_hop_data_df.head()

Unnamed: 0,index,song,year,artist,genre,lyrics
249,249,i-got-that,2007,eazy-e,Hip-Hop,(horns)...\n(chorus)\nTimbo- When you hit me o...
250,250,8-ball-remix,2007,eazy-e,Hip-Hop,"Verse 1:\nI don't drink brass monkey, like to ..."
251,251,extra-special-thankz,2007,eazy-e,Hip-Hop,"19 muthaphukkin 93,\nand I'm back in this bitc..."
252,252,boyz-in-da-hood,2007,eazy-e,Hip-Hop,"Hey yo man, remember that shit Eazy did a whil..."
253,253,automoblie,2007,eazy-e,Hip-Hop,"Yo, Dre, man, I take this bitch out to the mov..."


In [43]:
hip_hop_data_df.tail()

Unnamed: 0,index,song,year,artist,genre,lyrics
362136,362136,the-sky,2013,dub-fx,Hip-Hop,The sky is forever\nAs I'm tappin' away upon m...
362145,362145,run,2013,dub-fx,Hip-Hop,I see you run when I say WHATAGWAN\nI see you ...
362149,362149,so-are-you,2016,dub-fx,Hip-Hop,Oooh\nIf i only ever knew I'm trapped in my mi...
362150,362150,road-to-babylon,2016,dub-fx,Hip-Hop,[Verse 1]\nI walk alone on this road to Babylo...
362156,362156,love-me-or-not,2014,dub-fx,Hip-Hop,You could love me or not\nBut either way I've ...


In [46]:
hip_hop_data_df.drop(['index'], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [47]:
hip_hop_data_df.head()

Unnamed: 0,song,year,artist,genre,lyrics
249,i-got-that,2007,eazy-e,Hip-Hop,(horns)...\n(chorus)\nTimbo- When you hit me o...
250,8-ball-remix,2007,eazy-e,Hip-Hop,"Verse 1:\nI don't drink brass monkey, like to ..."
251,extra-special-thankz,2007,eazy-e,Hip-Hop,"19 muthaphukkin 93,\nand I'm back in this bitc..."
252,boyz-in-da-hood,2007,eazy-e,Hip-Hop,"Hey yo man, remember that shit Eazy did a whil..."
253,automoblie,2007,eazy-e,Hip-Hop,"Yo, Dre, man, I take this bitch out to the mov..."


In [48]:
hip_hop_data_df.to_csv(exporting_path,index = None)