# Contents

TODO: Create content table

# Initial Exploration

In [106]:
# Import needed libraries
import pandas as pd

In [107]:
# Read raw file with data and show two random rows
df = pd.read_csv('../Data/Raw/music_project_en.csv')
df.sample(2)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
33098,206D3BB2,Summertime,Shelly Manne & His Men,jazz,Springfield,20:52:29,Monday
53630,8BE7EE1B,In Time,Lake Jons,alternative,Springfield,13:34:52,Monday


In [108]:
# Show the general information of the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63736 non-null  object
 2   artist    57512 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


# Fix problems

- Make sure the column names are correctly written and correct mean.
- Make sure the rows are well written on every column

In [109]:
# Fix bad spaces, capital letters and mean manually
df.columns = ['User_ID', 'Track', 'Artist', 'Genre',
              'City', 'Hour_of_Play', 'Day_of_the_Week']
df.columns

Index(['User_ID', 'Track', 'Artist', 'Genre', 'City', 'Hour_of_Play',
       'Day_of_the_Week'],
      dtype='object')

In [110]:
# Write 10 samples of the first three columns
df[['User_ID', 'Track', 'Artist']].sample(10)

Unnamed: 0,User_ID,Track,Artist
20537,E8CCA6AE,Nobody Like You,The Polecats
31059,9D68CF00,Refreshing Rain,Zen Meditation and Natural White Noise and New...
11136,9FB58A7C,My Heart Will Go On,
34276,F0F58500,Barber: Adagio For Strings Op.11/2,Baltimore Symphony Orchestra
61321,9D4C9F64,Time Is On Your Side,Stadiumx
4389,A010EB7E,"Act II - ""Music for a While""",
21240,8E93BDE9,Headspace,Thomston
43388,689AD412,Go,Amazing
20590,1382220A,Everybody,Irma
62253,686E00B7,Make Love,Taeyang


In [111]:
# Count unique records and null values of the columns
for col in ['User_ID', 'Track', 'Artist']:
    print(
        f'The column {col} has {df[col].nunique()} unique and {df[col].isna().sum()} null values.')

The column User_ID has 41748 unique and 0 null values.
The column Track has 39666 unique and 1343 null values.
The column Artist has 37806 unique and 7567 null values.


The columns `User_ID` has 41.748 unique records of 8 characters, some of them are 7 but we asume the first number is 0 and that there are no problems with the format.

The track and artist name correspond to the way the distributor filled the form. We are more concerned with the way that we found many null values, that we are going to fill later.

In [112]:
# Write 10 samples of the column 'Genre'
df['Genre'].sample(10)

9806      country
7613         folk
26683       world
35300    eurofolk
26931       latin
20104        drum
20199         pop
49558       dance
30231         pop
50150       dance
Name: Genre, dtype: object

In [113]:
# Search for implicit duplicates
sorted(df['Genre'].unique())

TypeError: '<' not supported between instances of 'float' and 'str'

In [None]:
# Change the genres of 'hip' and 'hop' for 'hiphop'
df['Genre'].replace({'hip': 'Hip-Hop',
                     'hop': 'Hip-Hop',
                     'hiphop': 'Hip-Hop'},
                    inplace=True)

# Format record with the first capital letter
df['Genre'] = df['Genre'].apply(lambda x: str(x).title())

We had to change the format and write the first letter of the genre in capital letters. And we search for explicit duplicates when we found the word `hip` as a genre.

In [None]:
# Write 10 samples of the three last columns
df[['City', 'Hour_of_Play', 'Day_of_the_Week']].sample(10)

Unnamed: 0,City,Hour_of_Play,Day_of_the_Week
12019,Springfield,20:41:21,Monday
49490,Shelbyville,21:24:49,Monday
57460,Springfield,20:40:30,Monday
51267,Shelbyville,14:15:57,Friday
47546,Springfield,08:26:31,Monday
63725,Shelbyville,09:59:50,Friday
51937,Springfield,09:30:38,Friday
60513,Springfield,20:25:06,Monday
59513,Springfield,08:34:20,Monday
14161,Springfield,13:05:28,Monday


The last three columns have no problem. Maybe we will have to change the type of data later, but it is not something that require our attention now.

# Improve the data

We will fix the null values, the duplicates and finally improve the type of data for each column.

In [None]:
# Sum null values from the DataFrame
df.isna().sum()

User_ID               0
Track              1343
Artist             7567
Genre                 0
City                  0
Hour_of_Play          0
Day_of_the_Week       0
dtype: int64

In [None]:
# Fill null values of 'Track' and 'Artist'
df.fillna('Unknown', inplace=True)

In [None]:
# Search for problem duplicates
df.duplicated(subset=['User_ID', 'Hour_of_Play']).sum()
df.drop_duplicates(subset=['User_ID', 'Hour_of_Play'], inplace=True)

In [None]:
# Search for other duplicates
df.duplicated().sum()

0

In [114]:
# Confirm the type of 'Hour_of_Play'
df['Hour_of_Play'] = (pd.to_datetime(df['Hour_of_Play'], format='%H:%M:%S')
                      .dt.time)

We only found null values in the columns `Track` and `Artist` but it does not affect our analysis so we kept them and change the NaN value for the string `Unknown` and now we have 0 null values.

Then we found some records when the same user `User_ID` played two songs at the same time `Hour_of_Play` which is not possible, so we delete them and kept just the first record. After, we search for other duplicates but there are not.

Finally we confirm that the column `Hour_of_Play` was type `datetime` for the analysis.