# <span style="color:#0F19C9">Contents</span>

- [Initial exploration](#initial-exploration)
- [Fixing problems](#fixing-problems)
- [Improving the data](#improving-the-data)

# <span style="color:#0F19C9">Initial Exploration</span>

In [39]:
# Import needed libraries
import pandas as pd

In [40]:
# Read raw file with data and show two random rows
df = pd.read_csv('../Data/Raw/music_project_en.csv')
df.sample(2)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
6194,E595E655,,,rnb,Shelbyville,09:28:14,Monday
27909,BE0524D0,Keep It Simple,Emilio RaStok,hip,Shelbyville,20:47:42,Wednesday


In [41]:
# Show the general information of the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63736 non-null  object
 2   artist    57512 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


# <span style="color:#0F19C9">Fixing problems</span>

- Make sure the column names are correctly written and correct mean.
- Make sure the rows are well written on every column

In [42]:
# Fix bad spaces, capital letters and mean manually
df.columns = ['User_ID', 'Track', 'Artist', 'Genre',
              'City', 'Hour_of_Play', 'Day_of_the_Week']
df.columns

Index(['User_ID', 'Track', 'Artist', 'Genre', 'City', 'Hour_of_Play',
       'Day_of_the_Week'],
      dtype='object')

In [43]:
# Write 10 samples of the first four columns
df[['User_ID', 'Track', 'Artist', 'Genre']].sample(10)

Unnamed: 0,User_ID,Track,Artist,Genre
56134,4BCC0F2B,Boss Like That,Sjaak,rap
38367,B86ED4DB,Build One,Nightly,indie
24895,D6ADDFAA,I Am Weightless,Septembre,gothic
64753,A129C34D,Flames of Love,m@rcell,pop
43519,81ECAD48,Deadpool,Bizzy Mind,hip
56313,2CD11703,Metal Dog,Patrick Dawes,pop
14785,D31F8DF3,Feel the Panic,Felix Leiter,electronic
62688,53C5B66B,Horst & Monika,Die Orsons,hip
9230,5E970D24,,,
57772,E9BF1141,Change,Daybehavior,electronic


In [44]:
# Count unique records and null values of the columns
for col in ['User_ID', 'Track', 'Artist', 'Genre']:
    print(
        f'The column {col} has {df[col].nunique()} unique and {df[col].isna().sum()} null values.')

The column User_ID has 41748 unique and 0 null values.
The column Track has 39666 unique and 1343 null values.
The column Artist has 37806 unique and 7567 null values.
The column Genre has 268 unique and 1198 null values.


The columns `User_ID` has 41.748 unique records of 8 characters, some of them are 7 but we asume the first number is 0 and that there are no problems with the format.

The track, artist name and genre correspond to the way the distributor filled the form. We are more concerned with the way that we found many null values, that we are going to fill later.

In [45]:
# Write 10 samples of the column 'Genre'
df['Genre'].sample(10)

62832            rap
46830          dance
6426             pop
37457           rock
47285       folkrock
28948          samba
20337            pop
16870          dance
22193            pop
45062    alternative
Name: Genre, dtype: object

In [46]:
# Search for implicit duplicates
df['Genre'].sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'ambient', 'americana', 'animated', 'anime',
       'arabesk', 'arabic', 'arena', 'argentinetango', 'art', 'audiobook',
       'avantgarde', 'axé', 'baile', 'balkan', 'beats', 'bigroom',
       'black', 'bluegrass', 'blues', 'bollywood', 'bossa', 'brazilian',
       'breakbeat', 'breaks', 'broadway', 'cantautori', 'cantopop',
       'canzone', 'caribbean', 'caucasian', 'celtic', 'chamber',
       'children', 'chill', 'chinese', 'choral', 'christian', 'christmas',
       'classical', 'classicmetal', 'club', 'colombian', 'comedy',
       'conjazz', 'contemporary', 'country', 'cuban', 'dance',
       'dancehall', 'dancepop', 'dark', 'death', 'deep', 'deutschrock',
       'deutschspr', 'dirty', 'disco', 'dnb', 'documentary', 'downbeat',
       'downtempo', 'drum', 'dub', 'dubstep', 'eastern', 'easy',
       'electronic', 'electropop', 'emo', 'entehno', 'epicmetal',
       'estrada', 'ethnic', 'eurofo

In [47]:
# Change the genres of 'hip' and 'hop' for 'hiphop'
df['Genre'].replace({'hip': 'Hip-Hop',
                     'hop': 'Hip-Hop',
                     'hiphop': 'Hip-Hop'},
                    inplace=True)

We had to change the format and write the first letter of the genre in capital letters. And we search for explicit duplicates when we found the word `hip` as a genre.

In [48]:
# Write 10 samples of the three last columns
df[['City', 'Hour_of_Play', 'Day_of_the_Week']].sample(10)

Unnamed: 0,City,Hour_of_Play,Day_of_the_Week
18924,Shelbyville,13:28:33,Wednesday
27574,Springfield,21:25:29,Wednesday
10292,Springfield,20:15:51,Friday
2437,Springfield,14:22:38,Friday
35628,Springfield,13:22:56,Monday
44178,Springfield,21:25:19,Monday
13336,Springfield,20:26:58,Monday
47552,Springfield,20:14:17,Monday
12048,Springfield,20:47:37,Monday
47339,Springfield,13:54:41,Friday


The last three columns have no problem. Maybe we will have to change the type of data later, but it is not something that require our attention now.

# <span style="color:#0F19C9">Improving the data</span>

We will fix the null values, the duplicates and finally improve the type of data for each column.

In [49]:
# Sum null values from the DataFrame
df.isna().sum()

User_ID               0
Track              1343
Artist             7567
Genre              1198
City                  0
Hour_of_Play          0
Day_of_the_Week       0
dtype: int64

In [50]:
# Fill null values of 'Track', 'Artist' and 'Genre
df.fillna('Unknown', inplace=True)

In [51]:
# Search for problem duplicates
print(' We will drop {} duplicates'.format(
    df.duplicated(subset=['User_ID', 'Hour_of_Play']).sum()))
df.drop_duplicates(subset=['User_ID', 'Hour_of_Play'], inplace=True)

 We will drop 3869 duplicates


In [52]:
# Search for other duplicates
df.duplicated().sum()

0

In [53]:
# Confirm the type of 'Hour_of_Play'
df['Hour_of_Play'] = (pd.to_datetime(df['Hour_of_Play'], format='%H:%M:%S')
                      .dt.time)

We only found null values in the columns `Track` and `Artist` but it does not affect our analysis so we kept them and change the NaN value for the string `Unknown` and now we have 0 null values.

Then we found some records when the same user `User_ID` played two songs at the same time `Hour_of_Play` which is not possible, so we delete them and kept just the first record. After, we search for other duplicates but there are not.

Finally we confirm that the column `Hour_of_Play` was type `datetime` for the analysis.

In [54]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 61210 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   User_ID          61210 non-null  object
 1   Track            61210 non-null  object
 2   Artist           61210 non-null  object
 3   Genre            61210 non-null  object
 4   City             61210 non-null  object
 5   Hour_of_Play     61210 non-null  object
 6   Day_of_the_Week  61210 non-null  object
dtypes: object(7)
memory usage: 3.7+ MB


In [55]:
# Save DataFrame in csv file
df.to_csv('../Data/Processed/processed_data.csv', index=False)

We can say that we conclude 61.210 correct entries and 7 columns.