# Musical Preferences Between Cities


# Contents <a id='back'></a>

* [1. Taking a look at the data](#data_review)
* [2. Data preprocessing](#data_preprocessing)
    * [2.1 Header Style](#header_style)
    * [2.2 Missing Values](#missing_values)
    * [2.3 Duplicates](#duplicates)
* [3. Hypothesis Testing](#hypotheses)
    * [3.1 1st Hypothesis: Comparing User Behavior in the Two Cities](#activity)
    * [3.2 2nd Hypothesis: Music at the Beginning and End of the Week](#week)
    * [3.3 3rd Hypothesis: Genre Preferences in Springfield and Shelbyville](#genre)
* [Conclusions](#end)

## Taking a look at the data <a id='data_review'></a>

In [None]:
# importing necessary librarys 

import pandas as pd

In [51]:
# reading our df
df=pd.read_csv('/datasets/music_project_en.csv')

In [52]:
# giving a look to our dataset
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
5,842029A1,Chains,Obladaet,rusrap,Shelbyville,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Springfield,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Springfield,20:47:49,Wednesday
8,8FA1D3BE,L’estate,Julia Dalia,ruspop,Springfield,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Shelbyville,21:20:49,Wednesday


In [53]:
# getting some additional info

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63736 non-null  object
 2   artist    57512 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


[Back to content](#back)

## Data preprocessing <a id='data_preprocessing'></a>

### Correcting header style <a id='header_style'></a>


In [None]:
# getting column names on our dataframe
print(df.columns)

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')


In [55]:
# making columns more readable
df.columns=['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day']

In [56]:
# checking the changes
print(df.columns)

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')


[Back to content](#back)

### Missing values <a id='missing_values'></a>

In [57]:
# calculating missing values q'ty
print(df.isna().sum())

user_id       0
track      1343
artist     7567
genre      1198
city          0
time          0
day           0
dtype: int64


In [58]:
# replacing nan values with 'unknow'

columns_to_replace=['track', 'artist', 'genre'] #creando la lista columns_to_replace

for column in columns_to_replace: #iterando sobre la lista
    #col=df[columns_to_replace].fillna('unknown')
    #df[columns_to_replace]=col
    df[column] = df[column].fillna('unknown')

In [59]:
# checking if there is more nan's

print(df.isna().sum())

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64


[Back to content](#back)

### Duplicates <a id='duplicates'></a>

In [60]:
# counting duplicate values

print(df.duplicated().sum())

3826


In [61]:
# dropping duplicates 

df=df.drop_duplicates().reset_index(drop=True)

In [62]:
# checking if there's some remainging duplucates

print(df.duplicated().sum())

0


In [None]:
# inspecting unique genres in 'genre' row

print(df['genre'].sort_values().unique())

['acid' 'acoustic' 'action' 'adult' 'africa' 'afrikaans' 'alternative'
 'ambient' 'americana' 'animated' 'anime' 'arabesk' 'arabic' 'arena'
 'argentinetango' 'art' 'audiobook' 'avantgarde' 'axé' 'baile' 'balkan'
 'beats' 'bigroom' 'black' 'bluegrass' 'blues' 'bollywood' 'bossa'
 'brazilian' 'breakbeat' 'breaks' 'broadway' 'cantautori' 'cantopop'
 'canzone' 'caribbean' 'caucasian' 'celtic' 'chamber' 'children' 'chill'
 'chinese' 'choral' 'christian' 'christmas' 'classical' 'classicmetal'
 'club' 'colombian' 'comedy' 'conjazz' 'contemporary' 'country' 'cuban'
 'dance' 'dancehall' 'dancepop' 'dark' 'death' 'deep' 'deutschrock'
 'deutschspr' 'dirty' 'disco' 'dnb' 'documentary' 'downbeat' 'downtempo'
 'drum' 'dub' 'dubstep' 'eastern' 'easy' 'electronic' 'electropop' 'emo'
 'entehno' 'epicmetal' 'estrada' 'ethnic' 'eurofolk' 'european'
 'experimental' 'extrememetal' 'fado' 'film' 'fitness' 'flamenco' 'folk'
 'folklore' 'folkmetal' 'folkrock' 'folktronica' 'forró' 'frankreich'
 'französisch' 

In [64]:
# creating a function to replace duplicates

def replace_wrong_genres(wrong_genres, correct_genre):
    for wrong_genre in wrong_genres:
        df['genre'] = df ['genre'].replace(wrong_genre, correct_genre)
        
duplicates=['hip','hop','hip-hop']
name='hiphop'

In [65]:
# deleting duplicates using our function

replace_wrong_genres(duplicates, name)

In [66]:
# checking if corrections were applied 

print(df['genre'].sort_values().unique())

['acid' 'acoustic' 'action' 'adult' 'africa' 'afrikaans' 'alternative'
 'ambient' 'americana' 'animated' 'anime' 'arabesk' 'arabic' 'arena'
 'argentinetango' 'art' 'audiobook' 'avantgarde' 'axé' 'baile' 'balkan'
 'beats' 'bigroom' 'black' 'bluegrass' 'blues' 'bollywood' 'bossa'
 'brazilian' 'breakbeat' 'breaks' 'broadway' 'cantautori' 'cantopop'
 'canzone' 'caribbean' 'caucasian' 'celtic' 'chamber' 'children' 'chill'
 'chinese' 'choral' 'christian' 'christmas' 'classical' 'classicmetal'
 'club' 'colombian' 'comedy' 'conjazz' 'contemporary' 'country' 'cuban'
 'dance' 'dancehall' 'dancepop' 'dark' 'death' 'deep' 'deutschrock'
 'deutschspr' 'dirty' 'disco' 'dnb' 'documentary' 'downbeat' 'downtempo'
 'drum' 'dub' 'dubstep' 'eastern' 'easy' 'electronic' 'electropop' 'emo'
 'entehno' 'epicmetal' 'estrada' 'ethnic' 'eurofolk' 'european'
 'experimental' 'extrememetal' 'fado' 'film' 'fitness' 'flamenco' 'folk'
 'folklore' 'folkmetal' 'folkrock' 'folktronica' 'forró' 'frankreich'
 'französisch' 

[Back to content](#back)

## Etapa 3. Testing hypotheses <a id='hypotheses'></a>

### 1st hypotheses: Compare the user behavior in the two cities. <a id='activity'></a>

In [75]:
# counting playec tracks per city
df.groupby('city')['track'].count()

city
Shelbyville    18512
Springfield    42741
Name: track, dtype: int64

In [79]:
# calculating tracks played per day 

df.groupby('day')['track'].count()

day
Friday       21840
Monday       21354
Wednesday    18059
Name: track, dtype: int64

In [80]:
# creating a function to count amount of tracks played on certain day and city
def number_tracks(day, city):
    track_list = df[(df['day']==day) & (df['city']==city)]
    track_list_count = track_list ['user_id'].count()
    return track_list_count

In [81]:
# getting tracks played on monday on springfield
number_tracks('Monday', 'Springfield')

15740

In [82]:
# getting tracks played on monday on shelbyville
number_tracks('Monday', 'Shelbyville')

5614

In [83]:
# getting tracks played on wednesday on springfield
number_tracks('Wednesday', 'Springfield')

11056

In [84]:
# getting tracks played on wednesday on shelbyville
number_tracks('Wednesday', 'Shelbyville')

7003

In [85]:
# getting tracks played on friday on springfield
number_tracks('Friday', 'Springfield')

15945

In [86]:
# getting tracks played on friday on shelbyville
number_tracks('Friday', 'Shelbyville')

5895

In [87]:
# creating a new df for our previous obtained information

info=[['Springfield', 15740, 11056, 15945],['Shelbyville',11056, 7003, 5895]]
      
columnas = ['city', 'monday', 'wednesday', 'friday']

niw = pd.DataFrame(data=info, columns=columnas)



In [88]:
# looking out oir new dataset

print(niw)

          city  monday  wednesday  friday
0  Springfield   15740      11056   15945
1  Shelbyville   11056       7003    5895


[Back to content](#back)

### 2nd hypothesis: Music at the beginning and end of the week <a id='week'></a>

In [89]:
# getting a new table using values where city == springfield
spr_general = df[df['city'] == 'Springfield']
print(spr_general)


        user_id                          track                   artist  \
1      55204538    Delayed Because of Accident         Andreas Rönnberg   
4      E2DC1FAE                    Soul People               Space Echo   
6      4CB90AA5                           True             Roman Messer   
7      F03E1C1F               Feeling This Way          Polina Griffith   
8      8FA1D3BE                       L’estate              Julia Dalia   
...         ...                            ...                      ...   
61247  83A474E7  I Worship Only What You Bleed  The Black Dahlia Murder   
61248  729CBB09                        My Name                   McLean   
61250  C5E3A0D5                      Jalopiina                  unknown   
61251  321D0506                  Freight Train            Chas McDevitt   
61252  3A64EF84      Tell Me Sweet Little Lies             Monica Lopez   

              genre         city      time        day  
1              rock  Springfield  14:07:09 

In [90]:
# getting a new table using values where city == shelbyville

shel_general = df[df['city'] == 'Shelbyville']
print(shel_general)

        user_id                              track              artist  \
0      FFB692EC                  Kamigata To Boots    The Mass Missile   
2        20EC38                  Funiculì funiculà         Mario Lanza   
3      A3DD03C9              Dragons in the Sunset          Fire + Ice   
5      842029A1                             Chains            Obladaet   
9      E772D5C0                          Pessimist             unknown   
...         ...                                ...                 ...   
61239  D94F810B        Theme from the Walking Dead  Proyecto Halloween   
61240  BC8EC5CF       Red Lips: Gta (Rover Rework)               Rover   
61241  29E04611                       Bre Petrunko       Perunika Trio   
61242  1B91C621             (Hello) Cloud Mountain     sleepmakeswaves   
61249  D08D4A55  Maybe One Day (feat. Black Spade)         Blu & Exile   

            genre         city      time        day  
0            rock  Shelbyville  20:28:33  Wednesday  
2  

In [91]:
# creating a function to obtaing popular genres on some daytime

def genre_weekday(df, day, time1, time2):
    genre_df=df[(df['day']==day) & (df['time']>time1)& (df['time']<=time2)]
    genre_df_sorted = genre_df.groupby('genre')['genre'].count().sort_values(ascending = False).head(15)
    return genre_df_sorted[:15]

In [92]:
# calling out the function for springfield monday morning 
genre_weekday(spr_general, 'Monday', '07:00:00', '11:00:00')

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
classical      157
metal          120
jazz           100
folk            97
soundtrack      95
Name: genre, dtype: int64

In [93]:
# calling out the function for shelbyville monday morning 
genre_weekday(shel_general, 'Monday', '07:00:00', '11:00:00')

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
world           36
rap             32
soundtrack      31
rnb             27
metal           27
Name: genre, dtype: int64

In [94]:
# calling out the function for springfield monday afternoon
genre_weekday(spr_general, 'Monday', '17:00:00', '23:00:00')

genre
pop            717
dance          524
rock           518
electronic     485
hiphop         238
alternative    182
world          172
classical      172
ruspop         149
rusrap         133
jazz           124
unknown        109
soundtrack      92
folk            89
metal           88
Name: genre, dtype: int64

In [95]:
# calling out the function for shelbyville monday afternoon 
genre_weekday(shel_general, 'Monday', '17:00:00', '23:00:00')

genre
pop            263
rock           208
electronic     192
dance          191
hiphop         104
alternative     72
classical       71
jazz            57
rusrap          54
ruspop          53
world           52
unknown         51
metal           43
folk            39
soundtrack      37
Name: genre, dtype: int64

[Back to content](#back)

### 3rd hypothesis: genre preferences in springfield and shelbyville <a id='genre'></a>

In [96]:
#grouping by genre on spr_general
spr_genres=spr_general.groupby('genre')['genre'].count().sort_values(ascending = False)

In [97]:
# checking top 10 genres 

print(spr_genres.head(10))

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64


In [98]:
#grouping by genre on shel_general
shel_genres=shel_general.groupby('genre')['genre'].count().sort_values(ascending = False)

In [99]:
# checking top 10 genres 
print(shel_genres.head(10))

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64


[Back to content](#back)

# Conclusions <a id='end'></a>

Based on the data analysis, we have the following conclusions:

1. User activity in Springfield and Shelbyville is dependent on the day of the week, although the cities exhibit different patterns. The first hypothesis has been fully accepted.
2. Music preferences do not significantly vary throughout the week in Springfield and Shelbyville. We can observe minor differences on Mondays, but overall, the most popular genre in both cities is pop. Therefore, we cannot accept this hypothesis. It is also important to note that the results could have been different if there were no missing values.
3. It appears that the music preferences of users in Springfield and Shelbyville are quite similar. The third hypothesis is rejected as there is no observable difference in preferences based on the available data.

These conclusions are based on the analysis conducted, but it's important to consider the limitations of the data and the potential impact of missing values on the results.

[Back to content](#back)