# Contents <a id='back'></a>

* [Introduction](#intro)
* [1. Data Overview](#data_review)
    * [Conclusions](#data_review_conclusions)
* [2. Data Pre-Processing](#data_preprocessing)
    * [2.1 Header Stylel](#header_style)
    * [2.2 Missing Values](#missing_values)
    * [2.3 Duplicates](#duplicates)
    * [2.4 Conclusions](#data_preprocessing_conclusions)
* [3. Hypotheses](#hypotheses)
    * [3.1 Hypothesis 1: Users activities in the 2 cities](#activity)
    * [3.2 Hypothesis 2: Music preferences on Monday and Friday](#week)
    * [3.3 Hypothesis 3: Genre preferences in the 2 citites](#genre)
* [Findings](#end)

## Introduction <a id='intro'></a>
Every time we conduct research, we need to formulate hypotheses that can be tested. Sometimes we accept these hypotheses, but sometimes we also reject them. To make informed decisions, a key person in business department must be able to understand whether its assumptions are correct or not.

In this project, I will compare the music preferences of the cities of Springfield and Shelbyville. I will study the data from Y.Music to test the hypotheses below and compare user behavior in both cities.

### Objective: 
To test three hypotheses:
1. User activity varies depending on the day and city.
2. On Monday mornings, residents of Springfield and Shelbyville listen to different genres. This also applies to Friday nights.
3. Listeners in Springfield and Shelbyville have different preferences. In Springfield, they prefer pop music, while in Shelbyville, rap music has more fans.

### Steps
This project will consist of three stages:

1. Data Overview
2. Data Preprocessing
3. Hypothesis Testing

 
[Back to Contents](#back)

## 1. Data Overview <a id='data_review'></a>

In [1]:
import pandas as pd

In [2]:
# load the data
df = pd.read_csv('data/music_project_en.csv')
df.describe()

Unnamed: 0,userID,Track,artist,genre,City,time,Day
count,65079,63736,57512,63881,65079,65079,65079
unique,41748,39666,37806,268,2,20392,3
top,A8AE9169,Brand,Kartvelli,pop,Springfield,08:14:07,Friday
freq,76,136,136,8850,45360,14,23149


In [6]:
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
5,842029A1,Chains,Obladaet,rusrap,Shelbyville,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Springfield,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Springfield,20:47:49,Wednesday
8,8FA1D3BE,L’estate,Julia Dalia,ruspop,Springfield,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Shelbyville,21:20:49,Wednesday


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63736 non-null  object
 2   artist    57512 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


### Conclusions <a id='data_review_conclusions'></a> 

Each row in the table contains data on a played song. Some columns describe the song itself: title, artist, and genre. The rest convey information about the users: their city of origin and the time they played the song.

It is evident that the data is sufficient to test the hypotheses. However, there are missing values.

Next, we need to perform data preprocessing first.

[Back to Contents](#back)

## 2. Data preprocessing <a id='data_preprocessing'></a>
Fix the column title formatting and address missing values. Then, check for any duplicates in the data.

### Header Style <a id='header_style'></a>


In [8]:
df.columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

In [9]:
# rename column name
df = df.rename(
    columns={
        '  userID' : 'user_id',
        'Track': 'track',
        '  City  ' : 'city',
        'Day' : 'day'
    }
)

Periksa hasilnya. Tampilkan nama kolom sekali lagi:

In [10]:
# hasil pengecekan: daftar nama kolom
df.columns

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

[Back to Contents](#back)

### Missing Values <a id='missing_values'></a>

In [11]:
print(df.isna().sum())

user_id       0
track      1343
artist     7567
genre      1198
city          0
time          0
day           0
dtype: int64


In [12]:
# replace missing values with 'unknown'
columns_to_replace = ['track', 'artist', 'genre']

for column in columns_to_replace:
    df[column] = df[column].fillna('unknown')

In [13]:
# menghitung nilai yang hilang
df.isna().sum()

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

[Back to Contents](#back)

### Duplicates <a id='duplicates'></a>

In [14]:
df.duplicated().sum()

3826

In [15]:
df = df.drop_duplicates().reset_index(drop=True)

In [16]:
df.duplicated().sum()

0

In [17]:
# check unique values of genre
kolom_genre = df['genre']
genre_urut = kolom_genre.sort_values()
genre_urut.unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'ambient', 'americana', 'animated', 'anime',
       'arabesk', 'arabic', 'arena', 'argentinetango', 'art', 'audiobook',
       'avantgarde', 'axé', 'baile', 'balkan', 'beats', 'bigroom',
       'black', 'bluegrass', 'blues', 'bollywood', 'bossa', 'brazilian',
       'breakbeat', 'breaks', 'broadway', 'cantautori', 'cantopop',
       'canzone', 'caribbean', 'caucasian', 'celtic', 'chamber',
       'children', 'chill', 'chinese', 'choral', 'christian', 'christmas',
       'classical', 'classicmetal', 'club', 'colombian', 'comedy',
       'conjazz', 'contemporary', 'country', 'cuban', 'dance',
       'dancehall', 'dancepop', 'dark', 'death', 'deep', 'deutschrock',
       'deutschspr', 'dirty', 'disco', 'dnb', 'documentary', 'downbeat',
       'downtempo', 'drum', 'dub', 'dubstep', 'eastern', 'easy',
       'electronic', 'electropop', 'emo', 'entehno', 'epicmetal',
       'estrada', 'ethnic', 'eurofo

Lihat melalui *list* untuk menemukan duplikat implisit dari genre `hiphop`. Ini bisa berupa nama yang ditulis secara salah atau nama alternatif dari genre yang sama.

The following are implicit duplicates:
* `hip`
* `hop`
* `hip-hop`

To remove them, use the function replace_wrong_genres() with two parameters:
* `wrong_genres=` — a list of duplicates
* `correct_genre=` — a string with the correct value

The function should correct the names in the 'genre' column of the 'df' table, replacing each value from the wrong_genres list with the value in correct_genre.

In [18]:
# function to replace implicit duplicates
def replace_wrong_genres(wrong_genres, correct_genre):
    for wrong_genre in wrong_genres:
        df['genre'] = df['genre'].replace(wrong_genre, correct_genre)

duplicates = ['hip', 'hop', 'hip-hop']
genre_name = 'hiphop'

In [19]:
# apply the function
replace_wrong_genres(duplicates, genre_name)

In [20]:
# check genre column to see whether if the implicit duplicate still persists
kolom_genre = df['genre']
genre_urut = kolom_genre.sort_values()
genre_urut.unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'ambient', 'americana', 'animated', 'anime',
       'arabesk', 'arabic', 'arena', 'argentinetango', 'art', 'audiobook',
       'avantgarde', 'axé', 'baile', 'balkan', 'beats', 'bigroom',
       'black', 'bluegrass', 'blues', 'bollywood', 'bossa', 'brazilian',
       'breakbeat', 'breaks', 'broadway', 'cantautori', 'cantopop',
       'canzone', 'caribbean', 'caucasian', 'celtic', 'chamber',
       'children', 'chill', 'chinese', 'choral', 'christian', 'christmas',
       'classical', 'classicmetal', 'club', 'colombian', 'comedy',
       'conjazz', 'contemporary', 'country', 'cuban', 'dance',
       'dancehall', 'dancepop', 'dark', 'death', 'deep', 'deutschrock',
       'deutschspr', 'dirty', 'disco', 'dnb', 'documentary', 'downbeat',
       'downtempo', 'drum', 'dub', 'dubstep', 'eastern', 'easy',
       'electronic', 'electropop', 'emo', 'entehno', 'epicmetal',
       'estrada', 'ethnic', 'eurofo

[Back to Contents](#back)

### Conclusions <a id='data_preprocessing_conclusions'></a>
We have detected three issues with the data:

Incorrect title formatting
Missing values
Clear and implicit duplicates
The titles have been cleaned for easier table processing.

All missing values have been replaced with 'unknown'. However, we still need to assess if the missing values in the 'genre' column will impact our calculations.

The absence of duplicates will make the results more accurate and easier to interpret.

Now, we can proceed to hypothesis testing.

[Back to Contents](#back)

## 3. Hypothesis Testing <a id='hypotheses'></a>

### Hypothesis 1: Comparing User Behavior in Two Cities <a id='activity'></a>

According to the first hypothesis, users from Springfield and Shelbyville have differences in their music listening behavior. This test utilizes data from Mondays, Wednesdays, and Fridays.

- Separate the users into groups based on the city.
- Compare the number of songs played by each group on Mondays, Wednesdays, and Fridays.

In [21]:
# Calculating the number of songs played in each city.
track_per_city = df.groupby('city')['track'].count()
track_per_city

city
Shelbyville    18512
Springfield    42741
Name: track, dtype: int64

Springfield has more songs played than Shelbyville. However, it doesn't necessarily mean that the residents of Springfield listen to music more often. Springfield is a larger city and has more users.

Now, let's group the data by day and find the number of songs played on Monday, Wednesday, and Friday.

In [22]:
# Calculating track per day
track_per_day = df.groupby('day')['track'].count()
track_per_day

day
Friday       21840
Monday       21354
Wednesday    18059
Name: track, dtype: int64

In [23]:
def number_tracks(dataset, day, city):
    track_list = dataset.loc[(dataset['day'] == day) & (dataset['city'] == city)]
    track_list_count = track_list.groupby('city')['user_id'].count()
    
    return track_list_count

In [24]:
# the total number tracks on Monday in Springfield
spr_senin = number_tracks(dataset=df, day = 'Monday', city = 'Springfield')
spr_senin2 = spr_senin[0]
spr_senin2

15740

In [25]:
# the total number tracks on Monday in Shelbyville
shel_senin = number_tracks(dataset=df, day = 'Monday', city = 'Shelbyville')
shel_senin2 = shel_senin[0]
shel_senin2

5614

In [26]:
#  the total number tracks on Wednesday in Springfield
spr_rabu = number_tracks(dataset=df, day = 'Wednesday', city = 'Springfield')
spr_rabu2 = spr_rabu[0]
spr_rabu2

11056

In [27]:
#  the total number tracks on Wednesday in Shelbyville
shel_rabu = number_tracks(dataset=df, day = 'Wednesday', city = 'Shelbyville')
shel_rabu2 = shel_rabu[0]
shel_rabu2

7003

In [28]:
# the total number tracks on Friday in Springfield
spr_jumat = number_tracks(dataset=df, day = 'Friday', city = 'Springfield')
spr_jumat2 = spr_jumat[0]
spr_jumat2

15945

In [29]:
# the total number tracks on Friday in Shelbyville
shel_jumat = number_tracks(dataset=df, day = 'Friday', city = 'Shelbyville')
shel_jumat2 = shel_jumat[0]
shel_jumat2

5895

In [30]:
# table with result
kolom = ['city', 'monday', 'wednesday', 'friday']
kota = [
    ['Springfield', spr_senin2, spr_rabu2, spr_jumat2],
    ['Shelbyville', shel_senin2, shel_rabu2, shel_jumat2]
]

result = pd.DataFrame(data=kota, columns=kolom)
result

Unnamed: 0,city,monday,wednesday,friday
0,Springfield,15740,11056,15945
1,Shelbyville,5614,7003,5895


**Conclusion**

The data reveals differences in user behavior:

- In Springfield, the number of songs played reaches its peak on Mondays and Fridays, while there is a decline in activity on Wednesdays.
- In Shelbyville, on the other hand, users listen to more music on Wednesdays.

User activity is lower on Mondays and Fridays.

[Back to Contents](#back)

### Hypothesis 2: Music prefrences at the Beginning and End of the Week <a id='week'></a>

According to the second hypothesis, on Monday mornings and Friday evenings, residents of Springfield listen to different genres compared to the ones enjoyed by Shelbyville residents.

In [31]:
spr_general = df.loc[(df['city'] == 'Springfield')]
spr_general

Unnamed: 0,user_id,track,artist,genre,city,time,day
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
6,4CB90AA5,True,Roman Messer,dance,Springfield,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Springfield,20:47:49,Wednesday
8,8FA1D3BE,L’estate,Julia Dalia,ruspop,Springfield,09:17:40,Friday
...,...,...,...,...,...,...,...
61247,83A474E7,I Worship Only What You Bleed,The Black Dahlia Murder,extrememetal,Springfield,21:07:12,Monday
61248,729CBB09,My Name,McLean,rnb,Springfield,13:32:28,Wednesday
61250,C5E3A0D5,Jalopiina,unknown,industrial,Springfield,20:09:26,Friday
61251,321D0506,Freight Train,Chas McDevitt,rock,Springfield,21:43:59,Friday


In [32]:
shel_general = df.loc[(df['city'] == 'Shelbyville')]
shel_general

Unnamed: 0,user_id,track,artist,genre,city,time,day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
5,842029A1,Chains,Obladaet,rusrap,Shelbyville,13:09:41,Friday
9,E772D5C0,Pessimist,unknown,dance,Shelbyville,21:20:49,Wednesday
...,...,...,...,...,...,...,...
61239,D94F810B,Theme from the Walking Dead,Proyecto Halloween,film,Shelbyville,21:14:40,Monday
61240,BC8EC5CF,Red Lips: Gta (Rover Rework),Rover,electronic,Shelbyville,21:06:50,Monday
61241,29E04611,Bre Petrunko,Perunika Trio,world,Shelbyville,13:56:00,Monday
61242,1B91C621,(Hello) Cloud Mountain,sleepmakeswaves,postrock,Shelbyville,09:22:13,Monday


In [33]:
def genre_weekday(dataset, day, time1, time2):
    genre_df = dataset.loc[(dataset['day'] == day) & (dataset['time'] > time1) & (dataset['time'] < time2)]
    genre_df_count = genre_df.groupby('genre')['time'].count()
    genre_df_sorted = genre_df_count.sort_values(ascending=False)
    
    return genre_df_sorted[:15]

In [34]:
# The most popular genre on Monday morning in Springfield
genre_spr_senin_pg = genre_weekday(dataset = spr_general, day = 'Monday', time1 = '07:00:00', time2 = '11:00:00')
genre_spr_senin_pg

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
classical      157
metal          120
jazz           100
folk            97
soundtrack      95
Name: time, dtype: int64

In [35]:
# The most popular genre on Monday morning in Shelbyville
genre_shel_senin_pg = genre_weekday(dataset = shel_general, day = 'Monday', time1 = '07:00:00', time2 = '11:00:00')
genre_shel_senin_pg

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
world           36
rap             32
soundtrack      31
rnb             27
metal           27
Name: time, dtype: int64

In [36]:
# The most popular genre on Monday evening in Springfield
genre_spr_jumat_malam = genre_weekday(dataset = spr_general, day = 'Monday', time1 = '17:00:00', time2 = '23:00:00')
genre_spr_jumat_malam

genre
pop            717
dance          524
rock           518
electronic     485
hiphop         238
alternative    182
world          172
classical      172
ruspop         149
rusrap         133
jazz           124
unknown        109
soundtrack      92
folk            89
metal           88
Name: time, dtype: int64

In [37]:
# The most popular genre on Monday evening in Springfield
genre_shel_jumat_malam = genre_weekday(dataset = shel_general, day = 'Monday', time1 = '17:00:00', time2 = '23:00:00')
genre_shel_jumat_malam

genre
pop            263
rock           208
electronic     192
dance          191
hiphop         104
alternative     72
classical       71
jazz            57
rusrap          54
ruspop          53
world           52
unknown         51
metal           43
folk            39
soundtrack      37
Name: time, dtype: int64

**Conclusion**

After comparing the top 15 genres on Monday mornings, we can draw the following conclusions:

1. Users from Springfield and Shelbyville listen to music with the same genres. The top five genres are the same, with only rock and electronic interchanging positions.

2. In Springfield, the number of missing values is significant, with `'unknown'` appearing in the 10th position. This indicates that the missing values have a substantial amount of data, which raises concerns about the accuracy of our conclusions.

For Friday evenings, the situation is similar. Individual genres vary slightly, but overall, the top 15 genres for both cities are the same.

Thus, the second hypothesis is partially confirmed:
* Users listen to the same music at the beginning and end of the week.
* There are no significant differences between Springfield and Shelbyville. In both cities, pop is the most popular genre.

However, the number of missing values raises doubts about these results. In Springfield, there are so many influences on our top 15. If we do not disregard these values, the results may differ.

[Back to Contents](#back)

### Hypothesis 3: Genre Preferences in Springfield and Shelbyville <a id='genre'></a>

Shelbyville prefers rap music, while residents of Springfield have a greater preference for pop music.

In [38]:
spr_groupby = spr_general.groupby('genre')['genre'].count()
spr_genres = spr_groupby.sort_values(ascending=False)
spr_genres

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
               ... 
metalcore         1
marschmusik       1
malaysian         1
lovers            1
ïîï               1
Name: genre, Length: 250, dtype: int64

In [39]:
spr_genres.head(10)

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64

In [40]:
shel_groupby = shel_general.groupby('genre')['genre'].count()
shel_genres = shel_groupby.sort_values(ascending=False)
shel_genres

genre
pop           2431
dance         1932
rock          1879
electronic    1736
hiphop         960
              ... 
mandopop         1
leftfield        1
laiko            1
jungle           1
worldbeat        1
Name: genre, Length: 202, dtype: int64

In [41]:
shel_genres.head(10)

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64

**Kesimpulan**

Hypothesis partially confirmed:
* Pop music is the most popular genre in Springfield, as expected.
* However, pop music turns out to be equally popular in both Springfield and Shelbyville, and rap music does not make it to the top 5 genres in either city.

[Back to Contents](#back)

# Findings <a id='end'></a>

After analyzing the data, we can conclude:

1. User activity in Springfield and Shelbyville varies depending on the day, even though the cities are different.
The first hypothesis is fully accepted.

2. Music preferences do not differ significantly throughout the week in Springfield and Shelbyville. We can observe minor differences in rankings on Mondays, but:
Both in Springfield and Shelbyville, people predominantly listen to pop music.
Therefore, this hypothesis cannot be accepted. We should also note that the results could differ if not for the missing values.

3. It turns out that the music preferences of users from Springfield and Shelbyville are very similar.
The third hypothesis is rejected. If there are any preference differences, they cannot be seen from this data.

### Note
In a real project, research would involve statistical hypothesis testing, which is more precise and quantitative. Also, we cannot always draw conclusions about an entire city based on data from a single source.

[Back to Contents](#back)