# Project 1 - Music in Big City

# Content <a id='back'></a>

* [Intro](#intro)
* [Stage 1. Data Overview](#data_review)
     * [Conclusion](#data_review_conclusions)
* [Stage 2. Data preprocessing](#data_preprocessing)
     * [2.1 Heading Style](#header_style)
     * [2.2 Missing Values](#missing_values)
     * [2.3 Duplicates](#duplicates)
     * [2.4 Conclusion](#data_preprocessing_conclusions)
* [Stage 3. Testing the Hypothesis](#hypotheses)
     * [3.1 Hypothesis 1: user activity in two cities](#activity)
     * [3.2 Hypothesis 2: music preferences on Mondays and Fridays](#week)
     * [3.3 Hypothesis 3: genre preferences in the cities of Springfield and Shelbyville](#genre)
* [Findings](#end)

## Introduction <a id='intro'></a>
Whenever doing research, we need to formulate hypothesis that we can test. Sometimes we accept the hypothesis; but sometimes we reject it. To make the right decisions, a business must be able to understand whether the assumptions are correct or not.

In this project, you will compare the musical preferences of the cities of Springfield and Shelbyville. You will study actual Y.Music data to test the hypotheses below and compare user behavior in these two cities.

### Aim of Task: 
Testing 3 Hypothesis:
1. Different user activity based on days and cities.
2. In the monday morning, Springfield and Shelbyville listen to different genre, it happen also for friday.
3. Difference of music preference for Springfield and Shelbyville listener. Springfield prefer pop music, while Shelbyville is a fan of rapper.

### Steps
Dataset was provided in `/datasets/music_project_en.csv`. No information about data quality, so need to check first.

First, need to check data quality and the problem. Then, for data preprocessing, try to solve the serious problem.
 
3 steps of the project:
 1. Data overview
 2. Data pre-processing
 3. Test the hyphothesis

 
[Back to Table of Contents](#back)

## Data Overview <a id='data_review'></a>

In [2]:
# import library

import pandas as pd

In [68]:
# open dataset

df = pd.read_csv('/datasets/music_project_en.csv')
df.describe()

Unnamed: 0,userID,Track,artist,genre,City,time,Day
count,65079,63736,57512,63881,65079,65079,65079
unique,41748,39666,37806,268,2,20392,3
top,A8AE9169,Brand,Kartvelli,pop,Springfield,21:51:22,Friday
freq,76,136,136,8850,45360,14,23149


In [69]:
# check the first 10 rows

df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
5,842029A1,Chains,Obladaet,rusrap,Shelbyville,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Springfield,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Springfield,20:47:49,Wednesday
8,8FA1D3BE,L’estate,Julia Dalia,ruspop,Springfield,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Shelbyville,21:20:49,Wednesday


In [70]:
# data information

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63736 non-null  object
 2   artist    57512 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


Table contain 7 columns and have similar data types: `object`.

Berdasarkan dokumentasi:
- `'userID'` — Identification User
- `'Track'` — name of the song
- `'artist'` — name of the artist
- `'genre'` - type of the song
- `'City'` — city of user
- `'time'` — played time of the song
- `'Day'`

3 problems of writing the columns:
1. There is uppercase and lowercase.
2. Use of space in the names.
3. Combine of the names.

There are several of missing value also


### Findings <a id='data_review_conclusions'></a> 

- Dataset is enought to answer the hypothesis
- Need to check the data in data pre-processing steps to handle the missing value

[Kembali ke Daftar Isi](#back)

## Data Pre-Processing <a id='data_preprocessing'></a>

### Title Writing <a id='header_style'></a>

In [71]:
# check the name of columns

df.columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

In [72]:
# rename column to simplify the data

df = df.rename(columns={
    '  userID': 'user_id',
    'Track': 'track',
    '  City  ': 'city',
    'Day': 'day'
})

In [73]:
# check for update

df.columns

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

[Kembali ke Daftar Isi](#back)

### Missing Value <a id='missing_values'></a>

In [74]:
# check missing value

df.isna().sum()

user_id       0
track      1343
artist     7567
genre      1198
city          0
time          0
day           0
dtype: int64

- Missing value is not completely affect the project, such as in `track` and `artist`. It can be replace with `unknown`
- In column `genre`, the missing value must be filled because it will happen to make the preference of the music
- We need to calculate how big the missing value and how it can influence the calculation

In [75]:
# replace column in `track`, `artist` and `genre` with `unknown`

columns_to_replace = ['track', 'artist', 'genre']

for column in columns_to_replace:
    df[column]= df[column].fillna('unknown')

In [76]:
# check for the update

df.isna().sum()

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

[Kembali ke Daftar Isi](#back)

### Duplicate <a id='duplicates'></a>

In [77]:
# check for duplicate

df.duplicated().sum()

3826

In [78]:
# drop data duplicate

df = df.drop_duplicates()
df.head()

Unnamed: 0,user_id,track,artist,genre,city,time,day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday


In [79]:
# check for update

df.duplicated().sum()

0

Remove duplicate implicit in column `genre`. Such as, same genre in different way of writing. This mistakes can affect the calculation

In [80]:
# Check for unique name and sort alphabetically

df_genre = df['genre'].sort_values()
df_genre

df_genre.unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'ambient', 'americana', 'animated', 'anime',
       'arabesk', 'arabic', 'arena', 'argentinetango', 'art', 'audiobook',
       'avantgarde', 'axé', 'baile', 'balkan', 'beats', 'bigroom',
       'black', 'bluegrass', 'blues', 'bollywood', 'bossa', 'brazilian',
       'breakbeat', 'breaks', 'broadway', 'cantautori', 'cantopop',
       'canzone', 'caribbean', 'caucasian', 'celtic', 'chamber',
       'children', 'chill', 'chinese', 'choral', 'christian', 'christmas',
       'classical', 'classicmetal', 'club', 'colombian', 'comedy',
       'conjazz', 'contemporary', 'country', 'cuban', 'dance',
       'dancehall', 'dancepop', 'dark', 'death', 'deep', 'deutschrock',
       'deutschspr', 'dirty', 'disco', 'dnb', 'documentary', 'downbeat',
       'downtempo', 'drum', 'dub', 'dubstep', 'eastern', 'easy',
       'electronic', 'electropop', 'emo', 'entehno', 'epicmetal',
       'estrada', 'ethnic', 'eurofo

example of dumplicate implicit :
* `hip`
* `hop`
* `hip-hop`

make function `replace_wrong_genres()` with 2 parameters:
* `wrong_genres=` — false value
* `correct_genre=` —correct value

In [84]:
# function for replacing duplicate implicit

def replace_wrong_genres(wrong_genres, correct_genre):
    for wrong in wrong_genres:
        df['genre'] = df['genre'].replace(wrong, correct_genre)
        
wrong_genres = ['hip', 'hop', 'hip-hop']
correct_genre = 'hiphop'
replace_wrong_genres(wrong_genres, correct_genre)

In [85]:
# check for update

df_new_genre = df['genre'].sort_values()
df_new_genre.unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'ambient', 'americana', 'animated', 'anime',
       'arabesk', 'arabic', 'arena', 'argentinetango', 'art', 'audiobook',
       'avantgarde', 'axé', 'baile', 'balkan', 'beats', 'bigroom',
       'black', 'bluegrass', 'blues', 'bollywood', 'bossa', 'brazilian',
       'breakbeat', 'breaks', 'broadway', 'cantautori', 'cantopop',
       'canzone', 'caribbean', 'caucasian', 'celtic', 'chamber',
       'children', 'chill', 'chinese', 'choral', 'christian', 'christmas',
       'classical', 'classicmetal', 'club', 'colombian', 'comedy',
       'conjazz', 'contemporary', 'country', 'cuban', 'dance',
       'dancehall', 'dancepop', 'dark', 'death', 'deep', 'deutschrock',
       'deutschspr', 'dirty', 'disco', 'dnb', 'documentary', 'downbeat',
       'downtempo', 'drum', 'dub', 'dubstep', 'eastern', 'easy',
       'electronic', 'electropop', 'emo', 'entehno', 'epicmetal',
       'estrada', 'ethnic', 'eurofo

[Kembali ke Daftar Isi](#back)

### Conclusion <a id='data_preprocessing_conclusions'></a>
3 problems detected:

- Wrong style of title writing
- Missing values
- Duplicate in explicit or implicit

all problems have been resolved and continue to testing the hypothesis

[Kembali ke Daftar Isi](#back)

## Testing Hypothesis <a id='hypotheses'></a>

### Hypothesis 1: Comparing User Behaviour in 2 Cities <a id='activity'></a>

Based on first hypothesis, Springfield dan Shelbyville user has different behaviour in listening to music. This testing will need day: Monday, Wednesday, dan Friday.

* Separate user based on city.
* Compare music play time based on the days.

In [86]:
# counting the music for each city

df.groupby('city')['track'].count() 

city
Shelbyville    18512
Springfield    42741
Name: track, dtype: int64

Springfield play more music than Shelbyville. But, it doesn't mean Springfield more often listening music. This city is bigger and has more user.

Group the data based on days and calculate music played time

In [87]:
# counting the music based on days

df.groupby('day')['track'].count()

day
Friday       21840
Monday       21354
Wednesday    18059
Name: track, dtype: int64

The least number of song that has been playing happen in wednesday

Function for `number_tracks()` to calculate the song the played in specific days and city. It will need 2 parameters:
* days name
* city name

Use variael to keep rows in original table:
  * Value in `'day'` == `day`
  * Value in `'city'` == `city`

Use filter with logical index

Calculate `'user_id'` in resulting table. Save to new variable and return the variable from the function.

In [88]:
# function number_tracks

def number_tracks(day, city):
    track_list = df[df['day'] == day]
    track_list = track_list[track_list['city'] == city]
    track_list_count = track_list['user_id'].count()
    return track_list_count

Test `number_tracks()` 6 times and change the parameter, so can take the data in both city and days.

In [89]:
# music played in Springfield on Monday

number_tracks('Monday', 'Springfield')

15740

In [90]:
# music played in Shelbyville on Monday

number_tracks('Monday', 'Shelbyville')

5614

In [91]:
# music played in Springfield on Wednesday

number_tracks('Wednesday', 'Springfield')

11056

In [92]:
# music played in Shelbyville on Wednesday

number_tracks('Wednesday', 'Shelbyville')

7003

In [93]:
# music played in Springfield on Friday

number_tracks('Friday', 'Springfield')

15945

In [94]:
# music played in Shelbyville on Friday

number_tracks('Friday', 'Shelbyville')

5895

Use `pd.DataFrame` to make new table tha contain:
* Column name: `['city', 'monday', 'wednesday', 'friday']`
* value of column: `number_tracks()`

In [95]:
# make new table

column = ['city', 'monday', 'wednesday', 'friday']
track = [['Springfield',number_tracks('Monday', 'Springfield'),number_tracks('Wednesday', 'Springfield'),number_tracks('Friday', 'Springfield')],
         ['Shelbyville',number_tracks('Monday', 'Shelbyville'),number_tracks('Wednesday', 'Shelbyville'),number_tracks('Friday', 'Shelbyville')]]

pd.DataFrame(data=track, columns=column)

Unnamed: 0,city,monday,wednesday,friday
0,Springfield,15740,11056,15945
1,Shelbyville,5614,7003,5895


**Findings**

The result contain the different of user behaviour:

- In Springfield, The peak of musinc played time happen on Monday and Friday, while decrease on Wednesday.
- In Shelbyville, more user listening to music on Wednesday.

[Kembali ke Daftar Isi](#back)

### Hypothesis 2: Music in the beginning and the end of the Week <a id='week'></a>

Based on the second hypothesis, on the Monday and Friday night, Springfield people listening to different genre than Shelbyville people.

In [96]:
# new table contain Springfield people data

spr_general = df[df['city'] == 'Springfield'].reset_index(drop = True)
spr_general

Unnamed: 0,user_id,track,artist,genre,city,time,day
0,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
1,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
2,4CB90AA5,True,Roman Messer,dance,Springfield,13:00:07,Wednesday
3,F03E1C1F,Feeling This Way,Polina Griffith,dance,Springfield,20:47:49,Wednesday
4,8FA1D3BE,L’estate,Julia Dalia,ruspop,Springfield,09:17:40,Friday
...,...,...,...,...,...,...,...
42736,83A474E7,I Worship Only What You Bleed,The Black Dahlia Murder,extrememetal,Springfield,21:07:12,Monday
42737,729CBB09,My Name,McLean,rnb,Springfield,13:32:28,Wednesday
42738,C5E3A0D5,Jalopiina,unknown,industrial,Springfield,20:09:26,Friday
42739,321D0506,Freight Train,Chas McDevitt,rock,Springfield,21:43:59,Friday


In [97]:
# new table contain Shelbyville people data

shel_general = df[df['city'] == 'Shelbyville'].reset_index(drop = True)
shel_general

Unnamed: 0,user_id,track,artist,genre,city,time,day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
1,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
2,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
3,842029A1,Chains,Obladaet,rusrap,Shelbyville,13:09:41,Friday
4,E772D5C0,Pessimist,unknown,dance,Shelbyville,21:20:49,Wednesday
...,...,...,...,...,...,...,...
18507,D94F810B,Theme from the Walking Dead,Proyecto Halloween,film,Shelbyville,21:14:40,Monday
18508,BC8EC5CF,Red Lips: Gta (Rover Rework),Rover,electronic,Shelbyville,21:06:50,Monday
18509,29E04611,Bre Petrunko,Perunika Trio,world,Shelbyville,13:56:00,Monday
18510,1B91C621,(Hello) Cloud Mountain,sleepmakeswaves,postrock,Shelbyville,09:22:13,Monday


Make function `genre_weekday()` with 4 parameters:
* table for data
* days name
* first time format 'hh:mm'
* end time format 'hh: mm'

The result of the function has to be the information of 15 popular genres that happen in spesific days and period of time

In [98]:
# function `genre_weekday`

def genre_weekday(data, day, time1, time2):

    genre_df = data[data['day'] == day]
    genre_df = genre_df[genre_df['time'] < time2]
    genre_df = genre_df[genre_df['time'] > time1]

    genre_df_grouped = genre_df.groupby('genre')['genre'].count()
    genre_df_sorted = genre_df_grouped.sort_values(ascending = False)
    
    return genre_df_sorted[:15]

Compare the genre between Springfield and Shelbyville on Monday and Friday night (from 07.00 to 11.00 and from 17.00 to 23.00) using `genre_weekday` function

In [99]:
# genre played in Springfield on monday (07.00 - 11.00)

genre_weekday(spr_general, 'Monday', '07:00', '11:00')

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
classical      157
metal          120
jazz           100
folk            97
soundtrack      95
Name: genre, dtype: int64

In [100]:
# genre played in Shelbyville on monday (07.00 - 11.00)

genre_weekday(shel_general, 'Monday', '07:00', '11:00')

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
world           36
rap             32
soundtrack      31
rnb             27
metal           27
Name: genre, dtype: int64

In [101]:
# genre played in Springfield on monday (17.00 - 23.00)

genre_weekday(spr_general, 'Friday', '17:00', '23:00')

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
classical      163
alternative    163
rusrap         142
jazz           111
unknown        110
soundtrack     105
rnb             90
metal           88
Name: genre, dtype: int64

In [102]:
# genre played in Shelbyville on monday (17.00 - 23.00)

genre_weekday(shel_general, 'Friday', '17:00', '23:00')

genre
pop            256
rock           216
electronic     216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
unknown         47
ruspop          47
soundtrack      40
metal           39
rap             36
Name: genre, dtype: int64

**Findings**

1. The top 5 of genre between Sprigfield and Shelbyville is same

2. In Springfield, there is a huge number of missing value and that makes the `unknown` genre take the top 10. It conclude that the missing value is quite affect the calculation

On Friday night, there are variety of genre, but overall both city has same genre

So, with this result, the second hypothesis is proven true

[Kembali ke Daftar Isi](#back)

### Hypothesis 3: Genre Preference in Springfield dan Shelbyville <a id='genre'></a>

Group the `spr_general` based on genre and find total play time of music using `count()`. Then sort the result and save to| `spr_genres`.

In [103]:
# group for Springfield

spr_genres = spr_general.groupby('genre')
spr_genres = spr_genres['genre'].count()

spr_genres = spr_genres.sort_values(ascending=False)
spr_genres

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
               ... 
metalcore         1
marschmusik       1
malaysian         1
lovers            1
ïîï               1
Name: genre, Length: 250, dtype: int64

In [104]:
spr_genres.head(10)

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64

In [106]:
# group for Shelbyville

shel_genres = shel_general.groupby('genre')
shel_genres = shel_genres['genre'].count()

shel_genres = shel_genres.sort_values(ascending=False)
shel_genres

genre
pop           2431
dance         1932
rock          1879
electronic    1736
hiphop         960
              ... 
mandopop         1
leftfield        1
laiko            1
jungle           1
worldbeat        1
Name: genre, Length: 202, dtype: int64

In [107]:
shel_genres.head(10)

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64

**Findings**

Hypothesis is proven true, but:
* Pop music is the most popular genre in Springfield, which is true and it happen also in Shelbyville.
* But, Rap music isn't in top 5 for both city.


[Kembali ke Daftar Isi](#back)

# Main Conclusion <a id='end'></a>

After data analysis and testing the hypothesis, we can conclude:

1. User activity in Springfield and Shelbyville depends on the day.

The first hypothesis can be fully accepted.

2. Music preferences weren't too different during a week in Springfield and Shelbyville. 
We can see a small difference on Monday, but both Springfield and Shelbyville, most people listen to pop music.
So we cannot accept this hypothesis. We also have to remember that the results might have been different if we consider the missing values

3. It turns out that the music preferences of users from Springfield and Shelbyville are very similar.
The third hypothesis is rejected. If there is a difference in preference, it cannot be seen from this data.

### Notes
In real projects, research involves statistical hypothesis testing, which is more precise and more quantitative. Also note that you can't always draw conclusions about an entire city based on data from just one source.