# Y.Music

# Table of Contents <a id='back'></a>

* [Introduction](#intro)
* [Stage 1. Data Overview](#data_review)
    * [Conclusions](#data_review_conclusions)
* [Stage 2. Data Preprocessing](#data_preprocessing)
    * [2.1 Header Style](#header_style)
    * [2.2 Missing Values](#missing_values)
    * [2.3 Duplicates](#duplicates)
    * [2.4 Conclusions](#data_preprocessing_conclusions)
* [Stage 3. Hypothesis Testing](#hypotheses)
    * [3.1 Hypothesis 1: User activity in both cities](#activity)
    * [3.2 Hypothesis 2: Music preferences on Monday and Friday](#week)
    * [3.3 Hypothesis 3: Genre preferences in Springfield and Shelbyville](#genre)
* [Findings](#end)

## Introduction <a id='intro'></a>
Whenever we conduct an analysis, we need to formulate several hypotheses that we will test further. Sometimes, the tests lead us to accept these hypotheses, while other times, we need to reject them. To make informed business decisions, we must understand whether the assumptions we make are correct or not.

In this project, you will compare the music preferences of listeners in the cities of Springfield and Shelbyville. You will review real data from Y.Music to test the hypotheses below and compare user behavior in both cities.

### Goal:
Test three hypotheses:
1. User activity varies depending on the day and the city.
2. On Monday mornings, residents of Springfield and Shelbyville listen to different genres. This is also true for Friday evenings.
3. Listeners in Springfield and Shelbyville have different preferences. In Springfield, users prefer pop music, while in Shelbyville, rap music is more popular.

### Steps
Data related to user behavior is stored in the file [**music_project_en.csv**](https://raw.githubusercontent.com/milawidyalestari/data-analyst-ml-project/project1-y.music-behavior/music_project_en.csv). There is no information about the quality of this data, so you need to examine it first before testing the hypotheses.

First, you will assess the data quality and see if the issues are significant. Then, during data preprocessing, you will try to address the most serious problems.

This project consists of three stages:
1. Data Overview
2. Data Preprocessing
3. Hypothesis Testing

[Back to Table of Contents](#back)

## Stage 1. Data Overview <a id='data_review'></a>

In [1]:
# Import Pandas
import pandas as pd

Read the file `music_project_en.csv` from the folder `/datasets/` and save it to the variable `df`:

In [2]:
# Reading the file and saving it to the variable df
path = 'https://raw.githubusercontent.com/milawidyalestari/data-analyst-ml-project/project1-y.music-behavior/music_project_en.csv'
df = pd.read_csv(path)
df.describe()

Unnamed: 0,userID,Track,artist,genre,City,time,Day
count,65079,63736,57512,63881,65079,65079,65079
unique,41748,39666,37806,268,2,20392,3
top,A8AE9169,Brand,Kartvelli,pop,Springfield,08:14:07,Friday
freq,76,136,136,8850,45360,14,23149


Display the first 10 rows of the table:

In [3]:
# Obtaining the first 10 rows of the df table
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
5,842029A1,Chains,Obladaet,rusrap,Shelbyville,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Springfield,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Springfield,20:47:49,Wednesday
8,8FA1D3BE,L’estate,Julia Dalia,ruspop,Springfield,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Shelbyville,21:20:49,Wednesday


Get general information about the table with a single command:

In [4]:
# Obtain general information about the data available in df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63736 non-null  object
 2   artist    57512 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


This table contains seven columns. All columns have the same data type, which is `object`.

Based on the documentation:
- `'userID'` — user ID
- `'Track'` — song title
- `'artist'` — artist name
- `'genre'`
- `'City'` — user's city of origin
- `'time'` — time when the song was played
- `'Day'` — day of the week

We can see three issues with the column naming style:
1. Some names are written in uppercase, some in lowercase.
2. Some names use spaces.
3. For the third issue, it can be found in the first column name `'  userID'` which should use `snake_case` format for naming columns with 2 or more words.

We can also see that there are different numbers of values between columns. This indicates that our data contains missing values.

### Conclusion<a id='data_review_conclusions'></a> 

Each row in the table stores data related to a played song track. Some columns store data that describe the track itself: song title, artist, and genre. The rest store data related to user information: their city of origin, the time they played the song track.

It is clear that the data we have is sufficient to test hypotheses. Unfortunately, there are some missing values.

To continue the analysis, we need to preprocess the data first.

[Back to Table of Contents](#back)

## Stage 2. Data Preprocessing <a id='data_preprocessing'></a>
Fix the format in the column titles and handle missing values. Then, check if your data contains duplicates.

### Header Style <a id='header_style'></a>
Display the column headers:

In [5]:
# A list containing the column names in the df table
df.columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

Change the column names according to the rules of good writing style:
* If the column name consists of multiple words, use snake_case
* All characters should be lowercase
* Remove spaces

In [6]:
# Change column names
# Here are 4 column names that we need to edit
df = df.rename(
    columns={
    '  userID': 'user_id',
    'Track': 'track',
    '  City  ': 'city',
    'Day': 'day'
})
# Displaying column names after renaming
df.columns

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

Check the result. Display the column names again.

In [7]:
# checking your result: display the list of column names once again
df.columns

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

[Back to Table of Contents](#back)

### Missing Values <a id='missing_values'></a>
First, find the number of missing values in the table. To do this, use two `Pandas` methods:

In [8]:
# counting missing values
df.isnull().sum().sort_values(ascending=False)

artist     7567
track      1343
genre      1198
user_id       0
city          0
time          0
day           0
dtype: int64

It turns out that the `artist` column has the most missing values. However, since the data in the `artist` and `track` columns are not very influential to the hypothesis results in this project, the values in the `artist` column will be replaced with `'unknown'` as a replacement.

Not all missing values are relevant to your research. For example, missing values in the `track` and `artist` columns are not very important. You can simply replace them with clear markers.
However, missing values in the `'genre'` column can affect the comparison of music preferences in the cities of Springfield and Shelbyville. In real life, it is very useful to study the reasons for missing data and try to fix them. Unfortunately, we don't have the opportunity to do that in this project. Therefore, you should:
* Fill in missing values with markers
* Evaluate the extent to which missing values affect your calculations

Replace missing values in the `'track'`, `'artist'`, and `'genre'` columns with the string `'unknown'`. To do this, create a list named `columns_to_replace`, apply a `for` loop to that list, and replace the missing values in each column:

In [9]:
# Applying a loop to column names and replacing missing values with 'unknown'
columns_to_replace = ['track', 'artist', 'genre']

for column in columns_to_replace:
    df[column] = df[column].fillna('unknown')

In the 'genre' column, missing values can have a significant impact on the results of the hypothesis test. This may have occurred due to data entry errors. Therefore, for now, we will replace these missing values with `'unknown'`.

Make sure there are no more tables containing missing values. Recalculate the missing values.

In [10]:
# Counting missing values
df.isnull().sum()

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

The missing values have been filled with `'unknown'` in the `'track'`, `'artist'`, and `'genre'` columns, so there are no more empty values to analyze. 

[Back to Table of Contents](#back)

### Duplicates <a id='duplicates'></a>
Find the number of explicit duplicates in the table using a single command:

In [11]:
# counting explicit duplicates
df.duplicated().sum()

3826

Call one of the `Pandas` methods to remove explicit duplicates:

In [12]:
# removing explicit duplicates
df = df.drop_duplicates().reset_index(drop=True)

Count explicit duplicates again to ensure that you have successfully removed all of them:

In [13]:
# checking duplicates
df.duplicated().sum()

0

In addition to empty values or `'missing values'`, duplicate data can also affect the results of the research. It is advisable to remove these duplicate data to make the research results more accurate.

Now, remove implicit duplicates in the `genre` column. For example, writing a genre name in different ways is an example of implicit duplicates. Errors like this will also affect your analysis results.

Display a list containing unique genre names, then sort the list alphabetically. To do this:
* Take the desired DataFrame column
* Apply a sorting method to that column
* For the sorted column, call a method that will produce all unique values in the column

In [14]:
# displaying unique genre names
df['genre'].sort_values(ascending=True).unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'ambient', 'americana', 'animated', 'anime',
       'arabesk', 'arabic', 'arena', 'argentinetango', 'art', 'audiobook',
       'avantgarde', 'axé', 'baile', 'balkan', 'beats', 'bigroom',
       'black', 'bluegrass', 'blues', 'bollywood', 'bossa', 'brazilian',
       'breakbeat', 'breaks', 'broadway', 'cantautori', 'cantopop',
       'canzone', 'caribbean', 'caucasian', 'celtic', 'chamber',
       'children', 'chill', 'chinese', 'choral', 'christian', 'christmas',
       'classical', 'classicmetal', 'club', 'colombian', 'comedy',
       'conjazz', 'contemporary', 'country', 'cuban', 'dance',
       'dancehall', 'dancepop', 'dark', 'death', 'deep', 'deutschrock',
       'deutschspr', 'dirty', 'disco', 'dnb', 'documentary', 'downbeat',
       'downtempo', 'drum', 'dub', 'dubstep', 'eastern', 'easy',
       'electronic', 'electropop', 'emo', 'entehno', 'epicmetal',
       'estrada', 'ethnic', 'eurofo

Take a close look at the list displayed to find implicit duplicates of the `hiphop` genre. These duplicates could be names written incorrectly or alternative names for the same genre.

You will find the following implicit duplicates:
- `hip`
- `hop`
- `hip-hop`

To remove them, use the `replace_wrong_genres()` function with two parameters:
- `wrong_genres=` - a list with duplicates to be replaced
- `correct_genre=` - a string with the correct value

The function should correct the names in the `'genre'` column of the `df` table, replacing each value from the `wrong_genres` list with the value from `correct_genre`.

In [15]:
# Enter the function that replaces implicit duplicates
# df['genre'] = df['genre'].replace(['hip', 'hop', 'hip-hop'], 'hiphop')

def replace_wrong_genres(data, column_name, wrong_genres, correct_genre):
    data[column_name] = data[column_name].replace(wrong_genres, correct_genre)
    return(data)

Call `replace_wrong_genres()` and pass arguments to the function so that it can remove implicit duplicates (`hip`, `hop`, and `hip-hop`) and replace them with `hiphop`:

In [16]:
# removing implicit duplicates
wrong_genres = ['hip', 'hop', 'hip-hop']
correct_genre = 'hiphop'

replace_wrong_genres(df, 'genre', wrong_genres, correct_genre)

Unnamed: 0,user_id,track,artist,genre,city,time,day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
...,...,...,...,...,...,...,...
61248,729CBB09,My Name,McLean,rnb,Springfield,13:32:28,Wednesday
61249,D08D4A55,Maybe One Day (feat. Black Spade),Blu & Exile,hiphop,Shelbyville,10:00:00,Monday
61250,C5E3A0D5,Jalopiina,unknown,industrial,Springfield,20:09:26,Friday
61251,321D0506,Freight Train,Chas McDevitt,rock,Springfield,21:43:59,Friday


Make sure that the duplicated values have been removed. Display a list of unique values from the `'genre'` column:

In [17]:
# checking implicit duplicates
df['genre'].sort_values(ascending=True).unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'ambient', 'americana', 'animated', 'anime',
       'arabesk', 'arabic', 'arena', 'argentinetango', 'art', 'audiobook',
       'avantgarde', 'axé', 'baile', 'balkan', 'beats', 'bigroom',
       'black', 'bluegrass', 'blues', 'bollywood', 'bossa', 'brazilian',
       'breakbeat', 'breaks', 'broadway', 'cantautori', 'cantopop',
       'canzone', 'caribbean', 'caucasian', 'celtic', 'chamber',
       'children', 'chill', 'chinese', 'choral', 'christian', 'christmas',
       'classical', 'classicmetal', 'club', 'colombian', 'comedy',
       'conjazz', 'contemporary', 'country', 'cuban', 'dance',
       'dancehall', 'dancepop', 'dark', 'death', 'deep', 'deutschrock',
       'deutschspr', 'dirty', 'disco', 'dnb', 'documentary', 'downbeat',
       'downtempo', 'drum', 'dub', 'dubstep', 'eastern', 'easy',
       'electronic', 'electropop', 'emo', 'entehno', 'epicmetal',
       'estrada', 'ethnic', 'eurofo

Implicit duplicate data can lead us to make mistakes when grouping data. Therefore, we must be detailed in cleaning the data. For example, the data above has implicit duplicates of `'hip'`, `'hop'`, and `'hip-hop'`. These genres have the same meaning. However, due to errors in data entry, these genres should be removed to make the data grouping more relevant.

[Back to Table of Contents](#back)

### Conclusion <a id='data_preprocessing_conclusions'></a>
We have detected three issues in our data:

- Incorrect title writing styles
- Missing values
- Explicit and implicit duplicates

Now, the column names have been cleaned to facilitate table processing.
All missing values have been replaced with `'unknown'`. However, we still need to see if missing values in the `'genre'` column will affect our calculations.

The absence of duplicates will make our results more accurate and easier to understand.

Let's continue to the hypothesis testing phase.

[Back to Table of Contents](#back)

## Stage 3. Hypothesis Testing <a id='hypotheses'></a>

### Hypothesis 1: Comparing user behavior in two cities <a id='activity'></a>

According to the first hypothesis, users from Springfield and Shelbyville have different behaviors in listening to music. This test uses data taken from three days of the week: Monday, Wednesday, and Friday.

* Divide users into several groups based on city.
* Compare how many tracks are played by each group on Monday, Wednesday, and Friday.

In [18]:
# Counting tracks played on each city
df.groupby('city')['track'].count().reset_index()

Unnamed: 0,city,track
0,Shelbyville,18512
1,Springfield,42741


Users from Springfield play more tracks than users from Shelbyville. However, this does not necessarily mean that residents of Springfield listen to music more often. The city is indeed larger, with more users. So, this is a reasonable finding.

Now, group the data by day and find the number of tracks played on Monday, Wednesday, and Friday.

In [19]:
# Counting tracks played on each day
df.groupby('day')['track'].count().reset_index()

Unnamed: 0,day,track
0,Friday,21840
1,Monday,21354
2,Wednesday,18059


Wednesday is the overall 'calmest' day. However, if we consider the two cities separately, we might come to different conclusions.

You've seen how grouping works based on cities or days. Now, write a function that will group data based on both the day and the city.

Create a `number_tracks()` function to count the number of song tracks played for a specific day and city. The function will require two parameters:
* The name of the day of the week.
* The name of the city.

In the function we create, use variables to store rows from the original table, where:
  * The value in the `'day'` column matches the `day` parameter,
  * The value in the `'city'` column matches the `city` parameter.

Apply sequential filtering with logical indexing.

Then, calculate the value in the `'user_id'` column for the resulting table. Save the result in a new variable. Return this variable from the function.

In [20]:
# <creating the number_tracks() function>
# We will declare a function with two parameters: day=, city=.
# Make the variable track_list store the rows of df where
# the value in the 'day' column is equal to the parameter day=, and at the same time,
# the value in the 'city' column is equal to the parameter city= (apply sequential filtering
# with logical indexing).
# Make the variable track_list_count store the count of values in the 'user_id' column in track_list
# (find it with the count() method).
# Make the function you create return the number: the value of track_list_count.

# This function counts the number of tracks played for a specific city and day.
# First, it will take the rows with the desired day from the table,
# then filter those rows by the desired city from the result,
# then find the count of 'user_id' values in the filtered table,
# and then return that count.
# To see the output, wrap the function call in print().

def number_tracks(city, day):
    track_list = df[(df['city'] == city) & (df['day'] == day)]
    track_list_count = track_list['user_id'].count()
    return(track_list_count)

Call `number_tracks()` six times and change the parameter values for each call, so you can get data from both cities for each day (Monday, Wednesday, and Friday).

In [21]:
# number of tracks played in Springfield on Monday
spring_mon = number_tracks(city='Springfield', day='Monday')

In [22]:
# number of tracks played in Shelbyville on Monday
shelby_mon = number_tracks(city='Shelbyville', day='Monday')

In [23]:
# number of tracks played in Springfield on Wednesday
spring_wed = number_tracks(city='Springfield', day='Wednesday')

In [24]:
# number of tracks played in Shelbyville on Wednesday
shelby_wed = number_tracks(city='Shelbyville', day='Wednesday')

In [25]:
# number of tracks played in Springfield on Friday
spring_fri = number_tracks(city='Springfield', day='Friday')

In [26]:
# number of tracks played in Shelbyville on Friday
shelby_fri = number_tracks(city='Shelbyville', day='Friday')

Use `pd.DataFrame` to create a table, where
* The column names are: `['city', 'monday', 'wednesday', 'friday']`
* The data is the results you obtained from `number_tracks()`

In [27]:
number_track = [
    ['Springfield', spring_mon, spring_wed, spring_fri],
    ['Shelbyville', shelby_mon, shelby_wed, shelby_fri]
]
pd.DataFrame(number_track, columns=['city', 'monday', 'wednesday', 'friday'])

Unnamed: 0,city,monday,wednesday,friday
0,Springfield,15740,11056,15945
1,Shelbyville,5614,7003,5895


As we can see above, the playback of tracks in each city has different frequencies on certain days.

**Conclusion**

The data you obtained successfully revealed some differences in user behavior:

- In the city of Springfield, the number of tracks played peaks on Monday and Friday, while there is a decrease in activity on Wednesday.
- In Shelbyville, on the other hand, users listen to more music on Wednesday. User activity on Monday and Friday is lower.

Thus, it can be concluded that the first hypothesis appears to be correct.

[Back to Table of Contents](#back)

### Hypothesis 2: Music at the Beginning and End of the Week <a id='week'></a>

According to the second hypothesis, on Monday mornings and Friday nights, residents of Springfield listen to different genres of music than those enjoyed by the residents of Shelbyville.

Get the table (make sure the combined table names match the DataFrame provided in the two code blocks below):
* For Springfield — `spr_general`
* For Shelbyville — `shel_general`

In [28]:
# get the spr_general table from the df rows,
# where the values of the 'city' column are 'Springfield'
spr_general = df[df['city'] == 'Springfield']
spr_general

Unnamed: 0,user_id,track,artist,genre,city,time,day
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
6,4CB90AA5,True,Roman Messer,dance,Springfield,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Springfield,20:47:49,Wednesday
8,8FA1D3BE,L’estate,Julia Dalia,ruspop,Springfield,09:17:40,Friday
...,...,...,...,...,...,...,...
61247,83A474E7,I Worship Only What You Bleed,The Black Dahlia Murder,extrememetal,Springfield,21:07:12,Monday
61248,729CBB09,My Name,McLean,rnb,Springfield,13:32:28,Wednesday
61250,C5E3A0D5,Jalopiina,unknown,industrial,Springfield,20:09:26,Friday
61251,321D0506,Freight Train,Chas McDevitt,rock,Springfield,21:43:59,Friday


Displaying the table filtered based on the city `'Springfield'`.

In [29]:
# get the shel_general from the df rows,
# where the values of the 'city' column are 'Shelbyville'
shel_general = df[df['city'] == 'Shelbyville']
shel_general

Unnamed: 0,user_id,track,artist,genre,city,time,day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
5,842029A1,Chains,Obladaet,rusrap,Shelbyville,13:09:41,Friday
9,E772D5C0,Pessimist,unknown,dance,Shelbyville,21:20:49,Wednesday
...,...,...,...,...,...,...,...
61239,D94F810B,Theme from the Walking Dead,Proyecto Halloween,film,Shelbyville,21:14:40,Monday
61240,BC8EC5CF,Red Lips: Gta (Rover Rework),Rover,electronic,Shelbyville,21:06:50,Monday
61241,29E04611,Bre Petrunko,Perunika Trio,world,Shelbyville,13:56:00,Monday
61242,1B91C621,(Hello) Cloud Mountain,sleepmakeswaves,postrock,Shelbyville,09:22:13,Monday


Displaying the table filtered based on the city `'Shelbyville'`.

Create the `genre_weekday()` function with four parameters:
* A table for data
* Day name
* Start time stamp, in 'hh:mm' format
* End time stamp, in 'hh:mm' format

The function should provide information about the top 15 most popular genres on a specific day within the period between two time stamps.

In [30]:
# Declaring the genre_weekday() function with parameters day=, time1=, and time2=.
# The function should provide information about the most popular genre on a specific day and time:

# 1) Make the genre_df variable store rows that meet the following conditions:
#   - the value in the 'day' column equals the value of the day= argument
#   - the value in the 'time' column is greater than the value of the time1= argument
#   - the value in the 'time' column is less than the value of the time2= argument
#   Use sequential filtering with logical indexing.

# 2) Group genre_df by the 'genre' column, take one of its columns,
#    then use the count() method to find the number of entries for each
#    represented genre; store the resulting Series into
#    the genre_df_count variable

# 3) Sort genre_df_count in descending order based on frequency and store the result
#    into the genre_df_sorted variable

# 4) Generate a Series object with the first 15 values of genre_df_sorted - the 15 most
#    popular genres (on a specific day, within a specific time range)

# Write your function here
def genre_weekday(data, day, time1, time2):
    # Sequential filtering
    # genre_df will only store df rows with the same day value as day
    genre_df = data[data['day'] == day]
    
    # genre_df will only store df rows with time greater than time1
    genre_df = genre_df[genre_df['time'] > time1]
    # genre_df will only store df rows with time less than time2
    genre_df2 = genre_df[genre_df['time'] < time2]

    # Group the filtered DataFrame by the column named genre, take the genre column, and find the count of rows for each genre with the count() method
    genre_df_grouped = genre_df.groupby('genre')['genre'].count()

    # Sort the result in descending order (so that the most popular genres are displayed earlier in the Series object)
    genre_df_sorted = genre_df_grouped.sort_values(ascending=False)

    # Generate a Series object storing the 15 most popular genres on a specific day within a specific time range
    return genre_df_sorted[:15]


Compare the results of the `genre_weekday()` function for Springfield and Shelbyville on Monday morning (from 07:00 to 11:00) and Friday evening (from 17:00 to 23:00):

In [31]:
# calling the function for Monday morning in Springfield (use spr_general instead of the df table)
week = genre_weekday(spr_general, 'Monday', '07:00', '11:00')
week

genre
pop            2154
dance          1669
rock           1452
electronic     1432
hiphop          789
classical       558
world           548
alternative     499
ruspop          478
rusrap          440
unknown         393
jazz            339
metal           313
folk            268
soundtrack      268
Name: genre, dtype: int64

In [32]:
# calling the function for Monday morning in Shelbyville (use shel_general instead of the df table)
week = genre_weekday(shel_general, 'Monday', '07:00', '11:00')
week

genre
pop            732
dance          589
rock           577
electronic     523
hiphop         277
alternative    199
classical      187
jazz           174
rusrap         164
world          163
ruspop         162
metal          110
soundtrack     101
unknown         95
rap             93
Name: genre, dtype: int64

In [33]:
# calling the function for Friday evening in Springfield
genre_weekday(spr_general, 'Friday', '17:00', '23:00')

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
classical      163
alternative    163
rusrap         142
jazz           111
unknown        110
soundtrack     105
rnb             90
metal           88
Name: genre, dtype: int64

In [34]:
# calling the function for Friday evening in Shelbyville
genre_weekday(shel_general, 'Friday', '17:00', '23:00')

genre
pop            256
rock           216
electronic     216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
unknown         47
ruspop          47
soundtrack      40
metal           39
rap             36
Name: genre, dtype: int64

**Conclusion**

After comparing the top 15 genres on Monday morning, we can draw the following conclusions:

1. Users from Springfield and Shelbyville listen to music with the same genres. The top five genres from both cities are the same, with only rock and electronic genres swapping places.

2. In Springfield, the amount of missing values is significantly high, causing the `'unknown'` value to rank 12th. This indicates that the missing values cover a significant proportion of the data, raising questions about the reliability of our conclusions.

For Friday evening, the situation is similar. Individual genres vary, but overall, the top 15 genres for both cities are the same.

Thus, hypothesis two is partially confirmed:
* Users listen to the same music at the beginning and end of the week.
* There is no significant difference between Springfield and Shelbyville. In both cities, pop is the most popular genre.

However, the significance of the amount of missing values casts doubt on these results. In Springfield, there are so many missing values that affect our top 15 genre results. If we did not have these missing values, the results might be different.

[Back to Table of Contents](#back)

### Hypothesis 3: Genre Preferences in Springfield and Shelbyville <a id='genre'></a>

Hypothesis: Listeners in Shelbyville prefer rap music, while listeners in Springfield prefer pop.

Group the `spr_general` table by genre and find the count of tracks played for each genre using the `count()` method. Then, sort the results in descending order and store it in `spr_genres`.

In [35]:
# In one line: group the spr_general table by the 'genre' column,
# count the values of the 'genre' column with count() within the grouping,
# sort the resulting Series in descending order, then save the result to spr_genres.
spr_general = spr_general.groupby('genre')['genre'].count().sort_values(ascending=False)

Display the first 10 rows of `spr_genres`:

In [36]:
# Displaying the first 10 rows of spr_genres
spr_general[:10]

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64

Now, do the same for the data from Shelbyville.

Group the `shel_general` table by genre and find the count of tracks played for each genre. Then, sort the results in descending order and save it to the `shel_genres` table.

In [37]:
# In one line: group the shel_general table by the 'genre' column,
# count the values of the 'genre' column within the grouping using count(),
# ort the resulting Series in descending order, and save it to shel_genres.
shel_genres = shel_general.groupby('genre')['genre'].count().sort_values(ascending=False)

Display the first 10 rows of `shel_genres`:

In [38]:
# Displaying the first 10 rows of shel_genres
shel_genres[:10]

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64

**Conclusion**

This hypothesis is partially confirmed:
* Pop music is the most popular genre in Springfield, as we speculated.
* However, pop music is equally popular in both Springfield and Shelbyville, and rap music does not make it to the top 5 genres for both cities.

[Back to Table of Contents](#back)

# Findings <a id='end'></a>

We have tested the following three hypotheses:

1. User activity varies depending on the day and the city.
2. On Monday mornings, residents of Springfield and Shelbyville listen to different genres. This also applies to Friday evenings.
3. Listeners in Springfield and Shelbyville have different preferences. Both in Springfield and Shelbyville, users prefer pop music.

After analyzing the available data, we can conclude that:

1. User activity in Springfield and Shelbyville depends on the day of the week, although the two cities vary in several ways.

The first hypothesis can be fully accepted.

2. Music preferences do not vary significantly throughout the week in Springfield and Shelbyville. We can observe slight differences in rankings on Monday, but:
* Both in Springfield and Shelbyville, users mostly listen to pop music.

Therefore, this hypothesis cannot be accepted. It is also important to remember that the obtained results might differ if we did not have missing values.

3. It turns out that the music preferences of users from Springfield and Shelbyville are very similar.

The third hypothesis is rejected. If there were indeed differences in preferences, unfortunately, this cannot be determined from this data.

[Back to Table of Contents](#back)