# Yandex.Music

# Contents <a id='back'></a>

* [Introduction](#intro)
* [Stage 1. Data overview](#data_review)
    * [Conclusions](#data_review_conclusions)
* [Stage 2. Data preprocessing](#data_preprocessing)
    * [2.1 Header style](#header_style)
    * [2.2 Missing values](#missing_values)
    * [2.3 Duplicates](#duplicates)
    * [2.4 Conclusions](#data_preprocessing_conclusions)
* [Stage 3. Testing the hypotheses](#hypotheses)
    * [3.1 Hypothesis 1: user activity in the two cities](#activity)
    * [3.2 Hypothesis 2: music preferences on Monday and Friday](#week)
    * [3.3 Hypothesis 3: genre preferences in Springfield and Shelbyville](#genre)
* [Findings](#end)

## Introduction <a id='intro'></a>

In this project, I compared the music preferences of the cities of Springfield and Shelbyville. I studied real Yandex.Music data to test the hypotheses below and compare user behavior for these two cities.

### Goal: 
Test three hypotheses:
1. User activity differs depending on the day of the week and from city to city. 
2. On Monday mornings, Springfield and Shelbyville residents listen to different genres. This is also true for Friday evenings. 
3. Springfield and Shelbyville listeners have different preferences. In Springfield, they prefer pop, while Shelbyville has more rap fans.

### Stages 

My project consisted of three stages:
 1. Data overview
 2. Data preprocessing
 3. Testing the hypotheses
 
[Back to Contents](#back)

## Stage 1. Data overview <a id='data_review'></a>


Import `pandas`.

In [3]:
import pandas as pd
df=pd.read_csv('/datasets/music_project_en.csv')


Read the file `music_project_en.csv` from the `/datasets/` folder and save it in the `df` variable:

In [4]:
# reading the file and storing it to df
df.describe()

Unnamed: 0,userID,Track,artist,genre,City,time,Day
count,65079,63736,57512,63881,65079,65079,65079
unique,41748,39666,37806,268,2,20392,3
top,A8AE9169,Brand,Kartvelli,pop,Springfield,21:51:22,Friday
freq,76,136,136,8850,45360,14,23149


Print the first 10 table rows:

In [5]:
# obtaining the first 10 rows from the df table
print(df.head(10))

     userID                        Track            artist   genre  \
0  FFB692EC            Kamigata To Boots  The Mass Missile    rock   
1  55204538  Delayed Because of Accident  Andreas Rönnberg    rock   
2    20EC38            Funiculì funiculà       Mario Lanza     pop   
3  A3DD03C9        Dragons in the Sunset        Fire + Ice    folk   
4  E2DC1FAE                  Soul People        Space Echo   dance   
5  842029A1                       Chains          Obladaet  rusrap   
6  4CB90AA5                         True      Roman Messer   dance   
7  F03E1C1F             Feeling This Way   Polina Griffith   dance   
8  8FA1D3BE                     L’estate       Julia Dalia  ruspop   
9  E772D5C0                    Pessimist               NaN   dance   

        City        time        Day  
0  Shelbyville  20:28:33  Wednesday  
1  Springfield  14:07:09     Friday  
2  Shelbyville  20:58:07  Wednesday  
3  Shelbyville  08:37:09     Monday  
4  Springfield  08:34:34     Monday  
5

Obtaining general information about the table:

In [6]:
# obtaining general information about the data in df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
  userID    65079 non-null object
Track       63736 non-null object
artist      57512 non-null object
genre       63881 non-null object
  City      65079 non-null object
time        65079 non-null object
Day         65079 non-null object
dtypes: object(7)
memory usage: 3.5+ MB


The table contains seven columns. They all store the same data type: `object`.

According to the documentation:
- `'userID'` — user identifier
- `'Track'` — track title
- `'artist'` — artist's name
- `'genre'`
- `'City'` — user's city
- `'time'` — the exact time the track was played
- `'Day'` — day of the week

We can see three issues with style in the column names:
1. Some names are uppercase, some are lowercase.
2. There are spaces in some names.
3. There are different "non-null object" values for each column, indicating there are null values and missing data.
4. 
The number of column values is different. This means the data contains missing values.


### Conclusions <a id='data_review_conclusions'></a> 

Each row in the table stores data on a track that was played. Some columns describe the track itself: its title, artist and genre. The rest convey information about the user: the city they come from, the time they played the track. 

It's clear that the data is sufficient to test the hypotheses. However, there are missing values.

To move forward, we need to preprocess the data.

[Back to Contents](#back)

## Stage 2. Data preprocessing <a id='data_preprocessing'></a>
During this stage, I corrected the formatting in the column headers and examined missing values. 
Then, I checked for whether there were duplicates in the data.

### Header style <a id='header_style'></a>
Print the column header:

In [55]:
# the list of column names in the df table
print(df.columns)

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')


Change column names according to the rules of good style:
* If the name has several words, use snake_case
* All characters must be lowercase
* Delete spaces

In [56]:
# renaming columns
df=df.rename(
    columns={
        'Track':'track',
        'Day':'day',
        '  userID':'user_id',
        '  City  ':'city'
    }
)

Check the transformed result:

In [57]:
# checking result: the list of column names
print(df.columns)

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')


[Back to Contents](#back)

### Missing values <a id='missing_values'></a>
Use two `pandas` methods to find the number of missing values in the table.

In [58]:
# calculating missing values
print(df.isna().sum()) 

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64


Not all missing values affect the research. For instance, the missing values in `track` and `artist` are not critical. I will simply replace them with clear markers.

However, missing values in `'genre'` can affect the comparison of music preferences in Springfield and Shelbyville. In real life, it would be useful to learn the reasons why the data is missing and try to make up for them, but that's a luxury given project constraints here. Instead, I will:
* Fill in these missing values with markers
* Evaluate how much the missing values may affect your computations

Here, I created the `columns_to_replace` list, looped over it with `for`, and replaced the missing values in each of the columns. The purpose is to replace the missing values in `'track'`, `'artist'`, and `'genre'` with the string `'unknown'`.

In [63]:
# looping over column names and replacing missing values with 'unknown'

#df['track']=df['track'].fillna('unknown')
#df['artist']=df['artist'].fillna('unknown')
#df['genre']=df['genre'].fillna('unknown')

columns_to_replace=['genre','artist','track']
for column in columns_to_replace:
    df[column]=df[column].fillna('unknown')



Double-checkthat the table contains no more missing values by counting the missing values again:

In [64]:
# counting missing values
print(df.isna().sum()) 

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64


[Back to Contents](#back)

### Duplicates <a id='duplicates'></a>
Find the number of obvious duplicates in the table:

In [11]:
# counting clear duplicates
print(df.duplicated().sum()) 

3826


Call the `pandas` method for getting rid of obvious duplicates:

In [12]:
# removing obvious duplicates
df = df.drop_duplicates() 

Recount explicit duplicates once more to confirm removal:

In [13]:
# checking for duplicates
print(df.duplicated().sum()) 

0


Now the goal is to get rid of implicit duplicates in the `genre` column. For example, the name of a genre can be written in different ways. Such errors will also affect the result.

To do so, I will print a list of unique genre names, sorted in alphabetical order. 
* Retrieve the intended DataFrame column 
* Apply a sorting method to it
* For the sorted column, call the method that will return all unique column values

In [14]:
# viewing unique genre names

print(df['genre'].sort_values().unique())


['acid' 'acoustic' 'action' 'adult' 'africa' 'afrikaans' 'alternative'
 'ambient' 'americana' 'animated' 'anime' 'arabesk' 'arabic' 'arena'
 'argentinetango' 'art' 'audiobook' 'avantgarde' 'axé' 'baile' 'balkan'
 'beats' 'bigroom' 'black' 'bluegrass' 'blues' 'bollywood' 'bossa'
 'brazilian' 'breakbeat' 'breaks' 'broadway' 'cantautori' 'cantopop'
 'canzone' 'caribbean' 'caucasian' 'celtic' 'chamber' 'children' 'chill'
 'chinese' 'choral' 'christian' 'christmas' 'classical' 'classicmetal'
 'club' 'colombian' 'comedy' 'conjazz' 'contemporary' 'country' 'cuban'
 'dance' 'dancehall' 'dancepop' 'dark' 'death' 'deep' 'deutschrock'
 'deutschspr' 'dirty' 'disco' 'dnb' 'documentary' 'downbeat' 'downtempo'
 'drum' 'dub' 'dubstep' 'eastern' 'easy' 'electronic' 'electropop' 'emo'
 'entehno' 'epicmetal' 'estrada' 'ethnic' 'eurofolk' 'european'
 'experimental' 'extrememetal' 'fado' 'film' 'fitness' 'flamenco' 'folk'
 'folklore' 'folkmetal' 'folkrock' 'folktronica' 'forró' 'frankreich'
 'französisch' 

Looking through the list for implicit duplicates, I noticed that the genre `hiphop` could take on other forms either written incorrectly or are alternative names of the same genre:
* `hip`
* `hop`
* `hip-hop`

To get rid of them, I declared the function `replace_wrong_genres()` with two parameters: 
* `wrong_genres=` — the list of duplicates
* `correct_genre=` — the string with the correct value

The function should correct the names in the `'genre'` column from the `df` table, i.e. replace each value from the `wrong_genres` list with the value in `correct_genre`.

In [15]:
# function for replacing implicit duplicates
wrong_genres=['hip','hop','hip-hop']
correct_genre='hiphop'

def replace_wrong_genres(wrong_genres,correct_genre):
    for wrong_genre in wrong_genres:
        df['genre']=df['genre'].replace(wrong_genre,correct_genre)

Calling `replace_wrong_genres()` and pass it arguments so that it clears implicit duplcates (`hip`, `hop`, and `hip-hop`) and replaces them with `hiphop`:

In [16]:
# removing implicit duplicates
replace_wrong_genres(wrong_genres,correct_genre)

<div class="alert alert-success" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<h2> Alternative Clean-up </h2>
    
We wrote and called a function above that worked correctly.

In general, these manipulations can be performed without a function. To do this, I will need a list and the same **replace** method. It will look like this:
    
    
    wrong_genres = ['hip', 'hop', 'hip-hop']
    correct_genre = 'hiphop'  

    df['genre'] = df['genre'].replace(wrong_genres, correct_genre)
    
    
Or, which is the same:
    
    
    df['genre'] = df['genre'].replace(['hip', 'hop', 'hip-hop'], 'hiphop')
</div>

Double check that the duplicate names were removed by printing the list of unique values from the `'genre'` column:

In [65]:
# checking for implicit duplicates

print(df['genre'].sort_values().unique())

['acid' 'acoustic' 'action' 'adult' 'africa' 'afrikaans' 'alternative'
 'ambient' 'americana' 'animated' 'anime' 'arabesk' 'arabic' 'arena'
 'argentinetango' 'art' 'audiobook' 'avantgarde' 'axé' 'baile' 'balkan'
 'beats' 'bigroom' 'black' 'bluegrass' 'blues' 'bollywood' 'bossa'
 'brazilian' 'breakbeat' 'breaks' 'broadway' 'cantautori' 'cantopop'
 'canzone' 'caribbean' 'caucasian' 'celtic' 'chamber' 'children' 'chill'
 'chinese' 'choral' 'christian' 'christmas' 'classical' 'classicmetal'
 'club' 'colombian' 'comedy' 'conjazz' 'contemporary' 'country' 'cuban'
 'dance' 'dancehall' 'dancepop' 'dark' 'death' 'deep' 'deutschrock'
 'deutschspr' 'dirty' 'disco' 'dnb' 'documentary' 'downbeat' 'downtempo'
 'drum' 'dub' 'dubstep' 'eastern' 'easy' 'electronic' 'electropop' 'emo'
 'entehno' 'epicmetal' 'estrada' 'ethnic' 'eurofolk' 'european'
 'experimental' 'extrememetal' 'fado' 'film' 'fitness' 'flamenco' 'folk'
 'folklore' 'folkmetal' 'folkrock' 'folktronica' 'forró' 'frankreich'
 'französisch' 

[Back to Contents](#back)

### Conclusions <a id='data_review_conclusions'></a> 

I detected three issues with the data:

- Incorrect header styles
- Missing values
- Obvious and implicit duplicates

The headers have been cleaned up to make processing the table simpler.

All missing values have been replaced with `'unknown'`. But we still have to see whether the missing values in `'genre'` will affect our calculations.

The absence of duplicates will make the results more precise and easier to understand.

The next stage is to test our three hypotheses. 

[Back to Contents](#back)

## Stage 3. Testing hypotheses <a id='hypotheses'></a>

### Hypothesis 1: comparing user behavior in two cities <a id='activity'></a>

According to the first hypothesis, users from Springfield and Shelbyville listen to music differently. This hypothesis will be tested using the data on three days of the week: Monday, Wednesday, and Friday:

* Divide the users into groups by city.
* Compare how many tracks each group played on Monday, Wednesday, and Friday.


In [66]:
# Counting up the tracks played in each city
#df.groupby('city').count()
df.groupby('city')['user_id'].count()


city
Shelbyville    19719
Springfield    45360
Name: user_id, dtype: int64

Springfield has more tracks played than Shelbyville, but I can not assume that citizens of Springfield listen to music more often. This city could simply be bigger, and there are more users.

I will proceed to group data by day of the week and find the number of tracks played on Monday, Wednesday, and Friday.


In [69]:
# Calculating tracks played on each of the three days
#df.groupby('day').count()
df.groupby('day')[['user_id']].count()

Unnamed: 0_level_0,user_id
day,Unnamed: 1_level_1
Friday,23149
Monday,22697
Wednesday,19233


Wednesday is the quietest day overall. But if we consider the two cities separately, the conclusion might be different.

Previously, I grouped city and day seperately. Now I will write a function that groups the data by both.

I will create the `number_tracks()` function to calculate the number of songs played for a given day and city. It will require two parameters:
* day of the week
* name of the city

The function will use a variable to store the rows from the original table, where:
  * `'day'` column value is equal to the `day` parameter
  * `'city'` column value is equal to the `city` parameter

Consecutive filtering with logical indexing will be applied, and then the `'user_id'` column values will be calculated in the resulting table. 

In [70]:
# <creating the function number_tracks()>

    
def number_tracks(day,city):
    track_list=df.loc[:,['day','city','user_id']]
    track_list = track_list[track_list['day'] == day]
    track_list = track_list[track_list['city'] == city]
    track_list_count=track_list.loc[:,'user_id'].count()
    #print(day + ', ' + city + ': ' + str(track_list_count))
    #print(track_list)
    #print()
    return track_list_count

number_tracks('Monday', 'Springfield')
number_tracks('Monday','Shelbyville')
number_tracks('Wednesday','Springfield')
number_tracks('Wednesday','Shelbyville')
number_tracks('Friday','Springfield')
number_tracks('Friday','Shelbyville')

# We'll declare a function with two parameters: day=, city=.
# Let the track_list variable store the df rows where
# the value in the 'day' column is equal to the day= parameter and, at the same time, 
# the value in the 'city' column is equal to the city= parameter (apply consecutive filtering 
# with logical indexing).
# Let the track_list_count variable store the number of 'user_id' column values in track_list
# (found with the count() method).
# Let the function return a number: the value of track_list_count.

# The function counts tracked played for a certain city and day.
# It first retrieves the rows with the intended day from the table,
# then filters out the rows with the intended city from the result,
# then finds the number of 'user_id' values in the filtered table,
# then returns that number.
# To see what it returns, wrap the function call in print().

6259

Call `number_tracks()` six times, changing the parameter values, so that you retrieve the data on both cities for each of the three days.

In [32]:
# the number of songs played in Springfield on Monday
number_tracks('Monday','Springfield')

16715

In [33]:
# the number of songs played in Shelbyville on Monday
number_tracks('Monday','Shelbyville')

5982

In [34]:
# the number of songs played in Springfield on Wednesday
number_tracks('Wednesday','Springfield')

11755

In [35]:
# the number of songs played in Shelbyville on Wednesday
number_tracks('Wednesday','Shelbyville')

7478

In [25]:
# the number of songs played in Springfield on Friday
number_tracks('Friday','Springfield')

15945

In [36]:
# the number of songs played in Shelbyville on Friday
number_tracks('Friday','Shelbyville')

6259

Use `pd.DataFrame` to create a table, where
* Column names are: `['city', 'monday', 'wednesday', 'friday']`
* The data is the results you got from `number_tracks()`

In [71]:
# table with results


data={'city':['Shelbyville','Springfield'],
      'monday':[5614,15740],
      'wednesday':[7003,11056],
      'friday':[5895,15945]
}

dfnew=pd.DataFrame(data)

print(dfnew)

          city  monday  wednesday  friday
0  Shelbyville    5614       7003    5895
1  Springfield   15740      11056   15945


### Conclusions <a id='data_review_conclusions'></a> 

The data reveals differences in user behavior:

- In Springfield, the number of songs played peaks on Mondays and Fridays, while on Wednesday there is a decrease in activity.
- In Shelbyville, on the contrary, users listen to music more on Wednesday. User activity on Monday and Friday is smaller.

So the first hypothesis seems to be correct.

[Back to Contents](#back)

### Hypothesis 2: music at the beginning and end of the week <a id='week'></a>

According to the second hypothesis, on Monday morning and Friday night, citizens of Springfield listen to genres that differ from ones users from Shelbyville enjoy.

First step is to get the tables
* For Springfield — `spr_general`
* For Shelbyville — `shel_general`

In [39]:
# obtaining the spr_general table from the df rows, 
# where the value in the 'city' column is 'Springfield'

spr_general = df[df['city'] == 'Springfield']

In [40]:
# obtaining the shel_general from the df rows,
# where the value in the 'city' column is 'Shelbyville'

shel_general = df[df['city'] == 'Shelbyville']

I will write the `genre_weekday()` function with four parameters:
* A table for data
* The day of the week
* The first timestamp, in 'hh:mm' format
* The last timestamp, in 'hh:mm' format

The function should return info on the 15 most popular genres on a given day within the period between the two timestamps.

In [41]:
# Declaring the genre_weekday() function with the parameters day=, time1=, and time2=. It should
# return information about the most popular genres on a given day at a given time:

# 1) Let the genre_df variable store the rows that meet several conditions:
#    - the value in the 'day' column is equal to the value of the day= argument
#    - the value in the 'time' column is greater than the value of the time1= argument
#    - the value in the 'time' column is smaller than the value of the time2= argument
#    Use consecutive filtering with logical indexing.

# 2) Group genre_df by the 'genre' column, take one of its columns, 
#    and use the count() method to find the number of entries for each of 
#    the represented genres; store the resulting Series to the
#    genre_df_count variable

# 3) Sort genre_df_count in descending order of frequency and store the result
#    to the genre_df_sorted variable

# 4) Return a Series object with the first 15 genre_df_sorted value - the 15 most
#    popular genres (on a given day, within a certain timeframe)

def genre_weekday(df, day, time1, time2):
    # consecutive filtering

    # genre_df will store only those df rows where the day is equal to day=
    genre_df = df[df['day'] == day]
    # genre_df will store only those df rows where the time is smaller than time2=
    genre_df = genre_df[genre_df['time'] < time2]
    # genre_df will store only those df rows where the time is greater than time1=
    genre_df = genre_df[genre_df['time'] > time1]
    # group the filtered DataFrame by the column with the names of genres, take the genre column, and find the number of rows for each genre with the count() method
    genre_df_grouped = genre_df.groupby('genre')['genre'].count()
    # we will sort the result in descending order (so that the most popular genres come first in the Series object)
    genre_df_sorted = genre_df_grouped.sort_values(ascending=False)
    # we will return the Series object storing the 15 most popular genres on a given day in a given timeframe
    return genre_df_sorted[:15]


Compare the results of the `genre_weekday()` function for Springfield and Shelbyville on Monday morning (from 7AM to 11AM) and on Friday evening (from 17:00 to 23:00):

In [42]:
# calling the function for Monday morning in Springfield (use spr_general instead of the df table)
genre_weekday(spr_general, 'Monday', '07:00', '11:00')

genre
pop            830
dance          589
rock           511
electronic     501
hiphop         311
ruspop         203
world          190
rusrap         188
alternative    175
unknown        172
classical      167
metal          126
jazz           109
folk           107
soundtrack      97
Name: genre, dtype: int64

In [43]:
# calling the function for Monday morning in Shelbyville (use shel_general instead of the df table)
shel_general = df[df['city'] == 'Shelbyville']
genre_weekday(shel_general, day='Monday', time1='07:00', time2='11:00')

genre
pop            238
dance          192
rock           173
electronic     154
hiphop          88
ruspop          68
alternative     65
rusrap          56
jazz            47
classical       42
world           39
soundtrack      34
rap             33
rnb             31
metal           28
Name: genre, dtype: int64

In [44]:
# calling the function for Friday evening in Springfield

genre_weekday(spr_general, day='Friday', time1='17:00', time2='23:00')

genre
pop            761
rock           546
dance          521
electronic     510
hiphop         282
world          220
ruspop         184
alternative    176
classical      171
rusrap         151
jazz           121
unknown        117
soundtrack     112
metal           92
rnb             92
Name: genre, dtype: int64

In [45]:
# calling the function for Friday evening in Shelbyville

genre_weekday(shel_general, day='Friday', time1='17:00', time2='23:00')

genre
pop            279
rock           230
electronic     227
dance          221
hiphop         103
alternative     67
jazz            66
rusrap          66
classical       64
world           60
unknown         49
ruspop          49
soundtrack      40
metal           39
rap             39
Name: genre, dtype: int64

### Conclusions <a id='data_review_conclusions'></a> 

Having compared the top 15 genres on Monday morning, we can draw the following conclusions:

1. Users from Springfield and Shelbyville listen to similar music. The top five genres are the same, only rock and electronic have switched places.

2. In Springfield, the number of missing values turned out to be so big that the value `'unknown'` came in 10th. This means that missing values make up a considerable portion of the data, which may be a basis for questioning the reliability of our conclusions.

For Friday evening, the situation is similar. Individual genres vary somewhat, but on the whole, the top 15 is similar for the two cities.

Thus, the second hypothesis has been partially proven true:
* Users listen to similar music at the beginning and end of the week.
* There is no major difference between Springfield and Shelbyville. In both cities, pop is the most popular genre.

However, the number of missing values makes this result questionable. In Springfield, there are so many that they affect our top 15. Were we not missing these values, things might look different.

[Back to Contents](#back)

### Hypothesis 3: genre preferences in Springfield and Shelbyville <a id='genre'></a>

Hypothesis: Shelbyville loves rap music. Springfield's citizens are more into pop.

The `spr_general` table will be grouped by genre and the number of songs played for each genre is tallied with the `count()` method. The result will be sorted in descending order and stored to `spr_genres`.

In [72]:
# on one line: group the spr_general table by the 'genre' column, 
# count the 'genre' values with count() in the grouping, 
# sort the resulting Series in descending order, and store it to spr_genres
spr_general.groupby('genre')['track'].count()
spr_genres= (spr_general.groupby('genre')['track'].count()).sort_values(ascending=False)


To see the first 10 rows from `spr_genres`:

In [73]:
# printing the first 10 rows of spr_genres
print(spr_genres.head(10))

genre
pop            6253
dance          4707
rock           4188
electronic     4010
hiphop         2215
classical      1712
world          1516
alternative    1466
ruspop         1453
rusrap         1239
Name: track, dtype: int64



The `shel_general` table will be grouped by genre and the number of songs played for each genre is tallied with the `count()` method. The result will be sorted in descending order and stored to `shel_genres` .

In [53]:
# on one line: group the shel_general table by the 'genre' column, 
# count the 'genre' values in the grouping with count(), 
# sort the resulting Series in descending order and store it to shel_genres


shel_general.groupby('genre')['track'].count()
shel_genres= (shel_general.groupby('genre')['track'].count()).sort_values(ascending=False)


Print the first 10 rows of `shel_genres`:

In [54]:
# printing the first 10 rows from shel_genres
print(shel_genres.head(10))

genre
pop            2597
dance          2054
rock           2004
electronic     1842
hiphop         1020
alternative     700
classical       684
rusrap          604
ruspop          565
world           553
Name: track, dtype: int64


### Conclusions <a id='data_review_conclusions'></a>  

The hypothesis has been partially proven true:
* Pop music is the most popular genre in Springfield, as expected.
* However, pop music turned out to be equally popular in Springfield and Shelbyville, and rap wasn't in the top 5 for either city.


[Back to Contents](#back)

# Findings <a id='end'></a>

We have tested the following three hypotheses:

1. User activity differs depending on the day of the week and from city to city. 
2. On Monday mornings, Springfield and Shelbyville residents listen to different genres. This is also true for Friday evenings. 
3. Springfield and Shelbyville listeners have different preferences. In both Springfield and Shelbyville, they prefer pop.

After analyzing the data, I concluded:

1. User activity in Springfield and Shelbyville depends on the day of the week, though the cities vary in different ways. The first hypothesis is fully accepted.

2. Musical preferences do not vary significantly over the course of the week in both Springfield and Shelbyville. We can see small differences on Mondays, and people listen to pop music the most. We can not accept the second hypothesis. We must also keep in mind that the result could have been different if not for the missing values.

3. It turns out that the musical preferences of users from Springfield and Shelbyville are quite similar. The third hypothesis is rejected. If there is any difference in preferences, it cannot be seen from this data.


[Back to Contents](#back)