# Musical research

In this project, we will compare the two cities using the Yandex.Music service.

The comparison of Moscow and St. Petersburg is surrounded by myths. For example:
 * Moscow is a megalopolis, subject to the rigid rhythm of the work week;
 * St. Petersburg is a cultural capital, with its own tastes.

**The aim of the study** is to test three hypotheses:
1. The activity of users depends on the day of the week. And it manifests differently in Moscow and St. Petersburg.
2. On Monday morning in Moscow some genres prevail, and in St. Petersburg - others. Friday evenings are similarly dominated by different genres, depending on the city. 
3. Moscow and St. Petersburg prefer different genres of music. In Moscow they listen to pop music more often, in St. Petersburg - Russian rap.

**Study progress**.

We save data on user behavior in the file `yandex_music_project.csv`. Nothing is known about the quality of the data. Therefore, we will need to review the data before testing the hypotheses. 

We will check the data for errors and assess their impact on the study. Then, in the preprocessing phase, we will look for an opportunity to fix the most critical data errors.
 
Thus, the study will take place in three stages:
 1. data review.
 2. Data preprocessing.
 3. Hypothesis testing.


## Data review

In [1]:
#import the libraries you need
import pandas as pd

In [3]:
# read the data file and save it to df
df = pd.read_csv('yandex_music_project.csv', index_col=[0])
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


In [4]:
# get general information about the data in table df
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63848 non-null  object
 2   artist    57876 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 4.0+ MB


There are seven columns in the table. The data type in all columns is `object`.

According to the data documentation:
* `userID` - user ID;
* `Track` - track name;  
* `artist` - artist name;
* `genre` - genre name;
* `City` - user's city;
* `time` - start time of listening;
* `Day` - day of the week

We also have gaps in the three columns.

### Outputs

Each row in the table contains data about the track you listened to. Some of the columns describe the song itself: title, artist, and genre. The rest of the data tells about the user: what city the user is from, when the user listened to the music. 

It can be tentatively stated that, the data is sufficient to test hypotheses. But there are omissions in the data, and there are discrepancies with good style in the names of the speakers.

To move forward, the problems in the data need to be corrected.

## Data preprocessing

### Header style

In [6]:
# list of column names of the table df
df.columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

Let's rename the columns for easy work.

In [7]:
# renaming columns
df = df.rename({'  userID': 'user_id', 'Track': 'track', '  City  ': 'city','Day': 'day'}, axis='columns')
# check results - list of column names
df.columns

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

### Missing values
First, let's count how many missing values there are in the table.

In [8]:
# skip counting
df.isna().sum()

user_id       0
track      1231
artist     7203
genre      1198
city          0
time          0
day           0
dtype: int64

Not all missing values affect the study. So in `track` and `artist` the omissions are not important for our work. It is enough to replace them with explicit notations.

But the omissions in `genre' may interfere with a comparison of musical tastes in Moscow and St. Petersburg. We will have to:
* fill in these omissions with explicit designations as well,
* assess how much they will damage the calculations.

Replace the missing values in the columns `track`, `artist` and `genre` with the string `'unknown'`. To do this, create a list `columns_to_replace`, loop through its elements with the `for` loop and replace the missing values for each column:

In [9]:
# loop through the column names and replace missing values with 'unknown'
columns_to_replace = ['track', 'artist', 'genre']
for column in columns_to_replace:
    df[column] = df[column].fillna('unknown')
# skip counting
df.isna().sum()

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

### Duplicates
Count explicit duplicates in the table with one command:

In [10]:
# counting obvious duplicates
df.duplicated().sum()

3826

In [11]:
# removal of obvious duplicates (with old indexes deleted and new ones formed)
df = df.drop_duplicates().reset_index(drop=True) 
# check for duplicates
df.duplicated().sum()

0

Now get rid of the implicit duplicates in the `genre` column.

Display a list of unique genre names, sorted alphabetically. To do this:
* retrieve the desired dataframe column, 
* apply a sorting method to it,
* For the sorted column, call the method that returns unique values from the column.

In [12]:
# Viewing unique genre titles
sorted(df['genre'].unique())

['acid',
 'acoustic',
 'action',
 'adult',
 'africa',
 'afrikaans',
 'alternative',
 'alternativepunk',
 'ambient',
 'americana',
 'animated',
 'anime',
 'arabesk',
 'arabic',
 'arena',
 'argentinetango',
 'art',
 'audiobook',
 'author',
 'avantgarde',
 'axé',
 'baile',
 'balkan',
 'beats',
 'bigroom',
 'black',
 'bluegrass',
 'blues',
 'bollywood',
 'bossa',
 'brazilian',
 'breakbeat',
 'breaks',
 'broadway',
 'cantautori',
 'cantopop',
 'canzone',
 'caribbean',
 'caucasian',
 'celtic',
 'chamber',
 'chanson',
 'children',
 'chill',
 'chinese',
 'choral',
 'christian',
 'christmas',
 'classical',
 'classicmetal',
 'club',
 'colombian',
 'comedy',
 'conjazz',
 'contemporary',
 'country',
 'cuban',
 'dance',
 'dancehall',
 'dancepop',
 'dark',
 'death',
 'deep',
 'deutschrock',
 'deutschspr',
 'dirty',
 'disco',
 'dnb',
 'documentary',
 'downbeat',
 'downtempo',
 'drum',
 'dub',
 'dubstep',
 'eastern',
 'easy',
 'electronic',
 'electropop',
 'emo',
 'entehno',
 'epicmetal',
 'estrada',


We found the following implicit duplicates:
* *hip*,
* *hop*,
* *hip-hop*.

To clear them from the table, write function `replace_wrong_genres()` with two parameters: 
* `wrong_genres` - list of duplicates,
* `correct_genre` - the string with the correct value.

The function should fix the `genre` column in table `df`: replace each value from the `wrong_genres` list with a value from `correct_genre`.

In [14]:
# Function for replacing implicit duplicates
def replace_wrong_genres(wrong_genres, correct_genre):
    for genre in wrong_genres:
        df['genre'] = df['genre'].replace(genre, correct_genre)

Call `replace_wrong_genres()` and give it such arguments that it removes implicit duplicates: instead of `hip`, `hop` and `hip-hop` the table should contain the value `hiphop`:

In [15]:
# Eliminate implicit duplicates
replace_wrong_genres(['hip', 'hop', 'hip-hop'],'hiphop')
# Checking for implicit duplicates
sorted(df['genre'].unique())

['acid',
 'acoustic',
 'action',
 'adult',
 'africa',
 'afrikaans',
 'alternative',
 'alternativepunk',
 'ambient',
 'americana',
 'animated',
 'anime',
 'arabesk',
 'arabic',
 'arena',
 'argentinetango',
 'art',
 'audiobook',
 'author',
 'avantgarde',
 'axé',
 'baile',
 'balkan',
 'beats',
 'bigroom',
 'black',
 'bluegrass',
 'blues',
 'bollywood',
 'bossa',
 'brazilian',
 'breakbeat',
 'breaks',
 'broadway',
 'cantautori',
 'cantopop',
 'canzone',
 'caribbean',
 'caucasian',
 'celtic',
 'chamber',
 'chanson',
 'children',
 'chill',
 'chinese',
 'choral',
 'christian',
 'christmas',
 'classical',
 'classicmetal',
 'club',
 'colombian',
 'comedy',
 'conjazz',
 'contemporary',
 'country',
 'cuban',
 'dance',
 'dancehall',
 'dancepop',
 'dark',
 'death',
 'deep',
 'deutschrock',
 'deutschspr',
 'dirty',
 'disco',
 'dnb',
 'documentary',
 'downbeat',
 'downtempo',
 'drum',
 'dub',
 'dubstep',
 'eastern',
 'easy',
 'electronic',
 'electropop',
 'emo',
 'entehno',
 'epicmetal',
 'estrada',


**Findings**.

Preprocessing found three problems in the data:

- header style irregularities,
- missing values,
- duplicates - explicit and implicit.

We corrected the headers to make the table easier to work with. Without duplicates, the study will be more accurate.

We replaced the missing values with `'unknown'`. It remains to be seen whether the omissions in the `genre` column will harm the study.

Now we can move on to testing hypotheses. 

## Hypothesis testing

### Comparison of the behavior of users of the two capitals

The first hypothesis states that users listen to music differently in Moscow and St. Petersburg. Let's test this hypothesis using data from three days of the week - Monday, Wednesday, and Friday. To do this:

* Let's divide Moscow and St. Petersburg users.
* Let's compare how many tracks each group of users listened to on Monday, Wednesday and Friday.

In [16]:
# Counting auditions in each city
df.groupby('city')['user_id'].count() 

city
Moscow              42741
Saint-Petersburg    18512
Name: user_id, dtype: int64

There are more listens in Moscow than in St. Petersburg. This does not mean that Moscow users listen to music more often. It's just that there are more users in Moscow.

In [19]:
# Counting auditions on each of the three days
df.groupby('day')['user_id'].count()

day
Friday       21840
Monday       21354
Wednesday    18059
Name: user_id, dtype: int64

On average, users from the two cities are less active on Wednesdays. But the picture can change if you look at each city separately.

Let's create a function `number_tracks()`, which will count auditions for a given day and city. It will need two parameters:
* day of the week,
* city name.

In the function save into a variable rows of the original table, which have the value:
  * in the `day` column is equal to the parameter `day`,
  * in the `city` column is equal to the `city` parameter.

To do this we apply sequential filtering with logical indexing.

Then count the values in the column `user_id` of the resulting table. The result will be saved in a new variable. Let's return this variable from the function.

In [20]:
# <create function number_tracks()>  
# The function is declared with two parameters: day, city.
def number_tracks(day,city):
# In the variable track_list, those rows of the table df, for which 
# value in the 'day' column is equal to the day parameter and simultaneously to the value
# the value in the 'city' column is the same as the 'city' parameter (use a sequential filtering using a # logical indexing).
    track_list = df[(df['day']==day)&(df['city']==city)]
# The track_list_count variable stores the number of 'user_id' column values,
# as calculated by count() for the track_list table.
    track_list_count = track_list['user_id'].count()
# The function returns a number - track_list_count value.
    return track_list_count

Let's call `number_tracks()` six times, changing the values of the parameters, so as to obtain data for each city on each of the three days.

In [21]:
number_tracks('Monday', 'Moscow') # number of auditions in Moscow on Mondays

15740

In [22]:
number_tracks('Monday', 'Saint-Petersburg') # number of auditions in Saint-Petersburg on Mondays

5614

In [23]:
number_tracks('Wednesday', 'Moscow') # number of auditions in Moscow on Wednesdays

11056

In [24]:
number_tracks('Wednesday', 'Saint-Petersburg') # number of auditions in Saint-Petersburg on Wednesdays

7003

In [25]:
number_tracks('Friday', 'Moscow') # number of auditions in Moscow on Fridays

15945

In [26]:
number_tracks('Friday', 'Saint-Petersburg')# number of auditions in Saint-Petersburg on Fridays

5895

In [28]:
#create a dataframe with wiretap data
data = [['Moscow',15740,11056,15945],['Saint-Petersburg',5614,7003,5895]]
columns = ['city', 'monday', 'wednesday', 'friday']
pd.DataFrame(data=data,columns=columns)

Unnamed: 0,city,monday,wednesday,friday
0,Moscow,15740,11056,15945
1,Saint-Petersburg,5614,7003,5895


**Findings**.

The data show the difference in user behavior:

- In Moscow, listening peaks on Mondays and Fridays, with a noticeable drop on Wednesdays.
- In St. Petersburg, on the contrary, more music is listened to on Wednesdays. The activity on Monday and Friday here is almost equally inferior to that on Wednesdays.

So, the data speak in favor of the first hypothesis.

### Music at the beginning and end of the week

According to the second hypothesis, on Monday morning in Moscow some genres prevail and in St. Petersburg other genres prevail. Likewise, Friday nights are dominated by different genres, depending on the city.

Let's save the tables with the data into two variables:
* for Moscow - in `moscow_general`;
* for St. Petersburg - in `spb_general`.

In [29]:
# get table moscow_general from those rows of table df, 
# for which the value in the 'city' column is 'Moscow'.
moscow_general = df[df['city']=='Moscow'] 

In [30]:
# get table spb_general from those rows of table df,
# for which the value in the 'city' column is 'Saint-Petersburg'
spb_general = df[df['city']=='Saint-Petersburg'] 

Create function `genre_weekday()` with four parameters:
* table (dataframe) with data,
* weekday,
* initial timestamp in format 'hh:mm', 
* last time stamp in 'hh:mm' format.

The function should return information about the top 10 genres of the tracks that were listened to on the specified day, between the two time stamps.

In [31]:
# Definition of genre_weekday() with parameters table, day, time1, time2,
# which returns information about the most popular genres on the specified day at the
# the specified time:
def genre_weekday(table,day,time1,time2):
# 1) the variable genre_df stores into the variable genre_df those lines of the passed dataframe table for
# which have at the same time:
# - the value in the day column equals the value of the day argument
# - the value in the time column is greater than the argument time1
# - the value in the time column is less than the argument time2
# Use sequential filtering with logical indexing.
    genre_df = table[(table['day']==day)&(table['time']>time1)&(table['time']<time2)]
# 2) group the dataframe genre_df by the column genre, take one of its
# column and count() the number of records for each
# genres present, and write the resulting Series into the variable
# genre_df_count
    genre_df_count = genre_df.groupby('genre')['user_id'].count()
# 3) sort genre_df_count by decreasing occurrence and save
# into the genre_df_sorted variable
    genre_df_sorted = genre_df_count.sort_values(ascending = False)
# 4) return the Series of the first 10 values of genre_df_sorted, these will be the top 10
# popular genres (on the specified day, at the specified time)
    return genre_df_sorted.head(10)

Let's compare the results of function `genre_weekday()` for Moscow and St. Petersburg on Monday morning (from 7:00 to 11:00) and Friday night (from 17:00 to 23:00):

In [33]:
# function call for Monday morning in Moscow (instead of df - table moscow_general)
# objects storing time are strings and are compared as strings
# example call: genre_weekday(moscow_general, 'Monday', '07:00', '11:00')
genre_weekday(moscow_general, 'Monday', '07:00', '11:00')

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
Name: user_id, dtype: int64

In [45]:
# function call for Monday morning in St. Petersburg (instead of df - table spb_general)
genre_weekday(spb_general, 'Monday', '07:00', '11:00') 

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
Name: user_id, dtype: int64

In [46]:
# function call for Friday night in Moscow
genre_weekday(moscow_general, 'Friday', '17:00', '23:00')

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
alternative    163
classical      163
rusrap         142
Name: user_id, dtype: int64

In [34]:
# function call for Friday night in St. Petersburg
genre_weekday(spb_general, 'Friday', '17:00', '23:00')

genre
pop            256
electronic     216
rock           216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
Name: user_id, dtype: int64

**Conclusions**.

If you compare the top 10 genres on a Monday morning, you can draw these conclusions:

1. In Moscow and St. Petersburg they listen to similar music. The only difference is that the Moscow rating includes the "world" genre, and the St. Petersburg rating includes jazz and classical music.

2. In Moscow there were so many missing values, that ``unknown'`` took tenth place among the most popular genres. This means that the missing values occupy a significant share of the data and threaten the credibility of the study.

Friday night doesn't change that picture. Some genres go a little higher, others go down, but overall the top 10 remains the same.

Thus, the second hypothesis is only partially confirmed:
* Users listen to similar music at the beginning of the week and at the end.
* The difference between Moscow and St. Petersburg is not very pronounced. In Moscow they listen to Russian popular music more often, in St. Petersburg - jazz.

However, the omissions in the data cast doubt on this result. There are so many of them in Moscow that the top 10 ranking could look different if it were not for the missing data on genres.

### Genre preferences in Moscow and St. Petersburg

Hypothesis: St. Petersburg is the rap capital, music of this genre is listened to there more often than in Moscow.  And Moscow is the city of contrasts, in which, nevertheless, pop music prevails.

Let's group the table `moscow_general` by genre and count listening to tracks of each genre using `count()` method. Then sort the result in descending order and save it to the table `moscow_genres`.

In [35]:
# one line: grouping moscow_general table by 'genre' column, 
# count the number of 'genre' values in that grouping using count(), 
# sort the resulting Series in descending order and store it in moscow_genres
moscow_genres = moscow_general.groupby('genre')['genre'].count().sort_values(ascending = False)
# view first 10 lines of moscow_genres
moscow_genres.head(10)

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64

Now repeat the same for St. Petersburg.

In [36]:
# one line: grouping the spb_general table by the 'genre' column, 
# count the number of 'genre' values in this grouping using count(), 
# sorting the resulting Series in descending order and saving it to spb_genres
spb_genres = spb_general.groupby('genre')['genre'].count().sort_values(ascending = False)
# view the first 10 lines of spb_genres
spb_genres.head(10)

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64

**Findings**

The hypothesis was partly confirmed:
* Pop music is the most popular genre in Moscow, as the hypothesis suggested. Moreover, there is a close genre - Russian popular music - in the top 10 genres.
* Contrary to expectations, rap is equally popular in Moscow and St. Petersburg. 

## Results of the study

We tested three hypotheses and found:

1. The day of the week has a different effect on user activity in Moscow and St. Petersburg. 

The first hypothesis was fully confirmed.

2. Musical preferences do not change much during the week, be it in Moscow or St. Petersburg. Slight differences are noticeable at the beginning of the week, on Mondays:
* In Moscow they listen to "world" music,
* In St. Petersburg they listen to jazz and classical music.

Thus, the second hypothesis was only partly confirmed. The result might have been different if there had not been an omission in the data.

3. 3. The tastes of Moscow and St. Petersburg users have more in common than in difference. Contrary to expectations, genre preferences in St. Petersburg resembled those in Moscow.

The third hypothesis was not confirmed. If there are differences in preferences, they are not noticeable for the bulk of the users.