# Yandex.Music

The comparison of Moscow and St. Petersburg is surrounded by myths. For example:
 * Moscow is a metropolis subject to the rigid rhythm of the working week;
 * St. Petersburg is a cultural capital, with its own tastes.

Using Yandex.Music data, you will compare the behavior of users in the two capitals.

**Research Purpose** - Test three hypotheses:
1. User activity depends on the day of the week. Moreover, in Moscow and St. Petersburg this manifests itself in different ways.
2. On Monday morning, some genres prevail in Moscow, while others prevail in St. Petersburg. Similarly, Friday evenings are dominated by different genres, depending on the city.
3. Moscow and St. Petersburg prefer different genres of music. In Moscow, they listen to pop music more often, in St. Petersburg - Russian rap.

**Research Progress**

You will receive data on user behavior from the `yandex_music_project.csv` file. Nothing is known about the quality of the data. Therefore, before testing hypotheses, a review of the data is needed.

You will check the data for errors and assess their impact on the study. Then, in the pre-processing phase, you will look for opportunities to correct the most critical data errors.
 
Thus, the study will take place in three stages:
 1. Data review.
 2. Data preprocessing.
 3. Hypothesis testing.

## Browse data

Let's make the first idea about Yandex.Music data.

The main analytics tool is `pandas`. Let's import this library.

In [1]:
import pandas as pd

Read the `yandex_music_project.csv` file from the `/datasets` folder and save it in the `df` variable:

In [2]:
df = pd.read_csv('/datasets/yandex_music_project.csv')

Let's display the first ten rows of the table:

In [3]:
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


With one command, we get general information about the table:

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
  userID    65079 non-null object
Track       63848 non-null object
artist      57876 non-null object
genre       63881 non-null object
  City      65079 non-null object
time        65079 non-null object
Day         65079 non-null object
dtypes: object(7)
memory usage: 3.5+ MB


So the table has seven columns. The data type in all columns is `object`.

According to the data documentation:
* `userID` - user ID;
* `Track` — track name;
* `artist` — artist name;
* `genre` — genre name;
* `City` - user's city;
* `time` - start time of listening;
* `Day` is the day of the week.

There are three style violations in the column headings:
1. Lowercase letters are combined with uppercase.
2. There are gaps.
3. Duplicates - obvious and not obvious duplicate.


The number of values in the columns varies. This means there are missing values in the data.

**Conclusions**

Each line of the table contains data about the track you have listened to. Some of the columns describe the composition itself: title, artist and genre. The rest of the data tells about the user: what city he is from, when he listened to music.

Preliminarily, it can be argued that there is enough data to test hypotheses. But there are gaps in the data, and discrepancies in the names of the columns with notation style.

To move forward, we need to fix problems in the data.

## Data preprocessing
Let's fix the style in the column headings and drop gaps. Then we check the data for duplicates.

### Heading style
Let's display the column names:

In [5]:
print(df.columns)

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')


Let's bring the names in line with good style:
* write a few words in the title in "snake_case",
* make all characters lowercase,
* drop spaces.

To do this, rename the columns like this:
* `'userID'` → `'user_id'`;
* ``Track'` → ``track'`;
* `'City'` → `'city'`;
* `'Day'` → `'day'`.

In [6]:
df = df.rename(columns={'  userID': 'user_id', 'Track': 'track', '  City  ' : 'city', 'Day':'day'})

Let's check the result. To do this, display the column names again:

In [7]:
print(df.columns)

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')


### Missing values
First, let's calculate how many missing values are in the table. Two `pandas` methods are enough for this:

In [8]:
df.isna().sum()

user_id       0
track      1231
artist     7203
genre      1198
city          0
time          0
day           0
dtype: int64

Not all missing values affect the study. So in `track` and `artist` the gaps are not important for your work. It suffices to replace them with explicit notation.

But omissions in `genre` can interfere with the comparison of musical tastes in Moscow and St. Petersburg. In practice, it would be correct to determine the cause of the gaps and restore the data. This option is not available in the curriculum. Have to:
* fill in these gaps with explicit notation,
* estimate how much they will damage the calculations.

Let's replace the missing values in the `track`, `artist` and `genre` columns with the string `'unknown'`. To do this, create a `columns_to_replace` list, iterate through its elements with a `for` loop, and for each column, replace the missing values:

In [9]:
columns_to_replace = ['track','artist','genre']
for i in columns_to_replace:
    df[i] = df[i].fillna('unknown')

In [10]:
df.head(5)

Unnamed: 0,user_id,track,artist,genre,city,time,day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday


Make sure there are no gaps in the table. To do this, let's count the missing values again.

In [11]:
df.isna().sum()

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

### Duplicates
Let's count obvious duplicates in the table with one command:

In [12]:
df.duplicated().sum()

3826

Let's call a special `pandas` method to remove obvious duplicates:

In [13]:
df = df.drop_duplicates().reset_index(drop = True)

Once again, let's count the explicit duplicates in the table - make sure that we completely get rid of them:

In [14]:
df.duplicated().sum()

0

Now let's get rid of the not obvious
duplicates in the `genre` column. For example, the name of the same genre can be spelled slightly differently. Such errors will also affect the result of the study.

Let's display a list of unique genre names, sorted alphabetically. For this:
* extract the desired dataframe column,
* apply sorting method to it,
* for a sorted column, call a method that will return unique values from the column.

In [15]:
df['genre'].sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

Look through the list and look for duplicates of the name `hiphop`. These may be misspelled titles or alternative titles in the same genre.

You will see the following duplicates:
* *hip*,
* *hop*,
* *hip-hop*.

To clear the table of them, write the `replace_wrong_genres()` function with two parameters:
* `wrong_genres` - list of duplicates,
* `correct_genre` is a string with the correct value.

The function should correct the `genre` column in the `df` table: replace each value from the `wrong_genres` list with a value from `correct_genre`.

In [16]:
def replace_wrong_genres(wrong_genres, correct_genre):
    for i in wrong_genres:
        df['genre'] = df['genre'].replace(i,correct_genre)

Let's call `replace_wrong_genres()` and pass it arguments such that it eliminates duplicates: instead of `hip`, `hop` and `hip-hop` the table should have the value `hiphop`:

In [17]:
correct_genre = 'hiphop'
wrong_genres = ['hip', 'hop', 'hip-hop']

replace_wrong_genres(wrong_genres,correct_genre)

Let's check that we have replaced the wrong names:

* hip
* hop
* hip-hop

Output a sorted list of unique values in the `genre` column:

In [18]:
df[df['genre'] =='hiphop'].head(3)

Unnamed: 0,user_id,track,artist,genre,city,time,day
20,201CF2A8,Ya'll In Trouble,Lil Tee Chill Tank Young Buck Brother Mohammed...,hiphop,Moscow,08:46:03,Monday
46,825997A5,Glorious Feeling,Joelistics,hiphop,Moscow,21:46:34,Friday
79,1DA07AA4,Cardi B,Money Man,hiphop,Saint-Petersburg,14:02:14,Monday


In [19]:
df['genre'].sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

**Conclusions**

Preprocessing found three problems in the data:

- Heading style issues
- missing values,
- duplicates - obvious and not obvious

We've fixed the headers to make the table easier to work with. Without duplicates, the study will become more accurate.

You have replaced missing values with `'unknown'`. It remains to be seen whether the gaps in the `genre` column will harm the study.

Now we can move on to hypothesis testing.

## Hypothesis testing

### Comparison of user behavior in two capitals

The first hypothesis states that users listen to music differently in Moscow and St. Petersburg. Let's check this assumption on the data on three days of the week - Monday, Wednesday and Friday. For this:

* Separate users of Moscow and St. Petersburg
* Compare how many tracks each group of users listened to on Monday, Wednesday and Friday.

For training, we first perform each of the calculations separately.

Let's evaluate the activity of users in each city. Let's group the data by city and count the plays in each group.

In [20]:
df.groupby('city').count()

Unnamed: 0_level_0,user_id,track,artist,genre,time,day
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Moscow,42741,42741,42741,42741,42741,42741
Saint-Petersburg,18512,18512,18512,18512,18512,18512


There are more auditions in Moscow than in St. Petersburg. It does not follow from this that Moscow users listen to music more often. There are simply more users in Moscow.

Now let's group the data by day of the week and count the plays on Monday, Wednesday, and Friday. Let's take into account that the data contains information only about listening only for these days.

In [21]:
df.groupby('day').count()

Unnamed: 0_level_0,user_id,track,artist,genre,city,time
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Friday,21840,21840,21840,21840,21840,21840
Monday,21354,21354,21354,21354,21354,21354
Wednesday,18059,18059,18059,18059,18059,18059


On average, users from the two cities are less active on Wednesdays. But the picture may change if we consider each city separately.

We have seen how grouping by city and by day of the week works. Now let's write a function that combines these two calculations.

Let's create a `number_tracks()` function that will count the plays for a given day and city. She needs two parameters:
* day of the week,
* city name.

In the function, we save the rows of the source table into a variable, which have the value:
  * in the `day` column is equal to the `day` parameter,
  * in the `city` column is equal to the `city` parameter.

To do this, apply sequential filtering with logical indexing.

Then we count the values ​​in the `user_id` column of the resulting table. The result will be stored in a new variable. This variable will be returned from the function.

In [22]:
def number_tracks(day, city):
    track_list = df[(df['day'] == day) & (df['city'] == city)]
    track_list_count = track_list['user_id'].count()
    return track_list_count

Let's call `number_tracks()` six times, changing the value of the parameters so that we get data for each city on each of the three days.

In [23]:
# number of listenings in Moscow on Mondays
m_m = number_tracks('Monday','Moscow')

In [24]:
# the number of auditions in St. Petersburg on Mondays
spb_m = number_tracks('Monday','Saint-Petersburg')

In [25]:
# the number of listenings in Moscow on Wednesdays
m_w = number_tracks('Wednesday','Moscow')

In [26]:
# the number of auditions in St. Petersburg on Wednesdays
spb_w = number_tracks('Wednesday','Saint-Petersburg')

In [27]:
# the number of plays in Moscow on Fridays
m_f = number_tracks('Friday','Moscow')

In [28]:
# number of plays in St. Petersburg on Fridays
spb_f = number_tracks('Friday','Saint-Petersburg')

Let's create a table using the `pd.DataFrame` constructor, where
* column names - `['city', 'monday', 'wednesday', 'friday']`;
* data is the results you got with `number_tracks`.

In [29]:
# table with the results
col_df = ['city', 'monday', 'wednesday', 'friday']
dt_df = [['Moscow',m_m,m_w,m_f],['Saint-Petersburg',spb_m, spb_w, spb_f]]

Q = pd.DataFrame(dt_df, columns = col_df)
Q.head()

Unnamed: 0,city,monday,wednesday,friday
0,Moscow,15740,11056,15945
1,Saint-Petersburg,5614,7003,5895


**Conclusions**

The data shows the difference in user behavior:

- In Moscow, the peak of listening falls on Monday and Friday, and on Wednesday there is a noticeable decline.
- In St. Petersburg, on the contrary, they listen to music more on Wednesdays. Activity on Monday and Friday here is almost equally inferior to Wednesday.

So the data support the first hypothesis.

### Music at the beginning and end of the week

According to the second hypothesis, on Monday morning certain genres predominate in Moscow, while others dominate in St. Petersburg. Similarly, Friday evenings are dominated by different genres, depending on the city.

Let's save tables with data in two variables:
* in Moscow - in `moscow_general`;
* in St. Petersburg - in `spb_general`.

In [30]:
moscow_general = df[df['city'] == 'Moscow']

In [31]:
spb_general = df[df['city'] == 'Saint-Petersburg']

In [32]:
print(len(moscow_general), len(spb_general))

42741 18512


Let's create a `genre_weekday()` function with four parameters:
* table (dataframe) with data,
* day of the week,
* initial timestamp in 'hh:mm' format,
* last timestamp in 'hh:mm' format.

The function should return information about the top 10 genres of those tracks that were listened to on the specified day, in the interval between two timestamps.

In [33]:
def genre_weekday(table, day, time1, time2):
    genre_df = table[(table['day'] == day)&(table['time'] > time1)&(table['time'] < time2)]
    genre_df_count = genre_df.groupby('genre')['time'].count()
    genre_df_sorted = genre_df_count.sort_values(ascending = False)
    return(genre_df_sorted.head(10))

Let's compare the results of the `genre_weekday()` function for Moscow and St. Petersburg on Monday morning (from 7:00 to 11:00) and Friday evening (from 17:00 to 23:00):

In [34]:
genre_weekday(moscow_general, 'Monday', '07:00', '11:00')

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
Name: time, dtype: int64

In [35]:
genre_weekday(spb_general, 'Monday', '07:00', '11:00')

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
Name: time, dtype: int64

In [36]:
genre_weekday(moscow_general, 'Friday', '19:00', '23:00')

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
alternative    163
classical      163
rusrap         142
Name: time, dtype: int64

In [37]:
genre_weekday(spb_general, 'Friday', '19:00', '23:00')

genre
pop            256
rock           216
electronic     216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
Name: time, dtype: int64

**Conclusions**

If we compare the top 10 genres on Monday morning, we can draw the following conclusions:

1. In Moscow and St. Petersburg they listen to similar music. The only difference is that the Moscow rating includes the “world” genre, while the St. Petersburg rating includes jazz and classical.

2. There were so many missing values ​​in Moscow that the value `'unknown'` took tenth place among the most popular genres. This means that missing values ​​occupy a significant share in the data and threaten the reliability of the study.

Friday night does not change this picture. Some genres rise a little higher, others go down, but overall the top 10 stays the same.

Thus, the second hypothesis was only partially confirmed:
* Users listen to similar music at the beginning of the week and at the end.
* The difference between Moscow and St. Petersburg is not very pronounced. In Moscow, they listen to Russian popular music more often, in St. Petersburg - jazz.

However, gaps in the data cast doubt on this result. There are so many of them in Moscow that the top 10 ranking could look different if it were not for the lost genre data.

### Genre preferences in Moscow and St. Petersburg

Hypothesis: St. Petersburg is the capital of rap, the music of this genre is listened to more often than in Moscow. And Moscow is a city of contrasts, which, nevertheless, is dominated by pop music.

Let's group the `moscow_general` table by genre and count the listens of tracks of each genre using the `count()` method. Then sort the result in descending order and store it in the `moscow_genres` table.

In [38]:
moscow_genres = moscow_general.groupby('genre')['genre'].count().sort_values(ascending = False)

Let's display the first ten lines of `moscow_genres`:

In [39]:
moscow_genres.head(10)

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64

Now let's repeat the same for St. Petersburg.

Let's group the `spb_general` table by genre. Let's count listening to tracks of each genre. Sort the result in descending order and store it in the `spb_genres` table:

In [41]:
spb_genres = spb_general.groupby('genre')['genre'].count().sort_values(ascending = False)

Let's display the first ten lines of `spb_genres`:

In [42]:
spb_genres.head(10)

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64

**Conclusions**

The hypothesis was partially confirmed:
* Pop music is the most popular genre in Moscow, as the hypothesis suggested. Moreover, in the top 10 genres there is a similar genre - Russian popular music.
* Contrary to expectations, rap is equally popular in Moscow and St. Petersburg.

## Results of the research

You tested three hypotheses and found:

1. The day of the week has a different effect on the activity of users in Moscow and St. Petersburg.

The first hypothesis was fully confirmed.

2. Musical preferences do not change much during the week - be it Moscow or St. Petersburg. Small differences are noticeable at the beginning of the week, on Mondays:
* in Moscow they listen to music of the “world” genre,
* in St. Petersburg - jazz and classical music.

Thus, the second hypothesis was only partly confirmed. This result could have been different were it not for gaps in the data.

3. The tastes of users of Moscow and St. Petersburg have more in common than differences. Contrary to expectations, genre preferences in St. Petersburg resemble those in Moscow.

The third hypothesis was not confirmed. If there are differences in preferences, they are invisible to the bulk of users.

**In practice, studies contain tests of statistical hypotheses.**
From the data of one service, it is not always possible to draw a conclusion about all the inhabitants of the city.
Tests of statistical hypotheses will show how reliable they are, based on the available data.
We will get acquainted with the methods of testing hypotheses in the following topics.