# Yandex.Music 

On Yandex.Music data, the behavior of users of two capitals is necessary. 

**The purpose of the study** — test three hypotheses:
1. User activity depends on the day of the week. Moreover, in Moscow and St. Petersburg, this manifests itself in different ways. 
2. On Monday morning, some genres prevail in Moscow, and others in St. Petersburg. Similarly, on Friday evening, different genres prevail — depending on the city.
3. Moscow and St. Petersburg prefer different genres of music. In Moscow, pop music is more often listened to, in St. Petersburg - Russian rap.

**Research plan**

Data on user behavior must be obtained from the file yandex_music_project.csv. Nothing is known about the quality of the data. Therefore, a review of the data will be needed before testing hypotheses. 

It is necessary to check the data for errors and evaluate their impact on the study. Then, at the preprocessing stage, we will correct the most critical data errors.
 
Thus, the study will take place in three stages:
 1. Data overview.
 2. Data preprocessing. 
 3. Hypothesis testing. 

## Data overview

Let's make a first impression of the Yandex.Music data.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('/datasets/yandex_music_project.csv')

In [3]:
#display(df.head(10))
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
  userID    65079 non-null object
Track       63848 non-null object
artist      57876 non-null object
genre       63881 non-null object
  City      65079 non-null object
time        65079 non-null object
Day         65079 non-null object
dtypes: object(7)
memory usage: 3.5+ MB


There are seven columns in the dataframe. Data type in all columns — `object`.

According to the data documentation:
* `userID` — user ID;
* `Track` — track name;  
* `artist` — artist name;
* `genre` — genre name;
* `City` — user's city;
* `time` — time of listening start;
* `Day` — day of the week.

Three style violations are visible in the column names:
1. Lowercase letters are combined with uppercase.
2. There are gaps.
3. In the name of the user ID - the underscore separator between "user" and "id" is omitted.

The number of values in the columns varies. So there are missing values in the data.

**Preliminary conclusions of data quality**

In each row of the table — data about the listened track. Part of the columns describes the composition itself: name, artist and genre. The rest of the data tells about the user: what city he is from, when he listened to music. 

Previously, it can be argued that there is enough data to test hypotheses. But there are gaps in the data, and in the column names there are discrepancies with good style.

To move forward, we need to fix the data problems.

## Data preprocessing
Let's fix the style in the column headers, eliminate omissions. Then we will check the data for duplicates.

### Header style
Let's display the column names on the screen:

In [5]:
df.columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

Let's give the names in accordance with a good style:
* we will write down a few words in the name in a good style,
* we will make all characters lowercase,
* eliminate the gaps.

For this purpose rename columns:
* `'  userID'` → `'user_id'`;
* `'Track'` → `'track'`;
* `'  City  '` → `'city'`;
* `'Day'` → `'day'`.

In [6]:
df = df.rename(columns={'  userID':'user_id', 'Track':'track', '  City  ':'city', 'Day':'day'})

In [7]:
print(df.columns)

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')


### Missing values
Let's count the number of missing values in the table. Two `pandas` methods are enough for this:

In [8]:
df.isna().sum()

user_id       0
track      1231
artist     7203
genre      1198
city          0
time          0
day           0
dtype: int64

Not all missing values affect the study. So in track and artist, omissions are not important for your work. It is enough to replace them with explicit designations.

But omissions in `genre` may interfere with the comparison of musical tastes in Moscow and St. Petersburg. In practice, it would be correct to establish the reason for the omissions and restore the data. There is no such possibility in the training project. We have to:
* fill in these gaps with explicit notation,
* assess how much they will damage the calculations.

We need to replace the missing values in the columns `track`, `artist` and `genre` with the string `unknown`. To do this, create a list of `columns_to_replace`, iterate through its elements with a `for` loop and replace the missing values for each column:

In [9]:
columns_to_replace = ['track', 'artist', 'genre']
for column in columns_to_replace:
    df[column] = df[column].fillna('unknown')

Let's check that there are no gaps left in the dataframe. To do this, we will count the missing values again.

In [10]:
df.isna().sum()

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

### Duplicates
Let's count the obvious duplicates in the table with one command:

In [11]:
df.duplicated().sum()

3826

In [12]:
df = df.drop_duplicates().reset_index(drop=True)

Once again, let's count the obvious duplicates in the table to check that we have completely got rid of them:

In [13]:
df.duplicated().sum()

0

It is also necessary to get rid of implicit duplicates in the `genre` column. For example, the name of the same genre may be written a little differently. Such errors will also affect the result of the study.

Let's display a list of unique genre names, sorted alphabetically. To do this:
* extract the desired dataframe column,
* apply the sorting method to it,
* for a sorted column, we apply a method that returns unique values from the column.

In [14]:
df['genre'].sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

Find implicit duplicates of the name `hip hop`. These may be misspelled titles or alternative titles of the same genre.

We saw the following implicit duplicates:
* *hip*,
* *hop*,
* *hip-hop*.

To clear the table of them, write the function `replace_wrong_genres()` with two parameters: 
* `wrong_genres` — list of duplicates,
* `correct_genre` - is a string with the correct value.

The function should correct the `genre` column in the `df` table: replace each value from the `wrong_genres` list with the value from `correct_genre`.

In [15]:
def replace_wrong_genres(wrong_genres, correct_genre):
    for wrong_genre in wrong_genres:
        df['genre'] = df['genre'].replace(wrong_genre, correct_genre)

Apply `replace_wrong_genres()` and pass to it such arguments so that it eliminates implicit duplicates: instead of `hip`, `hop` and `hip-hop`, the table should have the value `hiphop`:

In [16]:
duplicates = ['hip', 'hop', 'hip-hop']
correct_genre = 'hiphop'
replace_wrong_genres(duplicates, correct_genre)
display(df)

Unnamed: 0,user_id,track,artist,genre,city,time,day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
...,...,...,...,...,...,...,...
61248,729CBB09,My Name,McLean,rnb,Moscow,13:32:28,Wednesday
61249,D08D4A55,Maybe One Day (feat. Black Spade),Blu & Exile,hiphop,Saint-Petersburg,10:00:00,Monday
61250,C5E3A0D5,Jalopiina,unknown,industrial,Moscow,20:09:26,Friday
61251,321D0506,Freight Train,Chas McDevitt,rock,Moscow,21:43:59,Friday


Сheck that the wrong names have been replaced:

*   hip
*   hop
*   hip-hop

Output a sorted list of unique values of the `genre` column:

In [17]:
print(df['genre'].sort_values().unique())

['acid' 'acoustic' 'action' 'adult' 'africa' 'afrikaans' 'alternative'
 'alternativepunk' 'ambient' 'americana' 'animated' 'anime' 'arabesk'
 'arabic' 'arena' 'argentinetango' 'art' 'audiobook' 'author' 'avantgarde'
 'axé' 'baile' 'balkan' 'beats' 'bigroom' 'black' 'bluegrass' 'blues'
 'bollywood' 'bossa' 'brazilian' 'breakbeat' 'breaks' 'broadway'
 'cantautori' 'cantopop' 'canzone' 'caribbean' 'caucasian' 'celtic'
 'chamber' 'chanson' 'children' 'chill' 'chinese' 'choral' 'christian'
 'christmas' 'classical' 'classicmetal' 'club' 'colombian' 'comedy'
 'conjazz' 'contemporary' 'country' 'cuban' 'dance' 'dancehall' 'dancepop'
 'dark' 'death' 'deep' 'deutschrock' 'deutschspr' 'dirty' 'disco' 'dnb'
 'documentary' 'downbeat' 'downtempo' 'drum' 'dub' 'dubstep' 'eastern'
 'easy' 'electronic' 'electropop' 'emo' 'entehno' 'epicmetal' 'estrada'
 'ethnic' 'eurofolk' 'european' 'experimental' 'extrememetal' 'fado'
 'fairytail' 'film' 'fitness' 'flamenco' 'folk' 'folklore' 'folkmetal'
 'folkrock' 

**Preprocessing Conclusions**

Preprocessing found three problems in the data:

- violations in the style of headlines,
- missing values,
- duplicates — explicit and implicit.

I have corrected the headers to make it easier to work with the table. Without duplicates, the study will become more accurate.

I have replaced the missing values with `unknown`. It remains to be seen whether omissions in the `genre` column will harm the study.

Now I can proceed to hypothesis testing.

## Hypothesis testing

### User behavior comparison in two cities

The first hypothesis states that users listen to music differently in Moscow and St. Petersburg. Let's check this assumption based on data on three days of the week — Monday, Wednesday and Friday. For this:

* divide the users of Moscow and St. Petersburg
* сompare how many tracks each user group listened to on Monday, Wednesday and Friday.


In [18]:
df.groupby('city')['time'].count()

city
Moscow              42741
Saint-Petersburg    18512
Name: time, dtype: int64

In Moscow, the amount of music listened to is more than in St. Petersburg. This does not mean that Moscow users listen to music more often. It's just that there are more users in Moscow.

Now let's group the data by day of the week and count the auditions on Monday, Wednesday and Friday. Considering that the data contains information only about the amount of music listened to only for these days.


In [19]:
df.groupby('day')['time'].count()

day
Friday       21840
Monday       21354
Wednesday    18059
Name: time, dtype: int64

On average, users from two cities are less active on Wednesdays. But the picture may change if we consider each city separately.

Let's write a function that combines these two calculations.

Let's create a function `number_tracks()`, which will count auditions for a given day and city. It will need two parameters:
* day of the week,
* name of the city.

In the function, we will save the rows of the source table to a variable, which have the value:
* in the column `day` is equal to the parameter `day`,
  * in the column `city` is equal to the parameter `city`.

To do this, we apply sequential filtering with logical indexing.

Then we will calculate the values in the `user_id` column of the resulting table. We will save the result to a new variable. Let's return this variable from the function.

In [20]:
def number_tracks(day, city):
    track_list = df[(df['day'] == day) & (df['city'] == city)]
    track_list_count = track_list['user_id'].count()
    return track_list_count

Call `number_tracks()` six times, changing the value of the parameters — so as to get data for each city on each of the three days.

In [21]:
number_tracks('Monday', 'Moscow')

15740

In [22]:
number_tracks('Monday', 'Saint-Petersburg')

5614

In [23]:
number_tracks('Wednesday', 'Moscow')

11056

In [24]:
number_tracks('Wednesday', 'Saint-Petersburg')

7003

In [25]:
number_tracks('Friday', 'Moscow')

15945

In [26]:
number_tracks('Friday', 'Saint-Petersburg')

5895

Create a table using the `pd.DataFrame` constructor, where
* column names — `['city', 'monday', 'wednesday', 'friday']`;
* data — the results that you received using `number_tracks'.

In [27]:
data = [['Moscow', 15740, 11056, 15945],
       ['Saint-Petersburg', 5614, 7003, 5895]]
columns = ['city','monday','wednesday','friday']
table = pd.DataFrame(data = data, columns = columns)
table

Unnamed: 0,city,monday,wednesday,friday
0,Moscow,15740,11056,15945
1,Saint-Petersburg,5614,7003,5895


**Conclusion**

The data shows the difference in user behavior:

- In Moscow, the peak of auditions falls on Monday and Friday, and on Wednesday there is a noticeable decline.
- In St. Petersburg, on the contrary, they listen to music more on Wednesdays. Activity on Monday and Friday is almost equally inferior to Wednesday here.

So, the data speak in favor of the first hypothesis.

### Music at the beginning and end of the week

According to the second hypothesis, some genres prevail in Moscow on Monday morning, and others in St. Petersburg. Similarly, on Friday evening, different genres prevail — depending on the city.

Save tables with data in two variables:
* in Moscow — in `moscow_general`;
* in Saint Petersburg — in `spb_general'.

In [28]:
moscow_general = df[df['city'] == 'Moscow']

In [29]:
spb_general = df[df['city'] == 'Saint-Petersburg']

Create a function `genre_weekday()` with four parameters:
* table (dataframe) with data,
* day of the week,
* initial timestamp in the format 'hh:mm', 
* the last timestamp in the format 'hh:mm'.

The function should return information about the top 10 genres of those tracks that were listened to on the specified day, in the interval between two timestamps.

In [30]:
def genre_weekday(table, day, time1, time2):
    genre_df = table[(table['day'] == day) & (table['time'] > time1) & (table['time'] < time2)]
    genre_df_count = genre_df.groupby('genre')['genre'].count() 
    genre_df_sorted = genre_df_count.sort_values(ascending = False)
    return genre_df_sorted.head(10)

Let's compare the results of the `genre_weekday()` function for Moscow and St. Petersburg on Monday morning (from 7:00 to 11:00) and Friday evening (from 17:00 to 23:00):

In [31]:
genre_weekday(moscow_general, 'Monday', '07:00', '11:00')

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
Name: genre, dtype: int64

In [32]:
genre_weekday(spb_general, 'Monday', '07:00', '11:00')

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
Name: genre, dtype: int64

In [33]:
genre_weekday(moscow_general, 'Friday', '17:00', '23:00')

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
alternative    163
classical      163
rusrap         142
Name: genre, dtype: int64

In [34]:
genre_weekday(spb_general, 'Friday', '17:00', '23:00')

genre
pop            256
rock           216
electronic     216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
Name: genre, dtype: int64

**Conclusion**

If we compare the top 10 genres on Monday morning, we can draw the following conclusions:

1. In Moscow and St. Petersburg, users listen to similar music. The only difference is that the “world” genre entered the Moscow rating, and jazz and classical music entered the St. Petersburg rating.

2. In Moscow, there were so many missing values that the value `unknown` took the tenth place among the most popular genres. This means that the missing values occupy a significant share in the data and threaten the reliability of the study.

Friday night does not change this picture. Some genres rise a little higher, others go down, but overall the top 10 remains the same.

Thus, the second hypothesis was only partially confirmed:
* Users listen to similar music at the beginning of the week and at the end.
* The difference between Moscow and St. Petersburg is not too pronounced. In Moscow, they listen to Russian popular music more often, in St. Petersburg — jazz.

However, omissions in the data cast doubt on this result. There are so many of them in Moscow that the top-10 rating could look different if not for the lost data on genres.

### Genre preferences in Moscow and St. Petersburg

Hypothesis: St. Petersburg is the capital of rap, music of this genre is listened to there more often than in Moscow.  And Moscow is a city of contrasts, in which, nevertheless, pop music prevails.

Group the `moscow_general` table by genre and count the number of tracks listened to for each genre using the `count()` method. Then sort the result in descending order and save it in the `moscow_genres` table.

In [35]:
moscow_genres = moscow_general.groupby('genre')['genre'].count().sort_values(ascending=False)

Display the first ten strings `moscow_genres`:

In [36]:
display(moscow_genres.head(10))

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64

Now repeat the same for St.Petersburg.

Group the `spb_general` table by genre. Count the listenings of tracks of each genre. Sort the result in descending order and save it in the `spb_genres` table:

In [37]:
spb_genres = spb_general.groupby('genre')['genre'].count().sort_values(ascending=False)

Display the first ten strings `spb_genres`:

In [38]:
display(spb_genres.head(10))

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64

**Conclusion**

The hypothesis was partially confirmed:
* Pop music is the most popular genre in Moscow, as the hypothesis suggested. Moreover, in the top 10 genres there is a similar genre - Russian popular music.
* Contrary to expectations, rap is equally popular in Moscow and St. Petersburg. 

## Results of the study

I have tested three hypotheses and established:

1. The day of the week has different effects on user activity in Moscow and St. Petersburg. 

The first hypothesis is fully confirmed.

2. Musical preferences do not change much during the week — be it Moscow or St. Petersburg. Small differences are noticeable at the beginning of the week, on Mondays:
* in Moscow, users listen to music of the “world” genre,
* in St. Petersburg, users listen to music of the — jazz and classics.

Thus, the second hypothesis is only partially confirmed. This result could have turned out to be different if not for the omissions in the data.

3. There are more similarities than differences in the tastes of users in Moscow and St. Petersburg. Contrary to expectations, genre preferences in St. Petersburg resemble those in Moscow.

The third hypothesis is not confirmed. If there are differences in preferences, they are invisible to the majority of users.