# Project Description

It is necessary to test three hypotheses:

1. User activity depends on the day of the week. Moreover, it manifests differently in Moscow and St. Petersburg.
2. In Moscow, on Monday mornings, one genre predominates, while in St. Petersburg, another dominates. Similarly, different genres predominate on Friday evenings, depending on the city.
3. Moscow and St. Petersburg prefer different genres of music. In Moscow, pop music is more often listened to, while in St. Petersburg, it is Russian rap.

# Data Upload and Overview

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('/datasets/yandex_music_project.csv') # data upload to dataframe

In [None]:
display(df.head(10)) # read first 10 rows in dataframe

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


In [None]:
df.info() # general information about the data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
  userID    65079 non-null object
Track       63848 non-null object
artist      57876 non-null object
genre       63881 non-null object
  City      65079 non-null object
time        65079 non-null object
Day         65079 non-null object
dtypes: object(7)
memory usage: 3.5+ MB


So, there are seven columns in the table. The data type in all columns is "object".

According to the data documentation:

* userID - user identifier;
* Track - track name;
* artist - artist name;
* genre - genre name;
* City - user's city;
* time - start time of listening;
* Day - day of the week.

There are three style violations in the column names:

- Lowercase letters are mixed with uppercase ones.
- Spaces are present.
- Two words are written together instead of using snake_case.
- The number of values in the columns varies, indicating that there are missing values in the data.

**Conclusions**

In each row of the table, there is information about the listened track. Some columns describe the composition itself: the title, artist, and genre. Other data tells about the user: where they are from and when they listened to the music.

It can be preliminarily asserted that there is enough data to test hypotheses. However, there are missing values in the data, and there are discrepancies in column names that do not conform to good style.

To move forward, it is necessary to resolve the issues with the data.

# Data Preprocessing

## Headers Style

In [None]:
df.columns # list of columns name

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

In [None]:
df = df.rename(columns = {'  userID' : 'user_id', 'Track' : 'track', '  City  ' : 'city', 'Day' : 'day'}) # columns rename

In [None]:
display(df.columns) # renamed columns

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

## Missing Values

In [None]:
display(df.isna().sum()) # missing values structure

user_id       0
track      1231
artist     7203
genre      1198
city          0
time          0
day           0
dtype: int64

In [None]:
columns_to_replace = ['track', 'artist', 'genre']

for column in columns_to_replace:
    df[column] = df[column].fillna('unknown')
    
# Looping through column names and replacing missing values with 'unknown'

In [None]:
display(df.isna().sum()) # missing values count

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

## Duplicates

In [None]:
display(df.duplicated().sum()) # counting explicit duplicates

3826

In [None]:
df = df.drop_duplicates().reset_index(drop = True) # removing explicit duplicates (including deletion of old indexes and generation of new ones)

In [None]:
display(df.duplicated().sum()) # check for absence of duplicates

0

In [None]:
df = df.sort_values('genre')
display(df['genre'].unique())

# unique genre names

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

In [None]:
def replace_wrong_genres(wrong_genres, correct_genre):
    for genre in wrong_genres:
        df['genre'] = df['genre'].replace(genre, correct_genre)
    
    # function for replacing implicit duplicates

In [None]:
replace_wrong_genres(['hip', 'hop', 'hip-hop'], 'hiphop') # elimination of implicit duplicates

In [None]:
df = df.sort_values('genre')
display(df['genre'].unique())
# checking for implicit duplicates

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

**Conclusions**

Preprocessing found three problems in the data:

- violations in the style of headlines,
- missing values,
- duplicates — explicit and implicit.

# Hypothesis Testing

## Comparison of user behavior of two cities

Testing the first hypothesis: users listen to music differently in Moscow and St. Petersburg.

Let's check this assumption based on data on three days of the week — Monday, Wednesday and Friday

In [None]:
df.groupby('city')['time'].count() # Counting auditions in each city

city
Moscow              42741
Saint-Petersburg    18512
Name: time, dtype: int64

In [None]:
df.groupby('day')['city'].count() # Counting auditions on each of the three days

day
Friday       21840
Monday       21354
Wednesday    18059
Name: city, dtype: int64

In [None]:
def number_tracks(day, city):
    track_list = df[(df['day'] == day)&(df['city'] == city)]
    track_list_count = track_list['user_id'].count()
    return track_list_count
    
# A function for counting auditions for a specific city and day

In [None]:
number_tracks('Monday', 'Moscow') # the number of auditions in Moscow on Mondays

15740

In [None]:
number_tracks('Monday', 'Saint-Petersburg') # the number of auditions in St. Petersburg on Mondays

5614

In [None]:
number_tracks('Wednesday', 'Moscow') # number of auditions in Moscow on Wednesdays

11056

In [None]:
number_tracks('Wednesday', 'Saint-Petersburg') # number of auditions in St. Petersburg on Wednesdays

7003

In [None]:
number_tracks('Friday', 'Moscow') # number of auditions in Moscow on Fridays

15945

In [None]:
number_tracks('Friday', 'Saint-Petersburg') # number of auditions in St. Petersburg on Fridays

5895

In [None]:
pd.DataFrame(data = [
    [
        'Moscow', 
        number_tracks('Monday', 'Moscow'), 
        number_tracks('Wednesday', 'Moscow'), 
        number_tracks('Friday', 'Moscow')], 
    [
        'Saint-Petersburg', 
        number_tracks('Monday', 'Saint-Petersburg'), 
        number_tracks('Wednesday', 'Saint-Petersburg'), 
        number_tracks('Friday', 'Saint-Petersburg')]
], columns = [
    'city',
    'monday', 
    'wednesday', 
    'friday']) # Results table

Unnamed: 0,city,monday,wednesday,friday
0,Moscow,15740,11056,15945
1,Saint-Petersburg,5614,7003,5895


**Conclusions**

The data shows the difference in user behavior:

- In Moscow, the peak of auditions falls on Monday and Friday, and on Wednesday there is a noticeable decline.
- In St. Petersburg, on the contrary, they listen to music more on Wednesdays. Activity on Monday and Friday is almost equally inferior to Wednesday here.

So, the data speak in favor of the first hypothesis.

## Music at the beginning and end of the week

Let's check the second hypothesis: on Monday morning, some genres prevail in Moscow, and others in St. Petersburg. Similarly, on Friday evenings, different genres prevail — depending on the city.

In [None]:
moscow_general = df[df['city'] == 'Moscow'] # getting the moscow_general table from those rows of the df table,
# for which the value in the 'city' column is 'Moscow'

In [None]:
spb_general = df[df['city'] == 'Saint-Petersburg'] # getting the spb_general table from those rows of the df table,
# for which the value in the 'city' column is 'Saint-Petersburg'

In [None]:
def genre_weekday(table, day, time1, time2): # Declaration of the genre_weekday() function with the parameters table, day, time1, time2,
    genre_df = table[(table['day'] == day) & (table['time'] > time1) & (table['time'] < time2)]
    genre_df_count = genre_df.groupby('genre')['genre'].count()
    genre_df_sorted = genre_df_count.sort_values(ascending = False)
    return genre_df_sorted.head(10)

# the function returns information about the most popular genres on the specified day at the specified time

In [None]:
genre_weekday(moscow_general, 'Monday', '07:00', '11:00') # function call for Monday morning in Moscow (instead of df — moscow_general table)

genre
pop            781
dance          549
electronic     480
rock           474
hip            281
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
Name: genre, dtype: int64

In [None]:
genre_weekday(spb_general, 'Monday', '07:00', '11:00') # function call for Monday morning in St. Petersburg (spb_general table instead of df)

genre
pop            218
dance          182
rock           162
electronic     147
hip             79
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
Name: genre, dtype: int64

In [None]:
genre_weekday(moscow_general, 'Friday', '17:00', '23:00') # function call for Friday evening in Moscow

genre
pop            713
rock           517
dance          495
electronic     482
hip            267
world          208
ruspop         170
alternative    163
classical      163
rusrap         142
Name: genre, dtype: int64

In [None]:
genre_weekday(spb_general, 'Friday', '17:00', '23:00') # function call for Friday evening in St. Petersburg

genre
pop            256
electronic     216
rock           216
dance          210
hip             94
alternative     63
jazz            61
classical       60
rusrap          59
world           54
Name: genre, dtype: int64

**Conclusions**

If we compare the top 10 genres on Monday morning, we can draw the following conclusions:

1. In Moscow and St. Petersburg, they listen to similar music. The only difference is that the “world” genre entered the Moscow rating, and jazz and classical music entered the St. Petersburg rating.

2. In Moscow, there were so many missing values that the value `unknown" took the tenth place among the most popular genres. This means that the missing values occupy a significant share in the data and threaten the reliability of the study.

Friday night doesn't change that picture. Some genres rise a little higher, others go down, but overall the top 10 remains the same.

Thus, the second hypothesis was only partially confirmed:
* Users listen to similar music at the beginning of the week and at the end.
* The difference between Moscow and St. Petersburg is not too pronounced. In Moscow, they listen to Russian popular music more often, in St. Petersburg — jazz.

However, omissions in the data cast doubt on this result. There are so many of them in Moscow that the top-10 rating could look different if not for the lost data on genres.

## Genre preferences in Moscow and St. Petersburg

Let's check the third hypothesis: St. Petersburg is the capital of rap, music of this genre is listened to there more often than in Moscow, and pop music prevails in Moscow.

In [None]:
moscow_genres = moscow_general.groupby('genre')['genre'].count().sort_values(ascending = False)

In [None]:
display(moscow_genres.head(10)) # viewing the first 10 rows of moscow_genres

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hip            2041
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64

In [None]:
spb_genres = spb_general.groupby('genre')['genre'].count().sort_values(ascending = False)

In [None]:
display(spb_genres.head(10)) # viewing the first 10 rows of spb_genres

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hip             934
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64

**Conclusions**

The hypothesis was partially confirmed:
* Pop music is the most popular genre in Moscow, as the hypothesis suggested. Moreover, in the top 10 genres there is a similar genre - Russian popular music.
* Contrary to expectations, rap is equally popular in Moscow and St. Petersburg.


# Final Conclusion

Hypotheses have been tested and the following has been established:

1. The day of the week has different effects on user activity in Moscow and St. Petersburg. 

The first hypothesis was fully confirmed.

2. Musical preferences do not change much during the week — be it Moscow or St. Petersburg. Small differences are noticeable at the beginning of the week, on Mondays:
* in Moscow, they listen to music of the “world” genre,
* in St. Petersburg — jazz and classics.

Thus, the second hypothesis was only partially confirmed. This result could have turned out to be different if not for the omissions in the data.

3. There are more similarities than differences in the tastes of users in Moscow and St. Petersburg. Contrary to expectations, genre preferences in St. Petersburg resemble those in Moscow.

The third hypothesis was not confirmed. If there are differences in preferences, they are invisible to the majority of users.