<br>

# Research: Music of the big russian cities

***

«Yandex Music» is a music streaming service developed by Yandex. Users select musical compositions, albums, collections of musical tracks to stream to their device on demand and receive personalized recommendations. The service is also available as a mobile app with versions compatible with iOS, Android. Service is available in Russia, Belarus, Kyrgyzstan, Kazakhstan, Azerbaijan, Armenia, Georgia, Tajikistan, Moldova, Israel.

As of October 2017, over 40 million music tracks are available on Yandex Music. About 20 million people use the service at least once a month.

The most popular feature of Yandex Music is the smart playlists, which is updated daily for each user and features recently played tracks, similar music to their favorites, and diverse tracks that are based on user's tastes.

Teams of such services to maintain interest in the product and attract new users often conduct research about users. To retain customers and attract new ones, to make the brand more recognizable, the service team conducts research on the audience and publishes interesting results.

### Research questions
- Do the musical preferences of the residents of the two Russian megacities of Moscow and St. Petersburg differ?
- Is the music that sounds on the way to work on Monday morning different from the one that plays on Wednesday or at the end of the workweek?

### Research plan

1. Collect Data. Read the data, read it.
2. Data Cleaning. Get rid of duplicates, problems with column names and omissions.
3. Data analysis. Answer the main questions of the study, prepare a reporting table or describe the result.
4. Summarizing. Review the work done and draw conclusions.

### Data Description
- userID -> user_id
- Track -> track_name
- artist -> artist_name
- genre -> genre_name
- City -> city
- time -> time
- Day -> weekday
<br>
<br>

***


## Step 1. Data Requirement Gathering

We will study the data provided by the service for the project. <br>
Import libraries

In [1]:
import pandas as pd

Read the file * music_project.csv * and save it in the variable * df *.

In [2]:
df = pd.read_csv('music_project.csv')

See the first 10 rows of the table.

In [3]:
df.head(10)

Unnamed: 0.1,Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


General information about the data in the table * df *.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 8 columns):
Unnamed: 0    65079 non-null int64
  userID      65079 non-null object
Track         63848 non-null object
artist        57876 non-null object
genre         63881 non-null object
  City        65079 non-null object
time          65079 non-null object
Day           65079 non-null object
dtypes: int64(1), object(7)
memory usage: 4.0+ MB


### **Сonclusion: Step 1**

Each row of the table contains information about the compositions of a particular genre in a specific performance, which users listened to in one of the cities at a specific time and day of the week.<br>
Two issues to solve: omissions and substandard column names. The columns * time *, * day * and * City * are especially valuable for testing working hypotheses. The data from the * genre * column will let you know the most popular genres.

Consider the information received in more detail.

There are 7 columns in the table, the data type of each column is <write the name of the data type>.

We will analyze in detail which columns in * df * and what information they contain:

* userID - user identifier;
* Track - the name of the track;
* artist - name of the artist;
* genre - the name of the genre;
* City - the city in which the listening took place;
* time - the time at which the user listened to the track;
* Day - the day of the week.

The number of values ​​in the columns varies. This indicates that the data has <enter definition> values.

<br>
<br>


***

## Stage 2. Data Processing

We need exclude omissions, rename the columns, and also check the data for duplicates.

Get a list of column names.

In [5]:
# list of column names of the df table
df.columns

Index(['Unnamed: 0', '  userID', 'Track', 'artist', 'genre', '  City  ',
       'time', 'Day'],
      dtype='object')

Column names have spaces that can make data access difficult.

Rename the columns for the convenience of further work. Check the result.

In [6]:
df.set_axis(['number','user_id','track_name','artist_name', 'genre_name', 'city', 'time', 'weekday'],axis='columns',inplace=True)

In [7]:
# checking results - list of column names
df.columns

Index(['number', 'user_id', 'track_name', 'artist_name', 'genre_name', 'city',
       'time', 'weekday'],
      dtype='object')

Check the data for gaps by calling a set of methods to summarize the missing values.

In [8]:
# the total number of omissions detected by the isnull () method in the df table

df.isnull().sum()

number            0
user_id           0
track_name     1231
artist_name    7203
genre_name     1198
city              0
time              0
weekday           0
dtype: int64

Empty values indicate that not all information is available for some tracks. The reasons may be different: say, a specific artist of a song is not named. Each individual case must be disassembled and the cause identified.

Заменяем пропущенные значения в столбцах с названием трека и исполнителя на строку 'unknown'. После этой операции нужно убедиться, что таблица больше не содержит пропусков.

In [46]:
# replacing the missing values in the column 'track_name' 
# with the string 'unknown' by a special replacement method

df['track_name'] = df['track_name'].fillna('unknown') 

In [47]:
# replacing missing values in the column 'artist_name' 
# with the string 'unknown' by a special replacement method

df['artist_name'] = df['artist_name'].fillna('unknown')

In [11]:
# check: calculating the total number of gaps identified in table df

df.isnull().sum()

number            0
user_id           0
track_name        0
artist_name       0
genre_name     1198
city              0
time              0
weekday           0
dtype: int64

Delete empty values in the column with genres; make sure that they are no longer left.

In [12]:
df.dropna(subset = ['genre_name'], inplace = True)

In [13]:
df.isnull().sum()

number         0
user_id        0
track_name     0
artist_name    0
genre_name     0
city           0
time           0
weekday        0
dtype: int64

It is necessary to establish the presence of duplicates. If there are, we delete, and we check whether all are deleted.

In [48]:
# getting the total number of duplicates in the table df

df.duplicated().sum()

0

In [15]:
# delete all duplicates from the df table using a special method

df = df.drop_duplicates().reset_index(drop=True)

In [49]:
# duplicate check
df.duplicated().sum()

0

Duplicates may appear due to a failure in the data record. It is worth paying attention and sorting out the reasons for the appearance of such an “information garbage”.

We save the list of unique values of the column with genres in the variable * genres_list *.

We declare the function * find_genre () * to search for implicit duplicates in the column with genres. For example, when the name of the same genre is written in different words.

In [17]:
# saving the list of unique values in the variable genres_list,
# identified by a special method in the column 'genre_name'

genres_list = df['genre_name'].unique()

In [18]:
# create find_genre () function
# the function takes as a parameter a string with the name of the desired genre
# the counter variable is declared in the body, it is assigned the value 0,
# then the for loop goes through the list of unique values
# if the next element of the list is equal to the parameter of the function,
# then the counter value is increased by 1
# at the end of the loop, the function returns the counter value

def find_genre(genres):
    count = 0
    for i in genres_list:
        if i == genres:
            count += 1
    return count

The function call * find_genre () * to search for various options for the name of the hip-hop genre in the table.

The correct name is * hiphop *. Let's look for other options:

* hip
* hop
* hip-hop

In [19]:
# calling find_genre () checks for the presence of the 'hip' option
find_genre('hip')

1

In [20]:
# checking for 'hop' option
find_genre('hop')

0

In [50]:
# 'hip-hop' option checked
find_genre('hip-hop')

0

We declare the * find_hip_hop () * function, which replaces the incorrect name of this genre in the * 'genre_name' * column with * 'hiphop' * and verifies the success of the replacement.

So we fix all the spelling variants that the check revealed.

In [22]:
def find_hip_hop(df, wrong):
    df['genre_name'] = df['genre_name'].replace(wrong,'hiphop') 
    final = df[df['genre_name'] == wrong]['genre_name'].count() 
    return final

In [23]:
find_hip_hop(df, 'hip')

0

We get general information about the data. We make sure that the cleaning is successful.

In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63881 entries, 0 to 63880
Data columns (total 8 columns):
number         63881 non-null int64
user_id        63881 non-null object
track_name     63881 non-null object
artist_name    63881 non-null object
genre_name     63881 non-null object
city           63881 non-null object
time           63881 non-null object
weekday        63881 non-null object
dtypes: int64(1), object(7)
memory usage: 3.9+ MB


### **Сonclusion: Step 2**

At the pre-processing stage, the data revealed not only omissions and problems with column names, but also all kinds of duplicates. Their removal will allow more accurate analysis. Since it’s important to save information about genres for analysis, we’ll not just delete all the missing values, but fill in the missing artist names and track names. Column names are now correct and convenient for further work.

<br>
<br>

***

## Stage 3. Analysis of music preference data
### Do music in different cities really listen differently?

It has been hypothesized that users in Moscow and St. Petersburg listen to music differently. We check this assumption according to the data on the three days of the week - Monday, Wednesday and Friday.

For each city, set the number of songs heard these days with a well-known genre, and compare the results.

We group the data by city and by calling the * count () * method, we calculate the compositions for which the genre is known.

In [25]:
df.groupby('city')
df.groupby('city')['genre_name']
df.groupby('city').count()

Unnamed: 0_level_0,number,user_id,track_name,artist_name,genre_name,time,weekday
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Moscow,44456,44456,44456,44456,44456,44456,44456
Saint-Petersburg,19425,19425,19425,19425,19425,19425,19425


There are more auditions in Moscow than in St. Petersburg, but this does not mean that Moscow is more active. Yandex.Music as a whole has more users in Moscow, so the values are comparable.

We will group the data by the day of the week and calculate the songs listened to on Monday, Wednesday and Friday for which the genre is known.

In [26]:
df.groupby('weekday')['genre_name']
df.groupby('weekday').count()

Unnamed: 0_level_0,number,user_id,track_name,artist_name,genre_name,city,time
weekday,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Friday,22774,22774,22774,22774,22774,22774,22774
Monday,22181,22181,22181,22181,22181,22181,22181
Wednesday,18926,18926,18926,18926,18926,18926,18926


Monday and Friday are time for music; on Wednesdays, users are a bit more involved.

We create the function * number_tracks () *, which takes as parameters the table, day of the week and the name of the city, and returns the number of songs listened for which the genre is known. We check the number of songs played for each city and Monday, then Wednesday and Friday.

In [51]:
def number_tracks(df, day, city):
    track_list = df[(df['weekday'] == day) & (df['city'] == city)]
    track_list_count = track_list['genre_name'].count()
    return track_list_count

# create number_tracks () function
# a function with three parameters is declared: df, day, city
# in the track_list variable, those rows of the df table for which
# value in column 'weekday' is equal to day parameter
# and at the same time the value in the 'city' column is equal to the city parameter
# the variable track_list_count stores the number of column values 'genre_name',
# calculated by the count () method for the track_list table
# function returns track_list_count

In [28]:
# list of tracks for Moscow on Monday

number_tracks(df, 'Monday', 'Moscow')

16299

In [29]:
# list of songs for St. Petersburg on Monday

number_tracks(df, 'Monday', 'Saint-Petersburg')

5882

In [30]:
# list of tracks for Moscow on Wednesday

number_tracks(df, 'Wednesday', 'Moscow')

11547

In [31]:
# list of songs for St. Petersburg on Wednesday

number_tracks(df, 'Wednesday', 'Saint-Petersburg')

7379

In [32]:
# list of tracks for Moscow on Friday

number_tracks(df, 'Friday', 'Moscow')

16610

In [33]:
# list of songs for St. Petersburg on Friday

number_tracks(df, 'Friday', 'Saint-Petersburg')

6164

Let us summarize the information obtained in one table, where ['city', 'monday', 'wednesday', 'friday'] are the names of the columns.

In [34]:
data = [['Moscow', 15347, 10865, 15680],
       ['Saint-Petersburg', 5519, 6913, 5802]]
columns = ['city','monday','wednesday','friday']

table = pd.DataFrame(data = data, columns = columns)
print(table)

               city  monday  wednesday  friday
0            Moscow   15347      10865   15680
1  Saint-Petersburg    5519       6913    5802


### **Сonclusion: Step 3**

The results show that, relative to the environment, music in St. Petersburg and Moscow is listened to “in the mirror”: in Moscow, peaks occur on Monday and Friday, and on Wednesday listening time is reduced. Whereas in St. Petersburg, Wednesday is the day of the greatest interest in music, and on Monday and Friday it is less, and almost equally less.

<br>
<br>

***

## Stage 4. Analysis of music preference data by day of the week
### Monday morning and Friday evening - different music or the same?

We are looking for the answer to the question of which genres prevail in different cities on Monday morning and Friday evening.

There is an assumption that on Monday morning, users listen to more invigorating music (for example, the pop genre), and on Friday evenings - more dance music (for example, electronica).

We get data tables for Moscow * moscow_general * and for St. Petersburg * spb_general *.

In [35]:
moscow_general = df[df['city'] == 'Moscow']

In [36]:
spb_general = df[df['city'] == 'Saint-Petersburg']

We create the function * genre_weekday () *, which returns a list of genres according to the requested day of the week and time of day from such and such an hour.

In [52]:
def genre_weekday(df, day, time1, time2):
    genre_list = df.loc[(df.loc[:,'weekday'] == day) & (df.loc[:,'time'] > time1) & (time2 > df.loc[:,'time'])]
    genre_list_sorted = genre_list.groupby('genre_name')['genre_name'].count().sort_values(ascending = False).head(10)
    return genre_list_sorted

# declaration of the function genre_weekday () with parameters df, day, time1, time2
# in the genre_list variable, those df lines are stored for which at the same time:
# 1) the value in the 'weekday' column is equal to the day parameter,
# 2) the value in the 'time' column is greater than time1 and
# 3) less time2.
# in the variable genre_list_sorted are stored in descending order
# first 10 values of Series obtained by counting the number of values 'genre_name'
# grouped by column 'genre_name' of the genre_list table
# function returns genre_list_sorted

We compare the results obtained according to the table for Moscow and St.<br>
Petersburg on Monday morning (from 7 to 11) and on Friday evening (from 17 to 23).

In [38]:
genre_weekday(moscow_general, 'Monday', '07:00:00', '11:00:00')

genre_name
pop            830
dance          589
rock           511
electronic     501
hiphop         311
ruspop         203
world          190
rusrap         188
alternative    175
classical      167
Name: genre_name, dtype: int64

In [39]:
genre_weekday(spb_general, 'Monday', '07:00:00', '11:00:00')

genre_name
pop            238
dance          192
rock           173
electronic     154
hiphop          88
ruspop          68
alternative     65
rusrap          56
jazz            47
classical       42
Name: genre_name, dtype: int64

In [40]:
genre_weekday(moscow_general, 'Friday', '17:00:00', '23:00:00')

genre_name
pop            761
rock           546
dance          521
electronic     510
hiphop         282
world          220
ruspop         184
alternative    176
classical      171
rusrap         151
Name: genre_name, dtype: int64

In [41]:
genre_weekday(spb_general, 'Friday', '17:00:00', '23:00:00')

genre_name
pop            279
rock           230
electronic     227
dance          221
hiphop         103
alternative     67
jazz            66
rusrap          66
classical       64
world           60
Name: genre_name, dtype: int64

Popular genres on Monday morning in St. Petersburg and Moscow turned out to be similar: everywhere, as expected, pop is popular. Despite this, the ending of the top 10 for the two cities is different: in St. Petersburg the top 10 includes jazz and Russian rap, and in Moscow the genre is * world *.

At the end of the week the situation does not change. Pop music still comes first. Again, the difference is noticeable only in the end of the top 10, where in St. Petersburg on Friday night there is also the * world * genre.

### **Сonclusion: Step 4**

The genre pop is the undisputed leader, and the top 5 as a whole does not differ in both capitals. It can be seen that the end of the list is more “lively”: for each city, more characteristic genres stand out that really change their positions depending on the day of the week and time.

<br>
<br>

***

## Stage 5. Comparative analysis of musical preferences of residents of Moscow and St. Petersburg
### Moscow and St. Petersburg are two different capitals, two different directions in music. Truth?

**Hypothesis**<br>
Peter is rich in his rap culture, that's why they listen to this direction more often.<br>
Moscow is a city of contrasts, but the majority of users listen to pop music.

We group the table * moscow_general * by genre, count the number of compositions of each genre using the method * count () *, sort it in descending order and save the result in the table * moscow_genres *.

Let's look at the first 10 rows of this new table.

In [42]:
moscow_genres = moscow_general.groupby('genre_name')['genre_name'].count().sort_values(ascending = False)

# in one row: grouping the moscow_general table by the column 'genre_name',
# counting the number of 'genre_name' values in this grouping using the count () method,
# sorting Series in descending order and saving in moscow_genres

In [43]:
moscow_genres.head(10)

genre_name
pop            6253
dance          4707
rock           4188
electronic     4010
hiphop         2215
classical      1712
world          1516
alternative    1466
ruspop         1453
rusrap         1239
Name: genre_name, dtype: int64

Group the * spb_general * table by genre, count the number of songs in each genre using the * count () * method, sort it in descending order and save the result in the * spb_genres * table.

We look through the first 10 rows of this table. Now you can compare two cities.

In [44]:
spb_genres = spb_general.groupby('genre_name')['genre_name'].count().sort_values(ascending = False)

In [45]:
spb_genres.head(10)

genre_name
pop            2597
dance          2054
rock           2004
electronic     1842
hiphop         1020
alternative     700
classical       684
rusrap          604
ruspop          565
world           553
Name: genre_name, dtype: int64

### **Сonclusion: Step 5**

In Moscow, in addition to the absolutely popular genre of pop, there is a trend in Russian popular music. This means that interest in this genre is wider. And rap, contrary to the assumption, occupies close positions in both cities.

<br>
<br>

***

## Step 6. Research Results

**Hypotheses:**<br>
* Music in two cities - Moscow and St. Petersburg - listen in different modes;
* Lists of the ten most popular genres on Monday morning and Friday evening have characteristic differences;
* The population of the two cities prefers different musical genres.

**General results**

Moscow and St. Petersburg agree on tastes: popular music prevails everywhere. At the same time, there is no dependence of preferences on the day of the week in each individual city - people constantly listen to what they like. But between the cities, in terms of days a week, there is specularity relative to the environment: Moscow listens more on Monday and Friday, and Petersburg, on the contrary, more on Wednesday, but less on Monday and Friday.

As a result, 
- the first hypothesis - specify confirmed, 
- the second hypothesis - indicate confirmed,
- the third - indicate confirmed.


<br>
<br>

***