# 1. Data collection

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('/datasets/music_project.csv')

In [None]:
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
  userID    65079 non-null object
Track       63848 non-null object
artist      57876 non-null object
genre       63881 non-null object
  City      65079 non-null object
time        65079 non-null object
Day         65079 non-null object
dtypes: object(7)
memory usage: 3.5+ MB


* userID — user ID;
* Track — track name;  
* artist — artist name;
* genre — genre name;
* City — city, where user listened to the track (Moscow or Saint Petersburg);
* time — time, when user listened to the track;
* Day — day of the week.

# 2. Data preparation

Rename columns:

In [None]:
df.columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

In [None]:
df.set_axis(['user_id','track_name','artist_name','genre_name','city','time','weekday'], axis='columns', inplace=True)

In [None]:
df.columns

Index(['user_id', 'track_name', 'artist_name', 'genre_name', 'city', 'time',
       'weekday'],
      dtype='object')

Deal with missing values:

In [None]:
df.isnull().sum()

user_id           0
track_name     1231
artist_name    7203
genre_name     1198
city              0
time              0
weekday           0
dtype: int64

Fill in missing values with 'unknown' for *track_name* and *artist_name* columns:



In [None]:
df['track_name'] = df['track_name'].fillna('unknown') 

In [None]:
df['artist_name'] = df['artist_name'].fillna('unknown')

In [None]:
df.isnull().sum()

user_id           0
track_name        0
artist_name       0
genre_name     1198
city              0
time              0
weekday           0
dtype: int64

Delete missing values for *genre_name* column:

In [None]:
df.dropna(subset = ['genre_name'], inplace = True)

In [None]:
df.isnull().sum()

user_id        0
track_name     0
artist_name    0
genre_name     0
city           0
time           0
weekday        0
dtype: int64

Detect and delete duplicate data:

In [None]:
df.duplicated().sum()

3755

In [None]:
df = df.drop_duplicates().reset_index(drop=True)

In [None]:
df.duplicated().sum()

0

Find "hidden" duplicates in *genre_name* column with *find_genre()* function ("hidden" means e.g. when the same genre name is written in different words):





In [None]:
genres_list = df['genre_name'].unique()

In [None]:
def find_genre(genre):
    count = 0
    for g in genres_list:
        if g == genre:
            count += 1
    return count

Call *find_genre()* function to find different variants of 'hip-hop' genre.

Right variant would be *hiphop*. Let's find other variants:
* hip
* hop
* hip-hop


In [None]:
find_genre('hip')

1

In [None]:
find_genre('hop')

0

In [None]:
find_genre('hip-hop')

0

Define *find_hip_hop()* function, which replaces wrong name of this genre in  *'genre_name'* column with *'hiphop'* and checks if it's done well.

Then correct all variants which were found with find_genre() function.

In [None]:
def find_hip_hop(df, wrong):
    df['genre_name'] = df['genre_name'].replace(wrong, 'hiphop')
    final_count = df[df['genre_name'] == wrong]['genre_name'].count()
    return final_count

In [None]:
find_hip_hop(df,'hip')

0

Now check if everything is okay in the data:

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60126 entries, 0 to 60125
Data columns (total 7 columns):
user_id        60126 non-null object
track_name     60126 non-null object
artist_name    60126 non-null object
genre_name     60126 non-null object
city           60126 non-null object
time           60126 non-null object
weekday        60126 non-null object
dtypes: object(7)
memory usage: 3.2+ MB


# Listening pattern by day of the week

Hypothesis is that there are different listening patterns in Moscow and Saint Petersburg. Let's check this using data on three days of the week — Monday, Wednesday and Friday.

Count all tracks group by *city*:

In [None]:
df.groupby('city')['genre_name'].count()

city
Moscow              41892
Saint-Petersburg    18234
Name: genre_name, dtype: int64

The result doesn't mean that Moscow is more "active"; there are just more users in Moscow then in Saint Petersburg.

Count all tracks listened to on Monday, Wednesday and Friday respectively:

In [None]:
df.groupby('weekday')['genre_name'].count()

weekday
Friday       21482
Monday       20866
Wednesday    17778
Name: genre_name, dtype: int64

Monday and Friday — music time!

Let's check the number of tracks for each city on Monday, Wednesday and Friday respectively using *number_tracks()* function: 

In [None]:
def number_tracks(df, day, city):
    track_list = df[(df['weekday']==day) & (df['city']==city)]
    track_list_count = track_list['genre_name'].count()
    return(track_list_count)

In [None]:
number_tracks(df, 'Monday', 'Moscow')

15347

In [None]:
number_tracks(df, 'Monday', 'Saint-Petersburg')

5519

In [None]:
number_tracks(df, 'Wednesday', 'Moscow')

10865

In [None]:
number_tracks(df, 'Wednesday', 'Saint-Petersburg')

6913

In [None]:
number_tracks(df, 'Friday', 'Moscow')

15680

In [None]:
number_tracks(df, 'Friday', 'Saint-Petersburg')

5802

Summary table:


In [None]:
data = [['Moscow', 15347, 10865, 15680],
       ['Saint-Petersburg', 5519, 6913, 5802]] 
columns = ['city','monday','wednesday','friday']
table = pd.DataFrame(data = data, columns = columns)
table

Unnamed: 0,city,monday,wednesday,friday
0,Moscow,15347,10865,15680
1,Saint-Petersburg,5519,6913,5802


The results show that listening pattern is "mirrored" for Moscow and Saint Petersburg: in Moscow listening time decreases on Wednesday, peaking on Monday and Friday. Whereas in Saint Petersburg Wednesday is the day of the most interest in music, and on Monday and Friday it is less.

# Music genre on Monday morning and Friday evening

Hypothesis is that users tend to listen to more energetic music (e.g. pop) on Monday morning, and dancing music (e.g. electronic) on Friday evening.

Get separate tables for Moscow (*moscow_general*) and for Saint Petersburg (*spb_general*)

In [None]:
moscow_general = df[df['city']=='Moscow']

In [None]:
spb_general = df[df['city']=='Saint-Petersburg']

*genre_weekday()* function returns a list of genres by day of the week and specific time range:

In [None]:
def genre_weekday(df, day, time1, time2):
    genre_list = df[(df['weekday']==day) & (df['time']>time1) & (df['time']<time2)]
    genre_list_sorted = genre_list.groupby('genre_name')['genre_name'].count().sort_values(ascending = False).head(10)
    return genre_list_sorted

Let's compare the results for Moscow and Saint Petersburg on Monday morning (from 7am to 11 am) and on Friday evening (from 5pm to 11pm):

In [None]:
genre_weekday(moscow_general, 'Monday', '07:00:00', '11:00:00')

genre_name
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
classical      157
Name: genre_name, dtype: int64

In [None]:
genre_weekday(spb_general, 'Monday', '07:00:00', '11:00:00')

genre_name
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
Name: genre_name, dtype: int64

In [None]:
genre_weekday(moscow_general, 'Friday', '17:00:00', '23:00:00')

genre_name
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
alternative    163
classical      163
rusrap         142
Name: genre_name, dtype: int64

In [None]:
genre_weekday(spb_general, 'Friday', '17:00:00', '23:00:00')

genre_name
pop            256
rock           216
electronic     216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
Name: genre_name, dtype: int64

Pop genre is always the absolute leader and top-5 is quite the same for both cities. 

# Different cities, different genres... or not?

Hypothesis: Saint Petersburg is known for its rap culture so this genre is listened to more often; and Moscow is the city of contracts, but the majority of users listen to pop-music.



Let's get separate tables for Moscow (*moscow_genres*) and for Saint Petersburg (*spb_genres*)

In [None]:
moscow_genres = moscow_general.groupby('genre_name')['genre_name'].count().sort_values(ascending = False)

In [None]:
moscow_genres.head(10)

genre_name
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre_name, dtype: int64

In [None]:
spb_genres = spb_general.groupby('genre_name')['genre_name'].count().sort_values(ascending = False)

In [None]:
spb_genres.head(10)

genre_name
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre_name, dtype: int64

Contrary to what had been assumed, rap music is on similar positions in both cities. 

# 4. Results of the research


Hypotheses:

* there are different listening patterns in Moscow and Saint Petersburg (*confirmed*);

* top-10 genres on Monday morning and on Friday evening are quite distinct (*disproved*);

* people in two cities prefer different music genres (*disproved*).

**Overall results**

Moscow and Saint Petersburg share the same tastes: pop music is the absolute leader. There are no special preferences depending on day of the week: users always listen to what they like. But listening pattern is "mirrored" for Moscow and Saint Petersburg: Moscow listens more music on Monday and Friday, and Saint Petersburg — on Wednesday.