# Yandex.Music

# Contents <a id='back'></a>

* [Introduction](#intro)
* [Stage 1. Data overview](#data_review)
    * [Conclusions](#data_review_conclusions)
* [Stage 2. Data preprocessing](#data_preprocessing)
    * [2.1 Header style](#header_style)
    * [2.2 Missing values](#missing_values)
    * [2.3 Duplicates](#duplicates)
    * [2.4 Conclusions](#data_preprocessing_conclusions)
* [Stage 3. Testing the hypotheses](#hypotheses)
    * [3.1 Hypothesis 1: user activity in the two cities](#activity)
    * [3.2 Hypothesis 2: music preferences on Monday and Friday](#week)
    * [3.3 Hypothesis 3: genre preferences in Springfield and Shelbyville](#genre)
* [Findings](#end)

## Introduction <a id='intro'></a>
In this project, we'll be comparing the music preferences of the cities of Springfield and Shelbyville. We will study real Yandex.Music data to test the hypotheses below and compare user behavior for these two cities.

### Goal: 
Testing three hypotheses:
1. User activity differs depending on the day of the week and from city to city. 
2. On Monday mornings, Springfield and Shelbyville residents listen to different genres. This is also true for Friday evenings. 
3. Springfield and Shelbyville listeners have different preferences. In Springfield, they prefer pop, while Shelbyville has more rap fans.
 
[Back to Contents](#back)

## Stage 1. Data overview <a id='data_review'></a>



In [26]:
# importing pandas with alias:pd

import pandas as pd

In [27]:
# reading the file and storing it to df

try:
    df = pd.read_csv("/datasets/music_project_en.csv")
    
except:
    df = pd.read_csv("music_project_en.csv")

In [28]:
# obtaining the first 10 rows from the df table

print(df.head(10))

     userID                        Track            artist   genre  \
0  FFB692EC            Kamigata To Boots  The Mass Missile    rock   
1  55204538  Delayed Because of Accident  Andreas Rönnberg    rock   
2    20EC38            Funiculì funiculà       Mario Lanza     pop   
3  A3DD03C9        Dragons in the Sunset        Fire + Ice    folk   
4  E2DC1FAE                  Soul People        Space Echo   dance   
5  842029A1                       Chains          Obladaet  rusrap   
6  4CB90AA5                         True      Roman Messer   dance   
7  F03E1C1F             Feeling This Way   Polina Griffith   dance   
8  8FA1D3BE                     L’estate       Julia Dalia  ruspop   
9  E772D5C0                    Pessimist               NaN   dance   

        City        time        Day  
0  Shelbyville  20:28:33  Wednesday  
1  Springfield  14:07:09     Friday  
2  Shelbyville  20:58:07  Wednesday  
3  Shelbyville  08:37:09     Monday  
4  Springfield  08:34:34     Monday  
5

In [29]:
#Obtaining the general information about the table with one command

df.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63736 non-null  object
 2   artist    57512 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


The table contains seven columns. They all store the same data type: `object`.

According to the documentation:
- `'userID'` — user identifier
- `'Track'` — track title
- `'artist'` — artist's name
- `'genre'`
- `'City'` — user's city
- `'time'` — the exact time the track was played
- `'Day'` — day of the week

We can see three issues with style in the column names:
1. Some names are uppercase, some are lowercase.
2. There are spaces in some names. 
3. <b> Choice of words are not descriptive enough and do not make sense on its own. 'userID also needs formatting into 'snake_case'</b>

The number of column values is different. This means the data contains missing values.


### Conclusions <a id='data_review_conclusions'></a> 

Each row in the table stores data on a track that was played. Some columns describe the track itself: its title, artist and genre. The rest convey information about the user: the city they come from, the time they played the track. 

It's clear that the data is sufficient to test the hypotheses. However, there are missing values.

To move forward, we need to preprocess the data.

[Back to Contents](#back)

## Stage 2. Data preprocessing <a id='data_preprocessing'></a>

In [30]:
# the list of column names in the df table

df.columns 

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

In [31]:
# renaming columns

df = df.rename(
    columns={
        '  userID': 'user_id',
        'Track': 'track_name',
        '  City  ': 'city',
        'time': 'time_of_day',
        'Day': 'day'
    }
)

In [32]:
# checking result:

df.columns

Index(['user_id', 'track_name', 'artist', 'genre', 'city', 'time_of_day',
       'day'],
      dtype='object')

[Back to Contents](#back)

### Missing values <a id='missing_values'></a>

In [33]:
# calculating missing values using (1) .isna() or (2) .isnull()

display(df.isna().sum()) 
df.isnull().sum()

user_id           0
track_name     1343
artist         7567
genre          1198
city              0
time_of_day       0
day               0
dtype: int64

user_id           0
track_name     1343
artist         7567
genre          1198
city              0
time_of_day       0
day               0
dtype: int64

Not all missing values affect the research. For instance, the missing values in `track` and `artist` are not critical. We will simply replace them with clear markers.

But missing values in `'genre'` can affect the comparison of music preferences in Springfield and Shelbyville. In real life, it would be useful to learn the reasons why the data is missing and try to make up for them. But we do not have that opportunity in this project. So we will:
* Fill in these missing values with markers
* Evaluate how much the missing values may affect your computations

In [34]:
# looping over column names and replacing missing values with 'unknown'
# Replacing missing values for each row under specified 'columns' 

columns_to_replace = ['track_name', 'artist', 'genre']

for col in columns_to_replace:
    df[col] = df[col].fillna('unknown')

In [35]:
# counting missing values

df.isna().sum()

user_id        0
track_name     0
artist         0
genre          0
city           0
time_of_day    0
day            0
dtype: int64

[Back to Contents](#back)

### Duplicates <a id='duplicates'></a>

In [36]:
# assigning a variable to the table of duplicates to get an overview

duplicated_df = df[df.duplicated()]

In [37]:
# counting clear duplicates
# the total rows with obvious duplicates

df.duplicated().sum()


3826

In [38]:
# removing obvious duplicates

df = df.drop_duplicates()

In [39]:
# checking for duplicates

df.duplicated().sum()
df = df.reset_index(drop=True)

In [40]:
# viewing unique genre names

df['genre'].sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'ambient', 'americana', 'animated', 'anime',
       'arabesk', 'arabic', 'arena', 'argentinetango', 'art', 'audiobook',
       'avantgarde', 'axé', 'baile', 'balkan', 'beats', 'bigroom',
       'black', 'bluegrass', 'blues', 'bollywood', 'bossa', 'brazilian',
       'breakbeat', 'breaks', 'broadway', 'cantautori', 'cantopop',
       'canzone', 'caribbean', 'caucasian', 'celtic', 'chamber',
       'children', 'chill', 'chinese', 'choral', 'christian', 'christmas',
       'classical', 'classicmetal', 'club', 'colombian', 'comedy',
       'conjazz', 'contemporary', 'country', 'cuban', 'dance',
       'dancehall', 'dancepop', 'dark', 'death', 'deep', 'deutschrock',
       'deutschspr', 'dirty', 'disco', 'dnb', 'documentary', 'downbeat',
       'downtempo', 'drum', 'dub', 'dubstep', 'eastern', 'easy',
       'electronic', 'electropop', 'emo', 'entehno', 'epicmetal',
       'estrada', 'ethnic', 'eurofo

In [41]:
# function for replacing implicit duplicates
# this function iterates over each wrong_genre in the specified wrong_genres list 

def replace_wrong_genres(wrong_genres, correct_genre):
    for wrong_genre in wrong_genres:
        df['genre'] = df['genre'].replace(wrong_genre, correct_genre)

In [42]:
# removing implicit duplicates

wrong_genres = ['hip', 'hop', 'hip-hop']
correct_genre = 'hiphop'

replace_wrong_genres(wrong_genres, correct_genre)

In [43]:
# checking for implicit duplicates

df['genre'].sort_values().unique()


array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'ambient', 'americana', 'animated', 'anime',
       'arabesk', 'arabic', 'arena', 'argentinetango', 'art', 'audiobook',
       'avantgarde', 'axé', 'baile', 'balkan', 'beats', 'bigroom',
       'black', 'bluegrass', 'blues', 'bollywood', 'bossa', 'brazilian',
       'breakbeat', 'breaks', 'broadway', 'cantautori', 'cantopop',
       'canzone', 'caribbean', 'caucasian', 'celtic', 'chamber',
       'children', 'chill', 'chinese', 'choral', 'christian', 'christmas',
       'classical', 'classicmetal', 'club', 'colombian', 'comedy',
       'conjazz', 'contemporary', 'country', 'cuban', 'dance',
       'dancehall', 'dancepop', 'dark', 'death', 'deep', 'deutschrock',
       'deutschspr', 'dirty', 'disco', 'dnb', 'documentary', 'downbeat',
       'downtempo', 'drum', 'dub', 'dubstep', 'eastern', 'easy',
       'electronic', 'electropop', 'emo', 'entehno', 'epicmetal',
       'estrada', 'ethnic', 'eurofo

[Back to Contents](#back)

### Conclusions <a id='data_preprocessing_conclusions'></a>
We detected three issues with the data:

- Incorrect header styles
- Missing values
- Obvious and implicit duplicates

The headers have been cleaned up to make processing the table simpler.

All missing values have been replaced with `'unknown'`. But we still have to see whether the missing values in `'genre'` will affect our calculations.

The absence of duplicates will make the results more precise and easier to understand.

Now we can move on to testing hypotheses. 

[Back to Contents](#back)

## Stage 3. Testing hypotheses <a id='hypotheses'></a>

### Hypothesis 1: comparing user behavior in two cities <a id='activity'></a>

According to the first hypothesis, users from Springfield and Shelbyville listen to music differently. We will test this using the data on three days of the week: Monday, Wednesday, and Friday.

* Divide the users into groups by city.
* Compare how many tracks each group played on Monday, Wednesday, and Friday.


In [44]:
# Counting up the tracks played in each city

df.groupby("city")["track_name"].count()

city
Shelbyville    18512
Springfield    42741
Name: track_name, dtype: int64

Springfield has more tracks played than Shelbyville. But that does not imply that citizens of Springfield listen to music more often. This city is simply bigger, and there are more users.


In [45]:
# Calculating tracks played on each of the three days

df.groupby("day")["track_name"].count()

day
Friday       21840
Monday       21354
Wednesday    18059
Name: track_name, dtype: int64

Wednesday is the quietest day overall. But if we consider the two cities separately, we might come to a different conclusion.

You have seen how grouping by city or day works. Now write a function that will group by both.

Create the `number_tracks()` function to calculate the number of songs played for a given day and city. It will require two parameters:
* day of the week
* name of the city

In the function, use a variable to store the rows from the original table, where:
  * `'day'` column value is equal to the `day` parameter
  * `'city'` column value is equal to the `city` parameter

Apply consecutive filtering with logical indexing.

Then calculate the `'user_id'` column values in the resulting table. Store the result to a new variable. Return this variable from the function.

In [46]:
# Counting songs played by grouped city and days:

def number_tracks(day, city):
    track_list = df[(df["day"] == day) & (df["city"] == city)]
    track_list_count = track_list['user_id'].count()
    return track_list_count

In [47]:
# the number of songs played in Springfield on Monday

number_tracks("Monday", "Springfield")

15740

In [48]:
# the number of songs played in Shelbyville on Monday

number_tracks("Monday", "Shelbyville")

5614

In [49]:
# the number of songs played in Springfield on Wednesday

number_tracks("Wednesday", "Springfield")

11056

In [50]:
# the number of songs played in Shelbyville on Wednesday

number_tracks("Wednesday", "Shelbyville")

7003

In [51]:
# the number of songs played in Springfield on Friday

number_tracks("Friday", "Springfield")

15945

In [52]:
# the number of songs played in Shelbyville on Friday

number_tracks("Friday", "Shelbyville")

5895

In [53]:
# Creating a new table with results

group_data = [
    ["Springfield", 15740, 11056, 15945],
    ["Shelbyville", 5614, 7003, 5895],
]

group_columns = ['city', 'monday', 'wednesday', 'friday']

df_grouped = pd.DataFrame(data=group_data, columns=group_columns )
df_grouped

Unnamed: 0,city,monday,wednesday,friday
0,Springfield,15740,11056,15945
1,Shelbyville,5614,7003,5895


**Conclusions**

The data reveals differences in user behavior:

- In Springfield, the number of songs played peaks on Mondays and Fridays, while on Wednesday there is a decrease in activity.
- In Shelbyville on the contrary, users listen to music more on Wednesday. User activity on Monday and Friday is smaller.

Our first hypothesis seems to be correct.

[Back to Contents](#back)

### Hypothesis 2: music at the beginning and end of the week <a id='week'></a>

According to the second hypothesis, on Monday morning and Friday night, citizens of Springfield listen to genres that differ from ones users from Shelbyville enjoy.

In [54]:
# creating the spr_general table from the df rows, 
# where the value in the 'city' column is 'Springfield'

spr_general = df[df["city"] == "Springfield"]


In [55]:
# creating the shel_general from the df rows,
# where the value in the 'city' column is 'Shelbyville'

shel_general = df[df["city"] == "Shelbyville"]


In [56]:
# Function for tabulating genre's listened to on a specific day at different times:

def genre_weekday(df, day, time1, time2):
    genre_df = df[(df['day'] == day) & (df['time_of_day'] > time1) & (df['time_of_day'] < time2)]
    genre_df_count = genre_df.groupby('genre')['user_id'].count()
    genre_df_sorted = genre_df_count.sort_values(ascending = False)
    return genre_df_sorted[:15]

In [57]:
# Calling the function for Monday morning in Springfield 

genre_weekday(spr_general, "Monday", "07:00", "11:00")

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
classical      157
metal          120
jazz           100
folk            97
soundtrack      95
Name: user_id, dtype: int64

In [58]:
# Calling the function for Monday morning in Shelbyville 

genre_weekday(shel_general, "Monday", "07:00", "11:00")

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
world           36
rap             32
soundtrack      31
rnb             27
metal           27
Name: user_id, dtype: int64

In [59]:
# Calling the function for Friday evening in Springfield

genre_weekday(spr_general, "Friday", "17:00", "23:00")

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
classical      163
alternative    163
rusrap         142
jazz           111
unknown        110
soundtrack     105
rnb             90
metal           88
Name: user_id, dtype: int64

In [60]:
# Calling the function for Friday evening in Shelbyville

genre_weekday(shel_general, "Friday", "17:00", "23:00")

genre
pop            256
rock           216
electronic     216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
unknown         47
ruspop          47
soundtrack      40
metal           39
rap             36
Name: user_id, dtype: int64

**Conclusion**

Having compared the top 15 genres on Monday morning, we can draw the following conclusions:

1. Users from Springfield and Shelbyville listen to similar music. The top five genres are the same, only rock and electronic have switched places.

2. In Springfield, the number of missing values turned out to be so big that the value `'unknown'` came in 10th. This means that missing values make up a considerable portion of the data, which may be a basis for questioning the reliability of our conclusions.

For Friday evening, the situation is similar. Individual genres vary somewhat, but on the whole, the top 15 is similar for the two cities.

Thus, the second hypothesis has been partially proven true:
* Users listen to similar music at the beginning and end of the week.
* There is no major difference between Springfield and Shelbyville. In both cities, pop is the most popular genre.

However, the number of missing values makes this result questionable. In Springfield, there are so many that they affect our top 15. Were we not missing these values, things might look different.

[Back to Contents](#back)

### Hypothesis 3: genre preferences in Springfield and Shelbyville <a id='genre'></a>

Hypothesis: Shelbyville loves rap music. Springfield's citizens are more into pop.

In [61]:
# Grouping the spr_general table by the 'genre' column, 
# counting the 'genre' values with count() in the grouping, 
# sorting the resulting Series in descending order, and storing it to spr_genres

spr_genres = spr_general.groupby('genre')['user_id'].count().sort_values(ascending = False)

In [62]:
# printing the first 10 rows of spr_genres

spr_genres.head(10)

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: user_id, dtype: int64

In [63]:
# Group the shel_general table by the 'genre' column, 
# counting the 'genre' values in the grouping with count(), 
# sorting the resulting Series in descending order and storing it to shel_genres

shel_genres = shel_general.groupby('genre')['user_id'].count().sort_values(ascending = False)

In [64]:
# Printing the first 10 rows from shel_genres

shel_genres.head(10)

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: user_id, dtype: int64

**Conclusion**

The hypothesis has been partially proven true:
* Pop music is the most popular genre in Springfield, as expected.
* However, pop music turned out to be equally popular in Springfield and Shelbyville, and rap wasn't in the top 5 for either city.


[Back to Contents](#back)

# Findings <a id='end'></a>

We have tested the following three hypotheses:

1. User activity differs depending on the day of the week and from city to city. 
2. On Monday mornings, Springfield and Shelbyville residents listen to different genres. This is also true for Friday evenings. 
3. Springfield and Shelbyville listeners have different preferences. In both Springfield and Shelbyville, they prefer pop.

After analyzing the data, we concluded:

1. User activity in Springfield and Shelbyville depends on the day of the week, though the cities vary in different ways. 

The first hypothesis is fully accepted.

2. Musical preferences do not vary significantly over the course of the week in both Springfield and Shelbyville. We can see small differences in order on Mondays, but:
* In Springfield and Shelbyville, people listen to pop music most.

So we can't accept this hypothesis. We must also keep in mind that the result could have been different if not for the missing values.

3. It turns out that the musical preferences of users from Springfield and Shelbyville are quite similar.

The third hypothesis is rejected. If there is any difference in preferences, it cannot be seen from this data.


[Back to Contents](#back)