**Yandex Music**

Objective:
Test three hypotheses:
1. User activity differs by day of the week and by city.
2. On Monday mornings, residents of Springfield and Shelbyville listen to different genres. The same is true on Friday nights.
3. Listeners in Springfield and Shelbyville have different preferences. People in Springfield prefer pop music, while people in Shelbyville are more likely to listen to rap music.

**1. Data Description**

In [1]:
import pandas as pd

In [2]:
#getting general information about the data in df
df=pd.read_csv('music_project_en.csv')
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63736 non-null  object
 2   artist    57512 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday


We can see three problems with the style in the column names:
1. Some names are uppercase, others are lowercase.
2. There are some spaces in some names.
3. The style needs to be changed and applied to all column names.
4. The number of values ​​in the columns is different. This means that the data contains missing values.
5. The column type needs to be changed to date.

**2. Data Processing**

In [3]:
df.columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

In [4]:
df.columns=['user_id','track', 'artist', 'genre', 'city', 'time', 'day']
df.columns
#Rename columns

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

In [5]:
#calculating missing values
df.isna().sum()

user_id       0
track      1343
artist     7567
genre      1198
city          0
time          0
day           0
dtype: int64

In [6]:
replace = ['track','artist','genre']
for alpha in replace:
    df[alpha]=df[alpha].fillna(value='unknown')

df.isna().sum()
# looping through column names and replacing missing values ​​with 'unknown'

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

In [7]:
df.duplicated().sum()

np.int64(3826)

In [8]:
df = df.drop_duplicates().reset_index(drop=True)
df.duplicated().sum()
# removing obvious duplicates

np.int64(0)

In [9]:
df['genre'].sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'ambient', 'americana', 'animated', 'anime',
       'arabesk', 'arabic', 'arena', 'argentinetango', 'art', 'audiobook',
       'avantgarde', 'axé', 'baile', 'balkan', 'beats', 'bigroom',
       'black', 'bluegrass', 'blues', 'bollywood', 'bossa', 'brazilian',
       'breakbeat', 'breaks', 'broadway', 'cantautori', 'cantopop',
       'canzone', 'caribbean', 'caucasian', 'celtic', 'chamber',
       'children', 'chill', 'chinese', 'choral', 'christian', 'christmas',
       'classical', 'classicmetal', 'club', 'colombian', 'comedy',
       'conjazz', 'contemporary', 'country', 'cuban', 'dance',
       'dancehall', 'dancepop', 'dark', 'death', 'deep', 'deutschrock',
       'deutschspr', 'dirty', 'disco', 'dnb', 'documentary', 'downbeat',
       'downtempo', 'drum', 'dub', 'dubstep', 'eastern', 'easy',
       'electronic', 'electropop', 'emo', 'entehno', 'epicmetal',
       'estrada', 'ethnic', 'eurofo

In [10]:
def replace_wrong_genres(df, wrong_genres, correct_genre):
  df['genre'].replace(wrong_genres, correct_genre, inplace=True)

wrong_genres = ['hip', 'hop', 'hip-hop']
correct_genre = 'hiphop'
replace_wrong_genres(df, wrong_genres, correct_genre)

df['genre'].sort_values().unique()
# function to replace implicit duplicates

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['genre'].replace(wrong_genres, correct_genre, inplace=True)


array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'ambient', 'americana', 'animated', 'anime',
       'arabesk', 'arabic', 'arena', 'argentinetango', 'art', 'audiobook',
       'avantgarde', 'axé', 'baile', 'balkan', 'beats', 'bigroom',
       'black', 'bluegrass', 'blues', 'bollywood', 'bossa', 'brazilian',
       'breakbeat', 'breaks', 'broadway', 'cantautori', 'cantopop',
       'canzone', 'caribbean', 'caucasian', 'celtic', 'chamber',
       'children', 'chill', 'chinese', 'choral', 'christian', 'christmas',
       'classical', 'classicmetal', 'club', 'colombian', 'comedy',
       'conjazz', 'contemporary', 'country', 'cuban', 'dance',
       'dancehall', 'dancepop', 'dark', 'death', 'deep', 'deutschrock',
       'deutschspr', 'dirty', 'disco', 'dnb', 'documentary', 'downbeat',
       'downtempo', 'drum', 'dub', 'dubstep', 'eastern', 'easy',
       'electronic', 'electropop', 'emo', 'entehno', 'epicmetal',
       'estrada', 'ethnic', 'eurofo

**3.Hypothesis testing**

Hypothesis 1: Compare user behavior in the two cities

In [11]:
tracks_by_city_total = df.groupby(['city','track']).size().unstack(fill_value=0).sum(axis=1).reset_index(name='total')
tracks_by_city_total.head()# counting the tracks played in each city

Unnamed: 0,city,total
0,Shelbyville,18512
1,Springfield,42741


In [12]:
# calculating the tracks played on each of the three days
df.groupby('day')['track'].count()

day
Friday       21840
Monday       21354
Wednesday    18059
Name: track, dtype: int64

In [13]:
# the function counts the tracks played on a certain day and city.
# first retrieves the rows of the desired day from the table,
# then filters the rows of the desired city from the result,
# then finds the number of 'user_id' values ​​in the filtered table,
# and returns that number.
def number_tracks(day, city):
    track_list_count=df[(df['day']==day)] & (df['city'] == city)
    num_tracks= len(track_list_count['user_id'].unique())
    return num_tracks

In [15]:
data = {'city': ['Springfield', 'Shelbyville'],
        'monday': [11904, 4248],
        'wednesday': [8388, 5228],
        'friday': [12145, 4495]}

new_table = pd.DataFrame(data)
new_table

Unnamed: 0,city,monday,wednesday,friday
0,Springfield,11904,8388,12145
1,Shelbyville,4248,5228,4495


Hypothesis 2: Music at the beginning and end of the week

In [16]:
spr_general= df.query("city=='Springfield'")
shel_general= df.query("city=='Shelbyville'")

In [18]:
def genre_weekday (df,day, time1, time2):
# Filter rows by day and time
    genre_df = df[(df['day'] == day) & (df['time'] >= time1) & (df['time'] < time2)]

# Group by genre and count the number of occurrences
    genre_df_count = genre_df.groupby('genre').size().reset_index(name='count')

# Sort genres by number of occurrences and select the 15 most popular
    genre_df_sorted = genre_df_count.sort_values(by='count', ascending=False).head(15)

# Return table with the 15 most popular genres
    return genre_df_sorted

# declaring the genre_weekday() function with the parameters day=, time1= and time2=. Should
# return information about the most popular genres on a given day at a given time

In [22]:
genre_weekday(spr_general, 'Monday', '00:00:00', '12:00:00')


Unnamed: 0,genre,count
96,pop,781
30,dance,549
42,electronic,480
112,rock,474
63,hiphop,286
114,ruspop,186
146,world,181
115,rusrap,175
2,alternative,164
139,unknown,161


In [23]:
genre_weekday(shel_general, 'Monday', '00:00:00', '12:00:00')

Unnamed: 0,genre,count
68,pop,218
18,dance,182
81,rock,162
28,electronic,147
45,hiphop,80
83,ruspop,64
1,alternative,58
84,rusrap,55
51,jazz,44
13,classical,40


In [24]:
genre_weekday(shel_general, 'Friday', '12:00:00', '18:00:00')

Unnamed: 0,genre,count
68,pop,281
17,dance,237
81,rock,221
25,electronic,173
41,hiphop,123
13,classical,106
0,alternative,83
82,ruspop,76
83,rusrap,59
104,world,57


**Conclusion**

Having compared the 15 most popular genres on Monday morning we can conclude the following:

1. Users in Springfield and Shelbyville listen to similar music. The five most popular genres are the same, only rock and electronic have swapped positions.

2. In Springfield the number of missing values ​​turned out to be so high that the value `'unknown'` reached tenth.

Hypothesis 3: Gender preferences in Springfield and Shelbyville

In [26]:
# Group DataFrame by gender and count the number of rows for each gender
spr_genres = spr_general.groupby('genre').size().reset_index(name='count')

# Sort table by number of occurrences in descending order
spr_genres = spr_genres.sort_values(by='count', ascending=False)

# Display result
spr_genres.head()

Unnamed: 0,genre,count
170,pop,5892
53,dance,4435
194,rock,3965
72,electronic,3786
109,hiphop,2096


In [27]:
shel_genres = shel_general.groupby('genre').size().reset_index(name='count')

# count the 'genre' values ​​in the grouping with count(),
shel_genres = shel_genres.sort_values(by='count', ascending=False)

# sort the resulting Series in descending order and store it in shel_genres
shel_genres.head()

Unnamed: 0,genre,count
138,pop,2431
41,dance,1932
159,rock,1879
56,electronic,1736
90,hiphop,960


The hypothesis has been partially proven:
* Pop music is the most popular genre in Springfield, as expected.
* However, pop music turned out to be just as popular in Springfield as it was in Shelbyville, and rap was not in the top 5 in either city.

# Conclusions <a id='end'></a>

We tested the following three hypotheses:

1. User activity differs depending on the day of the week and in different cities.
2. On Monday mornings, residents of Springfield and Shelbyville listen to different genres. The same is true on Friday nights.
3. Listeners in Springfield and Shelbyville have different preferences. In both cities, Springfield and Shelbyville, pop music is preferred.

After analyzing the data, we conclude:

1. User activity in Springfield and Shelbyville depends on the day of the week, although the cities vary in different ways.

The first hypothesis has been fully accepted.

2. Music preferences do not vary significantly over the course of the week in Springfield and Shelbyville. We can observe small differences in the order on Mondays, but:
* In Springfield and Shelbyville, people listen to pop music the most.

So we cannot accept this hypothesis. We must also note that the result could have been different if it were not for the missing values.

3. It turns out that the musical preferences of the users from Springfield and Shelbyville are quite similar.

The third hypothesis is rejected. If there is any difference in preferences, it cannot be observed in the data.