## Introduction:
In this project, we'll compare the music preferences of the cities of Springfield and Shelbyville. We'll look at yandex music data to test the specific hypotheses and compare user behavior for these two cities.

### Hypotheses: 
1. User activity differs depending on the day of the week and from city to city. 
2. On Monday mornings, Springfield and Shelbyville residents listen to different genres. This is also true for Friday evenings. 
3. Springfield and Shelbyville listeners have different preferences. In Springfield, they prefer pop, while Shelbyville has more rap fans.

### Stages:
This project will consist of three stages:
 1. Data overview - evaluate the quality of the data
 2. Data preprocessing - address the issues and account for them
 3. Testing the hypotheses - analyze the data and come to a conclusion about the hypotheses

## Data overview
**Objective:** Evaluate quality of the data

In [1]:
import pandas as pd

# importing pandas

In [2]:
try:
    music_df = pd.read_csv('music_project_en.csv')
except:
    music_df = pd.read_csv('/datasets/music_project_en.csv')
    
# reading the file and storing it to music_df

In [3]:
music_df.head(10)

# obtaining the first 10 rows from the music_df table, inspecting it

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
5,842029A1,Chains,Obladaet,rusrap,Shelbyville,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Springfield,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Springfield,20:47:49,Wednesday
8,8FA1D3BE,L’estate,Julia Dalia,ruspop,Springfield,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Shelbyville,21:20:49,Wednesday


In [4]:
music_df.info()

# obtaining general information about the data in music_df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63736 non-null  object
 2   artist    57512 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


The table contains seven columns.

According to the documentation:
- `'userID'` — user identifier
- `'Track'` — track title
- `'artist'` — artist's name
- `'genre'` — genre of the track
- `'City'` — user's city
- `'time'` — the exact time the track was played
- `'Day'` — day of the week

We can see there are some issues:
1. Some names are uppercase, some are lowercase.
2. There are spaces in some names.
3. The track, artist, and genre columns appear to have some missing values.

Each row in the table stores data on a track that was played. Some columns describe the track itself: its title, artist and genre. The rest convey information about the user: the city they come from, the time they played the track. 

It's clear that the data is sufficient to test the hypotheses. However, there are missing values.

To move forward, we need to preprocess the data.

## Data preprocessing
**Objectives:**
1. Correct style errors in column titles
2. Deal with missing values
3. Check whether there are duplicates

### Column Titles

In [5]:
music_df.columns

# a list of column names in the music_df table

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

In [6]:
music_df.rename(columns={'  userID':'userid',
                   'Track':'track',
                   '  City  ':'city',
                   'Day':'day'},inplace=True)

# renaming columns

In [7]:
music_df.columns

# checking result: the list of column names

Index(['userid', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

### Missing values

In [8]:
music_df.isna().sum()
# calculating number of missing values

userid       0
track     1343
artist    7567
genre     1198
city         0
time         0
day          0
dtype: int64

Not all missing values are created equal. For example, the missing values in `track` and `artist` are not critical, and can be filled with values that denote why this data is missing, or the fact that this data is missing.

But missing values in `'genre'` may affect the comparison of music preferences in Springfield and Shelbyville. Regardless, we will fill these missing values with 'unknown' to denote that the data was missing.

We will:
* Fill in these missing values in 'track' 'artist' and 'genre' with 'unknown' 
* Evaluate how much the missing values may affect your computations

In [9]:
columns_to_replace = ['track', 'artist', 'genre']
for col in columns_to_replace:
    music_df[col] = music_df[col].fillna('unknown')
    
# looping over column names and replacing missing values with 'unknown'

In [10]:
music_df.isna().sum()

# counting missing values, ensuring none remain

userid    0
track     0
artist    0
genre     0
city      0
time      0
day       0
dtype: int64

### Duplicates

In [11]:
music_df.duplicated().sum()

# counting clear duplicates

3826

In [12]:
music_df.drop_duplicates(inplace=True)
music_df.reset_index(drop=True,inplace=True)

# removing duplicates, thern resetting the index, without saving the old indexes

In [13]:
music_df.duplicated().sum()

# checking for duplicates, ensuring none remain

0

### Other Troublesome Values

In [15]:
music_df['genre'].sort_values().unique()

# viewing unique genre names

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'ambient', 'americana', 'animated', 'anime',
       'arabesk', 'arabic', 'arena', 'argentinetango', 'art', 'audiobook',
       'avantgarde', 'axé', 'baile', 'balkan', 'beats', 'bigroom',
       'black', 'bluegrass', 'blues', 'bollywood', 'bossa', 'brazilian',
       'breakbeat', 'breaks', 'broadway', 'cantautori', 'cantopop',
       'canzone', 'caribbean', 'caucasian', 'celtic', 'chamber',
       'children', 'chill', 'chinese', 'choral', 'christian', 'christmas',
       'classical', 'classicmetal', 'club', 'colombian', 'comedy',
       'conjazz', 'contemporary', 'country', 'cuban', 'dance',
       'dancehall', 'dancepop', 'dark', 'death', 'deep', 'deutschrock',
       'deutschspr', 'dirty', 'disco', 'dnb', 'documentary', 'downbeat',
       'downtempo', 'drum', 'dub', 'dubstep', 'eastern', 'easy',
       'electronic', 'electropop', 'emo', 'entehno', 'epicmetal',
       'estrada', 'ethnic', 'eurofo

Looking through the list, we notice that there are some genres that have duplicates, but are spelled slightly different:
* `hip`
* `hop`
* `hip-hop`
All should be the same as the genre 'hiphop'. We will make a function to correct these values

In [16]:
def replace_wrong_genres(wrong_genres, correct_genre):
    '''
    Replaces specified genres with another designated genre in the "df" dataframe.
    
    Parameters:
    wrong_genres (str or list of str): The genre or genres that you would like to replace.
    correct_genre (str): The genre that you would like to replace the wrong genre with.
    
    Output:
    Replaces the incorrect genre spellings specified in the wrong_genres parameters with the 
    correct_genre parameter, in the music_df dataframe.
    '''
    music_df.replace(wrong_genres, correct_genre,inplace=True)

# function for replacing implicit duplicates

In [17]:
replace_wrong_genres(['hip','hop','hip-hop'],'hiphop')

# removing implicit duplicates

In [18]:
music_df['genre'].sort_values().unique()

# checking that changes persisted

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'ambient', 'americana', 'animated', 'anime',
       'arabesk', 'arabic', 'arena', 'argentinetango', 'art', 'audiobook',
       'avantgarde', 'axé', 'baile', 'balkan', 'beats', 'bigroom',
       'black', 'bluegrass', 'blues', 'bollywood', 'bossa', 'brazilian',
       'breakbeat', 'breaks', 'broadway', 'cantautori', 'cantopop',
       'canzone', 'caribbean', 'caucasian', 'celtic', 'chamber',
       'children', 'chill', 'chinese', 'choral', 'christian', 'christmas',
       'classical', 'classicmetal', 'club', 'colombian', 'comedy',
       'conjazz', 'contemporary', 'country', 'cuban', 'dance',
       'dancehall', 'dancepop', 'dark', 'death', 'deep', 'deutschrock',
       'deutschspr', 'dirty', 'disco', 'dnb', 'documentary', 'downbeat',
       'downtempo', 'drum', 'dub', 'dubstep', 'eastern', 'easy',
       'electronic', 'electropop', 'emo', 'entehno', 'epicmetal',
       'estrada', 'ethnic', 'eurofo

### So far...
We detected three issues with the data:

- Incorrect header styles
- Missing values
- Obvious and implicit duplicates

The headers have been cleaned up to make processing the table simpler.

All missing values have been replaced with `'unknown'`. But we still have to see whether the missing values in `'genre'` will affect our calculations.

The absence of duplicates will make the results more precise and easier to understand.

Now we can move on to testing hypotheses. 

## Testing hypotheses

### Hypothesis 1: comparing user behavior in two cities <a id='activity'></a>

According to the first hypothesis, users from Springfield and Shelbyville listen to music differently. We will test this using the data on three days of the week: Monday, Wednesday, and Friday.

* Divide the users into groups by city.
* Compare how many tracks each group played on Monday, Wednesday, and Friday.


In [19]:
springfield_df = music_df[music_df['city'] == 'Springfield'] #dataframe split for springfield 
shelbyville_df = music_df[music_df['city'] == 'Shelbyville'] #dataframe split for shelbyville

springfield_songs = springfield_df.shape[0] #counting tracks played in springfield, saved to springfield_songs variable
shelbyville_songs = shelbyville_df.shape[0] #counting tracks played in shelbyville, saved to shelbyville_songs variable

if springfield_songs > shelbyville_songs:
    print('Springfield has more songs played than Shelbyville')
elif springfield_songs < shelbyville_songs:
    print('Shelbyville has more songs played than Springfield')
else:
    print('Springfield and Shelbyville users have played the same number of songs, wow!')

#if-else statement displaying which city has the most songs played according to the music_df dataset

Springfield has more songs played than Shelbyville


Springfield has more tracks played than Shelbyville. But that does not imply that citizens of Springfield listen to music more often. This city can simply be bigger, and this data may be reflectant of a city with more users.

In [20]:
mon_df = music_df[music_df['day'] == 'Monday'] #separate dataframe for each day
wed_df = music_df[music_df['day'] == 'Wednesday'] #separate dataframe for each day
fri_df = music_df[music_df['day'] == 'Friday'] #separate dataframe for each day

mon_songs = mon_df.shape[0] #finding the number of songs played for each day
wed_songs = wed_df.shape[0] #finding the number of songs played for each day
fri_songs = fri_df.shape[0] #finding the number of songs played for each day

print('Songs played on Monday:', mon_songs) #printing the number of songs played for each day.
print('Songs played on Wednesday:', wed_songs) #printing the number of songs played for each day.
print('Songs played on Friday:', fri_songs) #printing the number of songs played for each day.

Songs played on Monday: 21354
Songs played on Wednesday: 18059
Songs played on Friday: 21840


Wednesday is the quietest day overall. But if we consider the two cities separately, we might come to a different conclusion.

Now we will create a function that can group by both city and/or day.

In [21]:
def number_tracks(day, city):
    '''
    Returns number of tracks played in the day and city specified, from the "music_df" dataframe.
    
    Parameters:
    day (str): The day you want to look into specifically. Accepts "Monday", "Wednesday", or "Friday"
    city (str): The city you want to look into specifically. Accepts "Springfield" or "Shelbyville"
    
    Returns:
    (int): Number of tracks played in specified city on the specified day.
    '''
    track_list = music_df[music_df['day'] == day]
    track_list = track_list[track_list['city'] == city]
    track_list_count = track_list['userid'].count()
    return track_list_count
test = number_tracks('Friday', 'Springfield')
test

# checking to see that the function works as intended

15945

In [22]:
spring_mon = number_tracks('Monday', 'Springfield')
spring_mon 

# the number of songs played in Springfield on Monday

15740

In [23]:
shelby_mon = number_tracks('Monday', 'Shelbyville')
shelby_mon 

# the number of songs played in Shelbyville on Monday

5614

In [24]:
spring_wed = number_tracks('Wednesday', 'Springfield')
spring_wed 

# the number of songs played in Springfield on Wednesday

11056

In [25]:
shelby_wed = number_tracks('Wednesday', 'Shelbyville')
shelby_wed 

# the number of songs played in Shelbyville on Wednesday

7003

In [26]:
spring_fri = number_tracks('Friday', 'Springfield')
spring_fri 

# the number of songs played in Springfield on Friday

15945

In [27]:
shelby_fri = number_tracks('Friday', 'Shelbyville')
shelby_fri 

# the number of songs played in Shelbyville on Friday

5895

We will now make a dataframe with the information we have above.

In [28]:
column_names = ['city', 'monday', 'wednesday', 'friday']
data_dict = {'city':['Springfield','Shelbyville'], 
             'monday':[spring_mon, shelby_mon], 
             'wednesday':[spring_wed, shelby_wed],
             'friday':[spring_fri, shelby_fri]}

final_df = pd.DataFrame(data=data_dict, columns=column_names)
final_df

# table with results

Unnamed: 0,city,monday,wednesday,friday
0,Springfield,15740,11056,15945
1,Shelbyville,5614,7003,5895


**Conclusions**

The data reveals differences in user behavior:

- In Springfield, the number of songs played peaks on Mondays and Fridays, while on Wednesday there is a decrease in activity.
- Contrarily in Shelbyville, users listen to music more on Wednesday and user activity on Monday and Friday is smaller.

So the first hypothesis, "users from Springfield and Shelbyville listen to music differently" seems to be correct.

### Hypothesis 2: music at the beginning and end of the week <a id='week'></a>

According to the second hypothesis, on Monday morning and Friday night, citizens of Springfield listen to genres that differ from ones users from Shelbyville enjoy.

In [29]:
spr_general = springfield_df # create the spr_general table from the music_df rows, 
spr_general.head() # where the value in the 'city' column is 'Springfield'

Unnamed: 0,userid,track,artist,genre,city,time,day
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
6,4CB90AA5,True,Roman Messer,dance,Springfield,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Springfield,20:47:49,Wednesday
8,8FA1D3BE,L’estate,Julia Dalia,ruspop,Springfield,09:17:40,Friday


In [30]:
shel_general = shelbyville_df # create the shel_general from the df rows,
shel_general.head() # where the value in the 'city' column is 'Shelbyville'

Unnamed: 0,userid,track,artist,genre,city,time,day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
5,842029A1,Chains,Obladaet,rusrap,Shelbyville,13:09:41,Friday
9,E772D5C0,Pessimist,unknown,dance,Shelbyville,21:20:49,Wednesday


In [31]:
def genre_weekday(my_df, day, time1, time2):
    '''
    Returns series object storing the 15 most popular genres on a specified day within a specified timeframe.
    
    Parameters:
    my_df (Pandas Dataframe Object): The dataframe you want to analyze
    day (str): The specified day. Accepts "Monday", "Wednesday" or "Friday"
    time1 (str): The lower bound of the timeframe. In "HH:MM:SS" form, uses military time (no am/pm designation)
    time2 (str): The upper bound of the timeframe. In "HH:MM:SS" form, uses military time (no am/pm designation)
    
    Returns:
    Series: A list of 15 most played genres on a specific day within a specific timeframe
    '''
    
    genre_df = my_df[my_df['day'] == day]
    genre_df = genre_df[genre_df['time'] < time2]
    genre_df = genre_df[genre_df['time'] > time1]
    genre_df_count = genre_df.groupby('genre')['genre'].count()
    genre_df_sorted = genre_df_count.sort_values(ascending=False)
    return genre_df_sorted[:15]

# function that gives me the 15 most popular genres within a given day and timeframe.

In [32]:
genre_weekday(spr_general, 'Monday', '07:00', '11:00')

# calling the function for Monday morning in Springfield

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
classical      157
metal          120
jazz           100
folk            97
soundtrack      95
Name: genre, dtype: int64

In [33]:
genre_weekday(shel_general, 'Monday', '07:00', '11:00')

# calling the function for Monday morning in Shelbyville

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
world           36
rap             32
soundtrack      31
rnb             27
metal           27
Name: genre, dtype: int64

In [34]:
genre_weekday(spr_general, 'Friday', '17:00', '23:00')

# calling the function for Friday evening in Springfield

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
classical      163
alternative    163
rusrap         142
jazz           111
unknown        110
soundtrack     105
rnb             90
metal           88
Name: genre, dtype: int64

In [35]:
genre_weekday(shel_general, 'Friday', '17:00', '23:00')

# calling the function for Friday evening in Shelbyville

genre
pop            256
rock           216
electronic     216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
unknown         47
ruspop          47
soundtrack      40
metal           39
rap             36
Name: genre, dtype: int64

**Conclusion**

Having compared the top 15 genres on Monday morning, we can draw the following conclusions:

1. Users from Springfield and Shelbyville listen to similar music. The top five genres are the same, only rock and electronic have switched places.

2. In Springfield, the number of missing values turned out to be so big that the value `'unknown'` came in 10th. This means that missing values make up a considerable portion of the data, which may be a basis for questioning the reliability of our conclusions.

For Friday evening, the situation is similar. Individual genres vary somewhat, but on the whole, the top 15 is similar for the two cities.

Thus, the second hypothesis has been partially proven true:
* Users listen to similar music at the beginning and end of the week.
* There is no major difference between Springfield and Shelbyville. In both cities, pop is the most popular genre.

However, the number of missing values makes this result questionable. In Springfield, there are so many that they affect our top 15. Were we not missing these values, things might look different.

### Hypothesis 3: genre preferences in Springfield and Shelbyville

Hypothesis: Shelbyville loves rap music. Springfield's citizens are more into pop.

In [36]:
spr_genres = spr_general.groupby('genre')['genre'].count().sort_values(ascending=False)

# groups the spr_general table by the 'genre' column, 
# counts the values for each genre the 'genre' values 
# sorts the resulting Series in descending order, and store it to spr_genres

In [37]:
spr_genres.head(10)

# printing the first 10 rows of spr_genres

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64

In [38]:
shel_genres = shel_general.groupby('genre')['genre'].count().sort_values(ascending=False)

# groups the shel_general table by the 'genre' column, 
# counts the 'genre' values in the grouping 
# sorts the resulting Series in descending order and store it to shel_genres

In [39]:
shel_genres.head(10)# printing the first 10 rows from shel_genres

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64

**Conclusion**

The hypothesis has been partially proven true:
* Pop music is the most popular genre in Springfield, as expected.
* However, pop music turned out to be equally popular in Springfield and Shelbyville, and rap wasn't in the top 5 for either city.


# Recap & Conclusions

We have tested the following three hypotheses:

1. User activity differs depending on the day of the week and from city to city. 
2. On Monday mornings, Springfield and Shelbyville residents listen to different genres. This is also true for Friday evenings. 
3. Springfield and Shelbyville listeners have different preferences. In Springfield, they prefer pop, while Shelbyville has more rap fans.

Before we started analyzing the data, we made the following changes to it, in order to make a better analysis of the data later:
- Replaced missing values from 'artist', 'track', and 'genre' columns with 'unknown' (which may skew data when analyzing genre column specifically.
- Changed the column headers to be uniform in format.
- Removed any rows of duplicate data.

In order to test the first hypothesis, we:
- Separated the data by city, and further by day of the week, and this resulted in the table below. Which shows the number of tracks played in the specified city on the specified day.

In [40]:
final_df

Unnamed: 0,city,monday,wednesday,friday
0,Springfield,15740,11056,15945
1,Shelbyville,5614,7003,5895


After analyzing the data, we concluded:

1. User activity in Springfield and Shelbyville does depends on the day of the week, though the cities vary in different ways. In Springfield the peak listening days are on Monday and Friday, whereas in Shelbyville the peak listening day is on Wednesday.

In order to test the second hypothesis, we:
- Separated the data by city
- Further separated the data to compare tracks listened to on Monday 7AM-11AM to Friday 5pm-11pm
- Compared the genre listened to in these two time periods
- Combined 'hip', 'hop', and 'hip-hop' genres to one all encompassing genre 'hiphop'

After analyzing the data, we concluded:

2. Musical preferences, in terms of genre, do not vary significantly over the course of the week in both Springfield and Shelbyville. We can see small differences in order, for example, sometimes Dance is higher than Electronic, but, in both Springfield and Shelbyville the top 5 most listened to genres remain the same: Pop, Dance, Rock, Electronic, and Hiphop. This means that Springfield and Shelbyville residents **do not** listen to different genres, but actually listen to very similar genres, which is contrary to our original hypothesis. It also should be noted that the 'unknown' genre made it to the top 15 genres, which represents a good portion of our data, enough to skew the results of this analysis.

In order to test the third hypothesis, we:
- Separated the data by city
- Compared the number of songs played for each genre to each other in a list
- Compared the two lists (separated by city) to each other

After analyzing the data, we concluded:

3. It turns out that the musical preferences of users from Springfield and Shelbyville are quite similar. Both Springfield and Shelbyville listeners prefer pop, as it is the most listened to genre among residents of both cities. Rap is not seen on the top 10 genre of either of these two cities, this is once again, contrary to our hypothesis from before.