# Y.Music

This project aims to compare the musical influences of the inhabitants of Springfield and Shelbyville.

Three hypotheses were studied:
1. User activity is different depending on the day of the week and the city.
2. During Monday mornings, residents of Springfield and Shelbyville listened to different genres. This is also true for Friday nights.
3. Springfield and Shelbyville listeners have different ears. In Springfield, people prefer pop, while Shelbyville has more rap fans.


## Visão geral dos dados <a id='data_review'></a>

In [2]:
import pandas as pd

In [3]:
# reading the file and storing it in df
df = pd.read_csv('music_project_en.csv')

In [4]:
# printing the first 10 lines
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
5,842029A1,Chains,Obladaet,rusrap,Shelbyville,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Springfield,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Springfield,20:47:49,Wednesday
8,8FA1D3BE,L’estate,Julia Dalia,ruspop,Springfield,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Shelbyville,21:20:49,Wednesday


In [5]:
# getting general information about the data in df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63736 non-null  object
 2   artist    57512 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


The table contains seven columns. They store the same type of data: objects.

According to the documentation:
- `'userID'` — user ID
- `'Track'` — song title
- `'artist'` — name of the artist
- `'genre'` — the genre
- `'City'` — user's city
- `'time'` — exact time the song was played
- `'Day'` — day of the week
We can see three problems with the style of the column names:
1. Some names are capitalized, some are lowercase.
2. There are spaces in some names.
3. userId, in addition to having a space at the beginning, has uppercase and lowercase letters, and does not have an underscore(_) to separate the words (snake_case).

The number of column values is different. This means that the data contains missing values.

### Conclusions <a id='data_review_conclusions'></a> 

Cada linha na tabela armazena dados sobre uma música que foi tocada. Algumas colunas descrevem a música por si só: seu título, artista e gênero. O restante contém informações sobre o usuário: a cidade de onde eles vêm, a quantidade de vezes que a música foi tocada. 

Está claro que os dados são suficientes para testar as hipóteses. Entretanto, há valores ausentes.

Para seguir adiante, precisamos pré-processar os dados.

## Data pre-processing <a id='data_preprocessing'></a>

### Header style <a id='header_style'></a>

In [6]:
df.columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

In [7]:
# renaming columns
df= df.rename(
    columns={
        '  userID' : 'user_id',
        'Track' : 'track',
        '  City  ' : 'city',
        'Day' : 'day'
    }
)

In [8]:
# checking the result: the list of column names
df.columns

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

### Missing values <a id='missing_values'></a>

In [9]:
# calculating missing values
df.isna().sum()
df.isnull().sum()

user_id       0
track      1343
artist     7567
genre      1198
city          0
time          0
day           0
dtype: int64

Not all missing values affect the search. For example, the missing values in song and artist are not decisive, so I can replace them with clear markers.

But missing values in 'genre' may affect the comparison of Springfield and Shelbyville musical preferences. Therefore, I proceeded with the correction of the missing values as follows:
* Filled in missing values with bullets
* Evaluated how much missing values can affect calculations

Replacing the missing values in 'track', 'artist', and 'genre' with the string 'unknown'.

In [10]:
# looping through column names and replacing missing values with 'unknown'
columns_to_replace = ['track', 'artist', 'genre']

for cols in columns_to_replace:
    df[cols]= df[cols].fillna('unknown')

In [11]:
# counting the missing values
df.isna().sum()

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

### Duplicates <a id='duplicates'></a>

In [12]:
# counting obvious duplicates
df.duplicated().sum()

3826

In [13]:
# removing obvious duplicates
df = df.drop_duplicates()

In [14]:
# checking duplicates
df.duplicated().sum()

0

Dealing with non-obvious duplicates

In [15]:
# viewing unique gender names
df['genre'].unique()

array(['rock', 'pop', 'folk', 'dance', 'rusrap', 'ruspop', 'world',
       'electronic', 'unknown', 'alternative', 'children', 'rnb', 'hip',
       'jazz', 'postrock', 'latin', 'classical', 'metal', 'reggae',
       'triphop', 'blues', 'instrumental', 'rusrock', 'dnb', 'türk',
       'post', 'country', 'psychedelic', 'conjazz', 'indie',
       'posthardcore', 'local', 'avantgarde', 'punk', 'videogame',
       'techno', 'house', 'christmas', 'melodic', 'caucasian',
       'reggaeton', 'soundtrack', 'singer', 'ska', 'salsa', 'ambient',
       'film', 'western', 'rap', 'beats', "hard'n'heavy", 'progmetal',
       'minimal', 'tropical', 'contemporary', 'new', 'soul', 'holiday',
       'german', 'jpop', 'spiritual', 'urban', 'gospel', 'nujazz',
       'folkmetal', 'trance', 'miscellaneous', 'anime', 'hardcore',
       'progressive', 'korean', 'numetal', 'vocal', 'estrada', 'tango',
       'loungeelectronic', 'classicmetal', 'dubstep', 'club', 'deep',
       'southern', 'black', 'folkrock', 

Implicit duplicates:
* hip
* hop
* hip hop

To fix them I declared the replace_wrong_genres() function with two parameters:
* wrong_genres= — the list of duplicates
* correct_genre= — the string with the correct value

The function must correct the names in the 'genre' column of the df table, ie replacing each value in the wrong_genres list with values from correct_genre.

In [16]:
# function to replace implicit duplicates
def replace_wrong_genres (wrong_genres, correct_genre):
    for wrong_genre in wrong_genres:
        df.loc[:,'genre'] = df.loc[:,'genre'].replace(wrong_genre, correct_genre)


In [17]:
# removing implicit duplicates
duplicates = ['hip', 'hop', 'hip-hop']
correct_genre = 'hiphop'

replace_wrong_genres (duplicates, correct_genre)


In [18]:
# checking for duplicate values
df['genre'].unique()

array(['rock', 'pop', 'folk', 'dance', 'rusrap', 'ruspop', 'world',
       'electronic', 'unknown', 'alternative', 'children', 'rnb',
       'hiphop', 'jazz', 'postrock', 'latin', 'classical', 'metal',
       'reggae', 'triphop', 'blues', 'instrumental', 'rusrock', 'dnb',
       'türk', 'post', 'country', 'psychedelic', 'conjazz', 'indie',
       'posthardcore', 'local', 'avantgarde', 'punk', 'videogame',
       'techno', 'house', 'christmas', 'melodic', 'caucasian',
       'reggaeton', 'soundtrack', 'singer', 'ska', 'salsa', 'ambient',
       'film', 'western', 'rap', 'beats', "hard'n'heavy", 'progmetal',
       'minimal', 'tropical', 'contemporary', 'new', 'soul', 'holiday',
       'german', 'jpop', 'spiritual', 'urban', 'gospel', 'nujazz',
       'folkmetal', 'trance', 'miscellaneous', 'anime', 'hardcore',
       'progressive', 'korean', 'numetal', 'vocal', 'estrada', 'tango',
       'loungeelectronic', 'classicmetal', 'dubstep', 'club', 'deep',
       'southern', 'black', 'folkrock

### Data preprocessing conclusions <a id='data_preprocessing_conclusions'></a>
Three problems with the data were detected:

- Incorrect heading style
- Missing values
- Obvious and implied duplicates

The header has been cleaned up to make table processing simpler.

All missing values have been replaced with 'unknown'.

## Hypotheses <a id='hypotheses'></a>

### Hypothesis 1: comparing user behavior in two cities <a id='activity'></a>

According to the first hypothesis, users from Springfield and Shelbyville listen to music differently. For this I used data from three days of the week: Monday, Wednesday, and Friday.

In [19]:
# Counting the songs played in each city
df.groupby('city')
df.groupby('city')['track'].count()

city
Shelbyville    18512
Springfield    42741
Name: track, dtype: int64

Springfield has more songs played than Shelbyville. But that doesn't mean Springfield citizens listen to music more often. This city is just bigger, and has more users.

In [20]:
# Calculating the songs listened to on each of these three days
df.groupby('day')['track'].count()

day
Friday       21840
Monday       21354
Wednesday    18059
Name: track, dtype: int64

Wednesday is the quietest day in general. But if we consider the two cities separately, we must reach a different conclusion.

In [21]:
# creating the number_tracks() function
def number_tracks(day, city):
    track_list = df[(df['city']== city) & (df['day']== day)]                                              
    track_list_count = track_list['user_id'].count()
    return track_list_count

In [22]:
# the number of songs played in Springfield on Monday
number_tracks('Monday', 'Springfield')

15740

In [23]:
# the number of songs played in Shelbyville on Monday
number_tracks('Monday', 'Shelbyville')

5614

In [24]:
# the amount of songs played in Springfield on Wednesday
number_tracks('Wednesday', 'Springfield')

11056

In [25]:
# the amount of songs played in Shelbyville on Wednesday
number_tracks('Wednesday', 'Shelbyville')

7003

In [26]:
# the number of songs played in Springfield on Friday
number_tracks('Friday', 'Springfield')

15945

In [27]:
# the number of songs played in Shelbyville on Friday 
number_tracks('Friday', 'Shelbyville')

5895

In [28]:
# table with results
dados = [['Springfield', number_tracks('Monday', 'Springfield'), number_tracks('Wednesday', 'Springfield'),
          number_tracks('Friday', 'Springfield')], 
         ['Shelbyville', number_tracks('Monday', 'Shelbyville'), number_tracks('Wednesday', 'Shelbyville'),
          number_tracks('Friday', 'Shelbyville')]
        ]
colunas = ['city', 'monday', 'wednesday', 'friday']

tabela_resultado = pd.DataFrame(data=dados, columns=colunas)

tabela_resultado

Unnamed: 0,city,monday,wednesday,friday
0,Springfield,15740,11056,15945
1,Shelbyville,5614,7003,5895


**Conclusions**

The data reveal differences in user behavior:

- In Springfield, the amount of music played peaks on Mondays and Fridays, while on Wednesdays there is a decrease in activity.
- In Shelbyville, on the contrary, users listen to more music on Wednesday. Activity on Monday and Friday is small.

So the first hypothesis seems to be correct.

### Hypothesis 2: music at the beginning and end of the week<a id='week'></a>

According to the second hypothesis, on Monday morning and Friday night, Springfielders listen to genres that differ from what some Shelbyville users like.

In [29]:
# getting table spr_generala from rows df,
# where the value in column 'city' is 'Springfield'
spr_general = df[df['city'] == 'Springfield']
spr_general


Unnamed: 0,user_id,track,artist,genre,city,time,day
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
6,4CB90AA5,True,Roman Messer,dance,Springfield,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Springfield,20:47:49,Wednesday
8,8FA1D3BE,L’estate,Julia Dalia,ruspop,Springfield,09:17:40,Friday
...,...,...,...,...,...,...,...
65073,83A474E7,I Worship Only What You Bleed,The Black Dahlia Murder,extrememetal,Springfield,21:07:12,Monday
65074,729CBB09,My Name,McLean,rnb,Springfield,13:32:28,Wednesday
65076,C5E3A0D5,Jalopiina,unknown,industrial,Springfield,20:09:26,Friday
65077,321D0506,Freight Train,Chas McDevitt,rock,Springfield,21:43:59,Friday


In [30]:
# getting the shell_general from the df lines,
# where values in column 'city' are Shelbyville'
shel_general = df[df['city'] == 'Shelbyville']
shel_general

Unnamed: 0,user_id,track,artist,genre,city,time,day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
5,842029A1,Chains,Obladaet,rusrap,Shelbyville,13:09:41,Friday
9,E772D5C0,Pessimist,unknown,dance,Shelbyville,21:20:49,Wednesday
...,...,...,...,...,...,...,...
65063,D94F810B,Theme from the Walking Dead,Proyecto Halloween,film,Shelbyville,21:14:40,Monday
65064,BC8EC5CF,Red Lips: Gta (Rover Rework),Rover,electronic,Shelbyville,21:06:50,Monday
65065,29E04611,Bre Petrunko,Perunika Trio,world,Shelbyville,13:56:00,Monday
65066,1B91C621,(Hello) Cloud Mountain,sleepmakeswaves,postrock,Shelbyville,09:22:13,Monday


In [31]:
# creating function
def genre_weekday(dados, day, time1, time2):
    genre_df = dados[(dados['day'] == day) & (dados['time'] < time2) & (dados['time'] > time1)]
    genre_df_count = genre_df.groupby('genre')['track'].count()
    genre_df_sorted = genre_df_count.sort_values(ascending=False)
    return genre_df_sorted[:15]


In [32]:
# calling the function for Monday morning in Springfield (use spr_general instead of df table)
genre_weekday(spr_general, 'Monday', '06:00', '12:00')

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
classical      157
metal          120
jazz           100
folk            97
soundtrack      95
Name: track, dtype: int64

In [33]:
# calling the function for Monday morning in Shelbyville (use shel_general instead of df table)

genre_weekday(shel_general, 'Monday', '06:00', '12:00')

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
world           36
rap             32
soundtrack      31
rnb             27
metal           27
Name: track, dtype: int64

In [34]:
# calling the function for Friday afternoon in Springfield
genre_weekday(spr_general, 'Friday', '13:00', '18:00')

genre
pop            664
dance          522
rock           471
electronic     442
hiphop         264
classical      239
alternative    189
ruspop         161
world          155
rusrap         151
jazz           150
metal           96
soundtrack      92
folk            84
rnb             82
Name: track, dtype: int64

In [35]:
# calling the function for Friday afternoon in Shelbyville
genre_weekday(shel_general, 'Friday', '13:00', '18:00')

genre
pop            281
dance          237
rock           221
electronic     173
hiphop         123
classical      106
alternative     83
ruspop          76
rusrap          59
world           57
jazz            45
folk            42
soundtrack      38
rnb             34
metal           34
Name: track, dtype: int64

**Conclusion**

Having compared the 15 most listened genres on Monday morning, we can draw the following conclusions:

1. Springfield and Shelbyville users listen to similar music. The five most listened genres are the same, only rock and electronic music have switched places.

2. In Springfield, the amount of missing values turned out to be so many that the value 'unknown' came in 10th. This means that missing values account for a considerable portion of the data, which can be the basis for questioning the reliability of the conclusions.

For Friday afternoon, the situation is similar. Individual genres vary slightly, but overall, the top 15 most listened genres are similar for the two cities.

Thus, the second hypothesis was partially proved:
* Users listen to similar music genres at the beginning and end of the week.
* There is not much difference between Springfield and Shelbyville. In both cities, pop is the most popular genre.

However, the number of missing values makes this result questionable. In Springfield, there are so many that they've affected the top 15. If we didn't lack these values, things could be different.

### Hypothesis 3: Preferences in Springfield and Shelbyville<a id='genre'></a>

Hipótese: Shelbyville ama rap. Cidadãos de Springfield curtem mais pop.

In [36]:
spr_genres = spr_general.groupby('genre')['track'].count().sort_values(ascending=False)

In [38]:
# displaying the first 10 lines of spr_genres
spr_genres.head(10)

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: track, dtype: int64

In [39]:
shel_genres = shel_general.groupby('genre')['track'].count().sort_values(ascending=False)

In [40]:
# displaying the first 10 lines of shell_genres
shel_genres.head(10)

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: track, dtype: int64

**Conclusions**

The hypothesis was partially proved:
* Pop music is the most popular genre in Springfield, as expected.
* However, pop music turned out to be equally popular in Springfield and Shelbyville, and rap was not in the top 5 in either city.

# General conclusions <a id='end'></a>

The following three hypotheses were observed:

1. User activity varies depending on the day of the week and the city.
2. On Monday mornings, residents of Springfield and Shelbyville listen to different genres. This is also true for Friday nights.
3. Springfield and Shelbyville listeners have different preferences. In both Springfield and Shellbyville, they prefer pop.

After analyzing the data, I concluded that:

1. User activity in Springfield and Shelbyville depends on the day of the week, although cities vary in different ways.

The first hypothesis is fully accepted.

2. Music preferences do not vary significantly over the course of the week in both Springfield and Shelbyville. You can see small differences in the order on Mondays, but:
* In Springfield and Shelbyville, people listen to more pop music.

So we can accept this hypothesis. We should also keep in mind that the result might have been different if it weren't for the missing values.

3. It turns out that Springfield and Shelbyville users' music preferences are quite similar.

The third hypothesis was rejected. If there is any difference in preferences, it cannot be seen in this data.

### Observation
In real projects, research involving statistical testing of hypotheses, which is more accurate and more quantitative. Also realize that you can't always draw conclusions about an entire city based on data from just one source.