#  Data Science Project - Yandex Music

## 1. Project Description

**Research Purpose:** Using Yandex Music data, compare the behavior of users in the two capitals.

Moscow and St. Petersburg are usually perceived in different ways. For example:
* Moscow is a metropolis subjected to the harsh rhythm of the workweek;
* St. Petersburg is a cultural capital with its own tastes.

Test three hypotheses:
1. User activity depends on the day of the week, and this manifests differently in Moscow and St. Petersburg.
2. On Monday mornings, different genres prevail in Moscow and St. Petersburg. The same goes for Friday evenings, with different dominant genres depending on the city.
3. Moscow and St. Petersburg prefer different music genres. Moscow tends to listen to pop music more often, while in St. Petersburg, Russian rap is more popular.

Research Steps:

- Obtain user behavior data from the `yandex_music_project.csv` file. The quality of the data is unkown. Therefore, before testing hypotheses, a data overview is necessary
- Check the data for errors and assess their impact
- Correct the most critical data errors
- Draw a conclusion

# 2. Data Description 

The data is stored in the file `yandex_music_project.csv`

**Column Description:**

* `userID` - user identifier;
* `Track` - track name;
* `artist` - artist name;
* `genre` - genre name;
* `City` - user's city;
* `time` - start time of listening;
* `Day` - day of the week.

## 3. Data Overview

In [1]:
# Import the pandas library
import pandas as pd

In [2]:
# Reading the data file and saving it to a DataFrame (df)
df = pd.read_csv('yandex_music_project.csv')

In [3]:
# Display the first ten rows of the table on the screen:
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


In [4]:
# Obtain general information about the table with a single command using the info() method:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63848 non-null  object
 2   artist    57876 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


#### 3.1 Summary

1.  The table consists of seven columns, and all columns have the `object` data type. However, the number of values in each column varies, indicating the presence of missing data.
2.  Each row in the table provides details about a listened track. Specific columns offer information about the composition, including its title, artist, and genre. 
3.  Additional columns offer insights into the user, such as their city and the time of music playback.
4.  While the existing data appears adequate for hypothesis testing, there are challenges that require attention. Missing values are evident in the dataset, and there are style discrepancies in column names that deviate from recommended data presentation practices.
5.  The column names exhibit style violations, including:
  - Mixed case: The column names "Track" and "genre" use a combination of upper and lower case letters.
  - Spaces: The column name "time" contains a space.
6. Addressing and resolving these data issues is crucial to advance in the analysis

## 4. Data Preprocessing

- Correct the style in column headers
- Eliminate missing values
- Check for duplicates

### 4.1 Correct the style in column headers

In [5]:
# List of column names in the df table
df.columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

Bring the names in line with good style:
- Write several words in the name in "snake_case";
- Make all symbols lowercase;
- Eliminate spaces.

Rename the columns as follows:
- ' userID' → 'user_id';
- 'Track' → 'track';
- ' City ' → 'city';
- 'Day' → 'day'.

In [6]:
# Renaming columns
df = df.rename(columns={'  userID': 'user_id', 'Track': 'track', '  City  ': 'city', 'Day': 'day'}) 

In [7]:
# Checking the results - list of column names
df.columns

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

### 4.2 Eliminate missing values

Calculate how many missing values are in the table

In [8]:
# Counting missing values
df.isnull().sum()

user_id       0
track      1231
artist     7203
genre      1198
city          0
time          0
day           0
dtype: int64

Not every missing value significantly impacts the research. For exemple, in the 'track' and 'artist' columns, the absence of data is inconsequential for the analysis, and it suffices to replace them with explicit labels.

However, the missing values in the 'genre' column could hinder the comparison of musical tastes between Moscow and St. Petersburg. In a real-world scenario, it would be ideal to identify the cause of these gaps and restore the data. Unfortunately, such an option is not available in this project. As a workaround:

1. Fill in these missing values with explicit labels.
2. Assess the extent to which they might impact the calculations.

Replace missing values in the `track`, `artist`, and `genre` columns with the string `unknown`:
- create a list named 'columns_to_replace'
- iterate through its elements using a for loop
- for each column, perform the replacement of missing values

In [9]:
# Iterating through column names in a loop and replacing missing values with 'unknown'
columns_to_replace = ['track', 'artist', 'genre']
for column in columns_to_replace:
    df['track'] = df['track'].fillna('unknown')
    df['artist'] = df['artist'].fillna('unknown')
    df['genre'] = df['genre'].fillna('unknown')

In [10]:
# Recalculate the missing value again to make sure there are no missing values left in the table
df.isnull().sum()

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

### 4.3 Check for Duplicates

In [11]:
# Counting explicit duplicates
df.duplicated().sum()

3826

In [12]:
# Removing explicit duplicates
df = df.drop_duplicates()

In [13]:
# Checking for the absence of duplicates
df.duplicated().sum()

0

Eliminate implicit duplicates in the 'genre' column. For example, the name of the same genre might be recorded slightly differently. Such errors can also affect the research results.

Display on the screen a list of unique genre names sorted in alphabetical order:
- Extract the required column from the dataframe
- Apply a sorting method to it
- For the sorted column, call a method that returns unique values from the column

In [14]:
# Viewing unique genre names
sorted_values_list = df['genre']

sorted_values_list.sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

Review the list and identify implicit duplicates of the 'hiphop' genre name. These could be names with errors or alternative names for the same genre.

There are the following implicit duplicates:
- hip,
- hop,
- hip-hop.

To clean the table, use the replace() method with two arguments: 
- a list of duplicate strings (including hip, hop, and hip-hop)
- a string with the correct value. 

To correct the 'genre' column in the df table: replace each value from the list of duplicates with the correct one. Instead of hip, hop, and hip-hop, the table should have the value 'hiphop'.

In [15]:
# Removing implicit duplicates
df['genre'] = df['genre'].replace('hip', 'hiphop')
df['genre'] = df['genre'].replace('hop', 'hiphop')
df['genre'] = df['genre'].replace('hip-hop', 'hiphop')

Ensure that incorrect genre names have been successfully replaced:
- hip,
- hop,
- hip-hop.  

Display the sorted list of unique values in the 'genre' column:

In [16]:
# Checking for implicit duplicates
sorted_values_list.sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

**4.4 Summary**
1. The preprocessing stage identified three data issues:
   - Header style inconsistencies,
   - Missing values,
   - Duplicates - both explicit and implicit.
2. Header corrections were implemented for enhanced table manipulation. The removal of duplicates enhances the precision of the study.
3. Missing values were uniformly replaced with 'unknown'. The impact of gaps in the 'genre' column on the research outcome remains to be examined.


## 5. Hypothesis Testing

### 5.1 First Hypothesis: Distinct Music Listening Patterns in Moscow and St. Petersburg

The initial hypothesis suggests that users exhibit diverse music listening behaviors in Moscow and St. Petersburg. To validate this assumption, an analysis will be conducted using data from three specific weekdays—Monday, Wednesday, and Friday. The approach involves:
- Segregating users based on their location, distinguishing between Moscow and St. Petersburg.
- Comparing the quantity of tracks listened to by each user group on Monday, Wednesday, and Friday.

**5.1.1 City-Wise Music Listening Analysis:**

In [17]:
# Count listens in each city
df.groupby('city')['genre'].count()

city
Moscow              42741
Saint-Petersburg    18512
Name: genre, dtype: int64

Moscow exhibits a higher number of listens compared to St. Petersburg. However, this disparity doesn't necessarily indicate that Moscow users listen to music more frequently — it is attributed to a larger user base in Moscow.

**5.1.2 Day-wise Listening Analysis:**

In [18]:
# Count listens on each of the three days
df.groupby('day')['genre'].count()

day
Friday       21840
Monday       21354
Wednesday    18059
Name: genre, dtype: int64

On average, users from both cities display reduced activity on Wednesdays. A deeper understanding can be gained by examining each city independently.

**5.1.3 Function Creation:**

Create the `number_tracks()` function. This function calculates listens for a specified day and city, taking two parameters:
- The day of the week
- The city name

The function employs sequential filtering and logical indexing to extract relevant data. It then counts the values in the `user_id` column and returns the result.

In [19]:
# Function to calculate listens for a given day and city
def number_tracks(day, city):
    track_list = df[df['day'] == day]
    track_list = track_list[track_list['city'] == city]
    track_list_count = track_list['user_id'].count()
    return track_list_count

This function provides a flexible and efficient way to analyze music listens based on specific days and cities.

Call the number_tracks() function six times, varying the parameter values to obtain data for each city on each of the three days:

In [20]:
# Calling number_tracks() for Moscow on Monday
number_tracks('Monday', 'Moscow')

15740

In [21]:
# Calling number_tracks() for Moscow on Wednesday
number_tracks('Wednesday', 'Moscow')

11056

In [22]:
# Calling number_tracks() for Moscow on Friday
number_tracks('Friday', 'Moscow')

15945

In [23]:
# Calling number_tracks() for St. Petersburg on Monday
number_tracks('Monday', 'Saint-Petersburg')

5614

In [24]:
# Calling number_tracks() for St. Petersburg on Wednesday
number_tracks('Wednesday', 'Saint-Petersburg')

7003

In [25]:
# Calling number_tracks() for St. Petersburg on Friday
number_tracks('Friday', 'Saint-Petersburg')

5895

These calls will give the number of listens for each city on each of the three days of the week. These values can be used for further analysis or visualization as needed.

Create a table using the pd.DataFrame constructor, where:
- Column names are `['city', 'monday', 'wednesday', 'friday']`.
- Data consists of the results obtained using number_tracks.

In [26]:
# Table with results
info = pd.DataFrame(data=[['Moscow', 15740, 11056, 15945], ['St. Petersburg', 5614, 7003, 5895]], columns=['city', 'monday', 'wednesday', 'friday'])
info 

Unnamed: 0,city,monday,wednesday,friday
0,Moscow,15740,11056,15945
1,St. Petersburg,5614,7003,5895


**5.1.3 Summary:**

1. The data reveals differences in user behavior:
   - In Moscow, the peak of listens occurs on Monday and Friday, with a noticeable decline on Wednesday
   - In St. Petersburg, conversely, more music is listened to on Wednesdays
   - Activity on Monday and Friday is nearly equally distributed and lower compared to Wednesday
2. Thus, the data supports the first hypothesis

### 5.2 Second Hypothesis: Music Trends at the Beginning and End of the Week

According to the second hypothesis, different genres dominate music listening in Moscow and St. Petersburg on Monday mornings and Friday evenings. Let's store the data in two variables:
- For Moscow — moscow_general
- For St. Petersburg — spb_general

In [27]:
# Creating moscow_general from rows in df where the 'city' column equals 'Moscow'
moscow_general = df[df['city'] == 'Moscow']

# Creating spb_general from rows in df where the 'city' column equals 'Saint-Petersburg'
spb_general = df[df['city'] == 'Saint-Petersburg']

Create the `genre_weekday()` function with four parameters:

- The dataframe with data
- The day of the week
- The start time in the 'hh:mm' format
- The end time in the 'hh:mm' format
   
The function should return information about the top 10 genres of tracks listened to on the specified day, between the two given time labels.

In [28]:
def genre_weekday(df, day, time1, time2):
    # Sequential filtering
    genre_df = df[df['day'] == day]
    genre_df = genre_df[genre_df['time'] < time2]
    genre_df = genre_df[genre_df['time'] > time1]
    genre_df_grouped = genre_df.groupby('genre')['genre'].count()
    genre_df_sorted = genre_df_grouped.sort_values(ascending=False)
    return genre_df_sorted[:10]

Compare the results of the genre_weekday() function for Moscow and St. Petersburg on Monday mornings (from 7:00 to 11:00) and Friday evenings (from 17:00 to 23:00).

In [29]:
# Calling the function for Monday morning in Moscow (using moscow_general instead of df)
genre_weekday(moscow_general, 'Monday', '07:00', '11:00')

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
Name: genre, dtype: int64

In [30]:
# Calling the function for Monday morning in St. Petersburg (using spb_general instead of df)
genre_weekday(spb_general, 'Monday', '07:00', '11:00')

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
Name: genre, dtype: int64

In [31]:
# Calling the function for Friday evening in Moscow 
genre_weekday(moscow_general, 'Friday', '17:00', '23:00')

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
alternative    163
classical      163
rusrap         142
Name: genre, dtype: int64

In [32]:
# Calling the function for Friday evening in St. Petersburg
genre_weekday(spb_general, 'Friday', '17:00', '23:00')

genre
pop            256
electronic     216
rock           216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
Name: genre, dtype: int64

**5.2.1 Summary:**
   
When comparing the top 10 genres on Monday mornings, the following observations can be made:
1. Moscow and St. Petersburg have similar music preferences. The only difference is that the genre "world" made it into Moscow's ranking, while jazz and classical entered St. Petersburg's.
2. In Moscow, the 'unknown' genre holds the tenth position among the most popular genres due to a significant number of missing values. This indicates that missing values occupy a substantial portion of the data, posing a threat to the study's credibility.
3. Friday evening does not alter this pattern significantly. Some genres rise slightly, while others decline, but the overall top 10 remains unchanged.

Thus, the second hypothesis is only partially confirmed:
- Users listen to similar music at the beginning and end of the week.
- The difference between Moscow and St. Petersburg is not very pronounced. Moscow leans more towards Russian pop music, while St. Petersburg tends to favor jazz.
- However, the presence of gaps in the data casts doubt on these results. In Moscow, there are so many missing values that the top 10 rankings could look different if genre data were not lost.

### 5.3 Third Hypothesis: Musical Preferences in Moscow and St. Petersburg

Hypothesis: St. Petersburg, known as the rap capital, listens to this genre more frequently than Moscow. Meanwhile, Moscow, a city of contrasts, predominantly embraces pop music.

- Group the moscow_general table by genre and count the listens for each genre using the `count()` method
- Sort the results in descending order and store them in the `moscow_genres table`

In [33]:
# One-liner: Grouping the moscow_general table by the 'genre' column, 
# counting the number of 'genre' values in this grouping using the count() method, 
# sorting the resulting Series in descending order, and saving it to moscow_genres
moscow_genres = moscow_general.groupby('genre')['genre'].count().sort_values(ascending=False)

Display the first ten rows of `moscow_genres`:

In [34]:
# Displaying the first 10 rows of moscow_genres
moscow_genres.head(10)

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64

Repeat the process for St. Petersburg:
- Group the spb_general table by genre
- Count the listens for each genre
- Sort the results in descending order
- Store them in the `spb_genres table`

In [35]:
# One-liner: Grouping the spb_general table by the 'genre' column, 
# counting the number of 'genre' values in this grouping using the count() method, 
# sorting the resulting Series in descending order, and saving it to spb_genres
spb_genres = spb_general.groupby('genre')['genre'].count().sort_values(ascending=False)

Display the first ten rows of `spb_genres`:

In [36]:
# Displaying the first 10 rows of spb_genres
spb_genres.head(10)

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64

**5.3.1 Summary:**
1. The hypothesis is partially confirmed: Pop music is the most popular genre in Moscow, aligning with the hypothesis. Moreover, the top 10 genres include a related genre — Russian pop music.
2. Contrary to expectations, rap is equally popular in both Moscow and St. Petersburg.

## 6. Conclusion

**Hypotheses Analysis Summary**
1. Impact of the Day of the Week
The user activity patterns in Moscow and St. Petersburg vary significantly based on the day of the week. This confirms the first hypothesis.
2. Stability of Musical Preferences Throughout the Week
While overall musical preferences remain relatively stable throughout the week, there are subtle differences on Mondays. In Moscow, there is a preference for the "world" genre, while in St. Petersburg, jazz and classical music are more prominent. It's worth noting that this partial confirmation might be influenced by data gaps.
3. Comparison of Musical Tastes in Moscow and St. Petersburg
Contrary to expectations, the musical tastes of users in St. Petersburg closely resemble those in Moscow. The third hypothesis is not confirmed, suggesting that any differences in preferences are not evident in the broader user base.
   
**Practical Insights:**
* Research findings should be interpreted cautiously, as they may not be representative of the entire population.
* Statistical hypothesis testing is crucial for gauging the reliability of conclusions drawn from a specific dataset.