# Music of big cities

**Comparison of Moscow and St. Petersburg: myths and analysis**

For example:
1. Moscow is a metropolis driven by the intense rhythm of the workweek.  
2. St. Petersburg is the cultural capital, with its distinct tastes.  

Using data from Yandex.Music, you will analyze user behavior in these two cities.

**Research goal**
Test the following hypotheses:  
1. User activity depends on the day of the week, and this pattern differs between Moscow and St. Petersburg.  
2. On Monday mornings, different music genres dominate in Moscow compared to St. Petersburg. Similarly, on Friday evenings, genre preferences vary by city.  
3. Moscow and St. Petersburg have distinct genre preferences: pop music is more popular in Moscow, while Russian rap is more common in St. Petersburg.  

**Research process**  

User behavior data will be sourced from the file `yandex_music_project.csv`. The quality of the data is unknown, so an initial data overview will be necessary.  

You will:  
1. Review the data for errors and assess their impact on the research.  
2. During preprocessing, correct the most critical data issues if possible.  

**Research steps**  
1. Data overview.  
2. Data preprocessing.  
3. Hypotheses testing.

Load libraries:

In [9]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from caseconverter import snakecase
from collections import defaultdict
from IPython.display import display

In [2]:
FIG_WIDTH = 10 * 100
FIG_HEIGHT = 5 * 100

In [3]:
def get_statistics(df: pd.DataFrame) -> pd.DataFrame:
    """
    Generate summary statistics for each column in a DataFrame.

    Args:
        df (pd.DataFrame): Input DataFrame.

    Returns:
        pd.DataFrame: DataFrame with column statistics, including column name, data type,
                      count of unique values, a sample of unique values, null count, and
                      percentage of null values.
    """
    rows_total = len(df)
    
    stats_list = [
        {
            'column_name': f"'{column}'",
            'data_type': str(df[column].dtype),
            'count_unique': df[column].nunique(),
            'sample_values': (
                pd.Series(df[column].dropna().unique())
                .sample(min(5, len(df[column].dropna().unique())))
                .apply(lambda x: round(x, 2) if pd.api.types.is_numeric_dtype(df[column]) else x)
                .tolist()
            ),
            'count_null': df[column].isnull().sum(),
            'pct_null': round((df[column].isnull().sum() / rows_total) * 100, 0),
        }
        for column in df.columns
    ]

    # Convert the list of statistics to a DataFrame
    print(f"Dataframe size: {df.shape[0]} rows x {df.shape[1]} columns")
    print(f"Full duplicate rows: {df.duplicated().sum()}")
    return pd.DataFrame(stats_list)

In [4]:
def plot_unique_counts(df: pd.DataFrame, columns: list, n_cols: int = 2, fig_size: tuple = (1200, 800), top_n: int = 10):
    """
    Generate a Plotly figure with multiple subplots, each showing unique value counts for a selected column.

    Args:
        df (pd.DataFrame): Input DataFrame.
        columns (list): List of column names to analyze for unique counts.
        n_cols (int): Number of columns in the subplot grid. Defaults to 2.
        fig_size (tuple): Size of the figure (width, height). Defaults to (1200, 800).
        top_n (int): Number of top unique values to display per column. Defaults to 10.

    Returns:
        None: Displays the Plotly figure.
    """
    n_rows = -(-len(columns) // n_cols)  # Calculate number of rows dynamically
    fig = make_subplots(rows=n_rows, cols=n_cols, subplot_titles=columns)

    row, col = 1, 1  # Track subplot position
    for column in columns:
        # Compute value counts and take the top N
        df_temp = (
            df[column]
            .value_counts()
            .reset_index()
            .set_axis(['values', 'ucount'], axis=1)
            .head(top_n)
        )

        # Create bar plot for the current column
        bar_fig = px.bar(
            df_temp.sort_values('ucount', ascending=True),
            x='ucount',
            y='values',
            title=f"Top {top_n} unique values in {column}",
            orientation='h'
        )

        # Extract traces and add to the main figure
        for trace in bar_fig['data']:
            fig.add_trace(trace, row=row, col=col)

        # Move to the next subplot
        col += 1
        if col > n_cols:
            col = 1
            row += 1

    # Update layout
    fig.update_layout(
        title_text="Unique value counts per column",
        width=fig_size[0], height=fig_size[1],
        showlegend=False,
        template='plotly_white'
    )

    fig.show()

In [5]:
try:
    raw_music = pd.read_csv('yandex_music_project.csv')
except:
    raw_music = pd.read_csv('/datasets/yandex_music_project.csv')

## Data overview

Let's get an initial understanding of the Yandex.Music data.

In [6]:
display(raw_music.head(5))

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday


In [7]:
get_statistics(raw_music)

Dataframe size: 65079 rows x 7 columns
Full duplicate rows: 3826


Unnamed: 0,column_name,data_type,count_unique,sample_values,count_null,pct_null
0,' userID',object,41748,"[CB18EEFC, 910E8D70, D02B55C1, ED4C2605, 2B8B9...",0,0.0
1,'Track',object,47245,"[Advanced, It Is Pitch Dark, Karin, Op Blouber...",1231,2.0
2,'artist',object,43605,"[Gunter Bombe, Гарри Бардин, ZZZUBЫ, FLESH LIZ...",7203,11.0
3,'genre',object,289,"[singer, quebecois, swing, cantautori, fairytail]",1198,2.0
4,' City ',object,2,"[Saint-Petersburg, Moscow]",0,0.0
5,'time',object,20392,"[14:24:20, 20:42:59, 21:07:18, 20:32:11, 08:28...",0,0.0
6,'Day',object,3,"[Wednesday, Friday, Monday]",0,0.0


In [12]:
plot_unique_counts(
    raw_music.rename(columns=snakecase), 
    ['genre', 'day', 'city'],
    # ['genre'],
    n_cols=1,
    top_n=20,
    fig_size=(FIG_WIDTH, 2*FIG_HEIGHT)
)

The table contains seven columns, all with the `object` data type.

According to the data documentation:
- `userID` — user identifier;
- `Track` — track title;
- `artist` — artist name;
- `genre` — genre name;
- `City` — user's city;
- `time` — listening start time;
- `Day` — day of the week.

Three style issues are noticeable in the column names:
1. Lowercase and uppercase letters are mixed.
2. Spaces are present.
3. Identify another issue with the column names and describe it in this point.

The number of values in the columns varies, indicating there are missing values in the data. There are also full duplicates and inconsistent content in several columns.

**Conclusions**

Each row in the table contains data about a listened track. Some columns describe the track itself, including the title, artist, and genre. The other columns provide information about the user, such as their city and the time they listened to the music.

Preliminary analysis suggests that the data is sufficient to test the hypotheses. However, there are missing values, and the column names do not adhere to good style conventions.

To proceed, these data issues need to be addressed.

In [10]:
df_music = (
    raw_music.copy()
    .rename(columns=snakecase)
    .fillna('unknown')
    .drop_duplicates()
    .assign(genre=lambda df: df.genre.replace({'hip': 'hip-hop', 'hop': 'hip-hop'}))
)

In [11]:
get_statistics(df_music)

Dataframe size: 61253 rows x 7 columns
Full duplicate rows: 0


Unnamed: 0,column_name,data_type,count_unique,sample_values,count_null,pct_null
0,'user_id',object,41748,"[34710465, CE33D210, 3CD8253B, D1AE503D, CF707...",0,0.0
1,'track',object,47246,"[When I'm 'ere, Глазами Будды, Эндорфины, Brea...",0,0.0
2,'artist',object,43606,"[Black Gryph0n & Baasik, Yellowstraps, OLGA ...",0,0.0
3,'genre',object,288,"[adult, quebecois, folktronica, latin, dirty]",0,0.0
4,'city',object,2,"[Saint-Petersburg, Moscow]",0,0.0
5,'time',object,20392,"[20:17:43, 09:13:20, 13:31:27, 14:38:18, 21:27...",0,0.0
6,'day',object,3,"[Monday, Wednesday, Friday]",0,0.0


## Hypothesis testing

### Comparing user behavior in two capitals

The first hypothesis suggests that users in Moscow and St. Petersburg listen to music differently. Test this assumption using data for three weekdays: Monday, Wednesday, and Friday. To do this:

- Separate users from Moscow and St. Petersburg.
- Compare how many tracks each group of users listened to on Monday, Wednesday, and Friday.

For practice, perform each calculation separately at first.

Evaluate user activity in each city by grouping the data by city and counting the number of tracks listened to in each group.

In [12]:
(df_music
 .pivot_table(index='city', values='user_id', aggfunc='count')
 .reset_index()
 .set_axis(['city', 'plays'], 'columns')
)


In a future version of pandas all arguments of DataFrame.set_axis except for the argument 'labels' will be keyword-only.



Unnamed: 0,city,plays
0,Moscow,42741
1,Saint-Petersburg,18512


In Moscow, there are more listens than in St. Petersburg. However, this does not mean that Moscow users listen to music more frequently—it simply reflects that there are more users in Moscow.

Now, group the data by the day of the week and count the listens for Monday, Wednesday, and Friday. Keep in mind that the data only includes information about listens on these specific days.

In [13]:
(df_music
 .pivot_table(index='day', values='user_id', aggfunc='count')
 .reset_index()
 .set_axis(['day', 'plays'], 'columns')
)


In a future version of pandas all arguments of DataFrame.set_axis except for the argument 'labels' will be keyword-only.



Unnamed: 0,day,plays
0,Friday,21840
1,Monday,21354
2,Wednesday,18059


On average, users from both cities are less active on Wednesdays. However, the pattern may change when examining each city individually.

In [124]:
(df_music
 .pivot_table(index='city', columns='day', values='user_id', aggfunc='count')
 .reset_index()
)

day,city,Friday,Monday,Wednesday
0,Moscow,15945,15740,11056
1,Saint-Petersburg,5895,5614,7003


**Conclusions**

The data reveals differences in user behavior:

1. In Moscow, the peak listening activity occurs on Monday and Friday, with a noticeable decline on Wednesday.
2. In St. Petersburg, on the other hand, music is listened to more on Wednesdays. Activity on Monday and Friday is almost equally lower compared to Wednesday.

Thus, the data supports the first hypothesis.

In [14]:
df_music.head()

Unnamed: 0,user_id,track,artist,genre,city,time,day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday


### Music at the beginning and end of the week

According to the second hypothesis, on Monday mornings, different genres dominate in Moscow compared to St. Petersburg. Similarly, on Friday evenings, the dominant genres differ depending on the city.

In [29]:
def get_day_part(time):
    if (time >= '07:00:00') & (time <= '11:00:00'):
        return 'morning'
    elif (time >= '17:00:00') & (time <= '23:00:00'):
        return 'evening'
    else:
        return 'afternoon'

df_music = df_music.assign(day_part=df_music.time.apply(get_day_part))

display(df_music.head())

Unnamed: 0,user_id,track,artist,genre,city,time,day,day_part
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday,evening
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday,afternoon
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday,evening
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday,morning
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday,morning


In [70]:
df_temp = (
    df_music
    .loc[df_music.day.isin(['Monday', 'Wednesday'])]
    .pivot_table(index='genre', columns=['city', 'day'], values='user_id', aggfunc='count', margins=True)
    .reset_index()
    .sort_values('All', ascending=False)
    .head(15)
)

display(df_temp)

city,genre,Moscow,Moscow,Saint-Petersburg,Saint-Petersburg,All
day,Unnamed: 1_level_1,Monday,Wednesday,Monday,Wednesday,Unnamed: 6_level_1
259,All,15740.0,11056.0,5614.0,7003.0,39413
168,pop,2154.0,1572.0,732.0,951.0,5409
53,dance,1669.0,1164.0,589.0,704.0,4126
193,rock,1452.0,1016.0,577.0,709.0,3754
72,electronic,1432.0,924.0,523.0,657.0,3536
110,hip-hop,771.0,473.0,270.0,342.0,1856
44,classical,558.0,455.0,187.0,237.0,1437
5,alternative,499.0,356.0,199.0,249.0,1303
256,world,548.0,347.0,163.0,195.0,1253
200,ruspop,478.0,356.0,162.0,208.0,1204


In [68]:
for city, day in product(['Moscow', 'Saint-Petersburg'], ['Monday', 'Friday']):
    df_temp = (
        df_music
        .loc[(df_music.day == day) & (df_music.city == city)]
        .pivot_table(index='genre', columns='city', values='user_id', aggfunc='count')
        .reset_index()
        .sort_values(city, ascending=False)
        .head(10)
    )

    fig = px.bar(
        df_temp.sort_values(city, ascending=True),
        x=city,
        y='genre',
        title=f"Top 10 genres in {city} on {day}",
        orientation='h',
        template='plotly_white',
        width=FIG_WIDTH, height=FIG_HEIGHT*0.8
    )

    fig.show()

**Conclusions**

1. **Overall genre popularity:** Pop is the most popular genre overall, with 5409 listens, followed by dance (4126) and rock (3754).  

2. **City trends:** Moscow shows higher listening activity overall, with a strong preference for pop, dance, and electronic genres.  
Saint-Petersburg has notable engagement with classical and jazz, reflecting a more diverse musical taste.  

3. **Day patterns:** Monday has higher listening activity compared to Wednesday in both cities. Moscow leads in pop and dance, while Saint-Petersburg maintains consistent interest in classical and jazz.  

4. **Notable observations:** The genre "world" is more popular in Moscow than in Saint-Petersburg.  
The "unknown" genre ranks relatively high, indicating a significant portion of missing or unclassified data.  

5. **Summary:** Moscow favors mainstream genres with higher activity levels, while Saint-Petersburg leans towards niche genres like jazz and classical. Listening habits are stronger at the start of the week, but data gaps (indicated by "unknown") may limit the reliability of the analysis.

## Research summary

You tested three hypotheses and found the following:

1. The day of the week affects user activity differently in Moscow and St. Petersburg. The first hypothesis was fully confirmed.

2. Musical preferences do not change significantly during the week, whether in Moscow or St. Petersburg. Small differences are noticeable at the start of the week, on Mondays: In Moscow, the "world" genre is popular. In St. Petersburg, jazz and classical music are preferred. Therefore, the second hypothesis was only partially confirmed. This result could have been different if not for the missing data.

1. The music tastes of users in Moscow and St. Petersburg are more similar than different. Contrary to expectations, St. Petersburg's genre preferences resemble those of Moscow. The third hypothesis was not confirmed. If differences in preferences exist, they are not noticeable for the majority of users.

**In practice, research involves statistical hypothesis testing.** Data from a single service does not always provide conclusions about all residents of a city. Statistical hypothesis testing can show the reliability of results based on available data. You will learn about hypothesis testing methods in upcoming topics.