<a href="https://colab.research.google.com/github/kirillkobychev/HSE-ML-TEAM-4/blob/kirill-dev/Project_Music_genre_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Music genre prediction

**Описание задачи**

Вы сотрудник Отдела Data Science популярного музыкального стримингового сервиса. Сервис расширяет работу с новыми артистами и музыкантами, в связи с чем возникла задача -- правильно классифицировать новые музыкальные треки, чтобы улучшить работу рекомендательной системы. Ваши коллеги из отдела работы со звуком подготовили датасет, в котором собраны некоторые характеристики музыкальных произведений и их жанры. Ваша задача - разработать модель, позволяющую классифицировать музыкальные произведения по жанрам.

В ходе работы пройдите все основные этапы полноценного исследования:

*  загрузка и ознакомление с данными
*  предварительная обработка
*  полноценный разведочный анализ
*  разработка новых синтетических признаков
*  проверка на мультиколлинеарность
*  отбор финального набора обучающих признаков
*  выбор и обучение моделей
*  итоговая оценка качества предсказания лучшей модели
*  анализ важности ее признаков

**ВАЖНО**  
Необходимо реализовать решение с использованием технологии `pipeline` (из библиотеки `sklearn`)

**ОЖИДАЕМЫЙ РЕЗУЛЬТАТ**

* Оформленный репозиторий на GitHub (ноутбук с исследованием + код приложения)
* Развернутое web-приложение (с использованием библиотеки Streamlit)

## Участники проекта, репозиторий, приложение

Кобычев Кирилл, @hikoby

Иванов Егор, @Jaibesiondaide

Игорь Земенков, @iZemM

https://github.com/kirillkobychev/HSE-ML-TEAM-4

## Импорт библиотек, установка констант

In [2]:
%%capture
!pip install catboost -q
!pip install ydata-profiling
!pip install kaggle

In [3]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from catboost import CatBoostClassifier

from ydata_profiling import ProfileReport

In [4]:
TRAIN = "https://www.dropbox.com/scl/fi/5zy935lqpaqr9lat76ung/music_genre_train.csv?rlkey=ccovu9ml8pfi9whk1ba26zdda&dl=1"
TEST = "https://www.dropbox.com/scl/fi/o6mvsowpp9r3k2lejuegt/music_genre_test.csv?rlkey=ac14ydue0rzlh880jwj3ebum4&dl=1"

In [5]:
RANDOM_STATE = 42
TEST_SIZE = 0.25

## Загрузка и обзор данных

In [6]:
train = pd.read_csv(TRAIN)
test = pd.read_csv(TEST)

In [7]:
train.sample(5)

Unnamed: 0,instance_id,track_name,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,obtained_date,valence,music_genre
5342,52992.0,Elevator Music,0.194,0.642,370987.0,0.648,3.9e-05,B,0.365,-7.159,Minor,0.279,,4-Apr,0.715,Jazz
16366,25619.0,God's Not Dead (Like A Lion),0.000479,0.516,258453.0,0.905,0.0,B,0.198,-5.613,Major,0.0399,129.999,4-Apr,0.612,Rock
12427,72258.0,Get Me Home,0.121,0.78,229400.0,0.467,0.000134,A,0.314,-6.645,Minor,0.253,96.056,4-Apr,0.312,Rap
16614,51263.0,Rà-Àkõ-St,0.0225,0.686,391320.0,0.86,0.886,D,0.0716,-6.55,Major,0.057,117.983,4-Apr,0.695,Electronic
8415,79219.0,I Still Like Bologna,0.292,0.73,279827.0,0.465,0.000818,B,0.116,-10.954,Major,0.0272,133.003,4-Apr,0.712,Country


**Описание полей данных**

`instance_id` - уникальный идентификатор трека  
`track_name` - название трека  
`acousticness` - акустичность  
`danceability` - танцевальность  
`duration_ms` -продолжительность в милисекундах  
`energy` - энергичность  
`instrumentalness` - инструментальность  
**`key` - тональность**  
`liveness` - привлекательность  
`loudness` - громкость  
**`mode` - наклонение**  
`speechiness` - выразительность  
**`tempo` - темп**  
`obtained_date` - дата загрузки в сервис  
`valence` - привлекательность произведения для пользователей сервиса  
`music_genre` - музыкальный жанр

## Предварительная обработка данных

In [8]:
train.head()

Unnamed: 0,instance_id,track_name,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,obtained_date,valence,music_genre
0,25143.0,Highwayman,0.48,0.67,182653.0,0.351,0.0176,D,0.115,-16.842,Major,0.0463,101.384,4-Apr,0.45,Country
1,26091.0,Toes Across The Floor,0.243,0.452,187133.0,0.67,5.1e-05,A,0.108,-8.392,Minor,0.0352,113.071,4-Apr,0.539,Rock
2,87888.0,First Person on Earth,0.228,0.454,173448.0,0.804,0.0,E,0.181,-5.225,Minor,0.371,80.98,4-Apr,0.344,Alternative
3,77021.0,No Te Veo - Digital Single,0.0558,0.847,255987.0,0.873,3e-06,G#,0.325,-4.805,Minor,0.0804,116.007,4-Apr,0.966,Hip-Hop
4,20852.0,Chasing Shadows,0.227,0.742,195333.0,0.575,2e-06,C,0.176,-5.55,Major,0.0487,76.494,4-Apr,0.583,Alternative


In [9]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20394 entries, 0 to 20393
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   instance_id       20394 non-null  float64
 1   track_name        20394 non-null  object 
 2   acousticness      20394 non-null  float64
 3   danceability      20394 non-null  float64
 4   duration_ms       20394 non-null  float64
 5   energy            20394 non-null  float64
 6   instrumentalness  20394 non-null  float64
 7   key               19659 non-null  object 
 8   liveness          20394 non-null  float64
 9   loudness          20394 non-null  float64
 10  mode              19888 non-null  object 
 11  speechiness       20394 non-null  float64
 12  tempo             19952 non-null  float64
 13  obtained_date     20394 non-null  object 
 14  valence           20394 non-null  float64
 15  music_genre       20394 non-null  object 
dtypes: float64(11), object(5)
memory usage: 

In [10]:
print("Train unique")
[print(f"{i}: {train[i].unique()}") for i in train.columns if train[i].dtype == 'object']
print("\nTest unique")
[print(f"{i}: {test[i].unique()}") for i in test.columns if test[i].dtype == 'object']

Train unique
track_name: ['Highwayman' 'Toes Across The Floor' 'First Person on Earth' ...
 'Original Prankster' '4Peat' 'Trouble (feat. MC Spyder)']
key: ['D' 'A' 'E' 'G#' 'C' 'D#' 'A#' 'F' 'F#' nan 'G' 'C#' 'B']
mode: ['Major' 'Minor' nan]
obtained_date: ['4-Apr' '3-Apr' '5-Apr' '1-Apr']
music_genre: ['Country' 'Rock' 'Alternative' 'Hip-Hop' 'Blues' 'Jazz' 'Electronic'
 'Anime' 'Rap' 'Classical']

Test unique
track_name: ['Low Class Conspiracy' 'The Hunter' 'Hate Me Now' ... 'Bipolar'
 'Dead - NGHTMRE Remix'
 'A Night In Tunisia - Remastered 1998 / Rudy Van Gelder Edition']
key: ['A#' 'G#' 'A' 'B' 'D' 'F#' 'F' 'G' 'C' nan 'D#' 'C#' 'E']
mode: ['Minor' 'Major' nan]
obtained_date: ['4-Apr' '3-Apr' '5-Apr' '1-Apr']


[None, None, None, None]

In [11]:
print("Train null")
[print(f"{i} ({train[i].dtype}): {train[i].isnull().sum()}") for i in train.columns if train[i].isnull().sum() > 0]
print("Train data shape:", train.shape)
print("\nTest null")
[print(f"{i} ({train[i].dtype}): {test[i].isnull().sum()}") for i in test.columns if test[i].isnull().sum() > 0]
print("Test data shape: ", test.shape)

Train null
key (object): 735
mode (object): 506
tempo (float64): 442
Train data shape: (20394, 16)

Test null
key (object): 158
mode (object): 149
tempo (float64): 121
Test data shape:  (5099, 15)


In [12]:
statistics_train = train.select_dtypes(include=['float64', 'int64']).describe()
statistics_test = test.select_dtypes(include=['float64', 'int64']).describe()
statistics_obj_train = train.select_dtypes(include=['object']).describe()
statistics_obj_test = test.select_dtypes(include=['object']).describe()

print(f"Train:\n{statistics_train}")
print(f"\nTest:\n{statistics_test}")
print(f"\nTrain object:\n{statistics_obj_train}")
print(f"\nTest object:\n{statistics_obj_test}")

Train:
        instance_id  acousticness  danceability   duration_ms        energy  \
count  20394.000000  20394.000000  20394.000000  2.039400e+04  20394.000000   
mean   55973.846916      0.274783      0.561983  2.203754e+05      0.625276   
std    20695.792545      0.321643      0.171898  1.267283e+05      0.251238   
min    20011.000000      0.000000      0.060000 -1.000000e+00      0.001010   
25%    38157.250000      0.015200      0.451000  1.775170e+05      0.470000   
50%    56030.000000      0.120000      0.570000  2.195330e+05      0.666000   
75%    73912.750000      0.470000      0.683000  2.660000e+05      0.830000   
max    91758.000000      0.996000      0.978000  4.497994e+06      0.999000   

       instrumentalness      liveness      loudness   speechiness  \
count      20394.000000  20394.000000  20394.000000  20394.000000   
mean           0.159989      0.198540     -8.552998      0.091352   
std            0.306503      0.166742      5.499917      0.097735   
min  

In [13]:
key_list = ['D', 'A', 'E', 'G#', 'C', 'D#', 'A#', 'F', 'F#', 'G', 'C#', 'B']
filtered_train = train[~train['key'].isin(key_list)]

filtered_train.head()

Unnamed: 0,instance_id,track_name,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,obtained_date,valence,music_genre
26,87453.0,"Serenade in B flat, K.361 ""Gran partita"": 3. A...",0.991,0.154,-1.0,0.0384,0.902,,0.109,-26.909,Major,0.0405,68.199,4-Apr,0.0393,Classical
49,87796.0,Star67,0.747,0.478,294973.0,0.395,5e-06,,0.264,-7.917,Major,0.213,74.515,3-Apr,0.17,Hip-Hop
60,69398.0,Sleep On The Floor,0.249,0.389,211851.0,0.431,0.0,,0.13,-8.061,Major,0.0344,142.14,4-Apr,0.275,Rock
75,20134.0,Rogue,0.0145,0.58,201694.0,0.72,0.598,,0.281,-5.541,Major,0.0638,143.816,4-Apr,0.16,Electronic
107,69505.0,Party Song,0.156,0.563,191760.0,0.897,0.0,,0.352,-4.996,Major,0.213,157.803,4-Apr,0.779,Country


In [18]:
!pip install --quiet tls-client tqdm

In [19]:
import tls_client, difflib, time, random
import pandas as pd
from tqdm.auto import tqdm

df_train = train.copy(deep=True)
df_test  = test.copy(deep=True)

BASE_URL  = "https://api.tunebat.com/api/tracks/search"
KEY_LIST  = ['C','C#','D','D#','E','F','F#','G','G#','A','A#','B']
PAUSE_SEC = (0.4, 0.8)
session   = tls_client.Session(client_identifier="chrome_124",
                               random_tls_extension_order=True)

_FLAT2SHARP = {
    'AB': 'G#', 'BB': 'A#', 'CB': 'B',
    'DB': 'C#', 'EB': 'D#', 'FB': 'E',
    'GB': 'F#'
}

def normalize_key(raw):
    if not isinstance(raw, str):
        return None
    note = raw.split()[0].replace('♯', '#').replace('♭', 'b').upper().strip()
    if note.endswith('B'):                       # Eb → D#, Bb → A# …
        note = _FLAT2SHARP.get(note, None)
    return note if note in KEY_LIST else None

def camelot_to_mode(c):
    return "Minor" if c and c[-1] == 'A' else ("Major" if c else None)

def tunebat_search(query, score_threshold=0.80):
    time.sleep(random.uniform(*PAUSE_SEC))
    r = session.get(BASE_URL, params={"term": query})
    if r.status_code == 429:
        time.sleep(int(r.headers.get("Retry-After", "5")) + 1)
        r = session.get(BASE_URL, params={"term": query})
    if r.status_code != 200:
        return None
    items = r.json().get("data", {}).get("items", [])
    if not items:
        return None
    best = max(items, key=lambda d:
               difflib.SequenceMatcher(None, d["n"].lower(), query.lower()).ratio())
    score = difflib.SequenceMatcher(None, best["n"].lower(), query.lower()).ratio()
    return best if score >= score_threshold else None

def fill_missing(df: pd.DataFrame, name: str):
    before = {c: df[c].isna().sum() for c in ["key", "mode", "tempo"]}
    target_idx = df[df["key"].isna()].index        # ТОЛЬКО пустые key

    for idx in tqdm(target_idx, desc=f"Обработка {name}"):
        track = df.at[idx, "track_name"]
        info  = tunebat_search(track)
        if not info:
            continue

        key_raw = info.get("k")
        camelot = info.get("c")
        bpm     = info.get("b")

        key_new  = normalize_key(key_raw)
        mode_new = camelot_to_mode(camelot)

        changed_key = changed_mode = changed_tempo = False

        if pd.isna(df.at[idx, "key"]) and key_new:
            df.at[idx, "key"] = key_new
            changed_key = True

        if pd.isna(df.at[idx, "mode"]) and mode_new:
            df.at[idx, "mode"] = mode_new
            changed_mode = True

        if pd.isna(df.at[idx, "tempo"]) and bpm:
            try:
                df.at[idx, "tempo"] = float(bpm)
                changed_tempo = True
            except (ValueError, TypeError):
                pass

        tqdm.write(
            f"{track} | raw→ key='{key_raw}', camelot={camelot}, bpm={bpm} "
            f"| saved→ key={'✔'+key_new if changed_key else '—'}; "
            f"mode={'✔'+mode_new if changed_mode else '—'}; "
            f"tempo={'✔'+str(bpm) if changed_tempo else '—'}"
        )

    after = {c: df[c].isna().sum() for c in ["key", "mode", "tempo"]}
    print(f"\n{name}: nulls BEFORE → AFTER")
    for c in ["key", "mode", "tempo"]:
        print(f"{c:<5}: {before[c]} → {after[c]}")
    print(f"{name} shape: {df.shape}\n")

fill_missing(df_train, "Train")
fill_missing(df_test,  "Test")

print("df_train уникальные key:", df_train['key'].unique())
print("df_train уникальные mode:", df_train['mode'].unique())


Обработка Train:   0%|          | 0/735 [00:00<?, ?it/s]

Serenade in B flat, K.361 "Gran partita": 3. Adagio | raw→ key='E♭ Major', camelot=5B, bpm=126.0 | saved→ key=✔D#; mode=—; tempo=—
Star67 | raw→ key='E♭ Major', camelot=5B, bpm=92.0 | saved→ key=✔D#; mode=—; tempo=—
Sleep On The Floor | raw→ key='G Major', camelot=9B, bpm=142.0 | saved→ key=✔G; mode=—; tempo=—
Rogue | raw→ key='B Minor', camelot=10A, bpm=140.0 | saved→ key=—; mode=—; tempo=—
Party Song | raw→ key='G Major', camelot=9B, bpm=158.0 | saved→ key=✔G; mode=—; tempo=—
Kiss Me | raw→ key='E♭ Major', camelot=5B, bpm=100.0 | saved→ key=✔D#; mode=—; tempo=—
The Trouble With Us | raw→ key='D Major', camelot=10B, bpm=121.0 | saved→ key=✔D; mode=—; tempo=—
Pull Up Hop Out | raw→ key='C# Major', camelot=3B, bpm=125.0 | saved→ key=✔C#; mode=—; tempo=—
My, My, My | raw→ key='C Major', camelot=8B, bpm=74.0 | saved→ key=✔C; mode=—; tempo=—
Wild Love - Acoustic | raw→ key='B Minor', camelot=10A, bpm=140.0 | saved→ key=—; mode=—; tempo=—
Florida Boy | raw→ key='B♭ Major', camelot=6B, bpm=1

Обработка Test:   0%|          | 0/158 [00:00<?, ?it/s]

Test Me | raw→ key='C Minor', camelot=5A, bpm=174.0 | saved→ key=✔C; mode=—; tempo=—
There's A Small Hotel | raw→ key='A Minor', camelot=8A, bpm=105.0 | saved→ key=✔A; mode=—; tempo=—
Join Me | raw→ key='A♭ Minor', camelot=1A, bpm=114.0 | saved→ key=✔G#; mode=—; tempo=—
10 Feet | raw→ key='F# Minor', camelot=11A, bpm=104.0 | saved→ key=✔F#; mode=—; tempo=—
Ghost | raw→ key='D Major', camelot=10B, bpm=154.0 | saved→ key=✔D; mode=—; tempo=—
YNO | raw→ key='G Major', camelot=9B, bpm=105.0 | saved→ key=✔G; mode=—; tempo=—
Christmas Chill | raw→ key='C Major', camelot=8B, bpm=77.0 | saved→ key=✔C; mode=—; tempo=—
bring back the colors | raw→ key='A♭ Major', camelot=4B, bpm=166.0 | saved→ key=✔G#; mode=✔Major; tempo=—
21 | raw→ key='G Major', camelot=9B, bpm=176.0 | saved→ key=✔G; mode=—; tempo=—
Symphony No. 9 in D Minor, Op. 125 - "Choral": 4b. Allegro assai - Live | raw→ key='D Major', camelot=10B, bpm=137.0 | saved→ key=✔D; mode=✔Major; tempo=—
Down with the Sickness | raw→ key='E♭ Minor

In [20]:
df_train.to_csv('df_train.csv')
df_test.to_csv('df_test.csv')

## Разведочный анализ

In [24]:
%%capture
profile_train = ProfileReport(train, title="Profiling Report")
profile_train.to_file("train.html")

profile_test = ProfileReport(test, title="Profiling Report")
profile_test.to_file("test.html")

In [26]:
pivot_table = pd.crosstab(train['music_genre'], train['key'], margins=True, margins_name='Total')

# Выводим сводную таблицу
print("Сводная таблица: Распределение тональностей (key) по жанрам (music_genre):")
print(pivot_table)

Сводная таблица: Распределение тональностей (key) по жанрам (music_genre):
key             A    A#     B     C    C#     D   D#     E     F    F#     G  \
music_genre                                                                    
Alternative   241   139   225   280   236   267   68   199   200   193   288   
Anime         184    92   122   209   182   202   67   146   166   121   225   
Blues         375   125   187   360   169   366   65   216   245    98   419   
Classical     123    74    74   158   110   177   79   117   117    65   165   
Country       244    97   144   241   154   241   80   179   136   120   281   
Electronic    219   190   228   224   361   203   49   179   215   185   265   
Hip-Hop        83   102   116    80   224    86   28    55    90    72    78   
Jazz          115   116    64   136    93    96   47   102   138    63   136   
Rap           172   187   209   200   421   202   58   132   160   166   216   
Rock          257   102   153   261   175   2

## Работа с признаками

## Выбор и обучение моделей

## Оценка качества

## Анализ важности признаков модели