<a href="https://colab.research.google.com/github/kirillkobychev/HSE-ML-TEAM-4/blob/kirill-dev/Project_Music_genre_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Music genre prediction

**Описание задачи**

Вы сотрудник Отдела Data Science популярного музыкального стримингового сервиса. Сервис расширяет работу с новыми артистами и музыкантами, в связи с чем возникла задача -- правильно классифицировать новые музыкальные треки, чтобы улучшить работу рекомендательной системы. Ваши коллеги из отдела работы со звуком подготовили датасет, в котором собраны некоторые характеристики музыкальных произведений и их жанры. Ваша задача - разработать модель, позволяющую классифицировать музыкальные произведения по жанрам.

В ходе работы пройдите все основные этапы полноценного исследования:

*  загрузка и ознакомление с данными
*  предварительная обработка
*  полноценный разведочный анализ
*  разработка новых синтетических признаков
*  проверка на мультиколлинеарность
*  отбор финального набора обучающих признаков
*  выбор и обучение моделей
*  итоговая оценка качества предсказания лучшей модели
*  анализ важности ее признаков

**ВАЖНО**  
Необходимо реализовать решение с использованием технологии `pipeline` (из библиотеки `sklearn`)

**ОЖИДАЕМЫЙ РЕЗУЛЬТАТ**

* Оформленный репозиторий на GitHub (ноутбук с исследованием + код приложения)
* Развернутое web-приложение (с использованием библиотеки Streamlit)

## Участники проекта, репозиторий, приложение

Кобычев Кирилл, @hikoby

Иванов Егор, @Jaibesiondaide

Игорь Земенков, @iZemM

https://github.com/kirillkobychev/HSE-ML-TEAM-4

## Импорт библиотек, установка констант

In [13]:
%%capture
!pip install catboost -q
!pip install ydata-profiling
!pip install kaggle
!pip install --quiet tls-client tqdm

In [14]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from catboost import CatBoostClassifier

from ydata_profiling import ProfileReport
import tls_client, difflib, time, random, re
import pandas as pd
from tqdm.auto import tqdm

In [15]:
TRAIN = "https://www.dropbox.com/scl/fi/5zy935lqpaqr9lat76ung/music_genre_train.csv?rlkey=ccovu9ml8pfi9whk1ba26zdda&dl=1"
TEST = "https://www.dropbox.com/scl/fi/o6mvsowpp9r3k2lejuegt/music_genre_test.csv?rlkey=ac14ydue0rzlh880jwj3ebum4&dl=1"

In [16]:
RANDOM_STATE = 42
TEST_SIZE = 0.25

## Загрузка и обзор данных

In [None]:
train = pd.read_csv(TRAIN)
test = pd.read_csv(TEST)

In [None]:
train.sample(5)

Unnamed: 0,instance_id,track_name,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,obtained_date,valence,music_genre
13471,33354.0,I Saw Her Standing There - Remastered 2009,0.27,0.491,173947.0,0.801,0.0,E,0.0665,-9.835,Major,0.0361,160.109,4-Apr,0.971,Rock
15405,40512.0,Beg For It,0.164,0.479,224280.0,0.433,0.0,C,0.0732,-6.878,Minor,0.0455,58.396,4-Apr,0.324,Rap
739,43657.0,Paris,0.511,0.789,293067.0,0.668,0.00217,G#,0.0659,-6.568,Major,0.0357,117.007,4-Apr,0.832,Rock
7262,52073.0,Love Train,0.0613,0.455,181787.0,0.872,0.0238,G,0.134,-3.915,Major,0.0462,91.272,4-Apr,0.699,Blues
11663,25295.0,Nerf This,0.00326,0.685,207043.0,0.761,0.383,C,0.099,-2.584,Major,0.0616,75.03,4-Apr,0.192,Electronic


**Описание полей данных**

`instance_id` - уникальный идентификатор трека  
`track_name` - название трека  
`acousticness` - акустичность  
`danceability` - танцевальность  
`duration_ms` -продолжительность в милисекундах  
`energy` - энергичность  
`instrumentalness` - инструментальность  
**`key` - тональность**  
`liveness` - привлекательность  
`loudness` - громкость  
**`mode` - наклонение**  
`speechiness` - выразительность  
**`tempo` - темп**  
`obtained_date` - дата загрузки в сервис  
`valence` - привлекательность произведения для пользователей сервиса  
`music_genre` - музыкальный жанр

## Предварительная обработка данных

In [None]:
train.head()

Unnamed: 0,instance_id,track_name,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,obtained_date,valence,music_genre
0,25143.0,Highwayman,0.48,0.67,182653.0,0.351,0.0176,D,0.115,-16.842,Major,0.0463,101.384,4-Apr,0.45,Country
1,26091.0,Toes Across The Floor,0.243,0.452,187133.0,0.67,5.1e-05,A,0.108,-8.392,Minor,0.0352,113.071,4-Apr,0.539,Rock
2,87888.0,First Person on Earth,0.228,0.454,173448.0,0.804,0.0,E,0.181,-5.225,Minor,0.371,80.98,4-Apr,0.344,Alternative
3,77021.0,No Te Veo - Digital Single,0.0558,0.847,255987.0,0.873,3e-06,G#,0.325,-4.805,Minor,0.0804,116.007,4-Apr,0.966,Hip-Hop
4,20852.0,Chasing Shadows,0.227,0.742,195333.0,0.575,2e-06,C,0.176,-5.55,Major,0.0487,76.494,4-Apr,0.583,Alternative


In [None]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20394 entries, 0 to 20393
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   instance_id       20394 non-null  float64
 1   track_name        20394 non-null  object 
 2   acousticness      20394 non-null  float64
 3   danceability      20394 non-null  float64
 4   duration_ms       20394 non-null  float64
 5   energy            20394 non-null  float64
 6   instrumentalness  20394 non-null  float64
 7   key               19659 non-null  object 
 8   liveness          20394 non-null  float64
 9   loudness          20394 non-null  float64
 10  mode              19888 non-null  object 
 11  speechiness       20394 non-null  float64
 12  tempo             19952 non-null  float64
 13  obtained_date     20394 non-null  object 
 14  valence           20394 non-null  float64
 15  music_genre       20394 non-null  object 
dtypes: float64(11), object(5)
memory usage: 

In [None]:
print("Train unique")
[print(f"{i}: {train[i].unique()}") for i in train.columns if train[i].dtype == 'object']
print("\nTest unique")
[print(f"{i}: {test[i].unique()}") for i in test.columns if test[i].dtype == 'object']

Train unique
track_name: ['Highwayman' 'Toes Across The Floor' 'First Person on Earth' ...
 'Original Prankster' '4Peat' 'Trouble (feat. MC Spyder)']
key: ['D' 'A' 'E' 'G#' 'C' 'D#' 'A#' 'F' 'F#' nan 'G' 'C#' 'B']
mode: ['Major' 'Minor' nan]
obtained_date: ['4-Apr' '3-Apr' '5-Apr' '1-Apr']
music_genre: ['Country' 'Rock' 'Alternative' 'Hip-Hop' 'Blues' 'Jazz' 'Electronic'
 'Anime' 'Rap' 'Classical']

Test unique
track_name: ['Low Class Conspiracy' 'The Hunter' 'Hate Me Now' ... 'Bipolar'
 'Dead - NGHTMRE Remix'
 'A Night In Tunisia - Remastered 1998 / Rudy Van Gelder Edition']
key: ['A#' 'G#' 'A' 'B' 'D' 'F#' 'F' 'G' 'C' nan 'D#' 'C#' 'E']
mode: ['Minor' 'Major' nan]
obtained_date: ['4-Apr' '3-Apr' '5-Apr' '1-Apr']


[None, None, None, None]

In [None]:
print("Train null")
[print(f"{i} ({train[i].dtype}): {train[i].isnull().sum()}") for i in train.columns if train[i].isnull().sum() > 0]
print("Train data shape:", train.shape)
print("\nTest null")
[print(f"{i} ({train[i].dtype}): {test[i].isnull().sum()}") for i in test.columns if test[i].isnull().sum() > 0]
print("Test data shape: ", test.shape)

Train null
key (object): 735
mode (object): 506
tempo (float64): 442
Train data shape: (20394, 16)

Test null
key (object): 158
mode (object): 149
tempo (float64): 121
Test data shape:  (5099, 15)


In [None]:
statistics_train = train.select_dtypes(include=['float64', 'int64']).describe()
statistics_test = test.select_dtypes(include=['float64', 'int64']).describe()
statistics_obj_train = train.select_dtypes(include=['object']).describe()
statistics_obj_test = test.select_dtypes(include=['object']).describe()

print(f"Train:\n{statistics_train}")
print(f"\nTest:\n{statistics_test}")
print(f"\nTrain object:\n{statistics_obj_train}")
print(f"\nTest object:\n{statistics_obj_test}")

Train:
        instance_id  acousticness  danceability   duration_ms        energy  \
count  20394.000000  20394.000000  20394.000000  2.039400e+04  20394.000000   
mean   55973.846916      0.274783      0.561983  2.203754e+05      0.625276   
std    20695.792545      0.321643      0.171898  1.267283e+05      0.251238   
min    20011.000000      0.000000      0.060000 -1.000000e+00      0.001010   
25%    38157.250000      0.015200      0.451000  1.775170e+05      0.470000   
50%    56030.000000      0.120000      0.570000  2.195330e+05      0.666000   
75%    73912.750000      0.470000      0.683000  2.660000e+05      0.830000   
max    91758.000000      0.996000      0.978000  4.497994e+06      0.999000   

       instrumentalness      liveness      loudness   speechiness  \
count      20394.000000  20394.000000  20394.000000  20394.000000   
mean           0.159989      0.198540     -8.552998      0.091352   
std            0.306503      0.166742      5.499917      0.097735   
min  

In [None]:
df_train = train.copy(deep=True)
df_test  = test.copy(deep=True)

BASE_URL   = "https://api.tunebat.com/api/tracks/search"
KEY_LIST   = ['C','C#','D','D#','E','F','F#','G','G#','A','A#','B']
PAUSE_SEC  = (0.4, 0.8)
session    = tls_client.Session(client_identifier="chrome_124",
                                random_tls_extension_order=True)

FLAT2SHARP = {'AB':'G#','BB':'A#','CB':'B',
              'DB':'C#','EB':'D#','FB':'E','GB':'F#'}

def normalize_key(raw: str | None):
    if not isinstance(raw, str):
        return None

    m = re.match(r'^\s*([A-Ga-g])([#♯b♭]?)(?:\s|$)', raw)
    if not m:
        return None

    letter, accidental = m.groups()
    note = (letter.upper() +
            {'#': '#', '♯': '#', 'b': 'B', '♭': 'B'}.get(accidental, ''))

    # бемоль → диез
    if len(note) == 2 and note[1] == 'B':
        note = FLAT2SHARP.get(note, None)

    return note if note in KEY_LIST else None

def camelot_to_mode(cam):
    return ("Minor" if cam and cam[-1]=='A' else "Major") if cam else None

def tunebat_search(query, score_threshold=0.80):
    time.sleep(random.uniform(*PAUSE_SEC))
    r = session.get(BASE_URL, params={"term": query})

    if r.status_code == 429:
        time.sleep(int(r.headers.get("Retry-After", "5")) + 1)
        r = session.get(BASE_URL, params={"term": query})

    if r.status_code != 200:
        return None

    items = r.json().get("data", {}).get("items", [])
    if not items:
        return None

    best = max(items, key=lambda d:
               difflib.SequenceMatcher(None, d["n"].lower(), query.lower()).ratio())
    if difflib.SequenceMatcher(None, best["n"].lower(), query.lower()).ratio() < score_threshold:
        return None
    return best

cache = {}
def get_info(track):
    if track not in cache:
        cache[track] = tunebat_search(track)
    return cache[track]

def fill_missing(df: pd.DataFrame, name: str):
    print(f"\n{name}: пропуски ДО", df[['key','mode','tempo']].isna().sum().to_dict())

    for col in ['key', 'mode', 'tempo']:
        for idx in tqdm(df[df[col].isna()].index, desc=f"{name}: заполняем {col}"):
            track = str(df.at[idx, 'track_name']).strip()
            info  = get_info(track)
            if not info:
                continue

            key_new  = normalize_key(info.get('k'))
            mode_new = camelot_to_mode(info.get('c'))
            bpm      = info.get('b')

            if pd.isna(df.at[idx,'key'])  and key_new:
                df.at[idx,'key'] = key_new
            if pd.isna(df.at[idx,'mode']) and mode_new:
                df.at[idx,'mode'] = mode_new
            if pd.isna(df.at[idx,'tempo']) and bpm:
                try: df.at[idx,'tempo'] = float(bpm)
                except (ValueError, TypeError): pass

            tqdm.write(f"{track} → key:{key_new} mode:{mode_new} tempo:{bpm}")

    print(f"{name}: пропуски ПОСЛЕ", df[['key','mode','tempo']].isna().sum().to_dict())
    print(f"{name} shape: {df.shape}\n")

fill_missing(df_train, "Train")
fill_missing(df_test,  "Test")

df_train_filled = df_train.copy()
df_test_filled  = df_test.copy()
df_train_filled.to_csv('df_train_filled.csv', index=False)
df_test_filled.to_csv('df_test_filled.csv',  index=False)

print("\nnew filled df: df_train_filled.csv, df_test_filled.csv")
print("unique key :", sorted(df_train_filled['key'].dropna().unique()))
print("unique mode:", sorted(df_train_filled['mode'].dropna().unique()))


Train: пропуски ДО {'key': 735, 'mode': 506, 'tempo': 442}


Train: заполняем key:   0%|          | 0/735 [00:00<?, ?it/s]

Serenade in B flat, K.361 "Gran partita": 3. Adagio → key:D# mode:Major tempo:126.0
Star67 → key:D# mode:Major tempo:92.0
Sleep On The Floor → key:G mode:Major tempo:142.0
Rogue → key:D mode:Major tempo:155.0
Party Song → key:G mode:Major tempo:158.0
Kiss Me → key:D# mode:Major tempo:100.0
The Trouble With Us → key:D mode:Major tempo:121.0
Pull Up Hop Out → key:C# mode:Major tempo:125.0
My, My, My → key:C mode:Major tempo:74.0
Wild Love - Acoustic → key:B mode:Minor tempo:140.0
Florida Boy → key:A# mode:Major tempo:155.0
You Can Do Magic → key:E mode:Minor tempo:130.0
Word Around Town (feat. Rich Homie Quan) → key:C# mode:Major tempo:141.0
Crossing Over → key:B mode:Minor tempo:196.0
West Texas Rain → key:C mode:Major tempo:147.0
I'm in a Dancing Mood → key:G mode:Minor tempo:140.0
Sound the Alarm → key:E mode:Major tempo:102.0
M.... She Wrote → key:A mode:Minor tempo:160.0
Always A Friend → key:C mode:Major tempo:127.0
Build Your Kingdom Here → key:D mode:Major tempo:138.0
Big Bad Joh

KeyboardInterrupt: 

In [None]:
df_train_filled[df_train_filled['key'].isna()]

Unnamed: 0,instance_id,track_name,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,obtained_date,valence,music_genre
75,20134.0,Rogue,0.014500,0.580,201694.0,0.7200,0.598000,,0.2810,-5.541,Major,0.0638,143.816,4-Apr,0.160,Electronic
267,86930.0,Wild Love - Acoustic,0.856000,0.507,189147.0,0.4080,0.000000,,0.1020,-7.652,Minor,0.0396,140.038,4-Apr,0.530,Rock
392,51638.0,Crossing Over,0.000066,0.378,174160.0,0.8980,0.000082,,0.2800,-3.855,Minor,0.0639,195.747,4-Apr,0.128,Alternative
1123,44101.0,"Violin Concerto No. 3 in G Major, K. 216: II. ...",0.958000,0.180,500573.0,0.0529,0.203000,,0.0808,-25.866,Major,0.0432,92.457,4-Apr,0.109,Classical
1344,79814.0,Wheelz of Steel,0.053100,0.886,243267.0,0.6410,0.000542,,0.1130,-7.263,Minor,0.2630,111.649,4-Apr,0.489,Rap
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19216,30304.0,Long Train Runnin' - with Toby Keith and Huey ...,0.029100,0.621,210973.0,0.9770,0.037200,,0.2780,-3.820,Minor,0.0789,115.981,4-Apr,0.459,Blues
19904,65802.0,World Of The Dead (Princess Mononoke),0.348000,0.298,267000.0,0.6630,0.914000,,0.7090,-12.276,Major,0.1330,136.895,3-Apr,0.178,Anime
20045,64251.0,Ghost Train Haze,0.549000,0.566,113468.0,0.1250,0.543000,,0.1250,-9.770,Minor,0.1360,128.428,5-Apr,0.524,Jazz
20166,37706.0,Stay Down,0.617000,0.551,207326.0,0.3010,0.000011,,0.1360,-13.065,Major,0.1110,84.695,4-Apr,0.488,Alternative


In [None]:
track = "Stay Down"          # ← поставьте сюда нужное название

info = tunebat_search(track)

if not info:
    print("Ничего не найдено 🤷‍♂️")
else:
    import json, textwrap
    print("RAW JSON (сокращено):")
    print(textwrap.shorten(json.dumps(info, ensure_ascii=False), width=400, placeholder=" …"))

    k_raw   = info.get("k")
    cam     = info.get("c")
    bpm     = info.get("b")

    try:
        k_norm = normalize_key(k_raw)
    except NameError:
        k_norm = k_raw
    try:
        mode   = camelot_to_mode(cam)
    except NameError:
        mode   = None

    print("\nОтформатировано:")
    print(f"Track : {info['as'][0]} — «{info['n']}»")
    print(f"Key   : {k_raw}  →  {k_norm}")
    print(f"Mode  : {mode}")
    print(f"Camelot: {cam}")
    print(f"BPM   : {bpm}")

RAW JSON (сокращено):
{"id": "4mwiRPRAUSSFD6lJ86m98B", "n": "Stay Down", "as": ["Brent Faiyaz"], "l": null, "an": "Sonder Son", "rd": "2017-10-13", "is": false, "ie": false, "d": 207325, "p": 68, "k": "B Major", "kv": 11, "c": "1B", "b": 85.0, "ac": 0.617, "da": 0.551, "e": 0.301, "h": 0.488, "i": 1.1e-05, "li": 0.136, "lo": -13.065, "s": 0.111, "ci": [{"iu": …

Отформатировано:
Track : Brent Faiyaz — «Stay Down»
Key   : B Major  →  None
Mode  : Major
Camelot: 1B
BPM   : 85.0


## Разведочный анализ

In [None]:
%%capture
profile_train = ProfileReport(train, title="Profiling Report")
profile_train.to_file("train.html")

profile_test = ProfileReport(test, title="Profiling Report")
profile_test.to_file("test.html")

In [None]:
pivot_table = pd.crosstab(train['music_genre'], train['key'], margins=True, margins_name='Total')

# Выводим сводную таблицу
print("Сводная таблица: Распределение тональностей (key) по жанрам (music_genre):")
print(pivot_table)

Сводная таблица: Распределение тональностей (key) по жанрам (music_genre):
key             A    A#     B     C    C#     D   D#     E     F    F#     G  \
music_genre                                                                    
Alternative   241   139   225   280   236   267   68   199   200   193   288   
Anime         184    92   122   209   182   202   67   146   166   121   225   
Blues         375   125   187   360   169   366   65   216   245    98   419   
Classical     123    74    74   158   110   177   79   117   117    65   165   
Country       244    97   144   241   154   241   80   179   136   120   281   
Electronic    219   190   228   224   361   203   49   179   215   185   265   
Hip-Hop        83   102   116    80   224    86   28    55    90    72    78   
Jazz          115   116    64   136    93    96   47   102   138    63   136   
Rap           172   187   209   200   421   202   58   132   160   166   216   
Rock          257   102   153   261   175   2

## Работа с признаками

## Выбор и обучение моделей

## Оценка качества

## Анализ важности признаков модели