<a href="https://colab.research.google.com/github/kirillkobychev/HSE-ML-TEAM-4/blob/kirill-dev/Project_Music_genre_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Music genre prediction

**Описание задачи**

Вы сотрудник Отдела Data Science популярного музыкального стримингового сервиса. Сервис расширяет работу с новыми артистами и музыкантами, в связи с чем возникла задача -- правильно классифицировать новые музыкальные треки, чтобы улучшить работу рекомендательной системы. Ваши коллеги из отдела работы со звуком подготовили датасет, в котором собраны некоторые характеристики музыкальных произведений и их жанры. Ваша задача - разработать модель, позволяющую классифицировать музыкальные произведения по жанрам.

В ходе работы пройдите все основные этапы полноценного исследования:

*  загрузка и ознакомление с данными
*  предварительная обработка
*  полноценный разведочный анализ
*  разработка новых синтетических признаков
*  проверка на мультиколлинеарность
*  отбор финального набора обучающих признаков
*  выбор и обучение моделей
*  итоговая оценка качества предсказания лучшей модели
*  анализ важности ее признаков

**ВАЖНО**  
Необходимо реализовать решение с использованием технологии `pipeline` (из библиотеки `sklearn`)

**ОЖИДАЕМЫЙ РЕЗУЛЬТАТ**

* Оформленный репозиторий на GitHub (ноутбук с исследованием + код приложения)
* Развернутое web-приложение (с использованием библиотеки Streamlit)

## Участники проекта, репозиторий, приложение

Кобычев Кирилл, @hikoby

Иванов Егор, @Jaibesiondaide

Игорь Земенков, @iZemM

https://github.com/kirillkobychev/HSE-ML-TEAM-4

## Импорт библиотек, установка констант

In [25]:
%%capture
!pip install catboost -q
!pip install ydata-profiling
!pip install kaggle
!pip install --quiet tls-client tqdm

In [26]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from catboost import CatBoostClassifier

from ydata_profiling import ProfileReport
import tls_client, difflib, time, random, re
import pandas as pd
from tqdm.auto import tqdm

In [27]:
TRAIN = "https://www.dropbox.com/scl/fi/5zy935lqpaqr9lat76ung/music_genre_train.csv?rlkey=ccovu9ml8pfi9whk1ba26zdda&dl=1"
TEST = "https://www.dropbox.com/scl/fi/o6mvsowpp9r3k2lejuegt/music_genre_test.csv?rlkey=ac14ydue0rzlh880jwj3ebum4&dl=1"

In [28]:
RANDOM_STATE = 42
TEST_SIZE = 0.25

## Загрузка и обзор данных

In [29]:
train = pd.read_csv(TRAIN)
test = pd.read_csv(TEST)

In [30]:
train.sample(5)

Unnamed: 0,instance_id,track_name,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,obtained_date,valence,music_genre
14544,64352.0,Shepherd of Fire,0.000332,0.576,323800.0,0.906,0.0614,D,0.0908,-7.677,Major,0.0571,127.935,4-Apr,0.211,Alternative
1231,27489.0,Roses - Jengi Remix,0.16,0.792,238560.0,0.679,0.000238,F,0.111,-7.271,Minor,0.0587,125.004,4-Apr,0.557,Electronic
14948,35784.0,Weeping Willow,0.0323,0.775,233333.0,0.386,0.516,B,0.497,-13.175,Minor,0.065,143.976,4-Apr,0.436,Jazz
15461,63875.0,"Fuyu No Hi - Ep. 23 ""Ho-Kago!"" Mix",0.0354,0.486,207667.0,0.785,0.0,A,0.0545,-5.634,Major,0.0375,149.807,4-Apr,0.667,Anime
14994,47706.0,Weakened by Winter,0.136,0.431,478000.0,0.845,0.684,B,0.238,-5.699,Major,0.0492,,1-Apr,0.322,Classical


**Описание полей данных**

`instance_id` - уникальный идентификатор трека  
`track_name` - название трека  
`acousticness` - акустичность  
`danceability` - танцевальность  
`duration_ms` -продолжительность в милисекундах  
`energy` - энергичность  
`instrumentalness` - инструментальность  
**`key` - тональность**  
`liveness` - привлекательность  
`loudness` - громкость  
**`mode` - наклонение**  
`speechiness` - выразительность  
**`tempo` - темп**  
`obtained_date` - дата загрузки в сервис  
`valence` - привлекательность произведения для пользователей сервиса  
`music_genre` - музыкальный жанр

## Предварительная обработка данных

In [31]:
train.head()

Unnamed: 0,instance_id,track_name,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,obtained_date,valence,music_genre
0,25143.0,Highwayman,0.48,0.67,182653.0,0.351,0.0176,D,0.115,-16.842,Major,0.0463,101.384,4-Apr,0.45,Country
1,26091.0,Toes Across The Floor,0.243,0.452,187133.0,0.67,5.1e-05,A,0.108,-8.392,Minor,0.0352,113.071,4-Apr,0.539,Rock
2,87888.0,First Person on Earth,0.228,0.454,173448.0,0.804,0.0,E,0.181,-5.225,Minor,0.371,80.98,4-Apr,0.344,Alternative
3,77021.0,No Te Veo - Digital Single,0.0558,0.847,255987.0,0.873,3e-06,G#,0.325,-4.805,Minor,0.0804,116.007,4-Apr,0.966,Hip-Hop
4,20852.0,Chasing Shadows,0.227,0.742,195333.0,0.575,2e-06,C,0.176,-5.55,Major,0.0487,76.494,4-Apr,0.583,Alternative


In [32]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20394 entries, 0 to 20393
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   instance_id       20394 non-null  float64
 1   track_name        20394 non-null  object 
 2   acousticness      20394 non-null  float64
 3   danceability      20394 non-null  float64
 4   duration_ms       20394 non-null  float64
 5   energy            20394 non-null  float64
 6   instrumentalness  20394 non-null  float64
 7   key               19659 non-null  object 
 8   liveness          20394 non-null  float64
 9   loudness          20394 non-null  float64
 10  mode              19888 non-null  object 
 11  speechiness       20394 non-null  float64
 12  tempo             19952 non-null  float64
 13  obtained_date     20394 non-null  object 
 14  valence           20394 non-null  float64
 15  music_genre       20394 non-null  object 
dtypes: float64(11), object(5)
memory usage: 

In [33]:
print("Train unique")
[print(f"{i}: {train[i].unique()}") for i in train.columns if train[i].dtype == 'object']
print("\nTest unique")
[print(f"{i}: {test[i].unique()}") for i in test.columns if test[i].dtype == 'object']

Train unique
track_name: ['Highwayman' 'Toes Across The Floor' 'First Person on Earth' ...
 'Original Prankster' '4Peat' 'Trouble (feat. MC Spyder)']
key: ['D' 'A' 'E' 'G#' 'C' 'D#' 'A#' 'F' 'F#' nan 'G' 'C#' 'B']
mode: ['Major' 'Minor' nan]
obtained_date: ['4-Apr' '3-Apr' '5-Apr' '1-Apr']
music_genre: ['Country' 'Rock' 'Alternative' 'Hip-Hop' 'Blues' 'Jazz' 'Electronic'
 'Anime' 'Rap' 'Classical']

Test unique
track_name: ['Low Class Conspiracy' 'The Hunter' 'Hate Me Now' ... 'Bipolar'
 'Dead - NGHTMRE Remix'
 'A Night In Tunisia - Remastered 1998 / Rudy Van Gelder Edition']
key: ['A#' 'G#' 'A' 'B' 'D' 'F#' 'F' 'G' 'C' nan 'D#' 'C#' 'E']
mode: ['Minor' 'Major' nan]
obtained_date: ['4-Apr' '3-Apr' '5-Apr' '1-Apr']


[None, None, None, None]

In [73]:
print("Train null")
[print(f"{i} ({train[i].dtype}): {train[i].isnull().sum()}") for i in train.columns if train[i].isnull().sum() > 0]
print("Train data shape:", train.shape)
print("\nTest null")
[print(f"{i} ({test[i].dtype}): {test[i].isnull().sum()}") for i in test.columns if test[i].isnull().sum() > 0]
print("Test data shape: ", test.shape)

Train null
key (object): 735
mode (object): 506
tempo (float64): 442
Train data shape: (20394, 16)

Test null
key (object): 158
mode (object): 149
tempo (float64): 121
Test data shape:  (5099, 15)


In [23]:
statistics_train = train.select_dtypes(include=['float64', 'int64']).describe()
statistics_test = test.select_dtypes(include=['float64', 'int64']).describe()
statistics_obj_train = train.select_dtypes(include=['object']).describe()
statistics_obj_test = test.select_dtypes(include=['object']).describe()

print(f"Train:\n{statistics_train}")
print(f"\nTest:\n{statistics_test}")
print(f"\nTrain object:\n{statistics_obj_train}")
print(f"\nTest object:\n{statistics_obj_test}")

Train:
        instance_id  acousticness  danceability   duration_ms        energy  \
count  20394.000000  20394.000000  20394.000000  2.039400e+04  20394.000000   
mean   55973.846916      0.274783      0.561983  2.203754e+05      0.625276   
std    20695.792545      0.321643      0.171898  1.267283e+05      0.251238   
min    20011.000000      0.000000      0.060000 -1.000000e+00      0.001010   
25%    38157.250000      0.015200      0.451000  1.775170e+05      0.470000   
50%    56030.000000      0.120000      0.570000  2.195330e+05      0.666000   
75%    73912.750000      0.470000      0.683000  2.660000e+05      0.830000   
max    91758.000000      0.996000      0.978000  4.497994e+06      0.999000   

       instrumentalness      liveness      loudness   speechiness  \
count      20394.000000  20394.000000  20394.000000  20394.000000   
mean           0.159989      0.198540     -8.552998      0.091352   
std            0.306503      0.166742      5.499917      0.097735   
min  

Заполняем пропуски с помощью сервиса Tunebat, который имеет доступ к API Spotify. Парсим данные признаков: {"key"; "mode"; "tempo"} к каждой композиции. Время выполнения ячейки ниже занимает ~3 часа, при этом мы не избавляемся от пропусков, а заполняем их. Код ниже добавлен в комментарий, а новые csv файлы после обработки были сохранены ниже на Dropbox: (TRAIN_FILLED & TEST_FILLED)

*   Train: пропуски ПОСЛЕ {'key': 28, 'mode': 16, 'tempo': 15}
*   Test: пропуски ПОСЛЕ {'key': 2, 'mode': 8, 'tempo': 4}

* Остаток пропусков в конце удаляем

In [36]:
# df_train = train.copy(deep=True)
# df_test  = test.copy(deep=True)

# BASE_URL   = "https://api.tunebat.com/api/tracks/search"
# KEY_LIST   = ['C','C#','D','D#','E','F','F#','G','G#','A','A#','B']
# PAUSE_SEC  = (0.4, 0.8)
# session    = tls_client.Session(client_identifier="chrome_124",
#                                 random_tls_extension_order=True)

# FLAT2SHARP = {'AB':'G#','BB':'A#','CB':'B',
#               'DB':'C#','EB':'D#','FB':'E','GB':'F#'}

# def normalize_key(raw: str | None):
#     if not isinstance(raw, str):
#         return None

#     m = re.match(r'^\s*([A-Ga-g])([#♯b♭]?)(?:\s|$)', raw)
#     if not m:
#         return None

#     letter, accidental = m.groups()
#     note = (letter.upper() +
#             {'#': '#', '♯': '#', 'b': 'B', '♭': 'B'}.get(accidental, ''))

#     if len(note) == 2 and note[1] == 'B':
#         note = FLAT2SHARP.get(note, None)

#     return note if note in KEY_LIST else None

# def camelot_to_mode(cam):
#     return ("Minor" if cam and cam[-1]=='A' else "Major") if cam else None

# def tunebat_search(query, score_threshold=0.80):
#     time.sleep(random.uniform(*PAUSE_SEC))
#     r = session.get(BASE_URL, params={"term": query})

#     if r.status_code == 429:
#         time.sleep(int(r.headers.get("Retry-After", "5")) + 1)
#         r = session.get(BASE_URL, params={"term": query})

#     if r.status_code != 200:
#         return None

#     items = r.json().get("data", {}).get("items", [])
#     if not items:
#         return None

#     best = max(items, key=lambda d:
#                difflib.SequenceMatcher(None, d["n"].lower(), query.lower()).ratio())
#     if difflib.SequenceMatcher(None, best["n"].lower(), query.lower()).ratio() < score_threshold:
#         return None
#     return best

# cache = {}
# def get_info(track):
#     if track not in cache:
#         cache[track] = tunebat_search(track)
#     return cache[track]

# def fill_missing(df: pd.DataFrame, name: str):
#     print(f"\n{name}: пропуски ДО", df[['key','mode','tempo']].isna().sum().to_dict())

#     for col in ['key', 'mode', 'tempo']:
#         for idx in tqdm(df[df[col].isna()].index, desc=f"{name}: заполняем {col}"):
#             track = str(df.at[idx, 'track_name']).strip()
#             info  = get_info(track)
#             if not info:
#                 continue

#             key_new  = normalize_key(info.get('k'))
#             mode_new = camelot_to_mode(info.get('c'))
#             bpm      = info.get('b')

#             if pd.isna(df.at[idx,'key'])  and key_new:
#                 df.at[idx,'key'] = key_new
#             if pd.isna(df.at[idx,'mode']) and mode_new:
#                 df.at[idx,'mode'] = mode_new
#             if pd.isna(df.at[idx,'tempo']) and bpm:
#                 try: df.at[idx,'tempo'] = float(bpm)
#                 except (ValueError, TypeError): pass

#             tqdm.write(f"{track} → key:{key_new} mode:{mode_new} tempo:{bpm}")

#     print(f"{name}: пропуски ПОСЛЕ", df[['key','mode','tempo']].isna().sum().to_dict())
#     print(f"{name} shape: {df.shape}\n")

# fill_missing(df_train, "Train")
# fill_missing(df_test,  "Test")

# df_train_filled = df_train.copy()
# df_test_filled  = df_test.copy()
# df_train_filled.to_csv('df_train_filled.csv', index=False)
# df_test_filled.to_csv('df_test_filled.csv',  index=False)

# print("\nnew filled df: df_train_filled.csv, df_test_filled.csv")
# print("unique key :", sorted(df_train_filled['key'].dropna().unique()))
# print("unique mode:", sorted(df_train_filled['mode'].dropna().unique()))


Train: пропуски ДО {'key': 735, 'mode': 506, 'tempo': 442}


Train: заполняем key:   0%|          | 0/735 [00:00<?, ?it/s]

Serenade in B flat, K.361 "Gran partita": 3. Adagio → key:D# mode:Major tempo:126.0
Star67 → key:D# mode:Major tempo:92.0
Sleep On The Floor → key:G mode:Major tempo:142.0
Rogue → key:B mode:Minor tempo:140.0
Party Song → key:G mode:Major tempo:158.0
Kiss Me → key:D# mode:Major tempo:100.0
The Trouble With Us → key:D mode:Major tempo:121.0
Pull Up Hop Out → key:C# mode:Major tempo:125.0
My, My, My → key:C mode:Major tempo:74.0
Wild Love - Acoustic → key:B mode:Minor tempo:140.0
Florida Boy → key:A# mode:Major tempo:155.0
You Can Do Magic → key:E mode:Minor tempo:130.0
Word Around Town (feat. Rich Homie Quan) → key:C# mode:Major tempo:141.0
Crossing Over → key:B mode:Minor tempo:196.0
West Texas Rain → key:C mode:Major tempo:147.0
I'm in a Dancing Mood → key:G mode:Minor tempo:140.0
Sound the Alarm → key:E mode:Major tempo:102.0
M.... She Wrote → key:A mode:Minor tempo:160.0
Always A Friend → key:C mode:Major tempo:127.0
Build Your Kingdom Here → key:D mode:Major tempo:138.0
Big Bad Joh

Train: заполняем mode:   0%|          | 0/484 [00:00<?, ?it/s]

Clarinet Concerto No. 1 in F Minor, Op. 73, J. 114: III. Rondo. Allegretto → key:F mode:Major tempo:144.0
Concerto for Viola d'amore, Lute, Strings and Basso Continuo in D Minor, RV 540: I. Allegro → key:C# mode:Minor tempo:79.0
She Keeps The Home Fires Burning → key:F mode:Major tempo:127.0
Ven - Continuous Mix → key:E mode:Minor tempo:115.0
Wicked As It Seems → key:G mode:Major tempo:108.0
Sunrise, Sunburn, Sunset → key:A mode:Major tempo:160.0
Chains - Remix → key:C mode:Minor tempo:76.0
P's & Q's → key:B mode:Major tempo:88.0
Used to → key:D mode:Major tempo:85.0
Immigrant Song - 1990 Remaster → key:B mode:Major tempo:113.0
If I Fall You're Going Down with Me → key:G# mode:Major tempo:123.0
A Course of Strengthening Medicines → key:F mode:Major tempo:78.0
Can’t Hide Red → key:E mode:Minor tempo:95.0
MARY JANE → key:G# mode:Minor tempo:134.0
When the Levee Breaks - Remastered → key:F mode:Major tempo:144.0
Deep Water → key:A# mode:Major tempo:125.0
I Bought It → key:D mode:Major tem

Train: заполняем tempo:   0%|          | 0/421 [00:00<?, ?it/s]

Back Door Man → key:A mode:Major tempo:177.0
familia → key:D mode:Major tempo:76.0
Poor Johnny → key:B mode:Minor tempo:102.0
Running Away → key:G mode:Major tempo:93.0
Victoria → key:D# mode:Major tempo:163.0
Mama Don't Get Dressed up for Nothing → key:G mode:Major tempo:126.0
Prelude to the Afternoon of a Faun → key:E mode:Major tempo:172.0
Eclipse → key:A# mode:Major tempo:68.0
Guillaume Tell (William Tell): Overture (arr. F. Wrede for 2 pianos 8 hands): Andante - → key:G mode:Major tempo:74.0
Blue and Evil - Live → key:B mode:Minor tempo:170.0
High All The Time → key:F# mode:Major tempo:87.0
Pétrouchka: Tableau 4, The Shrovetide Fair - Wet Nurses' Dance → key:A mode:Minor tempo:126.0
Real Life → key:C mode:Minor tempo:101.0
Long Gone Lonesome Blues - Single Version → key:E mode:Major tempo:123.0
Gettin' Jiggy Wit It → key:F# mode:Major tempo:108.0
History (Love Mix) → key:F mode:Major tempo:114.0
Hookers → key:E mode:Minor tempo:167.0
The Drugs Don't Work → key:C mode:Major tempo:7

Test: заполняем key:   0%|          | 0/158 [00:00<?, ?it/s]

Test Me → key:C mode:Minor tempo:174.0
There's A Small Hotel → key:A mode:Minor tempo:105.0
Join Me → key:G# mode:Minor tempo:114.0
10 Feet → key:F# mode:Minor tempo:104.0
Ghost → key:D mode:Major tempo:154.0
YNO → key:G mode:Major tempo:105.0
Christmas Chill → key:C mode:Major tempo:77.0
bring back the colors → key:G# mode:Major tempo:166.0
21 → key:D mode:Major tempo:125.0
Symphony No. 9 in D Minor, Op. 125 - "Choral": 4b. Allegro assai - Live → key:D mode:Major tempo:137.0
Down with the Sickness → key:D# mode:Minor tempo:90.0
Shoot You Down - Remastered → key:C mode:Major tempo:109.0
Fuck the Pain Away → key:E mode:Major tempo:132.0
Manifest Destiny - Remastered → key:E mode:Minor tempo:120.0
Innocence → key:C# mode:Minor tempo:138.0
Flying Whales → key:G mode:Minor tempo:95.0
Southbound → key:B mode:Major tempo:133.0
Who's Making Love → key:A# mode:Major tempo:115.0
Films → key:A mode:Major tempo:98.0
Fuck You → key:C mode:Major tempo:127.0
Sanpo (My Neighbor Totoro) → key:C mode:M

Test: заполняем mode:   0%|          | 0/144 [00:00<?, ?it/s]

6 Foot 7 Foot → key:D mode:Major tempo:79.0
Taciturn → key:A# mode:Major tempo:121.0
Within You, Without You - Live → key:G mode:Major tempo:136.0
Ordermade → key:C# mode:Major tempo:104.0
Lines On My Face - Live → key:C mode:Major tempo:102.0
Sleigh Ride - Instrumental → key:A# mode:Major tempo:115.0
Ontheway! → key:F mode:Minor tempo:150.0
The Catalyst → key:C mode:Major tempo:135.0
Get Higher → key:G mode:Major tempo:102.0
Partita for 8 Singers: No. 1. Allemande → key:F# mode:Minor tempo:138.0
BZZRK - AOWL Remix → key:B mode:Major tempo:75.0
Come To Daddy (Pappy Mix) → key:G# mode:Major tempo:162.0
Good Lovin' - Single Version → key:D mode:Major tempo:196.0
Whatever → key:D# mode:Major tempo:108.0
Brandenburg Concerto No. 4 in G BWV1049: I. Allegro → key:G mode:Major tempo:90.0
Creature Comfort → key:B mode:Major tempo:190.0
DEVASTATED → key:A# mode:Minor tempo:123.0
Nothin' But The Cavi Hit → key:C# mode:Minor tempo:98.0
Dear My Firend -まだ見ぬ未来へ- → key:G# mode:Major tempo:129.0
Bing

Test: заполняем tempo:   0%|          | 0/116 [00:00<?, ?it/s]

She Walks In Beauty → key:D mode:Major tempo:124.0
Tell Me Why → key:B mode:Major tempo:129.0
Like You → key:F# mode:Minor tempo:125.0
Matador (feat. Thompson Square) → key:B mode:Minor tempo:78.0
New Fang → key:G mode:Major tempo:190.0
Bad Moon Rising → key:D mode:Major tempo:179.0
Like It's The Last Time → key:E mode:Major tempo:77.0
A Different Feeling → key:G mode:Minor tempo:111.0
Courtesy Of The Red, White And Blue (The Angry American) → key:D mode:Major tempo:112.0
Cake Dough Cheddar → key:F mode:Major tempo:160.0
Outlier → key:G# mode:Major tempo:135.0
I Need More → key:E mode:Minor tempo:80.0
Morning Side → key:C# mode:Major tempo:91.0
The Older I Get → key:D mode:Major tempo:115.0
Unforgettable → key:F# mode:Major tempo:98.0
No More (feat. NIKI) → key:A# mode:Minor tempo:142.0
The Most Immaculate Haircut → key:C mode:Major tempo:106.0
Apres Une Reve → key:F mode:Minor tempo:173.0
Modern Flows II Intro → key:D# mode:Major tempo:86.0
Es Por Ti → key:E mode:Major tempo:130.0
Los

In [55]:
TRAIN_FILLED = (
    "https://www.dropbox.com/scl/fi/b15tvwr1uv7b3yx3dqvhq/df_train_filled.csv"
    "?rlkey=m2bowu7cjpqijyp4venzv1ncs&st=jix0z39a&dl=1"
)
TEST_FILLED = (
    "https://www.dropbox.com/scl/fi/kwpslwi5sq6bw1ci6a1z8/df_test_filled.csv"
    "?rlkey=9n3fjlxgktdzi8ecdghegi97f&st=uqp8ph6x&dl=1"
)

In [56]:
train_filled = pd.read_csv(TRAIN_FILLED)
test_filled = pd.read_csv(TEST_FILLED)

In [60]:
columns_to_check = ["key", "mode", "tempo"]

train_filled = train_filled.dropna(subset=columns_to_check)
train_filled = train_filled[~((train_filled[columns_to_check] == 0).any(axis=1))]

test_filled = test_filled.dropna(subset=columns_to_check)
test_filled = test_filled[~((test_filled[columns_to_check] == 0).any(axis=1))]

In [80]:
print("Train")
print(train_filled.isnull().sum())
print("\nTest")
print(test_filled.isnull().sum())

Train
instance_id         0
track_name          0
acousticness        0
danceability        0
duration_ms         0
energy              0
instrumentalness    0
key                 0
liveness            0
loudness            0
mode                0
speechiness         0
tempo               0
obtained_date       0
valence             0
music_genre         0
dtype: int64

Test
instance_id         0
track_name          0
acousticness        0
danceability        0
duration_ms         0
energy              0
instrumentalness    0
key                 0
liveness            0
loudness            0
mode                0
speechiness         0
tempo               0
obtained_date       0
valence             0
dtype: int64


## Разведочный анализ

In [71]:
%%capture
profile_train = ProfileReport(train_filled, title="Profiling Report")
profile_train.to_file("train_filled.html")

profile_test = ProfileReport(test_filled, title="Profiling Report")
profile_test.to_file("test_filled.html")

In [None]:
pivot_table = pd.crosstab(train['music_genre'], train['key'], margins=True, margins_name='Total')

# Выводим сводную таблицу
print("Сводная таблица: Распределение тональностей (key) по жанрам (music_genre):")
print(pivot_table)

Сводная таблица: Распределение тональностей (key) по жанрам (music_genre):
key             A    A#     B     C    C#     D   D#     E     F    F#     G  \
music_genre                                                                    
Alternative   241   139   225   280   236   267   68   199   200   193   288   
Anime         184    92   122   209   182   202   67   146   166   121   225   
Blues         375   125   187   360   169   366   65   216   245    98   419   
Classical     123    74    74   158   110   177   79   117   117    65   165   
Country       244    97   144   241   154   241   80   179   136   120   281   
Electronic    219   190   228   224   361   203   49   179   215   185   265   
Hip-Hop        83   102   116    80   224    86   28    55    90    72    78   
Jazz          115   116    64   136    93    96   47   102   138    63   136   
Rap           172   187   209   200   421   202   58   132   160   166   216   
Rock          257   102   153   261   175   2

## Работа с признаками

## Выбор и обучение моделей

## Оценка качества

## Анализ важности признаков модели