На этом занятии мы попробуем задачу регрессии. Данные возьмем вот эти - https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

Перед обучением обучением модели, нужно подготовить данные:

- найти\собрать данные
- почистить и предобработать
- преобразовать в матрицы 


In [362]:
# традиционная ячейка импортов
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
# %matplotlib inline

# import gensim
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error


In [333]:
data_imdb_1 = pd.read_csv('/Users/macbook/Documents/datasets/imdb/IMDB-Movie-Data.csv')
print(data.shape)

data_imdb_1.head(3)

(1000, 15)


Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0


## Что делать с NaN?
Есть 3 варианта

In [334]:
# 1. Убрать строки с NaN
print(data_imdb_1.isna().any())
data_imdb_1.shape

Rank                  False
Title                 False
Genre                 False
Description           False
Director              False
Actors                False
Year                  False
Runtime (Minutes)     False
Rating                False
Votes                 False
Revenue (Millions)     True
Metascore              True
dtype: bool


(1000, 12)

In [335]:
print(data_imdb_1.shape)
tmp = data_imdb_1.dropna()
tmp.shape

(1000, 12)


(838, 12)

In [336]:
# 2. Превратить NaN в 0
print(data_imdb_1.shape)
tmp = data_imdb_1.fillna(0)
print(tmp.shape)

(1000, 12)
(1000, 12)


In [321]:
# 3. Превратить NaN в средние значения по колонке
data_imdb_1.isna().any()

Rank                  False
Title                 False
Genre                 False
Description           False
Director              False
Actors                False
Year                  False
Runtime (Minutes)     False
Rating                False
Votes                 False
Revenue (Millions)     True
Metascore              True
dtype: bool

In [338]:
# вычисляем средние для колонок с пустыми значениями
meta_mean = data_imdb_1.Metascore.mean()
rev_mean = data_imdb_1['Revenue (Millions)'].mean()

#заменяем пустоты на средние значения
data_imdb_1.Metascore.fillna(meta_mean, inplace=True)
data_imdb_1['Revenue (Millions)'].fillna(rev_mean, inplace=True)

# проверяем присутствие NaN
data_imdb_1.isna().any()

Rank                  False
Title                 False
Genre                 False
Description           False
Director              False
Actors                False
Year                  False
Runtime (Minutes)     False
Rating                False
Votes                 False
Revenue (Millions)    False
Metascore             False
dtype: bool

## Подготовка данных

Попробуем предсказывать год фильма по данным его описания, рейтинга, длины в минутах и кассовых сборов

In [339]:
data_imdb_1.Description

0      A group of intergalactic criminals are forced ...
1      Following clues to the origin of mankind, a te...
2      Three girls are kidnapped by a man with a diag...
3      In a city of humanoid animals, a hustling thea...
4      A secret government agency recruits some of th...
                             ...                        
995    A tight-knit team of rising investigators, alo...
996    Three American college students studying abroa...
997    Romantic sparks occur between two dance studen...
998    A pair of friends embark on a mission to reuni...
999    A stuffy businessman finds himself trapped ins...
Name: Description, Length: 1000, dtype: object

In [340]:
# подготовим описания фильмов
data_imdb_1["text"] = data_imdb_1.Description.apply(lambda x: x.lower().split()) 

data_imdb_1["text"]

0      [a, group, of, intergalactic, criminals, are, ...
1      [following, clues, to, the, origin, of, mankin...
2      [three, girls, are, kidnapped, by, a, man, wit...
3      [in, a, city, of, humanoid, animals,, a, hustl...
4      [a, secret, government, agency, recruits, some...
                             ...                        
995    [a, tight-knit, team, of, rising, investigator...
996    [three, american, college, students, studying,...
997    [romantic, sparks, occur, between, two, dance,...
998    [a, pair, of, friends, embark, on, a, mission,...
999    [a, stuffy, businessman, finds, himself, trapp...
Name: text, Length: 1000, dtype: object

In [341]:
data_imdb_1.text.values

array([list(['a', 'group', 'of', 'intergalactic', 'criminals', 'are', 'forced', 'to', 'work', 'together', 'to', 'stop', 'a', 'fanatical', 'warrior', 'from', 'taking', 'control', 'of', 'the', 'universe.']),
       list(['following', 'clues', 'to', 'the', 'origin', 'of', 'mankind,', 'a', 'team', 'finds', 'a', 'structure', 'on', 'a', 'distant', 'moon,', 'but', 'they', 'soon', 'realize', 'they', 'are', 'not', 'alone.']),
       list(['three', 'girls', 'are', 'kidnapped', 'by', 'a', 'man', 'with', 'a', 'diagnosed', '23', 'distinct', 'personalities.', 'they', 'must', 'try', 'to', 'escape', 'before', 'the', 'apparent', 'emergence', 'of', 'a', 'frightful', 'new', '24th.']),
       list(['in', 'a', 'city', 'of', 'humanoid', 'animals,', 'a', 'hustling', 'theater', "impresario's", 'attempt', 'to', 'save', 'his', 'theater', 'with', 'a', 'singing', 'competition', 'becomes', 'grander', 'than', 'he', 'anticipates', 'even', 'as', 'its', "finalists'", 'find', 'that', 'their', 'lives', 'will', 'never', 

In [342]:
input_text = list(data_imdb_1.text.values)

In [344]:
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(input_text)]
documents[10:12]

[TaggedDocument(words=['the', 'adventures', 'of', 'writer', 'newt', 'scamander', 'in', 'new', "york's", 'secret', 'community', 'of', 'witches', 'and', 'wizards', 'seventy', 'years', 'before', 'harry', 'potter', 'reads', 'his', 'book', 'in', 'school.'], tags=[10]),
 TaggedDocument(words=['the', 'story', 'of', 'a', 'team', 'of', 'female', 'african-american', 'mathematicians', 'who', 'served', 'a', 'vital', 'role', 'in', 'nasa', 'during', 'the', 'early', 'years', 'of', 'the', 'u.s.', 'space', 'program.'], tags=[11])]

обучаем модель на текстах описаний фильмов (можно поизменять параметры)

In [345]:
model = Doc2Vec(documents, vector_size=5, window=2, min_count=1, workers=4)

In [346]:
model.save("D2V.model") # сохранение модели

In [347]:
# так можно посмотреть на векторы текстов, на которых училась модель
# индекс [] около documents -- это индекс текста из датасета

model.dv[documents[0].tags[0]]


array([-2.6236091e-02, -2.3682602e-05, -2.9620193e-02, -4.7590252e-02,
        7.2363213e-02], dtype=float32)

Теперь нужно добавить векторы в датасет с остальными параметрами

In [348]:
# создадим список с векторами для каждого текста
vectors = []
for x in documents:
    vec = list(model.dv[x.tags][0])
    vectors.append(vec)

In [349]:
# так получим датафрейм, где все компоненты векторов в отдельных столбцах
split_df = pd.DataFrame(vectors,
                        columns=['v1', 'v2', 'v3','v4',"v5"])

split_df


Unnamed: 0,v1,v2,v3,v4,v5
0,-0.026236,-0.000024,-0.029620,-0.047590,0.072363
1,0.122605,-0.006474,0.170060,-0.525021,0.063658
2,0.166355,0.255435,0.152336,-0.348420,-0.051464
3,0.032873,0.379468,0.636457,-0.763817,-0.058566
4,0.083932,0.319532,0.244758,-0.398537,0.122084
...,...,...,...,...,...
995,0.262699,0.336988,0.230137,-0.336289,-0.118091
996,0.012138,0.348280,0.363518,-0.531848,0.122636
997,-0.051095,0.356146,0.214692,-0.154530,-0.146727
998,-0.017918,0.169617,0.234239,-0.163998,-0.030693


In [350]:
# теперь добавим его к основному датафрейму
result = data_imdb_1.join(split_df, how='left')
result.shape

(1000, 18)

In [351]:
result

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore,text,v1,v2,v3,v4,v5
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.130000,76.0,"[a, group, of, intergalactic, criminals, are, ...",-0.026236,-0.000024,-0.029620,-0.047590,0.072363
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.460000,65.0,"[following, clues, to, the, origin, of, mankin...",0.122605,-0.006474,0.170060,-0.525021,0.063658
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.120000,62.0,"[three, girls, are, kidnapped, by, a, man, wit...",0.166355,0.255435,0.152336,-0.348420,-0.051464
3,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.320000,59.0,"[in, a, city, of, humanoid, animals,, a, hustl...",0.032873,0.379468,0.636457,-0.763817,-0.058566
4,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.020000,40.0,"[a, secret, government, agency, recruits, some...",0.083932,0.319532,0.244758,-0.398537,0.122084
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,996,Secret in Their Eyes,"Crime,Drama,Mystery","A tight-knit team of rising investigators, alo...",Billy Ray,"Chiwetel Ejiofor, Nicole Kidman, Julia Roberts...",2015,111,6.2,27585,82.956376,45.0,"[a, tight-knit, team, of, rising, investigator...",0.262699,0.336988,0.230137,-0.336289,-0.118091
996,997,Hostel: Part II,Horror,Three American college students studying abroa...,Eli Roth,"Lauren German, Heather Matarazzo, Bijou Philli...",2007,94,5.5,73152,17.540000,46.0,"[three, american, college, students, studying,...",0.012138,0.348280,0.363518,-0.531848,0.122636
997,998,Step Up 2: The Streets,"Drama,Music,Romance",Romantic sparks occur between two dance studen...,Jon M. Chu,"Robert Hoffman, Briana Evigan, Cassie Ventura,...",2008,98,6.2,70699,58.010000,50.0,"[romantic, sparks, occur, between, two, dance,...",-0.051095,0.356146,0.214692,-0.154530,-0.146727
998,999,Search Party,"Adventure,Comedy",A pair of friends embark on a mission to reuni...,Scot Armstrong,"Adam Pally, T.J. Miller, Thomas Middleditch,Sh...",2014,93,5.6,4881,82.956376,22.0,"[a, pair, of, friends, embark, on, a, mission,...",-0.017918,0.169617,0.234239,-0.163998,-0.030693


In [354]:
# можно переопределить датасет, оставив только важное

data_sm = result[['Runtime (Minutes)',"Year",
                'Rating', 'Votes',
                'Revenue (Millions)','Metascore',"v1","v2","v3","v4","v5"]
              ]


data_sm.head(3)

Unnamed: 0,Runtime (Minutes),Year,Rating,Votes,Revenue (Millions),Metascore,v1,v2,v3,v4,v5
0,121,2014,8.1,757074,333.13,76.0,-0.026236,-2.4e-05,-0.02962,-0.04759,0.072363
1,124,2012,7.0,485820,126.46,65.0,0.122605,-0.006474,0.17006,-0.525021,0.063658
2,117,2016,7.3,157606,138.12,62.0,0.166355,0.255435,0.152336,-0.34842,-0.051464


## Подготавливаем матрицы

In [356]:
# определяем X и y

X = data_sm.drop(["Rating"],axis=1).values 

display(X, X.shape)

array([[ 1.21000000e+02,  2.01400000e+03,  7.57074000e+05, ...,
        -2.96201929e-02, -4.75902520e-02,  7.23632127e-02],
       [ 1.24000000e+02,  2.01200000e+03,  4.85820000e+05, ...,
         1.70059636e-01, -5.25020957e-01,  6.36581182e-02],
       [ 1.17000000e+02,  2.01600000e+03,  1.57606000e+05, ...,
         1.52335986e-01, -3.48420173e-01, -5.14641739e-02],
       ...,
       [ 9.80000000e+01,  2.00800000e+03,  7.06990000e+04, ...,
         2.14692265e-01, -1.54529616e-01, -1.46726534e-01],
       [ 9.30000000e+01,  2.01400000e+03,  4.88100000e+03, ...,
         2.34239370e-01, -1.63998455e-01, -3.06930672e-02],
       [ 8.70000000e+01,  2.01600000e+03,  1.24350000e+04, ...,
         2.36430794e-01, -3.57237220e-01,  1.06360599e-01]])

(1000, 10)

In [359]:
data_sm.isna().any()

Runtime (Minutes)     False
Year                  False
Rating                False
Votes                 False
Revenue (Millions)    False
Metascore             False
v1                    False
v2                    False
v3                    False
v4                    False
v5                    False
dtype: bool

In [357]:
y = data_sm['Rating'].values # отдельно вынесли массив со значениями скорости ветра
y.shape

(1000,)

Иногда бывает полезно [нормализовать](https://en.wikipedia.org/wiki/Normalization_(statistics)) данные: это позволяет исправить ситуацию, когда признаки представлены в разных единацах измерения. 
Для этого используется StandardScaler. 

До нормализации:

In [None]:
list(X[0])

[121.0,
 2014.0,
 757074.0,
 333.13,
 76.0,
 -0.026236090809106827,
 -2.3682601749897003e-05,
 -0.02962019294500351,
 -0.04759025201201439,
 0.07236321270465851]

In [366]:
# использзуем стандартизатор
sc = StandardScaler()

X_train, X_test, y_train, y_test = train_test_split(sc.fit_transform(X), y, random_state=42)

После:

In [367]:
list(sc.fit_transform(X)[0])

[0.4163497512303056,
 0.37979525138136244,
 3.1126899627963738,
 2.5961363010556906,
 1.0233613578368184,
 -1.459491887356481,
 -1.716976640351687,
 -2.254262911777046,
 2.0540598630737263,
 0.5594542593840583]

теперь с данными удобнее работать и обучать

In [392]:
# задаем модель регрессора
# силу регуляризации можно варьировать параметром alpha
regressor = Ridge(alpha=50) 


# обучаем
regressor.fit(X_train, y_train)

Ridge(alpha=50)

In [393]:
# давайте предскажем результат для тестовой выборки

y_preds = regressor.predict(X_test)

### оценка результатов алгоритма

В качестве метрики будем использовать [среднюю абсолютную ошибку](https://www.youtube.com/watch?v=ZejnwbcU8nw). Она показывает отклонение от правильного ответа в тех же единах измерения

*(а вообще есть [разные способы](https://towardsdatascience.com/what-are-the-best-metrics-to-evaluate-your-regression-model-418ca481755b))*

In [394]:
mean_absolute_error(y_test, y_preds) 

0.49842778666786136

Попробуйте разные значения для параметра регуляризации alpha при обучении модели. Как они влияют на величину ошибки?