На этом занятии мы попробуем задачу регрессии. Данные в этой же папке, будем тренироваться на датасете фильмов с IMDB

Перед обучением обучением модели, нужно подготовить данные:

- найти\собрать данные
- почистить и предобработать
- преобразовать в матрицы 


In [1]:
# импорты необходимых библиотек
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
# %matplotlib inline

# import gensim
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error


In [2]:
data = pd.read_csv('IMDB-Movie-Data.csv')
print(data.shape)

data.head(3)

(1000, 12)


Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0


## Что делать с NaN?
Я выбрал убрать строки с NaN, так как NaN содержат столбцы с информацией о сборах и метаскоре. Мне кажется, что приравнивать к нулю эти данные было бы некорректно.

In [3]:
# 1. Убрать строки с NaN
print(data.isna().any())
data.shape

Rank                  False
Title                 False
Genre                 False
Description           False
Director              False
Actors                False
Year                  False
Runtime (Minutes)     False
Rating                False
Votes                 False
Revenue (Millions)     True
Metascore              True
dtype: bool


(1000, 12)

In [4]:
print(data.shape)
data = data.dropna()
data.shape

(1000, 12)


(838, 12)

## Подготовка данных

Попробуем предсказывать рейтинг фильма по данным его описания, года, длины в минутах и кассовых сборов

Колонка "Rating" станет **целевой переменной, или таргетом** (y)<br>
Остальных данные будут **обучающей выборкой** (X)

In [5]:
data.Description

0      A group of intergalactic criminals are forced ...
1      Following clues to the origin of mankind, a te...
2      Three girls are kidnapped by a man with a diag...
3      In a city of humanoid animals, a hustling thea...
4      A secret government agency recruits some of th...
                             ...                        
993    While still out to destroy the evil Umbrella C...
994    3 high school seniors throw a birthday party t...
996    Three American college students studying abroa...
997    Romantic sparks occur between two dance studen...
999    A stuffy businessman finds himself trapped ins...
Name: Description, Length: 838, dtype: object

In [6]:
# подготовим описания фильмов
data["text"] = data.Description.apply(lambda x: x.lower().split()) 

data["text"]

0      [a, group, of, intergalactic, criminals, are, ...
1      [following, clues, to, the, origin, of, mankin...
2      [three, girls, are, kidnapped, by, a, man, wit...
3      [in, a, city, of, humanoid, animals,, a, hustl...
4      [a, secret, government, agency, recruits, some...
                             ...                        
993    [while, still, out, to, destroy, the, evil, um...
994    [3, high, school, seniors, throw, a, birthday,...
996    [three, american, college, students, studying,...
997    [romantic, sparks, occur, between, two, dance,...
999    [a, stuffy, businessman, finds, himself, trapp...
Name: text, Length: 838, dtype: object

In [7]:
data.text.values[:10]

array([list(['a', 'group', 'of', 'intergalactic', 'criminals', 'are', 'forced', 'to', 'work', 'together', 'to', 'stop', 'a', 'fanatical', 'warrior', 'from', 'taking', 'control', 'of', 'the', 'universe.']),
       list(['following', 'clues', 'to', 'the', 'origin', 'of', 'mankind,', 'a', 'team', 'finds', 'a', 'structure', 'on', 'a', 'distant', 'moon,', 'but', 'they', 'soon', 'realize', 'they', 'are', 'not', 'alone.']),
       list(['three', 'girls', 'are', 'kidnapped', 'by', 'a', 'man', 'with', 'a', 'diagnosed', '23', 'distinct', 'personalities.', 'they', 'must', 'try', 'to', 'escape', 'before', 'the', 'apparent', 'emergence', 'of', 'a', 'frightful', 'new', '24th.']),
       list(['in', 'a', 'city', 'of', 'humanoid', 'animals,', 'a', 'hustling', 'theater', "impresario's", 'attempt', 'to', 'save', 'his', 'theater', 'with', 'a', 'singing', 'competition', 'becomes', 'grander', 'than', 'he', 'anticipates', 'even', 'as', 'its', "finalists'", 'find', 'that', 'their', 'lives', 'will', 'never', 

In [8]:
input_text = list(data.text.values)

In [9]:
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(input_text)]
documents[10:12]

[TaggedDocument(words=['the', 'story', 'of', 'a', 'team', 'of', 'female', 'african-american', 'mathematicians', 'who', 'served', 'a', 'vital', 'role', 'in', 'nasa', 'during', 'the', 'early', 'years', 'of', 'the', 'u.s.', 'space', 'program.'], tags=[10]),
 TaggedDocument(words=['the', 'rebel', 'alliance', 'makes', 'a', 'risky', 'move', 'to', 'steal', 'the', 'plans', 'for', 'the', 'death', 'star,', 'setting', 'up', 'the', 'epic', 'saga', 'to', 'follow.'], tags=[11])]

обучаем модель на текстах описаний фильмов (можно поизменять параметры)

In [10]:
model = Doc2Vec(documents, vector_size=5, window=2, min_count=1, workers=4)



In [11]:
model.save("D2V.model") # сохранение модели

In [12]:
# так можно посмотреть на векторы текстов, на которых училась модель
# индекс [] около documents -- это индекс текста из датасета

model[documents[0].tags[0]]


array([ 0.08111026, -0.09000318,  0.02996959, -0.0562246 ,  0.08877965],
      dtype=float32)

Теперь нужно добавить векторы в датасет с остальными параметрами

In [13]:
# создадим список с векторами для каждого текста
vectors = []
for x in documents:
    vec = list(model[x.tags][0])
    vectors.append(vec)

In [14]:
# так получим датафрейм, где все компоненты векторов в отдельных столбцах
split_df = pd.DataFrame(vectors,
                        columns=['v1', 'v2', 'v3','v4',"v5"])

split_df


Unnamed: 0,v1,v2,v3,v4,v5
0,0.081110,-0.090003,0.029970,-0.056225,0.088780
1,-0.118023,-0.011983,-0.093576,0.069062,0.060961
2,0.030247,0.024422,-0.000969,-0.092455,0.025182
3,0.042027,-0.041914,-0.055413,0.051266,0.042590
4,0.034391,0.070055,-0.116559,-0.087874,0.006900
...,...,...,...,...,...
833,0.010773,-0.063559,-0.159734,-0.084096,0.009784
834,-0.106769,0.050991,-0.083262,-0.029523,0.086028
835,-0.093891,0.021157,0.051523,-0.003181,-0.059544
836,-0.064760,-0.041579,0.005948,0.038824,0.026079


In [15]:
# теперь добавим его к основному датафрейму
result = data.join(split_df, how='left')
result.shape

(838, 18)

In [16]:
result.head()

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore,text,v1,v2,v3,v4,v5
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0,"[a, group, of, intergalactic, criminals, are, ...",0.08111,-0.090003,0.02997,-0.056225,0.08878
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0,"[following, clues, to, the, origin, of, mankin...",-0.118023,-0.011983,-0.093576,0.069062,0.060961
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0,"[three, girls, are, kidnapped, by, a, man, wit...",0.030247,0.024422,-0.000969,-0.092455,0.025182
3,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0,"[in, a, city, of, humanoid, animals,, a, hustl...",0.042027,-0.041914,-0.055413,0.051266,0.04259
4,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0,"[a, secret, government, agency, recruits, some...",0.034391,0.070055,-0.116559,-0.087874,0.0069


In [17]:
# переопределим датасет, оставив только важное

data_sm = result[['Runtime (Minutes)',"Year",
                'Rating', 'Votes',
                'Revenue (Millions)','Metascore',"v1","v2","v3","v4","v5"]
              ]
data_sm = data_sm.dropna()

data_sm.head(3)

Unnamed: 0,Runtime (Minutes),Year,Rating,Votes,Revenue (Millions),Metascore,v1,v2,v3,v4,v5
0,121,2014,8.1,757074,333.13,76.0,0.08111,-0.090003,0.02997,-0.056225,0.08878
1,124,2012,7.0,485820,126.46,65.0,-0.118023,-0.011983,-0.093576,0.069062,0.060961
2,117,2016,7.3,157606,138.12,62.0,0.030247,0.024422,-0.000969,-0.092455,0.025182


## Подготавливаем матрицы

In [18]:
# определяем X и y

X = data_sm.drop(["Rating"],axis=1).values 

display(X, X.shape)

array([[ 1.21000000e+02,  2.01400000e+03,  7.57074000e+05, ...,
         2.99695916e-02, -5.62245958e-02,  8.87796506e-02],
       [ 1.24000000e+02,  2.01200000e+03,  4.85820000e+05, ...,
        -9.35755372e-02,  6.90623075e-02,  6.09612428e-02],
       [ 1.17000000e+02,  2.01600000e+03,  1.57606000e+05, ...,
        -9.68868728e-04, -9.24545005e-02,  2.51820832e-02],
       ...,
       [ 1.08000000e+02,  2.01400000e+03,  3.88040000e+04, ...,
         5.15228920e-02, -3.18115461e-03, -5.95437735e-02],
       [ 1.28000000e+02,  2.01600000e+03,  5.53100000e+03, ...,
         5.94843877e-03,  3.88236344e-02,  2.60794349e-02],
       [ 1.13000000e+02,  2.00800000e+03,  1.63144000e+05, ...,
         4.44201678e-02, -6.74552619e-02, -5.86266778e-02]])

(712, 10)

In [19]:
data_sm.isna().any()

Runtime (Minutes)     False
Year                  False
Rating                False
Votes                 False
Revenue (Millions)    False
Metascore             False
v1                    False
v2                    False
v3                    False
v4                    False
v5                    False
dtype: bool

In [20]:
y = data_sm['Rating'].values # отдельно вынесли массив со значениями скорости ветра
y.shape

(712,)

Иногда бывает полезно [нормализовать](https://en.wikipedia.org/wiki/Normalization_(statistics)) данные: это позволяет исправить ситуацию, когда признаки представлены в разных единацах измерения. 
Для этого используется StandardScaler. 

До нормализации:

In [21]:
list(X[0])

[121.0,
 2014.0,
 757074.0,
 333.13,
 76.0,
 0.08111026138067245,
 -0.09000318497419357,
 0.02996959164738655,
 -0.05622459575533867,
 0.08877965062856674]

In [22]:
# использзуем стандартизатор
sc = StandardScaler()

X_train, X_test, y_train, y_test = train_test_split(sc.fit_transform(X), y, random_state=42)

После:

In [23]:
list(sc.fit_transform(X)[0])

[0.291007552571261,
 0.4090342348627316,
 2.7018134034374643,
 2.2071532994957423,
 0.9140238266037984,
 1.9147674359228783,
 -1.4559039575669104,
 1.59287808281779,
 -0.3503958191137472,
 0.998203034860476]

теперь с данными удобнее работать и обучать

In [24]:
# задаем модель регрессора
# силу регуляризации можно варьировать параметром alpha
regressor = Ridge() 


# обучаем
regressor.fit(X_train, y_train)

Ridge()

In [25]:
# давайте предскажем результат для тестовой выборки

y_preds = regressor.predict(X_test)

### оценка результатов алгоритма

В качестве метрики будем использовать [среднюю абсолютную ошибку](https://www.youtube.com/watch?v=ZejnwbcU8nw). Она показывает отклонение от правильного ответа в тех же единах измерения

*(а вообще есть [разные способы](https://towardsdatascience.com/what-are-the-best-metrics-to-evaluate-your-regression-model-418ca481755b))*

In [26]:
mean_absolute_error(y_test, y_preds) 

0.4241282274750051

In [27]:
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, y_preds)

0.3376602381647097

In [28]:
import math
mean_squared_error(y_test, y_preds, squared=False)

0.5810853966197307

Возьмем линейную регрессию

In [33]:
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_preds = regressor.predict(X_test)
print("MAE: " + str(mean_absolute_error(y_test, y_preds)))
print("MSE: " + str(mean_squared_error(y_test, y_preds)))
print("RMSE: " + str(mean_squared_error(y_test, y_preds, squared=False)))

MAE: 0.42413821121834794
MSE: 0.33770380845810277
RMSE: 0.5811228858495445


Лассо

In [34]:
regressor = Lasso()
regressor.fit(X_train, y_train)
y_preds = regressor.predict(X_test)
print("MAE: " + str(mean_absolute_error(y_test, y_preds)))
print("MSE: " + str(mean_squared_error(y_test, y_preds)))
print("RMSE: " + str(mean_squared_error(y_test, y_preds, squared=False)))

MAE: 0.7520536127593316
MSE: 0.8563704428453197
RMSE: 0.9254028543533458


Теперь можно попытаться усовершенствовать пайплайн. Прежде всего поменяем обработку данных: уберем стоп слова и пунктуацию.

In [44]:
data = pd.read_csv('IMDB-Movie-Data.csv')
print(data.shape)

data.head(3)

(1000, 12)


Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0


In [45]:
# 1. Убрать строки с NaN
print(data.isna().any())
data.shape

print(data.shape)
data = data.dropna()
data.shape

Rank                  False
Title                 False
Genre                 False
Description           False
Director              False
Actors                False
Year                  False
Runtime (Minutes)     False
Rating                False
Votes                 False
Revenue (Millions)     True
Metascore              True
dtype: bool
(1000, 12)


(838, 12)

Сделаем лучше токенизацию, уберем пунктуацию и стоп слова

In [53]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
data["text"] = data.Description.apply(lambda x: word_tokenize(x)) 

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [54]:
import string
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = stopwords.words('english')
def remove_punctuations_and_stop_words(list_of_str):
    new_list = []
    for word in list_of_str:
      if word not in string.punctuation and word not in stop_words:
        new_list.append(word)
    return new_list

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [57]:
data["text"] = data.text.apply(lambda x: remove_punctuations_and_stop_words(x)) 

In [58]:
data["text"][:10]

0     [A, group, intergalactic, criminals, forced, w...
1     [Following, clues, origin, mankind, team, find...
2     [Three, girls, kidnapped, man, diagnosed, 23, ...
3     [In, city, humanoid, animals, hustling, theate...
4     [A, secret, government, agency, recruits, dang...
5     [European, mercenaries, searching, black, powd...
6     [A, jazz, pianist, falls, aspiring, actress, L...
8     [A, true-life, drama, centering, British, expl...
9     [A, spacecraft, traveling, distant, colony, pl...
10    [The, adventures, writer, Newt, Scamander, New...
Name: text, dtype: object

In [59]:
input_text = list(data.text.values)
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(input_text)]
documents[10:12]
model = Doc2Vec(documents, vector_size=5, window=2, min_count=1, workers=4)
vectors = []
for x in documents:
    vec = list(model[x.tags][0])
    vectors.append(vec)
split_df = pd.DataFrame(vectors,
                        columns=['v1', 'v2', 'v3','v4',"v5"])

result = data.join(split_df, how='left')
data_sm = result[['Runtime (Minutes)',"Year",
                'Rating', 'Votes',
                'Revenue (Millions)','Metascore',"v1","v2","v3","v4","v5"]
              ]
data_sm = data_sm.dropna()
X = data_sm.drop(["Rating"],axis=1).values 
y = data_sm['Rating'].values # отдельно вынесли массив со значениями скорости ветра



In [70]:
sc = StandardScaler()

X_train, X_test, y_train, y_test = train_test_split(sc.fit_transform(X), y, random_state=42)

Обучим модели на текстах с предобработанными данными

In [71]:
regressor = Ridge() 
regressor.fit(X_train, y_train)
y_preds = regressor.predict(X_test)
print("MAE: " + str(mean_absolute_error(y_test, y_preds)))
print("MSE: " + str(mean_squared_error(y_test, y_preds)))
print("RMSE: " + str(mean_squared_error(y_test, y_preds, squared=False)))

MAE: 0.4274951160218614
MSE: 0.3408081095282528
RMSE: 0.5837877264282393


In [65]:
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_preds = regressor.predict(X_test)
print("MAE: " + str(mean_absolute_error(y_test, y_preds)))
print("MSE: " + str(mean_squared_error(y_test, y_preds)))
print("RMSE: " + str(mean_squared_error(y_test, y_preds, squared=False)))

MAE: 0.4274995306074815
MSE: 0.34088230040220774
RMSE: 0.5838512656509428


In [66]:
regressor = Lasso()
regressor.fit(X_train, y_train)
y_preds = regressor.predict(X_test)
print("MAE: " + str(mean_absolute_error(y_test, y_preds)))
print("MSE: " + str(mean_squared_error(y_test, y_preds)))
print("RMSE: " + str(mean_squared_error(y_test, y_preds, squared=False)))

MAE: 0.7520536127593316
MSE: 0.8563704428453197
RMSE: 0.9254028543533458


Как можно видеть, результаты сильно не поменялись ни для какой из моделей

Попробуем новые гиперпараметры: поменяем alpha и увеличим количество итераций

In [84]:
regressor = Ridge(alpha=0.5, max_iter=20000) 
regressor.fit(X_train, y_train)
y_preds = regressor.predict(X_test)
print("MAE: " + str(mean_absolute_error(y_test, y_preds)))
print("MSE: " + str(mean_squared_error(y_test, y_preds)))
print("RMSE: " + str(mean_squared_error(y_test, y_preds, squared=False)))

MAE: 0.4274972116110509
MSE: 0.3408444598179998
RMSE: 0.5838188587378793


In [86]:
regressor = Ridge(alpha=0.001, max_iter=20000) 
regressor.fit(X_train, y_train)
y_preds = regressor.predict(X_test)
print("MAE: " + str(mean_absolute_error(y_test, y_preds)))
print("MSE: " + str(mean_squared_error(y_test, y_preds)))
print("RMSE: " + str(mean_squared_error(y_test, y_preds, squared=False)))

MAE: 0.42749952574422356
MSE: 0.3408822232195538
RMSE: 0.5838511995530657


На Ridge изменение параметров почти никак не влияет. Теперь попробуем Lasso

In [91]:
regressor = Lasso(alpha=0.5, max_iter=20000) 
regressor.fit(X_train, y_train)
y_preds = regressor.predict(X_test)
print("MAE: " + str(mean_absolute_error(y_test, y_preds)))
print("MSE: " + str(mean_squared_error(y_test, y_preds)))
print("RMSE: " + str(mean_squared_error(y_test, y_preds, squared=False)))

MAE: 0.6950468432083461
MSE: 0.7471891419735718
RMSE: 0.8644010307568888


In [93]:
regressor = Ridge(alpha=0.001, max_iter=20000) 
regressor.fit(X_train, y_train)
y_preds = regressor.predict(X_test)
print("MAE: " + str(mean_absolute_error(y_test, y_preds)))
print("MSE: " + str(mean_squared_error(y_test, y_preds)))
print("RMSE: " + str(mean_squared_error(y_test, y_preds, squared=False)))

MAE: 0.42749952574422356
MSE: 0.3408822232195538
RMSE: 0.5838511995530657


Как можно увидеть, уменьшение alpha для Lasso сильно улучшило результаты.

In [94]:
regressor = Ridge(alpha=0.0000001, max_iter=20000) 
regressor.fit(X_train, y_train)
y_preds = regressor.predict(X_test)
print("MAE: " + str(mean_absolute_error(y_test, y_preds)))
print("MSE: " + str(mean_squared_error(y_test, y_preds)))
print("RMSE: " + str(mean_squared_error(y_test, y_preds, squared=False)))

MAE: 0.4274995306069951
MSE: 0.34088230039448925
RMSE: 0.5838512656443328


Однако, если сделать его еще меньше -- ничего практически не поменяется. Кажется, что тут достигнут предел.