На этом занятии мы попробуем задачу регрессии. Данные в этой же папке, будем тренироваться на датасете фильмов с IMDB

Перед обучением обучением модели, нужно подготовить данные:

- найти\собрать данные
- почистить и предобработать
- преобразовать в матрицы 


In [1]:
# импорты необходимых библиотек
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
# %matplotlib inline

# import gensim
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [22]:
data = pd.read_csv('/content/drive/MyDrive/ml_2023/IMDB-Movie-Data.csv')
print(data.shape)

data.head(3)

(1000, 12)


Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0


In [23]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Rank                1000 non-null   int64  
 1   Title               1000 non-null   object 
 2   Genre               1000 non-null   object 
 3   Description         1000 non-null   object 
 4   Director            1000 non-null   object 
 5   Actors              1000 non-null   object 
 6   Year                1000 non-null   int64  
 7   Runtime (Minutes)   1000 non-null   int64  
 8   Rating              1000 non-null   float64
 9   Votes               1000 non-null   int64  
 10  Revenue (Millions)  872 non-null    float64
 11  Metascore           936 non-null    float64
dtypes: float64(3), int64(4), object(5)
memory usage: 93.9+ KB


## Что делать с NaN?
Есть 3 варианта

In [7]:
# 1. Убрать строки с NaN
print(data.isna().any())
data.shape

Rank                  False
Title                 False
Genre                 False
Description           False
Director              False
Actors                False
Year                  False
Runtime (Minutes)     False
Rating                False
Votes                 False
Revenue (Millions)     True
Metascore              True
dtype: bool


(1000, 12)

In [None]:
print(data.shape)
tmp = data.dropna()
tmp.shape

In [None]:
# 2. Превратить NaN в 0
print(data.shape)
tmp = data.fillna(0)
print(tmp.shape)

In [21]:
# 3. Превратить NaN в средние значения по колонке

# вычисляем средние для колонок с пустыми значениями
meta_mean = data.Metascore.mean()
rev_mean = data['Revenue (Millions)'].mean()

#заменяем пустоты на средние значения
data.Metascore.fillna(meta_mean, inplace=True)
data['Revenue (Millions)'].fillna(rev_mean, inplace=True)

# проверяем присутствие NaN
data.isna().any()

Rank                  False
Title                 False
Genre                 False
Description           False
Director              False
Actors                False
Year                  False
Runtime (Minutes)     False
Rating                False
Votes                 False
Revenue (Millions)    False
Metascore             False
dtype: bool

## Подготовка данных

Попробуем предсказывать рейтинг фильма по данным его описания, года, длины в минутах и кассовых сборов

Колонка "Rating" станет **целевой переменной, или таргетом** (y)<br>
Остальных данные будут **обучающей выборкой** (X)

In [10]:
# подготовим описания фильмов
data["text"] = data.Description.apply(lambda x: x.lower().split()) 

data["text"]

0      [a, group, of, intergalactic, criminals, are, ...
1      [following, clues, to, the, origin, of, mankin...
2      [three, girls, are, kidnapped, by, a, man, wit...
3      [in, a, city, of, humanoid, animals,, a, hustl...
4      [a, secret, government, agency, recruits, some...
                             ...                        
995    [a, tight-knit, team, of, rising, investigator...
996    [three, american, college, students, studying,...
997    [romantic, sparks, occur, between, two, dance,...
998    [a, pair, of, friends, embark, on, a, mission,...
999    [a, stuffy, businessman, finds, himself, trapp...
Name: text, Length: 1000, dtype: object

In [None]:
data.text.values

In [12]:
input_text = list(data.text.values)

In [13]:
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(input_text)]
documents[10:12]

[TaggedDocument(words=['the', 'adventures', 'of', 'writer', 'newt', 'scamander', 'in', 'new', "york's", 'secret', 'community', 'of', 'witches', 'and', 'wizards', 'seventy', 'years', 'before', 'harry', 'potter', 'reads', 'his', 'book', 'in', 'school.'], tags=[10]),
 TaggedDocument(words=['the', 'story', 'of', 'a', 'team', 'of', 'female', 'african-american', 'mathematicians', 'who', 'served', 'a', 'vital', 'role', 'in', 'nasa', 'during', 'the', 'early', 'years', 'of', 'the', 'u.s.', 'space', 'program.'], tags=[11])]

обучаем модель на текстах описаний фильмов (можно поизменять параметры)

In [14]:
model = Doc2Vec(documents, vector_size=5, window=2, min_count=1, workers=4)



In [None]:
model.save("D2V.model") # сохранение модели

In [15]:
model.dv = model.__dict__['docvecs']

In [16]:
# так можно посмотреть на векторы текстов, на которых училась модель
# индекс [] около documents -- это индекс текста из датасета

model.dv[documents[0].tags[0]]


array([ 0.09170343, -0.05824028, -0.0694524 , -0.04951696,  0.06817611],
      dtype=float32)

Теперь нужно добавить векторы в датасет с остальными параметрами

In [17]:
# создадим список с векторами для каждого текста
vectors = []
for x in documents:
    vec = list(model.dv[x.tags[0]])
    vectors.append(vec)

In [18]:
# так получим датафрейм, где все компоненты векторов в отдельных столбцах
split_df = pd.DataFrame(vectors,
                        columns=['v1', 'v2', 'v3','v4',"v5"])

split_df


Unnamed: 0,v1,v2,v3,v4,v5
0,0.091703,-0.058240,-0.069452,-0.049517,0.068176
1,0.058295,-0.089381,-0.082290,0.091599,0.088947
2,-0.006920,-0.141415,-0.032521,-0.055858,0.036913
3,0.069653,0.001282,-0.143914,0.140722,0.117188
4,0.131635,-0.059470,-0.031763,0.130147,0.038256
...,...,...,...,...,...
995,0.089319,-0.180330,-0.120867,0.083530,0.172534
996,0.016443,-0.107087,-0.005368,0.062964,0.055797
997,-0.008586,-0.062023,-0.069987,0.000394,0.157436
998,-0.009593,-0.118604,-0.046510,-0.025768,0.077008


In [None]:
# теперь добавим его к основному датафрейму
result = data.join(split_df, how='left')
result.shape

(1000, 18)

In [None]:
result.head(3)

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore,text,v1,v2,v3,v4,v5
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0,"[a, group, of, intergalactic, criminals, are, ...",-0.111846,0.089171,0.103931,-0.037813,0.022978
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0,"[following, clues, to, the, origin, of, mankin...",-0.096553,-0.069903,0.083971,0.095592,0.027149
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0,"[three, girls, are, kidnapped, by, a, man, wit...",-0.110882,0.101169,0.151963,-0.000888,-0.004631


In [None]:
# переопределим датасет, оставив только важное

data_sm = result[['Runtime (Minutes)',"Year",
                'Rating', 'Votes',
                'Revenue (Millions)','Metascore',"v1","v2","v3","v4","v5"]
              ]


data_sm.head(3)

Unnamed: 0,Runtime (Minutes),Year,Rating,Votes,Revenue (Millions),Metascore,v1,v2,v3,v4,v5
0,121,2014,8.1,757074,333.13,76.0,-0.111846,0.089171,0.103931,-0.037813,0.022978
1,124,2012,7.0,485820,126.46,65.0,-0.096553,-0.069903,0.083971,0.095592,0.027149
2,117,2016,7.3,157606,138.12,62.0,-0.110882,0.101169,0.151963,-0.000888,-0.004631


## Подготавливаем матрицы

In [None]:
result = data.join(split_df, how='left')
result.shape
data_sm = result[['Runtime (Minutes)',"Year",
                'Rating', 'Votes',
                'Revenue (Millions)','Metascore',"v1","v2","v3","v4","v5"]
              ]
X = data_sm.drop(["Rating"],axis=1).values 
y = data_sm['Rating'].values
sc = StandardScaler()
X_sc = sc.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_sc, y, random_state=42)

In [None]:
# определяем X и y

X = data_sm.drop(["Rating"],axis=1).values 

display(X, X.shape)

array([[ 1.21000000e+02,  2.01400000e+03,  7.57074000e+05, ...,
         1.03931241e-01, -3.78133208e-02,  2.29780748e-02],
       [ 1.24000000e+02,  2.01200000e+03,  4.85820000e+05, ...,
         8.39705765e-02,  9.55922604e-02,  2.71488763e-02],
       [ 1.17000000e+02,  2.01600000e+03,  1.57606000e+05, ...,
         1.51963025e-01, -8.87991395e-04, -4.63103969e-03],
       ...,
       [ 9.80000000e+01,  2.00800000e+03,  7.06990000e+04, ...,
         8.36219117e-02,  6.49447888e-02, -2.95825093e-03],
       [ 9.30000000e+01,  2.01400000e+03,  4.88100000e+03, ...,
        -1.53655689e-02,  7.95397311e-02, -2.88322978e-02],
       [ 8.70000000e+01,  2.01600000e+03,  1.24350000e+04, ...,
         1.49934873e-01,  9.36935246e-02, -2.82350294e-02]])

(1000, 10)

In [None]:
data_sm.isna().any()

Runtime (Minutes)     False
Year                  False
Rating                False
Votes                 False
Revenue (Millions)    False
Metascore             False
v1                    False
v2                    False
v3                    False
v4                    False
v5                    False
dtype: bool

In [None]:
y = data_sm['Rating'].values # отдельно вынесли массив со значениями скорости ветра
y.shape

(1000,)

Иногда бывает полезно [нормализовать](https://en.wikipedia.org/wiki/Normalization_(statistics)) данные: это позволяет исправить ситуацию, когда признаки представлены в разных единацах измерения. 
Для этого используется StandardScaler. 

До нормализации:

In [None]:
list(X[0])

[121.0,
 2014.0,
 757074.0,
 333.13,
 76.0,
 -0.11184559762477875,
 0.0891706570982933,
 0.1039312407374382,
 -0.03781332075595856,
 0.022978074848651886]

## Training

In [None]:
# использзуем стандартизатор
sc = StandardScaler()
X_sc = sc.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_sc, y, random_state=42)

После:

In [None]:
list(sc.fit_transform(X)[0])

[0.4163497512303056,
 0.37979525138136244,
 3.1126899627963738,
 2.5961363010556906,
 1.0233613578368184,
 -1.2049451827509656,
 1.4336877839173505,
 -0.5096592609491106,
 -0.5468519085754212,
 -0.35497945399934877]

теперь с данными удобнее работать и обучать

In [None]:
# задаем модель регрессора
# силу регуляризации можно варьировать параметром alpha
regressor = Ridge() 


# обучаем
regressor.fit(X_train, y_train)

Ridge()

In [None]:
# давайте предскажем результат для тестовой выборки

y_preds = regressor.predict(X_test)

### оценка результатов алгоритма

В качестве метрики будем использовать [среднюю абсолютную ошибку](https://www.youtube.com/watch?v=ZejnwbcU8nw). Она показывает отклонение от правильного ответа в тех же единах измерения

*(а вообще есть [разные способы](https://towardsdatascience.com/what-are-the-best-metrics-to-evaluate-your-regression-model-418ca481755b))*

In [None]:
mean_absolute_error(y_test, y_preds) 

0.48670878276592205

## Hyperparameters

Попробуйте разные значения для параметра регуляризации alpha при обучении модели. Как они влияют на величину ошибки?

- Попробуем подобрать параметры моделей alpha и solver с помощью GridSearchCV



In [None]:
from sklearn.model_selection import GridSearchCV
parameters = {'alpha':[0.01, 0.1, 1, 10, 100], 'solver':('auto', 'svd', 'cholesky', 'sparse_cg')}
ridge = Ridge()
clf = GridSearchCV(ridge, parameters)
clf.fit(X, y)
print("Ridge best estimator: ", clf.best_estimator_)
print("Ridge best score: ", clf.best_score_)

Ridge best estimator:  Ridge(alpha=10, solver='svd')
Ridge best score:  0.4609104960391699


In [None]:
ridge = Ridge(alpha=10, solver='svd').fit(X_train, y_train)
y_pred = ridge.predict(X_test)
print("mae:", mean_absolute_error(y_test, y_pred))
print("mse", mean_squared_error(y_test, y_pred))
print("rmse", mean_squared_error(y_test, y_pred, squared=False))

mae: 0.48684934492386445
mse 0.4658190929993873
rmse 0.682509408725907


In [None]:
lr = LinearRegression().fit(X_train, y_train)
y_pred = lr.predict(X_test)
print("mae:", mean_absolute_error(y_test, y_pred))
print("mse", mean_squared_error(y_test, y_pred))
print("rmse", mean_squared_error(y_test, y_pred, squared=False))

mae: 0.4866929811624341
mse 0.4651968871418212
rmse 0.6820534342277159


In [None]:
parameters = {'alpha':[0.01, 0.1, 1, 10, 50, 100]}
lasso = Lasso()
clf = GridSearchCV(lasso, parameters)
clf.fit(X_train, y_train)
print("Lasso best estimator: ", clf.best_estimator_)
print("Lasso best score: ", clf.best_score_)

Lasso best estimator:  Lasso(alpha=0.01)
Lasso best score:  0.507803237825531


In [None]:
lasso = Lasso(alpha=0.01).fit(X_train, y_train)
y_pred = lasso.predict(X_test)
print("mae:", mean_absolute_error(y_test, y_pred))
print("mse", mean_squared_error(y_test, y_pred))
print("rmse", mean_squared_error(y_test, y_pred, squared=False))

mae: 0.4866565920671522
mse 0.46710857696589003
rmse 0.6834534197484786


- Все лучшие модели показывают примерно одинаковый результат

## Text processing

In [63]:
data = pd.read_csv('/content/drive/MyDrive/ml_2023/IMDB-Movie-Data.csv')
meta_mean = data.Metascore.mean()
rev_mean = data['Revenue (Millions)'].mean()

#заменяем пустоты на средние значения
data.Metascore.fillna(meta_mean, inplace=True)
data['Revenue (Millions)'].fillna(rev_mean, inplace=True)


In [64]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Rank                1000 non-null   int64  
 1   Title               1000 non-null   object 
 2   Genre               1000 non-null   object 
 3   Description         1000 non-null   object 
 4   Director            1000 non-null   object 
 5   Actors              1000 non-null   object 
 6   Year                1000 non-null   int64  
 7   Runtime (Minutes)   1000 non-null   int64  
 8   Rating              1000 non-null   float64
 9   Votes               1000 non-null   int64  
 10  Revenue (Millions)  1000 non-null   float64
 11  Metascore           1000 non-null   float64
dtypes: float64(3), int64(4), object(5)
memory usage: 93.9+ KB


- Теперь проведём более тщательную предобработку данных. 

- Уберем все знаки препинания, сделаем лемматизацию - приведем все слова к начальной форме, уберём стоп-слова, все самые популярные сокращения

In [65]:
import re

In [66]:
def removePunctuationDown(strs):
    remove = '!#$%&\()+,-./:;<=>?@[\\]_{|}~'
    pattern = r"[{}]".format(remove)
    h = re.sub(pattern, " ", strs)
    return h

def removePunctuationUp(strs):
    remove = ',"\^`*'
    pattern = r"[{}]".format(remove)
    h = re.sub(pattern, "", strs)
    return h

def replace(strs):
    strs = strs.replace('\\t',' ').replace('\\n',' ').replace('\\u',' ').replace('\\',' ')
    strs = strs.replace('\n',' ')
    strs = strs.replace('\t','')
    strs = strs.encode('utf-8').decode('ascii', 'ignore')
    return strs


In [67]:
data.Description

0      A group of intergalactic criminals are forced ...
1      Following clues to the origin of mankind, a te...
2      Three girls are kidnapped by a man with a diag...
3      In a city of humanoid animals, a hustling thea...
4      A secret government agency recruits some of th...
                             ...                        
995    A tight-knit team of rising investigators, alo...
996    Three American college students studying abroa...
997    Romantic sparks occur between two dance studen...
998    A pair of friends embark on a mission to reuni...
999    A stuffy businessman finds himself trapped ins...
Name: Description, Length: 1000, dtype: object

In [68]:
data['text'] = [i.lower() for i in data.Description]
data['remove_email'] = [re.sub(r'\S*@\S*\s?','',i) for i in data.text]
data['remove_special_character'] = data['remove_email'].replace(r'http\s+|www.\s+','',regex=True).replace(r'http\S+', '', regex=True).replace(r'www\S+', '', regex=True)
data['remove_special_character'] = [re.sub('<*>+', '', i) for i in data.remove_special_character]
data['text_clean'] = [removePunctuationDown(i) for i in data.remove_special_character]
data['text_clean'] = [removePunctuationUp(i) for i in data.text_clean]
data['text_clean'] = [replace(j) for j in data.text_clean]

In [69]:
def text_clean(text): 

    text = re.sub(r"won\'t", "will not", text)
    text = re.sub(r"can\'t", "can not", text)
    text = re.sub(r"won\’t", "will not", text)
    text = re.sub(r"can\’t", "can not", text)
    text = re.sub(r"\'t've", " not have", text)
    text = re.sub(r"\'d've", " would have", text)
    text = re.sub(r"\'clock", "f the clock", text)
    text = re.sub(r"\'cause", " because", text)
    
    text = re.sub(r"n\'t", " not", text)
    text = re.sub(r"\'re", " are", text)
    text = re.sub(r"\'s", " is", text)
    text = re.sub(r"\'ll", " will", text)
    text = re.sub(r"\'t", " not", text)
    text = re.sub(r"\'ve", " have", text)
    text = re.sub(r"\'m", " am", text)
    text = re.sub(r"n\’t", " not", text)
    text = re.sub(r"\’re", " are", text)
    text = re.sub(r"\’s", " is", text)
    text = re.sub(r"\’ll", " will", text)
    text = re.sub(r"\’t", " not", text)
    text = re.sub(r"\’ve", " have", text)
    text = re.sub(r"\’m", " am", text)
    text = re.sub(r"\’", "\'", text)
    
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('\w*\d\w*', '', text)
    text = re.sub(r'\W', ' ', text)
    text = re.sub('[‘’“”…]', '', text)
    text = re.sub(r'^b\s+', '', text)
    text = re.sub(r'\s+', ' ', text, flags=re.I)
    
    return text

data['text_clean'] = data['text_clean'].apply(lambda x: text_clean(x))

In [70]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [71]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet

lemmatizer = nltk.stem.WordNetLemmatizer()
wordnet_lemmatizer = WordNetLemmatizer()

def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

def lemmatize_sentence(sentence):
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
    lemmatized_sentence = []
    for word, tag in wordnet_tagged:
        if tag is None:
            lemmatized_sentence.append(word)
        else:
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
    return " ".join(lemmatized_sentence)
    stopWords = set(stopwords.words('english'))

def remove_stopwords(sentence):
    stopWords = set(stopwords.words('english'))
    tokens = nltk.word_tokenize(sentence)
    wordsFiltered = []
    for w in tokens:
      if w not in stopWords:
          wordsFiltered.append(w)

    # return " ".join(wordsFiltered)
    return wordsFiltered


data['text_lemma'] = data['text_clean'].apply(lambda x: lemmatize_sentence(x))
data['text_final'] = data['text_lemma'].apply(lambda x: remove_stopwords(x))

data.head()

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore,text,remove_email,remove_special_character,text_clean,text_lemma,text_final
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0,a group of intergalactic criminals are forced ...,a group of intergalactic criminals are forced ...,a group of intergalactic criminals are forced ...,a group of intergalactic criminals are forced ...,a group of intergalactic criminal be force to ...,"[group, intergalactic, criminal, force, work, ..."
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0,"following clues to the origin of mankind, a te...","following clues to the origin of mankind, a te...","following clues to the origin of mankind, a te...",following clues to the origin of mankind a tea...,follow clue to the origin of mankind a team fi...,"[follow, clue, origin, mankind, team, find, st..."
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0,three girls are kidnapped by a man with a diag...,three girls are kidnapped by a man with a diag...,three girls are kidnapped by a man with a diag...,three girls are kidnapped by a man with a diag...,three girl be kidnap by a man with a diagnosed...,"[three, girl, kidnap, man, diagnosed, distinct..."
3,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0,"in a city of humanoid animals, a hustling thea...","in a city of humanoid animals, a hustling thea...","in a city of humanoid animals, a hustling thea...",in a city of humanoid animals a hustling theat...,in a city of humanoid animal a hustling theate...,"[city, humanoid, animal, hustling, theater, im..."
4,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0,a secret government agency recruits some of th...,a secret government agency recruits some of th...,a secret government agency recruits some of th...,a secret government agency recruits some of th...,a secret government agency recruit some of the...,"[secret, government, agency, recruit, dangerou..."


In [72]:
data["text_final"]

0      [group, intergalactic, criminal, force, work, ...
1      [follow, clue, origin, mankind, team, find, st...
2      [three, girl, kidnap, man, diagnosed, distinct...
3      [city, humanoid, animal, hustling, theater, im...
4      [secret, government, agency, recruit, dangerou...
                             ...                        
995    [tight, knit, team, rise, investigator, along,...
996    [three, american, college, student, study, abr...
997    [romantic, spark, occur, two, dance, student, ...
998    [pair, friend, embark, mission, reunite, pal, ...
999    [stuffy, businessman, find, trap, inside, body...
Name: text_final, Length: 1000, dtype: object

## Training

In [73]:
input_text = list(data.text_final.values)
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(input_text)]
model = Doc2Vec(documents, vector_size=5, window=2, min_count=1, workers=4)
model.dv = model.__dict__['docvecs']
vectors = []
for x in documents:
    vec = list(model.dv[x.tags[0]])
    vectors.append(vec)
split_df = pd.DataFrame(vectors,
                        columns=['v1', 'v2', 'v3','v4','v5'])
result = data.join(split_df, how='left')



In [74]:
data_sm = result[['Runtime (Minutes)',"Year",
                'Rating', 'Votes',
                'Revenue (Millions)','Metascore','v1', 'v2', 'v3','v4','v5']
              ]
X = data_sm.drop(["Rating"],axis=1).values 
y = data_sm['Rating'].values
sc = StandardScaler()
X_sc = sc.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_sc, y, random_state=42)

- И опять попробуем подобрать параметры модели 

In [78]:
from sklearn.model_selection import GridSearchCV
parameters = {'alpha':[0.01, 0.1, 1, 10, 100, 200, 300, 500, 1000, 2000], 'solver':('auto', 'svd', 'cholesky', 'sparse_cg')}
ridge = Ridge()
clf = GridSearchCV(ridge, parameters)
clf.fit(X_sc, y)
print("Ridge best estimator: ", clf.best_estimator_)
print("Ridge best score: ", clf.best_score_)

Ridge best estimator:  Ridge(alpha=100, solver='sparse_cg')
Ridge best score:  0.471639587341356


In [79]:
ridge = Ridge(alpha=100, solver='sparse_cg').fit(X_train, y_train)
y_pred = ridge.predict(X_test)
print("mae:", mean_absolute_error(y_test, y_pred))
print("mse", mean_squared_error(y_test, y_pred))
print("rmse", mean_squared_error(y_test, y_pred, squared=False))

mae: 0.49022692479671176
mse 0.47546778303373693
rmse 0.6895417195744844


In [80]:
lr = LinearRegression().fit(X_train, y_train)
y_pred = lr.predict(X_test)
print("mae:", mean_absolute_error(y_test, y_pred))
print("mse", mean_squared_error(y_test, y_pred))
print("rmse", mean_squared_error(y_test, y_pred, squared=False))

mae: 0.48677355956653184
mse 0.46466408845665264
rmse 0.6816627380579436


- В результате дополнительная предобработка текста не дала прироста качества модели