В качестве домашнего задания вам предлагается поработать над предсказанием погоды. Файл с данными вы найдете в соответствующей директории. Вам будет доступен датасет weather.csv, ПЕРВЫЕ 75% (shuffle = False) которого нужно взять для обучения, последние 25% - для тестирования.

Требуется построить 4 модели которые будут предсказывать целевую переменную <b>RainTomorrow</b> с помощью:

   1. логистической регрессии [sklearn.linear_model.LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression)
   
   2. метода ближайших соседей [sklearn.neighbors](https://scikit-learn.org/stable/modules/neighbors.html)
 
   3. Байесовского классификатора [sklearn.naive_bayes](https://scikit-learn.org/stable/modules/naive_bayes.html)
   
   4. логистической регрессии реализованной самостоятельно

Затем следует сравнить результаты моделей (по качеству и времени выполнения) и сделать вывод о том, какая модель и с какими параметрами даёт лучшие результаты.

Не забывайте о том, что работа с признаками играет очень большую роль в построении хорошей модели.

Краткое описание данных:

    Date - Дата наблюдений
    Location - Название локации, в которой расположена метеорологическая станция
    MinTemp - Минимальная температура в градусах цельсия
    MaxTemp - Максимальная температура в градусах цельсия
    Rainfall - Количество осадков, зафиксированных за день в мм
    Evaporation - Так называемое "pan evaporation" класса А (мм) за 24 часа до 9 утра
    Sunshine - Число солнечных часов за день
    WindGustDir - направление самого сильного порыва ветра за последние 24 часа
    WindGustSpeed - скорость (км / ч) самого сильного порыва ветра за последние 24 часа
    WindDir9am - направление ветра в 9 утра

In [38]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
import seaborn as sns
import time
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
%matplotlib notebook

In [39]:
X = pd.read_csv('weather.csv')

In [40]:
y = X.RainTomorrow.replace({'No':0, 'Yes': 1})

In [41]:
del X['RainTomorrow']

In [42]:
X.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,142183,142184,142185,142186,142187,142188,142189,142190,142191,142192
Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,145449,145450,145451,145452,145453,145454,145455,145456,145457,145458
Date,2008-12-01,2008-12-02,2008-12-03,2008-12-04,2008-12-05,2008-12-06,2008-12-07,2008-12-08,2008-12-09,2008-12-10,...,2017-06-15,2017-06-16,2017-06-17,2017-06-18,2017-06-19,2017-06-20,2017-06-21,2017-06-22,2017-06-23,2017-06-24
Location,Albury,Albury,Albury,Albury,Albury,Albury,Albury,Albury,Albury,Albury,...,Uluru,Uluru,Uluru,Uluru,Uluru,Uluru,Uluru,Uluru,Uluru,Uluru
MinTemp,13.4,7.4,12.9,9.2,17.5,14.6,14.3,7.7,9.7,13.1,...,2.6,5.2,6.4,8.0,7.4,3.5,2.8,3.6,5.4,7.8
MaxTemp,22.9,25.1,25.7,28.0,32.3,29.7,25.0,26.7,31.9,30.1,...,22.5,24.3,23.4,20.7,20.6,21.8,23.4,25.3,26.9,27.0
Rainfall,0.6,0.0,0.0,0.0,1.0,0.2,0.0,0.0,0.0,1.4,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Evaporation,,,,,,,,,,,...,,,,,,,,,,
Sunshine,,,,,,,,,,,...,,,,,,,,,,
WindGustDir,W,WNW,WSW,NE,W,WNW,W,W,NNW,W,...,S,E,ESE,ESE,E,E,E,NNW,N,SE
WindGustSpeed,44.0,44.0,46.0,24.0,41.0,56.0,50.0,35.0,80.0,28.0,...,19.0,24.0,31.0,41.0,35.0,31.0,31.0,22.0,37.0,28.0


### Предобработка признаков

Удалим колонку "Unnamed: 0" так как она не несет в себе полезной информации.

In [43]:
del X["Unnamed: 0"]

Расположение безусловно важно при предсказании погоды, но с учетом того что данные отсортированы по названиям городов, и для обучения выбираются первые 75% можно предположить, что данный признак будет бесполезен при обучении модели. Поэтому удалим данный признак из датасета.

In [44]:
del X["Location"]

Заменим 'Yes' и 'No' в колонке RainToday на 1 и 0 соответственно.

In [45]:
X["RainToday"] = X.RainToday.replace({'No':0, 'Yes': 1})
X["RainToday"].value_counts()

0.0    109332
1.0     31455
Name: RainToday, dtype: int64

Заполним пропуски в численных признаках. В колонке RainToday пропуски заполним нулями, а в остольных - средним значением.

In [46]:
for i in X.describe().columns:
    X[i].fillna(X[i].mean(), inplace=True)

X["RainToday"].fillna(0, inplace=True)
X.describe()

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday
count,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0
mean,12.1864,23.226784,2.349974,5.469824,7.624853,39.984292,14.001988,18.637576,68.84381,51.482606,1017.653758,1015.258204,4.437189,4.503167,16.987509,21.687235,0.223423
std,6.388924,7.109554,8.423217,3.168114,2.734927,13.138385,8.851082,8.721551,18.932077,20.532065,6.746248,6.681788,2.27808,2.104709,6.472166,6.870771,0.414476
min,-8.5,-4.8,0.0,0.0,0.0,6.0,0.0,0.0,0.0,0.0,980.5,977.1,0.0,0.0,-7.2,-5.4,0.0
25%,7.6,17.9,0.0,4.0,7.624853,31.0,7.0,13.0,57.0,37.0,1013.5,1011.0,3.0,4.0,12.3,16.7,0.0
50%,12.0,22.7,0.0,5.469824,7.624853,39.0,13.0,18.637576,70.0,51.482606,1017.653758,1015.258204,4.437189,4.503167,16.8,21.3,0.0
75%,16.8,28.2,0.8,5.469824,8.7,46.0,19.0,24.0,83.0,65.0,1021.8,1019.4,6.0,6.0,21.5,26.3,0.0
max,33.9,48.1,371.0,145.0,14.5,135.0,130.0,87.0,100.0,100.0,1041.0,1039.6,9.0,9.0,40.2,46.7,1.0


В колонке "Date" оставим информацию только о месяце.

In [47]:
X["Date"] = X["Date"].apply(lambda x: int(x[5:7]))

Применим метод OneHotEncoder к котегориальным признакам

In [48]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

X["WindGustDir"] = LabelEncoder().fit_transform(X["WindGustDir"])
X["WindDir9am"] = LabelEncoder().fit_transform(X["WindDir9am"])
X["WindDir3pm"] = LabelEncoder().fit_transform(X["WindDir3pm"])
X["Date"] = LabelEncoder().fit_transform(X["Date"])

ohe = OneHotEncoder(sparse=False)
new_ohe_wind_gus_dir = ohe.fit_transform(X[["WindGustDir"]])
new_ohe_wind_dir_9am = ohe.fit_transform(X[["WindDir9am"]])
new_ohe_wind_dir_3am = ohe.fit_transform(X[["WindDir3pm"]])
new_ohe_date = ohe.fit_transform(X[["Date"]])

X_ohe = X.copy()

del X_ohe["WindGustDir"], X_ohe["WindDir9am"], X_ohe["WindDir3pm"], X_ohe["Date"]
X_ohe = pd.DataFrame(np.hstack((X_ohe, new_ohe_wind_gus_dir, new_ohe_wind_dir_9am, new_ohe_wind_dir_3am, new_ohe_date)))

Применим MinMaxScaler

In [49]:
from sklearn import preprocessing

mm_scaler = preprocessing.MinMaxScaler()
X_ohe = mm_scaler.fit_transform(X_ohe)

Создадим датасеты без использования OneHotEncoder для метода ближайших соседей и байесовского классификатора. Также применим к ним MinMaxScaler и StandartScaler.

In [50]:
X.describe()

Unnamed: 0,Date,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday
count,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,...,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0
mean,5.402544,12.1864,23.226784,2.349974,5.469824,7.624853,8.255885,39.984292,7.915755,7.974471,...,18.637576,68.84381,51.482606,1017.653758,1015.258204,4.437189,4.503167,16.987509,21.687235,0.223423
std,3.426506,6.388924,7.109554,8.423217,3.168114,2.734927,4.953096,13.138385,4.911307,4.731493,...,8.721551,18.932077,20.532065,6.746248,6.681788,2.27808,2.104709,6.472166,6.870771,0.414476
min,0.0,-8.5,-4.8,0.0,0.0,0.0,0.0,6.0,0.0,0.0,...,0.0,0.0,0.0,980.5,977.1,0.0,0.0,-7.2,-5.4,0.0
25%,2.0,7.6,17.9,0.0,4.0,7.624853,4.0,31.0,3.0,4.0,...,13.0,57.0,37.0,1013.5,1011.0,3.0,4.0,12.3,16.7,0.0
50%,5.0,12.0,22.7,0.0,5.469824,7.624853,9.0,39.0,8.0,8.0,...,18.637576,70.0,51.482606,1017.653758,1015.258204,4.437189,4.503167,16.8,21.3,0.0
75%,8.0,16.8,28.2,0.8,5.469824,8.7,13.0,46.0,12.0,12.0,...,24.0,83.0,65.0,1021.8,1019.4,6.0,6.0,21.5,26.3,0.0
max,11.0,33.9,48.1,371.0,145.0,14.5,16.0,135.0,16.0,16.0,...,87.0,100.0,100.0,1041.0,1039.6,9.0,9.0,40.2,46.7,1.0


In [51]:
X_mm = mm_scaler.fit_transform(X)

s_scaler = preprocessing.StandardScaler()
X_ss = s_scaler.fit_transform(X)

In [52]:
X.describe()

Unnamed: 0,Date,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday
count,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,...,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0
mean,5.402544,12.1864,23.226784,2.349974,5.469824,7.624853,8.255885,39.984292,7.915755,7.974471,...,18.637576,68.84381,51.482606,1017.653758,1015.258204,4.437189,4.503167,16.987509,21.687235,0.223423
std,3.426506,6.388924,7.109554,8.423217,3.168114,2.734927,4.953096,13.138385,4.911307,4.731493,...,8.721551,18.932077,20.532065,6.746248,6.681788,2.27808,2.104709,6.472166,6.870771,0.414476
min,0.0,-8.5,-4.8,0.0,0.0,0.0,0.0,6.0,0.0,0.0,...,0.0,0.0,0.0,980.5,977.1,0.0,0.0,-7.2,-5.4,0.0
25%,2.0,7.6,17.9,0.0,4.0,7.624853,4.0,31.0,3.0,4.0,...,13.0,57.0,37.0,1013.5,1011.0,3.0,4.0,12.3,16.7,0.0
50%,5.0,12.0,22.7,0.0,5.469824,7.624853,9.0,39.0,8.0,8.0,...,18.637576,70.0,51.482606,1017.653758,1015.258204,4.437189,4.503167,16.8,21.3,0.0
75%,8.0,16.8,28.2,0.8,5.469824,8.7,13.0,46.0,12.0,12.0,...,24.0,83.0,65.0,1021.8,1019.4,6.0,6.0,21.5,26.3,0.0
max,11.0,33.9,48.1,371.0,145.0,14.5,16.0,135.0,16.0,16.0,...,87.0,100.0,100.0,1041.0,1039.6,9.0,9.0,40.2,46.7,1.0


In [53]:
col_names = list(X.columns)

In [54]:
pd.DataFrame(X_mm, columns=col_names).describe()

Unnamed: 0,Date,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday
count,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,...,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0
mean,0.49114,0.487887,0.529807,0.006334,0.037723,0.525852,0.515993,0.263444,0.494735,0.498404,...,0.214225,0.688438,0.514826,0.614112,0.610531,0.493021,0.500352,0.510285,0.519909,0.223423
std,0.311501,0.150682,0.134396,0.022704,0.021849,0.188616,0.309568,0.101848,0.306957,0.295718,...,0.100248,0.189321,0.205321,0.111508,0.106909,0.25312,0.233857,0.136544,0.131877,0.414476
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.181818,0.379717,0.429112,0.0,0.027586,0.525852,0.25,0.193798,0.1875,0.25,...,0.149425,0.57,0.37,0.545455,0.5424,0.333333,0.444444,0.411392,0.424184,0.0
50%,0.454545,0.483491,0.519849,0.0,0.037723,0.525852,0.5625,0.255814,0.5,0.5,...,0.214225,0.7,0.514826,0.614112,0.610531,0.493021,0.500352,0.506329,0.512476,0.0
75%,0.727273,0.596698,0.623819,0.002156,0.037723,0.6,0.8125,0.310078,0.75,0.75,...,0.275862,0.83,0.65,0.682645,0.6768,0.666667,0.666667,0.605485,0.608445,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [55]:
pd.DataFrame(X_ss, columns=col_names).describe()

Unnamed: 0,Date,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday
count,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,...,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0
mean,8.145159000000001e-17,1.663012e-16,-2.23867e-16,7.255688000000001e-17,-1.103344e-16,-4.707073e-16,1.023392e-16,-1.874886e-16,4.4773390000000004e-17,6.076389e-17,...,1.679002e-17,1.471126e-16,-7.643458e-16,-1.490794e-14,-1.865771e-14,-1.471126e-16,-4.349415e-16,-3.421966e-16,-1.535088e-16,6.396199000000001e-17
std,1.000004,1.000004,1.000004,1.000004,1.000004,1.000004,1.000004,1.000004,1.000004,1.000004,...,1.000004,1.000004,1.000004,1.000004,1.000004,1.000004,1.000004,1.000004,1.000004,1.000004
min,-1.576698,-3.237865,-3.942143,-0.2789887,-1.72653,-2.787964,-1.666819,-2.586651,-1.611747,-1.685409,...,-2.136964,-3.636371,-2.507433,-5.507341,-5.710798,-1.947782,-2.139575,-3.737171,-3.9424,-0.53905
25%,-0.9930106,-0.7178699,-0.7492457,-0.2789887,-0.4639446,-6.495101e-16,-0.8592403,-0.6838225,-1.000909,-0.8400067,...,-0.6463983,-0.6255971,-0.7053678,-0.615716,-0.6372873,-0.6308796,-0.2390681,-0.7242591,-0.7258651,-0.53905
50%,-0.11748,-0.02917555,-0.07409551,-0.2789887,0.0,-6.495101e-16,0.1502329,-0.07491751,0.01715327,0.005395501,...,0.0,0.06107062,-6.921323e-16,0.0,-1.701449e-14,0.0,-4.219974e-16,-0.02897164,-0.05635995,-0.53905
75%,0.7580506,0.7221273,0.6995141,-0.1840128,0.0,0.3931185,0.9578115,0.4578744,0.8316033,0.8507977,...,0.6148497,0.7477383,0.6583576,0.6146018,0.6198657,0.6860233,0.7111855,0.6972175,0.671363,-0.53905
max,1.633581,3.398644,3.498574,43.7661,44.04218,2.51384,1.563495,7.231942,1.646053,1.6962,...,7.838361,1.645688,2.363014,3.460638,3.643019,2.002926,2.136566,3.586523,3.640473,1.873642


### Логистическая регрессия

In [56]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_ohe, y, test_size=0.25, shuffle=False)

model = LogisticRegression()

model.fit(X_train, y_train)

LogisticRegression()

In [15]:
print(classification_report(y_test, model.predict(X_test)))

              precision    recall  f1-score   support

           0       0.87      0.96      0.91     27882
           1       0.75      0.47      0.58      7667

    accuracy                           0.85     35549
   macro avg       0.81      0.71      0.74     35549
weighted avg       0.84      0.85      0.84     35549



### Метод ближайших соседей

Рассмотрим как разные методы масштабирования данных влияют на качество метода ближайших соседей.

In [16]:
from sklearn.neighbors import KNeighborsClassifier

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, shuffle=False)

print(" " * 20 + "Without scaler:")
for i in [2, 5, 10, 15, 30]:
    model = KNeighborsClassifier(n_neighbors=i)
    model.fit(X_train, y_train)
    print("n_neighbors:", i)
    print(classification_report(y_test, model.predict(X_test)))

          Without scaler:
n_neighbors: 1
              precision    recall  f1-score   support

           0       0.86      0.87      0.86     27882
           1       0.50      0.48      0.49      7667

    accuracy                           0.79     35549
   macro avg       0.68      0.67      0.68     35549
weighted avg       0.78      0.79      0.78     35549

n_neighbors: 2
              precision    recall  f1-score   support

           0       0.83      0.96      0.89     27882
           1       0.68      0.31      0.43      7667

    accuracy                           0.82     35549
   macro avg       0.76      0.63      0.66     35549
weighted avg       0.80      0.82      0.79     35549

n_neighbors: 5
              precision    recall  f1-score   support

           0       0.86      0.94      0.90     27882
           1       0.66      0.45      0.54      7667

    accuracy                           0.83     35549
   macro avg       0.76      0.69      0.72     35549
wei

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X_mm, y, test_size=0.25, shuffle=False)

print(" " * 20 + "MinMaxScaler:")
for i in [1, 2, 5, 10, 15, 30]:
    model = KNeighborsClassifier(n_neighbors=i)
    model.fit(X_train, y_train)
    print("n_neighbors:", i)
    print(classification_report(y_test, model.predict(X_test)))

          MinMaxScaler:
n_neighbors: 1
              precision    recall  f1-score   support

           0       0.86      0.88      0.87     27882
           1       0.51      0.46      0.48      7667

    accuracy                           0.79     35549
   macro avg       0.68      0.67      0.67     35549
weighted avg       0.78      0.79      0.78     35549

n_neighbors: 2
              precision    recall  f1-score   support

           0       0.83      0.96      0.89     27882
           1       0.68      0.28      0.39      7667

    accuracy                           0.82     35549
   macro avg       0.75      0.62      0.64     35549
weighted avg       0.80      0.82      0.78     35549

n_neighbors: 5
              precision    recall  f1-score   support

           0       0.86      0.94      0.90     27882
           1       0.66      0.43      0.52      7667

    accuracy                           0.83     35549
   macro avg       0.76      0.68      0.71     35549
weigh

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X_ss, y, test_size=0.25, shuffle=False)

print(" " * 30 + "StandartScaler:")
for i in [2, 5, 10, 15, 30]:
    model = KNeighborsClassifier(n_neighbors=i)
    model.fit(X_train, y_train)
    print("n_neighbors:", i)
    print(classification_report(y_test, model.predict(X_test)))

          StandartScaler:
n_neighbors: 1
              precision    recall  f1-score   support

           0       0.85      0.87      0.86     27882
           1       0.50      0.46      0.48      7667

    accuracy                           0.78     35549
   macro avg       0.68      0.67      0.67     35549
weighted avg       0.78      0.78      0.78     35549

n_neighbors: 2
              precision    recall  f1-score   support

           0       0.83      0.96      0.89     27882
           1       0.66      0.28      0.39      7667

    accuracy                           0.81     35549
   macro avg       0.74      0.62      0.64     35549
weighted avg       0.79      0.81      0.78     35549

n_neighbors: 5
              precision    recall  f1-score   support

           0       0.86      0.93      0.89     27882
           1       0.63      0.43      0.51      7667

    accuracy                           0.82     35549
   macro avg       0.75      0.68      0.70     35549
wei

###  Байесовский классификатор

In [70]:
from sklearn.naive_bayes import GaussianNB

model = GaussianNB()

In [71]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, shuffle=False)
model.fit(X_train, y_train)

print(" " * 30 + "Without Scaler:")
print(classification_report(y_test, model.predict(X_test)))

                              Without Scaler:
              precision    recall  f1-score   support

           0       0.88      0.91      0.89     27882
           1       0.62      0.56      0.59      7667

    accuracy                           0.83     35549
   macro avg       0.75      0.73      0.74     35549
weighted avg       0.83      0.83      0.83     35549



In [73]:
X_train, X_test, y_train, y_test = train_test_split(X_mm, y, test_size=0.25, shuffle=False)
model.fit(X_train, y_train)

print(" " * 30 + "MinMaxScaler:")
print(classification_report(y_test, model.predict(X_test)))

                              MinMaxScaler:
              precision    recall  f1-score   support

           0       0.88      0.91      0.89     27882
           1       0.62      0.56      0.59      7667

    accuracy                           0.83     35549
   macro avg       0.75      0.73      0.74     35549
weighted avg       0.83      0.83      0.83     35549



In [69]:
X_train, X_test, y_train, y_test = train_test_split(X_ss, y, test_size=0.25, shuffle=False)
model.fit(X_train, y_train)

print(" " * 30 + "StandartScaler:")
print(classification_report(y_test, model.predict(X_test)))

                              StandartScaler:
              precision    recall  f1-score   support

           0       0.88      0.91      0.89     27882
           1       0.62      0.56      0.59      7667

    accuracy                           0.83     35549
   macro avg       0.75      0.73      0.74     35549
weighted avg       0.83      0.83      0.83     35549



### Реализация логистической регрессии
__Логистическая регрессия__

$$p(y|x) = a(x, \theta) = \sigma(\langle x, \theta \rangle) = \frac{1}{1 + \exp(-\langle \theta, x_i \rangle)}$$

In [15]:
theta = np.array([1, 2, 3])

X =  np.array([[ 1,  1, 1],
               [-1, -2, 1],
               [-1, -2, 2],
               [-2, -2, -3]
              ])

y = np.array([1, 1, 0, 0])

In [16]:
def probability(theta, X):
    # YOUR CODE HERE
    result = np.zeros()
    return result
prob = probability(theta, X)


assert type(prob) == np.ndarray, 'Возвращается неверный тип'
assert prob.shape == (X.shape[0],), 'Неверный размер массива'
assert (prob.round(3) == [0.998, 0.119, 0.731, 0.]).all(), 'Функция считается неверно'

TypeError: zeros() missing required argument 'shape' (pos 0)

Функция предсказания метки класса, получает на вход вероятности принадлежности к классу 1 и выдает метки классов $y \in \{0, 1\}$

In [None]:
def binary_class_prediction(theta, X, threshold =.5):
    prob =  probability(theta, X)
    # YOUR CODE HERE
    return result

y_pred = binary_class_prediction(theta, X)


assert type(y_pred) == np.ndarray, 'Возвращается неверный тип'
assert y_pred.shape == (X.shape[0],), 'Неверный размер массива'
assert min(y_pred) == 0, 'Функция считается неверно'
assert max(y_pred) == 1, 'Функция считается неверно'

__Функционал качества логистической регрессии__

Запишем правдободовие выборки для меток класса $y \in \{+1, -1\}$ 

$$Likelihood(a, X^\ell) = \prod_{i = 1}^{\ell} a(x_i,\theta)^{[y_i = +1]} (1 - a(x_i, \theta))^{[y_i = -1]} → \operatorname*{max}_{\theta}$$ 

Прологарифмируем правдоподобие выборки и перейдем к задаче минимизации:

$$Q(a, X^\ell) =     -\sum_{i = 1}^{\ell} 
        [y_i = +1] \log a(x_i, \theta)
        +
        [y_i = -1] \log (1 - a(x_i, \theta)) \to \operatorname*{min}_{\theta}$$ 
        
Подставим $a(x, \theta)$ в функцинал качества:

$$ Q(a, X^\ell) = -\sum_{i = 1}^{\ell} \left(
    [y_i = +1]
    \log \frac{1}{1 + \exp(-\langle \theta, x_i \rangle)}
    +
    [y_i = -1]
    \log \frac{\exp(-\langle \theta, x_i \rangle)}{1 + \exp(-\langle \theta, x_i \rangle)}
\right)
=\\
=
-\sum_{i = 1}^{\ell} \left(
    [y_i = +1]
    \log \frac{1}{1 + \exp(-\langle \theta, x_i \rangle)}
    +
    [y_i = -1]
    \log \frac{1}{1 + \exp(\langle \theta, x_i \rangle)}
\right)
=\\
=
\sum_{i = 1}^{\ell}
    \log \left(
        1 + \exp(-y_i \langle \theta, x_i \rangle)
    \right) $$
    

Итоговый оптимизируемый функционал качества (logloss), записанный для меток классов $y \in \{+1, -1\}$ и усредненный по выборке

$$Q(a, X^\ell) = \frac{1}{\ell}\sum_{i = 1}^{\ell}
    \log \left(
        1 + \exp(-y_i \langle \theta, x_i \rangle)
    \right) \to \operatorname*{min}_{\theta}$$

Реализуем его в функции logloss:

In [None]:
def logloss(theta, X, y): 
    # YOUR CODE HERE
    return result

In [None]:
assert logloss(theta, X, y).round(3) == 0.861, 'Функция считается неверно'

__Алгоритм оптимизации функционала качества. Стохастический градиентный спуск__

<b>Вход: </b> Выборка $X^\ell$, темп обучения $h$

<b>Выход: </b> оптимальный вектор весов $\theta$

1.  Инициализировать веса $\theta$
2.  Инициализировать оценку функционала качества: $Q(a, X^\ell)$
3.  <b>Повторять</b>: 

    Выбрать случайным образом подвыборку объектов $X^{batch} =\{x_1, \dots,x_n \}$ из $X^{\ell}$
    
    Рассчитать градиент функционала качества: $\nabla Q(X^{batch}, \theta)$
    
    Обновить веса: $\theta := \theta - h\cdot \nabla Q(X^{batch}, \theta)$
       
    <b>Пока</b> значение $Q$ и/или веса $\theta$ не сойдутся   

Реализуем функцию рассчета градиента функционала качества

$$\frac{\partial Q(a, X^{batch}) }{\partial \theta_j}   = \frac{\partial \frac{1}{n}\sum_{i = 1}^{n}
    \log \left(
        1 + \exp(- y_i \langle \theta, x_i \rangle)
    \right)} {\partial \theta_j}  = \frac{1}{n}\sum_{i = 1}^{n}
     \frac {1}{
        1 + \exp(- y_i \langle \theta, x_i \rangle)} \cdot  \exp(- y_i \langle \theta, x_i \rangle) \cdot -y_i x_{ij}$$

Реализуйте рассчет градиента в матричном виде:

In [None]:
def gradient(theta, X, y):
    # YOUR CODE HERE
    
    return result 

assert gradient(theta, X, y).shape == theta.shape, 'Неверный размер массива'

Функция обучения уже реализована

In [None]:
def fit(X, y, batch_size=10, h=0.05,  iters=100, plot=True):

    # получаем размерности матрицы
    size, dim = X.shape

    # случайная начальная инициализация
    theta = np.random.uniform(size=dim)
    
    errors = []
    
    theta_history = theta
    colors = [plt.get_cmap('gist_rainbow')(i) for i in np.linspace(0,1,dim)]
    
    # plt 
    if plot:
        fig = plt.figure(figsize=(15, 10))
        ax1 = fig.add_subplot(221)
        ax2 = fig.add_subplot(222)
        ax3 = fig.add_subplot(212)
        fig.suptitle('Gradient descent')
        
        
    for _ in range(iters):  
        
        # берём случайный набор элементов
        batch = np.random.choice(size, batch_size, replace=False)
        X_batch = X[batch]
        y_batch = y[batch]

        # считаем производные
        grad = gradient(theta, X_batch, y_batch)
        
        assert type(grad) == np.ndarray, 'неверный тип'
        assert len(grad.shape) == 1, 'Необходимо вернуть одномерный вектор'
        assert grad.shape[0] == len(theta), 'длина вектора должна быть равной количеству весов'
        
        
        # Обновляем веса
        
        theta -= grad * h
        
        theta_history = np.vstack((theta_history, theta))
        
        # error
        loss = logloss(theta, X, y)
        errors.append(loss)
        
        if plot:
            ax1.clear()            
            ax1.scatter(range(dim), theta, label='Gradient solution')
            ax1.legend(loc="upper left")
            ax1.set_title('theta')
            ax1.set_ylabel(r'$\bar \beta$')
            ax1.set_xlabel('weight ID')
            
            
            ax2.plot(range(_+1), errors, 'g-')
            ax2.set_title('logloss')
            ax2.set_xlabel('itarations')
            
            ax3.plot(theta_history)
            ax3.set_title('update theta')
            ax3.set_ylabel('value')
            ax3.set_xlabel('itarations')
            time.sleep(0.05)
            fig.canvas.draw()   
            
    return theta

In [None]:
X, y = make_classification(n_samples=2000)

In [None]:
optimal_theta = fit(X, y)

In [None]:
y_pred = binary_class_prediction(optimal_theta, X)