# Base LogisticRegression
In this notebook, the implementation of Logistic Regression will be disassembled and compared with the implementation of sklearn. <br>

The model takes a feature vector <font size="3"> $\vec{x} = (1, x_1, x_2, ..., x_n)$ </font>, each class has its own weights <font size="3"> $\vec{w}$</font>.<br>
We get the scalar product between the vector of weights and the vector of features as in linear regression: <font size="3"> $$\sum_1^n w_ix_i$$ </font>

The result of the scalar product is substituted into the sigmoid function: <br>
<p style="text-align: center;"><font size="3"> $$\hat{f}(x) = \sigma(wx) = {1\over 1 + e^{-wx}}$$ </font></p>

Based on the likelihood function for training any model, where <font size="3"> $X$ </font> is a feature vector, <font size="3"> $\hat{f}(x)$ </font> is our model, we can construct a loss function:
<p style="text-align: center;"><font size="3"> $$P(Y = 1|X) = \hat{f}(x)$$ </font></p>
Let's call it plausibility:
<p style="text-align: center;"><font size="3"> $$\prod_1^n P(Y = y_i|x_i)$$ </font></p>
A theorem from statistics guarantees that if we find model parameters that maximize likelihood, then they will be good: <br>
<p style="text-align: center;"><font size="3"> $$\prod_1^n P(Y = y_i|x_i) → max$$</font></p>
<p style="text-align: center;"><font size="3"> $$\ln(\prod_1^n P(Y = y_i|x_i)) → max$$</font></p>
Let's transform the maximization problem into the minimization problem:
<p style="text-align: center;"><font size="3"> $$L(w) = -\ln(\prod_1^n P(Y = y_i|x_i)) → min$$ </font></p>
Let's assume that our model predicts probabilities:
<p style="text-align: center;"><font size="3"> $$\ln(P(Y = y_i|x_i)) = y_i\ln(\hat{f}(x_i)) + (1 - y_i)\ln(1 - \hat{f}(x_i))$$ </font></p>
The probability predicted by logistic regression can be substituted into the previously obtained loss function:
<p style="text-align: center;"><font size="3"> $$L(w) = -\sum_1^ny_i\ln(\hat{f}(x_i)) + (1 - y_i)\ln(1 - \hat{f}(x_i))$$ </font></p>

### Load libraries and packages

In [1]:
import warnings
import gc
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd

from models.linear import LogisticRegression as MyLogisticRegression
from sklearn.linear_model import LogisticRegression

from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns

### Load Data
First let's look at the data:

In [2]:
data = pd.read_csv('b_depressed.csv')
data.head()

Unnamed: 0,Survey_id,Ville_id,sex,Age,Married,Number_children,education_level,total_members,gained_asset,durable_asset,...,incoming_salary,incoming_own_farm,incoming_business,incoming_no_business,incoming_agricultural,farm_expenses,labor_primary,lasting_investment,no_lasting_investmen,depressed
0,926,91,1,28,1,4,10,5,28912201,22861940,...,0,0,0,0,30028818,31363432,0,28411718,28292707.0,0
1,747,57,1,23,1,3,8,5,28912201,22861940,...,0,0,0,0,30028818,31363432,0,28411718,28292707.0,1
2,1190,115,1,22,1,3,9,5,28912201,22861940,...,0,0,0,0,30028818,31363432,0,28411718,28292707.0,0
3,1065,97,1,27,1,2,10,4,52667108,19698904,...,0,1,0,1,22288055,18751329,0,7781123,69219765.0,0
4,806,42,0,59,0,4,10,6,82606287,17352654,...,1,0,0,0,53384566,20731006,1,20100562,43419447.0,0


In [3]:
data.dtypes

Survey_id                  int64
Ville_id                   int64
sex                        int64
Age                        int64
Married                    int64
Number_children            int64
education_level            int64
total_members              int64
gained_asset               int64
durable_asset              int64
save_asset                 int64
living_expenses            int64
other_expenses             int64
incoming_salary            int64
incoming_own_farm          int64
incoming_business          int64
incoming_no_business       int64
incoming_agricultural      int64
farm_expenses              int64
labor_primary              int64
lasting_investment         int64
no_lasting_investmen     float64
depressed                  int64
dtype: object

In [4]:
data.shape

(1429, 23)

In [5]:
data['depressed'].value_counts()

0    1191
1     238
Name: depressed, dtype: int64

A little data preprocessing - fill in the gaps with median values:

In [6]:
for col in data.columns:
    data[col] = data[col].fillna(np.median(data[col]))

### Let's create training and test samples

In [7]:
from sklearn.model_selection import train_test_split

X_train, X_holdout, y_train, y_test = train_test_split(data.drop(['Survey_id', 'no_lasting_investmen'], axis=1), 
                                                       data['depressed'], 
                                                       test_size=0.3, 
                                                       random_state=17)
print(X_train.shape[0], X_holdout.shape[0])

1000 429


### Create models

In [8]:
my_model = MyLogisticRegression.LogisticRegression(random_state=17)
sk_model = LogisticRegression(max_iter=100)

In [9]:
X_train.values

array([[      64,        1,       27, ...,        1,  5332699,        0],
       [     292,        1,       21, ...,        0, 12795455,        0],
       [      82,        1,       86, ...,        0, 25397076,        0],
       ...,
       [       3,        0,       75, ...,        0, 28827665,        0],
       [     102,        1,       23, ...,        1, 16527861,        0],
       [      74,        1,       28, ...,        1, 23753336,        0]],
      dtype=int64)

In [10]:
%time loses_my_model = my_model.fit(X_train.values, y_train.values)

ValueError: shapes (1000,22) and (1000,) not aligned: 22 (dim 1) != 1000 (dim 0)