## Логистическая регрессия

Импортируем библиотеки под более удобным именем.

In [34]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Импортируем наш датасет, который получили после этапа подготовки данных. Стоит отметить, что необходимо выбрать переменную для классификации. Пусть это будет Успеность точки, где значение 1 - это успешная торговая точка, 0 - неуспешная. Таким образом, классификация поможет нам понять в случае появления новой торговой точки её принадлежность к той или иной категории.

In [35]:
df = pd.read_csv('Nikiforova_prep.csv', sep=';')
df

Unnamed: 0,square,type,reviews,cars_per_day,average_income_of_customers,road,place_for_walk,coating_quality,spontaneous_trade,place_for_picnic,success
0,9.00,1,42.820513,12679.0,14910.0,0.0,0.0,0.0,0.0,0.0,0
1,30.00,1,8.000000,12800.0,9030.0,0.0,0.0,0.0,0.0,0.0,0
2,15.00,1,42.820513,77106.0,9030.0,0.0,0.0,0.0,0.0,0.0,0
3,20.00,1,42.820513,0.0,13930.0,0.0,0.0,0.0,0.0,0.0,0
4,30.00,0,42.820513,0.0,14560.0,0.0,0.0,0.0,0.0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...
59,30.00,1,25.000000,8771.0,11130.0,1.0,1.0,0.0,0.0,0.0,0
60,51.77,1,4.000000,146.0,9800.0,1.0,0.0,0.0,0.0,0.0,1
61,27.26,1,3.000000,6661.0,12600.0,0.0,0.0,0.0,0.0,0.0,1
62,50.93,0,17.000000,0.0,12460.0,0.0,0.0,0.0,0.0,0.0,0


Разбиваем датасет на обучающую и тестовую выборки в соотношении 80%/20%.

In [36]:
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=13)

Для работы с логистической регрессией нам нужны шкалированные данные, поэтому займёмся этим.

In [37]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler().fit(X_train)
X_train = sc_X.transform(X_train)
X_test = sc_X.transform(X_test)

Построим базовую модель логистической регрессии, заодно посмотрим на значимость каждой переменной.

In [38]:
import statsmodels.api as sm
lr = sm.Logit(y_train, X_train).fit()
print(lr.summary2())

Optimization terminated successfully.
         Current function value: 0.492909
         Iterations 6
                        Results: Logit
Model:              Logit            Pseudo R-squared: 0.254   
Dependent Variable: y                AIC:              70.2767 
Date:               2021-12-02 13:21 BIC:              89.5950 
No. Observations:   51               Log-Likelihood:   -25.138 
Df Model:           9                LL-Null:          -33.675 
Df Residuals:       41               LLR p-value:      0.047574
Converged:          1.0000           Scale:            1.0000  
No. Iterations:     6.0000                                     
-----------------------------------------------------------------
          Coef.    Std.Err.      z      P>|z|     [0.025   0.975]
-----------------------------------------------------------------
x1        0.7955     0.4214    1.8879   0.0590   -0.0304   1.6213
x2        0.8035     0.4659    1.7247   0.0846   -0.1096   1.7166
x3        0.6902 

Теперь неоходимо провести отбор переменных. Это можно автоматизировать, прописав функцию, а можно сделать вручную. Получили 7 значимых переменных square, type, reviews, cars_per_day, average_income_of_customers, road, place_for_picnic.

In [39]:
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LogisticRegression
estimator=LogisticRegression(random_state=123)
selector=RFECV(estimator, step=1)
selector=selector.fit(X_train, y_train)
print(selector.n_features_)
print(selector.support_)
print(selector.ranking_)

7
[ True  True  True  True  True  True False False False  True]
[1 1 1 1 1 1 4 3 2 1]


In [40]:
selected_columns=[]
for i in range(len(X_train[0])):
    if selector.support_[i]==1:
        selected_columns.append(i)
print(selected_columns)
print('№\tIndex\tFeature')
i=0
for column in selected_columns:
    print(str(i)+'\t'+str(column)+'\t'+str(df.columns[column]))
    i=i+1

[0, 1, 2, 3, 4, 5, 9]
№	Index	Feature
0	0	square
1	1	type
2	2	reviews
3	3	cars_per_day
4	4	average_income_of_customers
5	5	road
6	9	place_for_picnic


Оставляем только отобранные переменные.

In [41]:
X_train_select=X_train[:, selected_columns]
X_test_select=X_test[:, selected_columns]

In [42]:
X_train = X_train[:,[True,  True,  True,  True,  True,  True, False, False, False,  True]]
X_test = X_test[:,[True,  True,  True,  True,  True,  True, False, False, False,  True]]

Строим логистическую регрессию только на значимых факторах.

In [43]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(random_state = 13).fit(X_train, y_train)

In [44]:
y_pred = lr.predict(X_test)

In [45]:
lr.score(X_test,y_test)

0.5384615384615384

Посмотрим на матрицу сопряженности. Видим, что модель 7 объектов распознала верно и 6 нет. Точность модели: 53,8%. Доля ошибок: 46,2%.

In [46]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[6 4]
 [2 1]]


In [47]:
import pickle
obj = {'X_train': X_train, 'X_test': X_test,'y_train': y_train,'y_test': y_test}
output = open('data.pkl', 'wb')
pickle.dump(obj, output, 2)
output.close()