# Logistic Regression

### Contrary to it's name Logistic Regression does classification and not regression (prediction). It is non-linear in nature therefore fits better with data with curves ie corelation is not linear

Logistic regression involves finding the "best fit" curve

$$p(y_{i}) = \frac{1}{1 + e^-(A+B_{i})}$$

* A is the intercept
* B is the regression coefficent
* e is the constant (eulers number 2.71828 )

* Output of classification modle is always a probability score. We set a threshold and output on basis of that.
* Classification is for predicting categorical data.
* It fits a curve on our data.

In [3]:
import pandas as pd
import matplotlib.pyplot as plt

In [4]:
housing_data = pd.read_csv('datasets/housing.csv')

housing_data.sample(5)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
7263,-118.23,33.99,37.0,378.0,176.0,714.0,156.0,2.1912,112500.0,<1H OCEAN
8211,-118.18,33.79,27.0,1580.0,510.0,1896.0,448.0,2.0186,130000.0,NEAR OCEAN
8741,-118.32,33.82,25.0,2587.0,512.0,1219.0,509.0,4.4271,382100.0,<1H OCEAN
13085,-121.35,38.56,16.0,2629.0,491.0,1265.0,485.0,4.5066,140200.0,INLAND
6468,-118.05,34.1,42.0,2065.0,404.0,1313.0,402.0,4.0179,274300.0,INLAND


In [5]:
housing_data = housing_data.dropna()

#### As there is lots of data towards the upper cap this makes the data skewed. It will create hindrance in our model. So we drop this data.

#### In order to rectify such situation Exploratory Analysis is important before modelling.

In [6]:
housing_data = housing_data.drop(housing_data.loc[housing_data['median_house_value'] == 500001].index)
housing_data.shape

(19475, 10)

In [7]:
#### All the values except ocean_approximity are already neumerical. As ML models accept only numerical data we need to convert it.

#### We will do so using <a href="https://en.wikipedia.org/wiki/One-hot">one-hot encoding</a>

In [8]:
# pd.get_dummies() will convert column to numeric one and remove the original column from dataset

housing_data = pd.get_dummies(housing_data, columns=['ocean_proximity'])
housing_data.shape

(19475, 14)

We will convert values to median to convert it from regression problem to classification problem

In [9]:
median = housing_data['median_house_value'].median()

median

173800.0

#### Create a boolean column. Values will be decided if house_value is above or below median.

#### This makes it a binary classifiaction problem. {True or False}

In [11]:
housing_data['above_median'] = (housing_data['median_house_value'] - median) > 0
housing_data.sample(5)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity_<1H OCEAN,ocean_proximity_INLAND,ocean_proximity_ISLAND,ocean_proximity_NEAR BAY,ocean_proximity_NEAR OCEAN,above_median
3488,-118.57,34.29,4.0,6995.0,1151.0,2907.0,1089.0,7.0808,341200.0,1,0,0,0,0,True
20530,-121.76,38.57,11.0,15018.0,3008.0,7984.0,2962.0,3.1371,201800.0,0,1,0,0,0,True
7056,-118.03,33.92,35.0,2108.0,405.0,1243.0,394.0,3.6731,167000.0,1,0,0,0,0,False
8240,-118.19,33.77,21.0,2103.0,727.0,1064.0,603.0,1.6178,137500.0,0,0,0,0,1,False
3526,-118.51,34.27,36.0,2276.0,429.0,1001.0,419.0,4.1042,252100.0,1,0,0,0,0,True


In [12]:
X = housing_data.drop(['median_house_value', 'above_median'], axis=1)
Y = housing_data['above_median']

X.columns

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'ocean_proximity_<1H OCEAN', 'ocean_proximity_INLAND',
       'ocean_proximity_ISLAND', 'ocean_proximity_NEAR BAY',
       'ocean_proximity_NEAR OCEAN'],
      dtype='object')

In [13]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

x_train.shape, x_test.shape

((15580, 13), (3895, 13))

In [14]:
y_train.shape, y_test.shape

((15580,), (3895,))

In [18]:
from sklearn.linear_model import LogisticRegression

# The algorithm to use in optimization is **liblinear**
# It is good choice for small datsets and binary classification

logistic_model = LogisticRegression(solver='liblinear').fit(x_train, y_train)

#### When building classificationo problem default way to measure is **accuracy**.
#### How many of predicted values are correct!

In [19]:
print('Training_score : ', logistic_model.score(x_train, y_train))

Training_score :  0.8182926829268292


In [20]:
y_pred = logistic_model.predict(x_test)

In [21]:
df_pred_actual = pd.DataFrame({'predicted': y_pred, 'actual': y_test})

df_pred_actual.head(10)

Unnamed: 0,predicted,actual
894,True,True
487,True,True
1365,True,False
1266,False,True
4640,False,True
12743,False,False
15774,True,True
916,True,True
9577,False,False
20313,True,True


In [22]:
from sklearn.metrics import accuracy_score

print('Testing_score : ', accuracy_score(y_test, y_pred))

Testing_score :  0.8274711168164314


#### As the accuracy is same on predicted value as well as training_data this shows model behaves in same manner on both of them.

#### Also 80% accuracy indicates model is decent.