<IMG SRC="https://github.com/jacquesroy/byte-size-data-science/raw/master/images/Banner.png" ALT="BSDS Banner" WIDTH=1195 HEIGHT=200>

# Logictic Regression


In [None]:
from IPython.display import IFrame

IFrame(src="https://www.youtube.com/embed/E4nhrtrGUWE?rel=0&amp;controls=0&amp;showinfo=0", width=560, height=315)

## Import the appropriate libraries and set up needed connections

In [None]:
import pandas as pd
import numpy as np
import requests, zipfile, io

## Read the data
<table style="width:50%;" align="left">
    <tr><td style="text-align: left;vertical-align: top;">
Here we read the data we'll use to create and test<br/>
our model. This data was mentioned in:<br/>
        
"Data Science for business"<br/>
        
Starting on page 104. I thought it would be<br/>
interesting to see if we get similar results.
        
The data relates to breast cancer detection<br/>
        between malign and benign tumors.
        </td>
<td><IMG SRC="https://miro.medium.com/max/500/1*dfkEYd_lCvR8XbpGufL-yw.jpeg" ALT="Data Science for Business" WIDTH=187 HEIGHT=283>
    </td>
    </tr>
    </table>


In [None]:
header = ['ID','DIAGNOSIS',
          'RADIUS_mean','TEXTURE_mean','PERIMETER_mean','AREA_mean','SMOOTHNESS_mean',
          'COMPACTNESS_mean','CONCAVITY_mean','CONCAVEPOINTS_mean','SYMMETRY_mean', 'FRACTALDIMENSION_mean',
          'RADIUS_se','TEXTURE_se','PERIMETER_se','AREA_se','SMOOTHNESS_se',
          'COMPACTNESS_se','CONCAVITY_se','CONCAVEPOINTS_se','SYMMETRY_se', 'FRACTALDIMENSION_se',
          'RADIUS_worst','TEXTURE_worst','PERIMETER_worst','AREA_worst','SMOOTHNESS_worst',
          'COMPACTNESS_worst','CONCAVITY_worst','CONCAVEPOINTS_worst','SYMMETRY_worst', 'FRACTALDIMENSION_worst'
         ]
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data"

stream = requests.get(url).content

data_pd = pd.read_csv(io.StringIO(stream.decode('utf-8')),header=0,names=header)
print("Number of records: {0}".format(data_pd.shape[0]))
data_pd.head()


In [None]:
# Some data statistics. We see 20 of the 32 columns.
# This gives us an idea of data distribution.
data_pd.describe()

## Create a model
We are using the sklearn library to create our logistic regression model.<br/>
We should be using a pipeline to automatically do data transformation. 
We are not using it here since we want to show the effect of the transformations.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import preprocessing

### Prepare the data

In [None]:
X = data_pd.drop(['ID','DIAGNOSIS'], axis=1)
y = data_pd['DIAGNOSIS']

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
X_scaled_train = preprocessing.scale(X_train)
X_scaled_test = preprocessing.scale(X_test)

### Look at scaled data
The scale preprocessing method centers the data to the mean and scale to the variance.

We don't have the variance in the statistics provided by describe but we have the 
standard deviation that is the square root of the variance.

If we did not scale the data, we would not be able to create a model.

In [None]:
scaled_train_pd = pd.DataFrame(data=X_scaled_train,columns=header[2:])
scaled_train_pd.describe()

### Create and train the model

In [None]:
clf = LogisticRegression(solver='lbfgs').fit(X_scaled_train, y_train)

## Look at some of the model attributes

In [None]:
print("Result classes      : {0}".format(clf.classes_))
print("Number of iterations: {0}".format(clf.n_iter_[0]))
print("Intercept           : {0:8.6f}".format(clf.intercept_[0]))

## Look at a few predictions
Note that we are using scaled data just like when creting the model.<br/>
This is where creating a pipeline would have been useful.

In [None]:
predict = clf.predict(X_scaled_test[5:8])
predict_proba = clf.predict_proba(X_scaled_test[5:8])

for i in range(3) :
    print("Prediction: {0}, probability: {1:6.3f}, {2:6.3f}".format(
        predict[i], 100 * predict_proba[i][0],100 * predict_proba[i][1])
         )

## How accurate is our model?
The accuracy of the model is similar to the one mentioned in "Data Science for Business"

In [None]:
score = clf.score(X_scaled_test,y_test)
print("Mean accuracy clf : {0}".format(clf.score(X_scaled_test,y_test)))

## Parameters importance
Remember that the core of the logistic regression is a linear regression formula.<br/>
We can retrive the weights (or coefficients) used for each attribute.

The larger the absolute number is, the more impact an attribute has.

In [None]:
# Zip column names and weights
x = zip(X_test.columns.tolist(),clf.coef_.tolist()[0])

importance = pd.DataFrame(data=list(x),columns = ['NAME','WEIGHT'])
print("Number of records: {0}".format(importance.shape[0]))

importance.reindex(importance.WEIGHT.abs().sort_values(ascending=False).index)

### Try multiple solvers
If you are curious, you can execute this next cell to see that you may want to choose the 
algorithm to solve the linear regression formula.

In [None]:
solvers = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
for solver in solvers :
    clf = LogisticRegression(random_state=0,solver=solver).fit(X_scaled_train, y_train)
    score = clf.score(X_scaled_test,y_test)
    print("Score for solver {0}: {1}, iterations: {2}".format(solver,score,clf.n_iter_))
