## Elena Mylläri

# Assigment: Logistic regression

The purpose of this project is to make a diagnostic tool (not for real medical use) that asks a medical expert six
numerical quantities obtained by radiographic measurements of a patients to make a linear regression model for predicting the quality of red wine based on it's physicochemical properties.

### Needed imports

In [131]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

### The dataset
Biomedical [data set](http://archive.ics.uci.edu/ml/datasets/Vertebral+Column#) built by Dr. Henrique da Mota. Each patient is represented in the data set by six biomechanical attributes derived from the shape and orientation of the pelvis and lumbar spine (in this order): pelvic incidence, pelvic tilt, lumbar lordosis angle, sacral slope, pelvic radius and grade of spondylolisthesis. The following convention is used for the class labels: Normal (NO) and Abnormal (AB).

In [196]:
df = pd.read_csv('C:\\Users\\Lena\\Downloads\\vertebral_column_data\\column_2C.dat', 
                 sep="\s+", 
                 names=['pelvic incidence', 'pelvic tilt', 
                        'lumbar lordosis angle','sacral slope', 
                        'pelvic radius', 'grade of spondylolisthesis', 
                        'vertebral abnormality'])
df.head(5)

Unnamed: 0,pelvic incidence,pelvic tilt,lumbar lordosis angle,sacral slope,pelvic radius,grade of spondylolisthesis,vertebral abnormality
0,63.03,22.55,39.61,40.48,98.67,-0.25,AB
1,39.06,10.06,25.02,29.0,114.41,4.56,AB
2,68.83,22.22,50.09,46.61,105.99,-3.53,AB
3,69.3,24.65,44.31,44.64,101.87,11.21,AB
4,49.71,9.65,28.32,40.06,108.17,7.92,AB


Check if there are values missing

In [197]:
df.isna().sum()

pelvic incidence              0
pelvic tilt                   0
lumbar lordosis angle         0
sacral slope                  0
pelvic radius                 0
grade of spondylolisthesis    0
vertebral abnormality         0
dtype: int64

Get the statistics

In [198]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
pelvic incidence,310.0,60.496484,17.236109,26.15,46.4325,58.69,72.88,129.83
pelvic tilt,310.0,17.542903,10.00814,-6.55,10.6675,16.36,22.12,49.43
lumbar lordosis angle,310.0,51.93071,18.553766,14.0,37.0,49.565,63.0,125.74
sacral slope,310.0,42.953871,13.422748,13.37,33.3475,42.405,52.6925,121.43
pelvic radius,310.0,117.920548,13.317629,70.08,110.71,118.265,125.4675,163.07
grade of spondylolisthesis,310.0,26.296742,37.558883,-11.06,1.6,11.765,41.285,418.54


In [199]:
df.dtypes

pelvic incidence              float64
pelvic tilt                   float64
lumbar lordosis angle         float64
sacral slope                  float64
pelvic radius                 float64
grade of spondylolisthesis    float64
vertebral abnormality          object
dtype: object

### Data preprosessing

Check the unique values of vertebral abnormality-attribute (should be AB and NO only) and the amount of each class

In [200]:
values, counts = np.unique(df["vertebral abnormality"], return_counts=True)
print(values)
print(counts)

['AB' 'NO']
[210 100]


Re-encode vertebral abnormality column

In [201]:
df['vertebral abnormality'] = np.where(df['vertebral abnormality']=='AB', 1,0)
df.head(5)

Unnamed: 0,pelvic incidence,pelvic tilt,lumbar lordosis angle,sacral slope,pelvic radius,grade of spondylolisthesis,vertebral abnormality
0,63.03,22.55,39.61,40.48,98.67,-0.25,1
1,39.06,10.06,25.02,29.0,114.41,4.56,1
2,68.83,22.22,50.09,46.61,105.99,-3.53,1
3,69.3,24.65,44.31,44.64,101.87,11.21,1
4,49.71,9.65,28.32,40.06,108.17,7.92,1


In [202]:
df.tail(5)

Unnamed: 0,pelvic incidence,pelvic tilt,lumbar lordosis angle,sacral slope,pelvic radius,grade of spondylolisthesis,vertebral abnormality
305,47.9,13.62,36.0,34.29,117.45,-4.25,0
306,53.94,20.72,29.22,33.22,114.37,-0.42,0
307,61.45,22.69,46.17,38.75,125.67,-2.71,0
308,45.25,8.69,41.58,36.56,118.55,0.21,0
309,33.84,5.07,36.64,28.77,123.95,-0.2,0


It seems, that the data is organized: first there are only the rows with the class 1 and then rows with class 0. 

So we shuffle the data:

In [203]:
df = df.sample(frac=1).reset_index(drop=True)
df.head(10)

Unnamed: 0,pelvic incidence,pelvic tilt,lumbar lordosis angle,sacral slope,pelvic radius,grade of spondylolisthesis,vertebral abnormality
0,77.69,21.38,64.43,56.31,114.82,26.93,1
1,57.04,0.35,49.2,56.69,103.05,52.17,1
2,43.44,10.1,36.03,33.34,137.44,-3.11,0
3,52.86,9.41,46.99,43.45,123.09,1.86,0
4,48.8,18.02,52.0,30.78,139.15,10.44,0
5,48.32,17.45,48.0,30.87,128.98,-0.91,0
6,72.08,18.95,51.0,53.13,114.21,1.01,1
7,115.92,37.52,76.8,78.41,104.7,81.2,1
8,82.41,29.28,77.05,53.13,117.04,62.77,1
9,63.9,13.71,62.12,50.19,114.13,41.42,1


Split into explanatory and response variables 

In [204]:
X = df.iloc[:,:6]
Y = df.iloc[:,6]

### The model

Build and fit model

In [205]:
reg = LogisticRegression()
reg.fit(X,Y)

print("Coefficients: ",reg.coef_)
print("Intercept: ", reg.intercept_)

Coefficients:  [[-0.03205038  0.10757546 -0.01869301 -0.06459006 -0.10677264  0.16808262]]
Intercept:  [15.15571755]


Compute predicted values from training set

In [206]:
Y_pred = reg.predict(X)

Display the confusion matrix

In [207]:
cm = confusion_matrix(Y, Y_pred)
print("Confusion matrix:\n",cm)

Confusion matrix:
 [[ 78  22]
 [ 22 188]]


Display the accuracy and the classification report

In [208]:
accuracy = (cm[0][0]+cm[1][1])/(cm[0][0]+cm[1][1]+cm[0][1]+cm[1][0])
print("Accuracy calculated from the training set = %.3f" % (accuracy))

print(classification_report(Y, Y_pred, target_names=['no', 'yes']))

Accuracy calculated from the training set = 0.858
              precision    recall  f1-score   support

          no       0.78      0.78      0.78       100
         yes       0.90      0.90      0.90       210

    accuracy                           0.86       310
   macro avg       0.84      0.84      0.84       310
weighted avg       0.86      0.86      0.86       310



### Diagnostic tool

Function for data input

In [209]:
def get_data():
    pelvic_incidence = float(input("Give the patient's pelvic incidence: "))
    pelvic_tilt= float(input("Give the patient's pelvic tilt: "))
    lumbar_lordosis_angle= float(input("Give the patient's lumbar lordosis angle: "))
    sacral_slope = float(input("Give the patient's sacral slope: "))
    pelvic_radius= float(input("Give the patient's pelvic radius: "))
    grade_of_spondylolisthesis= float(input("Give the patient's \
                                            grade of spondylolisthesis: "))

    frame = {'pelvic incidence': pelvic_incidence, 
             'pelvic tilt':pelvic_tilt, 
             'lumbar lordosis angle': lumbar_lordosis_angle,
             'sacral slope': sacral_slope, 
             'pelvic radius': pelvic_radius, 
             'grade of spondylolisthesis': grade_of_spondylolisthesis} 
    return pd.DataFrame(frame, index=[0]) 

Function for getting the prediction

In [210]:
def predict(data):
    vertebral_abnormality = reg.predict_proba(data)
    return vertebral_abnormality

Function for getting the result of the prediction in text form

In [211]:
def get_result(prediction):
    if prediction[0][1]> prediction[0][0]:
        return "The patient has a vertebral abnormality \
                with a propability of %3.f percent" % (prediction[0][1]*100.0)
    else:
        return "The patient doesn't have a vertebral abnormality \
                with a propability of %3.f percent" % (prediction[0][0]*100.0)

The interface for the diagnostic tool

Two rows from the original data are used for the demonstration:

In [212]:
df.iloc[1:3,:]

Unnamed: 0,pelvic incidence,pelvic tilt,lumbar lordosis angle,sacral slope,pelvic radius,grade of spondylolisthesis,vertebral abnormality
1,57.04,0.35,49.2,56.69,103.05,52.17,1
2,43.44,10.1,36.03,33.34,137.44,-3.11,0


In [None]:
print("Accuracy of this diagnostic tool \
        calculated from the training set = %.3f" % (accuracy))

predictMore = True
while predictMore:
    data = get_data()
    prediction = predict(data)
    print(get_result(prediction))
    choice = input("Do you want to get one prediction more? y/n")
    if choice == "y":
        predictMore = True
    else:
        predictMore = False

Accuracy of this diagnostic tool         calculated from the training set = 0.858


Give the patient's pelvic incidence:  57.04
Give the patient's pelvic tilt:  0.35
Give the patient's lumbar lordosis angle:  49.20
Give the patient's sacral slope:  56.69
Give the patient's pelvic radius:  103.05
Give the patient's grade of spondylolisthesis:  52.17


The patient has a vertebral abnormality with a propability of 100 percent


Do you want to get one prediction more? y/n y
Give the patient's pelvic incidence:  43.44
Give the patient's pelvic tilt:  10.10
Give the patient's lumbar lordosis angle:  36.03
Give the patient's sacral slope:  33.34
Give the patient's pelvic radius:  137.44
Give the patient's grade of spondylolisthesis:  -3.11


The patient doesn't have a vertebral abnormality with a propability of  96 percent


Do you want to get one prediction more? y/n y


### Conclusions

The equation for predicting the patient having a vertebral abnormality built by the Linear Regression:

> ab = 1/(1 + np.exp(0.03205038 * pelvic incidence - 0.10757546 * pelvic tilt + 0.01869301 * lumbar lordosis angle + 0.06459006 * sacral slope + 0.10677264 * pelvic radius - 0.16808262 * spondylolisthesis - 15.15571755))

The accuracy of this diagnostic tool calculated from the training set is 0.858. The original dataset has only 310 rows of data and is imbalanced ( 210 vs 100 rows for each class). For a better accuracy of the model a bigger balanced dataset would be needed.
