This notebook includes a python code for classification. I use 'Caesarian Section Classification Dataset Data Set' to do this project.

Attribute Information:

Inputs are:
- Age
- Delivery number
- Blood pressure
- Heart problem

Given inputs above, we want to classify Delivery time. There are three categories for Delivery time:
- 0 = timely
- 1 = premature
- 2 = latecomer

https://archive.ics.uci.edu/ml/datasets/Caesarian+Section+Classification+Dataset

In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import os
import seaborn as sns
from pandas.api.types import CategoricalDtype
import statsmodels.formula.api as smf
from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import math

In [5]:
df = pd.read_excel('Classification_Data_Ces.xlsx')

Let's look at some basic info about the data set.

In [6]:
df.shape

(80, 6)

Command above shows that the data has 80 observations and 6 columns (attributes).

In [6]:
df.columns

Index(['Age', 'Delivery number', 'Delivery time', 'Blood of Pressure',
       'Heart Problem', 'Caesarian'],
      dtype='object')

In [7]:
df.describe()

Unnamed: 0,Age,Delivery number,Delivery time,Blood of Pressure,Heart Problem,Caesarian
count,80.0,80.0,80.0,80.0,80.0,80.0
mean,27.6875,1.6625,0.6375,1.0,0.375,0.575
std,5.017927,0.794662,0.815107,0.711568,0.487177,0.497462
min,17.0,1.0,0.0,0.0,0.0,0.0
25%,25.0,1.0,0.0,0.75,0.0,0.0
50%,27.0,1.0,0.0,1.0,0.0,1.0
75%,32.0,2.0,1.0,1.25,1.0,1.0
max,40.0,4.0,2.0,2.0,1.0,1.0


In [8]:
df.head()

Unnamed: 0,Age,Delivery number,Delivery time,Blood of Pressure,Heart Problem,Caesarian
0,22,1,0,2,0,0
1,26,2,0,1,0,1
2,26,2,1,1,0,0
3,28,1,0,2,0,0
4,22,2,0,1,0,1


Looking at how many observation we have for each category.

In [8]:
df.groupby('Delivery time').count()

Unnamed: 0_level_0,Age,Delivery number,Blood of Pressure,Heart Problem,Caesarian
Delivery time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,46,46,46,46,46
1,17,17,17,17,17
2,17,17,17,17,17


Looking at table above we notice that around half of the observations belongs to zero case for the Delivery time.

Split data inot train and test

In [9]:
X = df.drop(['Caesarian','Delivery time'], axis=1)
y = df[['Delivery time']] 

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

In [11]:
from sklearn import neighbors

In [12]:
n_neighbers = 10
knn = neighbors.KNeighborsClassifier(n_neighbers)

In [13]:
knn.fit(X_train, y_train)


  """Entry point for launching an IPython kernel.


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=10, p=2,
           weights='uniform')

In [14]:
y_pred = knn.predict(X_test)

## Now let's measure accuracy of the classifier

In [15]:
from sklearn import metrics

In [16]:
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.65


In [17]:
from sklearn.metrics import classification_report

In [18]:
print(classification_report(y_test,y_pred))

             precision    recall  f1-score   support

          0       0.71      0.92      0.80        13
          1       0.33      0.33      0.33         3
          2       0.00      0.00      0.00         4

avg / total       0.51      0.65      0.57        20



  'precision', 'predicted', average, warn_for)


For one category, I noticed my precision is zero. I did some resaerch on that. I noticed this could be 
caused due to imblanance data set. Following table shows that,46 observation (out of 80) is from class 0 however I only have 17 observation for class 1, and 2.

In [19]:
df['Delivery time'].value_counts()

0    46
2    17
1    17
Name: Delivery time, dtype: int64

Now, I try to make date balance. Ther are couple of methods to do that. I am using upsample method which I am going to generate more sample for the minority class.

In [20]:
from sklearn.utils import resample

In [21]:
df_majority = df[df['Delivery time']==0]
df_minority1 = df[df['Delivery time']==1]
df_minority2 = df[df['Delivery time']==2]

In [22]:
df_minority1_upsampled = resample(df_minority1, 
                                 replace=True,     # sample with replacement
                                 n_samples=46,    # to match majority class
                                 random_state=123) # reproducible results

In [23]:
df_minority2_upsampled = resample(df_minority2, 
                                 replace=True,     # sample with replacement
                                 n_samples=46,    # to match majority class
                                 random_state=123)

In [24]:
df_upsampled = pd.concat([df_majority, df_minority1_upsampled, df_minority2_upsampled])

In [25]:
df_upsampled['Delivery time'].value_counts()

2    46
1    46
0    46
Name: Delivery time, dtype: int64

In [26]:
X_upsampled = df_upsampled.drop(['Caesarian','Delivery time'], axis=1)
y_upsampled = df_upsampled[['Delivery time']] 
X_train_upsampled, X_test_upsampled, y_train_upsampled, y_test_upsampled = train_test_split(X_upsampled, y_upsampled, test_size=0.25, random_state=1)


In [32]:
n_neighbers = 5
knn_upsampled = neighbors.KNeighborsClassifier(n_neighbers)

In [33]:
knn_upsampled.fit(X_train_upsampled, y_train_upsampled)

  """Entry point for launching an IPython kernel.


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [34]:
y_pred_upsampled = knn_upsampled.predict(X_test_upsampled)

In [35]:
print("Accuracy:",metrics.accuracy_score(y_test_upsampled, y_pred_upsampled))

Accuracy: 0.6


In [36]:
print(classification_report(y_test_upsampled,y_pred_upsampled))

             precision    recall  f1-score   support

          0       0.69      0.60      0.64        15
          1       0.62      0.73      0.67        11
          2       0.44      0.44      0.44         9

avg / total       0.60      0.60      0.60        35



Results above show that the classifier has non-zero precision for each category(class) now. Regarding toral accuracy improvment, probably by changing some parameters for the classifier such as distance measure or number of neighbors, we can improve it.