# Overview

The aim of this project is to build a model that predicts and diagnoses chronic inflammatory respiratory diseases. The model classifies the patient into one of a four classes: Chronic Obstructive Pulmonary Disease (COPD), asthma, infected, and Healthy Controls (HC). The used dataset to build the model is Exasens dataset and it is downloaded from [UCI ML](https://archive.ics.uci.edu/ml/datasets/Exasens#)
The Exasens dataset includes demographic information on four groups of saliva samples (COPD-Asthma-Infected-HC) collected in the frame of a joint research project at the Research Center Borstel, BioMaterialBank Nord (Borstel, Germany). A permittivity biosensor, developed at IHP Microelectronics (Frankfurt Oder, Germany), was used for the dielectric characterization of the saliva samples for classification purposes [[1]](https://www.mdpi.com/2227-9032/7/1/11).

Definition of four sample groups included within the Exasens dataset:
1. Outpatients and hospitalized patients with COPD without acute respiratory infection (COPD).
2. Outpatients and hospitalized patients with asthma without acute respiratory infections (Asthma).
3. Patients with respiratory infections, but without COPD or asthma (Infected).
4. Healthy controls without COPD, asthma, or any respiratory infection (HC).

The dataset contains 7 attributes:

1. Diagnosis (COPD-HC-Asthma-Infected)
2. ID
3. Age
4. Gender (1=male, 0=female)
5. Smoking Status (1=Non-smoker, 2=Ex-smoker, 3=Active-smoker)
6. Saliva Permittivity - Imaginary part (Min(Î”)=Absolute minimum value, Avg.(Î”)=Average)
7. Saliva Permittivity -  Real part (Min(Î”)=Absolute minimum value, Avg.(Î”)=Average)


## Clean Data

In [1]:
# import libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import confusion_matrix, classification_report

In [2]:
# Read dataset file 
df = pd.read_csv('Exasens_dataset.csv')
df.head()

Unnamed: 0,Diagnosis,ID,Min Imaginary Part,Avg Imaginary Part,Min Real Part,Avg Real Part,Gender,Age,Smoking
0,COPD,301-4,-320.61,-300.563531,-495.26,-464.171991,1,77,2
1,COPD,302-3,-325.39,-314.75036,-473.73,-469.26314,0,72,2
2,COPD,303-3,-323.0,-317.436056,-476.12,-471.897667,1,73,3
3,COPD,304-4,-327.78,-317.39967,-473.73,-468.856388,1,76,2
4,COPD,305-4,-325.39,-316.155785,-478.52,-472.869783,0,65,2


In [3]:
# Check null values
df.isnull().sum()

Diagnosis               0
ID                      0
Min Imaginary Part    299
Avg Imaginary Part    299
Min Real Part         299
Avg Real Part         299
Gender                  0
Age                     0
Smoking                 0
dtype: int64

In [4]:
# Remove null values
df = df.dropna()
df.isnull().sum()

Diagnosis             0
ID                    0
Min Imaginary Part    0
Avg Imaginary Part    0
Min Real Part         0
Avg Real Part         0
Gender                0
Age                   0
Smoking               0
dtype: int64

## Create Machine Learning Model

In [5]:
# Determine label and features
label = df['Diagnosis']
features = df.drop( ['ID', 'Diagnosis'], axis = 1)

In [6]:
# Split data into training and testing 
X_train, X_test, y_train, y_test = train_test_split(features, label, test_size = 0.2)

# Create Gradient Boosting classification model
clf = GradientBoostingClassifier()

# Train the model
clf.fit(X_train, y_train)

# Test the model
prediction = clf.predict(X_test)


In [7]:
# Print classification report 
class_report = classification_report(y_test, prediction)
print('Classification Report')
print(class_report) 

print()

# Print confusion matrix report
conf_matrix = confusion_matrix(y_test, prediction)
print('Confusion Matrix') 
print(conf_matrix) 

Classification Report
              precision    recall  f1-score   support

      Asthma       0.00      0.00      0.00         1
        COPD       0.75      0.86      0.80         7
          HC       0.83      0.83      0.83        12

    accuracy                           0.80        20
   macro avg       0.53      0.56      0.54        20
weighted avg       0.76      0.80      0.78        20


Confusion Matrix
[[ 0  0  1]
 [ 0  6  1]
 [ 0  2 10]]


  _warn_prf(average, modifier, msg_start, len(result))
