# BBM467 - Data Intensive Applications
## Data Science Project - SDSP

#### Student no : 21627873
#### Student name : Hüseyin Berk Yılmaz

#### Student no : 21591132
#### Student name : Nezir Turhallı

## Table of Content

[Purpose](#purpose)   
[Data Understanding](#data_understanding)   
[Data Preparation](#data_preparation)   
[Features Selection](#feat)   
[Modeling for Classification](#classificationmodel)  
[Evaluation](#evaluation)  
[References](#references)   


## Purpose <a class="anchor" id="purpose"></a>

There are diseases that are difficult to diagnose. It makes a disease with different symptoms better for doctors. Different tests can be applied to predict the patient's disease. These tests create time and cost.

In this project, we are expected to create a machine learning model for predicting diseases. It is expected that a web application will be designed to facilitate this model.

## Data Understanding<a class="anchor" id="data_understanding"></a>

### Import Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score 
from sklearn.metrics import confusion_matrix as cm


In [2]:
df = pd.read_excel("sdsp_patients.xlsx")
df.head()

Unnamed: 0,Disease,Feature_1,Feature_2,Feature_3,Feature_4,Feature_5,Feature_6,Feature_7,Feature_8,Feature_9,...,Feature_41,Feature_42,Feature_43,Feature_44,Feature_45,Feature_46,Feature_47,Feature_48,Feature_49,Feature_50
0,Disease_1,Male,28.0,130.0,96.0,2.0,No,Yes,Yes,No,...,No,No,No,0,No,No,No,No,No,No
1,Disease_1,Male,18.0,95.0,46.0,3.0,Yes,No,No,No,...,No,Yes,No,0,No,No,No,No,No,No
2,Disease_1,Male,44.0,152.0,150.0,1.0,No,Yes,No,Yes,...,Yes,No,No,0,No,No,No,No,No,No
3,Disease_1,Male,19.0,112.0,66.0,18.0,No,No,No,Yes,...,No,Yes,No,0,No,No,No,No,No,No
4,Disease_1,Male,17.5,105.5,54.0,3.0,No,No,Yes,Yes,...,No,No,No,0,No,No,No,No,Yes,No


In data,
* There are missing values.
* There are numerical and categorical columns.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 51 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Disease     400 non-null    object 
 1   Feature_1   400 non-null    object 
 2   Feature_2   400 non-null    float64
 3   Feature_3   400 non-null    object 
 4   Feature_4   400 non-null    float64
 5   Feature_5   400 non-null    float64
 6   Feature_6   400 non-null    object 
 7   Feature_7   400 non-null    object 
 8   Feature_8   400 non-null    object 
 9   Feature_9   400 non-null    object 
 10  Feature_10  400 non-null    object 
 11  Feature_11  400 non-null    object 
 12  Feature_12  400 non-null    object 
 13  Feature_13  400 non-null    object 
 14  Feature_14  400 non-null    object 
 15  Feature_15  400 non-null    object 
 16  Feature_16  400 non-null    object 
 17  Feature_17  400 non-null    object 
 18  Feature_18  400 non-null    object 
 19  Feature_19  400 non-null    o

## Data Preparation<a class="anchor" id="data_preparation"></a>

Firstly, I went through all the columns to solve the missing values in the data. I replaced the '' or nan values in the columns with the mode value of the column.

In [4]:
x,y = df.shape
for i in df.columns:
    a = df[i].mode()
    df[i] = df[i].replace(to_replace = [' '],value = [a])
    df[i] = df[i].replace(to_replace = [np.nan],value = [a])

I used a label encoder to make the non-numerical values in the data numerical.

In [5]:
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()

fit = df.apply(lambda x: labelencoder.fit_transform(df[x.name]))
fit

Unnamed: 0,Disease,Feature_1,Feature_2,Feature_3,Feature_4,Feature_5,Feature_6,Feature_7,Feature_8,Feature_9,...,Feature_41,Feature_42,Feature_43,Feature_44,Feature_45,Feature_46,Feature_47,Feature_48,Feature_49,Feature_50
0,0,1,29,49,33,11,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1,0,1,12,10,9,12,1,0,0,0,...,0,1,0,0,0,0,0,0,0,0
2,0,1,50,73,55,9,0,1,0,1,...,1,0,0,0,0,0,0,0,0,0
3,0,1,14,26,20,20,0,0,0,1,...,0,1,0,0,0,0,0,0,0,0
4,0,1,11,19,14,12,0,0,1,1,...,0,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
395,0,0,41,62,53,22,0,0,0,1,...,1,0,0,6,1,0,0,0,0,0
396,0,1,65,91,60,38,0,0,1,0,...,0,0,0,7,1,0,0,0,0,0
397,0,0,47,68,57,30,0,1,0,0,...,0,0,0,7,1,1,1,0,0,0
398,0,1,42,58,53,38,0,1,0,0,...,0,0,0,7,1,0,0,0,0,0


## Feature Selection<a class="anchor" id="feat"></a>

### Train Test Split
Until now, there has been no change in our data over any known situation.**Any action taken can be valid for any data set. However, we have used the 'Diseases' column while creating a train and test set below.**

In [6]:
X = fit.drop(['Disease'], axis=1)
y = fit['Disease']

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=42)

We learned the importance of features by creating a model by using all features for feature selection. On top of that, we separated the important features using **SelectFromModel** and created a new train and test set.

In [7]:
from sklearn.feature_selection import SelectFromModel

rf = RandomForestClassifier(n_estimators=100, max_depth = 3, random_state=42)
rf.fit(X_train, y_train)

sfm  = SelectFromModel(rf, threshold=0.04)
sfm.fit(X_train, y_train)
sfm.get_support()

selected_feat= X_train.columns[(sfm.get_support())]
X_important_train = sfm.transform(X_train)
X_important_test = sfm.transform(X_test)

### Selected features

In [8]:
selected_feat

Index(['Feature_5', 'Feature_28', 'Feature_29', 'Feature_30', 'Feature_33',
       'Feature_37', 'Feature_39', 'Feature_41', 'Feature_43'],
      dtype='object')

## Modeling for Classification <a class="anchor" id="classificationmodel"></a>

I used the RandomForest classification algorithm. The algorithm produces results that are both fast and highly accurate. I created a new model using the train and test sets I created with important features and trained this model. 

In [9]:
clf_important = RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1)
clf_important.fit(X_important_train, y_train)
y_important_pred = clf_important.predict(X_important_test)

## Evaluation<a class="anchor" id="evaluation"></a>

As you can see, the accuracy rate of our model is **97.5%**.

In [10]:
accuracy_score(y_test, y_important_pred)

0.975

In [11]:
clf_important.score(X_important_test, y_test)

0.975

## Create ML <a class="anchor" id="evaluation"></a>
I used joblib to save my machine learning model. 

In [12]:
import joblib

joblib.dump(clf_important, 'model.pkl')

joblib.dump(df[selected_feat], 'selected_df.pkl')

joblib.dump(df["Disease"], 'disease.pkl')

['disease.pkl']

### Test the our model

In [13]:
x = np.array([35,1,0,0,0,1,0,1,0]).reshape(1,-1)
print(clf_important.predict_proba(x))

[[0.63 0.09 0.28 0.  ]]


**I installed the files where I can access them for the web application.**

In [None]:
joblib.dump(clf_important, '../back/api/static/model.pkl')

joblib.dump(df[selected_feat], '../back/api/static/selected_df.pkl')

joblib.dump(df["Disease"], '../back/api/static/disease.pkl')

## References<a class="anchor" id="references"></a>

* https://machinelearningmastery.com/save-load-machine-learning-models-python-scikit-learn/
* https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
* https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
* https://machinelearningmastery.com/make-predictions-scikit-learn/


**Disclaimer!** <font color='grey'>This notebook was prepared by Hüseyin Berk Yılmaz and Nezir Turhallı as an assigment for the *BBM467 - Data Intensive Applications * class. The notebook is available for educational purposes only. There is no guarantee on the correctness of the content provided as it is a student work.

If you think there is any copyright violation, please let us [know](https://forms.gle/BNNRB2kR8ZHVEREq8). 
</font>