# Kidney Disease Prediction - Problem statement 12 - Group 186

Group No	Name	Student Email id	% Contribution \
186	Sivarajan N	2021FC04989@wilp.bits-pilani.ac.in	Equal (100%) \
186	Sindhu C	2021FC04993@wilp.bits-pilani.ac.in 	Equal (100%) \
186	Manibalan S	2021fc04442@wilp.bits-pilani.ac.in	None (0%)

The purpose of this notebook is to get you started to solve this problem.

The main requirements are listed below which follows a standard Data Science Project:
* Presteps
1. Download Dataset from Google Drive - https://drive.google.com/file/d/1NykVFA1f5oGXZ5JlrBGXPBJrnfRncREh/view?usp=sharing
2. Import the required libraries

* I - Data Visualization & Exploration
	1. Print 2 rows for sanity check to identify all the features present in the dataset and if the target matches with them. 
	2. Comment on class imbalance with appropriate visualization method. 
	3. Provide appropriate visualizations to get an insight about the dataset. 
	4. Do the correlational analysis on the dataset. Provide a visualization for the same. Justify the answer by answering this - Will this correlational analysis have effect on feature selection that we will perform in the next step? 

* II - Data Preprocessing, Feature Engineering & Cleaning
	1. Do the appropriate pre-processing of the data like identifying NULL or Missing Values if any, handling of outliers if present in the dataset, skewed data etc. Mention the pre-processing steps performed in the markdown cell. 
	2. Apply appropriate feature engineering techniques for them. Apply the feature transformation techniques like Standardization, Normalization, etc. Apply the appropriate transformations depending upon the structure and the complexity of the dataset. Provide proper justification.

* III - Model Building
	1. Split the dataset into training and test sets. Justify the choice of split. Experiment with different split to get the final split. Justify the method chosen. 
	2. Build Model Development using Logistic Regression with penalty= l1 and l2, C= [1,0.5,0.1,0.01,0.003] . Identify the best parameter and justify your answer.

* IV - Validation, Performance Evaluation, Testing
	1. Do the prediction for the test data and display the results for the inference. Calculate all the evaluation metrics and choose best for the model chosen. 
	2. Comment on under fitting/overfitting/just right model. Justify the chosen model
 


## Presteps
    - Data is downloaded and saved in the same folder as this python notebook.
    - Let's start by importing common data science packages/libraries

In [13]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # data visualization
import seaborn as sns # data visualization advanced

## I - Data Visualization & Exploration
	1. Lets Print the first 2 rows for sanity check
	2. Comment on class imbalance with appropriate visualization method. 
	3. Provide appropriate visualizations to get an insight about the dataset. 
	4. Do the correlational analysis on the dataset. Provide a visualization for the same. Justify the answer by answering this - Will this correlational analysis have effect on feature selection that we will perform in the next step? 


# Loading the Data

In [14]:
df_dataset = pd.read_csv('kidney_disease.csv')

In [15]:
df_dataset.head(2)

Unnamed: 0,id,age,bp,sg,al,su,rbc,pc,pcc,ba,...,pcv,wc,rc,htn,dm,cad,appet,pe,ane,classification
0,0,48.0,80.0,1.02,1.0,0.0,,normal,notpresent,notpresent,...,44,7800,5.2,yes,yes,no,good,no,no,ckd
1,1,7.0,50.0,1.02,4.0,0.0,,normal,notpresent,notpresent,...,38,6000,,no,no,no,good,no,no,ckd


- 26 columns including id and classification is present in the data rows.
- We are able to identify that all the features are present in the dataset and the target matches with them. 

#### 1. How many features are there? What are their datatypes?

In [16]:
df_dataset.shape

(400, 26)

There are 25 features and one id column

In [17]:
df_dataset.dtypes.value_counts()

object     14
float64    11
int64       1
dtype: int64

In [18]:
df_dataset.select_dtypes('int64').head(3)

Unnamed: 0,id
0,0
1,1
2,2


In [19]:
df_dataset.select_dtypes('object').head(3)

Unnamed: 0,rbc,pc,pcc,ba,pcv,wc,rc,htn,dm,cad,appet,pe,ane,classification
0,,normal,notpresent,notpresent,44,7800,5.2,yes,yes,no,good,no,no,ckd
1,,normal,notpresent,notpresent,38,6000,,no,no,no,good,no,no,ckd
2,normal,normal,notpresent,notpresent,31,7500,,no,yes,no,poor,no,yes,ckd


So there are 561 floats, 2 integers and 1 object. \
Of which the 2 integers (Id and Subject) aren't relevant for predictions. \
The 1 object is the Activity (target) that is to be predicted. \
Remaining all 561 features are real values from sensor data.

#### 2. Summarize the features

In [20]:
import re

Looking at the names of the features, it seems that these are medical data from India over 2-month period. \
The features starting with small 't' are time-domain features.\
The features starting with small 'f' are frequency-domain features.

In [21]:
columns = df_dataset.columns.to_numpy()

In [22]:
time_feats = []
time_func = set()
freq_feats = []
freq_func = set()
other_feats = []

n_time = 0
n_freq = 0
n_other = 0

regex_func = re.compile('-([a-z]+)')
regex_axis = re.compile('-([A-Z])')

for i in range(563):
    if np.char.startswith(columns[i],'t'):
        time_feats.append(columns[i])
        time_func.add(regex_func.findall(columns[i])[0])
        n_time += 1
    elif np.char.startswith(columns[i],'f'):
        freq_feats.append(columns[i])
        freq_func.add(regex_func.findall(columns[i])[0])
        n_freq += 1
    else:
        other_feats.append(columns[i])
        n_other += 1

IndexError: index 26 is out of bounds for axis 0 with size 26

In [None]:
print('Time features:',sorted(time_func))
print('Frequency features:',sorted(freq_func))

In [None]:
print('Other features:',sorted(other_feats))

In [None]:
n_time, n_freq, n_other, n_time + n_freq + n_other

#### 3. Is the target balanced? Let's check with appropriate visualization method.

In [None]:
df_dataset['Activity'].value_counts()

In [None]:
chart = sns.countplot(df_dataset['Activity'])
t = chart.set_xticklabels(chart.get_xticklabels(),rotation=25)

#### Class Imbalance: 
    - Data Seems more or less balanced.

#### 4. Visualization for correlational analysis: 
- The simplest way to visualize correlation is to create a scatter plot of the two variables. 

In [None]:
plt.scatter(df_dataset['Activity'],df_dataset['subject/Participant'])
plt.show()

### Justification:
    - From the Scatter plot above, it is clearer that the data is more or less balanced. This correlational analysis above will not have an effect on the feature selection in the next step.

## II - Data Preprocessing, Feature Engineering & Cleaning

#### 1. Data Preprocessing steps - null check, missing value check, handling outliers, checking skewed data.

In [None]:
df_dataset.isnull().sum()

There are no null values, so there are no standard missing values that pandas can detect.

### Detecting outliers using the Z-scores and Boxplot methods

In [None]:
outliers = []
def detect_outliers_zscore(data):
    thres = 3
    mean = np.mean(data)
    std = np.std(data)
    # print(mean, std)
    for i in data:
        z_score = (i-mean)/std
        if (np.abs(z_score) > thres):
            outliers.append(i)
    return outliers# Driver code
data_outliers = detect_outliers_zscore(df_dataset['subject/Participant'])
print("Outliers from Z-scores method: ", data_outliers)

In [None]:
plt.boxplot(df_dataset['subject/Participant'], vert=False)
plt.title("Detecting outliers using Boxplot")
plt.xlabel('subject/Participant')

There are no outliers in this dataset for 'subject/Participant' data.

The following Data Pre-processing steps were done for this exercise:
 - identifying NULL or Missing Values if any
 - handling of outliers if present in the dataset

#### 2. Feature Engineering - Which features are important?

In [None]:
act_map = {'STANDING':0, 'SITTING':1, 'LAYING':2, 'WALKING':3, 'WALKING_DOWNSTAIRS':4, 'WALKING_UPSTAIRS':5}
df_dataset['activity_code'] = df_dataset['Activity'].map(act_map)

In [None]:
df_dataset['activity_code'].value_counts()

In [None]:
fig, ax = plt.subplots(1,1, figsize=(15,15))
sns.heatmap(df_dataset[time_feats+['activity_code']].corr(), 
            cmap=sns.diverging_palette(240, 10, n=25), 
            cbar=True,ax=ax)

In [None]:
fig, ax = plt.subplots(1,1, figsize=(15,15))
sns.heatmap(df_dataset[freq_feats+['activity_code']].corr(), 
            cmap=sns.diverging_palette(240, 10, n=25), 
            cbar=True,ax=ax)

# III - Model Building

### Split the dataset into training and test sets:
    - Justify the choice of split. Experiment with different split to get the final split. Justify the method chosen.

In [None]:
from sklearn.model_selection import train_test_split, KFold, GridSearchCV

Let's come up with a quick base model before further exploring the data. \
Firstly, let's remove subject, Activity from the data.
And then split the data to train and test data sets.

In [None]:
df_dataset.drop(columns=['subject/Participant','Activity'], inplace=True)

In [None]:
y = df_dataset.pop('activity_code')
X = df_dataset

In [None]:
X.shape, y.shape

### Justification  - Lets' use 80:20 split
- Since the number of values is 10299, 80:20 is better than 70:30. 
- With datasets containing considerably high observations like ours (> 10,000), 80:20 is a good starting point. Overall, we need to make sure that the test set represents most of the variance in the dataset. We can ensure this by trying different amounts of test data.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

In [None]:
X_train.shape, y_train.shape

In [None]:
X_test.shape, y_test.shape

### Logistic Regression - Identify the best parameter: 
    - Build Model Development using Logistic Regression with penalty= l1 and l2, C= [1,0.5,0.1,0.01,0.003] . Identify the best parameter and justify your answer.
    
    - Penalized logistic regression imposes a penalty to the logistic model for having too many variables. This results in shrinking the coefficients of the less contributive variables toward zero. This is also known as regularization.
    
    - In Machine Learning, L1 tends to shrink coefficients to zero whereas L2 tends to shrink coefficients evenly. L1 is therefore useful for feature selection, as we can drop any variables associated with coefficients that go to zero. L2, on the other hand, is useful when you have collinear/codependent features. For our case, we use L1 since we need drop variabls.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn import preprocessing

from sklearn.metrics import classification_report

Let's do feature standardization before training the models

In [None]:
std_scaler = StandardScaler()
le = preprocessing.LabelEncoder()
X_train = X_train.apply(le.fit_transform) # this is needed to avoid string transformation
X_test = X_test.apply(le.fit_transform) # this is needed to avoid string transformation

X_prep_train = std_scaler.fit_transform(X_train)
X_prep_test = std_scaler.transform(X_test)

### Justification - Feature transformation techniques using Standardization
    - Standardization is a scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation.
    - Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1.
    - The appropriate transformation depending upon the structure and the complexity of our dataset is Standardization. Unlike normalization, standardization does not have a bounding range.
    


## Training the models

In [None]:
# Perform GridSearchCV to tune best-fit LR model
param = {'C': [1,0.5,0.1,0.01,0.003]}

LR_clf = LogisticRegression(penalty='l1', solver='liblinear') # liblinear supports logistic regression (LR) L1-regularized classifiers 
gs_model = GridSearchCV(estimator=LR_clf, param_grid=param)
gs_model.fit(X_prep_train, y_train)

y_pred = gs_model.predict(X_prep_train)
print(classification_report(y_train, y_pred))

**WOW** - that was a good Classification.
### Justification:
- Precision:- Accuracy of positive predictions, is perfect 1.00
- Recall:- Fraction of positives that were correctly identified. - is also 1.00
- F1 score — percent of positive predictions were correct - is perfect 1.00
- Support - is the number of actual occurrences of the class in the specified dataset. 

Model accuracy is 1.00 the highest we’ve seen of each respective metric from all models so far.

Let's check for validation data

In [None]:
from sklearn.model_selection import cross_val_score, KFold

In [None]:
kfold = KFold(n_splits=5)

In [None]:
scores = cross_val_score(LR_clf, X_prep_train, y_train, scoring='accuracy', cv=kfold)
print('Scores:',scores)
print('Mean:',np.mean(scores))
print('Std:',np.std(scores))

**Amazing** \
How about on the test set?

In [None]:
gs_model.fit(X_prep_test, y_test)
y_pred = gs_model.predict(X_prep_test)

print(classification_report(y_test, y_pred))

**Superb**, results on test data are consistent with train data.

# Testing

Given the excellent results here, let's try to submit a solution and see what score we get on the test set.

In [None]:
df_dataset = df_dataset.reindex(labels=df_dataset.columns,axis=1)
X_final_test = df_dataset

In [None]:
X_final_prep_test = std_scaler.transform(X_final_test)

In [None]:
y_final_pred = gs_model.predict(X_final_prep_test)

In [None]:
y_final_pred

In [None]:
rev_act_map = {0:'STANDING', 1:'SITTING', 2:'LAYING', 3:'WALKING', 4:'WALKING_DOWNSTAIRS', 5:'WALKING_UPSTAIRS'}
y_final = [rev_act_map[code] for code in y_final_pred]

In [None]:
submission = pd.DataFrame({
        "Id": range(1,len(y_final)+1),
        "Activity": y_final
    })

submission.to_csv('lr_sub.csv',index=False)

In [None]:
submission.head()

### Justifcation - under fitting/overfitting/just right model
Ideally, you want to select a model at the sweet spot between underfitting and overfitting.

There are three main methods to avoid overfitting: \
1- Keep the model simpler: reduce variance by taking into account fewer variables and parameters, thereby removing some of the noise in the training data. \
2- Use cross-validation techniques such as k-folds cross-validation. \
3- Use regularization techniques such as Linear Regression that penalize certain model parameters if they’re likely to cause overfitting.

In our case, we have followed all the three methods above by using fewer variables, validating using k-folds and regularization of parameters.