<b><font size="6">Predictive Project Example</font></b><br><br>

<img src="image/semma.png">

# <font color='#BFD72F'>Contents</font> <a class="anchor" id="toc"></a>

* [1. Import the data](#import)
* [2. Explore the data](#explore)
* [3. Prepare the data](#prepare)
    * [3. Feature Engineering](#feateng)
    * [4. Feature Selection](#feateng)
* [5. Model Assessment](#assess)
* [6. Predictions](#pred)


# 1. Import the data <a class="anchor" id="import"></a>
[Back to Contents](#toc)

<a class="anchor" id="2nd-bullet">

### 1.1. Import the needed libraries
    
</a>

__`Step 1`__ Import all the needed packages.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# Feature Selection
from sklearn.feature_selection import RFE
import scipy.stats as stats
from scipy.stats import chi2_contingency
# Scaling
from sklearn.preprocessing import MinMaxScaler
# Models
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
# Model Assessment
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import f1_score

import warnings
warnings.filterwarnings('ignore')

__`Step 2`__ Import the dataset.

In [2]:
data = pd.read_excel('data/heart.xlsx')
data

Unnamed: 0.1,Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,402,70.0,1.0,1,156,245,0,0,143,0,0.0,2,0,2,1
1,636,59.0,0.0,0,174,249,0,1,143,1,0.0,1,0,2,0
2,416,,1.0,2,125,273,0,0,152,0,0.5,0,1,2,1
3,142,61.0,1.0,3,134,234,0,1,145,0,2.6,1,2,2,0
4,288,58.0,0.0,2,120,340,0,1,172,0,0.0,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
712,427,57.0,1.0,2,150,168,0,1,174,0,1.6,2,0,2,1
713,343,52.0,1.0,2,172,199,1,1,162,0,0.5,2,0,3,1
714,609,,0.0,0,180,327,0,2,117,1,3.4,1,0,2,0
715,28,56.0,1.0,2,130,256,1,0,142,1,0.6,1,1,1,0


__GOAL__: predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset.

`age` patient's age in years<br>
`sex` patient's gender (1 = male; 0 = female)<br>
`cp` chest pain type (4 values)<br>
`trestbps` resting blood pressure (in mm Hg on admission to the hospital)<br>
`chol` serum cholestoral in mg/dl<br>
`fbs` fasting blood sugar > 120 mg/dl (1 = true; 0 = false)<br>
`restecg` resting electrocardiographic results (values 0,1,2)<br>
`thalach` maximum heart rate achieved<br>
`exang` exercise induced angina (1 = yes; 0 = no)<br>
`oldpeak` ST depression induced by exercise relative to rest<br>
`slope` the slope of the peak exercise ST segment<br>
`ca` number of major vessels (0-3) colored by flourosopy<br>
`thal` 0 = normal; 1 = fixed defect; 2 = reversable defect<br>
`target` refers to the presence of heart disease in the patient<br>

__`Step 3`__ - Define the independent variables as __X__ and the dependent variable as __y__. 

In [3]:
X = data.drop(columns=['target'])
y = data['target']

# 2. Explore the data <a class="anchor" id="explore"></a>
[Back to Contents](#toc)

__`Step 4`__ Explore and get insights from the data.

In [4]:
# small exploration on the data
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 717 entries, 0 to 716
Data columns (total 15 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  717 non-null    int64  
 1   age         705 non-null    float64
 2   sex         713 non-null    float64
 3   cp          717 non-null    int64  
 4   trestbps    717 non-null    int64  
 5   chol        717 non-null    int64  
 6   fbs         717 non-null    int64  
 7   restecg     717 non-null    int64  
 8   thalach     717 non-null    int64  
 9   exang       717 non-null    int64  
 10  oldpeak     717 non-null    float64
 11  slope       717 non-null    int64  
 12  ca          717 non-null    int64  
 13  thal        717 non-null    int64  
 14  target      717 non-null    int64  
dtypes: float64(3), int64(12)
memory usage: 84.1 KB


__`Step 5`__ Separate the numerical from the categorical variables. 

In [5]:
num_vars = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'slope', 'cp']
cat_vars = ['restecg', 'ca', 'thal', 'sex', 'fbs', 'exang']

Data Types for Independent Variables:
- __Categorical__ - restecg, ca, thal
- __Continuous__ - age, trestbps, chol, thalach, oldpeak, slope
- __Binary__ - sex, fbs, exang
- __Ordinal__ - cp

Missing values:
- age (continuous)

# 3. Prepare the data <a class="anchor" id="prepare"></a>
[Back to Contents](#toc)

## 3.1. Feature Engineering <a class="anchor" id="feateng"></a>

__`Step 6`__ Create new variables from the original ones.

In [6]:
X['log_chol'] = np.log(X['chol'])

num_vars.append('log_chol')

## 3.2. Feature Selection<a class="anchor" id="featsel"></a>

__`Step 7`__ Create a function that selects the best features for each split of a StratifiedKFold.

In [7]:
def select_best_features(X, y):
    skf = StratifiedKFold(n_splits = 3)
    counter = 0
    for train_index, val_index in skf.split(X, y):
        counter +=1
        print('')
        print('--------------------------------------------------------')
        print('SPLIT ', counter)
        print('--------------------------------------------------------')
        print('')
        X_train, X_val = X.iloc[train_index], X.iloc[val_index]
        y_train, y_val = y.iloc[train_index], y.iloc[val_index]
        
        
        # fill missing values (median in numerical data, mode in categorical data)
        median_age_train = X_train['age'].median()
        X_train['age'].fillna(median_age_train, inplace = True)
        X_val['age'].fillna(median_age_train, inplace = True)
        mode_sex_train = X_train['sex'].mode()[0]
        X_train['sex'].fillna(mode_sex_train, inplace = True)
        X_val['sex'].fillna(mode_sex_train, inplace = True)
        
        # get all numerical variables
        X_train_num = X_train[num_vars]
        X_val_num = X_val[num_vars]
        
        # get all categorical variables
        X_train_cat = X_train[cat_vars]
        X_val_cat = X_val[cat_vars]
        
        # Apply scaling to numerical data
        scaler = MinMaxScaler().fit(X_train_num)
        X_train_scaled = pd.DataFrame(scaler.transform(X_train_num), columns = X_train_num.columns, index = X_train_num.index,) # MinMaxScaler in the training data
    
        
        # Check which features to use using RFE
        print('')
        print('----------------- RFE ----------------------')
        model = LogisticRegression()
        rfe = RFE(estimator = model, n_features_to_select = 4)
        X_rfe = rfe.fit_transform(X = X_train_scaled, y = y_train)
        selected_features = pd.Series(rfe.support_, index = X_train_scaled.columns)
        print(selected_features)
        
        # Check which features to use using Chi-Square
        print('')
        print('----------------- CHI-SQUARE ----------------------')
        def TestIndependence(X,y,var,alpha=0.05):        
            dfObserved = pd.crosstab(y,X) 
            chi2, p, dof, expected = stats.chi2_contingency(dfObserved.values)
            dfExpected = pd.DataFrame(expected, columns=dfObserved.columns, index = dfObserved.index)
            if p<alpha:
                result="{0} is IMPORTANT for Prediction".format(var)
            else:
                result="{0} is NOT important for Prediction. (Discard {0} from model)".format(var)
            print(result)
        
        for var in X_train_cat:
            TestIndependence(X_train_cat[var],y_train, var)
            

In [8]:
select_best_features(X, y)


--------------------------------------------------------
SPLIT  1
--------------------------------------------------------


----------------- RFE ----------------------
age         False
trestbps     True
chol        False
thalach      True
oldpeak      True
slope       False
cp           True
log_chol    False
dtype: bool

----------------- CHI-SQUARE ----------------------
restecg is IMPORTANT for Prediction
ca is IMPORTANT for Prediction
thal is IMPORTANT for Prediction
sex is IMPORTANT for Prediction
fbs is NOT important for Prediction. (Discard fbs from model)
exang is IMPORTANT for Prediction

--------------------------------------------------------
SPLIT  2
--------------------------------------------------------


----------------- RFE ----------------------
age         False
trestbps    False
chol        False
thalach      True
oldpeak      True
slope        True
cp           True
log_chol    False
dtype: bool

----------------- CHI-SQUARE ----------------------
restecg is I

According to the previous results:
- Using RFE (selecting the 4 more important features):
    - thalach is always important (keep the variable)
    - oldpeak is always important (keep the variable)
    - cp is always important (keep the variable)
    - trestbps is important twice (keep the variable)
    - slope is important only once (remove the variable)


- Using Chi-Square:
    - fbs is never important (remove the variable)

    
Concluding, we should keep the variables (this is a possible interpretation):
    - thalach, oldpeak, cp, trestbps, restecg, ca, thal, sex and exang.

__`Step 8`__ Choose the best variables to keep.

In [9]:
best_vars = ['thalach','oldpeak','cp','trestbps','restecg','ca','thal','sex','exang']

# select the final features 
X_sel = X[best_vars].copy()
X_sel

Unnamed: 0,thalach,oldpeak,cp,trestbps,restecg,ca,thal,sex,exang
0,143,0.0,1,156,0,0,2,1.0,0
1,143,0.0,0,174,1,0,2,0.0,1
2,152,0.5,2,125,0,1,2,1.0,0
3,145,2.6,3,134,1,2,2,1.0,0
4,172,0.0,2,120,1,0,2,0.0,0
...,...,...,...,...,...,...,...,...,...
712,174,1.6,2,150,1,0,2,1.0,0
713,162,0.5,2,172,1,0,3,1.0,0
714,117,3.4,0,180,2,0,2,0.0,1
715,142,0.6,2,130,0,1,1,1.0,1


# 5. Model Assessment<a class="anchor" id="assess"></a>
[Back to Contents](#toc)

__`Step 9`__ Create a function that compares several models using F1 Score.

In [10]:
def compare_models(X, y, model):
    # apply StratifiedK-Fold
    skf = StratifiedKFold(n_splits = 5)
    score_train = []
    score_val = []
    for train_index, val_index in skf.split(X, y):
        X_train, X_val = X.iloc[train_index], X.iloc[val_index]
        y_train, y_val = y.iloc[train_index], y.iloc[val_index]
        
        # This time we are going to use validation to check overfitting 
        # so we need also to make all the needed changes in the validation
        
        # fill missing values (mean in numerical data, mode in categorical data)
        #median_age_train = X_train['age'].median() # age is no longer used
        #X_train['age'].fillna(median_age_train, inplace = True)
        #X_val['age'].fillna(median_age_train, inplace = True)
        mode_sex_train = X_train['sex'].mode()[0]
        X_train['sex'].fillna(mode_sex_train, inplace = True)
        X_val['sex'].fillna(mode_sex_train, inplace = True)
        
        # Create dummies and remove one of the variables (to avoid multicollinearity)
        X_train_dummies = pd.get_dummies(X_train, columns=['restecg', 'ca', 'thal'], drop_first=True)
        X_val_dummies = pd.get_dummies(X_val, columns=['restecg', 'ca', 'thal'], drop_first=True)
        
        # If we don't have all the values in the validation dataset that we have in the train, that column will not be created
        # We should assure that all columns in train are also present in validation
        # Get missing columns from the training dataset
        missing_cols = set(X_train_dummies.columns ) - set(X_val_dummies.columns )
        # Add a missing column in test set with default value equal to 0
        for c in missing_cols:
            X_val_dummies[c] = 0
        # Ensure the order of column in the test set is in the same order than in train set
        X_val_dummies = X_val_dummies[X_train_dummies.columns]
        
        # Data Scaling
        # Apply MinMaxScaler
        scaler = MinMaxScaler().fit(X_train_dummies)
        X_train_scaled = scaler.transform(X_train_dummies) 
        X_val_scaled = scaler.transform(X_val_dummies) # Scaling with 'scaler' from train data

        # Apply model
        model.fit(X_train_scaled, y_train)
        predictions_train = model.predict(X_train_scaled)
        predictions_val = model.predict(X_val_scaled)
        score_train.append(f1_score(y_train, predictions_train))
        score_val.append(f1_score(y_val, predictions_val))

    avg_train = round(np.mean(score_train),3)
    avg_val = round(np.mean(score_val),3)
    std_train = round(np.std(score_train),2)
    std_val = round(np.std(score_val),2)

    return str(avg_train) + '+/-' + str(std_train),str(avg_val) + '+/-' + str(std_val)

In [11]:
def show_results(df, X, y, *args):
    """
    Receive an empty dataframe and the different models and call the function avg_score
    """
    count = 0
    # for each model passed as argument
    for arg in args:
        # obtain the results provided by avg_score
        avg_train, avg_test = compare_models(X, y, arg)
        # store the results in the right row
        df.iloc[count] = avg_train, avg_test
        count+=1
    
    return df

__`Step 10`__ Compare the results of a Logistic Regression with a K-Nearest-Neighbors.

In [12]:
model_LR = LogisticRegression()
model_KNN = KNeighborsClassifier()

df = pd.DataFrame(columns = ['Train','Validation'], index = ['Logistic Regression','KNN'])
show_results(df, X_sel, y, model_LR, model_KNN)

Unnamed: 0,Train,Validation
Logistic Regression,0.876+/-0.01,0.869+/-0.01
KNN,0.922+/-0.0,0.865+/-0.01


According to the results, the best model is Logistic Regression with the default parameters since it performs better on validation and has a lower value of **overfitting**.  <br><br>

# 6. Predictions<a class="anchor" id="pred"></a>
[Back to Contents](#toc)

First, we are going to create our final model using all the training data (the more data the better and this model should have exactly the same structure than the selected model on the previous phase, i.e., a KNN with the default parameters.) <br><br>
Then we need to import the test dataset, made all the needed transformations and finally export the csv with the predictions (final answers).

#### Create final model

__`Step 11`__ Create the final model and train it with the whole dataset.

In [13]:
X_final = X[best_vars].copy()
y_final = y.copy()

#median_age = X_final['age'].median()
#X_final['age'].fillna(median_age, inplace = True)
mode_sex = X_final['sex'].mode()[0]
X_final['sex'].fillna(mode_sex, inplace = True)
        
# Create dummies and remove one of the variables (to avoid multicollinearity)
X_final_dummies = pd.get_dummies(X_final, columns=['restecg', 'ca', 'thal'], drop_first=True)


# Data Scaling
# Apply MinMaxScaler
scaler = MinMaxScaler().fit(X_final_dummies)
X_final_scaled = scaler.transform(X_final_dummies) 

# Create your final model with exactly the same parameters than your best model during model comparison
final_model = KNeighborsClassifier().fit(X_final_scaled, y_final)

__`Step 12`__ Apply the final created model to get the predictions of the test dataset.

In [14]:
test = pd.read_excel('data/heart_score.xlsx')

test_final = test[best_vars].copy()
#test_final['age'].fillna(median_age, inplace = True)
test_final['sex'].fillna(mode_sex, inplace = True)
        
# Create dummies and remove one of the variables (to avoid multicollinearity)
test_final_dummies = pd.get_dummies(test_final, columns=['restecg', 'ca', 'thal'], drop_first=True)

# If we don't have all the values in the test dataset that we have in the train, that column will not be created
# We should assure that all columns in train are also present in test
# Get missing columns from the training dataset
missing_cols = set(X_final_dummies.columns ) - set(test_final_dummies.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
    test_final_dummies[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
test_final_dummies = test_final_dummies[X_final_dummies.columns]

# Data Scaling
# Apply exactly the same MinMaxScaler used before
test_final_scaled = scaler.transform(test_final_dummies)

__`Step 13`__ Get the final predictions made by the final model. Create a dataframe with the needed columns for your final delivery: 'ID' that will contain the ID of the pations and 'Answer', the predicted value for each patient.

In [15]:
# Get predictions
predictions = final_model.predict(test_final_scaled)

# Save the final predictions
patient_index = test.index.T
answer = pd.DataFrame([patient_index, predictions]).T
answer.columns = ['ID','Answer']
answer

Unnamed: 0,ID,Answer
0,0,1
1,1,1
2,2,1
3,3,1
4,4,1
...,...,...
303,303,1
304,304,0
305,305,0
306,306,0


__`Step 14`__ Save that dataframe into a csv file named as 'answer.csv'.

In [16]:
#answer.to_csv('answer.csv', index = None)