## Summary

### Visualised:
- __Box plots for the continuous variables.__
    - Unsurprisingly there were a large number of outliers for the non-stroke class due to the imbalance.
    - `age_corr` was clearly a standout in terms of separating the 2 classes, despite the imbalance.
- __Histogram for continuous variables__
    - Whilst `average_blood_sugar` looks skewed, it should not be too much of a problem. 
    - `BMI` looks much nicer, closer to a normal distribution.
    - However, the distribution of the continuous variables with respect to the label does not appear to help differentiate them. 
- __Count plots were used for the binary variables, and due to the class imbalance it was hard to see anything significant.__

### Feature Engineering:
- __one-hot encoding__
    - We OneHot encode all the binary variables
- __binning__
    - Age: 0-18 child, 18-45 adult, 45-70 senior, 70+ elderly
    - Sugar: 0-100 low, 100-150 medium, 150-250 high, 250+ very high
    - BMI: < 18.5 low (unhealthy), 18-25 medium (normal), 25-30 high (at risk), 30+ very high (at high risk)
- __combining classes__
    - we create a new feature that indicates whether a person has ever smoked before (essentially an `or` on `quit` and `active_smoker`
- __normalisation of data__
    - all continuous data is normalised to between 0 and 1
    - Note: This is done in the next notebook
    
### SMOTE Upsampling

For the submission file we do not use SMOTE.

### Updated Dataset 
The processed dataset was saved to `test_processed_feature_engineered.csv`.

In [1]:
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np

%matplotlib inline

In [23]:
stk = pd.read_csv("../data/test_processed.csv")

In [24]:
stk["high_BP"] = stk["high_BP"].apply(lambda x: str(int(x)))
stk["heart_condition_detected_2017"] = stk["heart_condition_detected_2017"].apply(lambda x: str(int(x)))
stk["married"] = stk["married"].apply(lambda x: str(int(x)))

In [26]:
stk.shape

(8718, 11)

In [27]:
stk.dtypes

id                                 int64
high_BP                           object
heart_condition_detected_2017     object
married                           object
smoker_status                     object
average_blood_sugar              float64
BMI                              float64
job_status_corr                   object
living_area_corr                  object
sex_corr                          object
age_corr                           int64
dtype: object

In [28]:
# We CANNOT drop the ID as it is required for submission
# stk.drop(labels="id", axis=1, inplace=True)

In [29]:
stk.smoker_status.isnull().value_counts()

False    8718
Name: smoker_status, dtype: int64

In [30]:
cols_continuous = ["BMI", "average_blood_sugar"]
cols_categorical = ["high_BP", "smoker_status", "married", "heart_condition_detected_2017",
                   "job_status_corr", "living_area_corr"]

<div class="alert alert-block alert-info">
OneHot Encoding
</div>

Using `pd.get_dummies()`, we can easily handle categorical variables.

In [31]:
stk_onehot = stk.join(pd.get_dummies(stk[cols_categorical]))
stk_onehot.head()

Unnamed: 0,id,high_BP,heart_condition_detected_2017,married,smoker_status,average_blood_sugar,BMI,job_status_corr,living_area_corr,sex_corr,...,married_1,heart_condition_detected_2017_0,heart_condition_detected_2017_1,job_status_corr_business_owner,job_status_corr_government,job_status_corr_parental_leave,job_status_corr_private_sector,job_status_corr_unemployed,living_area_corr_city,living_area_corr_remote
0,33327,0,0,1,active_smoker,76.05,33.4,private_sector,remote,f,...,1,1,0,0,0,0,1,0,0,1
1,839,0,0,1,non-smoker,73.77,30.1,government,city,f,...,1,1,0,0,1,0,0,0,1,0
2,11127,0,0,1,active_smoker,62.95,30.8,business_owner,remote,m,...,1,1,0,1,0,0,0,0,0,1
3,20768,0,0,1,quit,68.81,36.5,private_sector,city,f,...,1,1,0,0,0,0,1,0,1,0
4,37774,0,0,0,active_smoker,122.89,30.8,private_sector,city,f,...,0,1,0,0,0,0,1,0,1,0


In [32]:
stk_onehot.head(n=5)[["high_BP", "high_BP_0", "high_BP_1"]]

Unnamed: 0,high_BP,high_BP_0,high_BP_1
0,0,1,0
1,0,1,0
2,0,1,0
3,0,1,0
4,0,1,0


In [33]:
stk_onehot.head(n=5)[["high_BP", "high_BP_0", "high_BP_1"]]
stk_onehot.head(n=5)[["smoker_status", "smoker_status_non-smoker", 
                      "smoker_status_quit", "smoker_status_active_smoker"]]

Unnamed: 0,smoker_status,smoker_status_non-smoker,smoker_status_quit,smoker_status_active_smoker
0,active_smoker,0,0,1
1,non-smoker,1,0,0
2,active_smoker,0,0,1
3,quit,0,1,0
4,active_smoker,0,0,1


In [34]:
stk_onehot.rename(columns={"smoker_status_non-smoker":"smoker_status_non_smoker"}, inplace=True)
print("smoker_status_non_smoker" in stk_onehot.columns)

stk_onehot.loc[stk_onehot["smoker_status"] == "non-smoker", "smoker_status"] = "non_smoker"
print("non_smoker" in stk_onehot.smoker_status.unique())


True
True


<div class="alert alert-block alert-info">
Binning
</div>

In [35]:
stk_onehot["low_BMI"] = 0
stk_onehot.loc[stk_onehot["BMI"] < 18.5, "low_BMI"] = 1

stk_onehot["medium_BMI"] = 0
stk_onehot.loc[(stk_onehot["BMI"] >= 18.5) & (stk_onehot["BMI"] < 24.9), "medium_BMI"] = 1

stk_onehot["high_BMI"] = 0
stk_onehot.loc[(stk_onehot["BMI"] >= 24.9) & (stk_onehot["BMI"] < 29.9), "high_BMI"] = 1

stk_onehot["very_high_BMI"] = 0
stk_onehot.loc[stk_onehot["BMI"] >= 29.9, "very_high_BMI"] = 1

In [36]:
stk_onehot["child"] = 0
stk_onehot.loc[stk_onehot["age_corr"] < 18, "child"] = 1

stk_onehot["adult"] = 0
stk_onehot.loc[(stk_onehot["age_corr"] >= 18) & (stk_onehot["age_corr"] < 45), "adult"] = 1

stk_onehot["senior"] = 0
stk_onehot.loc[(stk_onehot["age_corr"] >= 45) & (stk_onehot["age_corr"] < 70), "senior"] = 1

stk_onehot["elderly"] = 0
stk_onehot.loc[stk_onehot["age_corr"] >= 70, "elderly"] = 1

In [37]:
# stk_onehot["low_sugar"] = 0
# stk_onehot.loc[stk_onehot["average_blood_sugar"] < 50, "low_sugar"] = 1

stk_onehot["low_sugar"] = 0
stk_onehot.loc[(stk_onehot["average_blood_sugar"] >= 50) &\
               (stk_onehot["average_blood_sugar"] < 100), "low_sugar"] = 1

stk_onehot["medium_sugar"] = 0
stk_onehot.loc[(stk_onehot["average_blood_sugar"] >= 100) &\
               (stk_onehot["average_blood_sugar"] < 150), "medium_sugar"] = 1

stk_onehot["high_sugar"] = 0
stk_onehot.loc[(stk_onehot["average_blood_sugar"] >= 150) &\
               (stk_onehot["average_blood_sugar"] < 250), "high_sugar"] = 1

stk_onehot["very_high_sugar"] = 0
stk_onehot.loc[stk_onehot["average_blood_sugar"] >= 250, "very_high_sugar"] = 1

In [38]:
print(stk_onehot.shape)
stk_onehot.head()

(8718, 39)


Unnamed: 0,id,high_BP,heart_condition_detected_2017,married,smoker_status,average_blood_sugar,BMI,job_status_corr,living_area_corr,sex_corr,...,high_BMI,very_high_BMI,child,adult,senior,elderly,low_sugar,medium_sugar,high_sugar,very_high_sugar
0,33327,0,0,1,active_smoker,76.05,33.4,private_sector,remote,f,...,0,1,0,1,0,0,1,0,0,0
1,839,0,0,1,non_smoker,73.77,30.1,government,city,f,...,0,1,0,1,0,0,1,0,0,0
2,11127,0,0,1,active_smoker,62.95,30.8,business_owner,remote,m,...,0,1,0,0,1,0,1,0,0,0
3,20768,0,0,1,quit,68.81,36.5,private_sector,city,f,...,0,1,0,1,0,0,1,0,0,0
4,37774,0,0,0,active_smoker,122.89,30.8,private_sector,city,f,...,0,1,0,1,0,0,0,1,0,0


<div class="alert alert-block alert-info">
Combining Classes
</div>

In [39]:
# We try to mark the patients have smoked before, even if they are not smoking now
stk_onehot["has_smoked"] = 0
stk_onehot.loc[(stk_onehot["smoker_status_quit"] == 1) &\
               (stk_onehot["smoker_status_active_smoker"] == 1), "has_smoked"] = 1

In [40]:
stk_onehot.head()

Unnamed: 0,id,high_BP,heart_condition_detected_2017,married,smoker_status,average_blood_sugar,BMI,job_status_corr,living_area_corr,sex_corr,...,very_high_BMI,child,adult,senior,elderly,low_sugar,medium_sugar,high_sugar,very_high_sugar,has_smoked
0,33327,0,0,1,active_smoker,76.05,33.4,private_sector,remote,f,...,1,0,1,0,0,1,0,0,0,0
1,839,0,0,1,non_smoker,73.77,30.1,government,city,f,...,1,0,1,0,0,1,0,0,0,0
2,11127,0,0,1,active_smoker,62.95,30.8,business_owner,remote,m,...,1,0,0,1,0,1,0,0,0,0
3,20768,0,0,1,quit,68.81,36.5,private_sector,city,f,...,1,0,1,0,0,1,0,0,0,0
4,37774,0,0,0,active_smoker,122.89,30.8,private_sector,city,f,...,1,0,1,0,0,0,1,0,0,0


In [41]:
stk_onehot.iloc[:5, 20:]

Unnamed: 0,job_status_corr_business_owner,job_status_corr_government,job_status_corr_parental_leave,job_status_corr_private_sector,job_status_corr_unemployed,living_area_corr_city,living_area_corr_remote,low_BMI,medium_BMI,high_BMI,very_high_BMI,child,adult,senior,elderly,low_sugar,medium_sugar,high_sugar,very_high_sugar,has_smoked
0,0,0,0,1,0,0,1,0,0,0,1,0,1,0,0,1,0,0,0,0
1,0,1,0,0,0,1,0,0,0,0,1,0,1,0,0,1,0,0,0,0
2,1,0,0,0,0,0,1,0,0,0,1,0,0,1,0,1,0,0,0,0
3,0,0,0,1,0,1,0,0,0,0,1,0,1,0,0,1,0,0,0,0
4,0,0,0,1,0,1,0,0,0,0,1,0,1,0,0,0,1,0,0,0


__Now we drop all the columns that have been OneHot encoded__

In [42]:
columns_one_hot_encoded = ["high_BP", "heart_condition_detected_2017", "married",
                           "smoker_status", "job_status_corr", "living_area_corr",
                           "sex_corr"
                        ]
stk_oh_only = stk_onehot.drop(labels=columns_one_hot_encoded, axis=1).reset_index(drop=True)
stk_oh_only

Unnamed: 0,id,average_blood_sugar,BMI,age_corr,high_BP_0,high_BP_1,smoker_status_active_smoker,smoker_status_non_smoker,smoker_status_quit,married_0,...,very_high_BMI,child,adult,senior,elderly,low_sugar,medium_sugar,high_sugar,very_high_sugar,has_smoked
0,33327,76.05,33.4,36,1,0,1,0,0,0,...,1,0,1,0,0,1,0,0,0,0
1,839,73.77,30.1,40,1,0,0,1,0,0,...,1,0,1,0,0,1,0,0,0,0
2,11127,62.95,30.8,59,1,0,1,0,0,0,...,1,0,0,1,0,1,0,0,0,0
3,20768,68.81,36.5,33,1,0,0,0,1,0,...,1,0,1,0,0,1,0,0,0,0
4,37774,122.89,30.8,22,1,0,1,0,0,1,...,1,0,1,0,0,0,1,0,0,0
5,4283,116.97,30.7,60,1,0,0,1,0,0,...,1,0,0,1,0,0,1,0,0,0
6,13832,112.96,44.7,83,1,0,0,1,0,0,...,1,0,0,0,1,0,1,0,0,0
7,4579,99.22,36.7,47,1,0,1,0,0,1,...,1,0,0,1,0,1,0,0,0,0
8,26781,226.94,28.9,82,1,0,0,1,0,0,...,0,0,0,0,1,0,0,1,0,0
9,17196,78.88,28.0,49,1,0,0,1,0,0,...,0,0,0,1,0,1,0,0,0,0


In [43]:
stk_oh_only.to_csv("../data/test_processed_feature_engineered.csv", index=False)

In [44]:
stk_oh_only.shape

(8718, 33)

### Sandbox Code

code below was for experimentation and may break if you run it

In [None]:
break

In [None]:
# We drop the continuous variables average_sugar_level, BMI, age_corr
# As well as the string columns and label 

pt1 = np.arange(start=0, stop=3)
# pt2 = np.arange(start=6, stop=9)
pt3 = np.arange(start=10, stop=39)
print(pt1, pt3)

categorical_indices = np.concatenate((pt1, pt3), axis=0)
print(categorical_indices)

stk_onehot.iloc[:, categorical_indices].columns

In [None]:
from sklearn.datasets import make_classification

X, y = make_classification(n_classes=2, class_sep=2,
weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,
n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)

X

### OneHotEncoder Function

I had used `get_dummies` previously and wanted to try OneHotEncoder out, but it appears that 

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
enc = OneHotEncoder()
enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]]) 

In [None]:
enc.n_values_

In [None]:
stk.columns

In [None]:
stk[["smoker_status", "sex_corr"]]

In [None]:
enc = OneHotEncoder(handle_unknown='ignore')

enc.fit(stk[["smoker_status", "sex_corr"]])

In [None]:
enc.transform(stk[["smoker_status", "sex_corr"]]).toarray()

In [None]:
enc.get_feature_names()