# Fasting Blood Sugar Prediction

[ML Cookbook](https://www.ml-book.com) | [SLACK Channel](https://join.slack.com/t/mlckbk/shared_invite/zt-9qsjm911-6nSHAcCSjKfuHi972iEfEg)

## About

In this project, you have to build a model that **predicts fasting blood sugar** of a patient **is > 120 mg/dl**.

The project contains 7 sections in total, each with step-by-step instructions of what to do. Note that, as we go further with our lessons, we will try to step away from guided projects like this to "less-guided", with less intructions involved. Thus, my advice is try to understand why we do what we do in what order.

## Structure
The project is split into **7 sections**, each containing **step-by-step instructions** of what to do. These sections are the following:
1.   Import the Libratries
2.   Import the Datasets
3.   Data Preprocessing
4.   Data Overview
5.   Model Building
6.   Model Evaluation & Hyperparameter Tuning
7.   Conclusion

## Data
There are 2 datasets provided that you should use for this project:
- fbs1.xlsx
- fbs2.xlsx

### > Columns:
- age: age in years
- sex: (1 = male; 0 = female)
- cp: chest pain type
- trestbps: resting blood pressure (in mm Hg on admission to the hospital)
- chol: serum cholestoral in mg/dl
- restecg: resting electrocardiographic results
- thalach: maximum heart rate achieved
- exang: exercise induced angina (1 = yes; 0 = no)
- oldpeak: ST depression induced by exercise relative to rest
- slope: the slope of the peak exercise ST segment
- ca: number of major vessels (0-3) colored by flourosopy
- thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
- fbs: (target - fasting blood sugar > 120 mg/dl): 1 = true; 0 = false

### > Description:
Attribute Information: 
> 1. age 
> 2. sex 
> 3. chest pain type (4 values) 
> 4. resting blood pressure 
> 5. serum cholestoral in mg/dl 
> 6. resting electrocardiographic results (values 0,1,2)
> 7. maximum heart rate achieved 
> 8. exercise induced angina 
> 9. oldpeak = ST depression induced by exercise relative to rest 
> 10. the slope of the peak exercise ST segment 
> 11. number of major vessels (0-3) colored by flourosopy 
> 12. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
> 13. fasting blood sugar > 120 mg/dl
The names and social security numbers of the patients were recently removed from the database, replaced with dummy values. One file has been "processed", that one containing the Cleveland database. All four unprocessed files also exist in this directory.

# 1. Import the Libraries

Import the libraries needed (here you will also keep adding up the required libraries as you go further with this project)

In [None]:
import pandas as pd



# 2. Import the datasets

Do the following:

*   **Step 1**: Import two datasets as df1 and df2 **(we did that for you)**
*   **Step 2**: See what the dataframes look like
*   **Step 3**: Check the shape of each dataset by returning two lines with one print function: 


        Data shape of df1 is (X, Y),
        Data shape of df2 is (X, Y)

Use .format funtion for that.

---

## Step 1
Import two datasets as df1 and df2 **(we did that for you)**

In [None]:
df1 = pd.read_excel('https://raw.githubusercontent.com/the-learning-machine/data/master/tlm_project3/fbs1.xlsx')
df2 = pd.read_excel('https://raw.githubusercontent.com/the-learning-machine/data/master/tlm_project3/fbs2.xlsx')


## Step 2
See what the dataframes look like

## Step 3
Check the shape of each dataset by returning two lines with one print function: 


        Data shape of df1 is (X, Y),
        Data shape of df2 is (X, Y)

Use .format funtion for that.

# 3. Data Preprocessing

**Step 1:** Combine two datasets into one

**Step 2**: Create a class called "Prep". Inside that class write functions that: 
- Prints unique values of each column
- Prints data types
- Replaces null values with 0.
- Replaces inapropriate values - "?", "\" and "!" of a certain column with 0.
- Converts words "Null", "One" and "one" in target column into numeric form.

**Step 3**: Change data types if needed.


**Step 4**: Now, you have to preprocess id column. Something messed it up, as it should be in the form [ letter "t", "l" or "m" ] + [number] - e.g. "t18", "l891" or "m142". So how to make, for instance, "m142" out of "9867r13m12e1_142"? You should take either "l", "m", or "t" from that id (note that you will have multiple different letters in ids) + the number, which is digits coming after an underscore "_". Here are some examples of the transformation: 
- "9867t13e12r1_92" -> "t92". Even though we see two other letters in this id ("r" and "e"), the letter "t" should be - "9867r13e12m1_1203" -> "m1203"
- "343e2832j093k38042t8920402n_778" -> "t778". 

Create a for loop that would do that. for all the cells inside the id column.

**Step 5**: Validate the data: check dtypes (presence of wrong values <?, !>, null values etc), and unique values for each column

---

## Step 1
Combine two datasets into one

## Step 2
Create a class called "Prep". Inside that class write functions that: 
- Prints unique values of each column
- Prints data types
- Replaces null values with 0.
- Replaces inapropriate values - "?", "\" and "!" of a certain column with 0.
- Converts words "Null", "One" and "one" in target column into numeric form.

## Step 3
Change data types if needed.

## Step 4
Now, you have to preprocess id column. Something messed it up, as it should be in the form [ letter "t", "l" or "m" ] + [number] - e.g. "t18", "l891" or "m142". So how to make, for instance, "m142" out of "9867r13m12e1_142"? You should take either "l", "m", or "t" from that id (note that you will have multiple different letters in ids) + the number, which is digits coming after an underscore "_". Here are some examples of the transformation: 
- "9867t13e12r1_92" -> "t92". Even though we see two other letters in this id ("r" and "e"), the letter "t" should be - "9867r13e12m1_1203" -> "m1203"
- "343e2832j093k38042t8920402n_778" -> "t778". 

Create a for loop that would do that. for all the cells inside the id column.

## Step 5
Validate the data: check dtypes (presence of wrong values <?, !>, null values etc), and unique values for each column

# 4. Data Overview

Observe the data:

*   **Step 1**: Find out what is the mean for trestbps and cholan across people with heart deseases and not. 
*   **Step 2**: Find out what is the mean for thalach across people with heart deseases and not

---

## Step 1
Find out what is the mean for trestbps and cholan across people with heart deseases and not.

## Step 2
Find out what is the mean for thalach across people with heart deseases and not

# 5. Model Building

Do the following:
  
*   **Step 1**: Identify X variables that are the most significant indicators for a price category prediction. Set y variable as price category.
*   **Step 2**: Split the data into train and test
*   **Step 3**: Chose any classifier. The function has to have n hyperparameters that you would like to have.

---

## Step 1
Identify X variables that are the most significant indicators for a price category prediction. Set y variable as price category.


## Step 2
Split the data into train and test

## Step 3
Chose any classifier. The function has to have n hyperparameters that you would like to have.

<a id='section_6'></a>

# 6. Model Evaluation & Hyperparameter Tuning

**Step 1**: Create a class "Classifier". Inside that class write functions that:

- Builds Naive Bayes Classifier (think which one you need to build, Gaussian, Multinomial or Bernoulli) and prints the score (with all hyperparameters as parameters of the funtion).
- Builds Decision Tree and prints the score (with all hyperparameters as parameters of the funtion). 
    - Write a loop to show how depth of the tree affects accuracy. Print out the results that (1) are lower than the difference of 0.5 between accuracy of a training and test sets, and  (2) are higher than 0.55 for test set
    - Write a loop to show how min observations on a node, and min observations on each leaf affect accuracy. Print out the results that (1) are lower than the difference of 0.5 between accuracy of a training and test sets, and (2) are higher than 0.55 for test set
    - Plot any accuracy curve in relation to the depth to see how it changes the score.
    - Plot any accuracy curve in relation to the min observations on a splittable node to see how it changes the score.

**Step 2**: Add the following functionality to the "Classifier" class that you've created above so that it also:
- Builds Gaussian Decision Tree and prints the score (with all hyperparameters as parameters of the funtion). 
- Builds Random Forest and prints the score (with all hyperparameters as parameters of the funtion). 
- Builds Gradient Boosted Decision Tree and prints the score (with all hyperparameters as parameters of the funtion).
    - Write a loop to show how learning rate of the tree affects accuracy. Print out the results that (1) are lower than the difference of 0.5 between accuracy of a training and test sets, and (2) are higher than 0.55 for test set
    - Plot any accuracy curve in relation to the depth to see how it changes the score.

**Step 3**: Change several hyperparameters with all tree-based classifiers using a loop and find a way to only print the score result that is higher than X.

**Step 4**: Try to predict your target variable by putting some values of independent variables into your funtion. 

---

## Step 1
Create a class "Classifier". Inside that class write functions that:

- Builds Naive Bayes Classifier (think which one you need to build, Gaussian, Multinomial or Bernoulli) and prints the score (with all hyperparameters as parameters of the funtion).
- Builds Decision Tree and prints the score (with all hyperparameters as parameters of the funtion). 
    - Write a loop to show how depth of the tree affects accuracy. Print out the results that (1) are lower than the difference of 0.5 between accuracy of a training and test sets, and  (2) are higher than 0.55 for test set
    - Write a loop to show how min observations on a node, and min observations on each leaf affect accuracy. Print out the results that (1) are lower than the difference of 0.5 between accuracy of a training and test sets, and (2) are higher than 0.55 for test set
    - Plot any accuracy curve in relation to the depth to see how it changes the score.
    - Plot any accuracy curve in relation to the min observations on a splittable node to see how it changes the score.

## Step 2
Add the following functionality to the "Classifier" class that you've created above so that it also:
- Builds Gaussian Decision Tree and prints the score (with all hyperparameters as parameters of the funtion). 
- Builds Random Forest and prints the score (with all hyperparameters as parameters of the funtion). 
- Builds Gradient Boosted Decision Tree and prints the score (with all hyperparameters as parameters of the funtion).
    - Write a loop to show how learning rate of the tree affects accuracy. Print out the results that (1) are lower than the difference of 0.5 between accuracy of a training and test sets, and (2) are higher than 0.55 for test set
    - Plot any accuracy curve in relation to the depth to see how it changes the score.

## Step 3
Change several hyperparameters with all tree-based classifiers using a loop and find a way to only print the score result that is higher than X.

## Step 4
Try to predict your target variable by putting some values of independent variables into your funtion.

# 7. Conclusion

Summarize your **findings**. Did you manage to build a reliable model? What **data preprocessing** strategies and **feature selection** techniques have you used in order to get the best model? Which model has performed the best?

Feel free to share/discuss your findings in our [Slack Channel](https://join.slack.com/t/mlcookbook/shared_invite/zt-eyz4czw4-l95j_2iuETCbVRPpgA3kWA)!

In [None]:
# Answer:

'''

I used X model and achieved Y accuracy...
I believe the model is reliable as I performed X feature selection technique...

'''

# 8.* Advance Zone (OPTIONAL)

*This is a section intended for advanced students or those who is willing to do some additional googling in order to familiarize themselves with potentially new concepts. The steps outlined below are typically used in production data science applications, and that is why the ML-Book team thought it would be important to include it.*

<a id='section_8_1'></a>

# 8.1* Feature Engineering

*Oftentimes the relationship between our features and target variable is very complex. Thus, it can be fruitful to include some additional features based on already existing ones. In this section we will explore feature engineering for numerical columns only, but there are techniques that can be applied to categorical features as well. You can experiment with transformations that are not listed below as well!*

*   **Step 1**: Generate additional univariate numerical features. Feel free to select any number of features from your dataset to apply any of these transformations.
    *      Power of 2
    *      Square root (watch out for negative values!)
    *      Log transformation (can be applied only to positive values)


*   **Step 2**: Generate additional multivariate numerical features. Feel free to select any number of features from your dataset to apply any of these transformations:
    *      Multiplication of features' values
    *      Ratio of features' values (watch out for zero denominator)
    
    
*   **Step 3**: Generate additional features from categorical features. For every categorical feature from the dataset add the column with [frequency encoded values](https://python-data-science.readthedocs.io/en/latest/preprocess.html#tree-based-models).
    
    
*   **Step 4**: Train any model that was described in this notebook on this extended dataset.


*   **Step 5**: Compare the performance of the model trained on the extended dataset against the models trained on original dataset.

---

## Step 1
Generate additional univariate numerical features. Feel free to select any number of features from your dataset to apply any of these transformations.
- Power of 2
- Square root (watch out for negative values!)
- Log transformation (can be applied only to positive values)

## Step 2
Generate additional multivariate numerical features. Feel free to select any number of features from your dataset to apply any of these transformations:
- Multiplication of features' values
- Ratio of features' values (watch out for zero denominator)

## Step 3
Generate additional features from categorical features. For every categorical feature from the dataset add the column with [frequency encoded values](https://python-data-science.readthedocs.io/en/latest/preprocess.html#tree-based-models).

## Step 4
Train any model that was described in this notebook on this extended dataset.

## Step 5
Compare the performance of the model trained on the extended dataset against the models trained on original dataset.

# 8.2* CatBoost

As you have probably noticed there are quite a few categorical features in our dataset. For this kind of datasets CatBoost machine learning model often results in good performance utilizing advanced categorical features encoding techniques. Furthermore, it is very convenient to use it as all those transformation happen uder-the-hood and you don't have to specify them explicitly. 

**Step 1**: Add the following functionality to the "Classifier" class from [section 6](#section_6) that you've created above so that it also builds [CatBoost Classifier](https://catboost.ai/docs/concepts/python-reference_catboostclassifier.html) 

**Step 2**: Pick some hyperparameters of a CatBoost model and find their optimal values using cross-validation. 

**Step 3**: Compare the test score of CatBoost model with other models' scores from [section 6](#section_6)

**Step 4**: Compare the performance of the CatBoost model trained on the dataset that we had before completing [section 8.1*](#section_8_1) and after it.

---

## Step 1
Add the following functionality to the "Classifier" class from [section 6](#section_6) that you've created above so that it also builds [CatBoost Classifier](https://catboost.ai/docs/concepts/python-reference_catboostclassifier.html) 


## Step 2
Pick some hyperparameters of a CatBoost model and find their optimal values using cross-validation. 

## Step 3
Compare the test score of CatBoost model with other models' scores from [section 6](#section_6)

## Step 4
Compare the performance of the CatBoost model trained on the dataset that we had before completing [section 8.1*](#section_8_1) and after it.