# TLM PROJECT | Fasting Blood Sugar Prediction | Level 8/10

**www.thelearningmachine.ai**

In this project, you have to build a model that **predicts fasting blood sugar** of a patient **is > 120 mg/dl**.

The project contains 7 sections in total, each with step-by-step instructions of what to do. Note that, as we go further with our lessons, we will try to step away from guided projects like this to "less-guided", with less intructions involved. Thus, my advice is try to understand why we do what we do in what order.

## I. Structure
The project is split into **7 sections**, each containing **step-by-step instructions** of what to do. These 8 sections are the following:
1.   Import the Libratries
2.   Import the Datasets
3.   Data Preprocessing
4.   Data Overview
5.   Model Building
6.   Model Evaluation & Hyperparameter Tuning
7.   Conclusion

## II. Data
There are two datasets provided that you should use for this project:
- fbs1.xlsx
- fbs2.xlsx

### > Columns:
- age: age in years
- sex: (1 = male; 0 = female)
- cp: chest pain type
- trestbps: resting blood pressure (in mm Hg on admission to the hospital)
- chol: serum cholestoral in mg/dl
- restecg: resting electrocardiographic results
- thalach: maximum heart rate achieved
- exang: exercise induced angina (1 = yes; 0 = no)
- oldpeak: ST depression induced by exercise relative to rest
- slope: the slope of the peak exercise ST segment
- ca: number of major vessels (0-3) colored by flourosopy
- thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
- fbs: (target - fasting blood sugar > 120 mg/dl): 1 = true; 0 = false

### > Description:
Attribute Information: 
> 1. age 
> 2. sex 
> 3. chest pain type (4 values) 
> 4. resting blood pressure 
> 5. serum cholestoral in mg/dl 
> 6. resting electrocardiographic results (values 0,1,2)
> 7. maximum heart rate achieved 
> 8. exercise induced angina 
> 9. oldpeak = ST depression induced by exercise relative to rest 
> 10. the slope of the peak exercise ST segment 
> 11. number of major vessels (0-3) colored by flourosopy 
> 12. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
> 13. fasting blood sugar > 120 mg/dl
The names and social security numbers of the patients were recently removed from the database, replaced with dummy values. One file has been "processed", that one containing the Cleveland database. All four unprocessed files also exist in this directory.

# Place For Your Functions

# 1. Import the Libraries

First things first, import the libraries needed (here you will also add up libraries that you need for this assignment)

# 2. Import the datasets

Do the following:

*   **Step 1**: Understand where the current working directory is
*   **Step 2**: Import two datasets as df1 and df2
*   **Step 3**: Check the shape of each dataset by returning two lines with one print function: 

Data shape of df1 is (X, Y),

Data shape of df2 is (X, Y), 


Use .format funtion for that.

# 3. Data Preprocessing

**Step 1:** Combine two datasets into one

**Step 2**: Create a class called "Prep". Inside that class write functions that: 
- Prints unique values of each column
- Prints data types
- Replaces null values with 0.
- Replaces inapropriate values - "?", "\" and "!" of a certain column with 0.
- Converts words "Null", "One" and "one" in target column into numeric form.

**Step 3**: Change data types if needed.


**Step 4**: Now, you have to preprocess id column. Something messed it up, as it should be in the form [ letter "t", "l" or "m" ] + [number] - e.g. "t18", "l891" or "m142". So how to make, for instance, "m142" out of "9867r13m12e1_142"? You should take either "l", "m", or "t" from that id (note that you will have multiple different letters in ids) + the number, which is digits coming after an underscore "_". Here are some examples of the transformation: 
- "9867t13e12r1_92" -> "t92". Even though we see two other letters in this id ("r" and "e"), the letter "t" should be - "9867r13e12m1_1203" -> "m1203"
- "343e2832j093k38042t8920402n_778" -> "t778". 

Create a for loop that would do that. for all the cells inside the id column.

**Step 5**: Validate the data: check dtypes (presence of wrong values <?, !>, null values etc), and unique values for each column

## Step 1

## Step 3

## Step 4

# 4. Data Overview

Observe the data:

*   **Step 1**: Find out what is the mean for trestbps and cholan across people with heart deseases and not. 
*   **Step 2**: Find out what is the mean for thalach across people with heart deseases and not

## Step 1

## Step 2

# 5. Model Building

Do the following:
  
*   **Step 1**: Identify X variables that are the most significant indicators for a price category prediction. Set y variable as price category.
*   **Step 2**: Split the data into train and test
*   **Step 3**: Chose any classifier. The function has to have n hyperparameters that you would like to have.

## Step 1


## Step 2

## Step 3

# 6. Model Evaluation & Hyperparameter Tuning

**Step 1**: Create a class "Classifier". Inside that class write functions that:

- Builds Decision Tree and prints the score (with all hyperparameters as parameters of the funtion). 
    - Write a loop to show how depth of the tree affects accuracy. Print out the results that (1) are lower than the difference of 0.5 between accuracy of a training and test sets, and  (2) are higher than 0.55 for test set
    - Write a loop to show how min observations on a node, and min observations on each leaf affect accuracy. Print out the results that (1) are lower than the difference of 0.5 between accuracy of a training and test sets, and (2) are higher than 0.55 for test set
    - Plot any accuracy curve in relation to the depth to see how it changes the score.
    - Plot any accuracy curve in relation to the min observations on a splittable node to see how it changes the score.
- Builds Gradient Boosted Decision Tree and prints the score (with all hyperparameters as parameters of the funtion).
    - Write a loop to show how learning rate of the tree affects accuracy. Print out the results that (1) are lower than the difference of 0.5 between accuracy of a training and test sets, and (2) are higher than 0.55 for test set
    - Plot any accuracy curve in relation to the depth to see how it changes the score.
- Builds Random Forest and prints the score (with all hyperparameters as parameters of the funtion). 
- Builds Gaussian Decision Tree and prints the score (with all hyperparameters as parameters of the funtion). 
- Builds Naive Bayes Classifier (think which one you need to build, Gaussian, Multinomial or Bernoulli) and prints the score (with all hyperparameters as parameters of the funtion).

**Step 2**: Change several hyperparameters with all tree-based classifiers using a loop and find a way to only print the score result that is higher than X.

**Step 3**: Try to predict your target variable by putting some independent variables into your funtion. 

## Step 1

## Step 2

## Step 3

# 7. Conclusion

Conclude if your model works correctly. Explain how would you do a sanity check.

Answer: