## Data Cleaning & Pre-processing
![data-cleaning-in-python](https://daxg39y63pxwu.cloudfront.net/images/blog/data-cleaning-in-python/data-cleaning-in-python.png)

First step of an analytics project is to clean the datasets and pre-processed it to make it suitable for use by analytical model and visualization


In [None]:
# import basic libraries
import pandas as pd
import numpy as np
import seaborn as sb

### 1. Import dataset into the notebook

In [None]:
train_df = pd.read_csv('datasets/aug_train.csv')
train_df.info()

In [None]:
test_df = pd.read_csv('datasets/aug_test.csv')
test_df.info()

- combine train and test dataset and reshuffle 

(To write description)

**About this dataset**
- `Age` : Age of the patient
- `Sex` : Sex of the patient
- `exng`: exercise induced angina (1 = yes; 0 = no)
- `caa`: number of major vessels (0-3)
- `cp` : Chest Pain type chest pain type
  - Value 1: typical angina
  - Value 2: atypical angina
  - Value 3: non-anginal pain
  - Value 4: asymptomatic
- `trtbps` : resting blood pressure (in mm Hg)
- `chol` : cholestoral in mg/dl fetched via BMI sensor
- `fbs` : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
- `restecg` : resting electrocardiographic results
  - Value 0: normal
  - Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
  - Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
- `thalach` : maximum heart rate achieved
- `thall` : Thal rate
- `output` : 0= less chance of heart attack 1= more chance of heart attack

### 2. Data Cleaning & Pre-processing 

#### **2.1 Removing rows with NA values (if they exist)**
 - OR use algo to fill up missing values (knn imputation)

#### **2.2 Cleaning up continuous variables**

- Filter out outliers using confidential interval (95%?)
- analyse based on column 
- give reasons why drop / not drop 



#### **2.3 Cleaning up categorical variables**

- are the categorical variables  consistent in their values
- are there missing values
- are there unreasonable value (text?)

#### **2.4 Encoding nominal (unordered) categorical variables using `OneHotEncoding` for predictors & `Integer Encoding` for response**
The `?` dataset contains ? categorical predictor variables:
- xx
- xx
- xx

And 1 categorical response variable:
- target


**2.4 a) OneHotEncoding**

In [None]:
# Import the OneHotEncoder from sklearn

## example provided 
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()

# OneHotEncoding of categorical predictors

#cat_variables = [
#    'Smoking', 'AlcoholDrinking', 'Stroke', 'DiffWalking',
#    'Sex', 'AgeCategory', 'Race', 'Diabetic', 'PhysicalActivity',
#    'GenHealth', 'Asthma', 'KidneyDisease', 'SkinCancer'
#]

#heart_pki_cat = heart_pki_original_df[cat_variables]

#ohe.fit(heart_pki_cat)
#heart_pki_cat_ohe = pd.DataFrame(
 #   ohe.transform(heart_pki_cat).toarray(),
  #  columns=ohe.get_feature_names_out(heart_pki_cat.columns))

# Check the encoded variables
#heart_pki_cat_ohe.info()

**2.4 b) Combine encoded dataframe with continuous variables**

**2.4 c) Export encoded `xxx` dataframe as csv**

In [None]:
#??.to_csv('datasets/result.csv', index=False)

#### **2.5 For correlation, do `Integer Encoding` for categorical variable with strings and categorical response**
The `xx` dataset contains x categorical predictor variables that is strings:
- 

**2.5 a) IntegerEncoding**

Example (to be edited): 
For `HeartDisease`, `Smoking`, `AlcoholDrinking`, `Stroke`, `DiffWalking`, `Diabetic`, `PhysicalActivity`, `Asthma`, `KidneyDisease`, `SkinCancer`:
- Yes: 1
- No: 0

In [None]:
heart_pki_cor_df = heart_pki_original_df.copy()

mapping = {
    'HeartDisease': {
        "Yes": 1,
        "No": 0
    },
    'Smoking': {
        "Yes": 1,
        "No": 0
    },
    'AlcoholDrinking': {
        "Yes": 1,
        "No": 0
    },
    'Stroke': {
        "Yes": 1,
        "No": 0
    },
    'DiffWalking': {
        "Yes": 1,
        "No": 0
    },
    'Diabetic': {
        "Yes": 2,
        "Yes (during pregnancy)": 2,
        "No, borderline diabetes": 1,
        "No": 0
    },
    'PhysicalActivity': {
        "Yes": 1,
        "No": 0
    },
    'Asthma': {
        "Yes": 1,
        "No": 0
    },
    'KidneyDisease': {
        "Yes": 1,
        "No": 0
    },
    'SkinCancer': {
        "Yes": 1,
        "No": 0
    },
    'Sex': {
        "Female": 1,
        "Male": 0
    },
    'AgeCategory': {
        "18-24": 0,
        "25-29": 1,
        "30-34": 2,
        "35-39": 3,
        "40-44": 4,
        "45-49": 5,
        "50-54": 6,
        "55-59": 7,
        "60-64": 8,
        "65-69": 9,
        "70-74": 10,
        "75-79": 11,
        "80 or older": 12
    },
    'Race': {
        "White": 0,
        "Black": 1,
        "Asian": 2,
        "American Indian/Alaskan Native": 3,
        "Hispanic": 4,
        "Other": 5
    },
    'GenHealth': {
        "Poor": 0,
        "Fair": 1,
        "Good": 2,
        "Very good": 3,
        "Excellent": 4
        
    }
}

for mapping_type in mapping:
    for val in mapping[mapping_type]:
        condition = heart_pki_cor_df[mapping_type] == val
        heart_pki_cor_df.loc[condition, mapping_type] = mapping[mapping_type][val]

    print(mapping_type, ':', heart_pki_cor_df[mapping_type].unique())
        



**2.5 b) Export encoded `xx` dataframe as csv**

In [1]:
#??.to_csv('datasets/??_correlation.csv', index=False)

---

#### Dataset created from this notebook:

    .
    ├── heart_pki_2020_original.csv       # original dataset
    |   ├── heart_pki_2020_cleaned.csv        # for EDA and visualization
    |   └── heart_pki_2020_encoded.csv        # for analytical models (OneHotEncoding done)
    |
    └──|

 