# 概述
特征预处理，数据分析中定类、定序、定距、定比四大基本数据类型。定类数据转换为OneHotEncoding
## 导入工具包和数据集

In [None]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd

In [None]:
df = pd.read_csv('./data/heart.csv')

In [None]:
df.shape

In [None]:
df.head()

In [None]:
df.dtypes

## 将简写的列名改为完整的特征名称

In [None]:
df.columns = [
    "age",
    "sex",
    "chest_pain_type",
    "resting_blood_pressure",
    "cholestrerol",
    "fasting_blood_sugar",
    "rest_ecg",
    "max_heart_rate_achieved",
    "exercise_induced_angina",
    "st_depression",
    "st_slope",
    "num_major_vessels",
    "thalassemia",
    "target"
]

df.head()

## 区分 定类、定序、定距、定比 四种特征
特征类型 | 描述 | 举例 | 运算  
-: | -: | -: | -:
定类 Norminal Data | 离散值 | 颜色（红、蓝、黄、绿） | 仅可判断是否相等
定序 Ordinal Data | 离散值，有顺序 | 学历（大学、中学、小学） | 定类运算+排序
定距 Interval Data | 连续值，可比大小，但维数无可比性，数值0不代表真正零点 | 摄氏度、地震级数 | 定序运算+加减
定比 Ratio Data | 连续值，维数有可比性，数值0代表真正零点 | 年龄、体重 | 定距运算+乘除

## 将定类特征由整数编码转为实际对应的字符串

In [None]:
df["sex"][df["sex"] == 0] = 'female'
df["sex"][df["sex"] == 1] = 'male'

df["chest_pain_type"][df["chest_pain_type"] == 0] = 'typical angina'
df["chest_pain_type"][df["chest_pain_type"] == 1] = 'atypical angina'
df["chest_pain_type"][df["chest_pain_type"] == 2] = 'non-anginal pain'
df["chest_pain_type"][df["chest_pain_type"] == 3] = 'asymptomatic'

df["fasting_blood_sugar"][df["fasting_blood_sugar"] == 0] = 'lower than 120 mg/ml'
df["fasting_blood_sugar"][df["fasting_blood_sugar"] == 1] = 'greater than 120 mg/ml'

df["rest_ecg"][df["rest_ecg"] == 0] = 'normal'
df["rest_ecg"][df["rest_ecg"] == 1] = 'ST-T wave abnormality'
df["rest_ecg"][df["rest_ecg"] == 2] = 'left ventricular hypertrophy'

df["exercise_induced_angina"][df["exercise_induced_angina"] == 0] = 'no'
df["exercise_induced_angina"][df["exercise_induced_angina"] == 1] = 'yes'

df["st_slope"][df["st_slope"] == 0] = 'upsloping'
df["st_slope"][df["st_slope"] == 1] = 'flat'
df["st_slope"][df["st_slope"] == 2] = 'downsloping'

df["thalassemia"][df["thalassemia"] == 0] = 'unknown'
df["thalassemia"][df["thalassemia"] == 1] = 'normal'
df["thalassemia"][df["thalassemia"] == 2] = 'fixed defect'
df["thalassemia"][df["thalassemia"] == 3] = 'reversable defect'

# df["target"][df["target"] == 0] = 'No Heart Disease'
# df["target"][df["target"] == 1] = 'Heart Disease'

df.head()

In [None]:
df.dtypes

## 将离散的定类和定序特征列转为OneHotEncoding

In [None]:
df = pd.get_dummies(df)
df.columns

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.iloc[0]

## 将处理好的数据集导出为csv文件

In [None]:
df.to_csv("./data/process_heart.csv", index=False)