# DAY5

## 离散特征的处理

今天的任务分成以下几步
1. 读取数据
2. 找到所有离散特征
3. 选择一个离散特征进行独热编码
4. 采取循环对所有离散特征进行独热编码
5. 加上昨天的内容 并且处理所有缺失值

In [1]:
# 读取数据
import pandas as pd
data = pd.read_csv('data.csv') #此时data是一个DataFrame对象

In [2]:
# day4的课提到了 查看dataframe对象的列名，可以使用data.columns属性。
data.columns 

Index(['Id', 'Home Ownership', 'Annual Income', 'Years in current job',
       'Tax Liens', 'Number of Open Accounts', 'Years of Credit History',
       'Maximum Open Credit', 'Number of Credit Problems',
       'Months since last delinquent', 'Bankruptcies', 'Purpose', 'Term',
       'Current Loan Amount', 'Current Credit Balance', 'Monthly Debt',
       'Credit Score', 'Credit Default'],
      dtype='object')

In [3]:
# 打印所有的离散变量名
# 在python中对于变量名常常用英文含义和下划线来命名，而不借助拼音，这是便于他人阅读和理解代码的一种习惯。
# 连续的英文是continuous，离散的英文是discrete
for discrete_features in data.columns:
    if data[discrete_features].dtype == 'object':
        print(discrete_features)

Home Ownership
Years in current job
Purpose
Term


In [4]:
# 以Home Ownership为例，打印观察下
data['Home Ownership']

0            Own Home
1            Own Home
2       Home Mortgage
3            Own Home
4                Rent
            ...      
7495             Rent
7496    Home Mortgage
7497             Rent
7498    Home Mortgage
7499             Rent
Name: Home Ownership, Length: 7500, dtype: object

In [5]:
# 需要进行编码，打印这个变量的值
# vakue_counts()方法用于统计每个类别的个数，并返回一个Series对象。这个方法可以帮助我们快速了解数据集中每个类别的分布情况。
data['Home Ownership'].value_counts()

Home Ownership
Home Mortgage    3637
Rent             3204
Own Home          647
Have Mortgage      12
Name: count, dtype: int64

1. Home Ownership：房屋所有权
2. Rent：租房
3. Own Home：拥有自有住房
4. Have Mortgage：有抵押贷款

可以发现并不具备顺序关系，因此可以采用one-hot编码

In [6]:
# 对Home Ownership列进行独热编码
data = pd.get_dummies(data, columns=['Home Ownership'])
data.columns

Index(['Id', 'Annual Income', 'Years in current job', 'Tax Liens',
       'Number of Open Accounts', 'Years of Credit History',
       'Maximum Open Credit', 'Number of Credit Problems',
       'Months since last delinquent', 'Bankruptcies', 'Purpose', 'Term',
       'Current Loan Amount', 'Current Credit Balance', 'Monthly Debt',
       'Credit Score', 'Credit Default', 'Home Ownership_Have Mortgage',
       'Home Ownership_Home Mortgage', 'Home Ownership_Own Home',
       'Home Ownership_Rent'],
      dtype='object')

可以看到之前的Home Ownership已经被替换成了'Home Ownership_Have Mortgage','Home Ownership_Home Mortgage', 'Home Ownership_Own Home','Home Ownership_Rent'

In [7]:
data.head()

Unnamed: 0,Id,Annual Income,Years in current job,Tax Liens,Number of Open Accounts,Years of Credit History,Maximum Open Credit,Number of Credit Problems,Months since last delinquent,Bankruptcies,...,Term,Current Loan Amount,Current Credit Balance,Monthly Debt,Credit Score,Credit Default,Home Ownership_Have Mortgage,Home Ownership_Home Mortgage,Home Ownership_Own Home,Home Ownership_Rent
0,0,482087.0,,0.0,11.0,26.3,685960.0,1.0,,1.0,...,Short Term,99999999.0,47386.0,7914.0,749.0,0,False,False,True,False
1,1,1025487.0,10+ years,0.0,15.0,15.3,1181730.0,0.0,,0.0,...,Long Term,264968.0,394972.0,18373.0,737.0,1,False,False,True,False
2,2,751412.0,8 years,0.0,11.0,35.0,1182434.0,0.0,,0.0,...,Short Term,99999999.0,308389.0,13651.0,742.0,0,False,True,False,False
3,3,805068.0,6 years,0.0,8.0,22.5,147400.0,1.0,,1.0,...,Short Term,121396.0,95855.0,11338.0,694.0,0,False,False,True,False
4,4,776264.0,8 years,0.0,13.0,13.6,385836.0,1.0,,0.0,...,Short Term,125840.0,93309.0,7180.0,719.0,0,False,False,False,True


In [8]:
# 可以看到上面独热编码后的数据是bool类型，试着转换为int类型，因为后续可能有的函数计算不支持bool值
# 学习类型转换的方法
data['Home Ownership_Have Mortgage'] =data ['Home Ownership_Have Mortgage'].astype(int)
data['Home Ownership_Have Mortgage']


0       0
1       0
2       0
3       0
4       0
       ..
7495    0
7496    0
7497    0
7498    0
7499    0
Name: Home Ownership_Have Mortgage, Length: 7500, dtype: int32

到此为止，已经掌握了对离散变量做独热编码的所有方法
1. 找到离散变量
2. 独热编码映射
3. 转换独热编码到int类型
4. 填补每一列的缺失值

In [9]:
# 现在尝试结合之前的代码一次性对所有离散特征独热编码
# 重新读取数据
data = pd.read_csv("data.csv")
# 找到离散变量
discrete_lists = [] # 新建一个空列表，用于存放离散变量名
for discrete_features in data.columns:
    if data[discrete_features].dtype == 'object':
        discrete_lists.append(discrete_features)

# 离散变量独热编码
data = pd.get_dummies(data, columns=discrete_lists, drop_first=True) 

data.columns


Index(['Id', 'Annual Income', 'Tax Liens', 'Number of Open Accounts',
       'Years of Credit History', 'Maximum Open Credit',
       'Number of Credit Problems', 'Months since last delinquent',
       'Bankruptcies', 'Current Loan Amount', 'Current Credit Balance',
       'Monthly Debt', 'Credit Score', 'Credit Default',
       'Home Ownership_Home Mortgage', 'Home Ownership_Own Home',
       'Home Ownership_Rent', 'Years in current job_10+ years',
       'Years in current job_2 years', 'Years in current job_3 years',
       'Years in current job_4 years', 'Years in current job_5 years',
       'Years in current job_6 years', 'Years in current job_7 years',
       'Years in current job_8 years', 'Years in current job_9 years',
       'Years in current job_< 1 year', 'Purpose_buy a car',
       'Purpose_buy house', 'Purpose_debt consolidation',
       'Purpose_educational expenses', 'Purpose_home improvements',
       'Purpose_major purchase', 'Purpose_medical bills', 'Purpose_moving',

此时还有个困难，如何找到所有独热编码后的新特征名呢？



In [10]:
# 对比独热编码前后的列名 即可
data2 = pd.read_csv("data.csv")
list_final = [] # 新建一个空列表，用于存放独热编码后新增的特征名
for i in data.columns:
    if i not in data2.columns:
       list_final.append(i) # 这里打印出来的就是独热编码后的特征名
list_final

# 其实还可以通过data.columns.difference()方法来实现，请自行学习
# 可以看到 想要实现一个结果有很多不同方法

['Home Ownership_Home Mortgage',
 'Home Ownership_Own Home',
 'Home Ownership_Rent',
 'Years in current job_10+ years',
 'Years in current job_2 years',
 'Years in current job_3 years',
 'Years in current job_4 years',
 'Years in current job_5 years',
 'Years in current job_6 years',
 'Years in current job_7 years',
 'Years in current job_8 years',
 'Years in current job_9 years',
 'Years in current job_< 1 year',
 'Purpose_buy a car',
 'Purpose_buy house',
 'Purpose_debt consolidation',
 'Purpose_educational expenses',
 'Purpose_home improvements',
 'Purpose_major purchase',
 'Purpose_medical bills',
 'Purpose_moving',
 'Purpose_other',
 'Purpose_renewable energy',
 'Purpose_small business',
 'Purpose_take a trip',
 'Purpose_vacation',
 'Purpose_wedding',
 'Term_Short Term']

In [11]:
# 接着之前的，对bool特征进行类型转换
for i in list_final:
    data[i] = data[i].astype(int) # 这里的i就是独热编码后的特征名
data.head()


Unnamed: 0,Id,Annual Income,Tax Liens,Number of Open Accounts,Years of Credit History,Maximum Open Credit,Number of Credit Problems,Months since last delinquent,Bankruptcies,Current Loan Amount,...,Purpose_major purchase,Purpose_medical bills,Purpose_moving,Purpose_other,Purpose_renewable energy,Purpose_small business,Purpose_take a trip,Purpose_vacation,Purpose_wedding,Term_Short Term
0,0,482087.0,0.0,11.0,26.3,685960.0,1.0,,1.0,99999999.0,...,0,0,0,0,0,0,0,0,0,1
1,1,1025487.0,0.0,15.0,15.3,1181730.0,0.0,,0.0,264968.0,...,0,0,0,0,0,0,0,0,0,0
2,2,751412.0,0.0,11.0,35.0,1182434.0,0.0,,0.0,99999999.0,...,0,0,0,0,0,0,0,0,0,1
3,3,805068.0,0.0,8.0,22.5,147400.0,1.0,,1.0,121396.0,...,0,0,0,0,0,0,0,0,0,1
4,4,776264.0,0.0,13.0,13.6,385836.0,1.0,,0.0,125840.0,...,0,0,0,0,0,0,0,0,0,1


In [12]:
# 填补每一列的缺失值
data.dtypes

Id                                  int64
Annual Income                     float64
Tax Liens                         float64
Number of Open Accounts           float64
Years of Credit History           float64
Maximum Open Credit               float64
Number of Credit Problems         float64
Months since last delinquent      float64
Bankruptcies                      float64
Current Loan Amount               float64
Current Credit Balance            float64
Monthly Debt                      float64
Credit Score                      float64
Credit Default                      int64
Home Ownership_Home Mortgage        int32
Home Ownership_Own Home             int32
Home Ownership_Rent                 int32
Years in current job_10+ years      int32
Years in current job_2 years        int32
Years in current job_3 years        int32
Years in current job_4 years        int32
Years in current job_5 years        int32
Years in current job_6 years        int32
Years in current job_7 years      

In [13]:
data.isnull().sum() # 统计每一列的缺失值个数

Id                                   0
Annual Income                     1557
Tax Liens                            0
Number of Open Accounts              0
Years of Credit History              0
Maximum Open Credit                  0
Number of Credit Problems            0
Months since last delinquent      4081
Bankruptcies                        14
Current Loan Amount                  0
Current Credit Balance               0
Monthly Debt                         0
Credit Score                      1557
Credit Default                       0
Home Ownership_Home Mortgage         0
Home Ownership_Own Home              0
Home Ownership_Rent                  0
Years in current job_10+ years       0
Years in current job_2 years         0
Years in current job_3 years         0
Years in current job_4 years         0
Years in current job_5 years         0
Years in current job_6 years         0
Years in current job_7 years         0
Years in current job_8 years         0
Years in current job_9 ye

In [14]:
# 用均值填补
# 循环遍历这个列表中的每一列
for i in data.columns:
    if data[i].isnull().sum() > 0: # 找到存在缺失值的列
        #计算该列的均值
        mean_value = data[i].mean()
        #用均值填充缺失值
        data[i].fillna(mean_value, inplace=True)

data.isnull().sum()

Id                                0
Annual Income                     0
Tax Liens                         0
Number of Open Accounts           0
Years of Credit History           0
Maximum Open Credit               0
Number of Credit Problems         0
Months since last delinquent      0
Bankruptcies                      0
Current Loan Amount               0
Current Credit Balance            0
Monthly Debt                      0
Credit Score                      0
Credit Default                    0
Home Ownership_Home Mortgage      0
Home Ownership_Own Home           0
Home Ownership_Rent               0
Years in current job_10+ years    0
Years in current job_2 years      0
Years in current job_3 years      0
Years in current job_4 years      0
Years in current job_5 years      0
Years in current job_6 years      0
Years in current job_7 years      0
Years in current job_8 years      0
Years in current job_9 years      0
Years in current job_< 1 year     0
Purpose_buy a car           

现在在py文件中 一次性处理data数据中所有的连续变量和离散变量
1. 读取data数据
2. 对离散变量进行one-hot编码
3. 对独热编码后的变量转化为int类型
4. 对所有缺失值进行填充