第七章: 数据清洗和准备
在进行数据分析和建模的过程中，大量的时间花在数据准备上：加载、清理、转换和重新排列。
在本章中，我将讨论用于缺失值、重复值、字符串操作和其他数据转换的工具。下一章中，我将重点关注利用各种方法对数据集联合、重排列。

7.1 处理缺失值
对于数值型数据，pandas 使用浮点值NAN。我们称NAN为容易检测到的标识值。

In [1]:
import pandas as pd
import numpy as np 

In [3]:
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
print(string_data)
print(string_data.isnull())

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object
0    False
1    False
2     True
3    False
dtype: bool


在pandas 中，我们采用了R语言中的编程惯例，将缺失值成为NA，意思是not available(不可用)。在统计学中，NA数据可以是不存在的数据或者是存在但不可以观察的数据（例如在数据收集过程中出现了问题）。当清洗数据用于分析时，对缺数据本身进行分析以确定数据收集问题或数据丢失导致数据偏差通常很重要。

In [8]:
string_data[0] = None
print(string_data.isnull())  # 返回表明哪些值是缺失值的布尔值
print(string_data.dropna())  # 根据每个标签的值是否缺失数据来筛选轴标签，并允许丢失的数据量来确定阈值
print(string_data.fillna('hello'))  # 用某些值填充缺失的数据或用插值方法（如'ffill'或'bfill'）
print(string_data.notnull())  #  isnull 的反函数

0     True
1    False
2     True
3    False
dtype: bool
1    artichoke
3      avocado
dtype: object
0        hello
1    artichoke
2        hello
3      avocado
dtype: object
0    False
1     True
2    False
3     True
dtype: bool


7.1.1 过滤缺失值


In [11]:
data = pd.Series([1, np.nan, 3.5, np.nan, 7])
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

In [12]:
# 上面的代码等价于
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

In [14]:
# dropna 默认情况下会删除包含缺失值的行
data = pd.DataFrame([[1., 6.5, 3.], [1., np.nan, np.nan], [np.nan, np.nan, np.nan], [np.nan, 6.5, 3.]])
clean = data.dropna()
print(data)
print(clean)

     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0
     0    1    2
0  1.0  6.5  3.0


In [16]:
# 传入how='all' 时，将删除所有值均为NA的行
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


In [18]:
# 如果要用同样的方式去删除列，传入参数axis=1
data[4] = np.nan
print(data)
data.dropna(axis=1, how='all')

     0    1    2   4
0  1.0  6.5  3.0 NaN
1  1.0  NaN  NaN NaN
2  NaN  NaN  NaN NaN
3  NaN  6.5  3.0 NaN


Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [22]:
# 过滤DataFrame 的行的相关方法往往涉及时间序列数据。假设你只想保留包含一定数量的观察值的行。你可以用thresh参数来表示
df = pd.DataFrame(np.random.randn(7, 3))
df.iloc[:4, 1] = np.nan
df.iloc[:2, 2] = np.nan
print(df)

print(df.dropna())
print(df.dropna(thresh=2))

          0         1         2
0 -0.671558       NaN       NaN
1 -2.072681       NaN       NaN
2 -0.465074       NaN  0.236121
3  0.429361       NaN  0.518240
4 -0.463601  0.808191  2.123445
5 -1.806658  0.408206  0.795206
6 -1.484296 -0.601941 -1.127979
          0         1         2
4 -0.463601  0.808191  2.123445
5 -1.806658  0.408206  0.795206
6 -1.484296 -0.601941 -1.127979
          0         1         2
2 -0.465074       NaN  0.236121
3  0.429361       NaN  0.518240
4 -0.463601  0.808191  2.123445
5 -1.806658  0.408206  0.795206
6 -1.484296 -0.601941 -1.127979


7.1.2 补全缺失值

In [23]:
# 使用fillna方法来补全缺失值
df.fillna(0)

Unnamed: 0,0,1,2
0,-0.671558,0.0,0.0
1,-2.072681,0.0,0.0
2,-0.465074,0.0,0.236121
3,0.429361,0.0,0.51824
4,-0.463601,0.808191,2.123445
5,-1.806658,0.408206,0.795206
6,-1.484296,-0.601941,-1.127979


In [24]:
# 在调用fillna时使用字典，你可以为不同列设定不同的填充值
df.fillna({1:0.5, 2:0})

Unnamed: 0,0,1,2
0,-0.671558,0.5,0.0
1,-2.072681,0.5,0.0
2,-0.465074,0.5,0.236121
3,0.429361,0.5,0.51824
4,-0.463601,0.808191,2.123445
5,-1.806658,0.408206,0.795206
6,-1.484296,-0.601941,-1.127979


In [25]:
# fillna返回的是一个新对象，但你也可以修改已经存在的对象
_ = df.fillna(0, inplace=True)
df

Unnamed: 0,0,1,2
0,-0.671558,0.0,0.0
1,-2.072681,0.0,0.0
2,-0.465074,0.0,0.236121
3,0.429361,0.0,0.51824
4,-0.463601,0.808191,2.123445
5,-1.806658,0.408206,0.795206
6,-1.484296,-0.601941,-1.127979


In [29]:
# 用于重建索引的相同的插值方法也可以用于fillna
df = pd.DataFrame(np.random.randn(6, 3))
df.iloc[2:, 1] = np.nan
df.iloc[4:, 2] = np.nan
print(df)

'''
    fillna函数参数
        value:标量值或字典型对象用于填充缺失值
        method:插值方法，如果没有其他参数，默认是ffill
        axis:需要填充的轴，默认axis=0
        inplace:修改被调用的对象，而不是生成一个备份
        limit:用于前向或后向填充时最大的填充范围
'''
print(df.fillna(method='ffill'))
print(df.fillna(method='ffill', limit=2))

          0         1         2
0  0.115105  0.011041  0.210894
1 -1.802301  1.001189 -0.829573
2 -1.068148       NaN -1.939378
3 -1.698171       NaN  0.405176
4 -1.357074       NaN       NaN
5  0.928057       NaN       NaN
          0         1         2
0  0.115105  0.011041  0.210894
1 -1.802301  1.001189 -0.829573
2 -1.068148  1.001189 -1.939378
3 -1.698171  1.001189  0.405176
4 -1.357074  1.001189  0.405176
5  0.928057  1.001189  0.405176
          0         1         2
0  0.115105  0.011041  0.210894
1 -1.802301  1.001189 -0.829573
2 -1.068148  1.001189 -1.939378
3 -1.698171  1.001189  0.405176
4 -1.357074       NaN  0.405176
5  0.928057       NaN  0.405176


7.2 数据转换

7.2.1 删除重复值

In [31]:
# 由于各种原因，DataFrame中会出现重复行。
data = pd.DataFrame({'k1':['one', 'two']*3+['two'], 'k2':[1, 1, 2, 3, 3, 4, 4]})
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


In [32]:
# DataFrame 的duplicates 方法返回的是一个布尔值Series,这个Series反映的是每一行是否存在重复（与之前出现过的行相同）情况
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

In [33]:
# drop_duplicates返回的是DataFrame,内容是duplicated返回数组中为False的部分：
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


In [35]:
# 这些方法默认都是对列进行操作。你可以指定数据的任何子集来检测是否有重复。假设我们有一个额外的列，并想基于‘k1’列去除重复值
data['v1'] = range(7)
print(data)
data.drop_duplicates(['k1'])

    k1  k2  v1
0  one   1   0
1  two   1   1
2  one   2   2
3  two   3   3
4  one   3   4
5  two   4   5
6  two   4   6


Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1


In [36]:
# drop_duplicates 和 duplicated 默认都是保留第一个观测到的值。传入参数keep='last'将会返回最后一个
data.drop_duplicates(['k1', 'k2'], keep='last')

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
6,two,4,6


7.2.2 使用函数或映射进行数据转换
对于许多数据集，你可能希望基于DataFrame中数组、列或列中的数值进行一些转换

In [4]:
data = pd.DataFrame({'food':['bacon', 'pulled pork', 'bacon', 'Pastrami', 'corned beef', 'Bacon', 'pastrami', 'honey ham', 'nova lox'],
                     'ounces':[4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


In [11]:
# 添加一列用于表明每种食物的动物肉类型
meat_to_animal = {
    'bacon': 'pig',
    'pulled pork': 'pig',
    'pastrami': 'cow',
    'corned beef': 'cow',
    'honey ham': 'pig',
    'nova lox': 'salmon'
}
lowercased = data['food'].str.lower()  # 将大写字符串转换成小写
print(lowercased)
# Series 的map方法接收一个函数或者包含映射关系的字典型对象
data['animal'] = lowercased.map(meat_to_animal) # 传入字典型对象
print(data)

# 传入一个函数
data['food'].map(lambda x: meat_to_animal[x.lower()])

# 使用map是一种可以便捷执行按元素转换和其他清洗相关操作的方法

0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object
          food  ounces  animal
0        bacon     4.0     pig
1  pulled pork     3.0     pig
2        bacon    12.0     pig
3     Pastrami     6.0     cow
4  corned beef     7.5     cow
5        Bacon     8.0     pig
6     pastrami     3.0     cow
7    honey ham     5.0     pig
8     nova lox     6.0  salmon


0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

7.2.3 替代值
使用fillna 填充缺失值是通用值替换的特殊案例。前面已经看到，map可以用来修改一个对象中的子集的值，但是replace提供更为简单灵活的实现。

In [17]:
data = pd.Series([1., -999., 2., -999., -1000., 3.])
print(data)

# -999 可能是缺失值的标识。使用NA来替代这些值，可以使用replace方法生成新的series(除非传入inplace=True)
print(data.replace(-999., np.nan))
# 一次替代多个值
print(data.replace([-999., -1000], np.nan))
# 要将不同的值替换为不同的值，可以传入替代值的列表
print(data.replace([-999, -1000], [np.nan, 0]))
# 参数也可以通过字典传递
print(data.replace({-999: np.nan, -1000: 0}))

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64
0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64
0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64
0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64
0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64


7.2.4 重命名轴索引
和series中的值一样，可以通过函数或某种心事的映射对轴标签进行类似的转换，生成新的且带有不同标签的对象。也可以在不生成新的数据结构的情况下 修改轴。

In [26]:
data = pd.DataFrame(np.arange(12).reshape(3,4), index=['Ohio', 'Colorado', 'New York'], columns=['one', 'two', 'three', 'four'])
print(data)

# 与Series类似，轴索引也有一个map方法
transform = lambda x: x[:4].upper()
data.index.map(transform)

# 赋值给index，修改DataFrame
data.index = data.index.map(transform)
print(data)

# 创建数据集转换后的版本，且不修改原有的数据集
print(data.rename(index=str.title, columns=str.upper))
# rename可以结合字典型对象使用
print(data.rename(index={'OHIO': 'INDIANA'}, columns={'three': 'peekaboo'}))

# reename 可以让开发者从手动复制DataFrame 并为其分配索引和列属性的繁琐工作中解放出来。如果想要修改原数据集，传入inplace= True
data.rename(index={'OHIO': 'INDIANA'}, columns={'three': 'peekaboo'}, inplace=True)
data

          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
New York    8    9     10    11
      one  two  three  four
OHIO    0    1      2     3
COLO    4    5      6     7
NEW     8    9     10    11
      ONE  TWO  THREE  FOUR
Ohio    0    1      2     3
Colo    4    5      6     7
New     8    9     10    11
         one  two  peekaboo  four
INDIANA    0    1         2     3
COLO       4    5         6     7
NEW        8    9        10    11


Unnamed: 0,one,two,peekaboo,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


7.2.5 离散化和分箱
连续值经常需要离散化，或者分离成 “箱子”进行分析。

In [36]:
# 一组人群的数据，分组，放入离散的年龄框中
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

# 将这些年龄分为188~25、26~35、36~60以及61及以上等若干组
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
print(cats)
print(cats.codes)
print(cats.categories)
print(pd.value_counts(cats))

# 传递right=False来改变那一边是封闭的
print(pd.cut(ages, [18, 26, 36, 61, 100], right=False))

# 可以通过向labels 选项传递一个列表或数组来传入自定义的箱名
group_names = ['Youth', 'YouthAdult', 'MiddleAged', 'Serior']
print(pd.cut(ages, bins, labels=group_names))

# 传给cut整数个的箱来代替显示的箱边，pandas将根据数据中的最小值和最大值计算出等长的箱
# 考虑一些均匀分布的数据被切成四份的情况
data = np.random.rand(20)
print(data)
print(pd.cut(data, 4, precision=2))  # precision=2的选项将十进制精度限制在两位

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
[0 0 0 1 0 0 2 1 3 2 2 1]
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]]
              closed='right',
              dtype='interval[int64]')
(18, 25]     5
(35, 60]     3
(25, 35]     3
(60, 100]    1
dtype: int64
[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]
[Youth, Youth, Youth, YouthAdult, Youth, ..., YouthAdult, Serior, MiddleAged, MiddleAged, YouthAdult]
Length: 12
Categories (4, object): [Youth < YouthAdult < MiddleAged < Serior]
[0.68266646 0.95944817 0.9343395  0.68128065 0.26232989 0.06019497
 0.05726148 0.36990634 0.76061556 0.38339109 0.35679917 0.61285906
 0.98966669 0.60421461 0.19567249 0.06325163 0.90720376 0.48284788
 0.

In [40]:
# qcut是一个与分箱密切相关的函数，它基于样本分位数进行分箱。取决于数据的分布，使用cut通常不会使每个箱具有相同数据量的数据点。
# 由于qcut使用样本的分位数，可以通过qcut获得等长的箱
data = np.random.randn(1000)  # 正态分布
cats = pd.qcut(data, 4) # 切成四份
print(cats)
print(pd.value_counts(cats))

# 可以传入自定义的分位数（0和1之间的数据，包括边）
cats = pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.])
print(cats)
print(pd.value_counts(cats))

[(0.683, 3.133], (-3.327, -0.645], (-3.327, -0.645], (0.025, 0.683], (0.025, 0.683], ..., (-3.327, -0.645], (-0.645, 0.025], (-0.645, 0.025], (0.025, 0.683], (-0.645, 0.025]]
Length: 1000
Categories (4, interval[float64]): [(-3.327, -0.645] < (-0.645, 0.025] < (0.025, 0.683] < (0.683, 3.133]]
(0.683, 3.133]      250
(0.025, 0.683]      250
(-0.645, 0.025]     250
(-3.327, -0.645]    250
dtype: int64
[(0.025, 1.302], (-1.244, 0.025], (-1.244, 0.025], (0.025, 1.302], (0.025, 1.302], ..., (-3.327, -1.244], (-1.244, 0.025], (-1.244, 0.025], (0.025, 1.302], (-1.244, 0.025]]
Length: 1000
Categories (4, interval[float64]): [(-3.327, -1.244] < (-1.244, 0.025] < (0.025, 1.302] < (1.302, 3.133]]
(0.025, 1.302]      400
(-1.244, 0.025]     400
(1.302, 3.133]      100
(-3.327, -1.244]    100
dtype: int64


7.2.6 检测和过滤异常值
过滤和转换异常值在很大程度上是应用数组操作的事情

In [41]:
data = pd.DataFrame(np.random.randn(1000, 4)) # 正态分布数据
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.020094,-0.032192,-0.019808,-0.00257
std,0.967391,0.983185,1.001929,1.022037
min,-3.268987,-3.127929,-3.579394,-2.798291
25%,-0.684392,-0.706479,-0.654229,-0.758018
50%,-0.010722,-0.035095,0.028171,0.039024
75%,0.671431,0.62631,0.610993,0.724056
max,2.584111,2.751096,3.406822,4.094864


In [51]:
# 找出一列中绝对值大于三的值
col = data[2]
print(col[np.abs(col) > 3])

# 选出所有值打雨伞或小于-3的行
print(data[(np.abs(data) > 3).any(1)])

# 限制dataframe数值在-3到3之间
data[np.abs(data) > 3] = np.sign(data) * 3
print(data.describe())

# 根据数据中的值的政府分别生成1和-1的数值
print(np.sign(data).head())

Series([], Name: 2, dtype: float64)
Empty DataFrame
Columns: [0, 1, 2, 3]
Index: []
                 0            1            2            3
count  1000.000000  1000.000000  1000.000000  1000.000000
mean     -0.019587    -0.032064    -0.019407    -0.003665
std       0.965760     0.982790     0.996117     1.018223
min      -3.000000    -3.000000    -3.000000    -2.798291
25%      -0.684392    -0.706479    -0.654229    -0.758018
50%      -0.010722    -0.035095     0.028171     0.039024
75%       0.671431     0.626310     0.610993     0.724056
max       2.584111     2.751096     3.000000     3.000000
     0    1    2    3
0  1.0 -1.0  1.0 -1.0
1  1.0  1.0  1.0  1.0
2 -1.0 -1.0 -1.0  1.0
3  1.0 -1.0  1.0 -1.0
4 -1.0 -1.0 -1.0 -1.0


7.2.7 置换和随机抽样
使用numpy.random.permutation对dataframe中的series或行进行置换（随机重排序）是非常方便的。在调用permutation时根据你想要的轴长度可以产生一个表示新循序的整数数组

In [55]:
df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))
sampler = np.random.permutation(5)
print(df)
print(sampler)

# 整数数组可以用在基于iloc的索引或等价的take函数中
print(df.take(sampler))

# 选出一个不含有替代值的随机子集，可以使用sample方法
print(df.sample(n = 3))

# 生成带有替代值的样本
choices = pd.Series([5, 7, -1, 6, 4])
draws = choices.sample(n=10, replace=True)
print(draws)

    0   1   2   3
0   0   1   2   3
1   4   5   6   7
2   8   9  10  11
3  12  13  14  15
4  16  17  18  19
[2 0 4 1 3]
    0   1   2   3
2   8   9  10  11
0   0   1   2   3
4  16  17  18  19
1   4   5   6   7
3  12  13  14  15
    0   1   2   3
4  16  17  18  19
0   0   1   2   3
2   8   9  10  11
3    6
1    7
3    6
3    6
4    4
4    4
3    6
2   -1
3    6
1    7
dtype: int64


7.2.8 计算指标/虚拟变量
将分类变量转换为“虚拟”或“指标”矩阵是另一种用于统计建模和机器学习的转换操作。

In [61]:
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                  'data1': range(6)})
print(df)
print(pd.get_dummies(df['key']))

# 在指标列上加入前缀，然后与其他数据合并
dumies = pd.get_dummies(df['key'], prefix='key')
df_with_dummy = df[['data1']].join(dumies)  # df.['data1']是Series 而df[['data1']]是DataFrame
print(df_with_dummy)

  key  data1
0   b      0
1   b      1
2   a      2
3   c      3
4   a      4
5   b      5
   a  b  c
0  0  1  0
1  0  1  0
2  1  0  0
3  0  0  1
4  1  0  0
5  0  1  0
   data1  key_a  key_b  key_c
0      0      0      1      0
1      1      0      1      0
2      2      1      0      0
3      3      0      0      1
4      4      1      0      0
5      5      0      1      0


In [76]:
# 一行输入多个类别
mnames = ['movie_id', 'title', 'genres']
# header默认 列名为第一行，如果没有需要指定或自动分配
movies = pd.read_table('datasets/movielens/movies.dat', sep='::',header=None, names=mnames, engine='python') 
print(movies.head())

# 为每个电影流派添加指标变量
# 首先从数据集中提取有不同的流派列表
all_genres = []
for x in movies.genres:
    all_genres.extend(x.split('|'))
genres = pd.unique(all_genres)   # 去除重复
print(genres)

zeros_matrix = np.zeros((len(movies), len(genres)))
dummies = pd.DataFrame(zeros_matrix, columns=genres)
# 遍历每一部电影，将dummies每一行的条目设置为1
for i, gen in enumerate(movies.genres):
    indices = dummies.columns.get_indexer(gen.split('|'))
    dummies.iloc[i, indices] = 1

movies_windic = movies.join(dummies.add_prefix('Genre_'))
print(movies_windic.iloc[0])

   movie_id                               title                        genres
0         1                    Toy Story (1995)   Animation|Children's|Comedy
1         2                      Jumanji (1995)  Adventure|Children's|Fantasy
2         3             Grumpier Old Men (1995)                Comedy|Romance
3         4            Waiting to Exhale (1995)                  Comedy|Drama
4         5  Father of the Bride Part II (1995)                        Comedy
['Animation' "Children's" 'Comedy' 'Adventure' 'Fantasy' 'Romance' 'Drama'
 'Action' 'Crime' 'Thriller' 'Horror' 'Sci-Fi' 'Documentary' 'War'
 'Musical' 'Mystery' 'Film-Noir' 'Western']
movie_id                                       1
title                           Toy Story (1995)
genres               Animation|Children's|Comedy
Genre_Animation                                1
Genre_Children's                               1
Genre_Comedy                                   1
Genre_Adventure                                0
Gen

对于更大的数据，上面这种使用多成员构建指标变量并不是特别快速。更好的方法是写一个直接将数据些微Numpy数组的底层函数，然后将结果封装进DataFrame。

In [82]:
dummies = np.zeros((len(movies), len(genres)))
# genres = genres.tolist()
for i, gen in enumerate(movies.genres):
    indices = [genres.index(value) for value in gen.split('|') ]
    dummies[i, indices] = 1

dummies = pd.DataFrame(dummies, columns=genres)
movies_windic = movies.join(dummies.add_prefix('Genre_'))
print(movies_windic.iloc[0])
movies_windic.head()

movie_id                                       1
title                           Toy Story (1995)
genres               Animation|Children's|Comedy
Genre_Animation                                1
Genre_Children's                               1
Genre_Comedy                                   1
Genre_Adventure                                0
Genre_Fantasy                                  0
Genre_Romance                                  0
Genre_Drama                                    0
Genre_Action                                   0
Genre_Crime                                    0
Genre_Thriller                                 0
Genre_Horror                                   0
Genre_Sci-Fi                                   0
Genre_Documentary                              0
Genre_War                                      0
Genre_Musical                                  0
Genre_Mystery                                  0
Genre_Film-Noir                                0
Genre_Western       

Unnamed: 0,movie_id,title,genres,Genre_Animation,Genre_Children's,Genre_Comedy,Genre_Adventure,Genre_Fantasy,Genre_Romance,Genre_Drama,...,Genre_Crime,Genre_Thriller,Genre_Horror,Genre_Sci-Fi,Genre_Documentary,Genre_War,Genre_Musical,Genre_Mystery,Genre_Film-Noir,Genre_Western
0,1,Toy Story (1995),Animation|Children's|Comedy,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji (1995),Adventure|Children's|Fantasy,0.0,1.0,0.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men (1995),Comedy|Romance,0.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale (1995),Comedy|Drama,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II (1995),Comedy,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [83]:
# 将get_dummies 与cut等离散化函数结合时统计应用的一个有用的方法
np.random.seed(12345)
values = np.random.rand(10)
print(values)
bins = [0, 0.2, 0.4, 0.6, 0.8, 1]  # 分位点
pd.get_dummies(pd.cut(values, bins))

[0.92961609 0.31637555 0.18391881 0.20456028 0.56772503 0.5955447
 0.96451452 0.6531771  0.74890664 0.65356987]


Unnamed: 0,"(0.0, 0.2]","(0.2, 0.4]","(0.4, 0.6]","(0.6, 0.8]","(0.8, 1.0]"
0,0,0,0,0,1
1,0,1,0,0,0
2,1,0,0,0,0
3,0,1,0,0,0
4,0,0,1,0,0
5,0,0,1,0,0
6,0,0,0,0,1
7,0,0,0,1,0
8,0,0,0,1,0
9,0,0,0,1,0


7.3 字符串操作
由于python在字符串和文本操作上的便利性，使python成为一个流行的原生数据集操作语言已经很长时间了。


7.3.1 字符串对象方法
在很多字符串处理和脚本应用中，内建的字符串方法使足够的。

In [84]:
# 字符串分割
val = 'a,b,  guido'
val.split(',')

['a', 'b', '  guido']

In [87]:
# split常常和strip一起使用，用于清除空格（包括换行）
pieces = [x.strip() for x in val.split(',')]
pieces

['a', 'b', 'guido']

In [88]:
# 这些字符串可以使用假发与两个冒号分隔符连接在一起
first, second, third = pieces

first + '::' + second + '::' + third

'a::b::guido'

In [89]:
# 在字符串‘::’的join方法中传入一个列表或数组受更快且更Python的方法
'::'.join(pieces)

'a::b::guido'

In [91]:
# 涉及定位子字符串
print('guido' in val
# 注意find和index的区别在于index在字符串没有找到时会抛出一个异常，而find返回-1
print(val.index(','))
print(val.find('-1'))
print(val.find('gu'))

True
1
-1
6


In [92]:
# count返回的时某个特定子字符串在字符串中出现的次数
val.count(',')

2

In [96]:
# replace 将一种模式替代另一种模式，它通常也用于传入空字符串来删除某个模式
print(val.replace(',', '::'))
print(val.replace(',', ''))

a::b::  guido
ab  guido


In [102]:
# python内建字符串方法
print(val.endswith('do'))
print(val.startswith('a'))
print(val.rfind(' '))  # 返回子字符串在字符串最后一次出现时第一个字符的位置，如果没有找到，则返回-1

str1 = "     this is string example....wow!!!     "
print(str1.strip())  # 用于移除字符串头尾指定的字符（默认为空格或换行符）或字符序列
print(str1.rstrip()) # 尾
print(str1.lstrip()) # 头

print(val.lower()) # 将大学字母转换成小写字母，只对ASCII编码，也就是‘A-Z’有效
print(val.upper())
S1 = "Runoob EXAMPLE....WOW!!!" #英文
S2 = "ß"  #德语
print( S1.lower() )
print( S1.casefold() )
print( S2.lower() )
print( S2.casefold() ) #德语的"ß"正确的小写是"ss"

# 原字符串左/右对齐,并使用空格填充至指定长度的新字符串。如果指定的长度小于原字符串的长度则返回原字符串。
str2 = "this is string example....wow!!!" 
print(str2.ljust(50, '0'))
print(str2.rjust(50, '0'))

True
True
5
this is string example....wow!!!
     this is string example....wow!!!
this is string example....wow!!!     
a,b,  guido
A,B,  GUIDO
runoob example....wow!!!
runoob example....wow!!!
ß
ss
this is string example....wow!!!000000000000000000
000000000000000000this is string example....wow!!!


7.3.2 正则表达式
正则表达式提供了一种在文本中灵活查找或匹配（通常更为复杂）字符串模式的方法。单个表达式通常被称为regex，是根据正则表达式语言形成的字符串。Python内建的re模块是用于将正则表达式应用到字符串上的库

In [5]:
import re

# 将含有多种空白字符（制表符、空格、换行符）的字符串拆分开
text = 'foo    bar\t baz   \tqux'
re.split('\s+', text)

['foo', 'bar', 'baz', 'qux']

In [5]:
# 使用re.compile自行编译，形成一个可复用的正则表达式对象
regex = re.compile('\s+')
regex.split(text)

['foo', 'bar', 'baz', 'qux']

In [6]:
# 获得一个所有匹配正则表达式的模式的列表
regex.findall(text)

['    ', '\t ', '   \t']

In [7]:
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'

# re.IGNORECASE 使用正则表达式不区分大小写
regex = re.compile(pattern, flags=re.IGNORECASE)

In [8]:
# 在文本上使用ffindall,返回字符串中所有的匹配项
regex.findall(text)

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

In [10]:
# search返回文本中第一个匹配的而电子邮件地址
m = regex.search(text)
print(m)
print(text[m.start():m.end()])

<re.Match object; span=(5, 20), match='dave@google.com'>
dave@google.com


In [12]:
# match只在模式出现于字符串起始位置时进行匹配，如果没有匹配到，返回None
print(regex.match(text))

None


In [13]:
# sub会返回一个新的字符串，源字符串中的模式会被一个新的字符串代替
print(regex.sub('Hu P-C', text))

Dave Hu P-C
Steve Hu P-C
Rob Hu P-C
Ryan Hu P-C



In [14]:
# 查找电子邮件地址，并将的地址分为三个部分：用户名、域名和域名后缀
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern, flags=re.IGNORECASE)

In [17]:
# 该正则表达式的匹配对象的groups方法，返回的时模式组建的元祖
m = regex.match('wesm@bright.net')
m.groups()

('wesm', 'bright', 'net')

In [18]:
# finall 返回的时包含元祖的列表
regex.findall(text)

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

In [19]:
# sub也可以使用特殊符号
print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text))

Dave Username: dave, Domain: google, Suffix: com
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com



7.3.3 pandas中的向量化字符串函数
清理杂乱的数据集用于分析通常需要大量的字符串处理和正则化。包含字符串的列又是会含有缺失数据，使得数据变得复杂

In [3]:
data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',
        'Rob': 'rob@gmail.com', 'Wes': np.nan}
data = pd.Series(data)
print(data)
data.isnull()

Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object


Dave     False
Steve    False
Rob      False
Wes       True
dtype: bool

In [4]:
# Series有面向数组的方法用于跳过NA值的字符串操作
data.str.contains('gmail')

Dave     False
Steve     True
Rob       True
Wes        NaN
dtype: object

In [7]:
# 正则表达式可以结合任意的re模块选项使用
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
data.str.findall(pattern, flags=re.IGNORECASE)

Dave     [(dave, google, com)]
Steve    [(steve, gmail, com)]
Rob        [(rob, gmail, com)]
Wes                        NaN
dtype: object

In [9]:
# 多种方法可以进行向量化的元素检查
matches = data.str.match(pattern, flags=re.IGNORECASE)
matches

Dave     True
Steve    True
Rob      True
Wes       NaN
dtype: object

In [17]:
# 要访问嵌入式列表中的元素，我们可以将所有传递给这些函数的任意一个
print(matches.str.get(2))  # matches没有str类型变量，故返回NaN
print(matches.str[0])
# 可以使用字符串切片的类似语法进行向量化切片
print(data.str[:])

Dave    NaN
Steve   NaN
Rob     NaN
Wes     NaN
dtype: float64
Dave    NaN
Steve   NaN
Rob     NaN
Wes     NaN
dtype: float64
Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object


In [27]:
# 部分向量化字符串方法列表
# cat 根据可选的分隔符按元素粘合字符串
data.str.cat(sep='//')

'dave@google.com//steve@gmail.com//rob@gmail.com'

In [28]:
# count 模式才出现次数
data.str.count(pattern, flags=re.IGNORECASE)

Dave     1.0
Steve    1.0
Rob      1.0
Wes      NaN
dtype: float64

In [29]:
# contains 返回是否含有某个模式/正则表达式的布尔值数组
data.str.contains(pattern, flags=re.IGNORECASE)

  


Dave     True
Steve    True
Rob      True
Wes       NaN
dtype: object

In [None]:
# extract 使用正则表达式从字符串Series中分组抽取一个或多个字符串；返回结果时每个分组形成一列的DataFrame
data.str.exyract(pattern, flags=re.IGNORECASE)

In [31]:
# endwith等价于对每个元素使用x.endwith
data.str.endswith('com')

Dave     True
Steve    True
Rob      True
Wes       NaN
dtype: object

In [32]:
# startswith 等价于对每个元素使用x.startswith
data.str.startswith('d')

Dave      True
Steve    False
Rob      False
Wes        NaN
dtype: object

In [34]:
# findall 找出字符串中所有的模式/正则表达式匹配项，以列表返回
data.str.findall(pattern, flags=re.IGNORECASE)

Dave     [(dave, google, com)]
Steve    [(steve, gmail, com)]
Rob        [(rob, gmail, com)]
Wes                        NaN
dtype: object

In [35]:
# get 对每个元素进行索引（获得第i个元素）
data.str.get(-1)

Dave       m
Steve      m
Rob        m
Wes      NaN
dtype: object

In [36]:
# isalnum 等价于内建的str.alnum。检查字符串中的所有字符是否字母数字
data.str.isalnum()

Dave     False
Steve    False
Rob      False
Wes        NaN
dtype: object

In [37]:
# isalhpa 等价于内建的str.isalpha 检查字符串中的所有字符是否字母
data.str.isalpha()

Dave     False
Steve    False
Rob      False
Wes        NaN
dtype: object

In [38]:
# isdecimal 检查检查字符串中的所有字符是否小数
data.str.isdecimal()

Dave     False
Steve    False
Rob      False
Wes        NaN
dtype: object

In [39]:
# isdigit 检查检查字符串中的所有字符是否数字
data.str.isdigit()

Dave     False
Steve    False
Rob      False
Wes        NaN
dtype: object

In [40]:
# islower 检查检查字符串中的所有字符是否小写
data.str.islower()

Dave     True
Steve    True
Rob      True
Wes       NaN
dtype: object

In [41]:
# isnumeric 检查检查字符串中的所有字符是否小写
data.str.isnumeric()

Dave     False
Steve    False
Rob      False
Wes        NaN
dtype: object

In [44]:
# join 根据传递的分隔符，将Series中的而字符串联合
s = pd.Series([['lion', 'elephant', 'zebra'],
               [1.1, 2.2, 3.3],
               ['cat', np.nan, 'dog'],
               ['cow', 4.5, 'goat'],
               ['duck', ['swan', 'fish'], 'guppy']])
s.str.join('-')

0    lion-elephant-zebra
1                    NaN
2                    NaN
3                    NaN
4                    NaN
dtype: object

In [48]:
# LEN计算每个字符长度
data.str.len()

Dave     15.0
Steve    15.0
Rob      13.0
Wes       NaN
dtype: float64

In [49]:
# lower,upper转换大小写；等价于对每个元素进行x.lower()或x.upper

In [50]:
# match 使用re.match将正则表达式应用到每个元素上，将分配分组以列表形式返回

In [62]:
# pad 将空白加到字符串左边、右边或者两边
print(data)
print(data.str.pad(width=20, side='both', fillchar='+'))

Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object
Dave     ++dave@google.com+++
Steve    ++steve@gmail.com+++
Rob      +++rob@gmail.com++++
Wes                       NaN
dtype: object


In [None]:
# center 等价于pad(sside='both')

In [63]:
# repeat 重复值
data.str.repeat(3)

Dave     dave@google.comdave@google.comdave@google.com
Steve    steve@gmail.comsteve@gmail.comsteve@gmail.com
Rob            rob@gmail.comrob@gmail.comrob@gmail.com
Wes                                                NaN
dtype: object

In [64]:
# replace 以其他字符串替代模式 /正则表达的匹配项
data.str.replace('google', '163')

Dave        dave@163.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object

In [74]:
# slice对Seriex中的字符串进行切片
data.str.slice(0, -1, 1)

Dave     dave@google.co
Steve    steve@gmail.co
Rob        rob@gmail.co
Wes                 NaN
dtype: object

In [75]:
# split 以分隔符或正则表达式对字符串进行拆分
data.str.split('@')

Dave     [dave, google.com]
Steve    [steve, gmail.com]
Rob        [rob, gmail.com]
Wes                     NaN
dtype: object

In [None]:
# strip 对字符串两侧的空白进行消除，包括换行符
# rstrip 消除字符串右边的空白
# lstrip消除字符串左边的空白