Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Analysis with Python——03 #83

Open
hsipeng opened this issue Aug 23, 2019 · 0 comments
Open

Data Analysis with Python——03 #83

hsipeng opened this issue Aug 23, 2019 · 0 comments

Comments

@hsipeng
Copy link
Owner

hsipeng commented Aug 23, 2019

Data Analysis with Python——03

pandas

pandas 的数据结构

  • Series
    一组数据(各种 Numpy 数据类型)以及一组与之相关的数据标签(索引)组成, 可以通过 Series 的 values 和 index 属性获取其数组表示形式和索引对象
obj = series([4, 7, -5, 3])
obj
# 0 4
# 1 7
# 2 -5
# 3 3

obj.values
# array([4, 7, -5, 3])

obj.index
# Int64Index([0, 1, 2, 3])

obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c']) # 指定index
  • DataFrame

DataFrame 是一个表格型的数据结构, 可以看成由Series 组成的字典(公用同一个索引)

# 字典加数组
data={'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],'year':[2000,2001,2002,2001,2002],'pop':[1.5,1.7,3.6,2.4,2.9]}

frame=DataFrame(data)


# 字典的字典
# 外层字典的键作为列, 内层键作为行索引
pop={'Nevada':{2001:2.4,2002:2.9},....:'Ohio':{2000:1.5,2001:1.7,2002:3.6}}

frame3 = DataFrame(pop)
frame3
#      Nevada   Ohio
# 	2000   NaN		1.5
# 2001   2.4 		1.7
# 2002   2.9 		3.6

索引对象 Index

构建Series 和 DataFrame 时, 所用到的任何数组或其他序列的标签都会被转成一个Index, index 不可修改

obj = Series(range(3), index = ['a', 'b', 'c'])

index = obj.index

index
# Index([a, b, c], dtype=object)

基本功能

  • reindex
  • drop
  • 索引,切片
    Screen Shot 2019-03-11 at 5.15.26 PM.png
  • 算数运算和数据对齐
df1.add(df2, fill_value=0)

# 补0
  • 广播
  • 函数应用和映射
  • 排序
    • sort_index
    • rank
    • order

汇总和计算描述统计

  • 约简, sum ,mean ..
df.sum()

df.sum(axis=1)

Screen Shot 2019-03-11 at 5.28.38 PM.png

  • 累计型
    Idxmim idxmcx

  • describe
    一次产生多个结果包含(count, mean, std, min …)

相关系数和协方差

  • corr 相关系数
  • cov. 协方差

唯一值、值计数以及成员资格

  • unique
  • value_counts
  • isin
mask = obj.isin(['b', 'c'])

obj[mask]

Screen Shot 2019-03-11 at 7.13.00 PM.png

处理缺失数据

Screen Shot 2019-03-11 at 7.16.49 PM.png

过滤缺失

data.dropna()
data[data.notnull()] # 等效

填充缺失数据

df.fillna(0)

Screen Shot 2019-03-11 at 7.19.23 PM.png

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant