# 基本的统计分析

pandas本身定位是表格工具,算法不是他的主要目标,所以他内置的算法只是坎坎够用,pandas本身依赖numpy,因此numpy有的统计方法他都有,比如观察他的均值方差标准差什么的,本文依然使用iris来作为源数据

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
iris_data = pd.read_csv("source/iris.csv")

## 基本的统计功能

### pandas内置基本的统计功能

函数|作用
---|---
count|非NA值数量
describe|汇总统计
mean|求均值
min/max|最小最大值
argmin/argmax|获取最小最大值的index位置
idxmin/idxmax|获取最小最大值的index
quantile|计算分位数
sum|求和
median|中位数
mad|根据均值计算平局绝对离差
var|方差
std|标准差
skew|偏度(三阶矩)
kurt|锋度(四阶矩)
cumsum|累计和
cummin/cummax|累计最小值累计最大值
cumprod|累计积
diff|一阶差分(对时间序列很有用)
pct_change|百分数变化
corr|相关系数
cov|协方差

In [4]:
iris_data.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [5]:
iris_data[["sepal_length","sepal_width","petal_length","petal_width"]].pct_change()[1:].tail()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
145,0.0,-0.090909,-0.087719,-0.08
146,-0.059701,-0.166667,-0.038462,-0.173913
147,0.031746,0.2,0.04,0.052632
148,-0.046154,0.133333,0.038462,0.15
149,-0.048387,-0.117647,-0.055556,-0.217391


In [6]:
iris_data[["sepal_length","sepal_width","petal_length","petal_width"]].pct_change()[1:].sepal_length.corr(iris_data.petal_length)

0.15569820981689295

In [7]:
iris_data[["sepal_length","sepal_width","petal_length","petal_width"]].corr()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
sepal_length,1.0,-0.109369,0.871754,0.817954
sepal_width,-0.109369,1.0,-0.420516,-0.356544
petal_length,0.871754,-0.420516,1.0,0.962757
petal_width,0.817954,-0.356544,0.962757,1.0


In [8]:
iris_data[["sepal_length","sepal_width","petal_length","petal_width"]].cov()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
sepal_length,0.685694,-0.039268,1.273682,0.516904
sepal_width,-0.039268,0.188004,-0.321713,-0.117981
petal_length,1.273682,-0.321713,3.113179,1.296387
petal_width,0.516904,-0.117981,1.296387,0.582414


## 抽样

抽样的话,pandas提供了sample()方法可以做简单的抽样,你可以选择是有放回还是无放回的

In [9]:
iris_data_test=iris_data.sample(frac=0.4)
iris_data_test = iris_data_test.sort_index()
iris_data_test[:5]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa


剩余的数据可以这样得到

In [10]:
iris_data_train=iris_data.drop(iris_data_test.index)
iris_data_train[:5]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa
10,5.4,3.7,1.5,0.2,Iris-setosa


也可以设定别的你自己的抽样方式,比如我觉得我希望用每行数据摇色子的方式确定是否进入样本,那么可以这样

In [11]:
import random

In [12]:
temp = iris_data.copy()
temp["cc"]=[random.random() for i in range(len(iris_data))]
len(iris_data[temp["cc"]>0.3])

107

In [13]:
len(iris_data[temp["cc"]<=0.3])

43

## 相关性

numpy只默认支持协方差矩阵的计算

他们都可以带参数min_periods关键字，该关键字为每个列对指定所需的最小观测值数，以获得有效的结果

+ 协方差矩阵

In [14]:
iris_copy = iris_data.copy()

In [15]:
iris_cov = iris_copy[iris_copy.columns[:-1]].T.cov()

In [16]:
iris_cov[:5]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,140,141,142,143,144,145,146,147,148,149
0,4.75,4.421667,4.353333,4.16,4.696667,4.86,4.215,4.595,3.965,4.493333,...,2.65,3.09,2.341667,2.73,2.596667,2.85,2.741667,2.915,2.475,2.6
1,4.421667,4.149167,4.055,3.885,4.358333,4.515,3.9075,4.284167,3.7075,4.21,...,2.725,3.128333,2.409167,2.805,2.661667,2.906667,2.820833,2.955833,2.504167,2.628333
2,4.353333,4.055,3.99,3.813333,4.303333,4.453333,3.861667,4.211667,3.635,4.12,...,2.446667,2.85,2.161667,2.52,2.396667,2.63,2.531667,2.688333,2.281667,2.396667
3,4.16,3.885,3.813333,3.656667,4.11,4.256667,3.688333,4.031667,3.485,3.953333,...,2.493333,2.856667,2.218333,2.58,2.443333,2.653333,2.571667,2.718333,2.321667,2.44
4,4.696667,4.358333,4.303333,4.11,4.65,4.81,4.175,4.541667,3.915,4.433333,...,2.53,2.963333,2.238333,2.61,2.483333,2.726667,2.615,2.798333,2.381667,2.503333


+ 皮尔逊相关度

    这个可以使用numpy来求了

In [17]:
import numpy as np

In [18]:
iris_copy = iris_data.copy()
iris_ = iris_copy[iris_copy.columns[:-1]]

In [22]:
pd.DataFrame(np.corrcoef(iris_.values))[:5]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,140,141,142,143,144,145,146,147,148,149
0,1.0,0.995999,0.999974,0.998168,0.999347,0.999586,0.998811,0.999538,0.998077,0.996552,...,0.597825,0.685581,0.574649,0.584668,0.603048,0.646865,0.605998,0.653473,0.633917,0.633158
1,0.995999,1.0,0.996607,0.997397,0.992233,0.993592,0.990721,0.997118,0.998546,0.999033,...,0.65775,0.742643,0.632574,0.642756,0.661387,0.705879,0.667114,0.708983,0.686257,0.684835
2,0.999974,0.996607,1.0,0.998333,0.999061,0.999377,0.998438,0.999605,0.998356,0.996986,...,0.602231,0.689931,0.578798,0.588854,0.6073,0.651305,0.610553,0.657556,0.637631,0.636806
3,0.998168,0.997397,0.998333,1.0,0.996719,0.997833,0.996139,0.999546,0.999833,0.999307,...,0.64108,0.722377,0.620453,0.629754,0.646729,0.68638,0.647851,0.694538,0.677737,0.677225
4,0.999347,0.992233,0.999061,0.996719,1.0,0.999883,0.999914,0.998503,0.996031,0.993761,...,0.576858,0.66451,0.555166,0.564947,0.582896,0.625491,0.584183,0.634029,0.616536,0.616138


也可以使用pandas中的corr方法

corr可以使用的算法有:

+ pearson

    (default)皮尔逊相关系数
    
+ kendall

    Kendall Tau相关系数
    
+ spearman

    斯皮尔曼等级相关系数

可以使用'method'关键字指定.请注意，非数字列将从相关性计算中自动排除。为了自己看起来明确,要么写好注释,要么就自己手动排除或者处理

In [24]:
iris_.corr(method='spearman')[:5]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
sepal_length,1.0,-0.159457,0.881386,0.834421
sepal_width,-0.159457,1.0,-0.303421,-0.277511
petal_length,0.881386,-0.303421,1.0,0.936003
petal_width,0.834421,-0.277511,0.936003,1.0
