# Pandas

pandas是一个强大的python数据分析的工具包，是基于NumPy构建的。

pandas的主要功能

    具备对其功能的数据结构DataFrame、Series
    
    集成时间序列功能
    
    提供丰富的数学运算和操作
    
    灵活处理缺失数据


## Series

### Series——一维矩阵对象

Series是一种类似于一维矩阵的对象，由一组数据和与之相关的数据标签（索引）组成。

获取值矩阵和索引矩阵：values属性和index属性

Series像是列表和字典的结合体

Series的创建方式：

In [1]:
import pandas as pd

In [2]:
pd.Series([1, 3, 5, 7, 9])

0    1
1    3
2    5
3    7
4    9
dtype: int64

In [3]:
pd.Series([1, 3, 5, 7, 9], index = ['a', 'b', 'c', 'd', 'e'])

a    1
b    3
c    5
d    7
e    9
dtype: int64

In [4]:
pd.Series({'a':1, 'b':3, 'c':5, 'd':7, 'e':9})

a    1
b    3
c    5
d    7
e    9
dtype: int64

In [5]:
pd.Series(0, index = ['a', 'b', 'c', 'd', 'e'])

a    0
b    0
c    0
d    0
e    0
dtype: int64

### Series——使用特性

Series支持array的特性（下标）：

    从ndarray创建Series
    
    与标量运算
    
    两个Series运算
    
    索引取值
    
    切片
    
    ndarray通用函数
    
    布尔值过滤
    
Series支持字典的特性（标签）：

    从字典创建series
    
    in运算
    
    键索引

### Series——整数索引

如果索引是整数类型，则根据整数进行下标获取值时总是面向标签的。

loc属性：将索引解释为标签

iloc属性：将索引解释为下标

### Series——数据对齐

pandas在进行两个Series对象的运算时，会按索引进行对齐然后计算。

灵活的算术方法：add, sub, div, mul

In [6]:
sr1 = pd.Series([12, 23, 34], index = ['a', 'b', 'c'])
sr2 = pd.Series([45, 56, 67], index = ['b', 'c', 'd'])

In [7]:
sr = sr1 + sr2
print(sr)

a     NaN
b    68.0
c    90.0
d     NaN
dtype: float64


In [8]:
sr1.add(sr2, fill_value = 0)

a    12.0
b    68.0
c    90.0
d    67.0
dtype: float64

### 缺失值处理

In [9]:
sr

a     NaN
b    68.0
c    90.0
d     NaN
dtype: float64

In [10]:
sr.isnull()

a     True
b    False
c    False
d     True
dtype: bool

In [11]:
sr.notnull()

a    False
b     True
c     True
d    False
dtype: bool

In [12]:
sr[sr.notnull()]

b    68.0
c    90.0
dtype: float64

In [13]:
sr.dropna()

b    68.0
c    90.0
dtype: float64

In [14]:
sr

a     NaN
b    68.0
c    90.0
d     NaN
dtype: float64

In [15]:
sr.fillna(0)

a     0.0
b    68.0
c    90.0
d     0.0
dtype: float64

In [16]:
sr

a     NaN
b    68.0
c    90.0
d     NaN
dtype: float64

In [17]:
sr = sr.fillna(0)

In [18]:
sr

a     0.0
b    68.0
c    90.0
d     0.0
dtype: float64

In [19]:
sr = sr1 + sr2

In [20]:
sr.fillna(sr.mean())

a    79.0
b    68.0
c    90.0
d    79.0
dtype: float64

In [21]:
sr.mean()

79.0

## DataFrame

DataFrame是一个表格型的数据结构，含有一组有序的列。

DataFrame可以被看作是由Series组成的字典，并且共用一个索引。

### DataFrame创建

In [22]:
pd.DataFrame({'one':[1, 2, 3], 'two':[4, 5, 6]})

Unnamed: 0,one,two
0,1,4
1,2,5
2,3,6


In [23]:
pd.DataFrame({'one':[1, 2, 3], 'two':[4, 5, 6]}, index = ['a', 'b', 'c'])

Unnamed: 0,one,two
a,1,4
b,2,5
c,3,6


In [24]:
df = _

In [25]:
df

Unnamed: 0,one,two
a,1,4
b,2,5
c,3,6


In [26]:
pd.DataFrame({'one':pd.Series([1, 2, 3], index = ['a', 'b', 'c']), 'two':pd.Series([1, 2, 3, 4], index = ['b', 'a', 'c', 'd'])})

Unnamed: 0,one,two
a,1.0,2
b,2.0,1
c,3.0,3
d,,4


### csv文件


In [27]:
df.to_csv("test.csv")

In [28]:
!cat test.csv

,one,two
a,1,4
b,2,5
c,3,6


In [29]:
pd.read_csv("test.csv")

Unnamed: 0.1,Unnamed: 0,one,two
0,a,1,4
1,b,2,5
2,c,3,6


### DataFrame——常用属性

| code | attribute |
| --- | --- |
| index | 获取索引 |
| T | 转置 |
| columns | 获取列索引 |
| values | 获取值矩阵 |
| describe() | 获取快速统计 |

In [30]:
df

Unnamed: 0,one,two
a,1,4
b,2,5
c,3,6


In [31]:
df.index

Index(['a', 'b', 'c'], dtype='object')

In [32]:
df.columns

Index(['one', 'two'], dtype='object')

In [33]:
df.T

Unnamed: 0,a,b,c
one,1,2,3
two,4,5,6


In [34]:
df.describe()

Unnamed: 0,one,two
count,3.0,3.0
mean,2.0,5.0
std,1.0,1.0
min,1.0,4.0
25%,1.5,4.5
50%,2.0,5.0
75%,2.5,5.5
max,3.0,6.0


In [35]:
df

Unnamed: 0,one,two
a,1,4
b,2,5
c,3,6


In [36]:
df.values

array([[1, 4],
       [2, 5],
       [3, 6]])

### DataFrame——索引和切片

DataFrame是一个二维数据类型，所以有行索引和列索引。

DataFrame同样可以通过标签和位置两种方法进行索引和切片。

loc属性和iloc属性
    
    使用方法：逗号隔开，前面是行索引，后面是列索引
    
    行/列索引部分可以是常规索引、切片、布尔值索引、花式索引任意搭配

In [37]:
df["one"]["a"]

1

In [38]:
df.loc['a', :]

one    1
two    4
Name: a, dtype: int64

In [39]:
df.loc[['a', 'c'], :]

Unnamed: 0,one,two
a,1,4
c,3,6


### DataFrame——数据对齐与缺失数据

DataFrame对象在运算时，其行索引和列索引分别对齐。

DataFrame处理缺失数据的相关方法：

    dropna()
    
    fillna()
    
    isnull()
    
    notnull()

In [40]:
df1 = pd.DataFrame({'one':[4, 5, 6, 7], 'two':[1, 2, 3, 4]}, index = ['c', 'd', 'b', 'a'])
df2 = pd.DataFrame({'one':pd.Series([1, 2, 3], index = ['a', 'b', 'c']), 'two':pd.Series([1, 2, 3, 4], index = ['a', 'b', 'c', 'd'])})

In [41]:
df1

Unnamed: 0,one,two
c,4,1
d,5,2
b,6,3
a,7,4


In [42]:
df2

Unnamed: 0,one,two
a,1.0,1
b,2.0,2
c,3.0,3
d,,4


In [43]:
df1 + df2

Unnamed: 0,one,two
a,8.0,5
b,8.0,5
c,7.0,4
d,,6


In [44]:
df2.fillna(0)

Unnamed: 0,one,two
a,1.0,1
b,2.0,2
c,3.0,3
d,0.0,4


In [45]:
df2.dropna()

Unnamed: 0,one,two
a,1.0,1
b,2.0,2
c,3.0,3


In [46]:
df3 = df2

In [47]:
id(df3)

140101203378384

In [48]:
id(df2)

140101203378384

In [49]:
import copy
import numpy as np

In [50]:
df3 = copy.deepcopy(df2)

In [51]:
id(df2)

140101203378384

In [52]:
id(df3)

140101203396048

In [53]:
df3.loc['d', 'two'] = np.nan

In [54]:
df3.loc['c', 'two'] = np.nan

In [55]:
df3

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,
d,,


In [56]:
df3.dropna()

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0


In [57]:
df3.dropna(how = 'all')

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,


In [58]:
df3

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,
d,,


In [59]:
df3.iloc[3, 0] = 0
df3.dropna(axis = 1)

Unnamed: 0,one
a,1.0
b,2.0
c,3.0
d,0.0


In [60]:
df3

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,
d,0.0,


### pandas——常用函数

In [61]:
df

Unnamed: 0,one,two
a,1,4
b,2,5
c,3,6


In [62]:
df.mean()

one    2.0
two    5.0
dtype: float64

In [63]:
df.sum()

one     6
two    15
dtype: int64

In [64]:
df.sort_values(by = 'one')

Unnamed: 0,one,two
a,1,4
b,2,5
c,3,6


In [65]:
df.sort_values(by = 'a', ascending = False, axis = 1)

Unnamed: 0,two,one
a,4,1
b,5,2
c,6,3


In [66]:
df.sort_index()

Unnamed: 0,one,two
a,1,4
b,2,5
c,3,6


In [67]:
df.sort_index(ascending=False)

Unnamed: 0,one,two
c,3,6
b,2,5
a,1,4


## 时间对象处理

In [68]:
pd.to_datetime(['2020-1-17', '2020/1/19', '2020-Jan-20'])

DatetimeIndex(['2020-01-17', '2020-01-19', '2020-01-20'], dtype='datetime64[ns]', freq=None)

## 时间对象处理

产生时间对象数组：date_range

    start    开始时间
    
    end    结束时间
    
    periods    时间长度
    
    freq    时间频率，默认为’D‘， 可选Hour, Week, Business, Semi-, Month, min, Second, Ayear

In [69]:
pd.date_range('2020-1-18', '2020-2-14')

DatetimeIndex(['2020-01-18', '2020-01-19', '2020-01-20', '2020-01-21',
               '2020-01-22', '2020-01-23', '2020-01-24', '2020-01-25',
               '2020-01-26', '2020-01-27', '2020-01-28', '2020-01-29',
               '2020-01-30', '2020-01-31', '2020-02-01', '2020-02-02',
               '2020-02-03', '2020-02-04', '2020-02-05', '2020-02-06',
               '2020-02-07', '2020-02-08', '2020-02-09', '2020-02-10',
               '2020-02-11', '2020-02-12', '2020-02-13', '2020-02-14'],
              dtype='datetime64[ns]', freq='D')

In [70]:
pd.date_range('2020-1-18', periods = 70)

DatetimeIndex(['2020-01-18', '2020-01-19', '2020-01-20', '2020-01-21',
               '2020-01-22', '2020-01-23', '2020-01-24', '2020-01-25',
               '2020-01-26', '2020-01-27', '2020-01-28', '2020-01-29',
               '2020-01-30', '2020-01-31', '2020-02-01', '2020-02-02',
               '2020-02-03', '2020-02-04', '2020-02-05', '2020-02-06',
               '2020-02-07', '2020-02-08', '2020-02-09', '2020-02-10',
               '2020-02-11', '2020-02-12', '2020-02-13', '2020-02-14',
               '2020-02-15', '2020-02-16', '2020-02-17', '2020-02-18',
               '2020-02-19', '2020-02-20', '2020-02-21', '2020-02-22',
               '2020-02-23', '2020-02-24', '2020-02-25', '2020-02-26',
               '2020-02-27', '2020-02-28', '2020-02-29', '2020-03-01',
               '2020-03-02', '2020-03-03', '2020-03-04', '2020-03-05',
               '2020-03-06', '2020-03-07', '2020-03-08', '2020-03-09',
               '2020-03-10', '2020-03-11', '2020-03-12', '2020-03-13',
      

In [71]:
pd.date_range('2020-1-17', periods = 10, freq = 'H')

DatetimeIndex(['2020-01-17 00:00:00', '2020-01-17 01:00:00',
               '2020-01-17 02:00:00', '2020-01-17 03:00:00',
               '2020-01-17 04:00:00', '2020-01-17 05:00:00',
               '2020-01-17 06:00:00', '2020-01-17 07:00:00',
               '2020-01-17 08:00:00', '2020-01-17 09:00:00'],
              dtype='datetime64[ns]', freq='H')

In [72]:
pd.date_range('2020-1-17', periods = 10, freq = 'M')

DatetimeIndex(['2020-01-31', '2020-02-29', '2020-03-31', '2020-04-30',
               '2020-05-31', '2020-06-30', '2020-07-31', '2020-08-31',
               '2020-09-30', '2020-10-31'],
              dtype='datetime64[ns]', freq='M')

In [73]:
pd.date_range('2020-1-17', periods = 10, freq = 'B')

DatetimeIndex(['2020-01-17', '2020-01-20', '2020-01-21', '2020-01-22',
               '2020-01-23', '2020-01-24', '2020-01-27', '2020-01-28',
               '2020-01-29', '2020-01-30'],
              dtype='datetime64[ns]', freq='B')

## Pandas——时间序列

时间序列就是以时间对象为索引的Series或DataFrame

datetime对象作为索引时是存储在Datetimeindex对象中的

时间序列特殊功能

    传入“年”或“年月”作为切片方式
    
    传入日期范围作为切片方式
    
    丰富的函数支持

In [74]:
sr = pd.Series(np.arange(1000), index = pd.date_range('2020-1-17', periods = 1000, freq = "H"))

In [75]:
sr

2020-01-17 00:00:00      0
2020-01-17 01:00:00      1
2020-01-17 02:00:00      2
2020-01-17 03:00:00      3
2020-01-17 04:00:00      4
                      ... 
2020-02-27 11:00:00    995
2020-02-27 12:00:00    996
2020-02-27 13:00:00    997
2020-02-27 14:00:00    998
2020-02-27 15:00:00    999
Freq: H, Length: 1000, dtype: int64

In [76]:
type(sr)

pandas.core.series.Series

In [77]:
sr['2020-1-17':'2020-1-18']

2020-01-17 00:00:00     0
2020-01-17 01:00:00     1
2020-01-17 02:00:00     2
2020-01-17 03:00:00     3
2020-01-17 04:00:00     4
2020-01-17 05:00:00     5
2020-01-17 06:00:00     6
2020-01-17 07:00:00     7
2020-01-17 08:00:00     8
2020-01-17 09:00:00     9
2020-01-17 10:00:00    10
2020-01-17 11:00:00    11
2020-01-17 12:00:00    12
2020-01-17 13:00:00    13
2020-01-17 14:00:00    14
2020-01-17 15:00:00    15
2020-01-17 16:00:00    16
2020-01-17 17:00:00    17
2020-01-17 18:00:00    18
2020-01-17 19:00:00    19
2020-01-17 20:00:00    20
2020-01-17 21:00:00    21
2020-01-17 22:00:00    22
2020-01-17 23:00:00    23
2020-01-18 00:00:00    24
2020-01-18 01:00:00    25
2020-01-18 02:00:00    26
2020-01-18 03:00:00    27
2020-01-18 04:00:00    28
2020-01-18 05:00:00    29
2020-01-18 06:00:00    30
2020-01-18 07:00:00    31
2020-01-18 08:00:00    32
2020-01-18 09:00:00    33
2020-01-18 10:00:00    34
2020-01-18 11:00:00    35
2020-01-18 12:00:00    36
2020-01-18 13:00:00    37
2020-01-18 1

## Pandas——文件处理

#### 文件读取

read_csv、read_table函数主要参数：

| 属性 | 功能 |
| --- | --- |
| sep | 指定分隔符，可用正则表达式 |
| header=None | 指定文件无列名 |
| name | 指定列名 |
| index_col | 指定某列作为索引 |
| skip_row | 指定跳过某些行 |
| na_values | 指定某些字符串表示缺失值 |
| parse_dates | 指定某些列是否被解析为日期，类型为布尔值或列表 |

#### 文件写入

to_csv函数主要参数：

| 属性 | 功能 |
| --- | --- |
| sep | 指定文件分隔符 |
| na_rep | 指定缺失值转换的字符串，默认为空字符串 |
| header=False | 不输出列名一行 |
| index=False | 不输出行索引一列 |
| clos | 指定输出的列，传入列表 |
