# 数据结构总览
前面几章已经将我们使用最频繁的三种数据结构做了介绍，本章进行总结一下，之后在其基础上，再介绍一下基本数据类型和很有用的中间类型。

In [1]:
!cd

C:\Code\python\pandas_notebook


In [4]:
import numpy as np
import pandas as pd

# 1. 三种常用数据结构

## 1.1 Series
Series是由数据类型相同的元素构成的一维数据结构，具有列表和字典的特性。

#### 四个重要属性
- Series.index
- Series.name
- Series.values
- Series.dtype

## 1.2 DataFrame
DataFrame是由索引相同的Series构成的的一二维数据结构。

#### 四个重要属性
- DataFrame.index
- DataFrame.columns
- DataFrame.values
- DataFrame.dtypes

## 1.3 Index
Index是构成和操作Series、DataFrame的关键，其具有元组特性。

#### 三个重要属性
- Index.name
- Index.values
- Index.dtype

# 2. 基本数据类型
- 这些数据类型实际上都是numpy带来的;
- 基本数据类型中不包括字符串类型，字符串都是存储为object_型；
- 所以使用这些类型时，要加上前缀 `np.` 。

## 2.1 布尔型

In [6]:
columns = [u'类别',u'说明 ',u'简称']

In [13]:
data = [ ['bool_','compatible: Python bool','?'],
         ['bool8','8 bits',''] ] 
pd.DataFrame(data, columns = columns)

Unnamed: 0,类别,说明,简称
0,bool_,compatible: Python bool,?
1,bool8,8 bits,


## 2.2 整型

### 2.2.1 有符号整型

In [16]:
data = [['byte','compatible: C char','b'],
['short','compatible: C short','h'],
['intc','compatible: C int','i'],
['int_','compatible: Python int','l'],
['longlong','compatible: C long long','q'],
['intp','large enough to fit a pointer','p'],
['int8','8 bits','' ],
['int16','16 bits','' ],
['int32','32 bits',''],
['int64','64 bits','']]
pd.DataFrame(data = data, columns = columns)

Unnamed: 0,类别,说明,简称
0,byte,compatible: C char,b
1,short,compatible: C short,h
2,intc,compatible: C int,i
3,int_,compatible: Python int,l
4,longlong,compatible: C long long,q
5,intp,large enough to fit a pointer,p
6,int8,8 bits,
7,int16,16 bits,
8,int32,32 bits,
9,int64,64 bits,


### 2.2.2 无符号整型

In [17]:
data = [['ubyte','compatible: C unsigned char','B'],
['ushort','compatible: C unsigned short','H'],
['uintc','compatible: C unsigned int','I'],
['uint','compatible: Python int','L'],
['ulonglong','compatible: C long long','Q'],
['uintp','large enough to fit a pointer','P'],
['uint8','8 bits',''], 
['uint16','16 bits',''],
['uint32','32 bits',''],
['uint64','64 bits','']]
pd.DataFrame(data = data, columns = columns)

Unnamed: 0,类别,说明,简称
0,ubyte,compatible: C unsigned char,B
1,ushort,compatible: C unsigned short,H
2,uintc,compatible: C unsigned int,I
3,uint,compatible: Python int,L
4,ulonglong,compatible: C long long,Q
5,uintp,large enough to fit a pointer,P
6,uint8,8 bits,
7,uint16,16 bits,
8,uint32,32 bits,
9,uint64,64 bits,


## 2.3 浮点型

In [18]:
data = [['half',' ','e'],
['single','compatible: C float','f'],
['double','compatible: C double',''],
['float_','compatible: Python float','d'],
['longfloat','compatible: C long float','g'],
['float16','16 bits',''],
['float32','32 bits',''],
['float64','64 bits',''], 
['float96','96 bits, platform?',''], 
['float128','128 bits, platform?','']]
pd.DataFrame(data = data, columns = columns)

Unnamed: 0,类别,说明,简称
0,half,,e
1,single,compatible: C float,f
2,double,compatible: C double,
3,float_,compatible: Python float,d
4,longfloat,compatible: C long float,g
5,float16,16 bits,
6,float32,32 bits,
7,float64,64 bits,
8,float96,"96 bits, platform?",
9,float128,"128 bits, platform?",


## 2.4 复数型

In [20]:
data = [['csingle',' ','F'],
['complex_','compatible: Python complex','D'],
['clongfloat',' ','G'],
['complex64','two 32-bit floats',''], 
['complex128','two 64-bit floats',''], 
['complex192','two 96-bit floats, platform?',''], 
['complex256','two 128-bit floats, platform?','']]
pd.DataFrame(data = data, columns = columns)

Unnamed: 0,类别,说明,简称
0,csingle,,F
1,complex_,compatible: Python complex,D
2,clongfloat,,G
3,complex64,two 32-bit floats,
4,complex128,two 64-bit floats,
5,complex192,"two 96-bit floats, platform?",
6,complex256,"two 128-bit floats, platform?",


## 2.5 任意类型
Object其实就是就是指向pyton的类类型object的一个引用。

In [21]:
data = [['object_','any Python object','O']]
pd.DataFrame(data = data, columns = columns)

Unnamed: 0,类别,说明,简称
0,object_,any Python object,O


# 3. 有用的中间类型

## 3.1 `.str`
这个中间类型可将object_类型的Series当做字符串来处理，有很多可用的字符串处理函数。在后面的章节会专门讲这个应用。

In [30]:
s = pd.Series(['a_b','b_c','c_d'],dtype = 'object')
s

0    a_b
1    b_c
2    c_d
dtype: object

In [32]:
s.str.split('_',expand = True)

Unnamed: 0,0,1
0,a,b
1,b,c
2,c,d


## 3.2 `.cat`
这个中间类型专门处理类别类型，类别类型是机器学习中经常面对的一种特征属性，后面章节会讲到。

In [33]:
s = pd.Series( [1,2,3], dtype = 'category')
s

0    1
1    2
2    3
dtype: category
Categories (3, int64): [1, 2, 3]

In [35]:
s.cat.categories

Int64Index([1, 2, 3], dtype='int64')

## 3.3 `.dt`
这个中间类型专门处理时间格式的Series，在时间序列分析中会用到。

In [5]:
s = pd.Series(['2017-08-01','2017-08-03','2017-08-03'], dtype = 'datetime64[ns]')
s

0   2017-08-01
1   2017-08-03
2   2017-08-03
dtype: datetime64[ns]

In [6]:
s.dt.year

0    2017
1    2017
2    2017
dtype: int64

In [7]:
s.dt.month

0    8
1    8
2    8
dtype: int64

In [8]:
s.dt.day

0    1
1    3
2    3
dtype: int64