### index的用途总结
+ 更加方便的数据查询
+ 使用index可以获得性能提升
+ 自动的数据对齐功能
+ 更加强大的数据结构支持

In [1]:
import pandas as pd
df=pd.read_csv("./datas/ml-latest-small/ratings.csv")
df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [2]:
 df.count()

userId       100836
movieId      100836
rating       100836
timestamp    100836
dtype: int64

In [3]:
df.set_index("userId",inplace=True,drop=False) # 不丢弃原来的列

In [4]:
df

Unnamed: 0_level_0,userId,movieId,rating,timestamp
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,1,1,4.0,964982703
1,1,3,4.0,964981247
1,1,6,4.0,964982224
1,1,47,5.0,964983815
1,1,50,5.0,964982931
...,...,...,...,...
610,610,166534,4.0,1493848402
610,610,168248,5.0,1493850091
610,610,168250,5.0,1494273047
610,610,168252,5.0,1493846352


In [5]:
# 使用index的查询方法
df.loc[500].head(10)

Unnamed: 0_level_0,userId,movieId,rating,timestamp
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
500,500,1,4.0,1005527755
500,500,11,1.0,1005528017
500,500,39,1.0,1005527926
500,500,101,1.0,1005527980
500,500,104,4.0,1005528065
500,500,176,5.0,1005527755
500,500,180,4.0,1005527980
500,500,216,4.0,1005527406
500,500,231,1.0,1005528039
500,500,260,4.0,1005527309


In [8]:
condition= df.loc[:,"userId"]==500
df[condition].head()  # 满足条件按的保留

Unnamed: 0_level_0,userId,movieId,rating,timestamp
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
500,500,1,4.0,1005527755
500,500,11,1.0,1005528017
500,500,39,1.0,1005527926
500,500,101,1.0,1005527980
500,500,104,4.0,1005528065


### 使用index会提升查询性能
+ 如果index是唯一的，pandas会使用pandas表优化
+ 如果index是有序的，使用二分查找法
+ indexs是完全按随机的,查询全表

### 实验一 完全随机的顺序查询


In [9]:
# 将数据随机打散
from sklearn.utils import shuffle
df_shuffle=shuffle(df)

In [11]:
df_shuffle.head()

Unnamed: 0_level_0,userId,movieId,rating,timestamp
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
68,68,4366,3.0,1526948498
32,32,733,4.0,856736172
202,202,3221,3.0,975015565
255,255,1739,4.0,1005717179
21,21,102125,3.5,1441393005


In [12]:
df_shuffle.index.is_unique

False

In [16]:
%timeit df_shuffle.loc[500]

335 µs ± 3.52 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [17]:
%timeit df.loc[500]

183 µs ± 1.05 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


### 将index排序后查询

In [18]:
df_sorted=df_shuffle.sort_index()

In [20]:
%timeit df_sorted.loc[500]

182 µs ± 1.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


### 使用index自动对齐数据
+ 正常的series相加，会自动对齐序号相同的   

### 使用index更多更加强大的数据结构支持
+ categoricalIndex,基于分类数据的index,提升性能
+ MutliIndex ,多维缩影，用于groupby多维聚合后结果等
+ DatetimeIndex,时间类型，日期支持