# 数据集介绍  

本次大赛的目的是预测一个人想签入到哪个地方。Facebook创建了一个人造的世界，10公里*10公里的一个区域。对于一个给定的坐标，你的任务是返回最有可能的地方的排名列表  

## 1数据集数据  

```
train_csv, test_csv
row_id: id of the check_in event 签到事件的id/索引
x y: coordiantes 坐标系
accuracy: location accuracy 定位准确度
time: timestamp 时间戳
place_id: id of the business. this is the target you are predicting   预测签到的位置 
```
## 2流程分析
- 2.1获取数据
- 2.2数据处理
    - 为了减少时间，缩小数据范围
        - 2<x<2.5   1.0<y<1.5
    - time处理成有意义的数据年月日时分秒 
    - 过滤签到次数少的地点
    - 整理特征值 x 目标值y
- 2.3数据集划分
- 2.4特征工程：标准化
- 2.5KNN算法预估流程
- 2.6模型选择与调优
- 2.7模型评估

## 2.1获取数据

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv("./FBlocation/train.csv")

In [4]:
data.head()

Unnamed: 0,row_id,x,y,accuracy,time,place_id
0,0,0.7941,9.0809,54,470702,8523065625
1,1,5.9567,4.7968,13,186555,1757726713
2,2,8.3078,7.0407,74,322648,1137537235
3,3,7.3665,2.5165,65,704587,6567393236
4,4,4.0961,1.1307,31,472130,7440663949


## 2.2数据处理
### 2.2.1为了减少时间，缩小数据范围
- 数据处理
- 缩小数据范围
- 2<x<2.5   1.0<y<1.5

In [5]:
data = data.query("x < 2.5 & x > 2 & y < 1.5 & y > 1.0")
data.head()

Unnamed: 0,row_id,x,y,accuracy,time,place_id
112,112,2.236,1.3655,66,623174,7663031065
180,180,2.2003,1.2541,65,610195,2358558474
367,367,2.4108,1.3213,74,579667,6644108708
874,874,2.0822,1.1973,320,143566,3229876087
1022,1022,2.016,1.1659,65,207993,3244363975


### 2.2.2time处理成有意义的数据年月日时分秒
- 修改时间戳
- 添加合理的时间数据

In [22]:
data['time'].head()

112     623174
180     610195
367     579667
874     143566
1022    207993
Name: time, dtype: int64

In [23]:
# 处理时间戳成比较有意义的数据 单位为秒
time_value = pd.to_datetime(data['time'],unit='s')

In [26]:
# 带索引的一维数组
time_value.head()

112    1970-01-08 05:06:14
180    1970-01-08 01:29:55
367    1970-01-07 17:01:07
874    1970-01-02 15:52:46
1022   1970-01-03 09:46:33
Name: time, dtype: datetime64[ns]

In [28]:
time_value.index

Int64Index([     112,      180,      367,      874,     1022,     1045,
                1070,     1332,     1632,     2205,
            ...
            29113479, 29113817, 29114203, 29114281, 29114496, 29115112,
            29115204, 29115338, 29115464, 29117493],
           dtype='int64', length=83197)

In [7]:
time_value.values

array(['1970-01-08T05:06:14.000000000', '1970-01-08T01:29:55.000000000',
       '1970-01-07T17:01:07.000000000', ...,
       '1970-01-09T20:46:26.000000000', '1970-01-02T18:11:58.000000000',
       '1970-01-01T22:06:09.000000000'], dtype='datetime64[ns]')

In [8]:
## 转换成DatetimeIndex的格式，方便转换成星期几
date = pd.DatetimeIndex(time_value)
date

DatetimeIndex(['1970-01-08 05:06:14', '1970-01-08 01:29:55',
               '1970-01-07 17:01:07', '1970-01-02 15:52:46',
               '1970-01-03 09:46:33', '1970-01-06 19:49:38',
               '1970-01-06 13:33:24', '1970-01-02 22:49:55',
               '1970-01-04 14:30:10', '1970-01-07 16:57:44',
               ...
               '1970-01-02 09:24:50', '1970-01-01 10:29:34',
               '1970-01-09 11:38:46', '1970-01-02 03:42:14',
               '1970-01-04 22:02:44', '1970-01-09 08:31:25',
               '1970-01-07 12:29:49', '1970-01-09 20:46:26',
               '1970-01-02 18:11:58', '1970-01-01 22:06:09'],
              dtype='datetime64[ns]', name='time', length=83197, freq=None)

In [9]:
date.year

Int64Index([1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970,
            ...
            1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970],
           dtype='int64', name='time', length=83197)

In [10]:
date.month

Int64Index([1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
            ...
            1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
           dtype='int64', name='time', length=83197)

In [11]:
# 获取的时间年都是1970年，以及都是一月份，并不是有意义的数据
# 获取时间数据转换成星期数据，小时数据, 以及天数据
date.weekday

Int64Index([3, 3, 2, 4, 5, 1, 1, 4, 6, 2,
            ...
            4, 3, 4, 4, 6, 4, 2, 4, 4, 3],
           dtype='int64', name='time', length=83197)

In [12]:
date.day

Int64Index([8, 8, 7, 2, 3, 6, 6, 2, 4, 7,
            ...
            2, 1, 9, 2, 4, 9, 7, 9, 2, 1],
           dtype='int64', name='time', length=83197)

In [13]:
date.hour

Int64Index([ 5,  1, 17, 15,  9, 19, 13, 22, 14, 16,
            ...
             9, 10, 11,  3, 22,  8, 12, 20, 18, 22],
           dtype='int64', name='time', length=83197)

In [15]:
# 对原数据进行添加
data["day"] = date.day

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [16]:
data["weekday"] = date.weekday

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [17]:
data["hour"] = date.hour

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [19]:
data.head()

Unnamed: 0,row_id,x,y,accuracy,time,place_id,day,weekday,hour
112,112,2.236,1.3655,66,623174,7663031065,8,3,5
180,180,2.2003,1.2541,65,610195,2358558474,8,3,1
367,367,2.4108,1.3213,74,579667,6644108708,7,2,17
874,874,2.0822,1.1973,320,143566,3229876087,2,4,15
1022,1022,2.016,1.1659,65,207993,3244363975,3,5,9


### 2.2.3过滤签到次数少的地点
- 先统计下地点的签到数据
    - 不同的place_id被签到了几次
        - 用分组聚合
            - 返回groupby
    - 过滤掉签到次数小于三的

In [32]:
# 签到次数
data.groupby("place_id").count().head()

Unnamed: 0_level_0,row_id,x,y,accuracy,time,day,weekday,hour
place_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1012165853,1,1,1,1,1,1,1,1
1013991737,3,3,3,3,3,3,3,3
1014605271,28,28,28,28,28,28,28,28
1015645743,4,4,4,4,4,4,4,4
1017236154,31,31,31,31,31,31,31,31


In [33]:
place_count = data.groupby("place_id").count()
place_count.head()

Unnamed: 0_level_0,row_id,x,y,accuracy,time,day,weekday,hour
place_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1012165853,1,1,1,1,1,1,1,1
1013991737,3,3,3,3,3,3,3,3
1014605271,28,28,28,28,28,28,28,28
1015645743,4,4,4,4,4,4,4,4
1017236154,31,31,31,31,31,31,31,31


In [65]:
#每个点签到的次数
place_count_id = place_count["row_id"]

In [35]:
# 过滤掉签到次数小于3的
# 使用布尔索引
# 找到签到次数大于3的place_id
place_count_id[place_count_id>3].head()

place_id
1014605271    28
1015645743     4
1017236154    31
1024951487     5
1028119817     4
Name: row_id, dtype: int64

In [37]:
place_count_id[place_count_id>3].index

Int64Index([1014605271, 1015645743, 1017236154, 1024951487, 1028119817,
            1031277804, 1033901110, 1034200626, 1047804897, 1064433118,
            ...
            9897639927, 9904847340, 9913438709, 9929729717, 9935755425,
            9936666116, 9954155328, 9980625005, 9994257798, 9996671132],
           dtype='int64', name='place_id', length=950)

In [40]:
place_count_id[place_count_id>3].index.values

array([1014605271, 1015645743, 1017236154, 1024951487, 1028119817,
       1031277804, 1033901110, 1034200626, 1047804897, 1064433118,
       1077244172, 1082319594, 1113063429, 1136965912, 1140094746,
       1152474121, 1168869217, 1188605085, 1200418561, 1209729910,
       1214974606, 1219514528, 1224473281, 1237461560, 1254294119,
       1272823671, 1291402142, 1301102957, 1307723985, 1309512563,
       1314542379, 1321295503, 1326517230, 1327075245, 1339666015,
       1341745891, 1344594346, 1376204664, 1376745845, 1380291167,
       1385100836, 1395411566, 1397435709, 1425344847, 1430541006,
       1430872593, 1450678395, 1450922788, 1456201867, 1465048096,
       1465956877, 1470112415, 1479866897, 1481294899, 1496485145,
       1509463616, 1514759428, 1515664017, 1525784324, 1533408099,
       1536494374, 1540382716, 1553974810, 1564838354, 1577178217,
       1580704783, 1591245298, 1598721971, 1608186657, 1612001894,
       1644280884, 1653222405, 1656255553, 1658077953, 1663763

In [47]:
# 过滤data
data["place_id"].isin(place_count_id[place_count_id>3].index).head()

112      True
180     False
367      True
874      True
1022     True
Name: place_id, dtype: bool

In [66]:
# 在做一个对data的索引
data_final = data[data["place_id"].isin(place_count_id[place_count_id>3].index.values)]
data_final.head()

Unnamed: 0,row_id,x,y,accuracy,time,place_id,day,weekday,hour
112,112,2.236,1.3655,66,623174,7663031065,8,3,5
367,367,2.4108,1.3213,74,579667,6644108708,7,2,17
874,874,2.0822,1.1973,320,143566,3229876087,2,4,15
1022,1022,2.016,1.1659,65,207993,3244363975,3,5,9
1045,1045,2.3859,1.166,498,503378,6438240873,6,1,19


### 2.2.4删选特征值和目标值

In [50]:
# 删选特征值和目标值
x = data_final[["x","y","accuracy","day","weekday","hour"]]
y = data_final["place_id"]

In [51]:
x.head()

Unnamed: 0,x,y,accuracy,day,weekday,hour
112,2.236,1.3655,66,8,3,5
367,2.4108,1.3213,74,7,2,17
874,2.0822,1.1973,320,2,4,15
1022,2.016,1.1659,65,3,5,9
1045,2.3859,1.166,498,6,1,19


In [52]:
y.head()

112     7663031065
367     6644108708
874     3229876087
1022    3244363975
1045    6438240873
Name: place_id, dtype: int64

## 2.3数据集划分

In [53]:
# 数据集划分
from sklearn.model_selection import train_test_split 

In [54]:
x_train,x_test,y_train,y_test = train_test_split(x,y)

In [55]:
x_train.head()

Unnamed: 0,x,y,accuracy,day,weekday,hour
1791108,2.4012,1.2926,37,8,3,17
1161980,2.346,1.3527,75,7,2,11
23525554,2.1879,1.3039,59,8,3,10
34936,2.0997,1.4483,156,5,0,6
3850342,2.2383,1.4558,1,5,0,13


In [56]:
x_test.head()

Unnamed: 0,x,y,accuracy,day,weekday,hour
13301191,2.2185,1.1027,65,6,1,12
24391480,2.2658,1.2848,176,5,0,15
25971566,2.2185,1.1307,64,8,3,22
15717230,2.0873,1.1134,71,2,4,8
7332302,2.0149,1.2324,59,1,3,1


In [57]:
y_train.head()

1791108     6854303457
1161980     2998156180
23525554    6854303457
34936       5628219504
3850342     9841635233
Name: place_id, dtype: int64

In [58]:
y_test.head()

13301191    6246622940
24391480    2343525997
25971566    2852927300
15717230    3047002443
7332302     2614390775
Name: place_id, dtype: int64

In [59]:
#特征工程 数据标准化
from sklearn.preprocessing import StandardScaler
#knn
from sklearn.neighbors import KNeighborsClassifier
#网格搜索与交叉验证
from sklearn.model_selection import GridSearchCV

## 2.4特征工程：数据标准化

In [60]:
# 特征工程：数据标准化
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)

## 2.5KNN算法预估流程

In [61]:
# KNN算法预估器，因为要加入网格搜寻和交叉验证法，所以不需要添加k值
estimator = KNeighborsClassifier()

## 2.6模型选择与调优

In [62]:
# 加入网格搜索与交叉验证
# 参数准备 k值为字典格式
param_dict = {"n_neighbors": [3, 5, 7, 9]}
estimator = GridSearchCV(estimator, param_grid=param_dict, cv=3)
estimator.fit(x_train, y_train)



GridSearchCV(cv=3, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_neighbors': [3, 5, 7, 9]}, pre_dispatch='2*n_jobs',
       refit=True, return_train_score='warn', scoring=None, verbose=0)

## 2.7模型评估

In [67]:
# 模型评估
# 方法1：直接比对真实值和预测值
y_predict = estimator.predict(x_test)
print("y_predict:\n", y_predict)
print("直接比对真实值和预测值:\n", y_test == y_predict)

# 方法2：计算准确率
score = estimator.score(x_test, y_test)
print("准确率为：\n", score)

# 最佳参数：best_params_
print("最佳参数：\n", estimator.best_params_)
# 最佳结果：best_score_
print("最佳结果：\n", estimator.best_score_)
# 最佳估计器：best_estimator_
print("最佳估计器:\n", estimator.best_estimator_)
# 交叉验证结果：cv_results_
print("交叉验证结果:\n", estimator.cv_results_)

y_predict:
 [1732563460 2343525997 1732563460 ..., 7240847236 8032468532 2082127512]
直接比对真实值和预测值:
 13301191    False
24391480     True
25971566    False
15717230     True
7332302     False
4644954     False
28651221    False
3504524     False
21584914    False
18499738    False
7935099      True
23158509     True
19261211    False
8995507     False
12944807     True
9274157     False
17312603     True
7192194      True
26769063    False
15710138    False
2193602      True
24233630    False
18735907     True
26496950    False
8524566     False
26260767    False
14329301    False
10900165    False
5161754      True
2148218     False
            ...  
23620675    False
16396920     True
1075158      True
23831983    False
27971229    False
14548808     True
22529229    False
2713818     False
1668055     False
14226470     True
2762424     False
26317191    False
27426896    False
672209      False
12610947     True
19021551    False
4735342      True
13307855    False
8992229      True
2