# 数据集介绍  

本次大赛的目的是预测一个人想签入到哪个地方。FaceBook创建了一个人造的世界，10公里*10公里的一个区域。对于一个给定的坐标，你的任务是返回最有可能的地方的排名列表  

## 数据集数据  

```
train_csv, test_csv
row_id: id of the check_in event 登记事情的id
x y: coordiantes 坐标系
accuracy: location accuracy 定位准确度
time: timestamp 时间戳
place_id: id of the business. this is the target you are predicting   预测签到的位置 
```
## 流程分析
- 获取数据
- 数据处理
    - 为了减少时间，缩小数据范围
        - 2<x<2.5   1.0<y<1.5
    - time处理成有意义的数据年月日时分秒 
    - 过滤签到次数少的地点
    - 整理特征值 x 目标值y
- 数据集划分
- 特征工程：标准化
- KNN算法预估流程
- 模型选择与调优
- 模型评估

In [8]:
import pandas as pd

In [9]:
data = pd.read_csv("./FBlocation/train.csv")

In [33]:
data.head()

Unnamed: 0,row_id,x,y,accuracy,time,place_id,day,weekday,hour
112,112,2.236,1.3655,66,623174,7663031065,8,3,5
180,180,2.2003,1.2541,65,610195,2358558474,8,3,1
367,367,2.4108,1.3213,74,579667,6644108708,7,2,17
874,874,2.0822,1.1973,320,143566,3229876087,2,4,15
1022,1022,2.016,1.1659,65,207993,3244363975,3,5,9


### 为了减少时间，缩小数据范围
- 数据处理
- 缩小数据范围
- 2<x<2.5   1.0<y<1.5

In [34]:
data = data.query("x < 2.5 & x > 2 & y < 1.5 & y > 1.0")
data.head()

Unnamed: 0,row_id,x,y,accuracy,time,place_id,day,weekday,hour
112,112,2.236,1.3655,66,623174,7663031065,8,3,5
180,180,2.2003,1.2541,65,610195,2358558474,8,3,1
367,367,2.4108,1.3213,74,579667,6644108708,7,2,17
874,874,2.0822,1.1973,320,143566,3229876087,2,4,15
1022,1022,2.016,1.1659,65,207993,3244363975,3,5,9


### time处理成有意义的数据年月日时分秒
- 修改时间戳
- 添加合理的时间数据

In [15]:
# 处理时间戳成比较有意义的数据
time_value = pd.to_datetime(data['time'],unit='s')

In [16]:
time_value.values

array(['1970-01-08T05:06:14.000000000', '1970-01-08T01:29:55.000000000',
       '1970-01-07T17:01:07.000000000', ...,
       '1970-01-09T20:46:26.000000000', '1970-01-02T18:11:58.000000000',
       '1970-01-01T22:06:09.000000000'], dtype='datetime64[ns]')

In [19]:
## 转换成DatetimeIndex的格式，方便转换成星期几
date = pd.DatetimeIndex(time_value)
date

DatetimeIndex(['1970-01-08 05:06:14', '1970-01-08 01:29:55',
               '1970-01-07 17:01:07', '1970-01-02 15:52:46',
               '1970-01-03 09:46:33', '1970-01-06 19:49:38',
               '1970-01-06 13:33:24', '1970-01-02 22:49:55',
               '1970-01-04 14:30:10', '1970-01-07 16:57:44',
               ...
               '1970-01-02 09:24:50', '1970-01-01 10:29:34',
               '1970-01-09 11:38:46', '1970-01-02 03:42:14',
               '1970-01-04 22:02:44', '1970-01-09 08:31:25',
               '1970-01-07 12:29:49', '1970-01-09 20:46:26',
               '1970-01-02 18:11:58', '1970-01-01 22:06:09'],
              dtype='datetime64[ns]', name='time', length=83197, freq=None)

In [21]:
date.year

Int64Index([1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970,
            ...
            1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970],
           dtype='int64', name='time', length=83197)

In [22]:
date.month

Int64Index([1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
            ...
            1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
           dtype='int64', name='time', length=83197)

In [24]:
# 获取的时间年都是1970年，以及都是一月份，并不是有意义的数据
# 获取时间数据转换成星期数据，小时数据, 以及天数据
date.weekday

Int64Index([3, 3, 2, 4, 5, 1, 1, 4, 6, 2,
            ...
            4, 3, 4, 4, 6, 4, 2, 4, 4, 3],
           dtype='int64', name='time', length=83197)

In [25]:
date.day

Int64Index([8, 8, 7, 2, 3, 6, 6, 2, 4, 7,
            ...
            2, 1, 9, 2, 4, 9, 7, 9, 2, 1],
           dtype='int64', name='time', length=83197)

In [26]:
date.hour

Int64Index([ 5,  1, 17, 15,  9, 19, 13, 22, 14, 16,
            ...
             9, 10, 11,  3, 22,  8, 12, 20, 18, 22],
           dtype='int64', name='time', length=83197)

In [27]:
# 对原数据进行添加
data["day"] = date.day

In [28]:
data["weekday"] = date.weekday

In [30]:
data["hour"] = date.hour

In [35]:
data.head()

Unnamed: 0,row_id,x,y,accuracy,time,place_id,day,weekday,hour
112,112,2.236,1.3655,66,623174,7663031065,8,3,5
180,180,2.2003,1.2541,65,610195,2358558474,8,3,1
367,367,2.4108,1.3213,74,579667,6644108708,7,2,17
874,874,2.0822,1.1973,320,143566,3229876087,2,4,15
1022,1022,2.016,1.1659,65,207993,3244363975,3,5,9


### 过滤签到次数少的地点
- 先统计下地点的签到数据
    - 不同的place_id被签到了几次
        - 用分组聚合
            - 返回groupby
    - 过滤掉签到次数小于三的

In [41]:
# 签到次数
data.groupby("place_id").count().head()

Unnamed: 0_level_0,row_id,x,y,accuracy,time,day,weekday,hour
place_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1012165853,1,1,1,1,1,1,1,1
1013991737,3,3,3,3,3,3,3,3
1014605271,28,28,28,28,28,28,28,28
1015645743,4,4,4,4,4,4,4,4
1017236154,31,31,31,31,31,31,31,31


In [96]:
place_count = data.groupby("place_id").count()
place_count.head()

Unnamed: 0_level_0,row_id,x,y,accuracy,time,day,weekday,hour
place_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1012165853,1,1,1,1,1,1,1,1
1013991737,3,3,3,3,3,3,3,3
1014605271,28,28,28,28,28,28,28,28
1015645743,4,4,4,4,4,4,4,4
1017236154,31,31,31,31,31,31,31,31


In [93]:
place_count_id = place_count["row_id"]

In [56]:
# 过滤掉签到次数小于三的
# 使用布尔索引
place_count_id[place_count_id>3].head()

place_id
1014605271    28
1015645743     4
1017236154    31
1024951487     5
1028119817     4
Name: row_id, dtype: int64

In [60]:
# 过滤data
data["place_id"].isin(place_count_id[place_count_id>3].index.values).head()

112      True
180     False
367      True
874      True
1022     True
Name: place_id, dtype: bool

In [63]:
# 在做一个对data的索引
data_final = data[data["place_id"].isin(place_count_id[place_count_id>3].index.values)]
data_final.head()

Unnamed: 0,row_id,x,y,accuracy,time,place_id,day,weekday,hour
112,112,2.236,1.3655,66,623174,7663031065,8,3,5
367,367,2.4108,1.3213,74,579667,6644108708,7,2,17
874,874,2.0822,1.1973,320,143566,3229876087,2,4,15
1022,1022,2.016,1.1659,65,207993,3244363975,3,5,9
1045,1045,2.3859,1.166,498,503378,6438240873,6,1,19


In [68]:
# 删选特征值和目标值
x = data_final[["x","y","accuracy","day","weekday","hour"]]
y = data_final["place_id"]

In [71]:
x.head()

Unnamed: 0,x,y,accuracy,day,weekday,hour
112,2.236,1.3655,66,8,3,5
367,2.4108,1.3213,74,7,2,17
874,2.0822,1.1973,320,2,4,15
1022,2.016,1.1659,65,3,5,9
1045,2.3859,1.166,498,6,1,19


In [98]:
y.head()

112     7663031065
367     6644108708
874     3229876087
1022    3244363975
1045    6438240873
Name: place_id, dtype: int64

### 数据集划分

In [73]:
# 数据集划分
from sklearn.model_selection import train_test_split 

In [74]:
x_train,x_test,y_train,y_test = train_test_split(x,y)

In [76]:
x_train.head()

Unnamed: 0,x,y,accuracy,day,weekday,hour
28431067,2.0582,1.1462,5,4,6,12
27228652,2.1614,1.4778,70,8,3,4
2374344,2.3979,1.4139,93,4,6,20
9680394,2.0429,1.2291,72,7,2,20
2148587,2.1484,1.2679,29,6,1,5


In [77]:
x_test.head()

Unnamed: 0,x,y,accuracy,day,weekday,hour
19606411,2.3239,1.2449,76,9,4,20
3733238,2.3899,1.2112,42,5,0,12
23849836,2.3457,1.3791,76,4,6,18
27639088,2.1084,1.0483,155,9,4,0
5866275,2.3348,1.2559,167,8,3,23


In [78]:
y_train.head()

28431067    4231692509
27228652    9936666116
2374344     2128251934
9680394     9081742495
2148587     5111412226
Name: place_id, dtype: int64

In [80]:
y_test.head()

19606411    2585551753
3733238     3177384238
23849836    4951395211
27639088    4255947450
5866275     9773056775
Name: place_id, dtype: int64

In [97]:
#特征工程 数据标准化
from sklearn.preprocessing import StandardScaler
#knn
from sklearn.neighbors import KNeighborsClassifier
#网格搜索与交叉验证
from sklearn.model_selection import GridSearchCV

### 特征工程：数据标准化

In [99]:
# 特征工程：数据标准化
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)

### KNN算法预估流程

In [100]:
# KNN算法预估器，因为要加入网格搜寻和交叉验证法，所以不需要添加k值
estimator = KNeighborsClassifier()

### 模型选择与调优

In [101]:
# 加入网格搜索与交叉验证
# 参数准备 k值为字典格式
param_dict = {"n_neighbors": [3, 5, 7, 9]}
estimator = GridSearchCV(estimator, param_grid=param_dict, cv=3)
estimator.fit(x_train, y_train)



GridSearchCV(cv=3, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_neighbors': [3, 5, 7, 9]}, pre_dispatch='2*n_jobs',
       refit=True, return_train_score='warn', scoring=None, verbose=0)

### 模型评估

In [91]:
# 模型评估
# 方法1：直接比对真实值和预测值
y_predict = estimator.predict(x_test)
print("y_predict:\n", y_predict)
print("直接比对真实值和预测值:\n", y_test == y_predict)

# 方法2：计算准确率
score = estimator.score(x_test, y_test)
print("准确率为：\n", score)

# 最佳参数：best_params_
print("最佳参数：\n", estimator.best_params_)
# 最佳结果：best_score_
print("最佳结果：\n", estimator.best_score_)
# 最佳估计器：best_estimator_
print("最佳估计器:\n", estimator.best_estimator_)
# 交叉验证结果：cv_results_
print("交叉验证结果:\n", estimator.cv_results_)

y_predict:
 [2585551753 5304570159 6706708436 ..., 5600661085 5801995519 9764078387]
直接比对真实值和预测值:
 19606411     True
3733238     False
23849836    False
27639088     True
5866275      True
26883481     True
12628841    False
26354845    False
8797661      True
2104761      True
20529480    False
11780123    False
27441634    False
19372956     True
17458468     True
8657405     False
4694924     False
17493647    False
14189075     True
8580097      True
7894380      True
15114892    False
16739797     True
10910858     True
26235923    False
5382207     False
23573702    False
4174870     False
18541202     True
116800      False
            ...  
7766770     False
26097873    False
22027458     True
12463524     True
3001035     False
18147152     True
14645989    False
18207724     True
11460414     True
6168768     False
15829678     True
11710752     True
28407512    False
17534533     True
25109375    False
21971723     True
8278330      True
23381708    False
17953927    False
1