# Location Check-in Dataset

* Dingqui YANG 블로그의 Foursquare Check-in dataset. 2012~2013년 사이 약 18개월간의 전 세계 Check-in dataset. 33,278,683번의 check-in과 266,909의 유저, 3,680,126의 장소들로 이루어져 있음 (77개 국가의 415개의 도시)

## 데이터 처리

### 파일로부터 데이터 읽기
* Cities_data : 도시명, 국가 이름, 좌표 등의 정보
* Checkins_data : 유저-장소 방문 기록
* POI_data : 각 장소에 대한 ID, 좌표, 카테고리 정보

In [2]:
import pandas as pd
import numpy as np

In [3]:
Cities_data = pd.read_csv('.\Data\dataset_TIST2015_Cities.txt', sep='\t', lineterminator='\n', names = ['city name', 'Latitude', 'Longitude', 'Country Code', 'Country name', 'City type'])
Checkins_data = pd.read_csv('.\Data\dataset_TIST2015_Checkins.txt', sep='\t', lineterminator='\n', names = ['User ID', 'Venue ID', 'UTC Time', 'Timezone Offset'])
POI_data = pd.read_csv('.\Data\dataset_TIST2015_POIs.txt', sep='\t', lineterminator='\n', names = ['Venue ID', 'Latitude', 'Longitude', 'Venue category', 'Country Code'])

In [3]:
Cities_data.head()

Unnamed: 0,city name,Latitude,Longitude,Country Code,Country name,City type
0,Cuiaba,-15.615,-56.093004,BR,Brazil,Provincial capital
1,Brasilia,-15.792111,-47.897748,BR,Brazil,National and provincial capital
2,Goiania,-16.727004,-49.255001,BR,Brazil,Provincial capital
3,Campo Grande,-20.450997,-54.615996,BR,Brazil,Provincial capital
4,Puerto Presidente Stroessner,-25.526997,-54.622997,PY,Paraguay,Provincial capital


In [4]:
Checkins_data.head()

Unnamed: 0,User ID,Venue ID,UTC Time,Timezone Offset
0,50756,4f5e3a72e4b053fd6a4313f6,Tue Apr 03 18:00:06 +0000 2012,240
1,190571,4b4b87b5f964a5204a9f26e3,Tue Apr 03 18:00:07 +0000 2012,180
2,221021,4a85b1b3f964a520eefe1fe3,Tue Apr 03 18:00:08 +0000 2012,-240
3,66981,4b4606f2f964a520751426e3,Tue Apr 03 18:00:08 +0000 2012,-300
4,21010,4c2b4e8a9a559c74832f0de2,Tue Apr 03 18:00:09 +0000 2012,240


In [5]:
POI_data.head()

Unnamed: 0,Venue ID,Latitude,Longitude,Venue category,Country Code
0,3fd66200f964a52000e71ee3,40.733596,-74.003139,Jazz Club,US
1,3fd66200f964a52000e81ee3,40.758102,-73.975734,Gym,US
2,3fd66200f964a52000ea1ee3,40.732456,-74.003755,Indian Restaurant,US
3,3fd66200f964a52000ec1ee3,42.345907,-71.087001,Indian Restaurant,US
4,3fd66200f964a52000ee1ee3,39.933178,-75.159262,Sandwich Place,US


* POI data로부터, country code가 KR인 것들만 추출

In [6]:
POI_data_KR = POI_data.loc[POI_data['Country Code'] == 'KR']

In [7]:
POI_data_KR.head()

Unnamed: 0,Venue ID,Latitude,Longitude,Venue category,Country Code
80613,4b058781f964a520659622e3,37.555686,127.005097,Hotel,KR
80614,4b058781f964a520689622e3,37.56528,126.980946,Hotel,KR
80615,4b058781f964a520699622e3,37.509309,127.060715,Hotel,KR
80616,4b058781f964a5206a9622e3,37.504845,127.027166,Hotel,KR
80617,4b058781f964a5206b9622e3,37.513982,127.035497,Hotel,KR


* 장소 카테고리 종류 출력

In [None]:
POI_data_KR['Venue category'].unique()

* Checkins_data로부터 한국에 있는 장소들의 ID만 추출

In [9]:
Checkins_data_KR = Checkins_data[Checkins_data['Venue ID'].isin(POI_data_KR['Venue ID'].tolist())]
Checkins_data_KR.head(5)

Unnamed: 0,User ID,Venue ID,UTC Time,Timezone Offset
2719,66388,4edf00ad6da10302870475f7,Tue Apr 03 18:34:05 +0000 2012,540
5358,48332,4e6cb240e4cd4bedebb992f4,Tue Apr 03 19:11:43 +0000 2012,540
11293,141697,4cf7a6231cfea0939be9e539,Tue Apr 03 20:45:29 +0000 2012,540
11995,89193,4e5414fb1f6e850d277ec012,Tue Apr 03 20:54:55 +0000 2012,540
12436,317,4b9b0d50f964a520f1ee35e3,Tue Apr 03 21:01:02 +0000 2012,540


In [10]:
Checkins_data_KR.groupby('User ID').count()

Unnamed: 0_level_0,Venue ID,UTC Time,Timezone Offset
User ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
47,19,19,19
54,1,1,1
55,8,8,8
56,3,3,3
66,1,1,1
...,...,...,...
266092,22,22,22
266298,72,72,72
266514,75,75,75
266624,31,31,31


In [56]:
Checkins_data_KR.loc[Checkins_data_KR['User ID'] == 9889]

Unnamed: 0,User ID,Venue ID,UTC Time,Timezone Offset
2523922,9889,4b39d25ef964a520235f25e3,Fri Apr 27 10:03:56 +0000 2012,540
2633022,9889,4b058781f964a520659622e3,Sat Apr 28 00:57:13 +0000 2012,540
2633143,9889,4d86e0a502eb5481565f60f5,Sat Apr 28 00:58:14 +0000 2012,540
2633321,9889,4bb96ac73db7b713cc4f229a,Sat Apr 28 01:00:05 +0000 2012,540
2666535,9889,4bebbee6a9900f47c2241840,Sat Apr 28 06:11:19 +0000 2012,540
2671999,9889,4bc0069d4cdfc9b64c6b9221,Sat Apr 28 07:01:08 +0000 2012,540
2695115,9889,4b058782f964a520bc9622e3,Sat Apr 28 10:30:58 +0000 2012,540
2706715,9889,4b47f7d2f964a520584526e3,Sat Apr 28 11:57:27 +0000 2012,540
2706857,9889,4c5e977b6ebe2d7fb208d62e,Sat Apr 28 11:58:19 +0000 2012,540
2787237,9889,4b058781f964a520689622e3,Sat Apr 28 23:58:13 +0000 2012,540


In [12]:
POI_data_KR.loc[POI_data_KR['Venue ID'] == '4b7b6357f964a5205d612fe3']

Unnamed: 0,Venue ID,Latitude,Longitude,Venue category,Country Code
233378,4b7b6357f964a5205d612fe3,37.481589,126.882568,Subway,KR


* Checkins_data_KR에서 User ID를 row로, Venue ID를 column으로, 방문 횟수가 Dataframe의 값이 되는 Dataframe Checkins_table생성

In [26]:
Checkins_table = pd.DataFrame({'UserID': Checkins_data_KR['User ID'], 'VenueID': Checkins_data_KR['Venue ID']})
Checkins_table = Checkins_table.VenueID.groupby([Checkins_table.UserID, Checkins_table.VenueID]).size().unstack().fillna(0).astype(int)