外部資料

In [29]:
import pandas as pd
import numpy as np
from glob import glob, iglob

df = pd.DataFrame()

# iglob使用較少memory
for csv in iglob('../0821/*_top500_clean.csv'):
    tmp_df = pd.read_csv(csv)
    df = df.append(tmp_df, ignore_index=True, sort=False)

In [2]:
len(df)

6195

內部資料(飯店經緯度)

In [30]:
df_gps = pd.read_csv("F:\\NCTU\\lab\\奧丁丁\\奧丁丁資料前處理\\OwlTing_整合資料\\csv\\data_gps.csv")

總共1500多家飯店

In [40]:
df_gps.shape

(1565, 5)

內部資料(飯店名稱)

In [31]:
df_name = pd.read_csv("F:\\NCTU\\lab\\奧丁丁\\奧丁丁資料前處理\\OwlTing_整合資料\\csv\\data.csv")
df_name = df_name[['hotel_id', 'name']].copy()

In [35]:
len(df_name.name.value_counts()) # 只有900家飯店知道名稱

889

只保留不重複的部分

In [32]:
df_name = df_name.drop_duplicates()

### 結合內部資料

In [33]:
df_gps = pd.merge(df_name, df_gps, left_on=['hotel_id'], right_on=['hotel_id'], how='outer')

緯度為null

In [7]:
df_gps.longitude.isna().sum()

232

In [3]:
df_gps.latitude.isna().sum()

73

經度為0

In [9]:
len(df_gps.loc[df_gps.latitude == 0])

364

In [11]:
len(df_gps.loc[df_gps.longitude == 0])

200

### 移除經緯度為0或是null的飯店

In [34]:
df_gps_clean = df_gps.dropna(subset=['latitude', 'longitude'])

In [35]:
# 僅刪除經緯度
df_gps_clean = df_gps_clean[df_gps_clean.latitude!=0]
df_gps_clean = df_gps_clean[df_gps_clean.longitude!=0]

In [51]:
df_gps_clean.isna().sum()

hotel_id       0
name         372
country      196
city         339
latitude       0
longitude      0
dtype: int64

In [43]:
df_gps_clean.country.value_counts()

台灣          541
tw          182
TW          123
TAIWAN       42
MALAYSIA     14
Taiwan        7
Malaysia      7
日本            6
jp            3
台灣省           2
台湾            1
GREECE        1
my            1
中華民國          1
台灣TAIWAN      1
Name: country, dtype: int64

In [42]:
df_gps_clean.city.value_counts()

台南市        135
宜蘭縣         74
屏東縣         58
南投縣         45
花蓮          43
          ... 
瑞穗           1
Okinawa      1
Kythira      1
彰化           1
白沙鄉          1
Name: city, Length: 98, dtype: int64

**原始資料的經緯度就重複**

In [14]:
df.duplicated(subset=['lat', 'lng'], keep=False).sum()

789

1. 使用內部資料全部的小數點

In [22]:
def get_first(x, length):
    return x[:length]

# 把經緯度成跟內部資料位數一樣
df['lat'] = df['lat'].astype('str').apply(get_first, args=(8,)).astype('float')
df['lng'] = df['lng'].astype('str').apply(get_first, args=(10,)).astype('float')

2. 內外部資料都只用小數點3位看能不能和

In [24]:
def get_first(x, length):
    return x[:length]

# 把經緯度成字串保留想要的位數
df['lat'] = df['lat'].astype('str').apply(get_first, args=(6,)).astype('float')
df['lng'] = df['lng'].astype('str').apply(get_first, args=(7,)).astype('float')
df_gps_clean['latitude'] = df_gps_clean['latitude'].astype('str').apply(get_first, args=(6,)).astype('float')
df_gps_clean['longitude'] = df_gps_clean['longitude'].astype('str').apply(get_first, args=(7,)).astype('float')

3. 小數點第四位後做四捨五入 **會完全和不了**

In [55]:
def get_round(x, length):
    return round(x, length)

df['lat'] = df['lat'].apply(get_round, args=(5,))
df['lng'] = df['lng'].apply(get_round, args=(6,))
df_gps_clean['latitude'] = df_gps_clean['latitude'].apply(get_round, args=(5,))
df_gps_clean['longitude'] = df_gps_clean['longitude'].apply(get_round, args=(6,))

4. 內外部資料都只用小數點4位看能不能和

In [36]:
def get_first(x, length):
    return x[:length]

# 把經緯度成字串保留想要的位數
df['lat'] = df['lat'].astype('str').apply(get_first, args=(7,)).astype('float')
df['lng'] = df['lng'].astype('str').apply(get_first, args=(8,)).astype('float')
df_gps_clean['latitude'] = df_gps_clean['latitude'].astype('str').apply(get_first, args=(7,)).astype('float')
df_gps_clean['longitude'] = df_gps_clean['longitude'].astype('str').apply(get_first, args=(8,)).astype('float')

**經過處理後經緯度完全重複的**

In [10]:
df.duplicated(subset=['lat', 'lng'], keep=False).sum()

2755

In [57]:
df.duplicated(subset=['lat', 'lng']).sum()

1060

In [47]:
df_gps_clean.duplicated(subset=['latitude', 'longitude'], keep=False).sum()

39

經緯度同時重疊

In [58]:
df[df.duplicated(subset=['lat', 'lng'], keep=False)].sort_values(by=['lat']).sort_values(by=['lng'])

Unnamed: 0,title,uri,lat,lng,hotel_address,hotel_star,hotel_city,hotel_section,price_range,avg_rating,comment_count
764,麗馨精品商旅七賢館,https://www.tripadvisor.com.tw/Hotel_Review-g1...,22.618,120.268,前金區市中一路229號,3.0,高雄,前金,"NT$817 - NT$2,169",4.0,18.0
574,奇異果快捷旅店-九如店,https://www.tripadvisor.com.tw/Hotel_Review-g1...,22.618,120.268,三民區九如一路790號,3.0,高雄,三民,"NT$849 - NT$2,012",4.0,155.0
535,巴黎商旅,https://www.tripadvisor.com.tw/Hotel_Review-g1...,22.625,120.280,新興區自立二路8號,3.0,高雄,新興,"NT$1,257 - NT$2,075",3.5,31.0
643,芳橙汽車旅館,https://www.tripadvisor.com.tw/Hotel_Review-g1...,22.625,120.280,小港區中安路616號,3.0,高雄,小港,"NT$1,069 - NT$1,509",3.0,1.0
745,六合轉角737館,https://www.tripadvisor.com.tw/Hotel_Review-g1...,22.622,120.283,南台路六合夜市 - 美麗島捷運站旁,2.0,高雄,新興,"NT$849 - NT$1,886",4.0,9.0
...,...,...,...,...,...,...,...,...,...,...,...
1398,金色年代旅店,https://www.tripadvisor.com.tw/Hotel_Review-g1...,25.121,121.861,板橋區重慶路66號12樓,3.0,新北,板橋,"NT$1,257 - NT$2,798",4.0,10.0
3661,大里驛青年旅館,https://www.tripadvisor.com.tw/Hotel_Review-g1...,24.966,121.922,頭城鎮濱海路6段317-2號,2.0,宜蘭,頭城,"NT$1,980 - NT$3,018",5.0,16.0
3943,我家商務旅店,https://www.tripadvisor.com.tw/Hotel_Review-g1...,24.966,121.922,羅東鎮中正路186之2號 3樓,2.5,宜蘭,羅東,"NT$1,100 - NT$3,018",0.0,0.0
1461,九份聽山,https://www.tripadvisor.com.tw/Hotel_Review-g1...,25.015,121.944,瑞芳區基山街219之4號,3.0,新北,瑞芳,"NT$943 - NT$2,703",3.0,11.0


In [19]:
df[df.duplicated(subset=['lat', 'lng'], keep=False)].groupby(['lat', 'lng']).agg('count')

Unnamed: 0_level_0,Unnamed: 1_level_0,title,uri,hotel_address,hotel_star,hotel_city,hotel_section,price_range,avg_rating,comment_count
lat,lng,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
21.912,120.847,2,2,2,2,2,2,2,2,2
21.912,120.848,2,2,2,2,2,2,2,2,2
21.918,120.844,2,2,2,2,2,2,2,2,2
21.920,120.844,2,2,2,2,2,2,2,2,2
21.925,120.832,2,2,2,2,2,2,2,2,2
...,...,...,...,...,...,...,...,...,...,...
25.168,121.445,3,3,3,3,3,3,3,3,3
25.169,121.445,2,2,2,2,2,2,2,2,2
25.171,121.447,4,4,4,4,4,4,4,4,4
25.180,121.689,2,2,2,2,2,2,2,2,2


經度重疊

In [32]:
df[df['lat'].isin(df['lat'][df['lat'].duplicated()])]

Unnamed: 0,title,uri,lat,lng,hotel_address,hotel_star,hotel_city,hotel_section,price_range,avg_rating,comment_count
3,嘉義優遊商旅,https://www.tripadvisor.com.tw/Hotel_Review-g1...,23.48274,120.463,西區中山路617號11樓,3.0,嘉義,東區,"NT$1,163 - NT$2,734",3.5,41.0
39,楓雅SPA 汽車旅館 (大雅館),https://www.tripadvisor.com.tw/Hotel_Review-g1...,23.48274,120.431,大雅路二段256號,3.5,嘉義,東區,"NT$1,194 - NT$1,697",4.0,1.0
45,香緹汽車旅館,https://www.tripadvisor.com.tw/Hotel_Review-g1...,23.48041,120.463,西區八德路2號,3.0,嘉義,西區,"NT$1,980 - NT$2,922",3.5,4.0
70,波士頓大飯店,https://www.tripadvisor.com.tw/Hotel_Review-g1...,23.48041,120.463,東區中正路673號,3.0,嘉義,西區,"NT$691 - NT$1,508",3.0,11.0
72,雙鳳大旅社,https://www.tripadvisor.com.tw/Hotel_Review-g1...,23.99303,121.593,復興街5號,2.0,花蓮,花蓮市,"NT$597 - NT$2,766",3.5,11.0
...,...,...,...,...,...,...,...,...,...,...,...
4028,田野小徑,https://www.tripadvisor.com.tw/Hotel_Review-g1...,24.80044,121.756,礁溪鄉份尾一路3號,3.5,宜蘭,礁溪,"NT$2,075 - NT$3,395",4.0,1.0
4029,玫瑰莊園民宿,https://www.tripadvisor.com.tw/Hotel_Review-g1...,24.67588,121.751,冬山鄉梅花路742號,3.5,宜蘭,冬山,"NT$1,980 - NT$4,527",0.0,0.0
4030,夜市橙堡渡假旅館,https://www.tripadvisor.com.tw/Hotel_Review-g1...,24.82802,121.774,新民路19號,3.5,宜蘭,宜蘭市,"NT$1,320 - NT$2,515",4.0,12.0
4033,賓成精品商旅,https://www.tripadvisor.com.tw/Hotel_Review-g1...,24.67275,121.809,西後街26-1號,3.0,宜蘭,宜蘭市,"NT$1,163 - NT$2,389",3.0,9.0


### 嘗試能不能concat

內部資料的單一經緯度卻對應到多間外部資料的飯店

In [9]:
pd.merge(df, df_gps_clean, left_on=['lat', 'lng'], right_on=['latitude', 'longitude']).head()

Unnamed: 0,title,uri,lat,lng,hotel_address,hotel_star,hotel_city,hotel_section,price_range,avg_rating,comment_count,hotel_id,country,city,latitude,longitude
0,花蓮北吉光輕旅青年旅館,https://www.tripadvisor.com.tw/Hotel_Review-g1...,23.988,121.599,花蓮市中山路601巷1弄19號,2.0,花蓮,花蓮市,"NT$817 - NT$1,949",5.0,432.0,1482,台灣,花蓮市,23.988,121.599
1,希臘仙境美宿,https://www.tripadvisor.com.tw/Hotel_Review-g1...,23.988,121.599,花蓮市北濱街102-2號,3.0,花蓮,花蓮市,"NT$1,603 - NT$3,018",5.0,1.0,1482,台灣,花蓮市,23.988,121.599
2,美麗晨曦民宿,https://www.tripadvisor.com.tw/Hotel_Review-g1...,23.988,121.599,吉安鄉太昌村明義二街16號,3.0,花蓮,吉安,"NT$2,483 - NT$4,244",5.0,235.0,1482,台灣,花蓮市,23.988,121.599
3,海洋飯店,https://www.tripadvisor.com.tw/Hotel_Review-g1...,23.99,121.604,花蓮市國聯三路58號,3.0,花蓮,花蓮市,"NT$1,509 - NT$2,043",4.0,1.0,1297,台灣,,23.99,121.604
4,太魯閣蘇西小空間,https://www.tripadvisor.com.tw/Hotel_Review-g1...,23.99,121.604,新城鄉新城村中正路20號,2.5,花蓮,新城,"NT$1,069 - NT$2,672",3.5,14.0,1297,台灣,,23.99,121.604


能和的飯店數目

In [37]:
len(pd.merge(df, df_gps_clean, left_on=['lat', 'lng'], right_on=['latitude', 'longitude']))

105

In [28]:
pd.merge(df, df_gps_clean, left_on=['lat', 'lng'], right_on=['latitude', 'longitude']).head(10)

Unnamed: 0,title,uri,hotel_address,hotel_star,hotel_city,hotel_section,price_range,avg_rating,comment_count,lat,lng,hotel_id,name,country,city,latitude,longitude
0,翰品酒店花蓮,https://www.tripadvisor.com.tw/Hotel_Review-g1...,花蓮市永興路2號,4.5,花蓮,花蓮市,"NT$2,072 - NT$4,609",4.5,2265.0,23.987,121.622,1352,冠倫旅店,台灣,花蓮市,23.987,121.622
1,太魯閣晶英酒店,https://www.tripadvisor.com.tw/Hotel_Review-g1...,秀林鄉天祥路18號太魯閣國家公園內,4.5,花蓮,秀林,"NT$8,011 - NT$17,568",4.5,1733.0,23.975,121.602,1067,成旅晶贊飯店 花蓮假期,台灣,花蓮市,23.975,121.602
2,大寶的民宿,https://www.tripadvisor.com.tw/Hotel_Review-g1...,花蓮市建華路79號,3.0,花蓮,花蓮市,"NT$1,175 - NT$2,969",5.0,2.0,23.975,121.602,1067,成旅晶贊飯店 花蓮假期,台灣,花蓮市,23.975,121.602
3,W+225客棧,https://www.tripadvisor.com.tw/Hotel_Review-g1...,花蓮市中華路225號,3.0,花蓮,花蓮市,"NT$1,206 - NT$1,918",5.0,1.0,23.975,121.602,1067,成旅晶贊飯店 花蓮假期,台灣,花蓮市,23.975,121.602
4,山水相連民宿,https://www.tripadvisor.com.tw/Hotel_Review-g1...,豐坪路2段38號,3.0,花蓮,壽豐,"NT$2,165 - NT$3,743",0.0,0.0,23.975,121.602,1067,成旅晶贊飯店 花蓮假期,台灣,花蓮市,23.975,121.602
5,小旅行迷你公寓,https://www.tripadvisor.com.tw/Hotel_Review-g1...,花蓮市國聯一路103號,2.5,花蓮,花蓮市,"NT$557 - NT$2,567",5.0,463.0,23.973,121.615,1698,蒂芬尼海岸旅宿,tw,hualien city,23.973,121.615
6,小旅行迷你公寓,https://www.tripadvisor.com.tw/Hotel_Review-g1...,花蓮市國聯一路103號,2.5,花蓮,花蓮市,"NT$557 - NT$2,567",5.0,463.0,23.973,121.615,1699,花蓮 慢拾光,tw,,23.973,121.615
7,迴音谷森林民宿,https://www.tripadvisor.com.tw/Hotel_Review-g1...,國福街289巷168弄21號,3.0,花蓮,花蓮市,"NT$5,815 - NT$10,238",5.0,623.0,23.974,121.584,201,,台灣,花蓮,23.974,121.584
8,樸耕居民宿,https://www.tripadvisor.com.tw/Hotel_Review-g1...,花蓮市球崙一路215巷16號,3.5,花蓮,花蓮市,"NT$9,341 - NT$11,413",5.0,87.0,23.993,121.607,517,,台灣,花蓮,23.993,121.607
9,松邑莊園,https://www.tripadvisor.com.tw/Hotel_Review-g1...,瑞穗鄉溫泉路三段2巷37號,3.5,花蓮,瑞穗,"NT$3,804 - NT$6,000",4.5,97.0,23.993,121.607,517,,台灣,花蓮,23.993,121.607


檢查是否有同一間飯店被重複

In [26]:
pd.merge(df, df_gps_clean, left_on=['lat', 'lng'], right_on=['latitude', 'longitude']).title.value_counts()

陽光旅宿         5
幸福來敲門        4
缪思狂想民宿       4
娜娜雅筑         4
微笑拉菲草        4
            ..
F商旅花蓮站前館     1
花蓮海邊邊民宿      1
禾頁屋 - 中興館    1
花蓮皇裔民宿       1
夏爾迦民宿        1
Name: title, Length: 943, dtype: int64

內部資料本來也有GPS重複的問題

In [38]:
df_tmp = pd.merge(df, df_gps_clean, left_on=['lat', 'lng'], right_on=['latitude', 'longitude'])
df_tmp[df_tmp['title'].isin(df_tmp['title'][df_tmp['title'].duplicated()])]

Unnamed: 0,title,uri,hotel_address,hotel_star,hotel_city,hotel_section,price_range,avg_rating,comment_count,lat,lng,hotel_id,name,country,city,latitude,longitude
6,幸福加油站民宿,https://www.tripadvisor.com.tw/Hotel_Review-g1...,信義街93號,3.0,花蓮,花蓮市,"NT$1,206 - NT$3,340",5.0,4.0,23.9762,121.6026,256,幸福加油站民宿,TW,花蓮市,23.9762,121.6026
7,幸福加油站民宿,https://www.tripadvisor.com.tw/Hotel_Review-g1...,信義街93號,3.0,花蓮,花蓮市,"NT$1,206 - NT$3,340",5.0,4.0,23.9762,121.6026,1304,外館,台灣,花蓮縣,23.9762,121.6026
15,高雄盧昂,https://www.tripadvisor.com.tw/Hotel_Review-g1...,鼓山區慶豐街98號,3.0,高雄,鼓山,"NT$912 - NT$3,018",0.0,0.0,22.6675,120.2946,1818,高雄盧昂,,,22.6675,120.2946
16,高雄盧昂,https://www.tripadvisor.com.tw/Hotel_Review-g1...,鼓山區慶豐街98號,3.0,高雄,鼓山,"NT$912 - NT$3,018",0.0,0.0,22.6675,120.2946,1872,,,,22.6675,120.2946
17,高雄盧昂,https://www.tripadvisor.com.tw/Hotel_Review-g1...,鼓山區慶豐街98號,3.0,高雄,鼓山,"NT$912 - NT$3,018",0.0,0.0,22.6675,120.2946,1873,,,,22.6675,120.2946
24,新世代精品商務旅店,https://www.tripadvisor.com.tw/Hotel_Review-g1...,苓雅區三多四路63號14樓,2.5,高雄,苓雅,"NT$1,069 - NT$2,358",2.0,2.0,22.6117,120.3002,1480,85春天,tw,高雄市,22.6117,120.3002
25,新世代精品商務旅店,https://www.tripadvisor.com.tw/Hotel_Review-g1...,苓雅區三多四路63號14樓,2.5,高雄,苓雅,"NT$1,069 - NT$2,358",2.0,2.0,22.6117,120.3002,2118,,,,22.6117,120.3002
61,海洋阿帕朵民宿,https://www.tripadvisor.com.tw/Hotel_Review-g1...,車城鄉後灣村後灣路183號,3.0,屏東,車城,"NT$2,382 - NT$2,969",5.0,1.0,22.0387,120.694,717,住在 ZHUZAI,台灣,Pingtung County,22.0387,120.694
62,海洋阿帕朵民宿,https://www.tripadvisor.com.tw/Hotel_Review-g1...,車城鄉後灣村後灣路183號,3.0,屏東,車城,"NT$2,382 - NT$2,969",5.0,1.0,22.0387,120.694,879,墾丁南法風情海景民宿,tw,車城鄉,22.0387,120.694
71,台中逢甲幻多奇青年旅棧,https://www.tripadvisor.com.tw/Hotel_Review-g1...,文華路永新巷4號,2.0,台中,西屯,"NT$692 - NT$2,641",5.0,29.0,24.1767,120.6456,1629,台中逢甲幻多奇青年旅棧,TW,台中市,24.1767,120.6456


實際上和成功的飯店數 (單一經緯度先只保留一家，先不管內外部資料都有GPS重複的問題)

In [12]:
df_tmp.drop_duplicates(subset=['lat', 'lng'])

Unnamed: 0,title,uri,hotel_address,hotel_star,hotel_city,hotel_section,price_range,avg_rating,comment_count,lat,lng,hotel_id,name,country,city,latitude,longitude
0,翰品酒店花蓮,https://www.tripadvisor.com.tw/Hotel_Review-g1...,花蓮市永興路2號,4.5,花蓮,花蓮市,"NT$2,072 - NT$4,609",4.5,2265.0,23.987,121.622,1352,冠倫旅店,台灣,花蓮市,23.987,121.622
1,太魯閣晶英酒店,https://www.tripadvisor.com.tw/Hotel_Review-g1...,秀林鄉天祥路18號太魯閣國家公園內,4.5,花蓮,秀林,"NT$8,011 - NT$17,568",4.5,1733.0,23.975,121.602,1067,成旅晶贊飯店 花蓮假期,台灣,花蓮市,23.975,121.602
5,小旅行迷你公寓,https://www.tripadvisor.com.tw/Hotel_Review-g1...,花蓮市國聯一路103號,2.5,花蓮,花蓮市,"NT$557 - NT$2,567",5.0,463.0,23.973,121.615,1698,蒂芬尼海岸旅宿,tw,hualien city,23.973,121.615
7,迴音谷森林民宿,https://www.tripadvisor.com.tw/Hotel_Review-g1...,國福街289巷168弄21號,3.0,花蓮,花蓮市,"NT$5,815 - NT$10,238",5.0,623.0,23.974,121.584,201,,台灣,花蓮,23.974,121.584
8,樸耕居民宿,https://www.tripadvisor.com.tw/Hotel_Review-g1...,花蓮市球崙一路215巷16號,3.5,花蓮,花蓮市,"NT$9,341 - NT$11,413",5.0,87.0,23.993,121.607,517,,台灣,花蓮,23.993,121.607
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1285,灣曲時尚渡假會館,https://www.tripadvisor.com.tw/Hotel_Review-g1...,富堵三路101號,3.5,宜蘭,冬山,"NT$4,495 - NT$6,570",5.0,1.0,24.688,121.754,1356,宜蘭Bobo旅店,tw,,24.688,121.754
1286,星宿渡假,https://www.tripadvisor.com.tw/Hotel_Review-g1...,三星鄉天福村東興路13之5號,3.0,宜蘭,三星,"NT$2,012 - NT$20,150",0.0,0.0,24.659,121.780,264,,台灣,宜蘭縣,24.659,121.780
1287,明水露渡假民宿,https://www.tripadvisor.com.tw/Hotel_Review-g1...,三星鄉行健三路二段325號,3.5,宜蘭,三星,"NT$2,452 - NT$4,589",5.0,105.0,24.676,121.745,550,,台灣,宜蘭縣,24.676,121.745
1288,忙裡偷閒渡假民宿,https://www.tripadvisor.com.tw/Hotel_Review-g1...,員山鄉深溝村深洲路200巷9號,3.5,宜蘭,員山,"NT$1,760 - NT$5,375",0.0,0.0,24.671,121.803,537,,台灣,宜蘭縣,24.671,121.803
