# Train Timetable Scraping Project

We want to get data from https://tip.railway.gov.tw//tra-tip-web/tip/tip001/tip112/querybytime. 
Surly, we can't not just POST form and GET information, we need to tranform our words to let website understand.
This project is also a test for myself, which test my web crawler skill.
So let's step-by-step.




## import packages

In [17]:
from bs4 import BeautifulSoup as BS
import requests
import datetime
import re
import openpyxl as pyxl # We need to get station code by checking Excel file

## Station's Code Table
You can search the table from https://tip.railway.gov.tw/tra-tip-web/tip/tip001/tip111/view. I save the data into Excel and use the Python package *openpyxl* to read it.
If you want to save the table into another format, like **CSV** or **JSON**, which are fine. You should ensure you can get the station name and station code one by one because posting the request need both of them. 

## Make a Dictionary of Station Names and Station Codes
Use *pyopenxl* to read the data and create a dictionary.

In [18]:
'''load data'''

wb = pyxl.load_workbook('stationtonumber.xlsx') 
sheet = wb['worksheet1']
sheet = wb.active
sheet

<Worksheet "worksheet1">

In [19]:
'''use loop to create dictionary'''

station_code = {}

for i in range(1,sheet.max_row,2):

    station_code[sheet.cell(row=i, column=1).value] = sheet.cell(row=i+1, column=1).value
    
station_code

{'基隆': 900,
 '三坑': 910,
 '八堵': 920,
 '七堵': 930,
 '百福': 940,
 '五堵': 950,
 '汐止': 960,
 '汐科': 970,
 '南港': 980,
 '松山': 990,
 '臺北': 1000,
 '萬華': 1010,
 '板橋': 1020,
 '浮洲': 1030,
 '樹林': 1040,
 '南樹林': 1050,
 '山佳': 1060,
 '鶯歌': 1070,
 '桃園': 1080,
 '內壢': 1090,
 '中壢': 1100,
 '埔心': 1110,
 '楊梅': 1120,
 '富岡': 1130,
 '新富': 1140,
 '北湖': 1150,
 '湖口': 1160,
 '新豐': 1170,
 '竹北': 1180,
 '北新竹': 1190,
 '新竹': 1210,
 '三姓橋': 1220,
 '香山': 1230,
 '崎頂': 1240,
 '竹南': 1250,
 '造橋': 3140,
 '豐富': 3150,
 '苗栗': 3160,
 '南勢': 3170,
 '銅鑼': 3180,
 '三義': 3190,
 '泰安': 3210,
 '后里': 3220,
 '豐原': 3230,
 '栗林': 3240,
 '潭子': 3250,
 '頭家厝': 3260,
 '松竹': 3270,
 '太原': 3280,
 '精武': 3290,
 '臺中': 3300,
 '五權': 3310,
 '大慶': 3320,
 '烏日': 3330,
 '新烏日': 3340,
 '成功': 3350,
 '彰化': 3360,
 '花壇': 3370,
 '大村': 3380,
 '員林': 3390,
 '永靖': 3400,
 '社頭': 3410,
 '田中': 3420,
 '二水': 3430,
 '林內': 3450,
 '石榴': 3460,
 '斗六': 3470,
 '斗南': 3480,
 '石龜': 3490,
 '大林': 4050,
 '民雄': 4060,
 '嘉北': 4070,
 '嘉義': 4080,
 '水上': 4090,
 '南靖': 4100,
 '後壁': 4110,
 '新營': 4120,
 '柳營

## Input Departure-Station, Arrival-Station and Codes
When we search the timetable by station's names and get the codes by dictionary we created above. However, people often search stations by typing the word **台** instead of **臺**, but the website need the word **臺** to find out **臺北**, **台中** and some other big city. So when **台** is inputted, we sure convert it into **臺**.

In [27]:
'''input stations name to search'''

st = input('Enter the departure station:')
ed = input('Enter the arrival station:')

def getstations_input(station_name): #make sure stations name is correct
    a = list(station_name)
#    if a not in 
    if a[0] == '台':
        a[0] = '臺'
    
    a = ''.join([i for i in a]) #必須是臺這個才能查
    a = str(station_code[a]) + '-'  + a

    return a

Enter the departure station:台北
Enter the arrival station:台中


## Time Range to Search
People always search the train schedule when they need to depture this city today, therefore, I use *datetime* to get current local time making it be **StartTime**. In the other hand,**EndTime** is set by 23:59.
Except for hour and min, YMD need to be confirmed and transform them into appropriate format.

In [21]:
dt = datetime.datetime.now()
dt

datetime.datetime(2022, 4, 26, 20, 25, 54, 791814)

In [22]:
day = dt.strftime('%Y/%m/%d') #YMD format 
day

'2022/04/26'

### Time Range Modification 
If current time over an half hour, I decide to let the hour of **StartTime** plus one. For example, it is 13:42 now, then **StartTime** will become 14:00. The reason for this modification is that I have to spend almost an hour going to train station.
To add two hours to decide **EndTime** make sure I can arrive station on time and don't need to show the schedule two hours later. 

In [23]:
'''Modification'''
Min = int(dt.strftime('%M'))
hour = int(dt.strftime('%H'))
if Min >= 30:
    hour +=1
    Min = '00'
else:
    Min = '00'
start = str(hour)+':'+ Min    
end = str(hour+2) + ':' + Min

print('StartTime: ' + start)
print('EndTime: ' + end)

StartTime: 20:00
EndTime: 22:00


### Careful!!!
The website doesn't allow user asking train timetable which time range includes two days.
Therefore, if **EndTime** will over 24:00, it will turn to 23:59. 

In [24]:
if hour == 22:
    end = '23:59'

## POST Information and GET Result
When you search the timetable on website, you should fill in the form and send it out. Now we have to finish it by Python and just like a real person to request information from website.
I check the website and know what information should be posted, such as StartTime, EndTime which we created above. Others will be introduced below.

### briefly Introduce Data Dictionary 
- _csrf: It is a token. 
- startStation: Where you departure from.
- endStation: Where you want to arrive.
- transfer: How many times transfer between start and end.
- rideDate: The day you departure.
- startOrEndTime: 
- startTime: Start of time range you search.
- endTime: End of time range you search.
- trainTypeList: Kinds of trains you want.
- _isQryEarlyBirdTrn: Search early bird discount ticket.
- query: Search.

In [25]:
'''Dictionary we need to post'''
data = {
        '_csrf': '1275647f-8532-4d14-9fb0-bcb84bf61f4c',
        'startStation': "", 
        'endStation': "",
        'transfer': "ONE",
        'rideDate': "2021/12/09",
        'startOrEndTime': "true",
        'startTime': "",
        'endTime': "",
        'trainTypeList': "ALL",
        '_isQryEarlyBirdTrn': "on",
        'query': "查詢"
        }


In [29]:
'''Some data we need to insert into dictionary'''
data['rideDate'] = day
data['startTime'] = start
data['endTime'] = end
data['startStation'] = getstations_input(st)
data['endStation'] = getstations_input(ed)
data

{'_csrf': '1275647f-8532-4d14-9fb0-bcb84bf61f4c',
 'startStation': '1000-臺北',
 'endStation': '3300-臺中',
 'transfer': 'ONE',
 'rideDate': '2022/04/26',
 'startOrEndTime': 'true',
 'startTime': '20:00',
 'endTime': '22:00',
 'trainTypeList': 'ALL',
 '_isQryEarlyBirdTrn': 'on',
 'query': '查詢'}

## Web Crawler Start!
Use *request* package to post form and get data. After getting data, we can use *bs4* package to parse it.

In [31]:
session = requests.Session()
url = 'https://tip.railway.gov.tw//tra-tip-web/tip/tip001/tip112/querybytime'

s = session.get(url)
bs = BS(s.text, 'html.parser')

token = bs.find('form', {'id':"queryForm"}).input
#print(token['value'])

data['_csrf']=token['value']

r = session.post(url, data=data)
r

<Response [200]>

In [34]:
try:
    
    bs = BS(r.text, 'html.parser')

    trains= bs.find('div', {'class':"search-trip"}).find_all('tr', {'class':"trip-column"}) #find the data I want


    for train in trains:
        result = train.find_all('td', text=re.compile('\d{2}.*'))
        departure_time = result[0].get_text()
        arrival_time =  result[1].get_text()
        tarvel_time = result[2].get_text()
        print('Departure Time: ' + departure_time,
              'Arriveal Time: ' + arrival_time,
              'Travel Time: ' + tarvel_time,
              sep=' | ', end='\n')
except:
    
    print('We only can search the train schedule in one day.')
    print('Maybe you should research after 00:00.')

Departure Time: 20:09 | Arriveal Time: 23:34 | Travel Time: 3 小時 25 分
Departure Time: 20:15 | Arriveal Time: 21:59 | Travel Time: 1 小時 44 分
Departure Time: 21:00 | Arriveal Time: 23:18 | Travel Time: 2 小時 18 分


## Now You Get Data.
You can decide which train you want to ride. Something still can be imporved or revised so I wish one day I can make this project more easily to use and learn more about web crawler. 