## 인천 나리타 노선 가격 정보 읽어오기
- 대상 사이트 : 티웨이 웹사이트

### 테웨이 사이트 웹 크롤링
    - 티웨이 항공권예매 페이지를 이용한 크롤링
    - 세션을 생성하여 세션을 이용해 HTML 페이지 정보 획득
    - 필요 데이터 추출 하여 Pandas 의 DataFrame 형태로 생성
    - 생성된 데이터 엑셀 파일로 저장

In [45]:
from IPython.display import display
import requests
import pandas as pd
import numpy as np
from pandas import DataFrame
from bs4 import BeautifulSoup
import time
from datetime import datetime
from datetime import timedelta
from common.crawling_util import session_crawling

## 국제선 데이터 조회
def crawling_TW_data(dpt,arr,dpt_date):
    ##출발지, 도착지, 출발일을 기준으로 국내선(국제선) 편도 가격 읽어오기
    print('Crawling twayair homepage schedule site')
    session_url = "https://www.twayair.com/booking/availabilityList.do"
    session_head = {
        'Referer':'https://www.twayair.com/main.do',
    }
    
    url = 'https://www.twayair.com/booking/ajax/searchAvailability.do'
    head = {
        'Referer':'https://www.twayair.com/booking/availabilityList.do',
        'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'
    }
    param ={
        'origin':dpt,                  'destination':arr,
        'origin1':dpt,                 'destination1':arr,
        'origin2':arr,                 'destination2':dpt,
        'onwardDateStr':dpt_date,      'returnDateStr':dpt_date,
        'today':datetime.today().strftime('%Y%m%d'),
        #'searchAvailId':searchAvailId,
        'travelType':'OW',#'RT',
        'currencyCode':'KRW',            'domesticYn':'N', ## 국제선
        'pointOfPurchase':'KR',          'paxTypeCountStr':'1,0,0',
        'searchType':'byDate',
        'orderByOW':'',               'orderByRT':'',
        'fareBasisCodeOW':'',         'fareBasisCodeRT':'',
        'arrivCntryCode':'',          'promotionCode':'',
    }

    return session_crawling(session_url,url,param,session_head=session_head,head=head,method='get',json=False)

data_heads = ['Flt','start','end','fare1','fare2','fare3','tax1','tax2','seat']
## raw 데이터로 부터 DataFrame 생성
def raw_to_df(raw_data):
    soup = BeautifulSoup(raw_data,'lxml')
    ## 스케줄이 없는 경우 체크
    if len(soup.select('#tbodyOnward tr')[0].select('td')) == 1:
        return None
    
    fare_data = []
    for tr in soup.select('#tbodyOnward tr'):
        td = tr.select('td')
        td_list = [f.text.split()[0].strip() for f in td[:3]]
        ## 1. 이벤트운임에서 항공료만 가져온다.
        fares = [f.attrs['value'] for f in td[3].select('input') if f.attrs['type'] == 'hidden' and f.attrs['name'] == 'fare']
        ## 좌석 체크 하여 매진인 경우 가격 0 처리
        soldout = td[3].select('.txt3')[0].select('.soldout')
        if len(soldout) == 0: #좌석이 많음
            td_list.extend(fares)
        else:
            soldout_txt = soldout[0].text.strip()
            if '매진' in soldout_txt: ## 매진
                td_list.append('0')
            else:
                td_list.extend(fares)
        ## 2. 스마트운임에서 항공료만 가져온다.
        fares = [f.attrs['value'] for f in td[4].select('input') if f.attrs['type'] == 'hidden' and f.attrs['name'] == 'fare']
        ## 좌석 체크 하여 매진인 경우 가격 0 처리
        soldout = td[4].select('.txt3')[0].select('.soldout')
        if len(soldout) == 0: #좌석이 많음
            td_list.extend(fares)
        else:
            soldout_txt = soldout[0].text.strip()
            if '매진' in soldout_txt: ## 매진
                td_list.append('0')
            else:
                td_list.extend(fares)
        ## 3. 일반운잉에서 항공료, 유류세, 공항세를 가져온다.
        fares = [f.attrs['value'] for f in td[5].select('input') 
                 if f.attrs['type'] == 'hidden' and f.attrs['name'] in ['fare','surcharge','tax']]
        ## 좌석 체크 하여 매진인 경우 가격 0 처리
        soldout = td[5].select('.txt3')[0].select('.soldout')
        if len(soldout) == 0: #좌석이 많음
            td_list.extend(fares)
            td_list.append('9')
        else:
            soldout_txt = soldout[0].text.strip()
            if '매진' in soldout_txt: #매진
                td_list.extend(['0','0','0','0'])
            else:
                td_list.extend(fares)
                td_list.append(soldout_txt.replace('(','').replace('석','').replace(')',''))
        fare_data.append(td_list)
    return DataFrame(fare_data,columns=data_heads)

## 하루 기준 가격정보, 텍스정보, 최소값, 최대값, 평균 DataFrame 생성
def read_TW_1day_fare(dpt,arr,dpt_date):
    ## 데이터 읽어오기
    raw_data = crawling_TW_data(dpt,arr,dpt_date)
    df=raw_to_df(raw_data)
    if df is None or len(df) == 0: ## 읽는 도중 에러가 나거나 익셉션 발생으로 문제가 있을 경우 처리
        print('********** No Data Type 1 **********')
        return None

    ## 최소값 최대값 평균 계산
    fare_arr = df[['fare1','fare2','fare3']].values ## fare 만 구해오기
    fare_arr = fare_arr.flatten() # shape 1차원으로 변경
    fare_arr = np.unique(fare_arr)
    if '' in fare_arr or '0' in fare_arr:
        fare_arr = fare_arr[1:] # 0 값 제거
    fare_arr = fare_arr.astype('float') # 중복값 제거, 수치형으로 변경
    ## 요약정보 추가
    df.ix[len(df)] = [dpt_date,'min','max','mean',str(fare_arr.min()),str(fare_arr.max()),str(fare_arr.mean()),'','']
    return df

## 정해진 기간의 데이터 읽어오기, 기본 30일
def read_TW_date_range_fare(dpt,arr,start=0,end=31):
    # 기본 30일간 데이터 읽어서 파일로 저장
    date_range = [ (datetime.today()+timedelta(1)*i).strftime('%Y%m%d') for i in range(start,end)]
    df_list = []
    for d in date_range:
        try:
            fare_df = read_TW_1day_fare(dpt,arr,d)
            if fare_df is not None:
                df_list.append(fare_df)
        except Exception as e:
            print('****** Error occured : ',e)
    result = pd.concat(df_list,ignore_index=True)
    print('++++++++++Total : ', len(result))
    ## 파일 저장
    result.to_excel('{}/{}_{}_{}_{}_{}_{}.xls'.format('excel','TW',dpt,arr,start,end,datetime.today().strftime('%Y%m%d%H%m')))
    return result

In [46]:
## 하루치 데이터 읽어오기
dpt, arr, dpt_date = 'ICN','NRT','20170425'

df = read_TW_1day_fare(dpt,arr,dpt_date)
df

Crawling twayair homepage schedule site
Start Session crawling
make session :  https://www.twayair.com/booking/availabilityList.do
crawling :  https://www.twayair.com/booking/ajax/searchAvailability.do
>> Parameters
destination:NRT , travelType:OW , domesticYn:N , arrivCntryCode: , currencyCode:KRW , origin1:ICN , searchType:byDate , returnDateStr:20170425 , fareBasisCodeRT: , today:20170424 , orderByOW: , fareBasisCodeOW: , origin2:NRT , destination1:NRT , promotionCode: , pointOfPurchase:KR , destination2:ICN , paxTypeCountStr:1,0,0 , onwardDateStr:20170425 , origin:ICN , orderByRT: , 
End Session crawling


Unnamed: 0,Flt,start,end,fare1,fare2,fare3,tax1,tax2,seat
0,TW201,07:45,10:15,40800.0,80000.0,280000.0,1100.0,28000.0,9.0
1,TW8201,15:30,18:00,50000.0,80000.0,280000.0,1100.0,28000.0,9.0
2,20170425,min,max,mean,40800.0,280000.0,112700.0,,


In [47]:
## 정해진 기간의 데이터 읽어오기
dpt, arr, = 'ICN','NRT'
start,end = 0, 31 ## 읽어올 데이터 기간
read_TW_date_range_fare(dpt,arr,start,end)

Crawling twayair homepage schedule site
Start Session crawling
make session :  https://www.twayair.com/booking/availabilityList.do
crawling :  https://www.twayair.com/booking/ajax/searchAvailability.do
>> Parameters
destination:NRT , travelType:OW , domesticYn:N , arrivCntryCode: , currencyCode:KRW , origin1:ICN , searchType:byDate , returnDateStr:20170424 , fareBasisCodeRT: , today:20170424 , orderByOW: , fareBasisCodeOW: , origin2:NRT , destination1:NRT , promotionCode: , pointOfPurchase:KR , destination2:ICN , paxTypeCountStr:1,0,0 , onwardDateStr:20170424 , origin:ICN , orderByRT: , 
End Session crawling
********** No Data Type 1 **********
Crawling twayair homepage schedule site
Start Session crawling
make session :  https://www.twayair.com/booking/availabilityList.do
crawling :  https://www.twayair.com/booking/ajax/searchAvailability.do
>> Parameters
destination:NRT , travelType:OW , domesticYn:N , arrivCntryCode: , currencyCode:KRW , origin1:ICN , searchType:byDate , returnDateS

Unnamed: 0,Flt,start,end,fare1,fare2,fare3,tax1,tax2,seat
0,TW201,07:45,10:15,40800.0,80000.0,280000.0,1100.0,28000.0,9
1,TW8201,15:30,18:00,50000.0,80000.0,280000.0,1100.0,28000.0,9
2,20170425,min,max,mean,40800.0,280000.0,112700.0,,
3,TW201,07:45,10:15,40800.0,80000.0,280000.0,1100.0,28000.0,9
4,TW8201,15:30,18:00,60000.0,80000.0,280000.0,1100.0,28000.0,9
5,20170426,min,max,mean,40800.0,280000.0,115200.0,,
6,TW201,07:45,10:15,70000.0,80000.0,280000.0,1100.0,28000.0,9
7,TW8201,15:30,18:00,0,100000.0,280000.0,1100.0,28000.0,9
8,20170427,min,max,mean,70000.0,280000.0,132500.0,,
9,TW201,07:45,10:15,0,100000.0,280000.0,1100.0,28000.0,9


In [48]:
## 정해진 기간의 데이터 읽어오기
dpt, arr, = 'ICN','NRT'
start,end = 31, 46 ## 읽어올 데이터 기간
read_TW_date_range_fare(dpt,arr,start,end)

Crawling twayair homepage schedule site
Start Session crawling
make session :  https://www.twayair.com/booking/availabilityList.do
crawling :  https://www.twayair.com/booking/ajax/searchAvailability.do
>> Parameters
destination:NRT , travelType:OW , domesticYn:N , arrivCntryCode: , currencyCode:KRW , origin1:ICN , searchType:byDate , returnDateStr:20170525 , fareBasisCodeRT: , today:20170424 , orderByOW: , fareBasisCodeOW: , origin2:NRT , destination1:NRT , promotionCode: , pointOfPurchase:KR , destination2:ICN , paxTypeCountStr:1,0,0 , onwardDateStr:20170525 , origin:ICN , orderByRT: , 
End Session crawling
Crawling twayair homepage schedule site
Start Session crawling
make session :  https://www.twayair.com/booking/availabilityList.do
crawling :  https://www.twayair.com/booking/ajax/searchAvailability.do
>> Parameters
destination:NRT , travelType:OW , domesticYn:N , arrivCntryCode: , currencyCode:KRW , origin1:ICN , searchType:byDate , returnDateStr:20170526 , fareBasisCodeRT: , toda

Unnamed: 0,Flt,start,end,fare1,fare2,fare3,tax1,tax2,seat
0,TW201,07:45,10:15,70000.0,80000.0,280000.0,1100.0,28000.0,9.0
1,TW203,11:15,13:25,60000.0,90000.0,280000.0,1100.0,28000.0,9.0
2,TW8201,15:30,18:00,0,80000.0,280000.0,1100.0,28000.0,9.0
3,20170525,min,max,mean,60000.0,280000.0,116000.0,,
4,TW201,07:45,10:15,0,90000.0,280000.0,1100.0,28000.0,9.0
5,TW8201,15:30,18:00,0,110000.0,280000.0,1100.0,28000.0,9.0
6,20170526,min,max,mean,90000.0,280000.0,160000.0,,
7,TW201,07:45,10:15,0,90000.0,280000.0,1100.0,28000.0,9.0
8,TW203,11:15,13:25,70000.0,80000.0,280000.0,1100.0,28000.0,9.0
9,TW8201,15:30,18:00,0,100000.0,280000.0,1100.0,28000.0,9.0


In [49]:
## 정해진 기간의 데이터 읽어오기
dpt, arr, = 'ICN','NRT'
start,end = 46, 90 ## 읽어올 데이터 기간
read_TW_date_range_fare(dpt,arr,start,end)

Crawling twayair homepage schedule site
Start Session crawling
make session :  https://www.twayair.com/booking/availabilityList.do
crawling :  https://www.twayair.com/booking/ajax/searchAvailability.do
>> Parameters
destination:NRT , travelType:OW , domesticYn:N , arrivCntryCode: , currencyCode:KRW , origin1:ICN , searchType:byDate , returnDateStr:20170609 , fareBasisCodeRT: , today:20170424 , orderByOW: , fareBasisCodeOW: , origin2:NRT , destination1:NRT , promotionCode: , pointOfPurchase:KR , destination2:ICN , paxTypeCountStr:1,0,0 , onwardDateStr:20170609 , origin:ICN , orderByRT: , 
End Session crawling
Crawling twayair homepage schedule site
Start Session crawling
make session :  https://www.twayair.com/booking/availabilityList.do
crawling :  https://www.twayair.com/booking/ajax/searchAvailability.do
>> Parameters
destination:NRT , travelType:OW , domesticYn:N , arrivCntryCode: , currencyCode:KRW , origin1:ICN , searchType:byDate , returnDateStr:20170610 , fareBasisCodeRT: , toda

Unnamed: 0,Flt,start,end,fare1,fare2,fare3,tax1,tax2,seat
0,TW201,07:45,10:15,0,120000.0,280000.0,1100.0,28000.0,9
1,TW8201,15:30,18:00,0,90000.0,280000.0,1100.0,28000.0,9
2,20170609,min,max,mean,90000.0,280000.0,163333.333333,,
3,TW201,07:45,10:15,0,80000.0,280000.0,1100.0,28000.0,9
4,TW203,11:15,13:25,0,90000.0,280000.0,1100.0,28000.0,9
5,TW8201,15:30,18:00,0,90000.0,280000.0,1100.0,28000.0,9
6,20170610,min,max,mean,80000.0,280000.0,150000.0,,
7,TW201,07:45,10:15,70000.0,80000.0,280000.0,1100.0,28000.0,9
8,TW8201,15:30,18:00,0,110000.0,280000.0,1100.0,28000.0,9
9,20170611,min,max,mean,70000.0,280000.0,135000.0,,
