# Assignment: Extracting Static WebPage

Extract information about “วันพระ” for 3 years from:
- https://www.myhora.com/ปฏิทิน/วันพระ-พ.ศ.2565.aspx
- https://www.myhora.com/ปฏิทิน/วันพระ-พ.ศ.2566.aspx
- https://www.myhora.com/ปฏิทิน/วันพระ-พ.ศ.2567.aspx


Note that you can use dateparse package to parse Thai date.  First, we will have to install the package, this is for Google Colab users.  Otherwise, installing via command line is recommended (pip or conda).

In [20]:
import sys
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
    %pip install dateparser

In [21]:
import dateparser

To convert from thai text date string, we will use the parse method.  Note that the parse method assumes the BC year, not BE.  Thus, we will have to subtract 543 from the year.  In addition, weekday() returns day of week with 0=Monday, ..., 6=Sunday.

In [22]:
dt = dateparser.parse('วันศุกร์ที่ 17 มกราคม 2563')

# this will print out weekday == 0 (Monday)
print(dt)
print(dt.weekday())

# this will print out weekday == 4 (Friday)
dt = dt.replace(year=dt.year-543)
print(dt)
print(dt.weekday())

2563-01-17 00:00:00
0
2020-01-17 00:00:00
4


In [23]:
dt = dateparser.parse('วันเสาร์ที่ 21 กันยายน 2564')
dt = dt.replace(year=dt.year-543)
print(dt)
print(dt.weekday())

2021-09-21 00:00:00
1


Count the distribution of number of week days that are “วันพระ” for all three years and answer the following questions:

In [24]:
import pandas as pd

df = pd.read_csv('./data/wan-phra.csv')

In [25]:
df.head()

Unnamed: 0,date,weekday,day,month,year,kheun,ram,lunar_month,lunar_year,event
0,2022-01-02,6,2,1,2022,-1,14,1,ฉลู,
1,2022-01-10,0,10,1,2022,8,-1,2,ฉลู,
2,2022-01-17,0,17,1,2022,15,-1,2,ฉลู,
3,2022-01-25,1,25,1,2022,-1,8,2,ฉลู,
4,2022-02-01,1,1,2,2022,-1,15,2,ฉลู,


## Import and set-up const.

In [26]:
import os
import sys
import dateparser
from bs4 import BeautifulSoup as bs
import requests
import re
import json
from tqdm import tqdm

import pandas as pd
import numpy as np

In [27]:
BASE_URL = r'https://www.myhora.com/ปฏิทิน/วันพระ-พ.ศ.{}.aspx'
YEAR_LIST = [2565, 2566, 2567]

TARGET_DIR = './data'

In [28]:
thai2eng_mapping = {}

for n in range(10):
    thai2eng_mapping[chr(ord('๐')+n)] = str(n)

## Extract info.

In [29]:
df = pd.DataFrame(columns=['date', 'weekday', 'day', 'month', 'year', 'kheun', 'ram', 'lunar_month', 'lunar_year', 'event'])

for year in YEAR_LIST:
    current_url = BASE_URL.format(year)
    response = requests.get(current_url)
    soup = bs(response.text, 'html.parser')

    bud_list = soup.find_all('div', class_='bud-day')

    looper = tqdm(bud_list,
                  desc=f"extracting from {year}",
                  unit="day")
    for day in looper:
        normal_date, lunar_date, event = bs.find_all(day, 'div')

        normal_date = dateparser.parse(normal_date.text)
        normal_date = normal_date.replace(year=normal_date.year-543)
        lunar_date = ''.join([c if c not in thai2eng_mapping else thai2eng_mapping[c] for c in lunar_date.text])
        event = event.text

        row = {
            'date' : normal_date,
            'weekday' : normal_date.weekday(),
            'day' : normal_date.day,
            'month' : normal_date.month,
            'year' : normal_date.year,
            'kheun' : -1 if 'ขึ้น' not in lunar_date else lunar_date.split(' ')[1],
            'ram' : -1 if 'แรม' not in lunar_date else lunar_date.split(' ')[1],
            'lunar_month' : -1 if 'เดือน' not in lunar_date else re.findall(r'.*\((\d+)\).*', lunar_date)[0],
            'lunar_year' : -1 if 'ปี' not in lunar_date else lunar_date.split('ปี')[1],
            'event' : event.strip('()') if event else None
        }
        df = pd.concat([df, pd.DataFrame([row])])

df.index = df['date']
df = df.drop(columns=['date'])

  df = pd.concat([df, pd.DataFrame([row])])
extracting from 2565: 100%|██████████| 51/51 [00:00<00:00, 513.85day/s]
extracting from 2566: 100%|██████████| 50/50 [00:00<00:00, 410.16day/s]
extracting from 2567: 100%|██████████| 51/51 [00:00<00:00, 496.50day/s]


In [30]:
df.head(10)

Unnamed: 0_level_0,weekday,day,month,year,kheun,ram,lunar_month,lunar_year,event
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2022-01-02,6,2,1,2022,-1,14,1,ฉลู,
2022-01-10,0,10,1,2022,8,-1,2,ฉลู,
2022-01-17,0,17,1,2022,15,-1,2,ฉลู,
2022-01-25,1,25,1,2022,-1,8,2,ฉลู,
2022-02-01,1,1,2,2022,-1,15,2,ฉลู,
2022-02-09,2,9,2,2022,8,-1,3,ฉลู,
2022-02-16,2,16,2,2022,15,-1,3,ฉลู,วันมาฆบูชา
2022-02-24,3,24,2,2022,-1,8,3,ฉลู,
2022-03-02,2,2,3,2022,-1,14,3,ฉลู,
2022-03-10,3,10,3,2022,8,-1,4,ฉลู,


In [31]:
if not os.path.exists(TARGET_DIR):
    os.makedirs(TARGET_DIR)
df.to_csv(os.path.join(TARGET_DIR, 'wan-phra.csv'), index=True)

## How many วันพระ in total (of 3 years)?

In [32]:
print(len(df))

152


## How many days in total (of 3 years) that วันพระ is Monday?

In [33]:
print(len(df[df['weekday'] == 0]))

21


## Which day of the week that has the minimum number of วันพระ?

In [34]:
df.groupby('weekday').size()

weekday
0    21
1    20
2    22
3    23
4    21
5    21
6    24
dtype: int64

## Which day of the week that has the maximum number of วันพระ?

In [35]:
df.groupby('weekday').size()

weekday
0    21
1    20
2    22
3    23
4    21
5    21
6    24
dtype: int64