<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 5px; height: 70px">

# Optimising Hospital Bed Occupancy through Machine Learning
**DSI-41 Group FWSG**: Muhammad Faaiz Khan, Sharifah Nurulhuda, Tan Wei Chiong, Gabriel Tan

### 01_01 Data Collection - Web scraping (Weather Data)

This notebook outlines the steps taken to scrap **weather data** from the [Meteorological Service Singapore](https://www.weather.gov.sg/home/) website, as we intend to use temperature and rainfall-related features for subsequent modeling for predictions of NUH's **Bed Occupancy Rates (BOR)**. It is hypothesised that weather patterns such as rain could have an impact on hospitalisation rates (e.g. illnesses/injuries/accidents associated with wet weather, etc.).

This weather data is the only data source that requires web scraping, and we will be scraping the data from 2018 to 2023 (in line with the BOR data that is available). 

---

### Imports

In [1]:
import numpy as np
import pandas as pd
import requests
import re

from bs4 import BeautifulSoup

### An overview of the scraping process

The `read_csv` method in pandas allows us to read directly from a link to a csv file. We take advantage of the format to get the data that we need.

An example is shown below.

In [3]:
test_df = pd.read_csv('https://www.weather.gov.sg/files/dailydata/DAILYDATA_S24_202301.csv', encoding='utf-8')

In [4]:
test_df

Unnamed: 0,Station,Year,Month,Day,Daily Rainfall Total (mm),Highest 30 min Rainfall (mm),Highest 60 min Rainfall (mm),Highest 120 min Rainfall (mm),Mean Temperature (°C),Maximum Temperature (°C),Minimum Temperature (°C),Mean Wind Speed (km/h),Max Wind Speed (km/h)
0,Changi,2023,1,1,0.0,0.0,0.0,0.0,27.2,30.6,25.3,9.8,31.5
1,Changi,2023,1,2,0.0,0.0,0.0,0.0,27.4,31.6,25.1,11.6,38.9
2,Changi,2023,1,3,6.0,6.0,6.0,6.0,27.4,31.8,24.4,9.3,37.0
3,Changi,2023,1,4,2.6,1.4,1.8,1.8,26.2,28.6,23.9,9.0,37.0
4,Changi,2023,1,5,0.2,0.2,0.2,0.2,27.1,31.3,25.2,6.3,27.8
5,Changi,2023,1,6,0.0,0.0,0.0,0.0,28.0,33.6,25.5,8.2,29.6
6,Changi,2023,1,7,3.8,3.6,3.8,3.8,27.6,32.4,25.5,8.6,31.5
7,Changi,2023,1,8,0.0,0.0,0.0,0.0,28.2,34.2,25.3,7.8,29.6
8,Changi,2023,1,9,0.0,0.0,0.0,0.0,27.3,30.8,25.2,8.4,31.5
9,Changi,2023,1,10,3.0,3.0,3.0,3.0,27.2,31.0,25.3,7.5,27.8


In [5]:
test_array = test_df.to_numpy()

In [6]:
test_array[0]

array(['Changi', 2023, 1, 1, 0.0, 0.0, 0.0, 0.0, 27.2, 30.6, 25.3, 9.8,
       31.5], dtype=object)

The url format is 'https://www.weather.gov.sg/files/dailydata/DAILYDATA_[STATION_CODE]_[YEAR][MONTH].csv'.<br>
So all we need are the station codes, years, and months.<br>
Example: https://www.weather.gov.sg/files/dailydata/DAILYDATA_S24_202312.csv

First, we need to obtain the relevant station codes, years, and months. We do this by scraping the weather.gov.sg site directly.

In [9]:
res = requests.get(url='https://www.weather.gov.sg/climate-historical-daily/')

In [10]:
soup = BeautifulSoup(res.text, 'lxml')

In [11]:
print(soup.prettify())

<!DOCTYPE html>
<!--[if IE 7]>
<html class="ie ie7" lang="en-US">
<![endif]-->
<!--[if IE 8]>
<html class="ie ie8" lang="en-US">
<![endif]-->
<!--[if !(IE 7) & !(IE 8)]><!-->
<html lang="en-US">
 <!--<![endif]-->
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width" name="viewport"/>
  <!--<meta http-equiv="cache-control" content="max-age=0" />
<meta http-equiv="cache-control" content="no-cache, no-store, must-revalidate" />
<meta http-equiv="expires" content="0" />
<meta http-equiv="expires" content="Tue, 01 Jan 1980 1:00:00 GMT" />
<meta http-equiv="pragma" content="no-cache" />
-->
  <!-- <meta http-equiv="cache-control" content="public, max-age=60, must-revalidate" /> -->
  <title>
   Historical Daily Records |
  </title>
  <link href="http://gmpg.org/xfn/11" rel="profile"/>
  <link href="http://www.weather.gov.sg/xmlrpc.php" rel="pingback"/>
  <link href="http://www.weather.gov.sg/wp-content/themes/wiptheme/lib/bootstrap/css/bootstrap.min.css" rel="stylesheet"/>
 

In [12]:
soup.findAll('a')

[<a href="http://www.weather.gov.sg/home/"><img alt="MSS" height="71" src="http://www.weather.gov.sg/wp-content/themes/wiptheme/assets/img/mss-logo.png" width="254"/></a>,
 <a href="http://www.gov.sg"><img alt="Singapore Government" height="30" src="http://www.weather.gov.sg/wp-content/themes/wiptheme/assets/img/sg-gov-logo.jpg" width="220"/></a>,
 <a class="selected donotresize" href="#" id="decrease" style="border-bottom:none;">A<sup>-</sup></a>,
 <a class="donotresize" href="#" id="increase" style="border-bottom:none;">A<sup>+</sup></a>,
 <a class="donotresize" href="http://www.weather.gov.sg/about-contact-us/">Contact Us</a>,
 <a class="dropdown-toggle donotresize" data-toggle="dropdown" href="#">Weather</a>,
 <a class="donotresize" href="http://www.weather.gov.sg/weather-forecast-2hrnowcast/">2-hr Nowcast</a>,
 <a class="donotresize" href="http://www.weather.gov.sg/weather-forecast-24hrforecast/">24-hr Forecast</a>,
 <a class="donotresize" href="http://www.weather.gov.sg/weather-f

In [13]:
# getting all the station codes and years
codes = [code for code in [text.get('onclick') for text in soup.findAll('a')] if code != None]

Creating a dictionary of station codes and their respective stations.

In [14]:
station_code_dict = dict([(text.get('onclick').split("'")[1], text.text) for text in soup.find_all(lambda tag: tag.has_attr('onclick'))
                          if len(text.get('onclick').split("'"))==3])

In [15]:
station_code_dict

{'S104': 'Admiralty',
 'S105': 'Admiralty West',
 'S109': 'Ang Mo Kio',
 'S86': 'Boon Lay (East)',
 'S63': 'Boon Lay (West)',
 'S120': 'Botanic Garden',
 'S55': 'Buangkok',
 'S64': 'Bukit Panjang',
 'S90': 'Bukit Timah',
 'S92': 'Buona Vista',
 'S61': 'Chai Chee',
 'S24': 'Changi',
 'S114': 'Choa Chu Kang (Central)',
 'S121': 'Choa Chu Kang (South)',
 'S11': 'Choa Chu Kang (West)',
 'S50': 'Clementi',
 'S118': 'Dhoby Ghaut',
 'S107': 'East Coast Parkway',
 'S39': 'Jurong (East)',
 'S101': 'Jurong (North)',
 'S44': 'Jurong (West)',
 'S117': 'Jurong Island',
 'S33': 'Jurong Pier',
 'S31': 'Kampong Bahru',
 'S71': 'Kent Ridge',
 'S122': 'Khatib',
 'S66': 'Kranji Reservoir',
 'S112': 'Lim Chu Kang',
 'S08': 'Lower Peirce Reservoir',
 'S07': 'Macritchie Reservoir',
 'S40': 'Mandai',
 'S108': 'Marina Barrage',
 'S113': 'Marine Parade',
 'S111': 'Newton',
 'S119': 'Nicoll Highway',
 'S116': 'Pasir Panjang',
 'S94': 'Pasir Ris (Central)',
 'S29': 'Pasir Ris (West)',
 'S06': 'Paya Lebar',
 'S10

In [16]:
codes

["setYear('S104')",
 "setYear('S105')",
 "setYear('S109')",
 "setYear('S86')",
 "setYear('S63')",
 "setYear('S120')",
 "setYear('S55')",
 "setYear('S64')",
 "setYear('S90')",
 "setYear('S92')",
 "setYear('S61')",
 "setYear('S24')",
 "setYear('S114')",
 "setYear('S121')",
 "setYear('S11')",
 "setYear('S50')",
 "setYear('S118')",
 "setYear('S107')",
 "setYear('S39')",
 "setYear('S101')",
 "setYear('S44')",
 "setYear('S117')",
 "setYear('S33')",
 "setYear('S31')",
 "setYear('S71')",
 "setYear('S122')",
 "setYear('S66')",
 "setYear('S112')",
 "setYear('S08')",
 "setYear('S07')",
 "setYear('S40')",
 "setYear('S108')",
 "setYear('S113')",
 "setYear('S111')",
 "setYear('S119')",
 "setYear('S116')",
 "setYear('S94')",
 "setYear('S29')",
 "setYear('S06')",
 "setYear('S106')",
 "setYear('S81')",
 "setYear('S77')",
 "setYear('S25')",
 "setYear('S102')",
 "setYear('S80')",
 "setYear('S60')",
 "setYear('S36')",
 "setYear('S110')",
 "setYear('S84')",
 "setYear('S79')",
 "setYear('S43')",
 "setYear('

In [17]:
# station codes
station_codes = [re.split(r"'", code)[1] for code in codes if 'Year' in code]

# years
recorded_years = [re.split(r"\(|\)", code)[1] for code in codes if 'Month' in code]

In [18]:
station_codes

['S104',
 'S105',
 'S109',
 'S86',
 'S63',
 'S120',
 'S55',
 'S64',
 'S90',
 'S92',
 'S61',
 'S24',
 'S114',
 'S121',
 'S11',
 'S50',
 'S118',
 'S107',
 'S39',
 'S101',
 'S44',
 'S117',
 'S33',
 'S31',
 'S71',
 'S122',
 'S66',
 'S112',
 'S08',
 'S07',
 'S40',
 'S108',
 'S113',
 'S111',
 'S119',
 'S116',
 'S94',
 'S29',
 'S06',
 'S106',
 'S81',
 'S77',
 'S25',
 'S102',
 'S80',
 'S60',
 'S36',
 'S110',
 'S84',
 'S79',
 'S43',
 'S78',
 'S72',
 'S23',
 'S88',
 'S89',
 'S115',
 'S82',
 'S35',
 'S69',
 'S46',
 'S123',
 'S91']

In [19]:
recorded_years

['2023',
 '2022',
 '2021',
 '2020',
 '2019',
 '2018',
 '2017',
 '2016',
 '2015',
 '2014',
 '2013',
 '2012',
 '2011',
 '2010',
 '2009',
 '2008',
 '2007',
 '2006',
 '2005',
 '2004',
 '2003',
 '2002',
 '2001',
 '2000',
 '1999',
 '1998',
 '1997',
 '1996',
 '1995',
 '1994',
 '1993',
 '1992',
 '1991',
 '1990',
 '1989',
 '1988',
 '1987',
 '1986',
 '1985',
 '1984',
 '1983',
 '1982',
 '1981',
 '1980']

In [20]:
recorded_years[:6]

['2023', '2022', '2021', '2020', '2019', '2018']

Now we are ready to scrape.

In [21]:
months = ['01', '02', '03', '04', '05', '06', 
          '07', '08', '09', '10', '11', '12']

Instead of collecting a bunch of dataframes and then concatenating them, which tends to be computationally expensive, we first convert each month's dataframe into an array of numpy arrays, with each numpy array being a row in the dataframe.

After all the collection is finished, the entire long list of numpy arrays (rows) is then converted into a single dataframe.

In [23]:
def scrape_weather_data(year_num: int=6):
    '''
    Parameters
    ---
    - `year_num`: Number of years in reverse chronological order from 2023
        - The year 2020 would correspond to `year_num = 3`
    '''

    # initializes list of records
    lst_of_data = []

    # initializes list of non-existent records
    non_records = [] 

    for code in station_codes:
        for year in recorded_years[:year_num]:
            for month in months:
                # scrapes the respective location-year-month weather data into a dataframe
                url = f'https://www.weather.gov.sg/files/dailydata/DAILYDATA_{code}_{year}{month}.csv'
                # utf-8 encoding does not work for some csv files
                try:
                    # reads utf-8 by default
                    temp_df = pd.read_csv(url)
                    lst_of_data.extend(list(temp_df.to_numpy()))
                except UnicodeDecodeError:
                    # if utf-8 does not work, read cp1252 instead
                    temp_df = pd.read_csv(url, encoding='cp1252')
                    lst_of_data.extend(list(temp_df.to_numpy()))
                except:
                    # there are some urls for which the url returns an Error 404
                    non_records.append((code, year, month))

    
    return lst_of_data, non_records


Note: The actual scraping of weather data from 2018 to 2023 takes around 25 minutes to run.

In [25]:
# data from 2018 to 2023
final_arrays, non_records = scrape_weather_data()

In [26]:
len(final_arrays)

116205

In [27]:
final_arrays[-10:]

[array(['Yishun', 2018, 12, 22, '—', '—', '—', '—', '—', '—', '—', '—',
        '—'], dtype=object),
 array(['Yishun', 2018, 12, 23, '—', '—', '—', '—', '—', '—', '—', '—',
        '—'], dtype=object),
 array(['Yishun', 2018, 12, 24, '—', '—', '—', '—', '—', '—', '—', '—',
        '—'], dtype=object),
 array(['Yishun', 2018, 12, 25, '—', '—', '—', '—', '—', '—', '—', '—',
        '—'], dtype=object),
 array(['Yishun', 2018, 12, 26, '—', '—', '—', '—', '—', '—', '—', '—',
        '—'], dtype=object),
 array(['Yishun', 2018, 12, 27, '—', '—', '—', '—', '—', '—', '—', '—',
        '—'], dtype=object),
 array(['Yishun', 2018, 12, 28, '—', '—', '—', '—', '—', '—', '—', '—',
        '—'], dtype=object),
 array(['Yishun', 2018, 12, 29, '—', '—', '—', '—', '—', '—', '—', '—',
        '—'], dtype=object),
 array(['Yishun', 2018, 12, 30, '—', '—', '—', '—', '—', '—', '—', '—',
        '—'], dtype=object),
 array(['Yishun', 2018, 12, 31, '—', '—', '—', '—', '—', '—', '—', '—',
        '—'], dtype

These are the records that threw an error.

In [28]:
len(non_records)

717

In [29]:
non_records

[('S105', '2023', '01'),
 ('S105', '2023', '02'),
 ('S105', '2023', '03'),
 ('S105', '2023', '04'),
 ('S105', '2023', '05'),
 ('S105', '2023', '06'),
 ('S105', '2023', '07'),
 ('S105', '2023', '08'),
 ('S105', '2023', '09'),
 ('S105', '2023', '10'),
 ('S105', '2023', '11'),
 ('S105', '2023', '12'),
 ('S105', '2022', '01'),
 ('S105', '2022', '02'),
 ('S105', '2022', '03'),
 ('S105', '2022', '04'),
 ('S105', '2022', '05'),
 ('S105', '2022', '06'),
 ('S105', '2022', '07'),
 ('S105', '2022', '08'),
 ('S105', '2022', '09'),
 ('S105', '2022', '10'),
 ('S105', '2022', '11'),
 ('S105', '2022', '12'),
 ('S105', '2021', '01'),
 ('S105', '2021', '02'),
 ('S105', '2021', '03'),
 ('S105', '2021', '04'),
 ('S105', '2021', '05'),
 ('S105', '2021', '06'),
 ('S105', '2021', '07'),
 ('S105', '2021', '08'),
 ('S105', '2021', '09'),
 ('S105', '2021', '10'),
 ('S105', '2021', '11'),
 ('S105', '2021', '12'),
 ('S105', '2020', '05'),
 ('S105', '2020', '06'),
 ('S105', '2020', '07'),
 ('S105', '2020', '08'),


Now we collate everything into a pandas DataFrame, using the index we get from `test_df` above.

In [30]:
final_weather_df = pd.DataFrame.from_records(final_arrays, 
                                             columns=test_df.columns.tolist())

In [31]:
final_weather_df

Unnamed: 0,Station,Year,Month,Day,Daily Rainfall Total (mm),Highest 30 min Rainfall (mm),Highest 60 min Rainfall (mm),Highest 120 min Rainfall (mm),Mean Temperature (°C),Maximum Temperature (°C),Minimum Temperature (°C),Mean Wind Speed (km/h),Max Wind Speed (km/h)
0,Admiralty,2023.0,1.0,1.0,0.0,0.0,0.0,0.0,26.8,30.1,24.8,10.7,27.8
1,Admiralty,2023.0,1.0,2.0,0.0,0.0,0.0,0.0,27.3,31.5,24.7,13.6,34.3
2,Admiralty,2023.0,1.0,3.0,0.2,0.2,0.2,0.2,27.3,31.7,25.0,11.2,35.6
3,Admiralty,2023.0,1.0,4.0,0.0,0.0,0.0,0.0,26.5,29.3,24.5,13.2,41.5
4,Admiralty,2023.0,1.0,5.0,3.4,3.2,3.4,3.4,27.1,32.2,25.3,8.7,32.4
...,...,...,...,...,...,...,...,...,...,...,...,...,...
116200,Yishun,2018.0,12.0,27.0,—,—,—,—,—,—,—,—,—
116201,Yishun,2018.0,12.0,28.0,—,—,—,—,—,—,—,—,—
116202,Yishun,2018.0,12.0,29.0,—,—,—,—,—,—,—,—,—
116203,Yishun,2018.0,12.0,30.0,—,—,—,—,—,—,—,—,—


Lastly, we export `final_weather_df` to a `.csv` file.

In [32]:
final_weather_df.to_csv('../datasets/weather_records.csv', index=False)