# Scraping Weather Data With Python

---

## *Contents*

 1. [Introduction](#introduction)
 2. [Part 1 - Find The Data](#part1)
 3. [Part 2 - Extract The Data](#part2)
 4. [Part 3 - Refine The Data](#part3)
 5. [Part 4 - Export The Data](#part4)
 6. [Conclusion](#conclusion)

---
# Introduction <a name='introduction'></a>
Data is the oil of our information age. With it we can save lives, be more efficient and make more money. Data isn't just plucked from the internet in well organised, coherent files. The raw data needs to be *extracted*! That's what is happening here. **This project is about extracting information about the weather in Western Australia** using the Bureau of Meteorology's raw data. 

The end result of this data extraction will be a single CSV file with weather data about the rainfall, temperature and solar exposure of each weather station in WA. This will be done using a single Python executable (.py): this [Jupyter](http://jupyter.org/) notebook is only a proof of concept. Instead of collecting all of the data here in the notebook it will instead serve as an insight into my thought process. As such, I'll only be collecting rainfall data for a couple weather stations. 

All of the processes being used can be extended to gather data near-automatically in any (or all) states and territories in Australia. If you wish to collect data in your own country or region then this document can serve as guidlines as to how to solve similar problems you'll face.

If you haven't figured it out so far, I'll be using [Python](https://www.python.org/) to extract my data. Python is a loosely-typed, object-orientated programming language built in C. A hugely beneficial feature of Python is its use of packages. In this projcet I'll be mainly using [urllib2](https://docs.python.org/2/library/urllib2.html), [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) and [Pandas](https://pandas.pydata.org/) plus help from some smaller packages like [zipfile](https://docs.python.org/2/library/zipfile.html) and [glob](https://docs.python.org/2/library/glob.html).  

## The Dirty Details
The data originates from [The Bureau of Meteorology](http://www.bom.gov.au) in their [Climate Data database](http://www.bom.gov.au/climate/data/). As such I claim no rights to the data itself. Anything that is mine (like this notebook and my executable file) I am allowing free usage of in accordance to the licence in the repository. 

The CSV file being exported will be structured like so:

```
    Station Number, Date, Rainfall, Max Temp, Min Temp, Solar Exposure 
```

Although this wont be the structure exported in this notebook (which is only done as a proof of concept), that will be the structure of the CSV file exported by the executable script.

In this notebook I'll be using dataframes a lot. If you don't know much about Pandas, especially about their data structures, I recommend you read [this](https://pandas.pydata.org/pandas-docs/stable/dsintro.html) article in their documentation quickly.

## Preface - Importing Packages, Defining Global Variables
First-things-first: import in the packages that will be used and define any global variables that will be used throughout the notebook.

In [1]:
# Import all of the goodies that I'll be using.

# HTML Scraping Packages
import urllib2
from bs4 import BeautifulSoup

# File Manipulation Packages
import os
import zipfile
import glob

# Data Manipulation Packages
import pandas

In [2]:
# Next, define some global variables that I'll be using.
BOM_HOME = r'http://www.bom.gov.au'

---
# Part 1 - Find The Data <a name='part1'></a>
Before we automate anything we should have a look at the data we are going to collect. Like I mentioned in the introduction the data is all found in the [Climate Data Database](http://www.bom.gov.au/climate/data/). So the first thing I did was go there, have a look at how to get the data manually and what kind of information we can collect.

For each weather station you can request the:

 - Rainfall;
 - Max temperature;
 - Min temperature; or
 - Solar exposure.

It makes sense to aim to collect all of these data points in the executable file. For this proof of concept, however, I'll just be focusing on the rainfall data. That way I can get the process working for one of the datasets.

## Exploring A Dataset
Let's actually look at the data for one of the weather stations. The station I've chosen to experiment on is [Perth Airport's](http://www.bom.gov.au/jsp/ncc/cdio/weatherData/av?p_nccObsCode=136&p_display_type=dailyDataFile&p_startYear=&p_c=&p_stn_num=9021) weather station (station number 9021). On the webpage you can see towards the top right a link with the text "All years of data" which downloads a zipped directory with the following structure:

```
IDCJAC0009_9021_1800.zip
|
|---- IDCJAC0009_9021_1800_Data.csv
`---- IDCJAC0009_9021_1800_Note.txt
```

I wont automate the extraction of the data just yet as I'm only exploring the database. So instead I've extracted the csv file manually, renamed it to 'station9021.csv' and imported it into a Pandas dataframe to visualise the data. You can see the last 5 rows of data below.

In [3]:
pandas.read_csv('ExploratoryData/station9021.csv').tail(5)

Unnamed: 0,Product code,Bureau of Meteorology station number,Year,Month,Day,Rainfall amount (millimetres),Period over which rainfall was measured (days),Quality
27005,IDCJAC0009,9021,2017,12,8,0.0,1.0,N
27006,IDCJAC0009,9021,2017,12,9,0.0,1.0,N
27007,IDCJAC0009,9021,2017,12,10,0.0,1.0,N
27008,IDCJAC0009,9021,2017,12,11,0.0,1.0,N
27009,IDCJAC0009,9021,2017,12,12,0.0,1.0,N


Most of the data columns are self-explainatory except for the last two. 

The second last column is the period time for collection. For the most part is will be over 1 day, as each day rainfall measurements are taken at 9am for the previous day. However, occassionally, there are some times where the rainfall was not collected that day. So instead of throwing out the data they "carry it forward" to the next day's measurement. So now the rainfall amount is linked to two (or more) days worth of rainfall instead of one.

The last column is the quality of the data. This can be a little misleading as it doesn't actually tell you if the data collection was incorrect. Instead it flags whether or not the collection process has been audited and believed to have no significant errors associated with it. So if the row is flagged with "Y" then we can be confident that the data has been collected correctly and accurately. If it is flagged with "N" then it is likely that the data is correct, but that hasn't been confirmed.

## Finding The List of Stations
Not all station numbers are being used up to 9021, nor are all 4-digit numbers in use. For example, station number 9 doesn't correspond to any station in Australia. Obviously that is a huge issue as we can't just assume a range of station numbers to collect our data over.

Fortunately for me, BOM actually have a list of weather stations you can access! By going to the [Weather Station Directory](http://www.bom.gov.au/climate/data/stations/) you can request the list of weather station in Australia or any of it's States or Territories. I downloaded WA's list, which I've visualised the first 10 lines below.

In [4]:
with open("ExploratoryData/weather_stations.txt") as f:
    head = [next(f) for _ in xrange(10)]
print ''.join(head)

Bureau of Meteorology product IDCJMC0014.                                       Produced: 23 Nov 2017
West Australian stations measuring rainfall
Site    Name                                     Lat       Lon      Start    End       Years   %  AWS
-----------------------------------------------------------------------------------------------------
   7118 ABBOTTS                                  -26.4000  118.4000 Sep 1898 Nov 1913    6.8   45   N
  10258 ABERVON                                  -30.7833  117.9833 May 1968 Aug 1973    5.2   97   N
   4000 ABYDOS                                   -21.4167  118.9333 Jul 1917 Dec 1974   40.0   70   N
   4045 ABYDOS WOODSTOCK                         -21.6200  118.9550 Apr 1901 Sep 1997   65.2   67   N
   9971 ACTON PARK                               -33.7845  115.4072 Nov 2000 Nov 2017   17.0   98   N
   2046 ADA VALE                                 -16.9500  128.1000 Jan 1920 Mar 1925    4.8   92   N



Already you can see that this text file is formatted for humans to read, not computers. Fortunately each "column" is fixed width. Thus it is possible to convert it to a csv! Although this is possible in python to do this, the time I would have to invest getting it to work right is more effort then just passing it through excel.

So that's what I did: I imported the text file into excel, deleted the heading, footer and styling before finally exporting it as a plain csv file. I visualised the end result below.

In [5]:
pandas.read_csv('ExploratoryData/weather_stations.csv').head()

Unnamed: 0,Site,Name,Lat,Lon,Start,End,Years,%,AWS
0,7118,ABBOTTS,-26.4,118.4,Sep 1898,Nov 1913,6.8,45,N
1,10258,ABERVON,-30.7833,117.9833,May 1968,Aug 1973,5.2,97,N
2,4000,ABYDOS,-21.4167,118.9333,Jul 1917,Dec 1974,40.0,70,N
3,4045,ABYDOS WOODSTOCK,-21.62,118.955,Apr 1901,Sep 1997,65.2,67,N
4,9971,ACTON PARK,-33.7845,115.4072,Nov 2000,Nov 2017,17.0,98,N


Most of the columns here are self-explainatory. However, the last two columns aren't as clear. After some quick Googling and tinkering with the data I figured out what they are:
 - **% (Percentage)**: This is the completeness of the data collected. 100% signifies no missing data points for its entire history.
 - **AWS**: This states wheter or not the weather station is an *automatic weather station*. You could probably trust the data more in these weather stations, but that would be something worth exploring further.

---
# Part 2 -  Extract The Data <a name='part2'></a>
Now we have a rough idea as to what the data looks like and where we can access it. The next step is to automate the data extraction process.

The process is incredibly simple. There will simply be a for-loop that iterates over the station list and then, using that data, accesses each weather station's data file and downloads it. It's literally that simple.

## Accessing The Data For Each Weather Station
The first thing we want to explore is exactly *how* we can get to each weather station's data using a script. My first thoughts were to automate the "filling out" of forms using a package like [Selenum](http://selenium-python.readthedocs.io/). However this process is excessively convoluted. Instead I decided to investigate if there are any patterns in the URL for each station to mimic in a script.

Fortunatley for me I noticed something unique about the URL: it was formatted the exact same and there was only a query string that controled the view!

### URL Query String
Each weather station's data page can be reached by using the same directory and changing one parameter in the query string. I figured this out when I went to station 9021's data page and noticed the URL:

> http://www.bom.gov.au/jsp/ncc/cdio/weatherData/av?p_nccObsCode=136&p_display_type=dailyDataFile&p_startYear=&p_c=&p_stn_num=9021

which contains the variable "p_stn_num" as one of its parameters; that parameter controls what station's data is being shown. By simply changing that parameter we can reach different pages in the portal!

If you'd like to verify my conclusion, I've written up a function that fetches the station's name purely by changing the end of the URL. If you tinker with the function enough you'll notice that not all station numbers correspond to an actual weather station.

In [6]:
def printStationName(station_num):
    """Prints the name fo the weather station with the given station number."""
    page = urllib2.urlopen(
        r'http://www.bom.gov.au/jsp/ncc/cdio/weatherData/av?p_nccObsCode=136&p_display_type=dailyDataFile&p_startYear=&p_c=&p_stn_num=' 
        + str(station_num).zfill(6)
    )
    soup = BeautifulSoup(page, 'html.parser')
    print "Station %s:" % station_num, soup.h2.string
    return None 

printStationName(9030)

Station 9030: Mundaring 


### "Clicking" On The Download Link
The next step is to make the computer "click" on the download link on each page to collect the data. Of course I won't actually get the computer to click on the link, but rather find that link and simply open it using urllib2.

Below I've written up a short function that does exactly that. It creates the URL to the specific webpage, searches for the download link (which is structured the exact same on every page), opens that link and then saves it as a zip file in the directory.

In [7]:
def downloadDataForStation(station_num, datadir='./', verbose=False):
    """Accesses BOM's data link for the given station and downloads the data as a zipped file."""
    
    # First access the weather station's data page.
    page = urllib2.urlopen(
        r'http://www.bom.gov.au/jsp/ncc/cdio/weatherData/av?p_nccObsCode=136&p_display_type=dailyDataFile&p_startYear=&p_c=&p_stn_num=' 
        + str(station_num).zfill(6)
    )
    soup = BeautifulSoup(page, 'html.parser')
    
    # Then find the download link for that page.
    download_link_extension = soup.find('a', 
        {'title': "Data file for daily rainfall data for all years"}
    )['href']
    
    # Open the download link (which is read as binary) and save it in the correct format (zip file).
    data = urllib2.urlopen(str(BOM_HOME + download_link_extension))
    with open(datadir+'station_%s.zip' % station_num, 'wb') as zipper:
        zipper.write(data.read())
    
    # Finally, print a success message if verbose and return
    if verbose: print "Download for station %s was successful!" % station_num
    return None

Feel free to test the function on any station number you wish. I'm going to be consistent and download the data for station 9021.

In [8]:
downloadDataForStation(9021, datadir='ExploratoryData/', verbose=True)

Download for station 9021 was successful!


### Importing In The Station List
The hardest part about collecting the weather station data would be iterating over hundreds to thousands of station numbers to manually look for which exist. Fortunately I solved that problem in Part 1 of the notebook: we can simply import in the station list and iterate over the dataframe to collect our data.

In Part 1 we found a human-readable text file that contained all of the stations in WA that collect information on rainfall. With forsight to this point, I then converted the text file into a machine-readable CSV to be imported into a Pandas dataframe. Let's replicate that last step now to refresh our memories.

In [9]:
station_df = pandas.read_csv('ExploratoryData/weather_stations.csv')
station_df.head()

Unnamed: 0,Site,Name,Lat,Lon,Start,End,Years,%,AWS
0,7118,ABBOTTS,-26.4,118.4,Sep 1898,Nov 1913,6.8,45,N
1,10258,ABERVON,-30.7833,117.9833,May 1968,Aug 1973,5.2,97,N
2,4000,ABYDOS,-21.4167,118.9333,Jul 1917,Dec 1974,40.0,70,N
3,4045,ABYDOS WOODSTOCK,-21.62,118.955,Apr 1901,Sep 1997,65.2,67,N
4,9971,ACTON PARK,-33.7845,115.4072,Nov 2000,Nov 2017,17.0,98,N


So all I need to do is extract the first column to get all of the information I need.

In [10]:
station_numbers = station_df['Site']
for _, n in station_numbers.head().iteritems():
    print n

7118
10258
4000
4045
9971


Which, as you can see, works like a charm.

Since this notebook is only a proof of concept, I wont be downloading the data for all of the weather stations here. This section was only to show how it will be done in the executable and how it is possible to do so.

---
# Part 3 - Refine The Data <a name='part3'></a>
Now we have all of the data we can get down into the messy parts.

The first step is the extract the data out of the zipped files into their own directories. Once that's done, we can focus on cleaning up and reducing the data to reduce the size it takes up and to make it easier to work with. Finally we want to demonstrate how we can centralise multiple station's data into a single dataframe.

Something that I don't do in this notebook, but would worthwhile exploring, would be data validation. We should be checking how much data is missing at certain weather stations to ensure if we can truely trust the source. Other important considerations would be how many times measurements are "carried forward" and measured over two days instead of one. That will have huge problems with the accuracy of the data. In saying that however, right there is a great idea for a project: interpolate the missing data points using the surrounding weather stations.

## Unzipping The Compressed Directory
Before we can do anything with the data, we need to extract it out of the zipped file. Fortunately for me, python already has a package that unzips files: [zipfile](https://docs.python.org/2/library/zipfile.html). So unzipping the data is a really trivial problem.

In [11]:
def unzip(zipfilename):
    """Unzips the directories."""
    basename = os.path.splitext(zipfilename)[0]
    with zipfile.ZipFile(zipfilename, "r") as zip_ref:
        zip_ref.extractall(basename)

I've unzipped one of the data directories below so I can continue to work on it throughout the notebook.

In [12]:
unzip("ExploratoryData/station_9021.zip")

## Cleaning & Reducing The Data
Now that we can access the CSV files for each station, we want to go about reducing the amount of columns. I've done this in two ways: dropping columns that aren't any use and combining the date columns together into the dataframe's index. You can see how I've done this by referring to the function below.

In [13]:
def readStationCSV(stationcsv):
    """Pass in the station's CSV to return a cleaned dataframe."""
    df = pandas.read_csv(stationcsv)
    
    # Rename the columns
    df.rename(columns={
        'Bureau of Meteorology station number': "Station Number",
        'Rainfall amount (millimetres)': 'Rainfall',
    }, inplace=True)
    
    # Combine date into a single column.
    df['Year'] = df['Year'].map(str)
    df['Month'] = df['Month'].map(lambda x: str(x).zfill(2))
    df['Day'] = df['Day'].map(lambda x: str(x).zfill(2))
    df.insert(
        2, 'Date', 
        df['Year'] + '-' + 
        df['Month'] + '-' + 
        df['Day']
    )
    df.drop(['Year', 'Month', 'Day'], axis=1, inplace=True)

    # Next, drop the first and second columns.
    df.drop(["Product code", "Station Number"], axis=1, inplace=True)
    
    # Drop any rows that are before the start of data collection (i.e. drop Jan if started in Feb).
    data_start = df['Rainfall'].first_valid_index()
    data_finish = df['Rainfall'].last_valid_index()
    df = df[data_start:data_finish]
    
    # Set the date as the index column.
    df.set_index('Date', inplace=True)
    
    # Finally return the result.
    return df

One big decision I made was to drop the station number as one of the columns in the dataframe. I did this for storage reasons: we don't need the station number repeated 100's of times for no reason. We can store this data some other way.

Like per-usual, I'm going to run this on station 9021 to show that the function works and to prepare the data for the next steps.

In [14]:
station9021_df = readStationCSV(glob.glob("ExploratoryData/station_9021/*.csv")[0])
station9021_df.tail(5)

Unnamed: 0_level_0,Rainfall,Period over which rainfall was measured (days),Quality
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-12-10,0.0,1.0,N
2017-12-11,0.0,1.0,N
2017-12-12,0.0,1.0,N
2017-12-13,0.0,1.0,N
2017-12-14,0.0,1.0,N


## Centralising The Data Into One Dataframe
This is probably one of the more exciting parts of the notebook. Now that we have the data looking the way we want it to, we can begin to bring multiple station's worth of data together.

Since we only have data on station 9021 I'm going to download and clean stations 9022 and 9023.

In [15]:
station_dataframes = [station9021_df]
for n in [9022, 9023]:
    downloadDataForStation(n, datadir='ExploratoryData/', verbose=True)
    unzip('ExploratoryData/station_%s.zip' % n)
    station_df = readStationCSV(glob.glob("ExploratoryData/station_%s/*.csv" % n)[0])
    station_dataframes.append(station_df)

Download for station 9022 was successful!
Download for station 9023 was successful!


Obviously we can visualise each of those dataframes to confirm that they behaved correctly.

In [16]:
for n in station_dataframes:
    print n.tail(5), '\n\n\n'

            Rainfall  Period over which rainfall was measured (days) Quality
Date                                                                        
2017-12-10       0.0                                             1.0       N
2017-12-11       0.0                                             1.0       N
2017-12-12       0.0                                             1.0       N
2017-12-13       0.0                                             1.0       N
2017-12-14       0.0                                             1.0       N 



            Rainfall  Period over which rainfall was measured (days) Quality
Date                                                                        
1954-06-25       4.3                                             1.0       Y
1954-06-26      10.4                                             1.0       Y
1954-06-27      22.6                                             1.0       Y
1954-06-28      18.0                                             1.0    

The next step is really exciting. We are going to generate a [MultiIndexed](https://pandas.pydata.org/pandas-docs/stable/advanced.html) dataframe! I do that below using pandas's function [concat](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html).

In [17]:
df_main = pandas.concat(
    station_dataframes, 
    keys=['9021', '9022', '9023'], 
    names=['Station Number', 'Date']
)

Which does a great job of organising the data. If you'd like to understand how MultiIndexing works, and how to select data using it, just go to the documentation page on [MultiIndex/Advanced Indexing](https://pandas.pydata.org/pandas-docs/stable/advanced.html). I will however give you a taste of the power of MultiIndexing:

In [18]:
df_main.xs('1950-01-01', level='Date')

Unnamed: 0_level_0,Rainfall,Period over which rainfall was measured (days),Quality
Station Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
9021,0.0,,Y
9022,0.0,,Y
9023,0.0,,Y


In [19]:
df_main.loc[['9022', '9023']]

Unnamed: 0_level_0,Unnamed: 1_level_0,Rainfall,Period over which rainfall was measured (days),Quality
Station Number,Date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
9022,1877-01-01,0.0,,Y
9022,1877-01-02,0.0,,Y
9022,1877-01-03,0.0,,Y
9022,1877-01-04,0.0,,Y
9022,1877-01-05,0.0,,Y
9022,1877-01-06,0.0,,Y
9022,1877-01-07,0.0,,Y
9022,1877-01-08,0.0,,Y
9022,1877-01-09,0.0,,Y
9022,1877-01-10,0.0,,Y


---
# Part 4 - Export The Data <a name='part4'></a>
This section is completely trivial with Pandas. Using the method [to_csv](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html) you can export the entire dataframe into a csv file with almost zero effort. 

In [20]:
df_main.to_csv('ExploratoryData/example_export.csv')

Before we finish, we want to check that the exported data is formatted correctly. That's a trivial operation in python:

In [21]:
with open('ExploratoryData/example_export.csv', 'r') as f:
    for _ in range(5): print f.readline()

Station Number,Date,Rainfall,Period over which rainfall was measured (days),Quality

9021,1944-05-01,0.0,,Y

9021,1944-05-02,0.0,,Y

9021,1944-05-03,0.0,,Y

9021,1944-05-04,4.3,1.0,Y



We should also check if it is easily importable back into Pandas.

In [22]:
importdf2 = pandas.read_csv("ExploratoryData/example_export.csv", index_col=[0, 1])
importdf2.tail(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,Rainfall,Period over which rainfall was measured (days),Quality
Station Number,Date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
9023,2017-12-10,0.0,1.0,N
9023,2017-12-11,0.0,1.0,N
9023,2017-12-12,0.0,1.0,N
9023,2017-12-13,0.0,1.0,N
9023,2017-12-14,0.0,1.0,N


Which of course it is! Thus we have now successfully exported the data into CSV format.

---
# Conclusion <a name='conclusion'></a>

As you can see, it is completely viable to extract weather data from BOM using a few simple commands in Python. The process of centralising the data enables others to worry about the nuts-and-bolts of their algorithms, rather then worry about where to find it and how to clean it. Not only can others use the data, but new faces in the data science world can see how easy it is to scrape the internet for data that they can use themselves. Hopefully they will even find out about a couple python packages along the way!

The biggest uses for this data would relate to climate change. We can observe the trends in weather in WA (or Australia) and try to predict in what direction our climate is heading. There are other project ideas however: we could measure the viabiliity of getting solar panels in regional areas of WA, the potential savings to be had by installing a water tank onto your house, cleaning up the data using other station's information and many more.

The next step is to create the executable file so that I can collect all of the data, rather then just a small sample of it. This is a relatively simple process as I have already figured out most of the functions and packages in this notebook. I'll be also gathering the data for the maximum and minimum temperature plus the solar exposure so there is even more information to work with.

In a future project I'll be exploring the data itself. Hopefully I will be able to visualise the rainfall over the past 100 years, build a model that can roughly predict the rainfall at any point in space (i.e. the rainfall at your house which is between three weather stations) and even predict future rainfall patterns.

Finally thank you for reading this notebook. Hopefully there was something you learnt from it or it was perhaps just interesting. If you have any comments, recommendations or improvements to anything in this repository feel free to reach out to me or submit a pull request. I'm always happy for others to improve my work.