# Scraping Weather Data With Python
Hello everyone! Welcome to another one of my projects. This time I'll be looking into how to scrape data off the internet automatically using Python, specifically BeautifulSoup and urllib2. Although not the most complicated project, as I don't have to manually sift the data but rather just centralise it all, it should still serve as useful for the community. 

This Jupyter notebook is purely to show the proof-of-concepts for my processes such as requesting, downloading and sorting the data. The full collection of the data will be done by an external python file which has to be executed in the terminal.


## Overview Of The Project
The project's main aim is to collect weather data from all over Western Australia and centralise it into this repository. In theory this process could be extended to Australia-wide, but I don't want that much data. I'll illustrate how one could do that later in the project. That way others can use it in their own projects without needing to repeat the process I'm about to undertake.

The data I've collected is simply the options specified by the Bureau of Meteorology on their [Climate Data database](http://www.bom.gov.au/climate/data/). As such, I'll be collecting information on: Rainfall; Temperature (Max & Min); and Solar Exposure.

My general thought process in storing the data will be through the use of a 3D array (known as a [Panel](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Panel.html) in Pandas) with:
 - Each 'slice' (which forms a 2D matrix) repesenting a station.
 - Each row representing a date.
 - Each column representing a piece of data.

I've attempted to represent a slice of the data for station $n$ below.

$$
\begin{bmatrix}
Date & Rainfall(mm) & ...\\
01/01/2015 & 12.3 & ...\\ 
02/01/2015 & 6.1 & ...\\ 
... & ... & ...
\end{bmatrix}
$$



## Usage of The Data
For anyone reading this notebook who might be interested in using my data that I've collected, I give full permission without any attribution needed. Go forth and solve the world's problems using data!

All data has been scraped from [The Bureau of Meteorology](http://www.bom.gov.au). I claim rights to none of the data. As such I recommend reading into their data policies before using it for commercial use (but I'm sure personal use will be fine).

---
With all of that out of the way, let's begin!


In [1]:
# Import all of the goodies that I'll be using.
import urllib2
from bs4 import BeautifulSoup

In [2]:
# Next, define some global variables that I'll be using.
BOM_HOME = r'http://www.bom.gov.au/climate/data/'

# Part 1 - Finding The Data
The first step in this project will be finding the data to collect. As already mentioned, you can find most of the data online at the [Climate Data database](http://www.bom.gov.au/climate/data/). However, collecting data this way is really tedious as you would have to manually fill in forms. Although this is possible with Python packages such as [Selenum](http://selenium-python.readthedocs.io/) there is definately an easier way.

The first thing I did was to gain some familarity with how the portal worked. So I searched up for a weather station, found it's station number and went to that data page. If you'd like to follow along, I used station 9021.

Once I went to the new page, I noticed that the URL was structured like so:

> http://www.bom.gov.au/jsp/ncc/cdio/weatherData/av?p_nccObsCode=136&p_display_type=dailyDataFile&p_startYear=&p_c=&p_stn_num=009021

So already you can see the URL contains a query string. For the most part I can't really tell what each parameter does, but I did notice that my station number was in the string:

> p_stn_num=009021

So naturally I played around with this parameter. Of course this enabled me to move to new weather station! Let me illustrate that below.

In [9]:
# The name of the weather station for a given station number
def getStationName(station_num):
    page = urllib2.urlopen(
        r'http://www.bom.gov.au/jsp/ncc/cdio/weatherData/av?p_nccObsCode=136&p_display_type=dailyDataFile&p_startYear=&p_c=&p_stn_num=' 
        + str(station_num).zfill(6)
    )
    soup = BeautifulSoup(page, 'html.parser')
    return soup.h2.string

print "Station 9021:", getStationName(9021)
print "Station 9022:", getStationName(9022)

Station 9021: Perth Airport 
Station 9022: Guildford Post Office 
