# Scraping Weather Data With Python
Hello everyone! Welcome to another one of my projects. This time I'll be looking into how to scrape data off the internet automatically using Python, specifically BeautifulSoup and urllib2. Although not the most complicated project, as I don't have to manually sift the data but rather just centralise it all, it should still serve as useful for the community. 

This Jupyter notebook is purely to show the proof-of-concepts for my processes such as requesting, downloading and sorting the data. The full collection of the data will be done by an external python file which has to be executed in the terminal.


## Overview Of The Project
The project's main aim is to collect weather data from all over Western Australia and centralise it into this repository. In theory this process could be extended to Australia-wide, but I don't want that much data. I'll illustrate how one could do that later in the project. That way others can use it in their own projects without needing to repeat the process I'm about to undertake.

The data I've collected is simply the options specified by the Bureau of Meteorology on their [Climate Data database](http://www.bom.gov.au/climate/data/). As such, I'll be collecting information on: Rainfall; Temperature (Max & Min); and Solar Exposure.

My general thought process in storing the data will be through the use of a 3D array (known as a [Panel](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Panel.html) in Pandas) with:
 - Each 'slice' (which forms a 2D matrix) repesenting a station.
 - Each row representing a date.
 - Each column representing a piece of data.

I've attempted to represent a slice of the data for station $n$ below.

$$
\begin{bmatrix}
Date & Rainfall(mm) & ...\\
01/01/2015 & 12.3 & ...\\ 
02/01/2015 & 6.1 & ...\\ 
... & ... & ...
\end{bmatrix}
$$



## Usage of The Data
For anyone reading this notebook who might be interested in using my data that I've collected, I give full permission without any attribution needed. Go forth and solve the world's problems using data!

All data has been scraped from [The Bureau of Meteorology](http://www.bom.gov.au). I claim rights to none of the data. As such I recommend reading into their data policies before using it for commercial use (but I'm sure personal use will be fine).

---
With all of that out of the way, let's begin!


In [1]:
# Import all of the goodies that I'll be using.

# HTML Scraping Packages
import urllib2
from bs4 import BeautifulSoup

# File Manipulation Packages
import zipfile
import glob

# Data Manipulation Packages
import pandas

In [2]:
# Next, define some global variables that I'll be using.
BOM_HOME = r'http://www.bom.gov.au'

---
# Part 1 - Finding The Data
The first step in this project will be finding the data to collect. As already mentioned, you can find most of the data online at the [Climate Data database](http://www.bom.gov.au/climate/data/). However, collecting data this way is really tedious as you would have to manually fill in forms. Although this is possible with Python packages such as [Selenum](http://selenium-python.readthedocs.io/) there is definately an easier way.

---

## 1.1. Tinkering With The Portal

The first thing I did was to gain some familarity with how the portal worked. So I searched up for a weather station, found it's station number and went to that data page. If you'd like to follow along, I used station 9021.

Once I went to the new page, I noticed that the URL was structured like so:

> http://www.bom.gov.au/jsp/ncc/cdio/weatherData/av?p_nccObsCode=136&p_display_type=dailyDataFile&p_startYear=&p_c=&p_stn_num=009021

So already you can see the URL contains a query string. For the most part I can't really tell what each parameter does, but I did notice that my station number was in the string:

> p_stn_num=009021

So naturally I played around with this parameter. Of course this enabled me to move to new weather station! Let me illustrate that below.

In [3]:
# The name of the weather station for a given station number
def getStationName(station_num):
    page = urllib2.urlopen(
        r'http://www.bom.gov.au/jsp/ncc/cdio/weatherData/av?p_nccObsCode=136&p_display_type=dailyDataFile&p_startYear=&p_c=&p_stn_num=' 
        + str(station_num).zfill(6)
    )
    soup = BeautifulSoup(page, 'html.parser')
    return soup.h2.string

In [4]:
print "Station 9021:", getStationName(9021)
print "Station 9022:", getStationName(9022)

Station 9021: Perth Airport 
Station 9022: Guildford Post Office 


If you explore this further however you'll notice that not all stations exist. Uncomment the code below if you want to see this in action.

In [5]:
# print getStationName(9)

So we need to find the numbers of all of the stations in WA. Fortunately that's really easy!

---

## 1.2. Finding The Weather Stations That Exist
Although you could simply test all 6-digit numbers and capture any that exist, it is far easier to just do a simply google search. The BOM actually all you to query for a list of the stations [here](http://www.bom.gov.au/climate/data/stations/). At this point I also noticed a lot more data that I could collect if I desired.

I also noticed a big problem. You can simply request the data from BOM instead of collecting it manually! For the sake of practice I will however just collect the data manually. 

[This text file](http://www.bom.gov.au/climate/data/lists_by_element/alphaWA_136.txt) lists all of the weather stations that collect rainfall data in WA. So that means that I can just use this list to collect my data!

The text file is structured for humans to read, not a computer to. At this point I used excel to convert the text file into a CSV. After a little bit of cleaning up of the headings and footers, I imported the CSV into a Pandas dataframe so we can access the information easily.

In [6]:
df = pandas.read_csv('rainfall_stations_wa.csv')
df.head()


Unnamed: 0,Site,Name,Lat,Lon,Start,End,Years,%,AWS
0,7118,ABBOTTS,-26.4,118.4,Sep 1898,Nov 1913,6.8,45,N
1,10258,ABERVON,-30.7833,117.9833,May 1968,Aug 1973,5.2,97,N
2,4000,ABYDOS,-21.4167,118.9333,Jul 1917,Dec 1974,40.0,70,N
3,4045,ABYDOS WOODSTOCK,-21.62,118.955,Apr 1901,Sep 1997,65.2,67,N
4,9971,ACTON PARK,-33.7845,115.4072,Nov 2000,Nov 2017,17.0,98,N


Most of the titles used in the data are self-explainatory. However, let me explain the last two columns:
 1. **Percentage (%)**: After tinkering around with the data, I think it is a measure of the completeness of the data collected. The higher the percentage, the lower the number of missing data points. Obviously 100% indicates that there aren't any missing data points at all during its time of operation.
 2. **AWS**: This column signifies that the weather station is an *Automatic Weather Station*. The BOM state that their readings are more consistent and accurate, which would be an interesting observation to back up with data.

---

## 1.3. Downloading Historical Data
Now that we know which stations exist, we want to explore if it is possible to automate the downloading process.

On each weather station page, there is a button to download all historical data on the page. Thus we want to target this URL using Beautiful Soup. By inspecting the source code on each page, it is contained within the "downloads" class as the second item in the list. Let's try to target that.

In [7]:
def downloadDataLink(station_num):
    page = urllib2.urlopen(
        r'http://www.bom.gov.au/jsp/ncc/cdio/weatherData/av?p_nccObsCode=136&p_display_type=dailyDataFile&p_startYear=&p_c=&p_stn_num=' 
        + str(station_num).zfill(6)
    )
    soup = BeautifulSoup(page, 'html.parser')
    download_link_extension = soup.find(
        'a', 
        {'title': "Data file for daily rainfall data for all years"}
    )['href']
    return str(BOM_HOME + download_link_extension)

I know that's a little complicated looking. Let me break it down for you:
 1. Read the station data page. This is simply where you get to if you manaully input in the station number into the portal. The zfill is to ensure that the number is formatted correctly.
 2. Get the HTML code from Beautiful Soup.
 3. Extract the download link. You can find it by targeting the correct title.
 4. The link is only appended to the base BOM link. So return the full link by concatinating the two strings together.
 
You can see how this output looks for different stations below.

In [8]:
print downloadDataLink(9021)
print downloadDataLink(9022)

http://www.bom.gov.au/jsp/ncc/cdio/weatherData/av?p_display_type=dailyZippedDataFile&p_stn_num=009021&p_c=-16486220&p_nccObsCode=136&p_startYear=2017
http://www.bom.gov.au/jsp/ncc/cdio/weatherData/av?p_display_type=dailyZippedDataFile&p_stn_num=009022&p_c=-16489829&p_nccObsCode=136&p_startYear=1954


Downloading the data from here is actually really simple. If you open the URL using urllib2 you'll receive the entire source code, which you can simply save by writing that code into a zipped file.

I've done this below to show how this could be done.

In [9]:
data = urllib2.urlopen(downloadDataLink(9021))
with open('station_9021.zip', 'wb') as zipper:
    zipper.write(data.read())

From this download we get a zipped file with two text files inside: the first is the actual data while the second is information about the weather station. In theory we could throw out the second file, but for now I'm keeping it.

---
## 1.4 Extracting and Manipulating Data
Now that we have the data, we want to be able to import it! That way we can begin to format it together.

I'll start with extracting the zipped file, and viewing the first few lines of the CSV.

In [10]:
with zipfile.ZipFile("station_9021.zip", "r") as zip_ref:
    zip_ref.extractall("station_9021")
    
df_station9021 = pandas.read_csv(glob.glob("station_9021/*.csv")[0])
print "(Rows, Columns) =", df_station9021.shape
df_station9021.head()

(Rows, Columns) = (27009, 8)


Unnamed: 0,Product code,Bureau of Meteorology station number,Year,Month,Day,Rainfall amount (millimetres),Period over which rainfall was measured (days),Quality
0,IDCJAC0009,9021,1944,1,1,,,
1,IDCJAC0009,9021,1944,1,2,,,
2,IDCJAC0009,9021,1944,1,3,,,
3,IDCJAC0009,9021,1944,1,4,,,
4,IDCJAC0009,9021,1944,1,5,,,


Upon first inspection, I can already think of ways to simplify the data. Firstly I will be dropping the first column as it is effectively useless to me. Next I will be renaming the second and sixth columns to shorter names. I'll aim to combine columns 3-5 into a single date1 column too. 

In [11]:
# Import in the dataframe.
df_station9021 = pandas.read_csv(glob.glob("station_9021/*.csv")[0])

# Rename the columns.
df_station9021.rename(columns={
    'Bureau of Meteorology station number': "Station Number",
    'Rainfall amount (millimetres)': 'Rainfall',
}, inplace=True)

# Combine date into a single column.
# (I know this code sucks, I couldn't make it work any other way).
df_station9021['Year'] = df_station9021['Year'].map(str)
df_station9021['Month'] = df_station9021['Month'].map(lambda x: str(x).zfill(2))
df_station9021['Day'] = df_station9021['Day'].map(lambda x: str(x).zfill(2))

df_station9021.insert(
    2, 'Date', 
    
    df_station9021['Year'] + '-' + 
    df_station9021['Month'] + '-' + 
    df_station9021['Day']
)

df_station9021.drop(['Year', 'Month', 'Day'], axis=1, inplace=True)

# Next, drop the first column.
df_station9021.drop(["Product code"], axis=1, inplace=True)

# Visualise the newly formatted data.
print "(Rows, Columns) =", df_station9021.shape
df_station9021.head()

(Rows, Columns) = (27009, 5)


Unnamed: 0,Station Number,Date,Rainfall,Period over which rainfall was measured (days),Quality
0,9021,1944-01-01,,,
1,9021,1944-01-02,,,
2,9021,1944-01-03,,,
3,9021,1944-01-04,,,
4,9021,1944-01-05,,,


It all works! Now the data is in an easy to access format and is only 5 columns instead of the original 8. Of course the above process can be repeated for each station in the data, but this is only a proof of concept. The full data processing will be done via the executable python script.

---
## Aside: How Clean Is The Data?
I was thinking to myself that the data I'm showing doesn't look particularly complete. So I'll compute how many missing entries are in this particular station and see how that compares to the number given in the Station database.

In [12]:
missing_values = df_station9021.isnull().sum()
print missing_values
print (missing_values / df_station9021.shape[0]) * 100

Station Number                                        0
Date                                                  0
Rainfall                                            121
Period over which rainfall was measured (days)    17935
Quality                                             121
dtype: int64
Station Number                                     0.000000
Date                                               0.000000
Rainfall                                           0.447999
Period over which rainfall was measured (days)    66.403791
Quality                                            0.447999
dtype: float64


So, actually, the data isn't missing much information at all. Why are there so many missing entries at the start then?

I managed to figure it out. Although the weather station didn't open until May, the CSV still has "empty" data for the first 4 months of the year. So to get a better picture of how complete the data is, we should throw all data before the first entry and all data after the last entry.

In [13]:
data_start = df_station9021['Rainfall'].first_valid_index()
data_finish = df_station9021['Rainfall'].last_valid_index()
print data_start, data_finish

df_station9021 = df_station9021[data_start:data_finish]
print df_station9021.head(3), '\n\n\n\n', df_station9021.tail(3)

121 27008
     Station Number        Date  Rainfall  \
121            9021  1944-05-01       0.0   
122            9021  1944-05-02       0.0   
123            9021  1944-05-03       0.0   

     Period over which rainfall was measured (days) Quality  
121                                             NaN       Y  
122                                             NaN       Y  
123                                             NaN       Y   



       Station Number        Date  Rainfall  \
27005            9021  2017-12-08       0.0   
27006            9021  2017-12-09       0.0   
27007            9021  2017-12-10       0.0   

       Period over which rainfall was measured (days) Quality  
27005                                             1.0       N  
27006                                             1.0       N  
27007                                             1.0       N  


Now, if I were to repeat the same analysis I should get near perfect data!

In [14]:
missing_values = df_station9021.isnull().sum()
print missing_values
print (missing_values / df_station9021.shape[0]) * 100

Station Number                                        0
Date                                                  0
Rainfall                                              0
Period over which rainfall was measured (days)    17814
Quality                                               0
dtype: int64
Station Number                                     0.000000
Date                                               0.000000
Rainfall                                           0.000000
Period over which rainfall was measured (days)    66.255068
Quality                                            0.000000
dtype: float64


Yes! There are no missing data for the rainfall at the Perth airport, as I infered (since it is an important weather station).

---