# Scraping Weather Data With Python

---

## *Contents*

 1. [Introduction](#introduction)
 3. [Part 1 - Find The Data]()
     1. Look at BOM's data (rainfall, temp, solar exposure).
     2. Explore the structure of the data (manually).
     3. Find ways of fetching the stations in WA.
     4. Propose a process going forward (focus on rainfall only, but propose how it would work for other data).
 4. Part 2 - Extract The Data
     1. Extract the station numbers in the text file (format CSV in excel, but discuss other ways).
     2. Automate fetching the raw data from BOM using urllib2 and Beautiful Soup.
 5. Part 3 - Refine The Data
     1. Extract csv from zip file (delete other files and directory structure).
     2. Compact data structure down.
     3. Create data validation method (number of missing data points outside start/end points, "carried forward" measurements, etc.).
     4. Propose how you could interpolate the data for missing data sets.
     5. Display how to combine multiple station's data.
 6. Part 4 - Export The Data
     1. Literally just to_csv call plus validation that it worked.
 7. Conclusion
     1. Resummarise the purpose of the experiment
     2. Discuss the potential uses of the data
     3. Discuss the validity of the data (where it came from, how you can check it, etc.)
     4. Mention downfalls of the data and any nuances about it.
     5. Extrapolate the process into the pseudo-code version of the executable.
     6. Final thoughts.
     
In the next notebook I'll be visualising the data.

---
# Introduction <a name='introduction'></a>
Data is the oil of our information age. With it we can save lives, be more efficient and make more money. Data isn't just plucked from the internet in well organised, coherent files. The raw data needs to be *extracted*! That's what is happening here. **This project is about extracting information about the weather in Western Australia** using the Bureau of Meteorology's raw data. 

The end result of this data extraction will be a single CSV file with weather data about the rainfall, temperature and solar exposure of each weather station in WA. This will be done using a single Python executable (.py): this [Jupyter](http://jupyter.org/) notebook is only a proof of concept. Instead of collecting all of the data here in the notebook it will instead serve as an insight into my thought process. As such, I'll only be collecting rainfall data for a couple weather stations. 

All of the processes being used can be extended to gather data near-automatically in any (or all) states and territories in Australia. If you wish to collect data in your own country or region then this document can serve as guidlines as to how to solve similar problems you'll face.

If you haven't figured it out so far, I'll be using [Python](https://www.python.org/) to extract my data. Python is a loosely-typed, object-orientated programming language built in C. A hugely beneficial feature of Python is its use of packages. In this projcet I'll be mainly using [urllib2](https://docs.python.org/2/library/urllib2.html), [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) and [Pandas](https://pandas.pydata.org/) plus help from some smaller packages like [zipfile](https://docs.python.org/2/library/zipfile.html) and [glob](https://docs.python.org/2/library/glob.html).  

## The Dirty Details
The data originates from [The Bureau of Meteorology](http://www.bom.gov.au) in their [Climate Data database](http://www.bom.gov.au/climate/data/). As such I claim no rights to the data itself. Anything that is mine (like this notebook and my executable file) I am allowing free usage of in accordance to the licence in the repository. 

The CSV file being exported will be structured like so:

```
    Station Number, Date, Rainfall, Max Temp, Min Temp, Solar Exposure 
```

Although this wont be the structure exported in this notebook (which is only done as a proof of concept), that will be the structure of the CSV file exported by the executable script.

In this notebook I'll be using dataframes a lot. If you don't know much about Pandas, especially about their data structures, I recommend you read [this](https://pandas.pydata.org/pandas-docs/stable/dsintro.html) article in their documentation quickly.

## Preface - Importing Packages, Defining Global Variables
First-things-first: import in the packages that will be used and define any global variables that will be used throughout the notebook.

In [1]:
# Import all of the goodies that I'll be using.

# HTML Scraping Packages
import urllib2
from bs4 import BeautifulSoup

# File Manipulation Packages
import zipfile
import glob

# Data Manipulation Packages
import pandas

In [2]:
# Next, define some global variables that I'll be using.
BOM_HOME = r'http://www.bom.gov.au'

---
# Part 1 - Find The Data
Before we automate anything we should have a look at the data we are going to collect. Like I mentioned in the introduction the data is all found in the [Climate Data Database](http://www.bom.gov.au/climate/data/). So the first thing I did was go there, have a look at how to get the data manually and what kind of information we can collect.

For each weather station you can request the:

 - Rainfall;
 - Max temperature;
 - Min temperature; or
 - Solar exposure.

It makes sense to aim to collect all of these data points in the executable file. For this proof of concept, however, I'll just be focusing on the rainfall data. That way I can get the process working for one of the datasets.

## Exploring A Dataset
Let's actually look at the data for one of the weather stations. The station I've chosen to experiment on is [Perth Airport's](http://www.bom.gov.au/jsp/ncc/cdio/weatherData/av?p_nccObsCode=136&p_display_type=dailyDataFile&p_startYear=&p_c=&p_stn_num=9021) weather station (station number 9021). On the webpage you can see towards the top right a link with the text "All years of data" which downloads a zipped directory with the following structure:

```
IDCJAC0009_9021_1800.zip
|
|---- IDCJAC0009_9021_1800_Data.csv
`---- IDCJAC0009_9021_1800_Note.txt
```

I wont automate the extraction of the data just yet as I'm only exploring the database. So instead I've extracted the csv file manually, renamed it to 'station9021.csv' and imported it into a Pandas dataframe to visualise the data. You can see the last 5 rows of data below.

In [40]:
pandas.read_csv('ExploratoryData/station9021.csv').tail(5)

Unnamed: 0,Product code,Bureau of Meteorology station number,Year,Month,Day,Rainfall amount (millimetres),Period over which rainfall was measured (days),Quality
27005,IDCJAC0009,9021,2017,12,8,0.0,1.0,N
27006,IDCJAC0009,9021,2017,12,9,0.0,1.0,N
27007,IDCJAC0009,9021,2017,12,10,0.0,1.0,N
27008,IDCJAC0009,9021,2017,12,11,0.0,1.0,N
27009,IDCJAC0009,9021,2017,12,12,0.0,1.0,N


Most of the data columns are self-explainatory except for the last two. 

The second last column is the period time for collection. For the most part is will be over 1 day, as each day rainfall measurements are taken at 9am for the previous day. However, occassionally, there are some times where the rainfall was not collected that day. So instead of throwing out the data they "carry it forward" to the next day's measurement. So now the rainfall amount is linked to two (or more) days worth of rainfall instead of one.

The last column is the quality of the data. This can be a little misleading as it doesn't actually tell you if the data collection was incorrect. Instead it flags whether or not the collection process has been audited and believed to have no significant errors associated with it. So if the row is flagged with "Y" then we can be confident that the data has been collected correctly and accurately. If it is flagged with "N" then it is likely that the data is correct, but that hasn't been confirmed.

## Finding The List of Stations
Not all station numbers are being used up to 9021, nor are all 4-digit numbers in use. For example, station number 9 doesn't correspond to any station in Australia. Obviously that is a huge issue as we can't just assume a range of station numbers to collect our data over.

Fortunately for me, BOM actually have a list of weather stations you can access! By going to the [Weather Station Directory](http://www.bom.gov.au/climate/data/stations/) you can request the list of weather station in Australia or any of it's States or Territories. I downloaded WA's list, which I've visualised the first 10 lines below.

In [41]:
with open("ExploratoryData/weather_stations.txt") as f:
    head = [next(f) for _ in xrange(10)]
print ''.join(head)

Bureau of Meteorology product IDCJMC0014.                                       Produced: 23 Nov 2017
West Australian stations measuring rainfall
Site    Name                                     Lat       Lon      Start    End       Years   %  AWS
-----------------------------------------------------------------------------------------------------
   7118 ABBOTTS                                  -26.4000  118.4000 Sep 1898 Nov 1913    6.8   45   N
  10258 ABERVON                                  -30.7833  117.9833 May 1968 Aug 1973    5.2   97   N
   4000 ABYDOS                                   -21.4167  118.9333 Jul 1917 Dec 1974   40.0   70   N
   4045 ABYDOS WOODSTOCK                         -21.6200  118.9550 Apr 1901 Sep 1997   65.2   67   N
   9971 ACTON PARK                               -33.7845  115.4072 Nov 2000 Nov 2017   17.0   98   N
   2046 ADA VALE                                 -16.9500  128.1000 Jan 1920 Mar 1925    4.8   92   N



Already you can see that this text file is formatted for humans to read, not computers. Fortunately each "column" is fixed width. Thus it is possible to convert it to a csv! Although this is possible in python to do, the time I would have to invest getting it to work right is more effort then just passing it through excel.

So that's what I did: I imported the text file into excel, deleted the heading, footer and styling before finally exporting it as a plain csv file. I visualised the end result below.

In [42]:
pandas.read_csv('ExploratoryData/weather_stations.csv').head()

Unnamed: 0,Site,Name,Lat,Lon,Start,End,Years,%,AWS
0,7118,ABBOTTS,-26.4,118.4,Sep 1898,Nov 1913,6.8,45,N
1,10258,ABERVON,-30.7833,117.9833,May 1968,Aug 1973,5.2,97,N
2,4000,ABYDOS,-21.4167,118.9333,Jul 1917,Dec 1974,40.0,70,N
3,4045,ABYDOS WOODSTOCK,-21.62,118.955,Apr 1901,Sep 1997,65.2,67,N
4,9971,ACTON PARK,-33.7845,115.4072,Nov 2000,Nov 2017,17.0,98,N


---

---

---

---
# !!!OLD NOTEBOOK BELOW.!!!
---

---

---

---

---
# Part 1 - Finding The Data [XKCD]
The first step in this project will be finding the data to collect. As already mentioned, you can find most of the data online at the [Climate Data database](http://www.bom.gov.au/climate/data/). However, collecting data this way is really tedious as you would have to manually fill in forms. Although this is possible with Python packages such as [Selenum](http://selenium-python.readthedocs.io/) there is definately an easier way.

## 1.1. Tinkering With The Portal

The first thing I did was to gain some familarity with how the portal worked. So I searched up for a weather station, found it's station number and went to that data page. If you'd like to follow along, I used station 9021.

Once I went to the new page, I noticed that the URL was structured like so:

> http://www.bom.gov.au/jsp/ncc/cdio/weatherData/av?p_nccObsCode=136&p_display_type=dailyDataFile&p_startYear=&p_c=&p_stn_num=009021

So already you can see the URL contains a query string. For the most part I can't really tell what each parameter does, but I did notice that my station number was in the string:

> p_stn_num=009021

So naturally I played around with this parameter. Of course this enabled me to move to new weather station! Let me illustrate that below.

In [3]:
# The name of the weather station for a given station number
def getStationName(station_num):
    page = urllib2.urlopen(
        r'http://www.bom.gov.au/jsp/ncc/cdio/weatherData/av?p_nccObsCode=136&p_display_type=dailyDataFile&p_startYear=&p_c=&p_stn_num=' 
        + str(station_num).zfill(6)
    )
    soup = BeautifulSoup(page, 'html.parser')
    return soup.h2.string

In [4]:
print "Station 9021:", getStationName(9021)
print "Station 9022:", getStationName(9022)

Station 9021: Perth Airport 
Station 9022: Guildford Post Office 


If you explore this further however you'll notice that not all stations exist. Uncomment the code below if you want to see this in action.

In [5]:
try:
    getStationName(9)
except Exception as e: 
    print e

'NoneType' object has no attribute 'string'


So we need to find the numbers of all of the stations in WA. Fortunately that's really easy!


## 1.2. Finding The Weather Stations That Exist
Although you could simply test all 6-digit numbers and capture any that exist, it is far easier to just do a simply google search. The BOM actually all you to query for a list of the stations [here](http://www.bom.gov.au/climate/data/stations/). At this point I also noticed a lot more data that I could collect if I desired.

I also noticed a big problem. You can simply request the data from BOM instead of collecting it manually! For the sake of practice I will however just collect the data manually. 

[This text file](http://www.bom.gov.au/climate/data/lists_by_element/alphaWA_136.txt) lists all of the weather stations that collect rainfall data in WA. So that means that I can just use this list to collect my data!

The text file is structured for humans to read, not a computer to. At this point I used excel to convert the text file into a CSV. After a little bit of cleaning up of the headings and footers, I imported the CSV into a Pandas dataframe so we can access the information easily.

In [6]:
df_allstations = pandas.read_csv('rainfall_stations_wa.csv')
df_allstations.head()


Unnamed: 0,Site,Name,Lat,Lon,Start,End,Years,%,AWS
0,7118,ABBOTTS,-26.4,118.4,Sep 1898,Nov 1913,6.8,45,N
1,10258,ABERVON,-30.7833,117.9833,May 1968,Aug 1973,5.2,97,N
2,4000,ABYDOS,-21.4167,118.9333,Jul 1917,Dec 1974,40.0,70,N
3,4045,ABYDOS WOODSTOCK,-21.62,118.955,Apr 1901,Sep 1997,65.2,67,N
4,9971,ACTON PARK,-33.7845,115.4072,Nov 2000,Nov 2017,17.0,98,N


Most of the titles used in the data are self-explainatory. However, let me explain the last two columns:
 1. **Percentage (%)**: After tinkering around with the data, I think it is a measure of the completeness of the data collected. The higher the percentage, the lower the number of missing data points. Obviously 100% indicates that there aren't any missing data points at all during its time of operation.
 2. **AWS**: This column signifies that the weather station is an *Automatic Weather Station*. The BOM state that their readings are more consistent and accurate, which would be an interesting observation to back up with data.

---

# Part 2 - Extracting The Data
Now we know what data exists, we can go about attempting to collect it. This section will mainly focus on the downloading, extracting and manipulating the data into a format where it can be easily restructured.

---

## 2.1. Downloading Historical Data
Now that we know which stations exist, we want to explore if it is possible to automate the downloading process.

On each weather station page, there is a button to download all historical data on the page. Thus we want to target this URL using Beautiful Soup. By inspecting the source code on each page, it is contained within the "downloads" class as the second item in the list. Let's try to target that.

In [7]:
def downloadDataLink(station_num):
    page = urllib2.urlopen(
        r'http://www.bom.gov.au/jsp/ncc/cdio/weatherData/av?p_nccObsCode=136&p_display_type=dailyDataFile&p_startYear=&p_c=&p_stn_num=' 
        + str(station_num).zfill(6)
    )
    soup = BeautifulSoup(page, 'html.parser')
    download_link_extension = soup.find(
        'a', 
        {'title': "Data file for daily rainfall data for all years"}
    )['href']
    return str(BOM_HOME + download_link_extension)

I know that's a little complicated looking. Let me break it down for you:
 1. Read the station data page. This is simply where you get to if you manaully input in the station number into the portal. The zfill is to ensure that the number is formatted correctly.
 2. Get the HTML code from Beautiful Soup.
 3. Extract the download link. You can find it by targeting the correct title.
 4. The link is only appended to the base BOM link. So return the full link by concatinating the two strings together.
 
You can see how this output looks for different stations below.

In [8]:
print downloadDataLink(9021)
print downloadDataLink(9022)

http://www.bom.gov.au/jsp/ncc/cdio/weatherData/av?p_display_type=dailyZippedDataFile&p_stn_num=009021&p_c=-16488712&p_nccObsCode=136&p_startYear=2017
http://www.bom.gov.au/jsp/ncc/cdio/weatherData/av?p_display_type=dailyZippedDataFile&p_stn_num=009022&p_c=-16492320&p_nccObsCode=136&p_startYear=1954


Downloading the data from here is actually really simple. If you open the URL using urllib2 you'll receive the entire source code, which you can simply save by writing that code into a zipped file.

I've done this below to show how this could be done.

In [9]:
data = urllib2.urlopen(downloadDataLink(9021))
with open('station_9021.zip', 'wb') as zipper:
    zipper.write(data.read())

From this download we get a zipped file with two text files inside: the first is the actual data while the second is information about the weather station. In theory we could throw out the second file, but for now I'm keeping it.

---
## 2.2 Extracting and Manipulating Data
Now that we have the data, we want to be able to import it! That way we can begin to format it together.

I'll start with extracting the zipped file, and viewing the first few lines of the CSV.

In [10]:
with zipfile.ZipFile("station_9021.zip", "r") as zip_ref:
    zip_ref.extractall("station_9021")
    
df_station9021 = pandas.read_csv(glob.glob("station_9021/*.csv")[0])
print "(Rows, Columns) =", df_station9021.shape
df_station9021.head()

(Rows, Columns) = (27010, 8)


Unnamed: 0,Product code,Bureau of Meteorology station number,Year,Month,Day,Rainfall amount (millimetres),Period over which rainfall was measured (days),Quality
0,IDCJAC0009,9021,1944,1,1,,,
1,IDCJAC0009,9021,1944,1,2,,,
2,IDCJAC0009,9021,1944,1,3,,,
3,IDCJAC0009,9021,1944,1,4,,,
4,IDCJAC0009,9021,1944,1,5,,,


Upon first inspection, I can already think of ways to simplify the data. Firstly I will be dropping the first column as it is effectively useless to me. The second column isn't actually much use, as we already know what the station number is. I will be renaming the sixth column to a shorter name as well as aim to combine columns 3-5 into a single date column. 

# Part 3 - Refine The Data

In [29]:
# Import in the dataframe.
df_station9021 = pandas.read_csv(glob.glob("station_9021/*.csv")[0])

# Rename the columns.
df_station9021.rename(columns={
    'Bureau of Meteorology station number': "Station Number",
    'Rainfall amount (millimetres)': 'Rainfall',
}, inplace=True)

# Combine date into a single column.
# (I know this code sucks, I couldn't make it work any other way).
df_station9021['Year'] = df_station9021['Year'].map(str)
df_station9021['Month'] = df_station9021['Month'].map(lambda x: str(x).zfill(2))
df_station9021['Day'] = df_station9021['Day'].map(lambda x: str(x).zfill(2))

df_station9021.insert(
    2, 'Date', 
    
    df_station9021['Year'] + '-' + 
    df_station9021['Month'] + '-' + 
    df_station9021['Day']
)

df_station9021.drop(['Year', 'Month', 'Day'], axis=1, inplace=True)

# Next, drop the first column.
df_station9021.drop(["Product code", "Station Number"], axis=1, inplace=True)

# Visualise the newly formatted data.
print "(Rows, Columns) =", df_station9021.shape
df_station9021.head()

(Rows, Columns) = (27010, 4)


Unnamed: 0,Date,Rainfall,Period over which rainfall was measured (days),Quality
0,1944-01-01,,,
1,1944-01-02,,,
2,1944-01-03,,,
3,1944-01-04,,,
4,1944-01-05,,,


It all works! Now the data is in an easy to access format and is only 5 columns instead of the original 8. Of course the above process can be repeated for each station in the data, but this is only a proof of concept. The full data processing will be done via the executable python script.

---
## Aside: How Clean Is The Data?
I was thinking to myself that the data I'm showing doesn't look particularly complete. So I'll compute how many missing entries are in this particular station and see how that compares to the number given in the Station database.

In [12]:
missing_values = df_station9021.isnull().sum()
print missing_values
print (missing_values / df_station9021.shape[0]) * 100

Date                                                  0
Rainfall                                            121
Period over which rainfall was measured (days)    17935
Quality                                             121
dtype: int64
Date                                               0.000000
Rainfall                                           0.447982
Period over which rainfall was measured (days)    66.401333
Quality                                            0.447982
dtype: float64


So, actually, the data isn't missing much information at all. Why are there so many missing entries at the start then?

I managed to figure it out. Although the weather station didn't open until May, the CSV still has "empty" data for the first 4 months of the year. So to get a better picture of how complete the data is, we should throw all data before the first entry and all data after the last entry.

In [13]:
data_start = df_station9021['Rainfall'].first_valid_index()
data_finish = df_station9021['Rainfall'].last_valid_index()
print data_start, data_finish

df_station9021 = df_station9021[data_start:data_finish]
print df_station9021.head(3), '\n\n\n\n', df_station9021.tail(3)

121 27009
           Date  Rainfall  Period over which rainfall was measured (days)  \
121  1944-05-01       0.0                                             NaN   
122  1944-05-02       0.0                                             NaN   
123  1944-05-03       0.0                                             NaN   

    Quality  
121       Y  
122       Y  
123       Y   



             Date  Rainfall  Period over which rainfall was measured (days)  \
27006  2017-12-09       0.0                                             1.0   
27007  2017-12-10       0.0                                             1.0   
27008  2017-12-11       0.0                                             1.0   

      Quality  
27006       N  
27007       N  
27008       N  


Now, if I were to repeat the same analysis I should get near perfect data!

In [14]:
missing_values = df_station9021.isnull().sum()
print missing_values
print (missing_values / df_station9021.shape[0]) * 100

Date                                                  0
Rainfall                                              0
Period over which rainfall was measured (days)    17814
Quality                                               0
dtype: int64
Date                                               0.000000
Rainfall                                           0.000000
Period over which rainfall was measured (days)    66.252603
Quality                                            0.000000
dtype: float64


Yes! There are no missing data for the rainfall at the Perth airport, as I infered (since it is an important weather station).

---

# Part 3 - Refine The Data
Now that we know it is possible to collect, simplify and clean the data we can focus on trying to centralise it into a single data structure. My thoughts immediately went to using Pandas' [Panels](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Panel.html) but I've rethought my approach. Instead I'll be using the [MultiIndex](http://pandas.pydata.org/pandas-docs/stable/advanced.html) object.

---

## 3.1. Designing The Data Structure
Now that we know how we will want each dataframe to look, we can now focus on the structure of the data as a bigger picture. Since we will be using hierarchial indexing the dataframe will look like so:

```
+---                    ---+
| 9021  1900-01-01  ...    |
|       1900-01-02  ...    |
|       ...                |
| 9021  1900-01-01  ...    |
|       1900-01-02  ...    |
|       ...                |
| ...   ...                |
+---                    ---+
```

So the outermost index is the station number and the innermost index is the date. I've decided upon this structure, instead of the other way, since it is more likely that we will be focusing on a single station rather than a single date.

Let's explore this concept with two station's worth of data to see how easy it is to do.

In [15]:
def createDataframeForStation(station_num):
    """Given the station number, I'll fetch the data for it."""
    download_name = 'station_%s' % station_num
    
    # Fetch & download data
    link = downloadDataLink(station_num)
    data = urllib2.urlopen(link)
    with open(download_name+'.zip', 'wb') as zipper:
        zipper.write(data.read())
    
    # Extract it 
    with zipfile.ZipFile(download_name+".zip", "r") as zip_ref:
        zip_ref.extractall(download_name)
    df = pandas.read_csv(glob.glob(download_name+"/*.csv")[0])
    
    # Rename the columns
    df.rename(columns={
        'Bureau of Meteorology station number': "Station Number",
        'Rainfall amount (millimetres)': 'Rainfall',
    }, inplace=True)
    
    # Combine date into a single column.
    df['Year'] = df['Year'].map(str)
    df['Month'] = df['Month'].map(lambda x: str(x).zfill(2))
    df['Day'] = df['Day'].map(lambda x: str(x).zfill(2))
    df.insert(
        2, 'Date', 
        df['Year'] + '-' + 
        df['Month'] + '-' + 
        df['Day']
    )
    df.drop(['Year', 'Month', 'Day'], axis=1, inplace=True)

    # Next, drop the first and second columns.
    df.drop(["Product code", "Station Number"], axis=1, inplace=True)
    
    # Set the date as the index column.
    df.set_index('Date', inplace=True)
    
    # Finally return the result.
    return df

I should see if that worked first.

In [16]:
createDataframeForStation(9021).tail(5)

Unnamed: 0_level_0,Rainfall,Period over which rainfall was measured (days),Quality
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-12-08,0.0,1.0,N
2017-12-09,0.0,1.0,N
2017-12-10,0.0,1.0,N
2017-12-11,0.0,1.0,N
2017-12-12,0.0,1.0,N


Which it did! So let's collect data for station 9021, 9022 and 9023 so we can start playing around with it.

In [17]:
# Get the station data.
df9021 = createDataframeForStation(9021)
df9022 = createDataframeForStation(9022)
df9023 = createDataframeForStation(9023)

# Combine it into a single dataframe.
df_main = pandas.concat(
    [df9021, df9022, df9023], 
    keys=['9021', '9022', '9023'], 
    names=['Station Number', 'Date']
)

Did it work? Well we can see if we can access a single station, a collection of stations or all data by a specific date.

In [18]:
df_main.loc['9022'].head()

Unnamed: 0_level_0,Rainfall,Period over which rainfall was measured (days),Quality
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1877-01-01,0.0,,Y
1877-01-02,0.0,,Y
1877-01-03,0.0,,Y
1877-01-04,0.0,,Y
1877-01-05,0.0,,Y


In [19]:
df_main.loc[['9021', '9022']]

Unnamed: 0_level_0,Unnamed: 1_level_0,Rainfall,Period over which rainfall was measured (days),Quality
Station Number,Date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
9021,1944-01-01,,,
9021,1944-01-02,,,
9021,1944-01-03,,,
9021,1944-01-04,,,
9021,1944-01-05,,,
9021,1944-01-06,,,
9021,1944-01-07,,,
9021,1944-01-08,,,
9021,1944-01-09,,,
9021,1944-01-10,,,


In [20]:
df_main.xs('1950-01-01', level='Date')

Unnamed: 0_level_0,Rainfall,Period over which rainfall was measured (days),Quality
Station Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
9021,0.0,,Y
9022,0.0,,Y
9023,0.0,,Y


In [21]:
df_main.xs('1900-01-01', level='Date')

Unnamed: 0_level_0,Rainfall,Period over which rainfall was measured (days),Quality
Station Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
9022,,,
9023,,,


It all seems to work exactly as expected! I can even swap the levels around if I prefer to use the date as the outer layer.

In [22]:
df_main_swapped = df_main.swaplevel(0, 1)
df_main_swapped.sort_index().tail(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,Rainfall,Period over which rainfall was measured (days),Quality
Date,Station Number,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2017-12-10,9023,0.0,1.0,N
2017-12-11,9021,0.0,1.0,N
2017-12-11,9023,0.0,1.0,N
2017-12-12,9021,0.0,1.0,N
2017-12-12,9023,0.0,1.0,N


# Part 4 - Exporting The Data
Now that we have got a good protoype of the data structure, we want ot confirm that we can actually export it to a CSV file for others to use. Fortunately for me, pandas has a [to_csv](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html) function that I'll be able to use.

Let's test this function on the dataframe we just generated of stations 9021, 9022 and 9023.

In [23]:
df_main.to_csv('example.csv')

Well that was stupidly-easy. In fact that was too easy. Did the data actually export as I wanted?

In [24]:
with open('example.csv', 'r') as f:
    for _ in range(5): print f.readline()

Station Number,Date,Rainfall,Period over which rainfall was measured (days),Quality

9021,1944-01-01,,,

9021,1944-01-02,,,

9021,1944-01-03,,,

9021,1944-01-04,,,



Well it sure looks like it. What happens if I re-export it back into Pandas?

In [25]:
importdf = pandas.read_csv("example.csv")
importdf.tail(5)

Unnamed: 0,Station Number,Date,Rainfall,Period over which rainfall was measured (days),Quality
104963,9023,2017-12-08,0.0,1.0,N
104964,9023,2017-12-09,0.0,1.0,N
104965,9023,2017-12-10,0.0,1.0,N
104966,9023,2017-12-11,0.0,1.0,N
104967,9023,2017-12-12,0.0,1.0,N


And it all still works as I expected! The import is even better if I use the *index_col* parameter in the read_csv function.

In [26]:
importdf2 = pandas.read_csv("example.csv", index_col=[0, 1])
importdf2.tail(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,Rainfall,Period over which rainfall was measured (days),Quality
Station Number,Date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
9023,2017-12-08,0.0,1.0,N
9023,2017-12-09,0.0,1.0,N
9023,2017-12-10,0.0,1.0,N
9023,2017-12-11,0.0,1.0,N
9023,2017-12-12,0.0,1.0,N


# Conclusion
If you haven't noticed already, that is the proof of concept! The entire process works from start to finish. All that there is left to do is to create an executable file so that you can collect the data yourself. 

Hopefully this walkthough has been helpful in teaching you some of the basics of web scraping using Python. Notice that Python is designed such that we never "reinvent the wheel". If someone has already made a package for what we want to do, why bother rewriting it? We only want to achieve our task. We aren't here to show off our programming skills.

Thank you all for reading this Notebook. If you liked it, please go ahead and star the repository or fork my project. That way you can play around with the code and see if you can do it better, faster or clearer!