# This Week’s Overview


You will further developing you skills of data handling and processing in this practical by finishing 10 small tasks about:

- Working with streaming data. You will load weather data through API. It is harder conceptually because API data is harder to understand -- we've simplified it quite a bit but it's still got some parts that are going to be hard going.
- Making an interactive map using skills learned last week.
- Converting queried API data into a well-formatted DataFrame. Some DataFrame operation such as 'join','append',’merge’ you learned in last term will be used.
- Create a ShapeFile using Shapely and GeoPandas - get a sense of projected and geographic coordinate system.
- Calculating distance between points.

## Learning Outcomes

By the end of this practical you should have:
- Used API to request streaming data
- Enhanced your skills of manipulating data frame 
- Known calculate geographical distance in various ways

(always remember) The first thing we need to do is setup our working environment. Run the scripts below to import the relevant libraries.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import folium  
import os

print (folium.__version__)

%matplotlib inline

pd.set_option('display.max_rows', 300) # specifies number of rows to show
pd.options.display.float_format = '{:5,.4f}'.format # specifies default number format to 4 decimal places

import warnings
warnings.simplefilter('ignore')

0.7.0


### Weather Data  - background knowledge you should know

The UK's Met Office is a world-leading weather and climate research centre, and even if it doesn't always seem like their forecasts are very accurate that's because Britain's weather is inherently _unpredictable_. They've also done a lot of work to make their weather data widely available to people like us.

I probably don't need to say a _lot_ about weather data because you've probably been making use of forecasts for much of your life! But it's _still_ worth understanding something about how weather data is gathered and reported: many organisations operate weather stations where data on wind speed, temperature, rain, and amount of sun are collected and then transmitted to a server to be integrated into a larger data set of weather _observations_ at a national or global scale. Of course, any _one_ station might be in the 'wrong' place (somewhere shady or protected from the rain) or it might even break down, but the idea is that if you have enough of them you can collect a pretty good range of data for the country and begin to look for patterns and, potentially, make predictions.

We will be accessing data from the MetOffice from weather stations where observations, such as the ones below, are collected:
* <Param name="F" units="C">Feels Like Temperature (units: degrees Celsius)
* <Param name="G" units="mph">Wind Gust (units: mph)</Param>
* <Param name="H" units="%">Screen Relative Humidity (units: percent)</Param> 
* <Param name="T" units="C">Temperature (units: degrees Celsius)</Param> 
* <Param name="V" units="">Visibility (units: km?)</Param> 
* <Param name="D" units="compass">Wind Direction (units: compass degrees)</Param>  
* <Param name="S" units="mph">Wind Speed (units: mph)</Param> 
* <Param name="U" units="">Max UV Index (units: index value)</Param> 
* <Param name="W" units="">Weather Type (units: categorical)</Param> 
* <Param name="Pp" units="%">Precipitation Probability (units: percent)</Param>

These observations are associated with a particular station (where did we see these values/where _will_ we see these values?), they will also be associated with _either_ a particular time in the past (when were they collected?) or, if they're forecasts, with a particular time in the future (when do we expect to see them?). 

So although weather data might seem more 'objective' than data on social class (though for obvious reasons it turns out that both are just attempts to capture data about reality, not reality itself), it may also turn out to be very complex to store and manage beccause of the temporal element _and_ the fact that it's not just a count of one thing, each of these observations uses a very different set of units.

To really get to grips with the MetOffice API you will need to RTM (Read The Manual): http://www.metoffice.gov.uk/media/pdf/3/0/DataPoint_API_reference.pdf. The ruder version of that, which you will sometimes see on StackOverflow and elsewhere, is: RTFM.

### Getting Weather Data via an API

Because the weather is changing all the time, so is the data! And, 'worse', it's becoming obsolete: the forecast from 2 years ago isn't particularly useful to us now. *And* asking for "yesterday's weather" depends on the day that we're asking! When you have data that is always changing from minute to minute or day to day then you use an API (Application Programming Interface) to access it: the API knows that "yesterday's weather" means "work out what day it is right now and then get the weather from the day before", and it also knows that "give me the current weather from station X" means "look up station X and find the latest weather report that I've received". In other words, an API is  designed with programmatic, dynamic interaction in mind right from the start.

#### So What _is_ an API?

There's a nice, friendly introduction to APIs over at [Free Code Camp](https://medium.freecodecamp.com/what-is-an-api-in-english-please-b880a3214a82#.rmjnli2nn). 

Helpfully, the MetOffice provides a lot of documentation about their API (I'd suggest bookmarking it): http://www.metoffice.gov.uk/datapoint/support/api-reference

This type of data requires a lot more research up front to work with, but it's very flexible once you know how to 'speak API' because you can _customise_ the API request so that the server responds with _only_ the data we're interested in instead of being 'stuck' with what the provider wants to give us.



### Task 1: Obtaining an API Key

The first step to working with the API from the MetOffice is to obtain an API key: [Click to register and obtain the API Key here](http://www.metoffice.gov.uk/datapoint/API). 

Make a note of this key in your notebook. Right here:

In [2]:
api_key   = "75038866-617b-4f32-b7a1-2af56d36fe63" # your API key

That way your API key is saved somewhere easy to access.

We _always_ have to use the key as part of an API request: the process by which we _ask_ for data. Think of the key as being _your_ unique identifier: no two people share the same key and that way the MetOffice can cut off people who abuse the system or look at which APIs are popular with lots of users... **Twitter and Facebook do the same thing.**

### Task 2: Obtaining a List of Sites from the API

How to start? Well, the first thing that we need to know is: for what locations can I get weather data? For this to work, we need to know how to ask the API nicely for a list of available sites... 

First we import two new modules: one that makes requests to a web server, and one that will parse JSON* responses from the server in order to turn them into something that we can work with more easily.

In [3]:
import json, requests # Libraries we need

Then we set up some default variables (api_url, site_url) that will help us to build our request to the MetOffice's server. The comments help us to remember what each of these variables holds.

In [4]:
api_url  = "http://datapoint.metoffice.gov.uk/public/data/" # base URL
site_url = "val/wxobs/all/json/sitelist" # sites API URL
payload = {'key': api_key} # Dictionary to hold request parameters

You'll notice that the payload is just a dictionary and that this dictionary is then passed to the requests library (the get function). All it does it convert this dictionary to a key-value pair in the format expected by the API. Think of it as a kind of translation between languages: from the language of Python to the language of the web (HTTP, to be precise).
We issue our request and it returns a response that we store in s (short for sites data).

In [5]:
s = requests.get(api_url + site_url, params=payload) # Do the request

Lastly, we ask the response object to convert the reply into a JSON data structure... more on JSON in a second, but first let's look at what we got from our request!

In [6]:
sites = s.json() # Capture the output
print("Done!")

Done!


In [7]:
# Show the requested URL
print(s.url) # Click on the link below to see it nicely formatted automatically!

http://datapoint.metoffice.gov.uk/public/data/val/wxobs/all/json/sitelist?key=75038866-617b-4f32-b7a1-2af56d36fe63


In [8]:
# Capture the returned data
print(sites)

{'Locations': {'Location': [{'elevation': '7.0', 'id': '3066', 'latitude': '57.6494', 'longitude': '-3.5606', 'name': 'Kinloss', 'region': 'gr', 'unitaryAuthArea': 'Moray'}, {'elevation': '6.0', 'id': '3068', 'latitude': '57.712', 'longitude': '-3.322', 'obsSource': 'LNDSYN', 'name': 'Lossiemouth', 'region': 'gr', 'unitaryAuthArea': 'Moray'}, {'elevation': '36.0', 'id': '3075', 'latitude': '58.454', 'longitude': '-3.089', 'obsSource': 'LNDSYN', 'name': 'Wick John O Groats Airport', 'region': 'he', 'unitaryAuthArea': 'Highland'}, {'elevation': '15.0', 'id': '3002', 'latitude': '60.749', 'longitude': '-0.854', 'name': 'Baltasound', 'region': 'os', 'unitaryAuthArea': 'Shetland Islands'}, {'elevation': '82.0', 'id': '3005', 'latitude': '60.139', 'longitude': '-1.183', 'obsSource': 'LNDSYN', 'name': 'Lerwick (S. Screen)', 'region': 'os', 'unitaryAuthArea': 'Shetland Islands'}, {'elevation': '57.0', 'id': '3008', 'latitude': '59.527', 'longitude': '-1.628', 'name': 'Fair Isle', 'region': 'os

### Task 3: Parse JSON files



We have played around GeoJSON file last week. JSON is just like GeoJSON, but without embedded geographic data structures. Now that you have a better idea of what data the reply contains, let's see if we can convert the JSON reply into something useful for Python; if you scroll back up to where we printed out the reply you'll notice that it all starts with a '{', meaning that it's a dictionary. 

Yes, look just like the GeoJSON file. (So you should known how to parse it)

Let's start by printing out the keys in the dictionary and the _type_ of data associated as a value to that key:


In [9]:
for k in sites.keys():
    print("Key: " + str(k))
    print("Value: " + str(type(sites[k])))

Key: Locations
Value: <class 'dict'>


Not the most exciting answer, but at least we know that the value is a dictionary. Let's try moving down a level:

In [10]:
for k in sites.keys():
    print("Key: " + str(k))
    print("Value: " + str(type(sites[k])))
    for k2 in sites[k].keys():
        print("\tKey: " + str(k2))
        print("\tValue: " + str(type(sites[k][k2])))

Key: Locations
Value: <class 'dict'>
	Key: Location
	Value: <class 'list'>


**This isn't great design in the reply: we have a dictionary that contains only one key/value pair, and _that_ dictionary in turn contains only one key/value pair. But after that we get to a long, long list containing the data...**

The MetOffice is not making life easy for us here: there's a _lot_ of extra 'baggage' in this API response. But we at least know that the next level down is a list and _that_ suggests that things are about to get a bit more interesting... Let's simplify our code at the same time:

In [11]:
apiList = sites['Locations']['Location']
print("List in API response is " + str(len(apiList)) + " long")

List in API response is 148 long


Now _that_ is a rather more interesting response. What it means is that our JSON reply has this structure:
```
{
    'Locations': {
        'Location': [
            ... lots of data here ...
        ]
    }
}
```

If you scroll back up to the JSON reply you should now be able to read a little bit more of the response... and this should give you a clue as to how to print out the `name`, `id`, and `longitude` of the first five sites. I'll get you started:

In [12]:
for i in range(5):
    location = apiList[i]
    print("Location: " + location['name'] + " (id: " + location['id'] + ") is at longitude: " + location['longitude'])

Location: Kinloss (id: 3066) is at longitude: -3.5606
Location: Lossiemouth (id: 3068) is at longitude: -3.322
Location: Wick John O Groats Airport (id: 3075) is at longitude: -3.089
Location: Baltasound (id: 3002) is at longitude: -0.854
Location: Lerwick (S. Screen) (id: 3005) is at longitude: -1.183


Your answer should look like this:

`
Location: Kinloss (id: 3066) is at longitude: -3.5606
Location: Lossiemouth (id: 3068) is at longitude: -3.322
Location: Wick John O Groats Airport (id: 3075) is at longitude: -3.089
Location: Baltasound (id: 3002) is at longitude: -0.854
Location: Lerwick (S. Screen) (id: 3005) is at longitude: -1.183
`

And _now_ that we know where all the data was 'hidden', we can convert this to a proper data structure in which it is possible to _interact_ with it. To do that, we'll put the site data into a pandas data frame...

### Task 4: Turning API data into a Pandas DataFrame

Pandas is remarkably intelligent and will _often_ -- though not always -- work out the sensible thing to do from many kinds of data structures (list-of-lists, dictionary-of-lists, list-of-dictionaries...). So let's see what happens when we simply pass `apiList` (a LoD) directly to the `DataFrame` 'constructor' instead of passing the data through, for instance, the `read_csv` function as we did above with a CSV file.

In [13]:
site_df = pd.DataFrame(apiList)
print(site_df.head())

  elevation    id latitude longitude                        name nationalPark  \
0       7.0  3066  57.6494   -3.5606                     Kinloss          NaN   
1       6.0  3068   57.712    -3.322                 Lossiemouth          NaN   
2      36.0  3075   58.454    -3.089  Wick John O Groats Airport          NaN   
3      15.0  3002   60.749    -0.854                  Baltasound          NaN   
4      82.0  3005   60.139    -1.183         Lerwick (S. Screen)          NaN   

  obsSource region   unitaryAuthArea  
0       NaN     gr             Moray  
1    LNDSYN     gr             Moray  
2    LNDSYN     he          Highland  
3       NaN     os  Shetland Islands  
4    LNDSYN     os  Shetland Islands  


Wow, that's... almost scarily easy. You can see that pandas worked out the structure of our LoD and then automatically converted that to columns in a data frame. So it got the hardest part of the process exactly right and has saved us a lot of work. _That_ is the point of functions and of code: to be constructively lazy.

Of course, we could have predicted that pandas would cope since there is a whole section in the documentaiton [devoted to creating data frames from different structures](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe).



The problem is that pandas didn't know what we expected the columns to be, so it's treated them all as 'objects' (basically: strings) and not as numeric data types. To fix that you need to know that there's a function called `'astype'` that allows pandas to convert between different data types where it's fairly easy for pandas to figure out what we want to do:

In [14]:
for c in ['region','unitaryAuthArea']:
    site_df[c] = site_df[c].astype('str')
for c in ['elevation','latitude','longitude']:
    site_df[c] = site_df[c].astype('float')
for c in ['id']:
    site_df[c] = site_df[c].astype('int')

In [15]:
site_df.describe() # only the numeric paratmers shows here

Unnamed: 0,elevation,id,latitude,longitude
count,148.0,148.0,148.0,148.0
mean,119.2432,6033.1216,53.7383,-2.8579
std,189.2141,15564.2395,2.5321,2.2598
min,2.0,3002.0,49.2079,-10.25
25%,20.75,3169.75,51.6062,-4.1917
50%,63.0,3407.5,53.2555,-2.6735
75%,132.25,3768.25,55.3247,-1.2332
max,1245.0,99142.0,60.749,1.348


### Task 5: Making an interactive map using Folium

Before we go further, lets make use of your mapping skills learned last week to visually identify where are the locations collects weather data.

With the map to be created, we are able to answer the first question: for what locations can I get weather data? - Let's add all sites (points) to the map.

In [16]:
MAP_COORDINATES = (51.5113, -0.1160) 

m = folium.Map(location=MAP_COORDINATES, zoom_start=8, tiles="Stamen Toner")

for index, loc in site_df.iterrows():
    # And now add a marker
    folium.Marker((loc['latitude'], loc['longitude']), icon=folium.Icon(color='green',icon='cloud'), 
              popup = 'weather station'
             ).add_to(m)
m # Print the map

**Zoom out, and what did you see there?**

The weather station are located across the whole UK, not just in London.

We can use spatial join to find out all stations in GLA, but not there yet. Lets use the key word 'unitaryAuthArea' to query out stations in 'Greater London'.

In [17]:
#Lets find out the weather station in the Greater London area by querying a key word
site_df_london = site_df.loc[(site_df.unitaryAuthArea =='Greater London')]

In [18]:
MAP_COORDINATES = (51.5113, -0.1160) 

m = folium.Map(location=MAP_COORDINATES, zoom_start=10, tiles="Stamen Toner")

for index, loc in site_df_london.iterrows():
    # And now add a marker
    folium.Marker((loc['latitude'], loc['longitude']), icon=folium.Icon(color='green',icon='cloud'), 
              popup= loc['unitaryAuthArea']
             ).add_to(m)
m # Print the map

### Task 6: Obtaining Weather Data in London

The next step in this process is a bit more complicated because weather data is a bit more complicated than a list of locations...



Here, the MetOffice has _not_ made our lives very easy because the data is packaged in a way that doesn't allow us to easily load it into pandas. If you search online, you'll find plenty of people complaining about how the MetOffice API works. Or doesn't work, if you prefer.

So we're not going to ask you to sort this out for yourselves. Instead, we're going to provide you with a function (!) to take the observation data and convert it into a data frame.

Well, you should at least know how to use a function.

In [19]:
from datetime import datetime, timedelta
def processMetOfficeObservations(loc): 
    """
    Process a series of 'reports' for a single
    location using the datetime object as the 
    reference time against which to build the 
    timedelta (i.e. we start from midnight and 
    the timedelta is the number of minutes past 
    midnight)
    """
    observations = {} # Stores results
    
    for d in loc['Period']: # d for day
        if 'value' not in d or 'Rep' not in d:
            # print 'no "value" or "Rep" key, skip'
            continue
            
        dt = datetime.strptime(d['value'],'%Y-%m-%dZ')# Convert date to datetime object 
        
        # Now deal with the actual observations (i.e. 'Reports')
        the_type = type(d['Rep'])
        if the_type is dict:
            reports = [d['Rep']]
        elif the_type is list:
            reports = d['Rep']
        else:
            print ("***warning: d['Rep'] type: ", the_type, " not support")
            continue
        
        for report in d['Rep']:
                
            # Find the timestampe and add it to the date
            minutes_after_midnight = int(report['$'])
            ts = dt + timedelta(minutes=minutes_after_midnight)
            
            # For each of the possible values, set a default value
            # if the weather station doesn't actually collect that
            # parameter... can you see a problem with our defaults?
            if 'ts' not in observations:
                observations['ts'] = []
            observations['ts'].append(str(ts))
            for key in ['D','Pt']:
                if key not in report:
                    report[key] = u""
                if key not in observations:
                    observations[key] = []
                observations[key].append(report[key])
            for key in ['W','V','S','G']:
                if key not in report or report[key] == "":
                    report[key] = 0
                if key not in observations:
                    observations[key] = []
                observations[key].append(report[key])
            for key in ['T','Dp','H']:
                if key not in report or report[key] == "":
                    report[key] = 0.0
                if key not in observations:
                    observations[key] = []
                observations[key].append(report[key])
    
    return observations


First, just in case you want to only run this section again (and not revisit the content above), I'd suggest saving a copy of your API key here as well:

In [20]:
api_key = "75038866-617b-4f32-b7a1-2af56d36fe63" # Here you need to replace this with your unique API key

Let's start from a simple task. - querying weather data at one weather station. And we choose one located in London.

In [21]:
london_ids = (site_df_london.id).unique()
london_ids  

# how many stations with hourly updated data in London? You have got the answer from your map.

array([3672, 3772])

Just like what we did before for querying site locations. Here, we just added one small parameter - location id.

**if you are unsure where to add it, go to check data point [API](https://www.metoffice.gov.uk/datapoint/support/api-reference)**

In [22]:
import json, requests # Libraries we need

api_url  = "http://datapoint.metoffice.gov.uk/public/data/" # Base URL
obs_json= "val/wxobs/all/json/" # Observations URL

station = str(3772)  #3772 # This is heathrow airport weather station

payload = {'res': 'hourly', 'key': api_key} # Dictionary to hold request parameters

# pay attention here, station id added
r = requests.get(api_url + obs_json + station, params=payload)

#print(r.url)

weather = r.json() # Capture the reply

print("Done!")

Done!


In [23]:
print(weather)

{'SiteRep': {'Wx': {'Param': [{'name': 'G', 'units': 'mph', '$': 'Wind Gust'}, {'name': 'T', 'units': 'C', '$': 'Temperature'}, {'name': 'V', 'units': 'm', '$': 'Visibility'}, {'name': 'D', 'units': 'compass', '$': 'Wind Direction'}, {'name': 'S', 'units': 'mph', '$': 'Wind Speed'}, {'name': 'W', 'units': '', '$': 'Weather Type'}, {'name': 'P', 'units': 'hpa', '$': 'Pressure'}, {'name': 'Pt', 'units': 'Pa/s', '$': 'Pressure Tendency'}, {'name': 'Dp', 'units': 'C', '$': 'Dew Point'}, {'name': 'H', 'units': '%', '$': 'Screen Relative Humidity'}]}, 'DV': {'dataDate': '2019-01-21T10:00:00Z', 'type': 'Obs', 'Location': {'i': '3772', 'lat': '51.479', 'lon': '-0.449', 'name': 'HEATHROW', 'country': 'ENGLAND', 'continent': 'EUROPE', 'elevation': '25.0', 'Period': [{'type': 'Day', 'value': '2019-01-20Z', 'Rep': [{'D': 'N', 'H': '83.9', 'P': '1017', 'S': '3', 'T': '1.8', 'V': '11000', 'W': '7', 'Pt': 'R', 'Dp': '-0.6', '$': '600'}, {'D': 'NNE', 'H': '78.2', 'P': '1018', 'S': '5', 'T': '3.3', 'V'

In [24]:
#using the function
data = processMetOfficeObservations(weather['SiteRep']['DV']['Location'])

In [25]:
df3 = pd.DataFrame.from_dict( data )
# add one more parameter to indicate the station id - say, where the data from
df3['id'] = station
df3.head(4)

Unnamed: 0,ts,D,Pt,W,V,S,G,T,Dp,H,id
0,2019-01-20 10:00:00,N,R,7,11000,3,0,1.8,-0.6,83.9,3772
1,2019-01-20 11:00:00,NNE,R,1,11000,5,0,3.3,-0.1,78.2,3772
2,2019-01-20 12:00:00,NE,R,1,16000,6,0,4.3,-0.6,70.0,3772
3,2019-01-20 13:00:00,NNE,R,1,14000,5,0,5.1,-0.9,64.7,3772


### Task 7: Obtaining Weather Data in the UK ( its your turn)

Here you are going to repeat what has been done in Task 6. But instead of querying only 1 station, you need to write a loop to query all weather station in the UK and put the data into a DataFrame called weather_df.

Hints: first you get all unique ids of weather stations; then you write a loop to iterate all station ids; Each time you request data of one station, you convert the JSON into a new DataFrame; then you join the new dataframe to the weather_df (not sure what operation to use, then check [FROM HERE](https://pandas.pydata.org/pandas-docs/stable/merging.html)); After you gather all data from all station, rename the columns to make it easy understood and reset index to make it ready for query. That's it!

In [26]:
weather_df = df3.iloc[0:0] # so the big container is prepared for you, you just have to fill it with data.
weather_df

Unnamed: 0,ts,D,Pt,W,V,S,G,T,Dp,H,id


In [27]:
# get station ids here
uk_ids = site_df.id.unique()
len(uk_ids)

148

the answer should be something like 148 (data updates all the time as we are using streaming data)

In [28]:
# loop here
for stn in uk_ids:
    station = str(stn)
    r = requests.get(api_url + obs_json + station, params=payload)
    weather = r.json() # Capture the reply    
    if 'Location' not in weather['SiteRep']['DV']: # if there is no keyword - location, means no data coming in
        continue
    else:
        print('weather station id is %s' %(station) )
        test = processMetOfficeObservations(weather['SiteRep']['DV']['Location'])
        df3_new = pd.DataFrame.from_dict(test)
        df3_new['id'] = station 
        weather_df = weather_df.append(df3_new)  

weather station id is 3066
weather station id is 3068
weather station id is 3075
weather station id is 3002
weather station id is 3005
weather station id is 3008
weather station id is 3034
weather station id is 3044
weather station id is 3047
weather station id is 3796
weather station id is 3803
weather station id is 3857
weather station id is 3876
weather station id is 3895
weather station id is 3911
weather station id is 3916
weather station id is 3953
weather station id is 99081
weather station id is 3520
weather station id is 3522
weather station id is 3535
weather station id is 3560
weather station id is 3605
weather station id is 3647
weather station id is 3660
weather station id is 3684
weather station id is 3716
weather station id is 3740
weather station id is 3768
weather station id is 3153
weather station id is 3155
weather station id is 3162
weather station id is 3166
weather station id is 3226
weather station id is 3230
weather station id is 3238
weather station id is 3257


In [30]:
#change column names here. 
cnames = {'D' : 'WindDirection', 
          'G' : 'WindGust', 
          'Dp': 'DewPoint', 
          'H' : 'Humidity', 
          'P' : 'PressureTendency', 
          'S' : 'WindSpeed', 
          'T' : 'Temperature', 
          'W' : 'WeatherType', 
          'V' : 'Visibility', 
          'Pt': 'PressureTendency',
          'id': 'id',
          'ts': 'DateTime'
         }

weather_df.rename(columns=cnames, inplace=True)

In [31]:
#check you data, see if it makes sense
weather_df

Unnamed: 0,DateTime,WindDirection,PressureTendency,WeatherType,Visibility,WindSpeed,WindGust,Temperature,DewPoint,Humidity,id
0,2019-01-20 10:00:00,WSW,R,12,6000,10,0,2.1,1.2,93.7,3066
1,2019-01-20 11:00:00,SW,R,8,45000,10,0,2.8,2.0,94.5,3066
2,2019-01-20 12:00:00,WSW,R,10,40000,11,0,3.6,2.2,90.4,3066
3,2019-01-20 13:00:00,W,R,8,35000,13,0,4.2,2.2,86.7,3066
4,2019-01-20 14:00:00,W,R,14,9000,17,0,3.9,2.2,88.5,3066
5,2019-01-20 15:00:00,W,R,1,35000,15,0,3.6,1.4,85.2,3066
6,2019-01-20 16:00:00,WSW,R,2,45000,8,0,2.4,0.5,87.1,3066
7,2019-01-20 17:00:00,W,R,7,28000,13,0,2.3,0.7,89.0,3066
8,2019-01-20 18:00:00,WSW,R,0,27000,14,0,2.2,0.0,85.2,3066
9,2019-01-20 19:00:00,SW,R,0,27000,6,0,0.9,-1.4,84.4,3066


Did you notice that the row index is not with unique numbers? Why? This is a common issue when combining DataFrame together and may cause problems when we index data. We just need to reset the index.

In [32]:
weather_df = weather_df.reset_index(drop=True)
weather_df

Unnamed: 0,DateTime,WindDirection,PressureTendency,WeatherType,Visibility,WindSpeed,WindGust,Temperature,DewPoint,Humidity,id
0,2019-01-20 10:00:00,WSW,R,12,6000,10,0,2.1,1.2,93.7,3066
1,2019-01-20 11:00:00,SW,R,8,45000,10,0,2.8,2.0,94.5,3066
2,2019-01-20 12:00:00,WSW,R,10,40000,11,0,3.6,2.2,90.4,3066
3,2019-01-20 13:00:00,W,R,8,35000,13,0,4.2,2.2,86.7,3066
4,2019-01-20 14:00:00,W,R,14,9000,17,0,3.9,2.2,88.5,3066
5,2019-01-20 15:00:00,W,R,1,35000,15,0,3.6,1.4,85.2,3066
6,2019-01-20 16:00:00,WSW,R,2,45000,8,0,2.4,0.5,87.1,3066
7,2019-01-20 17:00:00,W,R,7,28000,13,0,2.3,0.7,89.0,3066
8,2019-01-20 18:00:00,WSW,R,0,27000,14,0,2.2,0.0,85.2,3066
9,2019-01-20 19:00:00,SW,R,0,27000,6,0,0.9,-1.4,84.4,3066


In [34]:
#how many records you got? 
len(weather_df)

3499

you should get an answer around 3499. Again number changes slightly as we are using streaming data.

### Task 8: Join Weather Data (weather_df) with Site Data (df2)

The weather data we get does not have location tag. You can just use the weather station id and always go back the site location table to find the coordinates. It is a good strategy for data storage, especially for large data sets. And it is probably a bad idea to combine them into one, as in such case both station id and timetag cannot be used as index any more.  

But (just) for practice, let's combine them into one to create a spatial file for later use.

In [35]:
weather_df.head(3)

Unnamed: 0,DateTime,WindDirection,PressureTendency,WeatherType,Visibility,WindSpeed,WindGust,Temperature,DewPoint,Humidity,id
0,2019-01-20 10:00:00,WSW,R,12,6000,10,0,2.1,1.2,93.7,3066
1,2019-01-20 11:00:00,SW,R,8,45000,10,0,2.8,2.0,94.5,3066
2,2019-01-20 12:00:00,WSW,R,10,40000,11,0,3.6,2.2,90.4,3066


In [36]:
site_df.head(3)

Unnamed: 0,elevation,id,latitude,longitude,name,nationalPark,obsSource,region,unitaryAuthArea
0,7.0,3066,57.6494,-3.5606,Kinloss,,,gr,Moray
1,6.0,3068,57.712,-3.322,Lossiemouth,,LNDSYN,gr,Moray
2,36.0,3075,58.454,-3.089,Wick John O Groats Airport,,LNDSYN,he,Highland


**What is the common key word for joining two tables?**

**What sort of join should we perform? inner? outter? left? or right?**

**You can also do it use merge, if you forgot, go to geocomputation week 6 - practical**

For your convinience, read more about merge [here](http://pandas.pydata.org/pandas-docs/stable/merging.html). 

![Illustration of the Pandas merge function](http://pandas.pydata.org/pandas-docs/stable/_images/merging_merge_on_key.png)


In [37]:
df_in_one = pd.merge(weather_df, site_df, on = 'id', how = 'left')

ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat

You may get a ValueError:You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat.

Why??? Think really hard about that before you raise your hand to get an answer from us.

In [38]:
site_df.dtypes

elevation          float64
id                   int64
latitude           float64
longitude          float64
name                object
nationalPark        object
obsSource           object
region              object
unitaryAuthArea     object
dtype: object

In [39]:
weather_df.dtypes # check the data types for weather_df

DateTime            object
WindDirection       object
PressureTendency    object
WeatherType         object
Visibility          object
WindSpeed           object
WindGust            object
Temperature         object
DewPoint            object
Humidity            object
id                  object
dtype: object

In [40]:
for c in ['WindDirection','PressureTendency']:
    weather_df[c] = weather_df[c].astype('str')
for c in ['DewPoint','WindGust','Humidity','WindSpeed','Temperature','Visibility','WeatherType']:
    weather_df[c] = pd.to_numeric(weather_df[c], errors='coerce') # change numeric data into integar and float types.

In [41]:
weather_df.dtypes # check the changes of data types for weather_df

DateTime             object
WindDirection        object
PressureTendency     object
WeatherType           int64
Visibility            int64
WindSpeed             int64
WindGust              int64
Temperature         float64
DewPoint            float64
Humidity            float64
id                   object
dtype: object

In [42]:
weather_df['id'] = weather_df['id'].astype('int')
df_in_one = pd.merge(weather_df, site_df, how = 'left', on = 'id')
df_in_one.head(5)

Unnamed: 0,DateTime,WindDirection,PressureTendency,WeatherType,Visibility,WindSpeed,WindGust,Temperature,DewPoint,Humidity,id,elevation,latitude,longitude,name,nationalPark,obsSource,region,unitaryAuthArea
0,2019-01-20 10:00:00,WSW,R,12,6000,10,0,2.1,1.2,93.7,3066,7.0,57.6494,-3.5606,Kinloss,,,gr,Moray
1,2019-01-20 11:00:00,SW,R,8,45000,10,0,2.8,2.0,94.5,3066,7.0,57.6494,-3.5606,Kinloss,,,gr,Moray
2,2019-01-20 12:00:00,WSW,R,10,40000,11,0,3.6,2.2,90.4,3066,7.0,57.6494,-3.5606,Kinloss,,,gr,Moray
3,2019-01-20 13:00:00,W,R,8,35000,13,0,4.2,2.2,86.7,3066,7.0,57.6494,-3.5606,Kinloss,,,gr,Moray
4,2019-01-20 14:00:00,W,R,14,9000,17,0,3.9,2.2,88.5,3066,7.0,57.6494,-3.5606,Kinloss,,,gr,Moray


### Task 9: Generate Shapefile/Geojson File from DataFrame

The examples we gave here probably not a great one. But it is true that quite often we are in the situation that no spatial file is available and we have to generate one by yourself. 

You have learned how to use QGIS to import CSV and generate a vector map from there. Python can do that as well, with simple steps. One more spatial data package you will learn here is Shapely. We will first show you an example and you will create a shapefile of site station following the example.

It should be installed already, if not, do it by yourself.

step 1: you use the package shapely to generate points from latitute and longitude. 

In [43]:
from shapely.geometry import Point


# combine lat and lon column to a shapely Point() object
df_in_one['geometry'] = df_in_one.apply(lambda x: Point((float(x.longitude), float(x.latitude))), axis=1)

In [44]:
# there is a column added as geometry
df_in_one.head(5)

Unnamed: 0,DateTime,WindDirection,PressureTendency,WeatherType,Visibility,WindSpeed,WindGust,Temperature,DewPoint,Humidity,id,elevation,latitude,longitude,name,nationalPark,obsSource,region,unitaryAuthArea,geometry
0,2019-01-20 10:00:00,WSW,R,12,6000,10,0,2.1,1.2,93.7,3066,7.0,57.6494,-3.5606,Kinloss,,,gr,Moray,POINT (-3.5606 57.6494)
1,2019-01-20 11:00:00,SW,R,8,45000,10,0,2.8,2.0,94.5,3066,7.0,57.6494,-3.5606,Kinloss,,,gr,Moray,POINT (-3.5606 57.6494)
2,2019-01-20 12:00:00,WSW,R,10,40000,11,0,3.6,2.2,90.4,3066,7.0,57.6494,-3.5606,Kinloss,,,gr,Moray,POINT (-3.5606 57.6494)
3,2019-01-20 13:00:00,W,R,8,35000,13,0,4.2,2.2,86.7,3066,7.0,57.6494,-3.5606,Kinloss,,,gr,Moray,POINT (-3.5606 57.6494)
4,2019-01-20 14:00:00,W,R,14,9000,17,0,3.9,2.2,88.5,3066,7.0,57.6494,-3.5606,Kinloss,,,gr,Moray,POINT (-3.5606 57.6494)


step 2: Then you use Geopandas to generate 'Geopandas dataframe' from 'Pandas dataframe'

In [45]:
import geopandas
df_in_one = geopandas.GeoDataFrame(df_in_one, geometry='geometry')

step 3: Then you set a projected and geographic coordination system here to convert the file into a shapefile

In [46]:
df_in_one.crs= "+proj=longlat +ellps= EPSG:27700 +datum=WGS84 +no_defs"
df_in_one.to_file('weather.shp', driver='ESRI Shapefile')

step 4: The generated shapefile should be in the same folder as your ipython notebook. and the shapefile can be openned by other GIS tools, such as QGIS. Try it. 

In [47]:
len(df_in_one)

3499

You have learned how to use QGIS to import CSV and generate a vector map from there. Python can do that as well, with simple steps. One more spatial data package you will learn here is Shapely. 

It should be installed already, if not, do it by yourself.


Though there are c.3531 records, there are only a few points shown in the map. why? because may records have the exact same coordinates. They are weather records from one weather station collected at different time. So there should be only 148 points (== number of weather stations) shown in the map.

There are two directions for your to further explore **after the practical**. 

1 - go for time series analysis, take one weather station as example - time series analysis and modeling is a challenge task to do.
2 - generate a series of maps by time series - which sounds more spatial data related, let's try this using what you have learned in Geocomputation week 8 - making maps.

**Could you prepare a shapefile of weather station locations using just 'site_df' for task 10, following what we did to generete weather shapefile?**

In [48]:
site_df['geometry'] = site_df.apply(lambda x: Point((float(x.longitude), float(x.latitude))), axis=1)
site_df = geopandas.GeoDataFrame(site_df, geometry='geometry')
site_df.crs= "+proj=longlat +ellps=EPSG:27700 +datum=WGS84 +no_defs"
site_df.to_file('weather_station.shp', driver='ESRI Shapefile')
len(site_df)

148

### Task 10: Generating a Distance Matrix 

A distance matrix is a square matrix (two-dimensional array) containing the distances, taken pairwise, between the elements of a set. 

We will calculate a distance matrix for all weather stations using methods we mentioned in lecture this week.

In [49]:
site_df.head(5)

Unnamed: 0,elevation,id,latitude,longitude,name,nationalPark,obsSource,region,unitaryAuthArea,geometry
0,7.0,3066,57.6494,-3.5606,Kinloss,,,gr,Moray,POINT (-3.5606 57.6494)
1,6.0,3068,57.712,-3.322,Lossiemouth,,LNDSYN,gr,Moray,POINT (-3.322 57.712)
2,36.0,3075,58.454,-3.089,Wick John O Groats Airport,,LNDSYN,he,Highland,POINT (-3.089 58.454)
3,15.0,3002,60.749,-0.854,Baltasound,,,os,Shetland Islands,POINT (-0.854 60.749)
4,82.0,3005,60.139,-1.183,Lerwick (S. Screen),,LNDSYN,os,Shetland Islands,POINT (-1.183 60.139)


**Could you calculate the distance between station id - 3066 to station id 3068?**

In [50]:
#x - longitude y - latitude

ori = [{'y': site_df.iloc[0]['latitude'], 'x':site_df.iloc[0]['longitude']}]
des = [{'y': site_df.iloc[1]['latitude'], 'x':site_df.iloc[1]['longitude']}]

ori = pd.DataFrame(ori)
des = pd.DataFrame(des)

# to make your life easier - ori is the station 3066, des is the station 3068

To verify what we taught in the class, could you complete the function below? Calculate the distance based on a Euclidean method. the squre_root method.

**implement Euclidean method**

In [51]:
import math
def distance_euclidean(ori,des):
    """
    calculate the euclidean distant between two points. -  
    """
    distance = np.sqrt((ori['latitude']-des['latitude'])**2+(ori['longitude']-des['longitude'])**2)
    return distance

haversine method, using package pyproj, and cosines method are provided, for you the compare the results. 

In [52]:
import math
from math import radians, cos, sin, asin, sqrt

def distance_haversine(ori,des):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [ori.x, ori.y, des.x, des.y])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    r = 6371 # Radius of earth in kilometers. Use 3956 for miles
    return c * r

In [53]:
import pyproj

def distance_pyproj(ori, des):
    geod = pyproj.Geod(ellps='WGS84')
    angle1,angle2,distance = geod.inv(ori.x[0], ori.y[0], des.x[0], des.y[0])
    print ("{0:8.4f}".format(distance/1000))


In [54]:
def distance_cosines(ori, des):
    # law of cosines
    distance = math.acos(math.sin(math.radians(ori.y))*math.sin(math.radians(des.y))+math.cos(math.radians(ori.y))*math.cos(math.radians(des.y))*math.cos(math.radians(des.x)-math.radians(ori.x)))*6371
    print ("{0:8.4f}".format(distance))



In [55]:
distance_haversine(ori,des)

15.800405971038828

In [56]:
distance_cosines(ori, des)

 15.8004


In [57]:
distance_pyproj(ori, des)

 15.8501


**they are slightly different! but very close**

Now, we calculate distance matrix using Scipy package. 'euclidean' method is used among [many other methods](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html) provided. You already tested the Euclidean method using lat and long, the answer seems quite wrong. (because it is in degree, not in meters). 

To get the distance in meters, we will first project the data into two-dimensional space (from long, lat to x, y). Then calculate the distance matrix.

In [58]:
import pyproj
from scipy.spatial import distance

In [59]:
print(site_df.crs) # what is the projected and geographic coordinate system?

+proj=longlat +ellps=EPSG:27700 +datum=WGS84 +no_defs


In [60]:
# Define some common projections using EPSG codes
wgs84=pyproj.Proj("+init=EPSG:4326") # LatLon with WGS84 datum used by GPS units and Google Earth
osgb36=pyproj.Proj("+init=EPSG:27700") # UK Ordnance Survey, 1936 datum - this is the one used in the UK

In [61]:
site_df['x'], site_df['y'] = pyproj.transform(wgs84, osgb36, list(site_df['longitude']), list(site_df['latitude']))

In [62]:
coords = site_df[['x','y']]

In [63]:
coords.head(4)  #check it

Unnamed: 0,x,y
0,306955.6288,863236.8602
1,321331.0898,869902.6478
2,336546.1921,952261.759
3,462571.013,1207872.358


In [64]:
distance.cdist(coords, coords, 'euclidean')

array([[     0.        ,  15845.71246803,  93813.82651373, ...,
        550461.29765573, 423749.90271714, 769320.39253568],
       [ 15845.71246803,      0.        ,  83752.7464451 , ...,
        551191.09949613, 423308.10265446, 776229.80656304],
       [ 93813.82651373,  83752.7464451 ,      0.        , ...,
        623879.72440312, 493622.82100721, 859058.65697336],
       ...,
       [550461.29765573, 551191.09949613, 623879.72440312, ...,
             0.        , 132228.59417675, 343772.25627342],
       [423749.90271714, 423308.10265446, 493622.82100721, ...,
        132228.59417675,      0.        , 442365.79990348],
       [769320.39253568, 776229.80656304, 859058.65697336, ...,
        343772.25627342, 442365.79990348,      0.        ]])

**Verifying it using measure tools in QGIS!!!** The shapefile you created can be opened there.

### Credits!