# UK Postcode-level Flood Risk Data (Rivers and Sea)
### Identifying Risk Areas for Flood Protection and Planning


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# Add in the correct column names from the data card
columns = ['Postcode', 'FID', 'PROB 4BAND', 'SUITABILITY', 'PUB_DATE', 'RISK FOR INSURANCE SOP', 'Easting', 'Northing', 'Latitude', 'Longitude']

In [None]:
flood_risk_path = '/kaggle/input/uk-postcode-level-flood-risk-data-rivers-and-sea/open_flood_risk_by_postcode.csv'
df = pd.read_csv(flood_risk_path, names=columns, header=0)

In [None]:
df.head()

In [None]:
# Change any non numbers to Nan values
df.replace('\\N', np.NaN, inplace=True)
df.replace('None', np.NaN, inplace=True)

In [None]:
# Chnage FID code to a float
df['FID'] = df['FID'].astype('float')

In [None]:
# Assign an integer to the Prob 4Band column, in a new column
FOURBAND = {'Very Low':1, 'Low':2, 'Medium':3, 'High':4}
df['4 BAND INT'] = df['PROB 4BAND'].map(FOURBAND)

In [None]:
df.info()

In [None]:
df

In [None]:
# Plot a heatmap of NaNs - do they appear in complete rows, or randomly?
plt.figure(figsize=(16,10))
sns.heatmap(df.isnull(), cbar=False, cmap="YlGnBu")
plt.show()

Complete rows of data are missing, but the coordinates for each postcode are present, meaning that a prediction of risk might be determined by local risk using a KNN method.
First have a look at the distribution of assessed areas and their risks,

In [None]:
# Scattergraph of risk level against co-ordinates
df.plot.scatter(x='Longitude', y='Latitude', c='4 BAND INT', figsize=(10,10), s=5, title = 'Flood Risk Map')


Areas which have not been assessed are shown in white on the chart.

Split data into assessed and unassessed dataframes and remove extraneous columns.  Treat "4 BAND INT" as the target, and work with Latitude and Longitude position co-ordinates as above.

In [None]:
df.head()

In [None]:
assessed = df.dropna(axis=0)
assessed.drop(['Postcode', 'FID','SUITABILITY', 'PUB_DATE', 'RISK FOR INSURANCE SOP', 'PROB 4BAND', 'Easting', 'Northing'], axis =1, inplace = True)
assessed.head()

In [None]:
assessed.describe()  # Looking to see whether we need to scale the Latitude and Longitude data.

In [None]:
unassessed = df[df['PROB 4BAND'].isna()]
unassessed.drop(['Postcode', 'FID','SUITABILITY', 'PUB_DATE', 'RISK FOR INSURANCE SOP', 'PROB 4BAND', 'Easting', 'Northing'], axis =1, inplace = True)
unassessed

### Discussion

Note that only 122,007 post codes have been assessed, out of the total of 1,443,994 postcodes in England.  That means that we are predicting the flood risk of 1,321,987 postcodes on the basis of 8.4% of the total number of postcodes.
The Flood Risk Map, shown above, reveals that assessment has been done more thoroughtly in areas thought to be at greater risk; notably the Humber, London and the south Devon coast.  

Before we apply a KNN algorithm, it is necessary to scale the data, given the different ranges involved.

In [None]:
from sklearn.preprocessing import StandardScaler
scaler_a = StandardScaler()  # for the assessed data
scaler_a.fit(assessed.drop('4 BAND INT', axis=1))
scaled_features_a = scaler_a.transform(assessed.drop('4 BAND INT', axis=1))

In [None]:
assessed_feat = pd.DataFrame(scaled_features_a, columns = assessed.columns[:-1])

In [None]:
# Check that this all looks ok
assessed_feat

In [None]:
X = assessed_feat
y = assessed['4 BAND INT']

We will now do a train test split and use standard metrics to optimise the value of k.  This k value can then be used later  in predicting unassessed areas.

In [None]:
from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 1)
knn.fit(train_X, train_y)

In [None]:
pred_assessed = knn.predict(val_X)

In [None]:
# Check how good these predictions are, and optimise n_neighbours
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(val_y, pred_assessed))
print(confusion_matrix(val_y, pred_assessed))

It's clear to see from this that it is easier to use the model to predict the areas of lower flood risk, rather than those with 'high' flood risk.

In [None]:
# Optimising k
error_rate = []
for i in range(1,20):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(train_X, train_y)
    pred_i = knn.predict(val_X)
    error_rate.append(np.mean(pred_i != val_y))
# error_rate

In [None]:
# Looks like the best fit is given with 5 or fewer neighbors, do a quick plot to aid decision making.
plt.plot(range(1,20), error_rate)

In [None]:
# Chooose a value of k=3, in order that high flood risk areas close to a postcode are not ignored.  
# Repeat fitting above, using this value to get full metrics reports.
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(train_X, train_y)

In [None]:
pred_assessed = knn.predict(val_X)
print(classification_report(val_y, pred_assessed))
print(confusion_matrix(val_y, pred_assessed))

### Discussion
Overall, k=3 gives a slightly poorer prediction than k=1.  
For this data set, the driving aim is that the highest likely level of risk, based upon the risk of nearby postcodes, may be determined.  That is to say, both households and insurance companies will want to know if there is any area which is of a high risk of flooding bordering the zone in question.  For this reason, I have chosen to continue with k=3, but it is a subjective choice.  
A more rigorous approach would take in the following two additions:
Firstly, in areas where little assessment has been done, any data from the nearest neighbours would be rejected if these are more than 'a' miles away from the postcode in question.  This may result in a prediction not being possible for some postcodes.
Secondly, it would be wise to bring in altitude data for the postcodes in question.  Ideally this would compare the lowest altitude of the postcode in question with that of its nieghbours (if this were not available then the mean altitude would still bring something helpful to the model).  I suggest that a correction factor of 1 band Up or down, as appropriate, might be applied to the prediction if the postcode is more than 'b' meters different in altitude from its neighbours.
Suitable values of 'a' and 'b' would need to be determined by the Environment Agency.

In [None]:
# Remind ourselves what we have:
assessed_feat # These have been scaled

In [None]:
unassessed # as yet unscaled

In [None]:
scaler_una = StandardScaler()  # for the assessed data
scaler_una.fit(unassessed.drop('4 BAND INT', axis=1))
scaled_features_una = scaler_una.transform(unassessed.drop('4 BAND INT', axis=1))

In [None]:
unassessed_feat = pd.DataFrame(scaled_features_una, columns = unassessed.columns[:-1])

In [None]:
unassessed_feat 

In [None]:
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(assessed_feat, assessed['4 BAND INT'])

In [None]:
pred_unassessed = knn.predict(unassessed_feat)
pred_unassessed

In [None]:
# Add the results of the predictions on the unassesed data into its dataframe
unassessed['4 BAND PRED'] = pred_unassessed
unassessed

In [None]:
df # A reminder

In [None]:
# Now want to join this data into the original dataframe.  

df['4 BAND PRED'] = np.nan # Create new column
df.update(unassessed, overwrite=True)
df

This dataframe could also be used with the Personal Flood Risk Checker code contained in my other notebook formed froom this data: UK (English) Postcode Level Flood Risk Analysis.
Here, we will simply look at the predicted data on a map, and compare it with mapped the assessment data.  Note that in the first map below, white areas have been assessed ad therefore there is no prediction data.

In [None]:
# Scattergraph of risk level against co-ordinates
df.plot.scatter(x='Longitude', y='Latitude', c='4 BAND PRED', figsize=(10,10), s=5, title = 'Flood Risk ML Prediction Map')

In [None]:
# Repeat the Original map, for comparison:
# Scattergraph of risk level against co-ordinates
df.plot.scatter(x='Longitude', y='Latitude', c='4 BAND INT', figsize=(10,10), s=5, title = 'Flood Risk Map')