# How many people in the UK are exposed to unacceptable levels of noise?

We looked at a [simple dataset on noise pollution](https://github.com/pwalsh/notebooks/blob/master/opendataprojects/analyse.ipynb) earlier, and here we'll look at it in a bit more detail.

The goal is to build a narrative with and around the data.

## Context

The Environment Protection Act of 1997 sets out [seven zones for the purpose of defining acceptable noise levels](http://www.hiil.org/bestpractices/How%20to%20determine%20acceptable%20levels%20of%20noise%20nuisance%20(UK)). 

Prolonged exposure to noise pollution can have a [negative effect on health](https://en.wikipedia.org/wiki/Health_effects_from_noise).

How are noise levels measured? The "[Environmental Noise Directive](http://ec.europa.eu/environment/noise/directive_en.htm)" of the EU requires member states to publish information on noise levels for:

- agglomerations with more than 100,000 inhabitants
- major roads (more than 3 million vehicles a year)
- major railways (more than 30.000 trains a year)
- major airports (more than 50.000 movements a year, including small aircrafts and helicopters)

There are [two ways](http://www.noisemap.ltd.uk/home/eu%20noise%20directive.html) of measuring noise levels:

1. Lden is the equivalent continuous noise level over a whole 24-hour period, but with noise in the evening (19:00 to 23:00) increased by 5 dB(A) and noise at night (23:00 to 07:00) increased by 10 dB(A) to reflect the greater noise-sensitivity of people at those times.
2. Lnight is the equivalent continuous noise level over the night-time period (23:00 to 07:00). Lnight does not contain any night-time noise weighting.

The file of data we sourced has been published in compliance with this directive.

In [67]:
# just for presentation in notebooks
from pprint import pprint as print

# Request

As previously, we want to get our data into a Python data structure we can work with.

In [68]:
import operator
import requests
import csv
from bokeh.charts import Bar, show, output_notebook


def clean(row):
    """Clean rows of data."""
    
    # remove a noisy row from our data - it is not about a specific agglomeration
    if row['Location/Agglomeration'] == 'Major sources (outside agglomerations)':
        row = None
    else:
        for key, value in row.items():
            # some of the population counts show 'n/a'.
            if value == 'n/a':
                row[key] = 0
            # when a value can be coerced to an integer, then do it.
            try:
                row[key] = int(value)
            except ValueError as e:
                pass
    return row


csv_source = 'http://data.defra.gov.uk/env/strategic_noise_mapping/r2_strategic_noise_mapping.csv'

csv_delimiter = ','

response = requests.get(csv_source)

raw = response.text.splitlines()

reader = csv.DictReader(raw, delimiter=csv_delimiter)

data = []

for row in reader:
    row = clean(row)
    if row:
        data.append(row)

INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): data.defra.gov.uk


## Sample

In [69]:
print(data[0])

{'AgglomerationPopulation': 895000,
 'Industry_Pop_Lden>=55dB': 1400,
 'Industry_Pop_Lden>=60dB': 600,
 'Industry_Pop_Lden>=65dB': 100,
 'Industry_Pop_Lden>=70dB': 0,
 'Industry_Pop_Lden>=75dB': 0,
 'Industry_Pop_Lnight>=50dB': 1100,
 'Industry_Pop_Lnight>=55dB': 400,
 'Industry_Pop_Lnight>=60dB': 100,
 'Industry_Pop_Lnight>=65dB': 0,
 'Industry_Pop_Lnight>=70dB': 0,
 'Location/Agglomeration': 'Tyneside',
 'Railways_Pop_Lden>=55dB': 14200,
 'Railways_Pop_Lden>=60dB': 8100,
 'Railways_Pop_Lden>=65dB': 3900,
 'Railways_Pop_Lden>=70dB': 1700,
 'Railways_Pop_Lden>=75dB': 200,
 'Railways_Pop_Lnight>=50dB': 10400,
 'Railways_Pop_Lnight>=55dB': 6000,
 'Railways_Pop_Lnight>=60dB': 2500,
 'Railways_Pop_Lnight>=65dB': 1100,
 'Railways_Pop_Lnight>=70dB': 0,
 'Road_Pop_Lden>=55dB': 166400,
 'Road_Pop_Lden>=60dB': 79200,
 'Road_Pop_Lden>=65dB': 46100,
 'Road_Pop_Lden>=70dB': 18200,
 'Road_Pop_Lden>=75dB': 1300,
 'Road_Pop_Lnight>=50dB': 94800,
 'Road_Pop_Lnight>=55dB': 51600,
 'Road_Pop_Lnight>=60d

## What is our data about?

Looking at a sample, we can see that the data is fairly comprehensive in terms of exposing data points for each of Lden and Lnight, and that there is data on a range of noise levels for each of railways, roads and industry (no airports!).

In addition to providing us with a count for people exposed to noise pollution in this matrix of conditions, the data provides a total population count for each agglomeration, which allows us to make some useful calculations without sourcing additional data.

# Analyse

As previously, we'll extract some information about the most populated and least populated areas.

In [70]:
# high-level data points
columns = len(data[0].keys())

rows = len(data)

most_populated = max(data, key=operator.itemgetter('AgglomerationPopulation'))

high_exposure_count = sum([
    most_populated['Industry_Pop_Lden>=75dB'], 
    most_populated['Railways_Pop_Lden>=75dB'], 
    most_populated['Road_Pop_Lden>=75dB']
])

high_exposure_percent = '{0:.2f}%'.format(
    (high_exposure_count / most_populated['AgglomerationPopulation']) * 100
)

# A factual statement, according to this data source.
statement = """\
The data holds {columns} columns of data for {rows} different "Agglomerations". \
The most populated "Agglomeration" is "{place_name}" with a population of {pop_count}. \
Out of this population, {high_exposure_count} ({high_exposure_percent}) people are exposed to very high levels of noise pollution \
from industry, railway and road sources.\
""".format(
    columns=columns, 
    rows=rows, 
    place_name=most_populated['Location/Agglomeration'], 
    pop_count=most_populated['AgglomerationPopulation'], 
    high_exposure_count=high_exposure_count,
    high_exposure_percent=high_exposure_percent
)

print(statement)

('The data holds 32 columns of data for 65 different "Agglomerations". The '
 'most populated "Agglomeration" is "Greater London Urban Area" with a '
 'population of 9300000. Out of this population, 117400 (1.26%) people are '
 'exposed to very high levels of noise pollution from industry, railway and '
 'road sources.')


### Preparing a view on the data

Looking at the data, it would be interesting to do a simple calculation to order the agglomerations from those with the highest percentage of their population exposed to potentially unhealthy levels of noise pollution, to those with the least percentage.

In [71]:
for row in data:
    exposed_count = sum([
        row['Industry_Pop_Lden>=75dB'], 
        row['Railways_Pop_Lden>=75dB'], 
        row['Road_Pop_Lden>=75dB']
    ])
    row['Exposed'] = (exposed_count / row['AgglomerationPopulation']) * 100

data = sorted(data, key=operator.itemgetter('Exposed'), reverse=True)

view = []

for row in data:
    simplified = {}
    for key, value in row.items():
        if key in ('Exposed', 'Location/Agglomeration'):
            simplified[key] = value
    view.append(simplified)

            
print(view[:9])

[{'Exposed': 1.2623655913978495,
  'Location/Agglomeration': 'Greater London Urban Area'},
 {'Exposed': 1.25, 'Location/Agglomeration': 'Slough Urban Area'},
 {'Exposed': 0.6923076923076923,
  'Location/Agglomeration': 'Doncaster Urban Area'},
 {'Exposed': 0.6181818181818182,
  'Location/Agglomeration': 'Preston Urban Area'},
 {'Exposed': 0.4705882352941176, 'Location/Agglomeration': 'Wigan Urban Area'},
 {'Exposed': 0.42827442827442824,
  'Location/Agglomeration': 'Greater Manchester Urban Area'},
 {'Exposed': 0.39215686274509803, 'Location/Agglomeration': 'Luton/Dunstable'},
 {'Exposed': 0.3770491803278689,
  'Location/Agglomeration': 'Bristol Urban Area'},
 {'Exposed': 0.34615384615384615, 'Location/Agglomeration': 'Plymouth'}]


# Visualise

Let's show the percentage of each population exposed to unsafe noise levels on a bar chart.

In [72]:
chart = Bar(view, 'Location/Agglomeration', values='Exposed', legend=False,
            title='Percentage of population exposed to high noise pollution',
            plot_width=1000)

show(chart)

## Conclusions

The above is a very simple exploration into a small dataset.

A more thorough inquiry would source additional data on both noise pollution and contextual information that illustrate and place the data. Some examples of such contextual datasets could be:

- socioeconomic indicators related to each agglomeration
- the relationship between jobs and industry in a particular agglomeration
- the relationship between noise pollution and housing pricing within each agglomeration, or, across agglomerations

We could also use different chart types, customise our charts further, and so on.

Here, we just want to consider what we've learned, and identify some directions that a more thorough inquiry could follow.

- It seems that a number of people live exposed to much higher than acceptable noise pollution, based on the basic facts we initially established.
- Are there correlation between socioeconomic factors and exposure to noise pollution?

- Can we track possible health implications (deperonslised information from hospitals, for example)?
- Are these figures correct, or is there a more complete dataset available?
- With additional data, is there a story here worth telling as a small app, a blog post, etc.?