<a href="https://colab.research.google.com/github/nswapnil31/Data-Science-Projects/blob/main/Analyzing_River_Thames_Water_Levels.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analyzing River Thames Water Levels
Time series data is everywhere, from watching your stock portfolio to monitoring climate change, and even live-tracking as local cases of a virus become a global pandemic. In this project, you’ll work with a time series that tracks the tide levels of the Thames River. You’ll first load the data and inspect it data visually, and then perform calculations on the dataset to generate some summary statistics. You’ll end by decomposing the time series into its component attributes and analyzing them. 

The original dataset is available from the British Oceanographic Data Center [here](https://www.bodc.ac.uk/data/published_data_library/catalogue/10.5285/b66afb2c-cd53-7de9-e053-6c86abc0d251) and you can read all about this fascinating archival story in [this article](https://www.nature.com/articles/s41597-022-01223-7) from the Nature journal.

Here's a map of the locations of the tidal gauges along the River Thames in London.

![](locations.png)

The dataset comes with a file called `Data_description.pdf`. The dataset consists of 13 `.txt` files, containing comma separated data. We'll begin by analyzing one of them, the London Bridge gauge, and preparing it for analysis. The same code can be used to analyze data from other files (i.e. other gauges along the river) later.



| Variable Name | Description | Format |
| ------------- | ----------- | ------ |
| Date and time | Date and time of measurement to GMT. Note the tide gauge is accurate to one minute. | dd/mm/yyyy hh:mm:ss |
| Water level | High or low water level measured by tide gauge. Tide gauges are accurate to 1 centimetre. | metres (Admiralty Chart Datum (CD), Ordnance Datum Newlyn (ODN or Trinity High Water (THW)) | 
| Flag | High water flag = 1, low water flag = 0 | Categorical (0 or 1) |

In [None]:
# We've imported your first Python package for you! Feel free to add as many cells as you like.
import pandas as pd    # for data manipulation

import numpy as np
import seaborn as sns

import matplotlib.pyplot as plt
import plotly.express as px


In [None]:
# Load the data from London Bridge
lb = pd.read_csv('data/10-11_London_Bridge.txt') # Comma-separated .txt file
lb.head()

Unnamed: 0,Date and time,water level (m ODN),flag,HW=1 or LW=0
0,01/05/1911 15:40:00,3.713,1,
1,02/05/1911 11:25:00,-2.9415,0,
2,02/05/1911 16:05:00,3.3828,1,
3,03/05/1911 11:50:00,-2.6367,0,
4,03/05/1911 16:55:00,2.9256,1,


In [None]:
# Take only the first three columns
df = lb.iloc[:, :3]

In [None]:
# Rename columns
df.columns = ['datetime', 'water_level', 'is_high_tide']

In [None]:
# Convert to datetime
df['datetime'] = pd.to_datetime(df['datetime'])

# Convert to float
df['water_level'] = df.water_level.astype(float)

# Create extra month and year columns for easy access
df['month'] = df['datetime'].dt.month
df['year'] = df['datetime'].dt.year

In [None]:
# Filter df for high and low tide
tide_high = df.query('is_high_tide==1')['water_level']
tide_low = df.query('is_high_tide==0')['water_level']

In [None]:

# Create summary statistics
summary_statistics = {'tide_high': {'mean':round(tide_high.mean(),2), 
              'median':round(tide_high.median(),2), 
              'interquartile_range':round((tide_high.quantile(.75) - tide_high.quantile(.25)),2)},
 'tide_low': {'mean':round(tide_low.mean()), 
              'median':round(tide_low.median(),2), 
              'interquartile_range':round((tide_low.quantile(.75) - tide_low.quantile(.25)),2)}}

summary_statistics

{'tide_high': {'mean': 3.32, 'median': 3.35, 'interquartile_range': 0.74},
 'tide_low': {'mean': -2, 'median': -2.41, 'interquartile_range': 0.54}}

In [None]:
# Calculate ratio of high tide days
all_high_days = df.query('is_high_tide==1').groupby('year').count()['water_level']
high_days = df.query(f'(water_level>{tide_high.quantile(.75)}) & (is_high_tide==1)').groupby('year').count()['water_level']
high_ratio = (high_days/all_high_days).reset_index()


In [None]:
# Calculate ratio of low tide days
all_low_days = df.query('is_high_tide==0').groupby('year').count()['water_level']
low_days = df.query(f'(water_level<{tide_low.quantile(.25)}) & (is_high_tide==0)').groupby('year').count()['water_level']
low_ratio = (low_days/all_low_days).reset_index()

In [None]:
solution = {'summary_statistics':summary_statistics, 'high_ratio': high_ratio, 'low_ratio':low_ratio}
print(solution)

{'summary_statistics': {'tide_high': {'mean': 3.32, 'median': 3.35, 'interquartile_range': 0.74}, 'tide_low': {'mean': -2, 'median': -2.41, 'interquartile_range': 0.54}}, 'high_ratio':     year  water_level
0   1911     0.032787
1   1912     0.127469
2   1913     0.186846
3   1914     0.161572
4   1915     0.219219
..   ...          ...
80  1991     0.252125
81  1992     0.265912
82  1993     0.317597
83  1994     0.357447
84  1995     0.324823

[85 rows x 2 columns], 'low_ratio':     year  water_level
0   1911     0.203463
1   1912     0.192793
2   1913     0.102985
3   1914     0.141618
4   1915     0.139818
..   ...          ...
80  1991     0.312057
81  1992     0.265912
82  1993     0.252496
83  1994     0.252482
84  1995     0.246809

[85 rows x 2 columns]}
