# Analyzing River Thames Water Levels
Time series data is everywhere, from watching your stock portfolio to monitoring climate change, and even live-tracking as local cases of a virus become a global pandemic. In this project, you’ll work with a time series that tracks the tide levels of the Thames River. You’ll first load the data and inspect it data visually, and then perform calculations on the dataset to generate some summary statistics. You’ll end by reducing the time series to its component attributes and analyzing them. 

The original dataset is available from the British Oceanographic Data Center.

Here's a map of the locations of the tidal meters along the River Thames in London.

![](locations.png)

The provided datasets are in the `data` folder in this workspace. For this project, you will work with one of these files, `10-11_London_Bridge.txt`, which contains comma separated values for water levels in the Thames River at the London Bridge. After you've finished the project, you can use your same code to analyze data from the other files (at other spots in the UK where tidal data is collected) if you'd like. 

The TXT file contains data for three variables, described in the table below. 

| Variable Name | Description | Format |
| ------------- | ----------- | ------ |
| Date and time | Date and time of measurement to GMT. Note the tide gauge is accurate to one minute. | dd/mm/yyyy hh:mm:ss |
| Water level | High or low water level measured by tide meter. Tide gauges are accurate to 1 centimetre. | metres (Admiralty Chart Datum (CD), Ordnance Datum Newlyn (ODN or Trinity High Water (THW)) | 
| Flag | High water flag = 1, low water flag = 0 | Categorical (0 or 1) |



In [142]:
# Import required packages
import numpy as np
import pandas as pd
import datetime as dt

# Read the "data/10-11_London_Bridge.txt" text file as a pandas DataFrame and store the result in the df variable
df = pd.read_csv("data/10-11_London_Bridge.txt")

# Print the head of the df DataFrame
print(df.head())

         Date and time  water level (m ODN)   flag   HW=1 or LW=0
0  01/05/1911 15:40:00               3.7130      1            NaN
1  02/05/1911 11:25:00              -2.9415      0            NaN
2  02/05/1911 16:05:00               3.3828      1            NaN
3  03/05/1911 11:50:00              -2.6367      0            NaN
4  03/05/1911 16:55:00               2.9256      1            NaN


In [143]:
# Drop the last column of the df DataFrame
df = df.drop(df.columns[-1], axis = 1)

# Rename the columns of the df DataFrame
df.columns = ["date_and_time", "water_level", "flag"]

# Print the head of the df DataFrame
print(df.head())

         date_and_time water_level  flag
0  01/05/1911 15:40:00      3.7130     1
1  02/05/1911 11:25:00     -2.9415     0
2  02/05/1911 16:05:00      3.3828     1
3  03/05/1911 11:50:00     -2.6367     0
4  03/05/1911 16:55:00      2.9256     1


In [144]:
# Print the data types of the df DataFrame
print(df.dtypes)

date_and_time    object
water_level      object
flag              int64
dtype: object


In [145]:
# Convert the "date_and_time" column of the df DataFrame to datetime type
df["date_and_time"] = pd.to_datetime(df["date_and_time"])

# Convert the "water_level" column of the df DataFrame to float type
df["water_level"] = df["water_level"].astype("float64")

# Create a "year" column in the df DataFrame, containing the year corresponding to the date
df["year"] = df["date_and_time"].dt.year.astype("int64")

In [146]:
# Create two DataFrames, one containing only high-tide data and the other containing only low-tide data and store the result in the df_high_tide and df_low_tide variable, respectively
df_high_tide = df[df["flag"] == 1]
df_low_tide = df[df["flag"] == 0]

# Print the heads of the df_high_tide and df_low_tide DataFrames
print(df_high_tide.head(), "\n\n", df_low_tide.head())

        date_and_time  water_level  flag  year
0 1911-01-05 15:40:00       3.7130     1  1911
2 1911-02-05 16:05:00       3.3828     1  1911
4 1911-03-05 16:55:00       2.9256     1  1911
6 1911-04-05 17:45:00       3.1542     1  1911
7 1911-05-05 06:30:00       3.0780     1  1911 

          date_and_time  water_level  flag  year
1  1911-02-05 11:25:00      -2.9415     0  1911
3  1911-03-05 11:50:00      -2.6367     0  1911
5  1911-04-05 12:10:00      -2.4843     0  1911
8  1911-05-05 13:00:00      -2.4843     0  1911
10 1911-06-05 14:25:00      -1.9509     0  1911


In [147]:
# Create a function named IQR which calculates the interquartile range of an array
def IQR(a):
    """Calculates the interquartile range of a list
    
    Parameters:
    a (list): A list of numbers
    
    Returns:
    float64: Returns the interquartile range of the list of numbers"""
    q25, q75 = a.quantile([0.25, 0.75])
    return q75 - q25

# Find the mean, median and interquartile range  of the water levels in the df_high_tide_data and df_low_tide DataFrames and store the results as pandas Series in the high_tide_stats and low_tide_stats variables
high_tide_stats = df_high_tide["water_level"].agg(["mean", "median", IQR])
low_tide_stats = df_low_tide["water_level"].agg(["mean", "median", IQR])

# Print the high_tide_stats and low_tide_stats Series
print(high_tide_stats, "\n\n", low_tide_stats)

mean      3.318373
median    3.352600
IQR       0.743600
Name: water_level, dtype: float64 

 mean     -2.383737
median   -2.412900
IQR       0.538200
Name: water_level, dtype: float64


In [148]:
# Calculate the 90th percentile of the water levels in the days with high tide, and store the result in the qh90 variable
qh90 = df_high_tide["water_level"].quantile(0.90)

# Calculate the number of days with high tide levels per year, and store the result as a pandas Series in the high_tide_days_per_year variable
high_tide_days_per_year = df_high_tide.groupby("year")["water_level"].count()

# Calculate the number of days with very high tide levels (i.e., with water levels above the 90th percentile of high tide days) per year, and store the result as a pandas Series in the very_high_tide_days_per_year variable
very_high_tide_days_per_year = df_high_tide[df_high_tide["water_level"] > qh90].groupby("year")["water_level"].count()

# Calculate the annual ratio of days with very high tide levels (i.e., with water levels at or above the 90th percentile of high tide days) for each year and store the results as a pandas DataFrame in the df_very_high_tide_yearly_ratio variable with the index reset
df_very_high_tide_yearly_ratio = (very_high_tide_days_per_year / high_tide_days_per_year).reset_index()

# Print the head of the df_very_high_tide_yearly_ratio DataFrame
print(df_very_high_tide_yearly_ratio.head())

   year  water_level
0  1911     0.004098
1  1912     0.032316
2  1913     0.082212
3  1914     0.055313
4  1915     0.045045


In [149]:
# Calculate the 10th percentile of the water levels in the days with low tide, and store the result in the ql10 variable
ql10 = df_low_tide["water_level"].quantile(0.1)

# Calculate the number of days with low tide levels per year, and store the result as a pandas Series in the low_tide_days_per_year variable
low_tide_days_per_year = df_low_tide.groupby("year")["water_level"].count()

# Calculate the number of days with very low tide levels (i.e., with water levels below the 10th percentile of low tide days) per year, and store the result as a pandas Series in the very_low_tide_days_per_year variable
very_low_tide_days_per_year = df_low_tide[df_low_tide["water_level"] < ql10].groupby("year")["water_level"].count()

# Calculate the annual ratio of days with very low tide levels (i.e., with water levels at or below the 10th percentile of low tide days) for each year and store the results as a pandas DataFrame in the df_very_low_tide_yearly_ratio variable with the index reset
df_very_low_tide_yearly_ratio = (very_low_tide_days_per_year / low_tide_days_per_year).reset_index()

# Print the head of the df_very_low_tide_yearly_ratio DataFrame
print(df_very_low_tide_yearly_ratio.head())

   year  water_level
0  1911     0.060606
1  1912     0.066667
2  1913     0.022388
3  1914     0.039017
4  1915     0.033435


In [150]:
# Create a dictionary summarizing all of the above data analysis and store it in the solution variable
solution = {
    "high_statistics" : high_tide_stats,
    "low_statistics" : low_tide_stats,
    "very_high_ratio" : df_very_high_tide_yearly_ratio,
    "very_low_ratio" : df_very_low_tide_yearly_ratio
}

# Print the solution dictionary
print(solution)

{'high_statistics': mean      3.318373
median    3.352600
IQR       0.743600
Name: water_level, dtype: float64, 'low_statistics': mean     -2.383737
median   -2.412900
IQR       0.538200
Name: water_level, dtype: float64, 'very_high_ratio':     year  water_level
0   1911     0.004098
1   1912     0.032316
2   1913     0.082212
3   1914     0.055313
4   1915     0.045045
..   ...          ...
80  1991     0.096317
81  1992     0.103253
82  1993     0.145923
83  1994     0.150355
84  1995     0.170213

[85 rows x 2 columns], 'very_low_ratio':     year  water_level
0   1911     0.060606
1   1912     0.066667
2   1913     0.022388
3   1914     0.039017
4   1915     0.033435
..   ...          ...
80  1991     0.150355
81  1992     0.107496
82  1993     0.112696
83  1994     0.106383
84  1995     0.107801

[85 rows x 2 columns]}
