# Weather Data Cornwalll

- This corresponds to the data I've gathered for the river gauges in Cornwall for the last 20 years. I want to match these temporally, prior to feautre enginnering and training an LSTM to either 1. predict the water levels or 2. build a flood classifier (not sure which yet)

- Will also need flood labels 


# Weather Dataset Column Descriptions

The dataset contains various meteorological measurements recorded hourly or daily. Each column in the dataset represents a specific aspect of the weather data collected at different stations. Here’s a breakdown of each column:

- **temp**: The air temperature in degrees Celsius (°C) at the time of observation.
- **dwpt**: The dew point temperature in degrees Celsius (°C). The dew point is the temperature to which air must be cooled to become saturated with water vapor, assuming constant pressure and water vapor content.
- **rhum**: Relative humidity in percentage (%). This measures the amount of water vapor present in the air relative to the amount needed for saturation at the same temperature.
- **prcp**: Precipitation amount in millimeters (mm). This indicates how much rain, sleet, snow, etc., has fallen during a given period.
- **snow**: Snowfall amount in millimeters (mm). This records the depth of new snow that has fallen.
- **wdir**: Wind direction in degrees (°). This is the direction from which the wind is coming, with 0 degrees representing north, 90 degrees east, 180 degrees south, and 270 degrees west.
- **wspd**: Wind speed in kilometers per hour (km/h). This is the speed of the wind observed at the time.
- **wpgt**: Wind gust peak in kilometers per hour (km/h). This records the highest speed of a wind gust during the observation period.
- **pres**: Atmospheric pressure in hectopascals (hPa). Also known as barometric pressure, it refers to the pressure exerted by the atmosphere at the point of observation.
- **tsun**: Sunshine duration in minutes (min). This measures the total amount of direct sunlight received during a given period.
- **coco**: Weather condition codes, which are numerical representations of the current weather conditions. These codes correspond to specific weather phenomena, such as clear skies, rain, thunderstorms, etc.
- **station_id**: A unique identifier for each weather station from which data is collected. This helps differentiate data entries based on their origin.

This structured data is essential for various applications, including weather forecasting, climate research, agricultural planning, and environmental monitoring.


In [7]:


import requests
import pandas as pd
import meteostat
from meteostat import Stations, Daily, Hourly
from datetime import datetime

In [6]:
# Find stations near a location (e.g., Cornwall, UK)
stations = Stations()
stations = stations.nearby(50.2660, -5.0527)  # Latitude and longitude of Cornwall
station = stations.fetch(10)  # Fetch the closest station

station


Unnamed: 0_level_0,name,country,region,wmo,icao,latitude,longitude,elevation,timezone,hourly_start,hourly_end,daily_start,daily_end,monthly_start,monthly_end,distance
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
03817,Saint Mawgan,GB,ENG,3817.0,EGDG,50.4333,-5.0,119.0,Europe/London,1973-01-01,2008-12-01,1973-01-03,2008-11-30,1978-01-01,2008-01-01,18974.990622
03808,Camborne,GB,ENG,3808.0,,50.2167,-5.3167,87.0,Europe/London,2018-01-27,2024-07-14,1973-01-01,2024-07-02,1973-01-01,2022-01-01,19558.379501
EGHQ0,Newquay Cornwall / Saint Mawgan,GB,ENG,,EGHQ,50.4406,-4.9954,119.0,Europe/London,2020-01-14,2024-07-14,2020-01-14,2022-04-27,2020-01-01,2022-01-01,19835.69607
03809,Culdrose,GB,ENG,3809.0,EGDR,50.0833,-5.25,78.0,Europe/London,1973-01-01,2024-07-14,1973-01-02,2022-04-25,1979-01-01,2022-01-01,24700.854794
EGHC0,Land's End / Lands End / Crows-an-Wra,GB,ENG,,EGHC,50.1028,-5.6706,122.0,Europe/London,1973-03-08,2024-07-13,2021-09-09,2021-09-09,NaT,NaT,47590.142041
03827,Plymouth,GB,ENG,3827.0,EGDB,50.35,-4.1167,50.0,Europe/London,1973-01-01,2024-07-14,1973-01-05,2022-04-24,1950-01-01,2022-01-01,67123.358322
EGHD0,Plymouth / Crown Hill,GB,ENG,,EGHD,50.4167,-4.1167,25.0,Europe/London,1988-03-18,2011-12-23,NaT,NaT,NaT,NaT,68504.717338
03803,"Scilly, Saint Mary'S",GB,ENG,3803.0,EGHE,49.9167,-6.3,31.0,Europe/London,1986-01-01,2024-07-13,1986-01-01,2022-03-26,2005-01-01,2021-01-01,97086.893998
03707,Chivenor,GB,ENG,3707.0,EGDC,51.0833,-4.15,8.0,Europe/London,1973-01-01,2024-06-22,1973-01-02,2022-03-25,2005-01-01,2021-01-01,110927.722699
03839,Exeter Airport,GB,ENG,3839.0,EGTE,50.7333,-3.4167,30.0,Europe/London,1973-01-01,2024-07-13,1973-01-06,2022-04-24,1982-01-01,2022-01-01,126840.578336


In [4]:
ids = station.index.to_list()
ids

['03817',
 '03808',
 'EGHQ0',
 '03809',
 'EGHC0',
 '03827',
 'EGHD0',
 '03803',
 '03707',
 '03839']

In [17]:

# DAILY DATA

# List of station IDs
ids = station.index.to_list()
all_data=pd.DataFrame()

start_date = datetime()

for station_id in station_ids:
    

    data = Daily(station_id,start_date,end_date)
    data = data.fetch()
    data['station_id'] = station_id
    if data.prcp.sum() == 0:
        print(f'no prcp data for station {station_id}')
    all_data = pd.concat([all_data,data], ignore_index=False)    


all_data_daily = all_data 

no prcp data for station 03817


In [21]:
all_data.groupby('station_id')['prcp'].sum()
# it appear 

station_id
03808    21182.7
03809      717.4
03817        0.0
EGHC0     1984.8
EGHQ0     2027.1
Name: prcp, dtype: float64


# Hourly Data 

In [6]:


# List of station IDs
station_ids = ['03817', '03808', 'EGHQ0', '03809', 'EGHC0']

# Start and end dates for the 20-year period
start_date = datetime(2003, 1, 1)
end_date = datetime(2023, 1, 1)

all_data=pd.DataFrame()

for station_id in station_ids:
    

    data = Hourly(station_id,start_date,end_date)
    data = data.fetch()
    data['station_id'] = station_id
    if data.prcp.sum() == 0:
        print(f'no prcp data for station {station_id}')
    all_data_hourly = pd.concat([all_data,data], ignore_index=False)    


all_data_hourly

no prcp data for station 03817


Unnamed: 0_level_0,temp,dwpt,rhum,prcp,snow,wdir,wspd,wpgt,pres,tsun,coco,station_id
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2016-12-08 15:00:00,12.0,12.0,100.0,,,180.0,16.6,,,,,EGHC0
2016-12-08 16:00:00,12.0,12.0,100.0,,,190.0,11.2,,,,,EGHC0
2016-12-09 08:00:00,12.0,9.0,82.0,,,190.0,35.3,,,,,EGHC0
2016-12-09 09:00:00,12.0,11.1,94.0,,,190.0,35.3,,,,,EGHC0
2016-12-09 10:00:00,13.0,9.1,77.0,,,190.0,38.9,,,,,EGHC0
...,...,...,...,...,...,...,...,...,...,...,...,...
2022-12-31 20:00:00,10.8,7.5,80.0,0.3,,230.0,43.6,,1000.5,,8.0,EGHC0
2022-12-31 21:00:00,10.5,7.4,81.0,0.0,,223.0,42.8,,1001.7,,3.0,EGHC0
2022-12-31 22:00:00,10.7,7.2,79.0,0.0,,228.0,43.6,,1002.5,,3.0,EGHC0
2022-12-31 23:00:00,10.9,6.4,74.0,0.0,,231.0,46.8,,1003.5,,3.0,EGHC0


In [8]:
all_data_hourly.describe()

Unnamed: 0,temp,dwpt,rhum,prcp,snow,wdir,wspd,wpgt,pres,tsun,coco
count,29957.0,29948.0,29948.0,17498.0,0.0,29805.0,29932.0,0.0,20281.0,0.0,5853.0
mean,12.136753,9.297546,83.58094,0.113384,,200.946955,24.403264,,1016.210912,,4.697591
std,4.015714,4.46079,11.72979,0.383126,,97.992913,11.663596,,10.40264,,4.029639
min,-4.0,-9.1,29.0,0.0,,0.0,0.0,,973.0,,1.0
25%,9.0,6.1,76.0,0.0,,120.0,15.5,,1009.8,,3.0
50%,12.0,9.8,86.0,0.0,,220.0,22.3,,1017.3,,3.0
75%,15.0,12.9,94.0,0.0,,280.0,31.7,,1023.4,,5.0
max,30.0,21.0,100.0,9.4,,360.0,187.0,,1049.0,,21.0


In [10]:
all_data_hourly.isna().sum()
# drop empty columns

all_data_hourly = all_data_hourly.drop(columns=['snow','wpgt','tsun'])

In [12]:
all_data_hourly.isnull().sum()



temp              9
dwpt             18
rhum             18
prcp          12468
wdir            161
wspd             34
pres           9685
coco          24113
station_id        0
dtype: int64

In [13]:
all_data_hourly.to_csv()

'time,temp,dwpt,rhum,prcp,wdir,wspd,pres,coco,station_id\n2016-12-08 15:00:00,12.0,12.0,100.0,,180.0,16.6,,,EGHC0\n2016-12-08 16:00:00,12.0,12.0,100.0,,190.0,11.2,,,EGHC0\n2016-12-09 08:00:00,12.0,9.0,82.0,,190.0,35.3,,,EGHC0\n2016-12-09 09:00:00,12.0,11.1,94.0,,190.0,35.3,,,EGHC0\n2016-12-09 10:00:00,13.0,9.1,77.0,,190.0,38.9,,,EGHC0\n2016-12-09 11:00:00,12.0,11.1,94.0,,190.0,35.3,,,EGHC0\n2016-12-09 12:00:00,13.0,11.1,88.0,,190.0,35.3,,,EGHC0\n2016-12-09 13:00:00,12.0,11.1,94.0,,190.0,37.1,,,EGHC0\n2016-12-09 14:00:00,12.0,11.1,94.0,,190.0,33.5,,,EGHC0\n2016-12-09 15:00:00,12.0,11.1,94.0,,190.0,31.7,,,EGHC0\n2016-12-09 16:00:00,12.0,11.1,94.0,,200.0,31.7,,,EGHC0\n2016-12-09 17:00:00,12.0,11.1,94.0,,190.0,31.7,,,EGHC0\n2016-12-10 08:00:00,12.0,12.0,100.0,,210.0,25.9,,,EGHC0\n2016-12-10 09:00:00,12.0,12.0,100.0,,200.0,25.9,,,EGHC0\n2016-12-10 10:00:00,12.0,12.0,100.0,,200.0,29.5,,,EGHC0\n2016-12-12 08:00:00,11.0,11.0,100.0,,260.0,22.3,,,EGHC0\n2016-12-12 09:00:00,10.0,10.0,100.0,,250.0

In [18]:
all_data_daily.isna().sum()
all_data_daily = all_data_daily.drop(columns=['tsun'])


In [19]:
all_data_daily.to_csv('all_data_daily')

In [1]:
from data_collection import get_weather_station_info

res = get_weather_station_info()

In [3]:
from data_collection import fetch_weather_data

fetch_weather_data(granularity=Daily, station_id='03817')


NameError: name 'Daily' is not defined

In [8]:
from datetime import datetime
from meteostat import Hourly, Daily, Monthly  # Assuming these are the classes you need

def fetch_weather_data(station_id: str, dates: tuple = None, granularity_class=Hourly):
    """
    Fetches weather data for a specified station ID within a given date range with specified granularity.

    Parameters:
        station_id (str): Unique identifier of the weather station.
        dates (tuple, optional): Tuple containing start and end datetime objects.
                                 Defaults to starting from March 4, 2014, at 6:15 AM to today.
        granularity_class: Class used to fetch weather data (e.g., Hourly, Daily, Monthly).
                           Defaults to Hourly.

    Returns:
        DataFrame: Weather data for the specified station ID and dates, or None if an error occurs.
    """
    if dates is None:
        dates = (datetime(2014, 3, 4, 6, 15, 0), datetime.today())  # Default date range

    try:
        # Initialize the granularity class with the specified parameters
        weather_data = granularity_class(station_id, start=dates[0], end=dates[1])
        fetched_data = weather_data.fetch()
        return fetched_data
    except Exception as e:
        print(f"Failed to fetch data: {e}")
        return None



In [12]:
dates = (datetime(2000,1,1), datetime(2001,1,1))
res = fetch_weather_data(granularity_class=Daily,station_id='03817')

In [13]:
res

Unnamed: 0_level_0,tavg,tmin,tmax,prcp,snow,wdir,wspd,wpgt,pres,tsun
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2018-07-21,18.1,16.0,20.4,,,32.0,7.3,,1014.7,
2018-07-22,18.5,15.4,21.5,,,291.0,8.7,,1017.8,
2018-07-23,19.1,16.9,21.9,,,235.0,9.3,,1017.5,
2018-07-24,18.0,15.5,20.7,,,251.0,9.0,,1015.9,
2018-07-25,17.6,14.1,21.0,,,278.0,6.5,,1018.2,
...,...,...,...,...,...,...,...,...,...,...
2024-07-11,15.1,12.5,17.3,,,313.0,15.1,35.2,1019.4,
2024-07-12,15.1,13.4,16.9,,,340.0,16.5,31.5,1018.6,
2024-07-13,15.1,12.2,18.1,,,323.0,11.0,29.6,1015.8,
2024-07-14,14.8,10.4,18.2,,,121.0,8.3,24.1,1012.7,


In [28]:
fetch_weather_data(granularity_class=Hourly, dates = (datetime(2000,1,1), datetime(2024,1,1)), station_id='03817')

Unnamed: 0_level_0,temp,dwpt,rhum,prcp,snow,wdir,wspd,wpgt,pres,tsun,coco
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2000-01-01 01:00:00,10.1,9.9,99.0,,,270.0,14.8,,1022.0,,
2000-01-01 02:00:00,10.0,9.8,99.0,,,290.0,11.2,,1022.5,,
2000-01-01 03:00:00,9.8,9.7,99.0,,,340.0,7.6,,1022.9,,
2000-01-01 04:00:00,9.4,9.4,100.0,,,,0.0,,1023.3,,
2000-01-01 05:00:00,9.1,9.1,100.0,,,360.0,7.6,,1023.8,,
...,...,...,...,...,...,...,...,...,...,...,...
2023-12-31 20:00:00,10.0,4.8,70.0,,,282.0,51.8,79.6,998.6,,7.0
2023-12-31 21:00:00,9.8,4.6,70.0,,,284.0,51.8,77.8,999.7,,3.0
2023-12-31 22:00:00,9.3,4.7,73.0,,,284.0,48.2,74.1,1000.8,,3.0
2023-12-31 23:00:00,9.4,4.4,71.0,,,282.0,46.3,70.4,1001.6,,3.0


In [4]:
from data_collection import fetch_weather_data 
df = fetch_weather_data(station_id='03817')
from datetime import datetime

In [9]:
df.loc['2018-04-01']


Unnamed: 0_level_0,temp,dwpt,rhum,prcp,snow,wdir,wspd,wpgt,pres,tsun,coco
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2018-04-01 00:00:00,5.3,2.8,84.0,,,312.0,9.3,,1011.3,,2.0
2018-04-01 03:00:00,4.4,2.4,87.0,,,331.0,5.6,,1011.1,,3.0
2018-04-01 06:00:00,4.1,2.1,87.0,,,107.0,5.6,,1011.8,,3.0
2018-04-01 09:00:00,6.1,3.1,81.0,,,119.0,13.0,,1011.8,,4.0
2018-04-01 12:00:00,7.2,3.1,75.0,,,128.0,20.4,,1010.9,,7.0
2018-04-01 15:00:00,6.9,3.9,81.0,,,118.0,29.6,,1008.2,,8.0
2018-04-01 18:00:00,6.5,4.7,88.0,,,115.0,33.3,,1004.7,,9.0
2018-04-01 21:00:00,7.8,7.4,97.0,,,149.0,27.8,,1001.8,,9.0


In [12]:
from data_collection import fetch_weather_data, get_weather_station_info

idees = get_weather_station_info()

In [18]:
idees = idees.reset_index()['id'].to_list()

In [19]:
idees

['03817',
 '03808',
 'EGHQ0',
 '03809',
 'EGHC0',
 '03827',
 'EGHD0',
 '03803',
 '03707',
 '03839']

In [32]:
import pandas as pd

# Initialize an empty list to hold data
all_data = []

# Loop through each station ID
for id in idees:
    fetched = fetch_weather_data(station_id=id)
    fetched['station_id'] = id
    all_data.append(fetched)  # Append each fetched DataFrame to the list

# Concatenate all fetched data into one DataFrame with keys for each ID
res = pd.concat(all_data, keys=idees)

print(res)


                           temp  dwpt  rhum  ...  tsun  coco  station_id
      time                                   ...                        
03817 2000-03-04 07:00:00   2.6  -1.9  72.0  ...   NaN   NaN       03817
      2000-03-04 08:00:00   2.7  -1.1  76.0  ...   NaN   NaN       03817
      2000-03-04 09:00:00   5.2  -0.2  68.0  ...   NaN   NaN       03817
      2000-03-04 10:00:00   6.1  -1.8  57.0  ...   NaN   NaN       03817
      2000-03-04 11:00:00   6.5  -1.9  55.0  ...   NaN   NaN       03817
...                         ...   ...   ...  ...   ...   ...         ...
03839 2024-07-15 11:00:00  16.7  13.2  80.0  ...   NaN   7.0       03839
      2024-07-15 12:00:00  17.1  13.4  79.0  ...   NaN   7.0       03839
      2024-07-15 13:00:00  17.0  13.5  80.0  ...   NaN   8.0       03839
      2024-07-15 14:00:00  17.1  13.6  80.0  ...   NaN   8.0       03839
      2024-07-15 15:00:00  17.4  13.7  79.0  ...   NaN   8.0       03839

[1346979 rows x 12 columns]


In [33]:
res

Unnamed: 0_level_0,Unnamed: 1_level_0,temp,dwpt,rhum,prcp,snow,wdir,wspd,wpgt,pres,tsun,coco,station_id
Unnamed: 0_level_1,time,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
03817,2000-03-04 07:00:00,2.6,-1.9,72.0,,,360.0,1.8,,1029.2,,,03817
03817,2000-03-04 08:00:00,2.7,-1.1,76.0,,,350.0,5.4,,1030.1,,,03817
03817,2000-03-04 09:00:00,5.2,-0.2,68.0,,,360.0,7.6,,1031.0,,,03817
03817,2000-03-04 10:00:00,6.1,-1.8,57.0,,,360.0,13.0,,1031.5,,,03817
03817,2000-03-04 11:00:00,6.5,-1.9,55.0,,,340.0,18.4,,1032.1,,,03817
...,...,...,...,...,...,...,...,...,...,...,...,...,...
03839,2024-07-15 11:00:00,16.7,13.2,80.0,2.8,,118.0,18.5,29.6,1006.6,,7.0,03839
03839,2024-07-15 12:00:00,17.1,13.4,79.0,1.1,,117.0,16.7,25.9,1006.3,,7.0,03839
03839,2024-07-15 13:00:00,17.0,13.5,80.0,0.5,,125.0,14.8,24.1,1006.1,,8.0,03839
03839,2024-07-15 14:00:00,17.1,13.6,80.0,0.1,,141.0,13.0,22.2,1006.0,,8.0,03839


In [35]:
res.to_csv('all_weather_data_cornwall.csv')

In [31]:
# Access all entries for January 1, 2018 across all stations
specific_date_data = res.xs('2018-01-01', level=1, drop_level=False)
print(specific_date_data)


                           temp  dwpt  rhum  ...  tsun  coco  station_id
      time                                   ...                        
03809 2018-01-01 00:00:00   8.1   3.0  70.0  ...   NaN   NaN       03809
      2018-01-01 01:00:00   8.5   3.7  72.0  ...   NaN   NaN       03809
      2018-01-01 02:00:00   8.7   4.5  75.0  ...   NaN   NaN       03809
      2018-01-01 03:00:00   9.0   4.6  74.0  ...   NaN   NaN       03809
      2018-01-01 04:00:00   9.2   5.4  77.0  ...   NaN   NaN       03809
...                         ...   ...   ...  ...   ...   ...         ...
03839 2018-01-01 19:00:00   8.0   4.0  76.0  ...   NaN   NaN       03839
      2018-01-01 20:00:00   8.0   4.0  76.0  ...   NaN   NaN       03839
      2018-01-01 21:00:00   8.0   4.0  76.0  ...   NaN   NaN       03839
      2018-01-01 22:00:00   8.0   4.0  76.0  ...   NaN   NaN       03839
      2018-01-01 23:00:00   7.0   3.1  76.0  ...   NaN   NaN       03839

[113 rows x 12 columns]


In [None]:
res.