# Meterological Factors

This notebook aims to consider other environmental factors such as humidity, precipitation, and wind speed. Monitoring these factors can be important to predicting its influence on heat events observed.

## Learnings:
- find trends meteorological data (humidity, precipitation, and wind speed) and how these affect temperature in certain regions
- identify the correlation between meteorological factors and temperature using correlation analysis using statistical methods such as Pearson Correlation Coefficient
- visualize correlations through scatter plots, striving for a linear-shaped graph if the variables are hypothesized to have a positive relationship

## Set Up

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy.stats import kendalltau
import geopandas as gpd
import folium
import seaborn as sns
from scipy import stats
import sys
import os

# Update paths to get source code from notebook_utils
curr_dir = os.path.dirname(os.path.abspath('notebooks'))
proj_dir = os.path.dirname(curr_dir)
src_path = os.path.join(proj_dir, 'src')
sys.path.append(src_path)

from notebook_utils.preprocessing import *
from notebook_utils.meterological_factors import *

# Create combined dataframe from ghcn_cleaned files
CA_stations_dfs = combine_files_to_dfs('../data/processed/ghcn_cleaned')

Processed file: CA_2003_clean.csv
Processed file: CA_2004_clean.csv
Processed file: CA_2005_clean.csv
Processed file: CA_2006_clean.csv
Processed file: CA_2007_clean.csv
Processed file: CA_2008_clean.csv
Processed file: CA_2009_clean.csv
Processed file: CA_2010_clean.csv
Processed file: CA_2011_clean.csv
Processed file: CA_2012_clean.csv
Processed file: CA_2013_clean.csv
Processed file: CA_2014_clean.csv
Processed file: CA_2015_clean.csv
Processed file: CA_2016_clean.csv
Processed file: CA_2017_clean.csv
Processed file: CA_2018_clean.csv
Processed file: CA_2019_clean.csv
Processed file: CA_2020_clean.csv
Processed file: CA_2021_clean.csv
Processed file: CA_2022_clean.csv
Processed file: CA_2023_clean.csv


# Collect Meterological Data

Collecting data on humidity, wind speed, and precipitation 

Data Source: NOAA National Centers for Environmental Information. (2024). Global Historical Climatology Network (GHCN) - Hourly Data. NOAA. https://www.ncei.noaa.gov/products/land-based-station/ghcn-hourly

In [2]:
#raw_folder = '../data/raw/ghcn_raw'
# list for psv files in raw folder
#psv_files = [f for f in os.listdir(raw_folder) if f.endswith('.psv')]

## Merge data 

In [3]:
# Merge dataframe with meterological data
# CA_stations_dfs = merge_meteo_data(CA_stations_dfs, raw_folder, psv_files)

### Write to CSV files to data/processed/ghcn_meteo folder

In [4]:
# Save combined dataframe to csv
# create_csv_meteo(CA_stations_dfs, 'ghcn_meteo')

## Load combined dataframe with meterological columns

In [5]:
CA_stations_df = combine_meteo_to_df('../data/processed/ghcn_meteo')

Processed file: CA_2003_meteo.csv
Processed file: CA_2004_meteo.csv
Processed file: CA_2005_meteo.csv
Processed file: CA_2006_meteo.csv
Processed file: CA_2007_meteo.csv
Processed file: CA_2008_meteo.csv
Processed file: CA_2009_meteo.csv
Processed file: CA_2010_meteo.csv
Processed file: CA_2011_meteo.csv
Processed file: CA_2012_meteo.csv
Processed file: CA_2013_meteo.csv
Processed file: CA_2014_meteo.csv
Processed file: CA_2015_meteo.csv
Processed file: CA_2016_meteo.csv
Processed file: CA_2017_meteo.csv
Processed file: CA_2018_meteo.csv
Processed file: CA_2019_meteo.csv
Processed file: CA_2020_meteo.csv
Processed file: CA_2021_meteo.csv
Processed file: CA_2022_meteo.csv
Processed file: CA_2023_meteo.csv


In [6]:
CA_stations_df.head()

Unnamed: 0,Station_ID,Station_name,Latitude,Longitude,datetime,Year,Month,Day,Hour,Temperature,Season,County,City,wind_speed,precipitation,relative_humidity
0,USW00023224,AUBURN MUNI AP,38.9547,-121.0819,2003-01-01 00:00:00,2003,1,1,0,5.2,Winter,Placer County,Auburn,,,
1,USW00023224,AUBURN MUNI AP,38.9547,-121.0819,2003-01-01 01:00:00,2003,1,1,1,5.0,Winter,Placer County,Auburn,,,
2,USW00023224,AUBURN MUNI AP,38.9547,-121.0819,2003-01-01 02:00:00,2003,1,1,2,4.8,Winter,Placer County,Auburn,,,
3,USW00023224,AUBURN MUNI AP,38.9547,-121.0819,2003-01-01 03:00:00,2003,1,1,3,4.6,Winter,Placer County,Auburn,,,
4,USW00023224,AUBURN MUNI AP,38.9547,-121.0819,2003-01-01 04:00:00,2003,1,1,4,4.3,Winter,Placer County,Auburn,,,


## Handle Missing Values

### Delete duplicate columns

In [7]:
key = ['Station_ID', 'Station_name', 'Latitude', 'Longitude', 'datetime', 'Year', 'Month', 'Day', 'Hour', 'County', 'City']
meteorological_columns = ['Temperature','relative_humidity', 'precipitation', 'wind_speed']
CA_stations_df = CA_stations_df.groupby(key, as_index=False)[meteorological_columns].mean().round(2)
CA_stations_df.tail()

Unnamed: 0,Station_ID,Station_name,Latitude,Longitude,datetime,Year,Month,Day,Hour,County,City,Temperature,relative_humidity,precipitation,wind_speed
8729669,USW00094299,ALTURAS MUNI AP,41.48,-120.56,2023-05-31 12:00:00,2023,5,31,12,Modoc County,Alturas,18.9,95.0,,0.75
8729670,USW00094299,ALTURAS MUNI AP,41.48,-120.56,2023-05-31 13:00:00,2023,5,31,13,Modoc County,Alturas,20.0,93.25,,0.0
8729671,USW00094299,ALTURAS MUNI AP,41.48,-120.56,2023-05-31 14:00:00,2023,5,31,14,Modoc County,Alturas,21.1,91.5,,0.0
8729672,USW00094299,ALTURAS MUNI AP,41.48,-120.56,2023-05-31 15:00:00,2023,5,31,15,Modoc County,Alturas,21.1,85.5,,0.0
8729673,USW00094299,ALTURAS MUNI AP,41.48,-120.56,2023-05-31 16:00:00,2023,5,31,16,Modoc County,Alturas,21.1,85.0,,0.7


In [13]:
# Apply interpolation to fill missing values
CA_stations_df = cubic_meteo_interpolate(CA_stations_df, 'wind_speed')
CA_stations_df = cubic_meteo_interpolate(CA_stations_df, 'precipitation')
CA_stations_df = cubic_meteo_interpolate(CA_stations_df, 'relative_humidity')

In [15]:
# Check for stations with 100% missing values
columns_to_check = ['relative_humidity', 'precipitation', 'wind_speed']
stations_with_all_na = check_missing_by_station(CA_stations_df, columns_to_check)

for column, stations in stations_with_all_na.items():
    print(f"Stations with 100% missing values for {column}: {stations}")

Stations with 100% missing values for relative_humidity: ['USW00023271', 'USW00053151', 'USW00053152']
Stations with 100% missing values for precipitation: ['USW00023289', 'USW00053130', 'USW00093232']
Stations with 100% missing values for wind_speed: ['USW00023271']


### Drop stations with 100% missing data for meterological columns

In [16]:
# Drop rows with remaining NaN values in any of the key columns
CA_stations_df.dropna(subset=['wind_speed', 'precipitation', 'relative_humidity'], inplace=True)

In [19]:
# Check for na values in metereological columns
CA_stations_df[['Temperature', 'wind_speed', 'relative_humidity', 'precipitation']].isna().sum()

Temperature          0
wind_speed           0
relative_humidity    0
precipitation        0
dtype: int64

In [17]:
#create_csv_meteo(CA_stations_df, 'ghcn_meteo_cleaned')

New datafrane yearly files saved to ghcn_meteo_cleaned


# Correlation Analysis

# Interpretation