# Data Exploration

## Background 

### Data Source
The data used in this notebook is sourced from the National Centers for Environmental Information (NCEI): [Global Historical Climatology Network (GHCN) - Hourly](https://www.ncei.noaa.gov/products/global-historical-climatology-network-hourly). Refer to their documentation and terms of use.


#### Data Set

Station_ID: the station identification code. The first two characters signify the FIPS country code, the third character is a network code identifying the station numbering system used, and the remaining eight characters contain the actual station ID.

Station_Name: the name of the station.

Year: the year the observation was taken in Coordinated Universal Time (UTC).

Month: the month the observation was taken in Coordinated Universal Time (UTC).

Day: the day the observation was taken in Coordinated Universal Time (UTC).

Hour: the hour the observation was taken in Coordinated Universal Time (UTC).

Latitude: latitude of the station (in decimal degrees). North (+); South (-).

Longitude: the longitude of the station (in decimal degrees). East (+); West (-).

Temperature: 2 meter (circa) Above Ground Level Air (dry bulb) Temperature (⁰C to tenths)


Notes: 
- Raw data was removed in download_ghcn.py for storage purposes.
- GHCN hourly dataset contained psv files for individual stations in specific years. When processing the data, it was converted to csv format files for all California stations in years 2003 - 2023.
- Most columns were dropped as they were not needed. Columns kept were described above.
- Duplicate rows were dropped.
- Missing temperature values were filled in with -999.


## Data Cleaning

### Data Examination

In [None]:
import pandas as pd
import sys
from preprocessing import combine_files_to_dfs

dfs = combine_files_to_dfs("../data/ghcn_reduced")
CA_stations = pd.concat(dfs, ignore_index=True)

In [None]:
CA_stations.head()

In [None]:
CA_stations.info()

While there are no non-null values, from looking at the head of the data, we can see that there are missing hour rows.

In [None]:
CA_stations.describe()

Above shows a max temperature of 902 degrees celsius which is impossible. This is an invalid value.

In [None]:
# Checking which columns have missing data
cols_missing_data = CA_stations.isnull().any()
print(cols_missing_data)

There are no 'null' values in the other columns besides temperature. Missing values in temperature are indicated with -999. 

There are completely missing rows, which can be more accurately dealt with after filling in the missing temperature values.

# Cleaning Invalid Data

- Handle Missing Values  (e.g., mean/median impuation, interpolation, forward or backward fill, k-nearest neighbors imputation, deletion)
- Handle Outliers  (e.g., visual inspection by boxplots, Z-score and IQR method, or data transformation by log transformation and winsorization)
- Handle inconsistencies (e.g., checking ranges to ensure temperature values fall within a reasonable range, unit consistency, string matching and standardization), and duplicates (identify and remove duplicates) in the dataset

Notes:
- For non-leap years, there should be 8760 rows (for each hour) for each station.
- For leap years, there should be 8784 rows (for each hour) for each station
- The reduced files contain 99 CA stations.
- Some stations are not observed each year from 2003-2023.

In [None]:
# Create Individual Dataframes
dataframes = combine_files_to_dfs("../data/ghcn_reduced")