# Data Exploration

## Background 

### Data Source
The data used in this notebook is sourced from the National Centers for Environmental Information (NCEI): [Global Historical Climatology Network (GHCN) - Hourly](https://www.ncei.noaa.gov/products/global-historical-climatology-network-hourly). Refer to their documentation and terms of use.


#### Data Set

Station_ID: the station identification code. The first two characters signify the FIPS country code, the third character is a network code identifying the station numbering system used, and the remaining eight characters contain the actual station ID.

Station_Name: the name of the station.

Year: the year the observation was taken in Coordinated Universal Time (UTC).

Month: the month the observation was taken in Coordinated Universal Time (UTC).

Day: the day the observation was taken in Coordinated Universal Time (UTC).

Hour: the hour the observation was taken in Coordinated Universal Time (UTC).

Latitude: latitude of the station (in decimal degrees). North (+); South (-).

Longitude: the longitude of the station (in decimal degrees). East (+); West (-).

Temperature: 2 meter (circa) Above Ground Level Air (dry bulb) Temperature (⁰C to tenths)


Notes: 
- Raw data was removed in download_ghcn.py for storage purposes.
- GHCN hourly dataset contained psv files for individual stations in specific years. When processing the data, it was converted to csv format files for all California stations in years 2003 - 2023.
- Most columns were dropped as they were not needed. Columns kept were described above.
- Duplicate rows were dropped.

## Data Cleaning

### Set up

In [1]:
import pandas as pd # type: ignore
import sys
import os

# Update paths to get source code from notebook_utils

curr_dir = os.path.dirname(os.path.abspath('notebooks'))
proj_dir = os.path.dirname(curr_dir)
src_path = os.path.join(proj_dir, 'src')
sys.path.append(src_path)


In [2]:
from notebook_utils.preprocessing import combine_files_to_dfs

# create a combined dataframe for all reduced csv files
dfs = combine_files_to_dfs("../data/ghcn_reduced")
CA_stations = pd.concat(dfs, ignore_index=True) # type: ignore

Processed file: reduced_CA_stations_2003.csv
Processed file: reduced_CA_stations_2004.csv
Processed file: reduced_CA_stations_2005.csv
Processed file: reduced_CA_stations_2006.csv
Processed file: reduced_CA_stations_2007.csv
Processed file: reduced_CA_stations_2008.csv
Processed file: reduced_CA_stations_2009.csv
Processed file: reduced_CA_stations_2010.csv
Processed file: reduced_CA_stations_2011.csv
Processed file: reduced_CA_stations_2012.csv
Processed file: reduced_CA_stations_2013.csv
Processed file: reduced_CA_stations_2014.csv
Processed file: reduced_CA_stations_2015.csv
Processed file: reduced_CA_stations_2016.csv
Processed file: reduced_CA_stations_2017.csv
Processed file: reduced_CA_stations_2018.csv
Processed file: reduced_CA_stations_2019.csv
Processed file: reduced_CA_stations_2020.csv
Processed file: reduced_CA_stations_2021.csv
Processed file: reduced_CA_stations_2022.csv
Processed file: reduced_CA_stations_2023.csv


### Data Examination

In [3]:
CA_stations.head()

Unnamed: 0,Station_ID,Station_name,Year,Month,Day,Hour,Latitude,Longitude,temperature
0,USW00023225,BLUE CANYON NYACK AP,2003,1,1,0,39.2761,-120.7092,-1.1
1,USW00023225,BLUE CANYON NYACK AP,2003,1,1,1,39.2761,-120.7092,-1.1
2,USW00023225,BLUE CANYON NYACK AP,2003,1,1,2,39.2761,-120.7092,-1.1
3,USW00023225,BLUE CANYON NYACK AP,2003,1,1,3,39.2761,-120.7092,-1.7
4,USW00023225,BLUE CANYON NYACK AP,2003,1,1,4,39.2761,-120.7092,-2.2


In [4]:
CA_stations.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19851938 entries, 0 to 19851937
Data columns (total 9 columns):
 #   Column        Dtype  
---  ------        -----  
 0   Station_ID    object 
 1   Station_name  object 
 2   Year          int64  
 3   Month         int64  
 4   Day           int64  
 5   Hour          int64  
 6   Latitude      float64
 7   Longitude     float64
 8   temperature   float64
dtypes: float64(3), int64(4), object(2)
memory usage: 1.3+ GB


In [5]:
CA_stations.describe()

Unnamed: 0,Year,Month,Day,Hour,Latitude,Longitude,temperature
count,19851940.0,19851940.0,19851940.0,19851940.0,19851940.0,19851940.0,19612370.0
mean,2013.089,6.504389,15.73293,11.59572,36.6845,-120.143,15.61475
std,5.375401,3.467041,8.794091,6.93905,2.374357,2.021136,8.620436
min,2003.0,1.0,1.0,0.0,32.5681,-124.2381,-99.0
25%,2009.0,3.0,8.0,6.0,34.4142,-121.815,10.5
50%,2013.0,7.0,16.0,12.0,36.985,-120.4667,14.7
75%,2018.0,10.0,23.0,18.0,38.3208,-118.2911,20.0
max,2023.0,12.0,31.0,23.0,41.7836,-116.1472,902.0


The temperature column has a really high max celsius value which is 902 degrees celsius. This is unreasonably high. After doing some searching, we found that the highest recorded temperature value was 56.7 degrees celsius in California 1913. 

There is also an unreasonably low temperature observation of -99 degrees celsius since the lowest recorded temperature observation on Earth was -98 degrees in Antartica. 

In [6]:
# Checking which columns have missing data
cols_missing_data = CA_stations.isnull().any()
print(cols_missing_data)

Station_ID      False
Station_name    False
Year            False
Month           False
Day             False
Hour            False
Latitude        False
Longitude       False
temperature      True
dtype: bool


While temperature is the only column with NA cells, there are some completely missing rows meaning not all hours from each year were observed. 

# Cleaning Invalid Data

- Handle Missing Values  (e.g., mean/median impuation, interpolation, forward or backward fill, k-nearest neighbors imputation, deletion)
- Handle Outliers  (e.g., visual inspection by boxplots, Z-score and IQR method, or data transformation by log transformation and winsorization)
- Handle inconsistencies (e.g., checking ranges to ensure temperature values fall within a reasonable range, unit consistency, string matching and standardization), and duplicates (identify and remove duplicates) in the dataset

Notes:
- For non-leap years, there should be 8760 rows (for each hour) for each station.
- For leap years, there should be 8784 rows (for each hour) for each station
- Leap years from 2003-2023 include: 2004, 2008, 2012, 2016, and 2020
- The reduced files contain 99 CA stations.
- Some stations are not observed each year from 2003-2023.


### Set up

In [7]:
# Create Individual Dataframes
dataframes = combine_files_to_dfs("../data/ghcn_reduced")

Processed file: reduced_CA_stations_2003.csv
Processed file: reduced_CA_stations_2004.csv
Processed file: reduced_CA_stations_2005.csv
Processed file: reduced_CA_stations_2006.csv
Processed file: reduced_CA_stations_2007.csv
Processed file: reduced_CA_stations_2008.csv
Processed file: reduced_CA_stations_2009.csv
Processed file: reduced_CA_stations_2010.csv
Processed file: reduced_CA_stations_2011.csv
Processed file: reduced_CA_stations_2012.csv
Processed file: reduced_CA_stations_2013.csv
Processed file: reduced_CA_stations_2014.csv
Processed file: reduced_CA_stations_2015.csv
Processed file: reduced_CA_stations_2016.csv
Processed file: reduced_CA_stations_2017.csv
Processed file: reduced_CA_stations_2018.csv
Processed file: reduced_CA_stations_2019.csv
Processed file: reduced_CA_stations_2020.csv
Processed file: reduced_CA_stations_2021.csv
Processed file: reduced_CA_stations_2022.csv
Processed file: reduced_CA_stations_2023.csv


## Handling Missing Values

### Filling missing rows

In [8]:
# Something

### Filling missing temperature values

#### Missing Temperature Values: Handling short gaps in temperature observations

We define a "short gap" as being a gap in the temperature column that is less than a day. 

Short gaps in temperature observations will be handled with forward/backward fill.

#### Missing Temperature Values: Handling large gaps in temperature observations

We define a "large gap" as being a gap in the temperature column that is more than a day.

Large gaps in temperature observations will be handled via interpolation

#### Missing Values with Deletion

When there are more than _ % of missing temperature values in a station, it is removed from the data.

When there are more than _% missing rows in a station, it is removed from the data.

Note: Still


## Handling Inconsistencies

1. Ensure temperature observations are within -50°C and 60°C


## Handling Outliers

Outliers can skew our statistical analysis of the data. 

### 1. Visual Inspection

#### Boxplot

### 2. Statistical Methods

#### Z-Score

#### Interquartile Range (IQR)

### 3. Data Transformation

#### Log Transformation or Winsorization