# Data Exploration

## Background 

### Data Source
The data used in this notebook is sourced from the National Centers for Environmental Information (NCEI): [Global Historical Climatology Network (GHCN) - Hourly](https://www.ncei.noaa.gov/products/global-historical-climatology-network-hourly). Refer to their documentation and terms of use.


#### Data Set

Station_ID: the station identification code. The first two characters signify the FIPS country code, the third character is a network code identifying the station numbering system used, and the remaining eight characters contain the actual station ID.

Station_Name: the name of the station.

Year: the year the observation was taken in Coordinated Universal Time (UTC).

Month: the month the observation was taken in Coordinated Universal Time (UTC).

Day: the day the observation was taken in Coordinated Universal Time (UTC).

Hour: the hour the observation was taken in Coordinated Universal Time (UTC).

Latitude: latitude of the station (in decimal degrees). North (+); South (-).

Longitude: the longitude of the station (in decimal degrees). East (+); West (-).

Temperature: 2 meter (circa) Above Ground Level Air (dry bulb) Temperature (⁰C to tenths)


Notes: 
- Raw data was removed in download_ghcn.py for storage purposes.
- GHCN hourly dataset contained psv files for individual stations in specific years. When processing the data, it was converted to csv format files for all California stations in years 2003 - 2023.
- Most columns were dropped as they were not needed. Columns kept were described above.
- Duplicate rows were dropped.
- Missing temperature values were filled in with -999.


## Data Cleaning

### Data Examination

In [9]:
import pandas as pd
import sys
from preprocessing import combine_files_to_dfs

dfs = combine_files_to_dfs("../data/ghcn_reduced")
CA_stations = pd.concat(dfs, ignore_index=True)



Processed file: reduced_CA_stations_2003.csv
Processed file: reduced_CA_stations_2004.csv
Processed file: reduced_CA_stations_2005.csv
Processed file: reduced_CA_stations_2006.csv
Processed file: reduced_CA_stations_2007.csv
Processed file: reduced_CA_stations_2008.csv
Processed file: reduced_CA_stations_2009.csv
Processed file: reduced_CA_stations_2010.csv
Processed file: reduced_CA_stations_2011.csv
Processed file: reduced_CA_stations_2012.csv
Processed file: reduced_CA_stations_2013.csv
Processed file: reduced_CA_stations_2014.csv
Processed file: reduced_CA_stations_2015.csv
Processed file: reduced_CA_stations_2016.csv
Processed file: reduced_CA_stations_2017.csv
Processed file: reduced_CA_stations_2018.csv
Processed file: reduced_CA_stations_2019.csv
Processed file: reduced_CA_stations_2020.csv
Processed file: reduced_CA_stations_2021.csv
Processed file: reduced_CA_stations_2022.csv
Processed file: reduced_CA_stations_2023.csv


In [10]:
CA_stations.head()

Unnamed: 0,Station_ID,Station_name,Year,Month,Day,Hour,Latitude,Longitude,temperature
0,USL000SANF1,SAND KEY FL,2003,1,1,0,24.46,-81.88,24.6
1,USL000SANF1,SAND KEY FL,2003,1,1,1,24.46,-81.88,24.7
2,USL000SANF1,SAND KEY FL,2003,1,1,2,24.46,-81.88,24.7
3,USL000SANF1,SAND KEY FL,2003,1,1,3,24.46,-81.88,24.6
4,USL000SANF1,SAND KEY FL,2003,1,1,4,24.46,-81.88,24.5


In [11]:
CA_stations.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35152313 entries, 0 to 35152312
Data columns (total 9 columns):
 #   Column        Dtype  
---  ------        -----  
 0   Station_ID    object 
 1   Station_name  object 
 2   Year          int64  
 3   Month         int64  
 4   Day           int64  
 5   Hour          int64  
 6   Latitude      float64
 7   Longitude     float64
 8   temperature   float64
dtypes: float64(3), int64(4), object(2)
memory usage: 2.4+ GB


While there are no non-null values, from looking at the head of the data, we can see that there are missing hour rows.

In [12]:
CA_stations.describe()

Unnamed: 0,Year,Month,Day,Hour,Latitude,Longitude,temperature
count,35152310.0,35152310.0,35152310.0,35152310.0,35152310.0,35152310.0,35152310.0
mean,2013.445,6.490783,15.73185,11.58046,35.96936,-118.8302,-6.598248
std,5.39786,3.456516,8.794606,6.941218,3.891105,7.30364,150.5064
min,2003.0,1.0,1.0,0.0,12.0,-124.2381,-999.0
25%,2009.0,3.0,8.0,6.0,34.0833,-121.79,10.0
50%,2014.0,6.0,16.0,12.0,36.6019,-119.8797,15.0
75%,2018.0,10.0,23.0,18.0,38.15,-117.6025,21.0
max,2023.0,12.0,31.0,23.0,41.7836,-61.2,999.0


Above shows a max temperature of 269 degrees celsius (516 degreess fahrenheit) which is impossible. This is an invalid value.

# Cleaning Invalid Data