# Data Exploration

## Background 

### Data Source
The data used in this notebook is sourced from the National Centers for Environmental Information (NCEI): [Global Historical Climatology Network (GHCN) - Hourly](https://www.ncei.noaa.gov/products/global-historical-climatology-network-hourly). Refer to their documentation and terms of use.


#### Data Set

Station_ID: the station identification code. The first two characters signify the FIPS country code, the third character is a network code identifying the station numbering system used, and the remaining eight characters contain the actual station ID.

Station_Name: the name of the station.

Year: the year the observation was taken in Coordinated Universal Time (UTC).

Month: the month the observation was taken in Coordinated Universal Time (UTC).

Day: the day the observation was taken in Coordinated Universal Time (UTC).

Hour: the hour the observation was taken in Coordinated Universal Time (UTC).

Latitude: latitude of the station (in decimal degrees). North (+); South (-).

Longitude: the longitude of the station (in decimal degrees). East (+); West (-).

Temperature: 2 meter (circa) Above Ground Level Air (dry bulb) Temperature (⁰C to tenths)


### Data Examination

In [2]:
import pandas as pd

CA_2003_df = pd.read_csv('../data/ghcn_hourly/ghcn_processed/CA_stations_2003.csv')


In [3]:
CA_2003_df.head()

Unnamed: 0,Station_ID,Station_name,Year,Month,Day,Hour,Latitude,Longitude,temperature
0,GPW00000401,POINTE A PITRE INTL AP,2003,1,1,0,16.2669,-61.6,24.7
1,GPW00000401,POINTE A PITRE INTL AP,2003,1,1,1,16.2669,-61.6,25.0
2,GPW00000401,POINTE A PITRE INTL AP,2003,1,1,3,16.2669,-61.6,24.2
3,GPW00000401,POINTE A PITRE INTL AP,2003,1,1,4,16.2669,-61.6,24.0
4,GPW00000401,POINTE A PITRE INTL AP,2003,1,1,5,16.2669,-61.6,22.0


In [4]:
CA_2003_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 941680 entries, 0 to 941679
Data columns (total 9 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Station_ID    941680 non-null  object 
 1   Station_name  941680 non-null  object 
 2   Year          941680 non-null  int64  
 3   Month         941680 non-null  int64  
 4   Day           941680 non-null  int64  
 5   Hour          941680 non-null  int64  
 6   Latitude      941680 non-null  float64
 7   Longitude     941680 non-null  float64
 8   temperature   941680 non-null  float64
dtypes: float64(3), int64(4), object(2)
memory usage: 64.7+ MB


While there are no non-null values, from looking at the head of the data, we can see that there are missing hour rows.

In [5]:
#CA_2003_df.describe()
CA_2003_df.describe()

Unnamed: 0,Year,Month,Day,Hour,Latitude,Longitude,temperature
count,941680.0,941680.0,941680.0,941680.0,941680.0,941680.0,941680.0
mean,2003.0,6.591146,15.697363,11.511542,37.261181,-118.485979,0.295263
std,0.0,3.42926,8.753395,6.922126,4.975111,11.535646,5.843196
min,2003.0,1.0,1.0,0.0,11.15,-124.1322,-1.0
25%,2003.0,4.0,8.0,6.0,36.0975,-122.4161,-1.0
50%,2003.0,7.0,16.0,12.0,38.493,-121.0313,-1.0
75%,2003.0,10.0,23.0,18.0,39.8744,-119.6827,-1.0
max,2003.0,12.0,31.0,23.0,41.988,-60.8331,60.0


It's odd that temperature's min, 25%, 50%, and 75% values are all 0. 

In [6]:
CA_2003_df['temperature'].unique()

array([24.7, 25. , 24.2, 24. , 22. , 22.5, 21. , 23.5, 27. , 28. , 29.2,
       30. , 28.3, 27.5, 26. , 24.6, 23.4, 21.3, 20.3, 20. , 29. , 30.1,
       29.8, 25.1, 23. , 23.8, 23.7, 28.5, 29.9, 25.4, 23.6, 22.9, 23.2,
       23.1, 28.7, 24.5, 22.1, 20.9, 20.5, 22.4, 28.4, 29.4, 27.6, 21.8,
       28.1, 25.5, 24.9, 24.8, 30.3, 27.3, 24.3, 29.3, 25.6, 21.1, 27.7,
       20.1, 21.5, 22.8, 26.5, 28.9, 24.4, 22.7, 22.2, 29.1, 21.7, 27.9,
       23.3, 22.3, 27.8, 25.3, 22.6, 19.5, 19. , 30.5, 26.7, 21.9, 21.6,
       29.7, 23.9, 25.2, 28.8, 26.4,  2. , 20.8, 25.7, 21.4, 19.3, 29.6,
       26.6, 21.2, 25.9, 20.4, 27.4, 28.2, 29.5, 28.6, 24.1, 25.8, 26.2,
       27.2, 26.1, 30.2, 20.2, 26.3, 27.1, 20.6, 19.7, 26.9, 19.2, 26.8,
       18.8, 19.9, 19.1, 19.8, 18.9, 19.4, 31. , 20.7, 30.4, 30.8, -1. ,
       30.6, 30.9, 30.7, 32. , 31.2,  6. , 31.1, 32.1, 31.4, 31.3, 31.5,
       31.6, 31.9, 31.8, 31.7, 32.3, 17.9, 17. , 18. ,  9. , 32.4, 33. ,
       32.7, 32.6, 32.2, 34. ,  4. , 32.8, 32.9, 33

In [7]:
#get unique stations where temperature is 0 and count

temp_zero = CA_2003_df[CA_2003_df['temperature'] == -1]
zero_count = temp_zero.groupby(['Station_ID', 'Station_name']).size().reset_index(name = 'count')

sorted_zero = zero_count.sort_values(by='count', ascending=False)
top_stations = sorted_zero.head(100)

print(zero_count)

      Station_ID                    Station_name  count
0    GPW00000401          POINTE A PITRE INTL AP      3
1    MBW00000404  MARTINIQUE AIME CESAIRE INTL A      9
2    NLW00000413                FLAMINGO INTL AP      1
3    STW00000405           GEORGE F L CHARLES AP     66
4    TDW00000410  ARTHUR NAPOLEON RAYMOND ROBINS     15
..           ...                             ...    ...
128  USC00049855      YOSEMITE PARK HEADQUARTERS   6476
129  USI0000KHGT                        TUSI AHP    151
130  USI0000KSYL                     ROBERTS AHP      3
131  USL000FWYF1                 FOWEY ROCKS  FL    609
132  USL000LKWF1                  LAKE WORTH  FL      3

[133 rows x 3 columns]
