## Step 1 - Data Engineering

The climate data for Hawaii is provided through two CSV files. Start by using Python and Pandas to inspect the content of these files and clean the data.

* Create a Jupyter Notebook file called `data_engineering.ipynb` and use this to complete all of your Data Engineering tasks.

* Use Pandas to read in the measurement and station CSV files as DataFrames.

* Inspect the data for NaNs and missing values. You must decide what to do with this data.

* Save your cleaned CSV files with the prefix `clean_`.

In [1]:
import pandas as pd

In [6]:
df_hawaii_meas = pd.read_csv('raw_data\hawaii_measurements.csv')
df_hawaii_stat = pd.read_csv('raw_data\hawaii_stations.csv')
print(df_hawaii_meas.shape)
print(df_hawaii_stat.shape)

(19550, 4)
(9, 5)


In [13]:
df_hawaii_meas.head()

Unnamed: 0,station,date,prcp,tobs
0,USC00519397,2010-01-01,0.08,65
1,USC00519397,2010-01-02,0.0,63
2,USC00519397,2010-01-03,0.0,74
3,USC00519397,2010-01-04,0.0,76
4,USC00519397,2010-01-06,,73


In [46]:
df_hawaii_stat #9 stations - no missing data in this df

Unnamed: 0,station,name,latitude,longitude,elevation
0,USC00519397,"WAIKIKI 717.2, HI US",21.2716,-157.8168,3.0
1,USC00513117,"KANEOHE 838.1, HI US",21.4234,-157.8015,14.6
2,USC00514830,"KUALOA RANCH HEADQUARTERS 886.9, HI US",21.5213,-157.8374,7.0
3,USC00517948,"PEARL CITY, HI US",21.3934,-157.9751,11.9
4,USC00518838,"UPPER WAHIAWA 874.3, HI US",21.4992,-158.0111,306.6
5,USC00519523,"WAIMANALO EXPERIMENTAL FARM, HI US",21.33556,-157.71139,19.5
6,USC00519281,"WAIHEE 837.5, HI US",21.45167,-157.84889,32.9
7,USC00511918,"HONOLULU OBSERVATORY 702.2, HI US",21.3152,-157.9992,0.9
8,USC00516128,"MANOA LYON ARBO 785.2, HI US",21.3331,-157.8025,152.4


In [41]:
df_hawaii_meas.isnull().sum()  #prcp is the only column with Nan values

station       0
date          0
prcp       1447
tobs          0
dtype: int64

In [37]:
df_hawaii_meas.sort_values('date')[df_hawaii_meas.prcp.isnull()]

  """Entry point for launching an IPython kernel.


Unnamed: 0,station,date,prcp,tobs
4,USC00519397,2010-01-06,,73
9012,USC00518838,2010-01-07,,72
9020,USC00518838,2010-01-18,,70
9023,USC00518838,2010-01-25,,72
26,USC00519397,2010-01-30,,70
9027,USC00518838,2010-02-02,,66
29,USC00519397,2010-02-03,,67
9035,USC00518838,2010-02-12,,64
2765,USC00513117,2010-02-13,,71
9041,USC00518838,2010-02-19,,63


In [39]:
#check if one station that has Nan values maybe doesn't fill in the prcp when there is 0 prcp
df_hawaii_meas.sort_values('date')[df_hawaii_meas.station == 'USC00519397']

  


Unnamed: 0,station,date,prcp,tobs
0,USC00519397,2010-01-01,0.08,65
1,USC00519397,2010-01-02,0.00,63
2,USC00519397,2010-01-03,0.00,74
3,USC00519397,2010-01-04,0.00,76
4,USC00519397,2010-01-06,,73
5,USC00519397,2010-01-07,0.06,70
6,USC00519397,2010-01-08,0.00,64
7,USC00519397,2010-01-09,0.00,68
8,USC00519397,2010-01-10,0.00,73
9,USC00519397,2010-01-11,0.01,64


In [44]:
#the Nan values seem random - at station USC00519397 they have 0.00 precip and Nan values, so it doesn't
#automatically just mean zero precip.  with so much other data, just drop the Nan value rows in the measurement
#df (no Nan values in stations)
df_hawaii_meas.dropna(axis=0, inplace=True)
print(df_hawaii_meas.shape)

(18103, 4)


In [45]:
df_hawaii_meas.isnull().sum()  #double check dropped all rows with na

station    0
date       0
prcp       0
tobs       0
dtype: int64

In [50]:
df_hawaii_meas.to_csv('raw_data\clean_hawaii_measurements.csv', index=False)