1. Let's import the modules we'll need to start off with.

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import glob

2. Now let's import the **newark_airport_2016.csv** data into a new dataframe **newark**

In [2]:
newark = pd.read_csv('newark_airport_2016.csv')

3. Let's check out the length of the data and then the first five rows of our **newark** dataset to get an idea of what we have.

In [3]:
print(len(newark))
print(newark.head)

366
<bound method NDFrame.head of          STATION                                         NAME        DATE  \
0    USW00014734  NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US  2016-01-01   
1    USW00014734  NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US  2016-01-02   
2    USW00014734  NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US  2016-01-03   
3    USW00014734  NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US  2016-01-04   
4    USW00014734  NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US  2016-01-05   
..           ...                                          ...         ...   
361  USW00014734  NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US  2016-12-27   
362  USW00014734  NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US  2016-12-28   
363  USW00014734  NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US  2016-12-29   
364  USW00014734  NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US  2016-12-30   
365  USW00014734  NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US  2016-12-31   

      AWND  PGTM  PRCP  SNOW  SNWD  TAVG 

It looks like one row for each of the 366 days of the year 2016

4. Now let's start our data cleaning. First, let's see if, and how many of our 366 rows have null or missing data.

In [4]:
print(newark.isna().sum())

STATION      0
NAME         0
DATE         0
AWND         0
PGTM       366
PRCP         0
SNOW         0
SNWD         0
TAVG         0
TMAX         0
TMIN         0
TSUN       366
WDF2         0
WDF5         2
WSF2         0
WSF5         2
dtype: int64


5. Now, let's let **newark** be our dataset with the missing or null data *drop*ped. However, we don't want to eliminate any rows since all rows contain valuable weather observation data. Instead, we will drop the PGTM and TSUN columns since they are empty.

In [5]:
newark = newark.drop(['PGTM', 'TSUN'], axis=1)
print(newark.isna().sum())

STATION    0
NAME       0
DATE       0
AWND       0
PRCP       0
SNOW       0
SNWD       0
TAVG       0
TMAX       0
TMIN       0
WDF2       0
WDF5       2
WSF2       0
WSF5       2
dtype: int64


Looks like, after that is done, we still have some missing or null data in the WDF5 and WSF5 columns.

6. Let's print and rows where either column is missing data.

In [6]:
print(newark[newark.WDF5.isna() | newark.WSF5.isna()])

         STATION                                         NAME        DATE  \
32   USW00014734  NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US  2016-02-02   
329  USW00014734  NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US  2016-11-25   

     AWND  PRCP  SNOW  SNWD  TAVG  TMAX  TMIN  WDF2  WDF5  WSF2  WSF5  
32    3.8  0.00   0.0   1.2    43    51    32   350   NaN  11.0   NaN  
329   4.7  0.01   0.0   0.0    48    55    45   220   NaN   8.1   NaN  


We'll address those missing values in a few steps, but let's look at our column names. They don't appear to be very descriptive or helpful to us, so we should consider changing them.

7. Let's see what our column names are.

In [7]:
print(newark.columns)

Index(['STATION', 'NAME', 'DATE', 'AWND', 'PRCP', 'SNOW', 'SNWD', 'TAVG',
       'TMAX', 'TMIN', 'WDF2', 'WDF5', 'WSF2', 'WSF5'],
      dtype='object')


8. Next, let's rename the **newark** columns so that A. they will be easier to understand, and B. they will be easier to work with in our upcoming database.

In [8]:
newark = newark.rename(columns={'STATION':'station', 'NAME':'name', 'DATE':'date', 'AWND':'avg_daily_wind_mph',
                        'PRCP':'precip_in', 'SNOW':'snow_accum_in', 'SNWD':'snow_depth_in','TAVG':'avg_temp_f',
                        'TMAX':'max_temp_f', 'TMIN':'min_temp_f', 'WDF2':'avg_2sec_wind_dir', 'WDF5':'avg_5sec_wind_dir',
                        'WSF2':'max_2sec_wind_mph', 'WSF5':'max_5sec_wind_mph'})
print(newark.head())

       station                                         name        date  \
0  USW00014734  NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US  2016-01-01   
1  USW00014734  NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US  2016-01-02   
2  USW00014734  NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US  2016-01-03   
3  USW00014734  NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US  2016-01-04   
4  USW00014734  NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US  2016-01-05   

   avg_daily_wind_mph  precip_in  snow_accum_in  snow_depth_in  avg_temp_f  \
0               12.75        0.0            0.0            0.0          41   
1                9.40        0.0            0.0            0.0          36   
2               10.29        0.0            0.0            0.0          37   
3               17.22        0.0            0.0            0.0          32   
4                9.84        0.0            0.0            0.0          19   

   max_temp_f  min_temp_f  avg_2sec_wind_dir  avg_5sec_wind_dir  \
0          43

Come to think of it, there are still some columns left that are not really of any use to us here:
- station
- avg_2sec_wind_dir
- avg_5sec_wind_dir
- max_2sec_wind_mph
- max_5sec_wind_mph

9. Let's *drop* these columns. Note that the missing values are included in these columns, so those will be cleaned out of the data.

In [9]:
newark = newark.drop(['station', 'avg_2sec_wind_dir', 'avg_5sec_wind_dir', 'max_2sec_wind_mph', 'max_5sec_wind_mph'], axis=1)
print(newark.head())

                                          name        date  \
0  NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US  2016-01-01   
1  NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US  2016-01-02   
2  NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US  2016-01-03   
3  NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US  2016-01-04   
4  NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US  2016-01-05   

   avg_daily_wind_mph  precip_in  snow_accum_in  snow_depth_in  avg_temp_f  \
0               12.75        0.0            0.0            0.0          41   
1                9.40        0.0            0.0            0.0          36   
2               10.29        0.0            0.0            0.0          37   
3               17.22        0.0            0.0            0.0          32   
4                9.84        0.0            0.0            0.0          19   

   max_temp_f  min_temp_f  
0          43          34  
1          42          30  
2          47          28  
3          35          14  
4          31     

10. Now, let's *describe* **newark** to see if we can see any oddities or potential outliers in our data.

In [10]:
print(newark.describe())

       avg_daily_wind_mph   precip_in  snow_accum_in  snow_depth_in  \
count          366.000000  366.000000     366.000000     366.000000   
mean             9.429973    0.104945       0.098087       0.342623   
std              3.748174    0.307496       1.276498       2.078510   
min              2.460000    0.000000       0.000000       0.000000   
25%              6.765000    0.000000       0.000000       0.000000   
50%              8.720000    0.000000       0.000000       0.000000   
75%             11.410000    0.030000       0.000000       0.000000   
max             22.820000    2.790000      24.000000      20.100000   

       avg_temp_f  max_temp_f  min_temp_f  
count  366.000000  366.000000  366.000000  
mean    57.196721   65.991803   48.459016  
std     17.466981   18.606301   17.135790  
min      8.000000   18.000000    0.000000  
25%     43.000000   51.250000   35.000000  
50%     56.000000   66.000000   47.000000  
75%     74.000000   83.000000   64.000000  
max     

Let's print the data types of each column to see if the data matches the expected data types.

In [11]:
print(newark.dtypes)

name                   object
date                   object
avg_daily_wind_mph    float64
precip_in             float64
snow_accum_in         float64
snow_depth_in         float64
avg_temp_f              int64
max_temp_f              int64
min_temp_f              int64
dtype: object


Now let's write this to a CSV file.

In [12]:
newark.to_csv('newark_wx.csv', index=False)