<a id = "top"></a>

# Weather preprocessing notebook (Version 1.0)

**Ying Zhou**

**Table of contents**

[1.Data cleaning](#1)

[2.Feature engineering](#2)

Let's use R to clean up the weather data.

In [1]:
library(tidyverse)
library(lubridate)

── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.0.0     ✔ purrr   0.2.5
✔ tibble  1.4.2     ✔ dplyr   0.7.6
✔ tidyr   0.8.1     ✔ stringr 1.3.1
✔ readr   1.1.1     ✔ forcats 0.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Attaching package: ‘lubridate’

The following object is masked from ‘package:base’:

    date



# 1. Clean up NAs

In [2]:
df <- read_csv("weather.csv")

Parsed with column specification:
cols(
  .default = col_character(),
  LATITUDE = col_double(),
  LONGITUDE = col_double(),
  ELEVATION = col_double(),
  DATE = col_date(format = ""),
  AWND = col_double(),
  FMTM = col_integer(),
  PGTM = col_integer(),
  PRCP = col_double(),
  SNOW = col_double(),
  TAVG = col_integer(),
  TMAX = col_integer(),
  TMIN = col_integer(),
  WDF2 = col_integer(),
  WDF5 = col_integer(),
  WSF2 = col_double(),
  WSF5 = col_double(),
  WT01 = col_integer(),
  WT02 = col_integer(),
  WT03 = col_integer(),
  WT04 = col_integer()
  # ... with 11 more columns
)
See spec(...) for full column specifications.


In [3]:
df <- select(df, -c(STATION, NAME, LATITUDE, LONGITUDE, ELEVATION))

In [4]:
spec(df)

cols(
  STATION = col_character(),
  NAME = col_character(),
  LATITUDE = col_double(),
  LONGITUDE = col_double(),
  ELEVATION = col_double(),
  DATE = col_date(format = ""),
  AWND = col_double(),
  AWND_ATTRIBUTES = col_character(),
  FMTM = col_integer(),
  FMTM_ATTRIBUTES = col_character(),
  PGTM = col_integer(),
  PGTM_ATTRIBUTES = col_character(),
  PRCP = col_double(),
  PRCP_ATTRIBUTES = col_character(),
  SNOW = col_double(),
  SNOW_ATTRIBUTES = col_character(),
  SNWD = col_character(),
  SNWD_ATTRIBUTES = col_character(),
  TAVG = col_integer(),
  TAVG_ATTRIBUTES = col_character(),
  TMAX = col_integer(),
  TMAX_ATTRIBUTES = col_character(),
  TMIN = col_integer(),
  TMIN_ATTRIBUTES = col_character(),
  WDF2 = col_integer(),
  WDF2_ATTRIBUTES = col_character(),
  WDF5 = col_integer(),
  WDF5_ATTRIBUTES = col_character(),
  WSF2 = col_double(),
  WSF2_ATTRIBUTES = col_character(),
  WSF5 = col_double(),
  WSF5_ATTRIBUTES = col_character(),
  WT01 = col_integer(),
  WT01_ATT

In [5]:
str(df)

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	2559 obs. of  57 variables:
 $ DATE           : Date, format: "2012-07-08" "2012-07-09" ...
 $ AWND           : num  8.95 8.05 7.38 8.28 10.74 ...
 $ AWND_ATTRIBUTES: chr  ",,W" ",,W" ",,W" ",,W" ...
 $ FMTM           : int  9999 9999 9999 9999 9999 9999 9999 9999 9999 9999 ...
 $ FMTM_ATTRIBUTES: chr  ",X,X" ",X,X" ",X,X" ",X,X" ...
 $ PGTM           : int  NA NA NA NA NA NA NA NA NA NA ...
 $ PGTM_ATTRIBUTES: chr  NA NA NA NA ...
 $ PRCP           : num  0 0 0 0 0 0 0 0 0 0 ...
 $ PRCP_ATTRIBUTES: chr  ",,X,2400" ",,X,2400" ",,X,2400" ",,X,2400" ...
 $ SNOW           : num  0 0 0 0 0 0 0 0 0 0 ...
 $ SNOW_ATTRIBUTES: chr  ",,X" ",,X" ",,X" ",,X" ...
 $ SNWD           : chr  NA NA NA NA ...
 $ SNWD_ATTRIBUTES: chr  NA NA NA NA ...
 $ TAVG           : int  NA NA NA NA NA NA NA NA NA NA ...
 $ TAVG_ATTRIBUTES: chr  NA NA NA NA ...
 $ TMAX           : int  89 84 83 80 86 90 91 91 88 97 ...
 $ TMAX_ATTRIBUTES: chr  ",,X" ",,X" ",,X" ",,X" ...
 $ T

Now we need to drop useless lines. All descriptions are fairly useless, so are `FMTM`, `PGTM` etc.

In [6]:
kept_cols <- c('DATE','AWND','PRCP','SNOW','TMAX','TMIN','WSF5','WT01','WT02','WT03','WT04','WT05','WT06','WT08','WT09','WT13','WT14','WT15','WT16','WT17','WT18','WT22')
df <- df %>% select(kept_cols)

In [7]:
str(df)

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	2559 obs. of  22 variables:
 $ DATE: Date, format: "2012-07-08" "2012-07-09" ...
 $ AWND: num  8.95 8.05 7.38 8.28 10.74 ...
 $ PRCP: num  0 0 0 0 0 0 0 0 0 0 ...
 $ SNOW: num  0 0 0 0 0 0 0 0 0 0 ...
 $ TMAX: int  89 84 83 80 86 90 91 91 88 97 ...
 $ TMIN: int  71 67 65 66 66 69 72 72 72 76 ...
 $ WSF5: num  25.1 23.9 17 17.9 23.9 25.9 23.9 25.9 21.9 36 ...
 $ WT01: int  NA NA NA NA NA NA NA NA NA NA ...
 $ WT02: int  NA NA NA NA NA NA NA NA NA NA ...
 $ WT03: int  NA NA NA NA NA NA NA NA NA NA ...
 $ WT04: int  NA NA NA NA NA NA NA NA NA NA ...
 $ WT05: int  NA NA NA NA NA NA NA NA NA NA ...
 $ WT06: int  NA NA NA NA NA NA NA NA NA NA ...
 $ WT08: int  NA NA NA NA NA NA NA NA NA NA ...
 $ WT09: int  NA NA NA NA NA NA NA NA NA NA ...
 $ WT13: int  NA NA NA NA NA NA NA NA NA NA ...
 $ WT14: int  NA NA NA NA NA NA NA NA NA NA ...
 $ WT15: int  NA NA NA NA NA NA NA NA NA NA ...
 $ WT16: int  NA NA NA NA NA NA NA 1 1 NA ...
 $ WT17: int  NA NA NA 

Time to fill NAs for `WT` columns

In [8]:
wt_cols = c('WT01','WT02','WT03','WT04','WT05','WT06','WT08','WT09','WT13','WT14','WT15','WT16','WT17','WT18','WT22')
df[wt_cols][is.na(df[wt_cols])] <- 0

In [9]:
str(df)

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	2559 obs. of  22 variables:
 $ DATE: Date, format: "2012-07-08" "2012-07-09" ...
 $ AWND: num  8.95 8.05 7.38 8.28 10.74 ...
 $ PRCP: num  0 0 0 0 0 0 0 0 0 0 ...
 $ SNOW: num  0 0 0 0 0 0 0 0 0 0 ...
 $ TMAX: int  89 84 83 80 86 90 91 91 88 97 ...
 $ TMIN: int  71 67 65 66 66 69 72 72 72 76 ...
 $ WSF5: num  25.1 23.9 17 17.9 23.9 25.9 23.9 25.9 21.9 36 ...
 $ WT01: num  0 0 0 0 0 0 0 0 0 0 ...
 $ WT02: num  0 0 0 0 0 0 0 0 0 0 ...
 $ WT03: num  0 0 0 0 0 0 0 0 0 0 ...
 $ WT04: num  0 0 0 0 0 0 0 0 0 0 ...
 $ WT05: num  0 0 0 0 0 0 0 0 0 0 ...
 $ WT06: num  0 0 0 0 0 0 0 0 0 0 ...
 $ WT08: num  0 0 0 0 0 0 0 0 0 0 ...
 $ WT09: num  0 0 0 0 0 0 0 0 0 0 ...
 $ WT13: num  0 0 0 0 0 0 0 0 0 0 ...
 $ WT14: num  0 0 0 0 0 0 0 0 0 0 ...
 $ WT15: num  0 0 0 0 0 0 0 0 0 0 ...
 $ WT16: num  0 0 0 0 0 0 0 1 1 0 ...
 $ WT17: num  0 0 0 0 0 0 0 0 0 0 ...
 $ WT18: num  0 0 0 0 0 0 0 0 0 0 ...
 $ WT22: num  0 0 0 0 0 0 0 0 0 0 ...
 - attr(*, "spec")=List of 

In [79]:
df %>% group_by(AWND) %>% tally()

AWND,n
2.24,1
2.91,1
3.13,3
3.36,1
3.58,3
3.80,5
4.03,5
4.25,5
4.47,7
4.70,11


Now let's check how many NAs are there?

In [10]:
df %>% select_if(function(x) any(is.na(x))) %>% summarise_each(funs(sum(is.na(.))))

`summarise_each()` is deprecated.
Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
To map `funs` over all variables, use `summarise_all()`


AWND,PRCP,SNOW,TMAX,TMIN,WSF5
2,2,3,2,2,11


Not having `AWND`, `TMAX` or `TMIN` is really bad. These have to be dropeed.

In [12]:
df <- df %>% drop_na(AWND)

In [13]:
df %>% select_if(function(x) any(is.na(x))) %>% summarise_each(funs(sum(is.na(.))))

`summarise_each()` is deprecated.
Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
To map `funs` over all variables, use `summarise_all()`


SNOW,WSF5
1,9


Now let's go over that one with no `SNOW` info.

In [14]:
dplyr::filter(df, is.na(SNOW))

DATE,AWND,PRCP,SNOW,TMAX,TMIN,WSF5,WT01,WT02,WT03,⋯,WT06,WT08,WT09,WT13,WT14,WT15,WT16,WT17,WT18,WT22
2019-07-08,7.83,0,,76,66,15,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0


It is in summer, hence no snow.

In [15]:
snow_col = c('SNOW')
df[snow_col][is.na(df[snow_col])] <- 0

In [16]:
df %>% select_if(function(x) any(is.na(x))) %>% summarise_each(funs(sum(is.na(.))))

`summarise_each()` is deprecated.
Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
To map `funs` over all variables, use `summarise_all()`


WSF5
9


In [17]:
dplyr::filter(df, is.na(WSF5))

DATE,AWND,PRCP,SNOW,TMAX,TMIN,WSF5,WT01,WT02,WT03,⋯,WT06,WT08,WT09,WT13,WT14,WT15,WT16,WT17,WT18,WT22
2012-12-13,4.03,0.0,0,42,29,,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
2013-09-19,5.37,0.0,0,71,54,,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
2014-03-18,8.28,0.0,0,33,20,,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
2014-06-13,9.62,0.54,0,64,59,,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
2014-08-24,6.04,0.0,0,75,63,,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
2015-05-18,10.29,0.0,0,58,49,,1,0,0,⋯,0,0,0,0,0,0,0,0,0,0
2015-06-27,7.16,0.2,0,68,57,,1,0,0,⋯,0,0,0,0,0,0,0,0,0,0
2015-08-08,6.26,0.0,0,76,61,,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
2015-09-18,7.61,0.0,0,85,65,,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0


According to [Boston standards](https://sciencing.com/average-daily-wind-speed-24011.html) even the average is higher than 10.29 mph. Hence none of these days seem particularly windy. Okay. We are going to bin this data anyway so that's not an issue.