## Set Up

Check working directory:

In [1]:
getwd()

Load tidyverse set of packages.

Tidyverse is a *set* of packages developed by R developer superstar Hadley Wickham. The tidyverse packages are designed to work well together in a coherent manner, replacing in some cases R's built-in functionality. The success of the tidyverse is evident in its rapid adoption by the R community.

Tidyverse packages include: broom, **dplyr**, forcats, **ggplot2**, haven, httr, hms, jsonlite, **lubridate**, magrittr, modelr, purrr, readr, **readxl**, stringr, tibble, rvest, tidyr, xml2

In [2]:
library(tidyverse)

Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats


Import dengue data set

In [3]:
dengue_data  <- read.csv("R Shiny App/Dengue_Cases_Malaysia_2011.csv")

In [4]:
head(dengue_data)

NEGERI,Year,Minggu.1,Minggu.2,Minggu.3,Minggu.4,Minggu.5,Minggu.6,Minggu.7,Minggu.8,...,Minggu.43,Minggu.44,Minggu.45,Minggu.46,Minggu.47,Minggu.48,Minggu.49,Minggu.50,Minggu.51,Minggu.52
PERLIS,2011,2,7,3,4,0,2,1,0,...,3,2,1,4,2,1,1,1,3,2
KEDAH,2011,15,11,9,20,13,13,9,9,...,19,14,8,20,11,15,17,19,11,7
PULAU PINANG,2011,42,46,40,48,24,33,32,35,...,10,13,16,20,18,21,26,23,22,24
PERAK,2011,36,25,40,41,18,31,16,31,...,40,30,35,42,50,27,44,62,33,24
SELANGOR,2011,170,213,173,167,144,137,136,162,...,140,145,150,160,181,163,201,234,194,151
WPKL/PUTRAJAYA,2011,36,36,28,32,22,28,39,37,...,62,46,48,55,56,32,47,49,39,33


## What is Tidy?

Tidyverse code has a certain **style** that is easily readable and is composed of sequential steps, where each step applies a function to the data.

Function names are descriptive and often "SQL-like."

In [23]:
dengue_data %>%
  select(-Year)  %>% 
  gather(key=NEGERI, value=dengue_cases, 2:53)  %>% 
  setNames(c("NEGERI", "week", "dengue_cases")) %>% 
  head(5)

NEGERI,week,dengue_cases
PERLIS,Minggu.1,2
KEDAH,Minggu.1,15
PULAU PINANG,Minggu.1,42
PERAK,Minggu.1,36
SELANGOR,Minggu.1,170


Each step is on a separate line, and is separated by a **pipe (%>%)**. The output of the left-hand side is 'piped-in' as the *input* of the right-hand side.

In [6]:
1  %>% I()

[1] 1

In [8]:
x  <- 10
x  %>% rep(1:., 5)
x  %>% rep(1:5, .)

Tidy data is also a **principle** for how to organize data. 

A dataset is a collection of **values**, usually either numbers (if quantitative) or strings (if qualitative). Values are organised in two ways. Every value belongs to a **variable** and an **observation**. A **variable** contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An **observation** contains all values measured on the same unit (like a person, or a day, or a race) across attributes.

Tidy data is defined as:
* Each variable forms a column.
* Each observation forms a row.
* Each type of observational unit forms a table.

Tidying data often involves pivoting it from wide to long format.

Let's look at our dengue data set again:

In [24]:
head(dengue_data)

NEGERI,Year,Minggu.1,Minggu.2,Minggu.3,Minggu.4,Minggu.5,Minggu.6,Minggu.7,Minggu.8,...,Minggu.43,Minggu.44,Minggu.45,Minggu.46,Minggu.47,Minggu.48,Minggu.49,Minggu.50,Minggu.51,Minggu.52
PERLIS,2011,2,7,3,4,0,2,1,0,...,3,2,1,4,2,1,1,1,3,2
KEDAH,2011,15,11,9,20,13,13,9,9,...,19,14,8,20,11,15,17,19,11,7
PULAU PINANG,2011,42,46,40,48,24,33,32,35,...,10,13,16,20,18,21,26,23,22,24
PERAK,2011,36,25,40,41,18,31,16,31,...,40,30,35,42,50,27,44,62,33,24
SELANGOR,2011,170,213,173,167,144,137,136,162,...,140,145,150,160,181,163,201,234,194,151
WPKL/PUTRAJAYA,2011,36,36,28,32,22,28,39,37,...,62,46,48,55,56,32,47,49,39,33


What are our variables? Which values do they take? How many observations do we have?

Why is this data not tidy?

Let's tidy this data:

In [25]:
dengue_data %>%
  select(-Year)  %>% 
  gather(key=NEGERI, value=dengue_cases, 2:53)  %>% 
  setNames(c("NEGERI", "week", "dengue_cases")) %>% 
  head(5)

NEGERI,week,dengue_cases
PERLIS,Minggu.1,2
KEDAH,Minggu.1,15
PULAU PINANG,Minggu.1,42
PERAK,Minggu.1,36
SELANGOR,Minggu.1,170


## Common types of messy data

* Column headers are values, not variable names.
* Multiple variables are stored in one column.
* Variables are stored in both rows and columns.
* Multiple types of observational units are stored in the same table.
* A single observational unit is stored in multiple tables.

Which of these do we observe in the Dengue data set?

## Common Functions

The package for tidying data is **tidyr**. With it you can:
* **gather()** columns up into rows (stack, melt, unpivot)
* **spread()** rows out over columns (unstack, cast, pivot)
* **separate()** one column into many
* **unite()** many columns into one
* **fill()** and drop_na() values
* And more!

The package for manipulating data is **dplyr**. dplyr functions include:
* **filter()** to select rows based on [condition]
* **arrange()** to reorder values
* **select()** and rename() columns by name
* **mutate()** to add new variables that are functions of existing variables 
* **summarise()** to condense multiple values into a single value
* **group_by()** to group rows, allows you to apply functions "by group" 
* **left_join()** for SQL-like joins

## Resources

[Tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html)

[Dplyr](https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html)