# Data Wrangling in R (part 1)

**Data wrangling** covers loading, manipulating, reshaping, and exploring data structures.

## Table of Contents

- [Understanding Data](#data)
- [Data Frames](#dataframe)
- [Working with CSV Data](#csv)
- [Factors](#fac)
- [Manipulating Data with dplyr](#dplyr)
- [The Pipe Operator](#pipe)
- [Tibble](#tib)
- [Joining Data Frames Together](#join)

---
<a id='data'></a>

## Understanding Data

1. **Where data come from?** - `When working with any data set, it is vital to consider where the data came from, who, how and why recorded it - to effectively and meaningfully analyze it`.
    - **Sensors** - assuming these devices have been properly calibrated, they offer a reliable and consistent mechanism for data collection.
    - **Surveys** - are dependent on individuals self-reporting, their quality may vary. The biases inherent in survey responses should be recognized and (if possible) adjusted.
    - **Record keeping** - based on manual or automatic process to keep track of different (business) activities. The reliability depends on quality of systems producing it and the way they are gathered. Record keeping may only focus on particular tasks.
    - **Secondary data analysis** - data compiled from existing knowledge artifacts or measurements (such as historical texts). These artifacts may already exclude perspectives.
    

2. **Dataset sources**:

    - Government piblications
    - News and journalism (New York time, FiveThirtyEight)
    - Scientific research (Nature Recommended Data Repositories)
    - Social networks and media organizations (Facebook, Twitter, Google)
    - Online communities (Kaggle, Socrata, UCI ML Repository)
    

3. Once you acquire a data set, you will have to **understand its structure and content** before (programmatically) investigating it. You need to know what kinds of statistical analysis will be valid for different types of data, as well as how to interpret what that data are measuring.

4. **Data interpretation**. Working with data requires domain knowledge, at least a basic level of understanding of the problem domain (the meaning of data), significance and purpose of any feature (to detect outliers and errors), and some of the subtleties that may not be explicit in the data set (such as biases or aggregations that may hide important causalities). `Gathering domain knowledge almost always requires outside research`.

5. **Organize your data into data structures**. Usually these structures allow building one or more (connected) tables where columns represent features and rows observations. You need to understand data schema and specific context for all values. Use meta-data (data about data) as a starter.

6. **Use data to answer questions**, this will require translating from various domain questions to specific observations and features in your data set. You need to be able to decide **what precisely is meant by a question** - a task that requires understanding the nuances found in the questions' problem domain.

---
<a id='dataframe'></a>

## Data Frames

**Data Frames** act like tables, where data is organized into rows and columns. Technically, data frames are lists in which each element is a vector of the same length (as other vectors). Each vector represents a column, not a row. The elements at corresponding indices in the vectors are considered part of the same row (record). With this design, each row may have a different type of data, and vector's elements must all be of the same type.

### Create

In [21]:
# Create a data frame
name <- c("Alice", "John", "Melinda")
height <- c(150, 170, 165)
weight <- c(60, 70, 65)

# Factors are categorical variables (like: small, medium, large)
people <- data.frame(name, height, weight, stringsAsFactors = FALSE)

# or

people <- data.frame(
    name = c("Alice", "John", "Melinda"), 
    height = c(150, 170, 165), 
    weight = c(60, 70, 65), 
    stringsAsFactors = FALSE)

In [22]:
people

name,height,weight
<chr>,<dbl>,<dbl>
Alice,150,60
John,170,70
Melinda,165,65


In [23]:
# Dollar notation
people_weights <- people$weight
print(people_weights)

[1] 60 70 65


In [24]:
# Double-bracket notation
people_weights <- people[["weight"]]
print(people_weights)

[1] 60 70 65


In [25]:
# Single-bracket notation
# see below

### Inspect

In [26]:
nrow(people)

In [27]:
ncol(people)

In [28]:
print(dim(people)) # rows, cols

[1] 3 3


In [29]:
print(colnames(people))

[1] "name"   "height" "weight"


In [30]:
print(rownames(people))

[1] "1" "2" "3"


In [31]:
print(head(people))

     name height weight
1   Alice    150     60
2    John    170     70
3 Melinda    165     65


In [32]:
print(tail(people))

     name height weight
1   Alice    150     60
2    John    170     70
3 Melinda    165     65


In [33]:
# Open the data frame in a spreadsheet-like viewer (only in RStudio)
View(people)

ERROR: Error in View(people): ‘View()’ not yet supported in the Jupyter R kernel


### Access

In [37]:
# Single-bracket notation
# Elements by row and column names or mixed
people[1, "name"]

In [38]:
# Elements by row and column indices
people[1,1]

In [39]:
# Row by name/index
people[1,]

name,height,weight
<chr>,<dbl>,<dbl>
Alice,150,60


In [41]:
# Column by name/index
print(people[,1])

[1] "Alice"   "John"    "Melinda"


In [44]:
# Assign a set of row names for the vector
print(rownames(people))
rownames(people) <- people$name
print(rownames(people))

[1] "1" "2" "3"
[1] "Alice"   "John"    "Melinda"


In [45]:
people["Alice", "name"]

In [48]:
# Get multiple rows and columns
people[,c("height", "weight")]

Unnamed: 0_level_0,height,weight
Unnamed: 0_level_1,<dbl>,<dbl>
Alice,150,60
John,170,70
Melinda,165,65


In [49]:
people[2:3,]

Unnamed: 0_level_0,name,height,weight
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>
John,John,170,70
Melinda,Melinda,165,65


In [51]:
people[people$height > 160,]

Unnamed: 0_level_0,name,height,weight
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>
John,John,170,70
Melinda,Melinda,165,65


### Type & Convertion

In [52]:
is.data.frame(people)

In [58]:
v1 = c(1,2,3)
is.data.frame(v1)
print(as.data.frame(v1))

  v1
1  1
2  2
3  3


---
<a id='csv'></a>

## Working with CSV Data

- For a script to work on any computer, **relative (not absolute) path** is the best choice.
- Setup a working directory via `Session > Set Working Directory` (in RStudio).
- You should always include `stringsAsFactors = FALSE` argument when either loading or creating data frames. Factors are categorical variables (like: small, medium, large)

In [62]:
# Read from a CSV file
people <- read.csv("data/simpleR.csv", stringsAsFactors = FALSE)
head(people)

first_name,weight,height
<chr>,<int>,<dbl>
Ada,64,135
Bob,74,156
Chris,69,139
Diya,69,144
Emma,71,152


In [63]:
# Write to a CSV file
write.csv(people, "data/simpleRwrite.csv", row.names = FALSE)

In [64]:
# See available in R datasets
data()

Package,Item,Title
<chr>,<chr>,<chr>
datasets,AirPassengers,Monthly Airline Passenger Numbers 1949-1960
datasets,BJsales,Sales Data with Leading Indicator
datasets,BJsales.lead (BJsales),Sales Data with Leading Indicator
datasets,BOD,Biochemical Oxygen Demand
datasets,CO2,Carbon Dioxide Uptake in Grass Plants
datasets,ChickWeight,Weight versus age of chicks on different diets
datasets,DNase,Elisa assay of DNase
datasets,EuStockMarkets,"Daily Closing Prices of Major European Stock Indices, 1991-1998"
datasets,Formaldehyde,Determination of Formaldehyde
datasets,HairEyeColor,Hair and Eye Color of Statistics Students


In [65]:
View(mtcars)

ERROR: Error in View(mtcars): ‘View()’ not yet supported in the Jupyter R kernel


In [66]:
# Get an absolute path to the current working directory
getwd()

---
<a id='fac'></a>

## Factors

Factors are categorical variables (like: small, medium, large). Factors are not vectors, vector methods will not work on them.

In [70]:
shirt_sizes <- c("small", "medium", "large", "small", "large")
print(shirt_sizes)

[1] "small"  "medium" "large"  "small"  "large" 


In [71]:
shirt_sizes_factor <- as.factor(shirt_sizes)
print(shirt_sizes_factor)

[1] small  medium large  small  large 
Levels: large medium small


In [72]:
length(shirt_sizes_factor)

In [73]:
is.factor(shirt_sizes_factor)

---
<a id='dplyr'></a>

## Manipulating Data with dplyr

The [**dplyr**](https://dplyr.tidyverse.org/) ("Dee-ply-er") package is the preeminent tool for data wrangling in R. It provides programmers with an intuitive vocabulary for executing data management and analysis tasks.

**dplyr** is a `grammar of data manipulation`, providing a consistent set of verbs that help you solve the most common data manipulation challenges:

- `mutate()` adds new variables that are functions of existing variables
- `select()` picks variables based on their names.
- `filter()` picks cases based on their values.
- `summarise()` reduces multiple values down to a single summary.
- `arrange()` changes the ordering of the rows.

In [77]:
# Once per machine
install.packages("dplyr")
# or
install.packages("tidyverse") # dplyr is one of the package of tidyverse collection


The downloaded binary packages are in
	/var/folders/qk/0l3zx9w11959pqp8tr5s_1780000gn/T//Rtmp61f83k/downloaded_packages

The downloaded binary packages are in
	/var/folders/qk/0l3zx9w11959pqp8tr5s_1780000gn/T//Rtmp61f83k/downloaded_packages


In [76]:
# In each relevant script
library("dplyr")


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




In [78]:
# Load a dataset from pscl package
install.packages("pscl")
library("pscl")


The downloaded binary packages are in
	/var/folders/qk/0l3zx9w11959pqp8tr5s_1780000gn/T//Rtmp61f83k/downloaded_packages


Classes and Methods for R developed in the
Political Science Computational Laboratory
Department of Political Science
Stanford University
Simon Jackman
hurdle and zeroinfl functions by Achim Zeileis



In [79]:
View(presidentialElections)

ERROR: Error in View(presidentialElections): ‘View()’ not yet supported in the Jupyter R kernel


In [80]:
?presidentialElections

0,1
presidentialElections {pscl},R Documentation


In [84]:
head(presidentialElections)

state,demVote,year,south
<chr>,<dbl>,<int>,<lgl>
Alabama,84.76,1932,True
Arizona,67.03,1932,False
Arkansas,86.27,1932,True
California,58.41,1932,False
Colorado,54.81,1932,False
Connecticut,47.4,1932,False


In [88]:
votes <- select(presidentialElections, year, demVote)
head(votes)
# or in a classical way:
#votes <- presidentialElections[, c("year", "demVote")]

year,demVote
<int>,<dbl>
1932,84.76
1932,67.03
1932,86.27
1932,58.41
1932,54.81
1932,47.4


In [89]:
# Select range
votes <- select(presidentialElections, year:demVote)
head(votes)

year,demVote
<int>,<dbl>
1932,84.76
1932,67.03
1932,86.27
1932,58.41
1932,54.81
1932,47.4


In [None]:
# Select except 'year'

In [90]:
votes <- select(presidentialElections, -year)
head(votes)

state,demVote,south
<chr>,<dbl>,<lgl>
Alabama,84.76,True
Arizona,67.03,False
Arkansas,86.27,True
California,58.41,False
Colorado,54.81,False
Connecticut,47.4,False


In [95]:
# filter()
votes_2008 <- filter(presidentialElections, year == 2008)
head(votes_2008)
# or in a classical way:
# votes_2008 <- presidentialElections[presidentialElections$year == 2008, ] #selected rows, all cols

state,demVote,year,south
<chr>,<dbl>,<int>,<lgl>
Alabama,38.74,2008,True
Alaska,37.89,2008,False
Arizona,44.91,2008,False
Arkansas,38.86,2008,True
California,60.94,2008,False
Colorado,53.66,2008,False


In [97]:
# Extract rows for the state of Colorado in 2008
votes_colorado_2008 <- filter(
    presidentialElections, 
    year == 2008, 
    state == "Colorado"
)
head(votes_colorado_2008)

state,demVote,year,south
<chr>,<dbl>,<int>,<lgl>
Colorado,53.66,2008,False


In [None]:
# Add row names of a dataframe as a new column called row_names
df <- mutate(df, row_names = rownames(df))

In [99]:
# mutate() to create new columns, returns a new dataframe
presidentialElections <- mutate(
    presidentialElections,
    other_parties_vote = 100 - demVote,
    abs_vote_difference = abs(demVote - other_parties_vote)
)
head(presidentialElections)

state,demVote,year,south,other_parties_vote,abs_vote_difference
<chr>,<dbl>,<int>,<lgl>,<dbl>,<dbl>
Alabama,84.76,1932,True,15.24,69.52
Arizona,67.03,1932,False,32.97,34.06
Arkansas,86.27,1932,True,13.73,72.54
California,58.41,1932,False,41.59,16.82
Colorado,54.81,1932,False,45.19,9.62
Connecticut,47.4,1932,False,52.6,5.2


In [100]:
# arrange() to sort rows by some feature (column value), returns a new dataframe
# Arrange rows in decreasing order by year, then by demVote
presidentialElections <- arrange(presidentialElections, -year, demVote)
head(presidentialElections)

state,demVote,year,south,other_parties_vote,abs_vote_difference
<chr>,<dbl>,<int>,<lgl>,<dbl>,<dbl>
West Virginia,26.18,2016,False,73.82,47.64
Utah,27.17,2016,False,72.83,45.66
North Dakota,27.23,2016,False,72.77,45.54
Idaho,27.48,2016,False,72.52,45.04
Oklahoma,28.93,2016,False,71.07,42.14
South Dakota,31.74,2016,False,68.26,36.52


In [101]:
presidentialElections <- arrange(presidentialElections, desc(year), demVote)
head(presidentialElections)

state,demVote,year,south,other_parties_vote,abs_vote_difference
<chr>,<dbl>,<int>,<lgl>,<dbl>,<dbl>
West Virginia,26.18,2016,False,73.82,47.64
Utah,27.17,2016,False,72.83,45.66
North Dakota,27.23,2016,False,72.77,45.54
Idaho,27.48,2016,False,72.52,45.04
Oklahoma,28.93,2016,False,71.07,42.14
South Dakota,31.74,2016,False,68.26,36.52


In [102]:
# summarize() to aggregate each collumn to a single value
average_votes <- summarize(
    presidentialElections,
    mean_dem_vote = mean(demVote),
    mean_other_parties = mean(other_parties_vote)
)
head(average_votes)

mean_dem_vote,mean_other_parties
<dbl>,<dbl>
48.3594,51.6406


---
<a id='pipe'></a>

## The Pipe Operator

The **pipe operator** `%>%` takes the result from one function and passes it in as the first argument to the next function. This operator is loaded with dplyr but works with any R function.

In [104]:
# Question: Which state had the highest percentage of votes for the Democratic Party 
#  (B. Obama) candidate in 2008?
most_dem_state <- presidentialElections %>% # data frame to start with
    filter(year == 2008) %>% # 1. Filter down to only 2008 votes
    filter(demVote == max(demVote)) %>% # 2. Filter down to the the highest demVote
    select(state) # 3. Select name of the state
print(most_dem_state)

[38;5;246m# A tibble: 1 x 1[39m
  state
  [3m[38;5;246m<chr>[39m[23m
[38;5;250m1[39m DC   


In [107]:
# group_by() to create associations among groups of rows in a data frame
# Group observations by state
grouped <- group_by(presidentialElections, state)
head(grouped)

state,demVote,year,south,other_parties_vote,abs_vote_difference
<chr>,<dbl>,<int>,<lgl>,<dbl>,<dbl>
West Virginia,26.18,2016,False,73.82,47.64
Utah,27.17,2016,False,72.83,45.66
North Dakota,27.23,2016,False,72.77,45.54
Idaho,27.48,2016,False,72.52,45.04
Oklahoma,28.93,2016,False,71.07,42.14
South Dakota,31.74,2016,False,68.26,36.52


In [110]:
# Once rows are groupped other verbs can be applied, 
#  and they will be automatically applied to each group
state_voting_summary <- presidentialElections %>%
    group_by(state) %>%
    summarize(
        mean_dem_vote = mean(demVote),
        mean_other_parties = mean(other_parties_vote)
    )
head(state_voting_summary)

state,mean_dem_vote,mean_other_parties
<chr>,<dbl>,<dbl>
Alabama,50.7525,49.2475
Alaska,37.496,62.504
Arizona,45.43545,54.56455
Arkansas,52.43364,47.56636
California,51.29955,48.70045
Colorado,45.60955,54.39045


---
<a id='tib'></a>

## Tibble
**Tibble** is a kind of data frame used in tidyverse.

In [112]:
# convert tibble into a normal data frame
normal_df <- as.data.frame(state_voting_summary)
head(normal_df)

state,mean_dem_vote,mean_other_parties
<chr>,<dbl>,<dbl>
Alabama,50.7525,49.2475
Alaska,37.496,62.504
Arizona,45.43545,54.56455
Arkansas,52.43364,47.56636
California,51.29955,48.70045
Colorado,45.60955,54.39045


---
<a id='join'></a>

## Joining Data Frames Together

In [None]:
# Combine (join) donations and donors data frames by their shared column donor_name
combined_data <- left_join(donations, donors, by = "donor_name")

# When the tables have different identifiers
combined_data <- left_join(donations, donors, by = c("donor_name" = "another_column"))

In [None]:
right_join()
inner_join()
full_join()

In [None]:
---
<a id='dataframe'></a>