# Data Wrangling in R

**Data wrangling** covers loading, manipulating, reshaping, and exploring data structures.

## Table of Contents

- [Understanding Data](#data)
- [Data Frames](#dataframe)
- [Working with CSV Data](#csv)
- [Factors](#fac)

---
<a id='data'></a>

## Understanding Data

1. **Where data come from?** - `When working with any data set, it is vital to consider where the data came from, who, how and why recorded it - to effectively and meaningfully analyze it`.
    - **Sensors** - assuming these devices have been properly calibrated, they offer a reliable and consistent mechanism for data collection.
    - **Surveys** - are dependent on individuals self-reporting, their quality may vary. The biases inherent in survey responses should be recognized and (if possible) adjusted.
    - **Record keeping** - based on manual or automatic process to keep track of different (business) activities. The reliability depends on quality of systems producing it and the way they are gathered. Record keeping may only focus on particular tasks.
    - **Secondary data analysis** - data compiled from existing knowledge artifacts or measurements (such as historical texts). These artifacts may already exclude perspectives.
    

2. **Dataset sources**:

    - Government piblications
    - News and journalism (New York time, FiveThirtyEight)
    - Scientific research (Nature Recommended Data Repositories)
    - Social networks and media organizations (Facebook, Twitter, Google)
    - Online communities (Kaggle, Socrata, UCI ML Repository)
    

3. Once you acquire a data set, you will have to **understand its structure and content** before (programmatically) investigating it. You need to know what kinds of statistical analysis will be valid for different types of data, as well as how to interpret what that data are measuring.

4. **Data interpretation**. Working with data requires domain knowledge, at least a basic level of understanding of the problem domain (the meaning of data), significance and purpose of any feature (to detect outliers and errors), and some of the subtleties that may not be explicit in the data set (such as biases or aggregations that may hide important causalities). `Gathering domain knowledge almost always requires outside research`.

5. **Organize your data into data structures**. Usually these structures allow building one or more (connected) tables where columns represent features and rows observations. You need to understand data schema and specific context for all values. Use meta-data (data about data) as a starter.

6. **Use data to answer questions**, this will require translating from various domain questions to specific observations and features in your data set. You need to be able to decide **what precisely is meant by a question** - a task that requires understanding the nuances found in the questions' problem domain.

---
<a id='dataframe'></a>

## Data Frames

**Data Frames** act like tables, where data is organized into rows and columns. Technically, data frames are lists in which each element is a vector of the same length (as other vectors). Each vector represents a column, not a row. The elements at corresponding indices in the vectors are considered part of the same row (record). With this design, each row may have a different type of data, and vector's elements must all be of the same type.

### Create

In [21]:
# Create a data frame
name <- c("Alice", "John", "Melinda")
height <- c(150, 170, 165)
weight <- c(60, 70, 65)

# Factors are categorical variables (like: small, medium, large)
people <- data.frame(name, height, weight, stringsAsFactors = FALSE)

# or

people <- data.frame(
    name = c("Alice", "John", "Melinda"), 
    height = c(150, 170, 165), 
    weight = c(60, 70, 65), 
    stringsAsFactors = FALSE)

In [22]:
people

name,height,weight
<chr>,<dbl>,<dbl>
Alice,150,60
John,170,70
Melinda,165,65


In [23]:
# Dollar notation
people_weights <- people$weight
print(people_weights)

[1] 60 70 65


In [24]:
# Double-bracket notation
people_weights <- people[["weight"]]
print(people_weights)

[1] 60 70 65


In [25]:
# Single-bracket notation
# see below

### Inspect

In [26]:
nrow(people)

In [27]:
ncol(people)

In [28]:
print(dim(people)) # rows, cols

[1] 3 3


In [29]:
print(colnames(people))

[1] "name"   "height" "weight"


In [30]:
print(rownames(people))

[1] "1" "2" "3"


In [31]:
print(head(people))

     name height weight
1   Alice    150     60
2    John    170     70
3 Melinda    165     65


In [32]:
print(tail(people))

     name height weight
1   Alice    150     60
2    John    170     70
3 Melinda    165     65


In [33]:
# Open the data frame in a spreadsheet-like viewer (only in RStudio)
View(people)

ERROR: Error in View(people): ‘View()’ not yet supported in the Jupyter R kernel


### Access

In [37]:
# Single-bracket notation
# Elements by row and column names or mixed
people[1, "name"]

In [38]:
# Elements by row and column indices
people[1,1]

In [39]:
# Row by name/index
people[1,]

name,height,weight
<chr>,<dbl>,<dbl>
Alice,150,60


In [41]:
# Column by name/index
print(people[,1])

[1] "Alice"   "John"    "Melinda"


In [44]:
# Assign a set of row names for the vector
print(rownames(people))
rownames(people) <- people$name
print(rownames(people))

[1] "1" "2" "3"
[1] "Alice"   "John"    "Melinda"


In [45]:
people["Alice", "name"]

In [48]:
# Get multiple rows and columns
people[,c("height", "weight")]

Unnamed: 0_level_0,height,weight
Unnamed: 0_level_1,<dbl>,<dbl>
Alice,150,60
John,170,70
Melinda,165,65


In [49]:
people[2:3,]

Unnamed: 0_level_0,name,height,weight
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>
John,John,170,70
Melinda,Melinda,165,65


In [51]:
people[people$height > 160,]

Unnamed: 0_level_0,name,height,weight
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>
John,John,170,70
Melinda,Melinda,165,65


### Type & Convertion

In [52]:
is.data.frame(people)

In [58]:
v1 = c(1,2,3)
is.data.frame(v1)
print(as.data.frame(v1))

  v1
1  1
2  2
3  3


---
<a id='csv'></a>

## Working with CSV Data

- For a script to work on any computer, **relative (not absolute) path** is the best choice.
- Setup a working directory via `Session > Set Working Directory` (in RStudio).
- You should always include `stringsAsFactors = FALSE` argument when either loading or creating data frames. Factors are categorical variables (like: small, medium, large)

In [62]:
# Read from a CSV file
people <- read.csv("data/simpleR.csv", stringsAsFactors = FALSE)
head(people)

first_name,weight,height
<chr>,<int>,<dbl>
Ada,64,135
Bob,74,156
Chris,69,139
Diya,69,144
Emma,71,152


In [63]:
# Write to a CSV file
write.csv(people, "data/simpleRwrite.csv", row.names = FALSE)

In [64]:
# See available in R datasets
data()

Package,Item,Title
<chr>,<chr>,<chr>
datasets,AirPassengers,Monthly Airline Passenger Numbers 1949-1960
datasets,BJsales,Sales Data with Leading Indicator
datasets,BJsales.lead (BJsales),Sales Data with Leading Indicator
datasets,BOD,Biochemical Oxygen Demand
datasets,CO2,Carbon Dioxide Uptake in Grass Plants
datasets,ChickWeight,Weight versus age of chicks on different diets
datasets,DNase,Elisa assay of DNase
datasets,EuStockMarkets,"Daily Closing Prices of Major European Stock Indices, 1991-1998"
datasets,Formaldehyde,Determination of Formaldehyde
datasets,HairEyeColor,Hair and Eye Color of Statistics Students


In [65]:
View(mtcars)

ERROR: Error in View(mtcars): ‘View()’ not yet supported in the Jupyter R kernel


In [66]:
# Get an absolute path to the current working directory
getwd()

---
<a id='fac'></a>

## Factors

Factors are categorical variables (like: small, medium, large). Factors are not vectors, vector methods will not work on them.

In [70]:
shirt_sizes <- c("small", "medium", "large", "small", "large")
print(shirt_sizes)

[1] "small"  "medium" "large"  "small"  "large" 


In [71]:
shirt_sizes_factor <- as.factor(shirt_sizes)
print(shirt_sizes_factor)

[1] small  medium large  small  large 
Levels: large medium small


In [72]:
length(shirt_sizes_factor)

In [73]:
is.factor(shirt_sizes_factor)

In [None]:
---
<a id='dataframe'></a>