In [1]:
# Creating a vector of numbers

x <- c(1, 3, 2, 5)

# Print out the vector x
x

In [2]:
# Length of x

length(x)

In [3]:
# Create a 2 * 2 matrix 

x <- matrix(data = c(1, 5, 7, 9), nrow = 2, ncol = 2)


# Call the matrix
x

0,1
1,7
5,9


In [4]:
# Find the Square root of x

sqrt(x)

0,1
1.0,2.645751
2.236068,3.0


In [5]:
# Create a series of correlated numbers

# A series of normally distributed numbers

x <- rnorm(50)

y <- x + rnorm(50, mean = 10, sd = 0.05)

# Lets find the correlation

cor(x, y)

## Dataframes

Conceptually, we can think of a data frame as a table with rows representing observations and the different variables reported for each observation defining the columns. Data frames are particularly useful for datasets because we can combine different data types into one object.

A large proportion of data analysis challenges start with data stored in a data frame.

In [6]:
# Make sure the 'dslabs' package has been installed
library(dslabs)

# Load the murders dataset into your workspace
data(murders)

# Check the class of the dataset
class(murders)


"package 'dslabs' was built under R version 3.6.3"

### Examining the Object

In [7]:
# For more information about the data use 'str' - structure

str(murders)


'data.frame':	51 obs. of  5 variables:
 $ state     : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
 $ abb       : chr  "AL" "AK" "AZ" "AR" ...
 $ region    : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
 $ population: num  4779736 710231 6392017 2915918 37253956 ...
 $ total     : num  135 19 232 93 1257 ...


In [8]:
# See the first 5 variables

head(murders)

state,abb,region,population,total
Alabama,AL,South,4779736,135
Alaska,AK,West,710231,19
Arizona,AZ,West,6392017,232
Arkansas,AR,South,2915918,93
California,CA,West,37253956,1257
Colorado,CO,West,5029196,65


### The Accessor

We access the different variables represented by columns included in this data frame. To do this, we use the accessor operator **$**

In [9]:
# The Accessor

tail(murders$population)

But how did we know to use population? Previously, by applying the function str to the object murders, we revealed the names for each of the five variables stored in this table. We can quickly access the variable names using:

In [10]:
# Check for the column names using

names(murders)

### Vectors: numerics, characters and logical

The object murders$population is not one number but several. We call these types of objects vectors. A single number is technically a vector of length 1, but in general we use the term vectors to refer to objects with several entries. The function length tells you how many entries are in the vector:

In [11]:
pop <- murders$population

# Check the length of the vector
length(pop)

# Check the Data type
class(pop)

In a numeric vector, every entry must be a number.

To store character strings, vectors can also be of class character. For example, the state names are characters:

In [12]:
class(murders$state)

In [13]:
y <- 3

z <- 10

# Relational Operators / Booleans
y == z

In [14]:
# Working with Factors

class(murders$region)

 Factors are useful for storing categorical data. We can see that there are only 4 regions by using the levels function:

In [15]:
levels(murders$region)

In the background, R stores these levels as integers and keeps a map to keep track of the labels. This is more memory efficient than storing all the characters.

Note that the levels have an order that is different from the order of appearance in the factor object. The default in R is for the levels to follow alphabetical order. However, often we want the levels to follow a different order. You can specify an order through the levels argument when creating the factor with the factor function. For example, in the murders dataset regions are ordered from east to west. The function reorder lets us change the order of the levels of a factor variable based on a summary computed on a numeric vector. We will demonstrate this with a simple example, and will see more advanced ones in the Data Visualization part of the book.

Suppose we want the levels of the region by the total number of murders rather than alphabetical order. If there are values associated with each level, we can use the reorder and specify a data summary to determine the order. The following code takes the sum of the total murders in each region, and reorders the factor following these sums.

## Conditional Expressions

Conditional expressions are one of the basic features of programming. They are used for what is called flow control. The most common conditional expression is the if-else statement. In R, we can actually perform quite a bit of data analysis without conditionals. However, they do come up occasionally, and you will need them once you start writing your own functions and packages.


In [16]:
age <- 29

if (age <= 30) {
    # String concatenation in R
    paste(("My age is "), as.numeric(30), sep = " ")
} else {
    print("This is the else block")
}

In [17]:
# Or


ifelse(age <= 30, "Yes", "No")

In [18]:
data(na_example)

sum(is.na(na_example))

no_nas <- ifelse(is.na(na_example), 0, na_example)

sum(is.na(no_nas))

### Defining Functions

As you become more experienced, you will find yourself needing to perform the same operations over and over. A simple example is computing averages. We can compute the average of a vector x using the sum and length functions: sum(x)/length(x). Because we do this repeatedly, it is much more efficient to write a function that performs this operation. This particular operation is so common that someone already wrote the mean function and it is included in base R. However, you will encounter situations in which the function does not already exist, so R permits you to write your own.

my_function <- function(VARIABLE_NAME){
  perform operations on VARIABLE_NAME and calculate VALUE
  VALUE
}

In [19]:
avg <- function(x){
    
    total <- sum(x)
    
    n <- length(x) 

    total / n
}


trial_sample <- 1 : 1000

avg(trial_sample)

### For Loops


In [20]:
seq <- 1 : 20

for (i in seq) {
    print(i + (i + 1))
}

[1] 3
[1] 5
[1] 7
[1] 9
[1] 11
[1] 13
[1] 15
[1] 17
[1] 19
[1] 21
[1] 23
[1] 25
[1] 27
[1] 29
[1] 31
[1] 33
[1] 35
[1] 37
[1] 39
[1] 41


## The Tidyverse

Up to now we have been manipulating vectors by reordering and subsetting them through indexing. However, once we start more advanced analyses, the preferred unit for data storage is not the vector but the data frame. 



In [22]:
library(tidyverse)




In [23]:
# Example of a tidy dataset

head(murders)

state,abb,region,population,total
Alabama,AL,South,4779736,135
Alaska,AK,West,710231,19
Arizona,AZ,West,6392017,232
Arkansas,AR,South,2915918,93
California,CA,West,37253956,1257
Colorado,CO,West,5029196,65


We will learn how to implement the tidyverse approach throughout the book, but before delving into the details, in this chapter we introduce some of the most widely used tidyverse functionality, starting with the dplyr package for manipulating data frames and the purrr package for working with functions. 

Note that the tidyverse also includes a graphing package, ggplot2, which we introduce later in Chapter 8 in the Data Visualization part of the book; the readr package discussed in Chapter 5; and many others. In this chapter, we first introduce the concept of tidy data and then demonstrate how we use the tidyverse to work with data frames in this format.

### Manipulating Dataframes


The dplyr package from the tidyverse introduces functions that perform some of the most common operations when working with data frames and uses names for these functions that are relatively easy to remember. For instance, to change the data table by adding a new column, we use **mutate**. To filter the data table to a subset of rows, we use **filter**. Finally, to subset the data by selecting specific columns, we use **select**.

In [24]:
# Adding a column to the dataset

murders <- mutate(murders, rate = ((total / population) * 100000))

head(murders)

state,abb,region,population,total,rate
Alabama,AL,South,4779736,135,2.824424
Alaska,AK,West,710231,19,2.675186
Arizona,AZ,West,6392017,232,3.629527
Arkansas,AR,South,2915918,93,3.18939
California,CA,West,37253956,1257,3.374138
Colorado,CO,West,5029196,65,1.292453


In [25]:
# Subsetting with Filter

# Filter the data table to only show the entries for which 

# the murder rate is lower than 0.71

filter(murders, rate <= 0.71)

state,abb,region,population,total,rate
Hawaii,HI,West,1360301,7,0.514592
Iowa,IA,North Central,3046355,21,0.6893484
New Hampshire,NH,Northeast,1316470,5,0.3798036
North Dakota,ND,North Central,672591,4,0.5947151
Vermont,VT,Northeast,625741,2,0.3196211


In [27]:
# Select Columns with Select

new_table <- select(murders, state, region, rate)

head(new_table)

state,region,rate
Alabama,South,2.824424
Alaska,West,2.675186
Arizona,West,3.629527
Arkansas,South,3.18939
California,West,3.374138
Colorado,West,1.292453


### The Pipe %>%

In R we can perform a series of operations, for example select and then filter, by sending the results of one function to another using what is called the pipe operator: %>%. Since R version 4.1.0, you can also use |>. Some details are included below.

We wrote code above to show three variables (state, region, rate) for states that have murder rates below 0.71. To do this, we defined the intermediate object new_table. In dplyr we can write code that looks more like a description of what we want to do without intermediate objects:

original data  → select → filter 

In [28]:
murders %>% 
        select(state, region, rate) %>% 
        filter(rate <= 0.75)

state,region,rate
Hawaii,West,0.514592
Iowa,North Central,0.6893484
New Hampshire,Northeast,0.3798036
North Dakota,North Central,0.5947151
Vermont,Northeast,0.3196211


### Summarizing Data

An important part of exploratory data analysis is summarizing data. The average and standard deviation are two examples of widely used summary statistics. 

More informative summaries can often be achieved by first splitting data into groups. In this section, we cover two new dplyr verbs that make these computations easier: summarize and group_by. We learn to access resulting values using the pull function.

In [29]:
# Call the height Dataframe into your workspace
data(heights)

# Preview the Data
head(heights)

sex,height
Male,75
Male,70
Male,68
Male,74
Male,61
Female,65


In [33]:
heights %>% 
        filter(sex == "Female") %>% 
        summarize(average = mean(height), std_dev = sd(height))

average,std_dev
64.93942,3.760656


### Group then Summarize with Groupby

A common operation in data exploration is to first split data into groups and then compute summaries for each group. For example, we may want to compute the average and standard deviation for men’s and women’s heights separately. The group_by function helps us do this.

In [34]:
heights  %>% 
        group_by(sex)  %>% 
        summarize(average = mean(height), std_dev = sd(height))

sex,average,std_dev
Female,64.93942,3.760656
Male,69.31475,3.611024


### Sort Data Frames

When examining a dataset, it is often convenient to sort the table by the different columns. We know about the order and sort function, but for ordering entire tables, the dplyr function arrange is useful. For example, here we order the states by population size:

In [35]:
murders  %>% 
        # Arrange in Ascending Order   
        arrange(rate)  %>% 
        head()

state,abb,region,population,total,rate
Vermont,VT,Northeast,625741,2,0.3196211
New Hampshire,NH,Northeast,1316470,5,0.3798036
Hawaii,HI,West,1360301,7,0.514592
North Dakota,ND,North Central,672591,4,0.5947151
Iowa,IA,North Central,3046355,21,0.6893484
Idaho,ID,West,1567582,12,0.7655102


In [36]:
murders  %>% 
        # Arrange in Sescending Order   
        arrange(desc(rate))  %>% 
        head()

state,abb,region,population,total,rate
District of Columbia,DC,South,601723,99,16.452753
Louisiana,LA,South,4533372,351,7.742581
Missouri,MO,North Central,5988927,321,5.359892
Maryland,MD,South,5773552,293,5.074866
South Carolina,SC,South,4625364,207,4.475323
Delaware,DE,South,897934,38,4.231937


If we are ordering by a column with ties, we can use a second column to break the tie. Similarly, a third column can be used to break ties between first and second and so on. Here we order by region, then within region we order by murder rate:

In [37]:
murders  %>% 
        arrange(desc(rate), population)  %>% 
        head()

state,abb,region,population,total,rate
District of Columbia,DC,South,601723,99,16.452753
Louisiana,LA,South,4533372,351,7.742581
Missouri,MO,North Central,5988927,321,5.359892
Maryland,MD,South,5773552,293,5.074866
South Carolina,SC,South,4625364,207,4.475323
Delaware,DE,South,897934,38,4.231937


## Importing Data

Function	Format	Typical suffix
- read_table	white space separated values	txt
- read_csv	comma separated values	csv
- read_csv2	semicolon separated values	csv
- read_tsv	tab delimited separated values	tsv
- read_delim	general text file format, must define delimiter	txt


Function	Format	Typical suffix
- read_excel	auto detect the format	xls, xlsx
- read_xls	original format	xls
- read_xlsx	new format	xlsx

In [39]:
murder_df <- read_csv('murders.csv')

Parsed with column specification:
cols(
  state = col_character(),
  abb = col_character(),
  region = col_character(),
  population = col_double(),
  total = col_double()
)


### Text Files vs Binary Files

You have already worked with text files. All your R scripts are text files and so are the R markdown files used to create this book. The csv tables you have read are also text files. One big advantage of these files is that we can easily “look” at them without having to purchase any kind of special software or follow complicated instructions. Any text editor can be used to examine a text file, including freely available editors such as RStudio, Notepad, textEdit, vi, emacs, nano, and pico. To see this, try opening a csv file using the “Open file” RStudio tool. You should be able to see the content right on your editor. However, if you try to open, say, an Excel xls file, jpg or png file, you will not be able to see anything immediately useful. These are binary files. Excel files are actually compressed folders with several text files inside. But the main distinction here is that text files can be easily examined.

Although R includes tools for reading widely used binary files, such as xls files, in general you will want to find data sets stored in text files. Similarly, when sharing data you want to make it available as text files as long as storage is not an issue (binary files are much more efficient at saving space on your disk). In general, plain-text formats make it easier to share data since commercial software is not required for working with the data.

Extracting data from a spreadsheet stored as a text file is perhaps the easiest way to bring data from a file to an R session. Unfortunately, spreadsheets are not always available and the fact that you can look at text files does not necessarily imply that extracting data from them will be straightforward.

### Unicode vs ASCII

To understand the difference between these, remember that everything on a computer needs to eventually be converted to 0s and 1s. ASCII is an encoding that maps characters to numbers. ASCII uses 7 bits (0s and 1s) which results in  2 ^ (7) = 128 unique items, enough to encode all the characters on an English language keyboard. However, other languages use characters not included in this encoding. For example, the é in México is not encoded by ASCII. For this reason, a new encoding, using more than 7 bits, was defined: Unicode. When using Unicode, one can chose between 8, 16, and 32 bits abbreviated UTF-8, UTF-16, and UTF-32 respectively. RStudio actually defaults to UTF-8 encoding.

Although we do not go into the details of how to deal with the different encodings here, it is important that you know these different encodings exist so that you can better diagnose a problem if you encounter it. One way problems manifest themselves is when you see “weird looking” characters you were not expecting

### Good Hygiene with Excel Files

- Be Consistent - Before you commence entering data, have a plan. Once you have a plan, be consistent and stick to it.
- Choose Good Names for Things - You want the names you pick for objects, files, and directories to be memorable, easy to spell, and descriptive. This is actually a hard balance to achieve and it does require time and thought. One important rule to follow is do not use spaces, use underscores _ or dashes instead -. Also, avoid symbols; stick to letters and numbers.
- Write Dates as YYYY-MM-DD - To avoid confusion, we strongly recommend using this global ISO 8601 standard.
- No Empty Cells - Fill in all cells and use some common code for missing data.
- Put Just One Thing in a Cell - It is better to add columns to store the extra information rather than having more than one piece of information in one cell.
- Make It a Rectangle - The spreadsheet should be a rectangle.
- Create a Data Dictionary - If you need to explain things, such as what the columns are or what the labels used for categorical variables are, do this in a separate file.
- No Calculations in the Raw Data Files - Excel permits you to perform calculations. Do not make this part of your spreadsheet. Code for calculations should be in a script.
- Do Not Use Font Color or Highlighting as Data - Most import functions are not able to import this information. Encode this information as a variable instead.
- Make Backups - Make regular backups of your data.
- Use Data Validation to Avoid Errors - Leverage the tools in your spreadsheet software so that the process is as error-free and repetitive-stress-injury-free as possible.
- Save the Data as Text Files - Save files for sharing in comma or tab delimited format.