# Introduction to R & Jupyter Notebooks
# Data Manipulation

Matthew D. Turner, PhD  
Georgia State University

Some rights reserved: [cc by-nc-sa](https://creativecommons.org/licenses/by-nc-sa/4.0/) See bottom of document for details.
***
In this section of the workshop we will look at the related problems of

+ Data Structures
+ Loading Data
+ Referencing Data
+ (More) Indexing
+ Subsetting Data

This is a little boring but it is usually 90% of the work you have to do in data-oriented research, so we need to get it right. Bear with me.

## 3.1 Data Structures (Data Frames)
R provides a variety of data structures for storing data. We will look at two of these here.

Previously you loaded the `datasets` package that contains a lot of small data sets to let you experiment with statistical techniques. Let's use that again.

In [None]:
# Load the datasets package and list the data sets (you've seen this before!)

library(datasets)
# library(help = "datasets")  # Uncomment this line to see the list of data again

### Time Series
One specialized data type is a **time series** data set, which is commonly used for data collected over time. (This is hopefully obvious from the name.)

In [None]:
data(discoveries)    # Load the data
discoveries          # Typing its name shows us the data on the screen

This data set is the count of "important" discoveries per year, for the years from 1860 to 1959. To learn more about the data set, you could type `?discoveries` but don't bother, the people who added this to R did not know what "important" meant, either.

Time series are usually best viewed as a plot. For many data sets, R will make reasonable default plots. For a time series like this, you can literally just say `plot(discoveries)` and the result is not bad.

In [None]:
plot(discoveries)

Now that is way too big for some laptop screens. So let's tune the options. 

> **Note!** If the plot above is already reduced in size (so that it looks the same as the next one) then your current session of R is still using options set from a previous run of something. It happened to me, too. Nothing bad has happened. I just did not want you staring at two identical plots and wondering what you were missing.

In [None]:
options(repr.plot.width=6, repr.plot.height=4) # Set plot height and width (in inches)
plot(discoveries)

In [None]:
# We can also clean up the titles and make the figure "pretty"
# See if you can figure this out before you press shift-enter to run it

plot(discoveries, xlab = 'Time (in Years)', ylab = 'Number of Discoveries', col = "cyan")

In [None]:
# Exercise: Get the mean and standard deviation of the discoveries timeseries
# You used these functions previously on lists, they work here, too



One function you will use a lot is `summary`. When applied to a time series, it gives a quick statistical summary of the variable. For people who come from disciplines the use the mean and standard deviation, this output may seem weird. 

For a **numerical** variable, R gives the minimum, maximum, median, first quartile, and third quartile. In the original language this is all that was given, but over time the mean was squeezed into the list as well, due to the demands of people from mean-centered disciplines. 

In [None]:
# Exercise: Use the summary function on the discoveries time series
# Hint: It *is* as simple as it sounds



### Data Frames
The primary way to hold data sets in R is with a **data frame**. A data frame is basically a little spreadsheet that you can manipulate and refer to in other calculations.

Time series are special, in most data sets we need to have multiple columns, one for each variable, and multiple rows, one for each observation (i.e. subject/object/thing/trial/etc.) Remember that, it is the usual format for statistics: **columns are variables** and **rows are observational units**. This is how data frames are organized. Let's see one &emdash; we will start with the famous _Motor Trend_ Cars dataset.

In [None]:
data(mtcars)   # Load the data (for some data sets this is optional)
head(mtcars)   # Head shows the first few rows of a data frame
dim(mtcars)    # Short for dimension
               # Gives the number of ROWS (cars in this case) and COLS (measurements on cars)

Compare this to the `discoveries` time series. Here we have 32 different models of car, and for each model of car we have measurements on 11 variables, including: miles per gallon (mpg), number of cylinders in the engine (cyl), horse power (hp), and type of transmission (am; 0 = automatic, 1 = manual), among others.

The data frame is an easy way to keep the measurements on each car together with the name of the car. 

### Indexing Data Frames
We can refer to the rows or columns in a variety of ways, and we can pick out the bits of the data frame that we want and ignore the rest. We can also make new data frames out of old ones.

> **Important Note** Look carefully are the next example and notice that it uses **square** brackets `[` and `]`, while functions (above) use the **round** ones, `(` and `)`. It is important to use round for functions and square for indexing. You **will** eventually mess this up, it happens to all of us.

In [None]:
# Pick out 1st row:

mtcars[1,]   # Note the comma and the nothing after it! and the SQUARE brackets

In [None]:
# The first 5 rows:

mtcars[1:5,]          # Remember the colon operator from before? 

mtcars[seq(1,5,1),]   # You can use seq here, too!

In [None]:
# The first column of measurements:

mtcars[,1]   # Note the comma and the empty space in front of it for row
             #    Also, this is formatted differently!!

Ok, this is one of the things that I _hate_ about R: the use of an empty space to mean "all" of something. So, inside of square brackets, something like `[,3]` means "give the column 3 values for all of the rows of data," or just "column 3;" while `[6,]` means "gives all the columns (measurements) from the 6th row of data" or "row 6." This is a terrible notation that leads to errors. But we will just have to get used to it.

In [None]:
# Exercise: obtain the list of displacement (disp) values from mtcars



In [None]:
# Exercise
#
# Pick out the ODD rows from the data frame
# Hint (1): Use the seq function from the previous notebook for the rows (? can help)
# Hint (2): You need to know the number of rows to make this work (look above)
# Hint (3): Remember to use the correct brackets (square or round)



Remember when I said that `seq` stopping early would help you out? Someday it will dawn on you why that was important here. If it hasn't already. Meditate upon this.

You can refer to columns by name. As we often want to analyze variables, we often need to pick out entire columns. so this allows us to use human-friendly names rather than having to figure out the column numbers.

In [None]:
mtcars[,"mpg"]  # This list is the same as the one above

In [None]:
# Use c() -- catenate -- to make a list of columns you want:

mtcars[, c("mpg", "hp", "am")]  # This shows three columns

newcar <- mtcars[, c("mpg", "hp", "am")]  # What does this do?
                                          # Does it print anything below?

The `summary` function also works woth data frames. When applied to these, it gives a quick statistical summary of **all** of the variables. For each **numerical** variable, R gives the minimum, maximum, median, first quartile, and third quartile. If it finds any non-numerical variables, it gives a partial tabulation (more on this below).

In [None]:
summary(newcar)

To get the means (`mean` function) and standard deviations (the `sd` function) you have to pick out the whole variable, that is, you can **not** apply these functions to the whole data frame.

In [None]:
# Obtain the mean and standard deviation of the mpg variable in the newcar
# data frame; remember you have to pick out the single variable like you 
# did above



As we will do this operation (picking out a column) a **lot**, there is a shorthand notation for it:

```R
dataFrameName$columnName
```

So, for the cell you just did, you can use `mtcars$mpg` where you used `mtcars[,"mpg"]`.

In [None]:
# Print the values inside of the am variable from the newcar frame
# Use mean and sd to summarize the am variable 
# Use $ here!



### Variable Transformation
Although we can be sloppy sometimes, it is good to make sure that R knows what the numbers we use represent. When we started looking at `mtcars` we said that `am` was the **type** of transmission, either automatic (0) or manual (1). 

Well, what do the mean and standard deviation of _type of transmission_ mean? Nothing really. The mean, approximately 0.406, tells us that most cars are automatic (why?) but not much else.

When R loaded this data it did not know that the 1's and 0's in this variable were **names**, not really numbers. (In psychology, numbers as names are called a _nominal scale_; in several fields we might also call a variable like `am` a _categorical variable_.) We can fix this.

Categorical variables are sometimes called **factors** and R will allow us to tell it when we have a factor. We do this with the `factor` function. In a moment we are going to use this function, but when we do we are going to introduce a new idiom, commonly used in computer science. Here is the code we will use in a moment:

```
newcar$am <- factor(newcar$am)
```

Notice that `newcar$am` appears on both sides of the equals sign. Remember that in an assignment operation, we put the result into whatever is on the left hand side **after** doing the operation on the right hand side. So in this command we are telling R:

1. Get whatever is stored in `newcar$am`
1. Give this to the `factor` function
1. Take the output of `factor` and place this result into the variable name on the right hand side of the arrow (or equals sign); this just happens to be the same variable name `newcar$am`, so the original data will be **replaced** with the new data

In computing we always do what is on the _right hand side_ of the assignment operator first, then put it into the name on the left! (See notebook 2 for more on assignments.) 

In [None]:
# Convert the am variable in newcar to a factor (use command shown)
# Then use summary on newcar, again



Compare the output from the cell just above to what we got earlier. 

>**Important Note** If you **re-run** the cell further above were I originally put the command to do a summary, it will give you the **same** output as you just got. Always remember that changes we make to our data are **persistent**.

Notice that the `summary` command gives different output for categorical/nominal variables than it gives for numbers.

## 3.2 Loading Data
In this workshop we will focus on the most basic method of loading data, reading it in from a CSV file. This is generally the preferred method in R for data sets that are on the order of 100 MB or less. If you work with massive data, there are specialized add-ons for dealing with the bespoke data types for those sorts of problems.

The basic command is `read.csv` which is a derivative of the more general command `read.table`. Remember this other command if you find yourself trying to read data in from plain text files. 

In [None]:
# Use ? to glance at the help for read.csv
# Notice the vast array of options available for weird data files



In [None]:
# 9 times out of 10, just using read.csv will work, with no need to use options
# Load the data from the file height_weight_200.csv which you have on DICE
# Hint: the name of a file needs to be inside of double quotes ("")



If you just used: `read.csv("height_weight_200.csv")` then you successfully printed your data to the screen. If you look, you will notice that it looks like a data frame. R generally loads data into data frames, although sometimes this will not be true.

However, to use the data we need to put it into a variable. If you thought about this, good! If not, no problem. We're moving fast today.

Let's repeat the command above, but this time store the data in a variable, `hw2`.

In [None]:
# Put the height_weight_200.csv data into the new variable hw2
# Just for practice, use the <- if you have been using = or vice-versa



In [None]:
# Print out the first few lines of hw2 
# Hint: There are examples of this above, there are several ways to do it



## 3.3 Referencing Data
Once you have gotten all of your data into a data frame, it is time to analyze it. To do that we need to be able to refer to it in other places. Data inside of R data frames has a **name** (the variable) and an **address** (the data frame name). You usually need to use both to be exact. This is what the dollar sign (`$`) notation does. Think of it like:

```
address$name
dataFrameName$variableName
dataFrameName$columnName
```

Get used to this aspect of R. R assumes that you will often have more than one data set loaded at once, because real statistical problems rarely all fit into just one data set. For most of today we will not be making much use of this.

Now that the data has been loaded from a file on the disk into a data frame, we can do all of the stuff we did above to it. Although this is a very small data set.

In [None]:
summary(hw2)

sd(hw2$height)
sd(hw2$weight)

In [None]:
# Look up the help for the plot function



In [None]:
# Make a plot of height (x-axis) versus weight (y-axis)
# Hint: remember to give the name and address of the variables
# Hint: if you like you can copy the options function from above and change it



## 3.4 More Indexing
We have already been dealing with the problem of "indexing" a data frame. Within each data frame there is an implied set of coordinates, like latitude and longitude. But in frames, it is **rows** and **columns**. R has methods for working with these data coordinates directly.

We have seen some of this above. We will not look into getting single numbers out of a data frame, it is not used heavily in statistical analyses. We generally do not need to pick out specific points in the data frame, just entire rows or columns. Or _sets_ of rows within columns.

We can combine the dollar sign notation with the square brackets. Since we will be using the dollar sign to pick out columns, the numbers in the square brackets will not need the commas from above.

In [None]:
head(hw2)          # Print first six rows of hw2

hw2$height[1]      # First height (as a list)
hw2$height[1:5]    # First 5 heights (as a list)

In [None]:
plot(hw2$height[1:10], hw2$weight[1:10])  # Plot the first 10 people ONLY

In [None]:
# Make a plot that only uses the first 100 data rows in the hw2 frame



### Logical Indices
R allows you to use an ordered list of truth values to index a data frame. This seems weird, but like a lot of weird things, it turns out to be useful.

We can use this to plot the height versus weight for **only** the people who are above the third quartile of height (69.2 inches). What we will do is the following:

1. For each height, if the height is greater than or equal to 69.2 inches mark `TRUE`, if not mark `FALSE`
1. Put these `TRUE` and `FALSE` results in a variable, here we use `ind`
1. Use this `ind` variable as an **index** inside the square brackets

Viz.

In [None]:
# The first line compares each height with 69.2, and stores the TRUE or FALSE
# values in the variable ind

ind <- hw2$height >= 69.2   # This shows nothing in the output below!

print(ind)                  # The TRUE's and FALSE's below

hw2$height[ind]             # The numbers below; just the heights >= 69.2

To be honest this is a little more advanced than much of what we are covering in these notebooks, but it is really important. If you don't get it right away, move on to other topics and come back to it later.

It **will** eventually make sense to you. Trust me.

In [None]:
# Exercise (a little more advanced than some others!)
# Make the same plot as before, but with the heights and weights where
# height is greater than 69.2 inches
#
# Hint: use the last expression in the cell above
# Hint: Apply the same ind to both variables



As usual in R there is a shorthand way to do this: you do not need to create a new variable to hold the logical indices as an intermediate step (like we did with `ind` above), you can just stick the comparison directly into the square brackets!

In [None]:
plot(hw2$height[hw2$height >= 69.2], hw2$weight[hw2$height >= 69.2])

Compare this to the plot you just made above. It should be the same.

For beginners this is harder to read, and it really messes up the labels for the x and y axes. But once you get used to it, it is an efficient way to pick out data that meets certain criteria. 

## 3.5 Subsetting Data

We can use logical indices to make subsets we might need. 

R provides many built-in data sets for practice and demonstration of statistical techniques. You can run `library(help = "datasets")` to get a list and then get help for each specific data set name to see the details. Most packages that get added to R will bring their own data in with them.

Here we use the **state** data. R provides **lists** of the various facts about states including their abbreviation (`states.abb`), their area (`states.area`; in square miles), and their region of the country (`state.region`; Northeast, South, North Central, West). 

Note that these lists are not data frames, they are just lists. (Try typing their names in a cell to see what they contain. We will put these lists together into a single data frame which has all three pieces of information for each state. For this we use the `data.frame` function.

In [None]:
# Look at the state.abb and state.area variables and describe them



In [None]:
# Make a data frame, s, from state data

s <- data.frame(name = state.abb, region = state.region, area = state.area)
head(s)
summary(s)

In [None]:
# List all the states in the "Northeast" region

s$name[s$region == "Northeast"]

In [None]:
# List all of the states in the "South" region



In [None]:
# List all of the areas of the states in the "South" region



In [None]:
# We can make this into a "prettier" list

data.frame(name=s$name[s$region == "South"], area=s$area[s$region == "South"])

## 3.6 Bonus: Boxplots!
Before we leave the state data let's look at the `boxplot` function. This function makes a boxplot, which is a graphical version of the five number summary invented by [John Tukey](https://en.wikipedia.org/wiki/John_Tukey#Statistical_practice). The five number summary is a box that runs from the 1st quartile to the 3rd quartile with a line at the median. It has lines extending from the ends of the box that go to the lower and upper fences, these mark the boundary between points close to the rest of the data and points further out, called "outlying" points. (These lines are called "whiskers" and give the plots their other name "box and whisker plots.") The outlying points are usually worthy of closer inspection.

Boxplots are most useful to compare **groups** of numerical measures.

This is much clearer with an example.

In [None]:
# Compare the regions by the sized (areas) of their states

boxplot(s$area ~ s$region)

That display is too squished and also uses "scientific notation" for the y-axis. Let's unsquish it and turn off the scientific notation:

In [None]:
options(scipen=999)  # Eliminates the scientific notation on y-axis
boxplot(s$area ~ s$region, ylim = c(0, 175000)) # Change y-axis to unsquish

Note that in making the plot look better, we had to change the y-axis and two of the outlying points are no longer on the plot.

The `boxplot` function takes two variables seperated by the tilde, `~`, on the left of the `~` you put the numerical variable, and on the right of it you put the grouping variable.

We can see from this example that states in the northeast are the smallest, on average, while states out west are the largest, again on average.

You might want to explore these data further to see if you can:
+ Determine which states are included in which regions
+ Identify the outlying points
+ Compute other statistics for regions. Think about things like: `mean(s$area[s$region == 'West'])` and other function and comparison combinations

Now that we have the basics of data frames, indexing, and subsetting working, it is time to actually do some statistics. 

Version 1.0  
2018.06.06  

To contact the author, please email [mturner46@gsu.edu](mailto:mturner46@gsu.edu). Please contact me with recommendations for improvement or if you find any errors. This work may be adapted for any non-commercial purpose within the bounds of the license.

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.