# DataFrame Basics

In our previous lessons, we've talked about how vectors are often used to store lots of different observations of a given measurement (e.g. the answers of different survey respondents to a given question), and how matrices can be used to collect lots of different measurements in columns (e.g. each column can be answers to different questions). 

But matrices have one major limitation when it comes to social science workflows, which is that all the entries in a matrix have to be of the same type. In reality, however, we often have datasets with lots of *different* data types. For example, we might have numeric data on age and income, but character data for people's names, preferred political candidate, etc. Or we might have data on power plants across the US that includes numeric data on capacity, age, and pollution alongside character data on the power plant's fuel and the company that owns the plant. 

To deal with this kind of *heterogeneous tabular data*, we turn to the `data.frame`. 

Dataframes are basically just a collection of vectors, where each vector corresponds to a different column, and each column has a single type. Since they're two-dimensional data structures like matrices, we can actually subset them in the same way as matrices, but they are more flexible in terms of the types of data they can store. 

In this reading, we'll discussing how to create dataframes, how to get information about your dataframe as a whole, how to subset the rows of your dataframe, how to subset the columns of your dataframe, and finally how to edit your dataframe. 

There will be a **lot** of syntax in this reading, but the goal of the reading is **not** to have you memorize all the syntax, but rather to understand the *logic* of how dataframes work. To help with that, I've provided a recap section at the end of the reading with examples of all the commands we cover in one place. So as you read, try and focus on the logic of how dataframes work, not the exact syntax. 

## Creating Dataframes

Let's start by learning how to create a dataset in R. This turns
out to be very simple --- just combine vectors using the `data.frame()`
command. 


In [1]:
# Create three vectors 
name <- c("al", "bea", "carol")
age <- c(6, 7, 4)
hair <- c("brown", "green", "blond")

# Create data frame 
children <- data.frame(name, age, hair)
children


name,age,hair
<chr>,<dbl>,<chr>
al,6,brown
bea,7,green
carol,4,blond


Or we can create our data frame by inserting our vectors as keyword arguments:

In [2]:
# Create data frame 
children <- data.frame(
    name = c("al", "bea", "carol"),
    age = c(6, 7, 4),
    hair = c("brown", "green", "blond")
)
children

name,age,hair
<chr>,<dbl>,<chr>
al,6,brown
bea,7,green
carol,4,blond


Note that unlike matrices and vectors -- which *can* have names -- dataframe columns **always** have names, and you'll usually see columns accessed by name for reasons we'll discuss below:

In [3]:
children[, "hair"]

And as we discussed before, the columns of a dataframe are just our old friends, the vector!

In [4]:
class(children[, "hair"])

## Getting to Know Your Dataframe

To better understand the proper structure of datasets, let's create a second data frame that has a more realistic data structure:

In [5]:
country <- rep(c("USA", "China", "Sudan"), 3)
year <- c(1994, 1994, 1994, 1995, 1995, 1995, 1996, 1996, 1996)
gdp_pc <- round(runif(9, 1000, 20000))

countries <- data.frame(country, year, gdp_pc)
countries

country,year,gdp_pc
<chr>,<dbl>,<dbl>
USA,1994,2929
China,1994,10576
Sudan,1994,7123
USA,1995,7665
China,1995,15991
Sudan,1995,4092
USA,1996,2127
China,1996,5831
Sudan,1996,13325


Where we can pretend that `gdp_pc` is a measure of a country's GDP per capita in a given year. 
 
(A quick aside: `rep()`, as you may recall, creates a vector that repeats the first input the number of times specified by the second input. `runif()` creates, in this case, 9 random values uniformly distributed between 1000 and 20000.)


Now let's explore some common functions for getting to know your dataframe!

The first is `dim()`, which gives the dimensions of a data frame. The number of rows are listed first, columns second.

In [6]:
dim(countries)

Use `nrow()` and `ncol()` to to get the number of rows or columns separately.

In [7]:
nrow(countries)
ncol(countries)


Use `head()` and `tail()` to look at the first and last few rows of a dataset, respectively. Obviously this is more useful when we have datasets with hundreds or thousands of observations you can't just look at. :) 

In [8]:
head(countries)

Unnamed: 0_level_0,country,year,gdp_pc
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>
1,USA,1994,2929
2,China,1994,10576
3,Sudan,1994,7123
4,USA,1995,7665
5,China,1995,15991
6,Sudan,1995,4092


In [9]:
tail(countries)

Unnamed: 0_level_0,country,year,gdp_pc
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>
4,USA,1995,7665
5,China,1995,15991
6,Sudan,1995,4092
7,USA,1996,2127
8,China,1996,5831
9,Sudan,1996,13325


Other useful commands to get to know variables better include `summary()`,
`table()`, and `prop.table()`. 

In [10]:
# Get some summary information about each variable
summary(countries)

   country               year          gdp_pc     
 Length:9           Min.   :1994   Min.   : 2127  
 Class :character   1st Qu.:1994   1st Qu.: 4092  
 Mode  :character   Median :1995   Median : 7123  
                    Mean   :1995   Mean   : 7740  
                    3rd Qu.:1996   3rd Qu.:10576  
                    Max.   :1996   Max.   :15991  

## Subsetting Dataframes

Subsetting dataframes follows the same logic we saw with matrices: we use square brackets, and the first entry in the square brackets subsets rows, and the second entry subsets columns. 

For example, we could easily get the second row and the second and third columns of our dataframe like this:

In [11]:
# Subset by index
countries[2, c(2, 3)]

Unnamed: 0_level_0,year,gdp_pc
Unnamed: 0_level_1,<dbl>,<dbl>
2,1994,10576


In general, though, we won't usually subset by index when working with dataframes. That's because datasets have a tendency to change over time -- your coauthors may give you an updated version of the data you're working with, or you change the way you sort the data early in your code. Instead, we'll mostly subset rows using logicals, and subset columns by name, like this:

In [13]:
countries[countries[, "gdp_pc"] < 10000, "year"]

## Columns Operations

Columns in most datasets correspond to variables, so you will often want to do things like take the average of a single column (e.g. the average of a single variable), or edit the values in a column. 

In fact, accessing a single column is so common with dataframes that there are two exactly-identical ways to get a single column -- the one we're used to, and a new shorthand:

In [14]:
# What we're used to from matrices
countries[, "gdp_pc"]

In [15]:
# The shortcut for a single dataframe column
countries$gdp_pc

These are exactly equivalent for single columns! But note that you can't always use this trick -- for example, it doesn't work for trying to get several columns from a dataframe. Most of the time, though, its a very convenient shorthand for single-column manipulations. 

### Subsetting Columns

Subsetting columns can be done in a couple ways. The easiest is to just list the columns one wishes to keep:

In [18]:
countries[, c("gdp_pc", "year")]

gdp_pc,year
<dbl>,<dbl>
3047.332,1994
11003.27,1994
7410.769,1994
7974.666,1995
16637.036,1995
4257.317,1995
2212.931,1996
6066.572,1996
13863.33,1996


But in big dataframes, we sometimes have lots of columns, and don't want to list all the columns *except* the one we want to drop. For that there are two solutions. The first is like this:

In [20]:
# Drop columns gdp_pc and year
countries[, !(names(countries) %in% c("gdp_pc", "year"))]


This is a little weird looking, so it's worth breaking down. 

First, `names(countries)` returns all the column names of `countries`. 

In [21]:
names(countries)

Then `names(countries) %in% c("gdp_pc", "year")` returns a logical vector the length of the column names of `countries` that's `TRUE` if the name is in the list, and `FALSE` otherwise:

In [22]:
names(countries) %in% c("gdp_pc", "year")

Then finally the `!` before that expression is the logical `NOT`, meaning that it makes all `TRUE` values into `FALSE` and vice-versa. So in the end `!(names(countries) %in% c("gdp_pc", "year"))` returns a logical vector that is `TRUE` for all values *not* in the list, and `FALSE` for those in the list. That is then interpreted as a logical subsetting vector, and all columns not in the list are kept, and those not in the list are dropped. 

I know, it's kinda a lot... but it is a good example of how you can compose simple building blocks to do complicated things in R!

Finally, if you're dropping a single columns, you can also assign the value of `NULL` to the column:

In [24]:
countries$gdp_pc <- NULL
countries

country,year
<chr>,<dbl>
USA,1994
China,1994
Sudan,1994
USA,1995
China,1995
Sudan,1995
USA,1996
China,1996
Sudan,1996


Which... well, just works! :)

### Modifying Columns

As with matrices, we can use subsetting to make modifications to columns. For example, suppose, as with our matrix version, we wanted to multiple GDP per capita by 1.02 to adjust for inflation. We could either do:

In [25]:
# re-create with original gdp_pc 
countries <- data.frame(country, year, gdp_pc)
countries

country,year,gdp_pc
<chr>,<dbl>,<dbl>
USA,1994,2929
China,1994,10576
Sudan,1994,7123
USA,1995,7665
China,1995,15991
Sudan,1995,4092
USA,1996,2127
China,1996,5831
Sudan,1996,13325


In [26]:
countries[, "gdp_pc"] <- countries[,"gdp_pc"] * 1.02

Or

In [17]:
countries$gdp_pc <- countries$gdp_pc * 1.02

### Creating New Columns 

If we wanted to keep both the original `gdp_pc` column and add a *new* column with the inflation adjusted values, we can do so just by using a *new* column name when we assign our values back into the dataframe:

In [27]:
# re-create with original gdp_pc 
countries <- data.frame(country, year, gdp_pc)

In [28]:
# Add new column
countries$adjusted_gdp_pc <- countries$gdp_pc * 1.02
countries

country,year,gdp_pc,adjusted_gdp_pc
<chr>,<dbl>,<dbl>,<dbl>
USA,1994,2929,2987.58
China,1994,10576,10787.52
Sudan,1994,7123,7265.46
USA,1995,7665,7818.3
China,1995,15991,16310.82
Sudan,1995,4092,4173.84
USA,1996,2127,2169.54
China,1996,5831,5947.62
Sudan,1996,13325,13591.5


### Analyzing Columns 

Finally, as long as we're talking about columns, it's worth emphasizing that once you pull a column out of your dataframe, you can analyze it like any other vector (since it is just a vector!). For example: 

In [29]:
mean(countries$gdp_pc)

But two summary functions are worth noting here: `table()`, to get the number of observations that have a given value in a vector, and the combination `prop.table(table())`, to get the share of observations with a given value in a vector:

In [32]:
# Number of observations by country 
table(countries$country)


China Sudan   USA 
    3     3     3 

In [33]:
# Proportion of observations by country 
prop.table(table(countries$country))


    China     Sudan       USA 
0.3333333 0.3333333 0.3333333 

## Row Operations

In most datasets you work with, each row will correspond to a single observation in your data. Given that, we often manipulate rows as a way of manipulating the sample in our analyses. 

### Subsetting

As we discussed before, you usually don't want to explicitly subset by index with rows in case at some point in the future your data gets sorted differently (so observations end up with different row numbers), or the data you're working with gets updated or corrected.

Subsetting with logicals, thankfully, works just the way it did for matrices, except that we can access column names with the `$` notation:

In [34]:
countries[countries$year == 1995 & countries$country == "USA", ]

Unnamed: 0_level_0,country,year,gdp_pc,adjusted_gdp_pc
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>
4,USA,1995,7665,7818.3


### Sorting Dataframes

Often, we'll want to sort the rows of our dataframe by the values in one of our columns. To do so, we use the `order` command:

In [39]:
# Sort by GDP
countries[order(countries$adjusted_gdp_pc),]

Unnamed: 0_level_0,country,year,gdp_pc,adjusted_gdp_pc
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>
7,USA,1996,2127,2169.54
1,USA,1994,2929,2987.58
6,Sudan,1995,4092,4173.84
8,China,1996,5831,5947.62
3,Sudan,1994,7123,7265.46
4,USA,1995,7665,7818.3
2,China,1994,10576,10787.52
9,Sudan,1996,13325,13591.5
5,China,1995,15991,16310.82


What's happening? `order()` returns a vector with the indices of the rows of the dataset in sorted order:

In [40]:
order(countries$adjusted_gdp_pc)

And then, because it's a vector of indices being passed in the first position of our square brackets, we get all the rows of `countries` "subset" by index (though obviously it's not really a subset, since all row indices appear in the vector -- just a re-ordering)!

We can also sort by multiple columns:

In [None]:
countries[order(countries$year, countries$country),]

Unnamed: 0_level_0,country,year,adjusted_gdp_pc
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>
2,China,1994,16022.16
3,Sudan,1994,15073.56
1,USA,1994,19145.4
5,China,1995,11920.74
6,Sudan,1995,15743.7
4,USA,1995,6523.92
8,China,1996,3611.82
9,Sudan,1996,4270.74
7,USA,1996,6803.4


And we can use `-` to sort any variable in descending order rather than ascending order:

In [41]:
countries[order(-countries$adjusted_gdp_pc), ]

Unnamed: 0_level_0,country,year,gdp_pc,adjusted_gdp_pc
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>
5,China,1995,15991,16310.82
9,Sudan,1996,13325,13591.5
2,China,1994,10576,10787.52
4,USA,1995,7665,7818.3
3,Sudan,1994,7123,7265.46
8,China,1996,5831,5947.62
6,Sudan,1995,4092,4173.84
1,USA,1994,2929,2987.58
7,USA,1996,2127,2169.54


## Recap

Phew. OK, I know this reading covered *a lot*, so here's a quick recap and a summary table for reference. 

- Dataframes really are just like matrices. The main difference is that each column can be a different type, and dataframes always have column names.
- We subset single dataframe columns using `$`, but that's just a shorthand for the syntax we learned before (`df[, "colname"]`).
- The columns of a dataframe are just vectors.
- We usually subset dataframes with logicals (for rows) or by name (columns) for safety. 

And now a reference table, written with a toy dataset called `df` with columns `col1`, `col2`, and `col3` in mind: 

**Looking at your dataframe:**

- Number of rows: `nrow(df)`
- Number of columns: `ncol(df)`
- First six rows: `head(df)`
- Last six rows: `tail(df)`
- Quick summary of all data: `summary(df)`

**Row Operations**

- Subset rows by logical: `df[df$col1 < 42, ]` or `df[df[, col1] < 42, ]`
- Random sample of N rows: `df[sample(nrow(df), N), ]`
- Sort rows (ascending, one column): `df[order(df$col1), ]`
- Sort rows (descending, one column): `df[order(-df$col1), ]`
- Sort rows (multiple columns): `df[order(df$col1, df$col2), ]`

**Column Operations**

- Subset one column by name: `df$col1` or `df[, "col1"]`
- Subset multiple columns by name: `df[ , c("col1", "col2")]`
- Drop one column: `df$col1 <- NULL`
- Drop set of columns: `df[ , !(names(df) %in% c("col1", "col2"))]`
- Editing a single column: `df$col1 <- df$col1 * 42` or `df[, "col1"] <- df[, "col1"] * 42`
- Create new column: `df$newcol <- df$col1 * 42` or `df[, "newcol"] <- df[, "col1"] * 42`

**Learn About a Column:**

- Tabulate number of observations of each value: `table(df$col1)`
- Share of observations of each value: `prop.table(table(df$col1))`
- Quick summary of one column: `summary(df$col1)`
