# Manipulating Dataframes

In our previous lessons, we learn how to create dataframes from vectors, and how to load them from files. In this lesson, we will learn how to work with our dataframe once we have it loaded up!

Being able to quickly modify datasets -- often referred to as "data wrangling" -- is critical to being a social scientist. Indeed, most social scientists and data scientists spend a huge proportion of their time of their time cleaning and organizing their data. ([about 80 percent in surveys](https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/?sh=44a70ffb6f63)). So this is probably one of the most important readings of the course!

Be aware that there will be a **lot** of syntax in this reading. However, the goal of the reading is **not** to have you memorize all the syntax, but rather to understand the *logic* of how dataframes work. To help with that, I've provided a recap section at the end of the reading with examples of all the commands we cover in one place. 

So as you read, try and focus on the logic of how dataframes work, not the exact syntax. Syntax is something you can always look up later so long as you understand enough about the logic of what's going on to realize what you need to look up!

To begin, let's start by recreating the dataframe we had in our last exercise as an example to work with:

In [1]:
country <- rep(c("USA", "China", "Sudan"), 3)
year <- c(1994, 1994, 1994, 1995, 1995, 1995, 1996, 1996, 1996)
gdp_pc <- round(runif(9, 1000, 20000))

countries <- data.frame(country, year, gdp_pc)
countries

country,year,gdp_pc
<chr>,<dbl>,<dbl>
USA,1994,3521
China,1994,12134
Sudan,1994,19386
USA,1995,9927
China,1995,4042
Sudan,1995,12528
USA,1996,13416
China,1996,2660
Sudan,1996,6112


## Dataframes Are Like Matrices-Plus

Dataframes, like matrices, are just two dimensional grids of data. And so everything we learned about matrices also applies to dataframes (hooray!).

For example, if we want to get the GDP of China in 1994 from our dataframe, we can subset our dataframe using square brackets and logical / name vectors, just like named matrices:

In [2]:
countries[countries[, "gdp_pc"] < 10000, "year"]

So remember that: if you could do it with a matrix, you can do it with a dataframe!

As we'll see, though, dataframes *do* have few augmentations designed to make the life of the R users a little easier, and so can kinda be thought of as being like matrices+.

## Columns Operations

Unlike matrices, which could have column names, dataframe columns **always** have names. As a result, we will basically always address columns using their names for reasons will discuss in more detail below.

It is also convention that columns in most datasets correspond to variables, so you we'll often want to do things like take the average of a single column (e.g. the average of a single variable), or edit the values in a single column.

In fact, accessing a single column is so common with dataframes that there are **three** exactly-identical ways to get a single column -- the one we're used to, and a new shorthand:

In [24]:
# What we're used to from matrices
countries[, "gdp_pc"]

ERROR: Error in `[.data.frame`(countries, , "gdp_pc"): undefined columns selected


In [25]:
# The shortcut for a single dataframe column
countries$gdp_pc

NULL

In [5]:
# And if you don't have a comma, R assumes you're accessing columns
countries["gdp_pc"]

gdp_pc
<dbl>
3521
12134
19386
9927
4042
12528
13416
2660
6112


These are exactly equivalent for single columns! But note that you can't always use this trick -- for example, it doesn't work for trying to get several columns from a dataframe. Most of the time, though, its a very convenient shorthand for single-column manipulations. 

### Modifying Columns

As with matrices, we can use subsetting to make modifications to columns. For example, suppose, as with our matrix version, we wanted to multiple GDP per capita by 1.02 to adjust for inflation. We could either do:

In [6]:
# re-create with original gdp_pc 
countries <- data.frame(country, year, gdp_pc)
countries

country,year,gdp_pc
<chr>,<dbl>,<dbl>
USA,1994,3521
China,1994,12134
Sudan,1994,19386
USA,1995,9927
China,1995,4042
Sudan,1995,12528
USA,1996,13416
China,1996,2660
Sudan,1996,6112


In [7]:
countries[, "gdp_pc"] <- countries[,"gdp_pc"] * 1.02

Or

In [8]:
countries$gdp_pc <- countries$gdp_pc * 1.02

### Creating New Columns 

If we wanted to keep both the original `gdp_pc` column and add a *new* column with the inflation adjusted values, we can do so just by using a *new* column name when we assign our values back into the dataframe:

In [9]:
# re-create with original gdp_pc 
countries <- data.frame(country, year, gdp_pc)

In [10]:
# Add new column
countries$adjusted_gdp_pc <- countries$gdp_pc * 1.02
countries

country,year,gdp_pc,adjusted_gdp_pc
<chr>,<dbl>,<dbl>,<dbl>
USA,1994,3521,3591.42
China,1994,12134,12376.68
Sudan,1994,19386,19773.72
USA,1995,9927,10125.54
China,1995,4042,4122.84
Sudan,1995,12528,12778.56
USA,1996,13416,13684.32
China,1996,2660,2713.2
Sudan,1996,6112,6234.24


### Analyzing Columns 

Finally, as long as we're talking about columns, it's worth emphasizing that once you pull a column out of your dataframe, you can analyze it like any other vector (since it is just a vector!). For example: 

In [11]:
mean(countries$gdp_pc)

But two summary functions are worth noting here: `table()`, to get the number of observations that have a given value in a vector, and the combination `prop.table(table())`, to get the share of observations with a given value in a vector:

In [12]:
# Number of observations by country 
table(countries$country)


China Sudan   USA 
    3     3     3 

In [13]:
# Proportion of observations by country 
prop.table(table(countries$country))


    China     Sudan       USA 
0.3333333 0.3333333 0.3333333 

### Dropping Columns

Dropping columns can be done in a couple ways. The easiest is to just list the columns one wishes to keep:

In [14]:
countries[, c("gdp_pc", "year")]

gdp_pc,year
<dbl>,<dbl>
3521,1994
12134,1994
19386,1994
9927,1995
4042,1995
12528,1995
13416,1996
2660,1996
6112,1996


But in big dataframes, we sometimes have lots of columns, and don't want to list all the columns *except* the one we want to drop. For that there are two solutions. The first is like this:

In [15]:
# Drop columns gdp_pc and year
countries[, !(names(countries) %in% c("gdp_pc", "year"))]


country,adjusted_gdp_pc
<chr>,<dbl>
USA,3591.42
China,12376.68
Sudan,19773.72
USA,10125.54
China,4122.84
Sudan,12778.56
USA,13684.32
China,2713.2
Sudan,6234.24


This is a little weird looking, so it's worth breaking down. 

First, `names(countries)` returns all the column names of `countries`. 

In [16]:
names(countries)

Then `names(countries) %in% c("gdp_pc", "year")` returns a logical vector the length of the column names of `countries` that's `TRUE` if the name is in the list, and `FALSE` otherwise:

In [17]:
names(countries) %in% c("gdp_pc", "year")

Then finally the `!` before that expression is the logical `NOT`, meaning that it makes all `TRUE` values into `FALSE` and vice-versa. So in the end `!(names(countries) %in% c("gdp_pc", "year"))` returns a logical vector that is `TRUE` for all values *not* in the list, and `FALSE` for those in the list. That is then interpreted as a logical subsetting vector, and all columns not in the list are kept, and those not in the list are dropped. 

I know, it's kinda a lot... but it is a good example of how you can compose simple building blocks to do complicated things in R!

Finally, if you're dropping a single columns, you can also assign the value of `NULL` to the column:

In [18]:
countries$gdp_pc <- NULL
countries

country,year,adjusted_gdp_pc
<chr>,<dbl>,<dbl>
USA,1994,3591.42
China,1994,12376.68
Sudan,1994,19773.72
USA,1995,10125.54
China,1995,4122.84
Sudan,1995,12778.56
USA,1996,13684.32
China,1996,2713.2
Sudan,1996,6234.24


Which... well, just works! :)

## Row Operations

In most datasets you work with, each row will correspond to a single observation in your data. Given that, we often manipulate rows as a way of manipulating the sample in our analyses. 

### Subsetting

Subsetting with logicals is exactly the same with dataframes as it was with matrices, except that we can access column names with the `$` notation:

In [19]:
countries[countries$year == 1995 & countries$country == "USA", ]

Unnamed: 0_level_0,country,year,adjusted_gdp_pc
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>
4,USA,1995,10125.54


### Sorting Dataframes

Often, we'll want to sort the rows of our dataframe by the values in one of our columns. To do so, we use the `order` command:

In [20]:
# Sort by GDP
countries[order(countries$adjusted_gdp_pc),]

Unnamed: 0_level_0,country,year,adjusted_gdp_pc
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>
8,China,1996,2713.2
1,USA,1994,3591.42
5,China,1995,4122.84
9,Sudan,1996,6234.24
4,USA,1995,10125.54
2,China,1994,12376.68
6,Sudan,1995,12778.56
7,USA,1996,13684.32
3,Sudan,1994,19773.72


What's happening? `order()` returns a vector with the indices of the rows of the dataset in sorted order:

In [21]:
order(countries$adjusted_gdp_pc)

And then, because it's a vector of indices being passed in the first position of our square brackets, we get all the rows of `countries` "subset" by index (though obviously it's not really a subset, since all row indices appear in the vector -- just a re-ordering)!

We can also sort by multiple columns:

In [22]:
countries[order(countries$year, countries$country),]

Unnamed: 0_level_0,country,year,adjusted_gdp_pc
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>
2,China,1994,12376.68
3,Sudan,1994,19773.72
1,USA,1994,3591.42
5,China,1995,4122.84
6,Sudan,1995,12778.56
4,USA,1995,10125.54
8,China,1996,2713.2
9,Sudan,1996,6234.24
7,USA,1996,13684.32


And we can use `-` to sort any variable in descending order rather than ascending order:

In [23]:
countries[order(-countries$adjusted_gdp_pc), ]

Unnamed: 0_level_0,country,year,adjusted_gdp_pc
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>
3,Sudan,1994,19773.72
7,USA,1996,13684.32
6,Sudan,1995,12778.56
2,China,1994,12376.68
4,USA,1995,10125.54
9,Sudan,1996,6234.24
5,China,1995,4122.84
1,USA,1994,3591.42
8,China,1996,2713.2


## Avoiding Subsetting by Index

As you've seen, we almost always access columns by name rather than by using their index numbers. That's because when working with real data, there's always a possibility that the order of columns gets jumbled up -- maybe you get an updated version of the data set you're working with that has the columns in different orders, or maybe in a large research project one of your collaborators has modified the order of columns in some of the code that runs before your code.

In these situations, trying to extract a column using its index may give you the wrong answer, while pulling out a column by name will and sure you're always getting the variable that you intended!

The same logic also applies to subsetting by rows. While subsetting by row index *works*, we generally avoid using indices for the same reason we avoid subsetting columns by index -- if the order or our data changes (say, it gets sorted unexpectedly), we can't predict how our index subsets will change! That's why in nearly all the examples above we subset with a logical vector. 

Obviously there are exceptions to this rule -- `order()` and `sample()` are both implicitly subsetting by index. But those functions generate the indices they use from the values of row immediately before they use them, so there is no opportunity for the order of row to change between when those indices are generated and when they are used.

## Recap

Phew. OK, I know this reading covered *a lot*, so here's a quick recap and a summary table for reference. 

- Dataframes really are just like matrices. The main difference is that each column can be a different type, and dataframes always have column names.
- We subset single dataframe columns using `$`, but that's just a shorthand for the syntax we learned before (`df[, "colname"]`).
- The columns of a dataframe are just vectors.
- We usually subset dataframes with logicals (for rows) or by name (columns) for safety. 

And now a reference table, written with a toy dataset called `df` with columns `col1`, `col2`, and `col3` in mind: 

**Looking at your dataframe:**

- Number of rows: `nrow(df)`
- Number of columns: `ncol(df)`
- First six rows: `head(df)`
- Last six rows: `tail(df)`
- Quick summary of all data: `summary(df)`

**Row Operations**

- Subset rows by logical: `df[df$col1 < 42, ]` or `df[df[, col1] < 42, ]`
- Random sample of N rows: `df[sample(nrow(df), N), ]`
- Sort rows (ascending, one column): `df[order(df$col1), ]`
- Sort rows (descending, one column): `df[order(-df$col1), ]`
- Sort rows (multiple columns): `df[order(df$col1, df$col2), ]`

**Column Operations**

- Subset one column by name: `df$col1` or `df[, "col1"]`
- Subset multiple columns by name: `df[ , c("col1", "col2")]`
- Drop one column: `df$col1 <- NULL`
- Drop set of columns: `df[ , !(names(df) %in% c("col1", "col2"))]`
- Editing a single column: `df$col1 <- df$col1 * 42` or `df[, "col1"] <- df[, "col1"] * 42`
- Create new column: `df$newcol <- df$col1 * 42` or `df[, "newcol"] <- df[, "col1"] * 42`

**Learn About a Column:**

- Tabulate number of observations of each value: `table(df$col1)`
- Share of observations of each value: `prop.table(table(df$col1))`
- Quick summary of one column: `summary(df$col1)`


## Exercises

And now it's time to put these new [skills into action with some exercises!](exercises/exercise_dataframe.ipynb)