# DataFrame Basics

In our previous lessons, we've talked about how vectors are often used to store lots of different observations of a given measurement (e.g. the answers of different survey respondents to a given question), and how matrices can be used to collect lots of different measurements in columns (e.g. each column can be answers to different questions). 

But matrices have one major limitation when it comes to social science workflows, which is that all the entries in a matrix have to be of the same type. In reality, however, we often have datasets with lots of *different* data types. For example, we might have numeric data on age and income, but character data for people's names, preferred political candidate, etc. Or we might have data on power plants across the US that includes numeric data on capacity, age, and pollution alongside character data on the power plant's fuel and the company that owns the plant. 

To deal with this kind of *heterogeneous tabular data*, we turn to the `data.frame`. 

Dataframes are basically just a collection of vectors, where each vector corresponds to a different column, and each column has a single type. Since they're two-dimensional data structures like matrices, we can actually subset them in the same way as matrices, but they are more flexible in terms of the types of data they can store. 

## Creating Dataframes

Let's start by learning how to create a dataset in R. This turns
out to be very simple --- just combine vectors using the `data.frame()`
command. 


In [3]:
# Create three vectors 
name <- c("al", "bea", "carol")
age <- c(6, 7, 4)
hair <- c("brown", "green", "blond")

# Create data frame 
children <- data.frame(name, age, hair)
children


name,age,hair
<chr>,<dbl>,<chr>
al,6,brown
bea,7,green
carol,4,blond


Or we can create our data frame by inserting our vectors as keyword arguments:

In [4]:
# Create data frame 
children <- data.frame(
    name = c("al", "bea", "carol"),
    age = c(6, 7, 4),
    hair = c("brown", "green", "blond")
)
children

name,age,hair
<chr>,<dbl>,<chr>
al,6,brown
bea,7,green
carol,4,blond


Note that unlike matrices and vectors -- which *can* have names -- dataframe columns **always** have names, and you'll usually see them accessed by name:

In [5]:
class(children[, "hair"])

## Accessing Columns

While you can access the features of a dataframe using the exact same syntax as you would with matrices, dataframes also allow you to access a single column using `[name of dataframe]$[name of column]`. For example:

In [6]:
children$hair

## Basic Dataset Commands

To better understand the proper structure of datasets, let's create a second data frame that has a more realistic data structure:

In [7]:
country <- rep(c("USA", "China", "Sudan"), 3)
year <- c(1994, 1994, 1994, 1995, 1995, 1995, 1996, 1996, 1996)
gdp_pc <- round(runif(9, 1000, 20000))

countries <- data.frame(country, year, gdp_pc)
countries

country,year,gdp_pc
<chr>,<dbl>,<dbl>
USA,1994,9196
China,1994,19360
Sudan,1994,9985
USA,1995,1722
China,1995,2201
Sudan,1995,2256
USA,1996,6119
China,1996,13931
Sudan,1996,15336


Where we can pretend that `gdp_pc` is a measure of a country's GDP per capita in a given year. 
 
(A quick aside: `rep()`, as you may recall, creates a vector that repeats the first input the number of times specified by the second input. `runif()` creates, in this case, 9 random values uniformly distributed between 1000 and 20000.)


Now let's explore some common functions for getting to know your dataframe!

The first is `dim()`, which gives the dimensions of a data frame. The number of rows are listed first, columns second.

In [8]:
dim(countries)

Use `nrow()` and `ncol()` to to get the number of rows or columns separately.

In [9]:
nrow(countries)
ncol(countries)

Snapshots
-------------

Use `head()` and `tail()` to look at the first and last few rows of a dataset, respectively. Obviously this is more useful when we have datasets with hundreds or thousands of observations you can't just look at. :) 

In [10]:
head(countries)

Unnamed: 0_level_0,country,year,gdp_pc
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>
1,USA,1994,9196
2,China,1994,19360
3,Sudan,1994,9985
4,USA,1995,1722
5,China,1995,2201
6,Sudan,1995,2256


In [11]:
tail(countries)

Unnamed: 0_level_0,country,year,gdp_pc
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>
4,USA,1995,1722
5,China,1995,2201
6,Sudan,1995,2256
7,USA,1996,6119
8,China,1996,13931
9,Sudan,1996,15336


Other useful commands to get to know variables better include `summary()`,
`table()`, and `prop.table()`. 

In [12]:
# Get some summary information about each variable
summary(countries)

   country               year          gdp_pc     
 Length:9           Min.   :1994   Min.   : 1722  
 Class :character   1st Qu.:1994   1st Qu.: 2256  
 Mode  :character   Median :1995   Median : 9196  
                    Mean   :1995   Mean   : 8901  
                    3rd Qu.:1996   3rd Qu.:13931  
                    Max.   :1996   Max.   :19360  

In [13]:
# Number of observations by country 
table(countries$country)


China Sudan   USA 
    3     3     3 

In [14]:
# Proportion of observations by country 
prop.table(table(countries$country))


    China     Sudan       USA 
0.3333333 0.3333333 0.3333333 

## Subsetting

Subsetting dataframes works almost exactly like subsetting matrices:

In [15]:
# Subset by index
countries[2, c(2, 3)]

Unnamed: 0_level_0,year,gdp_pc
Unnamed: 0_level_1,<dbl>,<dbl>
2,1994,19360


In [16]:
# Access entire row 5
countries[5, ]


Unnamed: 0_level_0,country,year,gdp_pc
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>
5,China,1995,2201


In general, though, accessing columns by index tends to be dangerous. That's because if you ever modify the data you're working with in the code above where you use an index, it might cause a column to move into a different position. 

For this reason, it's usually better to access columns using column names. 

In [17]:
# Access a column using column/variable name (two equivalent approaches)
countries$year
countries[, "year"]

**NOTE:** Normally you'd save these results!

The only reason I'm not saving these subsets is so you can see the output of without extra lines of code.

Normally, you'd wan to save your subsets -- by assigning them to a new variable -- so you can continue to analyze or manipulate them.

Similarly, to access rows, it's best to use a logical statement rather than row numbers (in case at some point you change the way your data is sorted):

In [18]:
countries[countries$year == 1995 & countries$country == "USA", ]

Unnamed: 0_level_0,country,year,gdp_pc
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>
4,USA,1995,1722


## Modifying DataFrames

As with matrices, we can use subsetting to make modifications to our dataframes. For example, suppose, as with our matrix version, we wanted to multiple GDP per capita by 1.02 to adjust for inflation. We could either do:

In [19]:
countries[, "gdp_pc"] <- countries[,"gdp_pc"] * 1.02

Or

In [20]:
countries$gdp_pc <- countries$gdp_pc * 1.02

Or, if we wanted to keep both the original `gdp_pc` column and add a *new* column with the inflation adjusted values, we can do so just by using a new column name when we assign our values back into the dataframe:

In [21]:
# re-create with original gdp_pc 
countries <- data.frame(country, year, gdp_pc)
countries

country,year,gdp_pc
<chr>,<dbl>,<dbl>
USA,1994,9196
China,1994,19360
Sudan,1994,9985
USA,1995,1722
China,1995,2201
Sudan,1995,2256
USA,1996,6119
China,1996,13931
Sudan,1996,15336


In [22]:
# Add new column
countries$adjusted_gdp_pc <- countries$gdp_pc * 1.02
countries

country,year,gdp_pc,adjusted_gdp_pc
<chr>,<dbl>,<dbl>,<dbl>
USA,1994,9196,9379.92
China,1994,19360,19747.2
Sudan,1994,9985,10184.7
USA,1995,1722,1756.44
China,1995,2201,2245.02
Sudan,1995,2256,2301.12
USA,1996,6119,6241.38
China,1996,13931,14209.62
Sudan,1996,15336,15642.72


And if we then wanted to *drop* the old column later, we could do so in one of two ways: we can subset for the other columns by name:

In [23]:
countries[, c("country", "year", "adjusted_gdp_pc")]

country,year,adjusted_gdp_pc
<chr>,<dbl>,<dbl>
USA,1994,9379.92
China,1994,19747.2
Sudan,1994,10184.7
USA,1995,1756.44
China,1995,2245.02
Sudan,1995,2301.12
USA,1996,6241.38
China,1996,14209.62
Sudan,1996,15642.72


Or we can use this trick:

In [24]:
countries$gdp_pc <- NULL 
countries

country,year,adjusted_gdp_pc
<chr>,<dbl>,<dbl>
USA,1994,9379.92
China,1994,19747.2
Sudan,1994,10184.7
USA,1995,1756.44
China,1995,2245.02
Sudan,1995,2301.12
USA,1996,6241.38
China,1996,14209.62
Sudan,1996,15642.72


## Sorting Data

We can also easily sort dataframes with the `order` command:

In [27]:
# Sort by GDP

countries[order(countries$adjusted_gdp_pc),]

Unnamed: 0_level_0,country,year,adjusted_gdp_pc
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>
4,USA,1995,1756.44
5,China,1995,2245.02
6,Sudan,1995,2301.12
7,USA,1996,6241.38
1,USA,1994,9379.92
3,Sudan,1994,10184.7
8,China,1996,14209.62
9,Sudan,1996,15642.72
2,China,1994,19747.2


Or sort by year, then country name: 

In [28]:
countries[order(countries$year, countries$country),]

Unnamed: 0_level_0,country,year,adjusted_gdp_pc
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>
2,China,1994,19747.2
3,Sudan,1994,10184.7
1,USA,1994,9379.92
5,China,1995,2245.02
6,Sudan,1995,2301.12
4,USA,1995,1756.44
8,China,1996,14209.62
9,Sudan,1996,15642.72
7,USA,1996,6241.38


In [29]:
# And you can use - to sort any variable in descending order
# rather than ascending:

countries[order(-countries$adjusted_gdp_pc),]

Unnamed: 0_level_0,country,year,adjusted_gdp_pc
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>
2,China,1994,19747.2
9,Sudan,1996,15642.72
8,China,1996,14209.62
3,Sudan,1994,10184.7
1,USA,1994,9379.92
7,USA,1996,6241.38
6,Sudan,1995,2301.12
5,China,1995,2245.02
4,USA,1995,1756.44
