# Dataset Basics

In our previous lessons, we've talked about how vectors are often used to store lots of different observations of a given measurement (e.g. the answers of different survey respondents to a given question), and how matrices can be used to collect lots of different measurements in columns (e.g. each column can be answers to different questions). 

But matrices have one major limitation when it comes to social science workflows, which is that all the entries in a matrix have to be of the same type. In reality, however, we often have datasets with lots of *different* data types. For example, we might have numeric data on age and income, but character data for people's names, preferred political candidate, etc. Or we might have data on power plants across the US that includes numeric data on capacity, age, and pollution alongside character data on the power plant's fuel and the company that owns the plant. 

To deal with this kind of *heterogeneous tabular data*, we turn to the `data.frame`. 

Dataframes are basically just a collection of vectors, where each vector corresponds to a different column, and each column has a single type. Since they're two-dimensional data structures like matrices, we can actually subset them in the same way as matrices, but they are more flexible in terms of the types of data they can store. 

## Creating Dataframes

Let's start by learning how to create a dataset in R. This turns
out to be very simple --- just combine vectors using the `data.frame()`
command. 


In [2]:
# Create three vectors 
name <- c("al", "bea", "carol")
age <- c(6, 7, 4)
hair <- c("brown", "green", "blond")

# Create data frame 
children <- data.frame(name, age, hair)
children


name,age,hair
<chr>,<dbl>,<chr>
al,6,brown
bea,7,green
carol,4,blond


Note that unlike matrices and vectors -- which *can* have names -- dataframe columns **always** have names, and you'll usually see them accessed by name:

In [5]:
class(children[, "hair"])

## Accessing Columns

While you can access the features of a dataframe using the exact same syntax as you would with matrices, dataframes also allow you to access a single column using `[name of dataframe]$[name of column]`. For example:

In [6]:
children$hair

## Basic Dataset Commands

To better understand the proper structure of datasets, let's create a second data frame that has a more realistic data structure:

In [8]:
country <- rep(c("USA", "China", "Sudan"), 3)
year <- c(1994, 1994, 1994, 1995, 1995, 1995, 1996, 1996, 1996)
gdp_pc <- round(runif(9, 1000, 20000), 0)

countries <- data.frame(country, year, gdp_pc)
countries

country,year,gdp_pc
<chr>,<dbl>,<dbl>
USA,1994,15297
China,1994,10213
Sudan,1994,5411
USA,1995,7282
China,1995,8996
Sudan,1995,6622
USA,1996,16665
China,1996,7851
Sudan,1996,3802


Where we can pretend that `gdp_pc` is a measure of a country's GDP per capita in a given year. 
 
(A quick aside: `rep()`, as you may recall, creates a vector that repeats the first input the number of times specified by the second input. `runif()` creates, in this case, 9 random values uniformly distributed between 1000 and 20000.)


Now let's explore some common functions for getting to know your dataframe!

The first is `dim()`, which gives the dimensions of a data frame. The number of rows are listed first, columns second.

In [10]:
dim(countries)

Use `nrow()` and `ncol()` to to get the number of rows or columns separately.

In [4]:
nrow(countries)
ncol(countries)

Snapshots
-------------

Use `head()` and `tail()` to look at the first and last few rows of a dataset, respectively. Obviously this is more useful when we have datasets with hundreds or thousands of observations you can't just look at. :) 

In [32]:
head(countries)

Unnamed: 0_level_0,country,year,gdp_pc
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>
1,USA,1994,15297
2,China,1994,10213
3,Sudan,1994,5411
4,USA,1995,7282
5,China,1995,8996
6,Sudan,1995,6622


In [33]:
tail(countries)

Unnamed: 0_level_0,country,year,gdp_pc
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>
4,USA,1995,7282
5,China,1995,8996
6,Sudan,1995,6622
7,USA,1996,16665
8,China,1996,7851
9,Sudan,1996,3802


Other useful commands to get to know variables better include `summary()`,
`table()`, and `prop.table()`. 

In [11]:
# Get some summary information about each variable
summary(countries)

   country               year          gdp_pc     
 Length:9           Min.   :1994   Min.   : 3802  
 Class :character   1st Qu.:1994   1st Qu.: 6622  
 Mode  :character   Median :1995   Median : 7851  
                    Mean   :1995   Mean   : 9127  
                    3rd Qu.:1996   3rd Qu.:10213  
                    Max.   :1996   Max.   :16665  

In [12]:
# Number of observations by country 
table(countries$country)


China Sudan   USA 
    3     3     3 

In [13]:
# Proportion of observations by country 
prop.table(table(countries$country))


    China     Sudan       USA 
0.3333333 0.3333333 0.3333333 

## Subsetting

Subsetting dataframes works almost exactly like subsetting matrices:

In [15]:
# Subset by index
countries[2, c(2, 3)]

Unnamed: 0_level_0,year,gdp_pc
Unnamed: 0_level_1,<dbl>,<dbl>
2,1994,10213


In [16]:
# Access entire row 5
countries[5, ]


Unnamed: 0_level_0,country,year,gdp_pc
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>
5,China,1995,8996


In general, though, accessing columns by index tends to be dangerous. That's because if you ever modify the data you're working with in the code above where you use an index, it might cause a column to move into a different position. 

For this reason, it's usually better to access columns using column names. 

In [17]:
# Access a column using column/variable name (two equivalent approaches)
countries$year
countries[, "year"]

Note that when we're accessing a column this way, it's just a vector
and all the things we've learned about [vectors](../vectors) apply.
For example:

In [18]:
# Get mean gdp per cap
mean(countries$gdp_pc)

Similarly, to access rows, it's best to use a logical statement:

In [19]:
countries[countries$year == 1995 & countries$country == "USA", ]

Unnamed: 0_level_0,country,year,gdp_pc
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>
4,USA,1995,7282


## Reading data 


Note: In this section we'll move from toy datasets to REAL DATA!

In particular, we'll be working with `world-small.csv`, which you can download [here](data/world-small.csv).

So far we've created datasets ourselves. Oftentimes, however, we'll want to read a dataset into R from file. Datasets come in many formats --- e.g., .csv, .txt, .dta, and .RData. R can read most data formats as is, either directly or using a library like `foreign`. For now, we'll assume that the file is in a readable format.

To read a file you need to:

1. Specify where the file is located on your computer. This is referred to as setting your working directory. 
2. Execute a command that will read the file from your working directory. 

### File Paths and Working Directories

A key concept in R is the idea of a "working directory". The working directory is the location in your file system that R thinks of as being "open". That means that if you save a file with `write.csv("my_data.csv")`, the file `my_data.csv` will be saved to your working directory. Similarly, if you open a file with `read.csv("world-small.csv")`, R will look in your working directory to try to find a file called "world-small.csv" to try to load.

To see the current working directory of your R session, run `getwd()`. On my system (a mac), this looks like:

In [20]:
getwd()

To change your working directory, you can use the command `setwd("[new working directory]")`. For example, if I wanted to move my working directory to my desktop, I'd type:

In [21]:
setwd("/users/nick/downloads")

And if you want to see what's in your working directory (as a sanity check to ensure you're in the right place), run `dir()`:

In [26]:
setwd("/Users/Nick/github/computational_methods_boot_camp/source/data")
dir()

Note that file paths (the way we specify a working directory) will look very different on Windows! On a mac, this kind of path always starts with a `/`. On Windows, it will start with something like `C:/` (e.g. my downloads folder is at `"C:/Users/Nick/Downloads"`).

If you can't figure out the path to a file to the folder you need to access, however, in RStudio you can also set the working directory by going to the `Session` menu, going to `Set Working Directory`, and `Choose Directory...`. That will insert the correct path into the `setwd()` function in your console. 

(On Macs, that path will often start with `~/`, which is a shorthand on Macs for your user directory, and is the same as `/users/[your user name]/`).

Reading the file 
----------------------

Now that we've told R where to look for our file, it's time to read
it. Different commands are used to read different types of files. This
is the syntax used for reading a .csv file:

In [27]:
world <- read.csv("world-small.csv")

I'm reading the file from the working directory and assigning it
to the object `world`, which becomes of class `data.frame`. 

In [28]:
class(world)

Let's check if the file was read correctly, using `dim()`
(returns the dimensions), `head()` (returns the top six rows),
and `summary()` (returns summary information about each variable):

In [29]:
dim(world) #the number of rows and columns 

In [30]:
head(world) #the first few rows of the dataset

Unnamed: 0_level_0,country,region,gdppcap08,polityIV
Unnamed: 0_level_1,<chr>,<chr>,<int>,<dbl>
1,Albania,C&E Europe,7715,17.8
2,Algeria,Africa,8033,10.0
3,Angola,Africa,5899,8.0
4,Argentina,S. America,14333,18.0
5,Armenia,C&E Europe,6070,15.0
6,Australia,Asia-Pacific,35677,20.0


In [31]:
summary(world) #a summary of the variables in the dataset

   country             region            gdppcap08        polityIV     
 Length:145         Length:145         Min.   :  188   Min.   : 0.000  
 Class :character   Class :character   1st Qu.: 2153   1st Qu.: 7.667  
 Mode  :character   Mode  :character   Median : 7271   Median :16.000  
                                       Mean   :13252   Mean   :13.408  
                                       3rd Qu.:19330   3rd Qu.:19.000  
                                       Max.   :85868   Max.   :20.000  

Everything looks as we would have hoped.

## Exercises


1. Read the `world-small.csv` data into R and store it in an object
called `world`. (Set your working directory using code first.) 

2. (Conceptual) What is the unit of analysis in the dataset? What's the name
of the dataset's id variable?

3. How many observations does `world` have? How many variables? Use an R
command to find out.

4. Use brackets and a logical statement to inspect all the values for
   Nigeria and United States. That is, your code should return two
   entire rows of the dataset. 

5. Use R to return China's Polity IV score. As in question 4, use a logical
statement and brackets, but don't return the entire row. Rather, return a single
value with the Polity IV score.

6. What is the lowest GDP per capita in the dataset? (Use R to return only the value.)

7. What country has the lowest GDP per capita? (Your code should
return the country name and be general enough so that if the observations
in the dataset --- or their order --- change, your code should still return the
country with the lowest GDP per capita.)





