# Reading Data

Yup, it's finally time for REAL DATA! HOORAY!

The first step in working with real data (usually) will be to load it from a file. 

## File Paths in Working Directories

To read a file you need to:

1. Specify where the file is located on your computer. This is referred to as setting your working directory. 
2. Execute a command that will read the file from your working directory. 

The working directory is the location in your file system that R thinks of as being "open". That means that if you save a file with `write.csv("my_data.csv")`, the file `my_data.csv` will be saved to your working directory. Similarly, if you open a file with `read.csv("world-small.csv")`, R will look in your working directory to try to find a file called "world-small.csv" to try to load.

To see the current working directory of your R session, run `getwd()`. On my system (a mac), this looks like:

In [20]:
getwd()

To change your working directory, you can use the command `setwd("[new working directory]")`. For example, if I wanted to move my working directory to my desktop, I'd type:

In [21]:
setwd("/users/nick/downloads")

And if you want to see what's in your working directory (as a sanity check to ensure you're in the right place), run `dir()`:

In [26]:
setwd("/Users/Nick/github/computational_methods_boot_camp/source/data")
dir()

Note that file paths (the way we specify a working directory) will look very different on Windows! On a Mac, this kind of path always starts with a `/`. On Windows, it will start with something like `C:/` (e.g. my Downloads folder would be at `"C:/Users/Nick/Downloads"`).

If you can't figure out the path to a file to the folder you need to access, however, in RStudio you can also set the working directory by going to the `Session` menu, going to `Set Working Directory`, and `Choose Directory...`. That will insert the correct path into the `setwd()` function in your console. 

(On Macs, that path will often start with `~/`, which is a shorthand on Macs for your user directory, and is the same as `/users/[your user name]/`).

If you want to learn more about file paths, you can read about them here.

Reading the file 
----------------------

Now that we've told R where to look for our file, it's time to read it. 

Datasets come in many formats, usually identified by their file suffix, such as `file.csv` for comma-separated value text files, `file.dta` for Stata datafiles, etc. Thankfully, R can read almost any standard data format you may get. For example, here are a handful of commands for reading different types of files:

(don't try to memorize these! You can always google this type of command in the future, I just want to make sure you know that these commands *exist* so it will occur to you to google them in the future):

```r
# Available by default
df <- read.csv("file.csv")           # Comma separated values
df <- read.csv("file.txt", sep="\t") # tab separated values

# Using the `foreign` library, which you can install with `install.package("foreign")`
library(foreign) #load foreign
df <- read.dta("file.dta")   # Stata data
df <- read.spss("file.spss") # SPSS data

# Using the `readxl` library, which you can install with `install.package("readxl")`
library(readxl)
df <- read_excel("file.xls")  # Excel xls spreadsheet
df <- read_excel("file.xlsx") # Excel xlsx spreadsheet
```

For the exercises we'll be doing next, we'll work with the `world-small.csv` dataset, which you can download [here](data/world-small.csv).

As noted, different commands are used to read different types of files. This is the syntax used for reading a .csv file:

In [27]:
world <- read.csv("world-small.csv")

I'm reading the file from the working directory and assigning it
to the object `world`, which becomes of class `data.frame`. 

In [28]:
class(world)

Let's check if the file was read correctly, using `dim()`
(returns the dimensions), `head()` (returns the top six rows),
and `summary()` (returns summary information about each variable):

In [29]:
dim(world) #the number of rows and columns 

In [30]:
head(world) #the first few rows of the dataset

Unnamed: 0_level_0,country,region,gdppcap08,polityIV
Unnamed: 0_level_1,<chr>,<chr>,<int>,<dbl>
1,Albania,C&E Europe,7715,17.8
2,Algeria,Africa,8033,10.0
3,Angola,Africa,5899,8.0
4,Argentina,S. America,14333,18.0
5,Armenia,C&E Europe,6070,15.0
6,Australia,Asia-Pacific,35677,20.0


In [31]:
summary(world) #a summary of the variables in the dataset

   country             region            gdppcap08        polityIV     
 Length:145         Length:145         Min.   :  188   Min.   : 0.000  
 Class :character   Class :character   1st Qu.: 2153   1st Qu.: 7.667  
 Mode  :character   Mode  :character   Median : 7271   Median :16.000  
                                       Mean   :13252   Mean   :13.408  
                                       3rd Qu.:19330   3rd Qu.:19.000  
                                       Max.   :85868   Max.   :20.000  

Everything looks as we would have hoped!

## Exercises


1. Read the `world-small.csv` data into R and store it in an object
called `world`. (Set your working directory using code first.) 

2. (Conceptual) What is the unit of analysis in the dataset? What's the name
of the dataset's id variable?

3. How many observations does `world` have? How many variables? Use an R
command to find out.

4. Use brackets and a logical statement to inspect all the values for
   Nigeria and United States. That is, your code should return two
   entire rows of the dataset. 

5. Use R to return China's Polity IV score. As in question 4, use a logical
statement and brackets, but don't return the entire row. Rather, return a single
value with the Polity IV score.

6. What is the lowest GDP per capita in the dataset? (Use R to return only the value.)

7. What country has the lowest GDP per capita? (Your code should
return the country name and be general enough so that if the observations
in the dataset --- or their order --- change, your code should still return the
country with the lowest GDP per capita.)







## Exercises 


1. Read the `world-small.csv` dataset into R and store it in an object called `world`.

2. Subset `world` to European countries. Save this subset as a new data frame called `europe`.

3. Add two variables to `europe`: 
    a. A variable that recodes `polityIV` from 0-20 to -10-10. 
    b. A variable that categorizes a country as "rich" or "poor" based on some 
cutoff of `gdppcap08` you think is reasonable. 

4. Drop the `region` variable in `europe` (keep the rest). 

5. Sort `europe` based on Polity IV. 

6. Repeat Exercises 2-5 using chaining. 

7. What was the world's mean GDP per capita in 2008? Polity IV score?

8. What was Africa's mean GDP per capita and Polity IV score?

9. What was the poorest country in the world in 2008? Richest? 

10. How many countries in Europe are "rich" according to your coding? 
How many are poor? What percentage have Polity IV scores of at least 18?

