## Basic R Commands & Data Exploration

Does college pay off? We'll use some of the latest data from the US Department of Education's <a href="https://collegescorecard.ed.gov/data/">College Scorecard Database</a> to answer that question. 

In this notebook, you'll get a gentle introduction to R - a coding language used by data scientists to analyze large datasets. Then, you'll begin diving into the college scorecard data yourself. By the end of this notebook, you'll get a general sense of which colleges set up their graduates for success and which colleges ... don't.

In [1]:
## Run this code but do not edit it. Hit Ctrl+Enter to run the code.
# This command downloads a useful package of R commands
suppressPackageStartupMessages({
    library(dplyr)
    library(ggplot2)
    library(ggformula)
})

### 1.0 - Exploring the dataset

To begin, let's download our data. Our full dataset is included in a file named `colleges.csv`, which we're retrieving from a github repository. The command below downloads the data from the file and stores it into an R dataframe object called `dat`.

In [2]:
dat <- read.csv('https://raw.githubusercontent.com/mahmoudharding/slrp/main/data/colleges.csv')

### 2.0 - Finding summary statistics 

When analyzing variables of interest, it's often helpful to calculate summary statistics. For quantitative variables, we can use the `summary` command to find the five-number summary (minimum, Q1, median, Q3, maximum) and the average (mean) of the values. The code block shows how we find these summary statistics for the `admit_rate` variable. Note: the `$` sign in R is used to isolate a single variable (`admit_rate`) from a dataframe (`dat`).

### 3.0 - Visualizing data (histograms, barplots, and boxplots)

In addition to summary statistics, a great way to get an overall impression of our data is to visualize it. In this section, we'll walk through different types of visualizations we can create in R. Note: We're going to save scatterplots for the next notebook in this series.

One of the most useful visualizations for displaying a quantitative variable is a histogram. Here, we use the `gf_histogram` command to display the histogram for `admit_rate`.