# R Coding Workshop 
4/15/2021

## Overview

Graduate Student Instructor: Kayleigh Barnes

Email: kayleighnb@berkeley.edu

### Goals for today
This session is intended to guide you through the practical implementation of basic analytic techniques in R in Jupyter notebooks. R is an open-source statistical computing software used to analyze data. A Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. This workshop will be focused on interactive demonstration in R, but also include time for additional questions and guidance in working through the sample code. We will cover some fundamental coding techniques that will help you in Econ 140, basic data science classes, or research assistant positions. This workshop is for *beginners* that have little or no coding experience.



### Important notes 
- One attendee from today's workshop will be randomly selected to win a 20 dollar gift card to Amazon 
- Attendance to this workshop comes with free access to datacamp through July. Datacamp offers online courses in both R and Python so that you can continue learning after today's workshop 
- Link to join Berkeley Econ's datacamp group with @berkeley.edu ID: [here](https://www.datacamp.com/groups/shared_links/9cecd27b5daab26dc69f7d4a48b3c2ae5e20ff9ed77e3e239fa2e4510a4848d3) (make sure you're signed out of datacamp before clicking this - otherwise the sign-up gets screwed and you'll be asked to pay after the first chapter of any course)

## Jupyter and R Basics
- To create a new notebook, click the "New" button and select R
- Write R script by selecting the option "Code" from the dropdown list, or write text by selecting "Markdown"
- Select "Insert" to add a block of text or code
- Run code by highlighting and selecting "Run"
- Use the # symbol to add comments to the script, or to add headlines to text selections
- To clear your coding output, select Cell=>All Output=>Clear 

In [None]:
  # Clear the workspace, this removes all data and numbers you have stored or saved in R
  rm(list = ls())
  
  # The help function, using ? or help() before a command will bring up information on what the command does
  ?setwd
  help(setwd)

In [None]:
#The working directory is the location that R will look for data in
    # this is the same as telling your computer to look in a documents folder when uploading soemthing
getwd()

User written open-source packages are needed for specific functionality in R (e.g. nice graphics). However, we need to manually install these packages (once) and load them at the beginning of every script. Packages have been pre-installed in Jupyter notebooks.  If you are wondering why a command you've used before is no longer working, it may be because you haven't loaded the package.

In [None]:
  #Install packages
 # install.packages('ggplot2') # remove the first # to install ggplot2, it is already installed on datahub
  
  # Load required packages
  library(ggplot2)

## Loading in data and summary statistics

Now let's load in the data set. Make sure you have uploaded the data to Jupyter before running the next line of code. We are going to use data on a set of households in Mexico in the 1990's. The data includes a village ID, a household ID, and demogrpahic variables like income, household size, age and gender of the head of household and a poverty indicator. 

In [None]:
# Reading data into R from a CSV file
#  ?read.table # delete the # at the beginning of this line to view the help entry for the "read" command
  MyFirstData <- read.csv('MyFirstData.csv', header = TRUE, stringsAsFactors = TRUE)

Notice that there is no ouput from the code that reads in the data. Unlike excel, R stores the data in the background and we need to use specific comands to interact with it. Once it's read in, we can use several commands to describe the data

In [None]:
# Structure of the Data
  str(MyFirstData)

In [None]:
  # Summary of the Data
  summary(MyFirstData)

In [None]:
  # Variable Names
  colnames(MyFirstData)

In [None]:
  #Number of Observations
  nrow(MyFirstData)

In [None]:
  #Display first 6 rows of the data 
  head(MyFirstData)

In [None]:
  #Tabulate a specific variable (to refer to a variable, use Dataset$VariableName)
  table(MyFirstData$sexhead)

## Basic Data Cleaning and Formatting

### Category Variable

Right now, we have two categorical variables: sexhead, which indicates the sex of the head of household and pov_HH, which indicates whether a household is below the poverty line. The data entries for these variables are text rather than numbers (we call these string variables in the data science world). Often when doing data analysis, it is easier to map categorical text variables to numbers, particularly 0 and 1. These variables that contain only 0's and 1's are called dummy variables. 

Now, suppose we want to create a poor_male variable, which will be defined as 1 if the household is categorized as poor (pov_HH = pobre) and the head of the household is male (sexhead is Male), and 0 otherwise.

In [None]:
#Create one dummy variable based on T/F condition
MyFirstData$poor_male <- ifelse(MyFirstData$pov_HH == 'pobre' & MyFirstData$sexhead == 'Male', 1, 0)
  #tabulate the observations
  table(MyFirstData$poor_male)

### Numerical Variable
We can use regular mathematical operations to create numerical variables from other variables.

In [None]:
#Squaring an existing variable
MyFirstData$agehead2 <-  MyFirstData$agehead^2
summary(MyFirstData$agehead2)

#Creating a constant
MyFirstData$constant <- 1
summary(MyFirstData$constant)

 ### New Datasets
 We may also want to create a new data that summarizes the old, or is a subset of the original dataset.

In [None]:
#Subset of only observations with male head of hh
data_males<-MyFirstData[ which(MyFirstData$sexhead=='Male'),]
summary(data_males)

#First select variables to aggregate
myvars <- c("villid", "IncomeLab", "famsize", "agehead")
meandata <- MyFirstData[myvars]

#Collapse data to get average values by village.  Could also use "sum" as the function to get totals
meandata<-aggregate(meandata, by = list(meandata$villid), FUN = mean)
nrow(meandata)
summary(meandata)



## Making comparisons - T-Tests

A main goal of working with data is to make inferences about the population we are interested in. Much of Econ 140 will be focused on methods to make these inferences: What is the relationship between two variables? Did an experiment have a significant treatment effect?

If you have taken Stats 20, you are likely already familiar with a t-test. T-tests compare the difference in the means of a variable between two groups. The test statistic tells us whether the difference is *significant*, that is we can confidently say that the two groups are different. 

In [None]:
#let's run a t-test comparing the average family size for households above and below the poverty line
t.test(MyFirstData$famsize ~ MyFirstData$pov_HH, var.equal = TRUE)

## Visualizing Data
Make sure that the ggplot2 package is included at the top of the script.  Below, we show an example of a scatterplot using ggplot.  "geom" can be used to denote different types of graphs such as a line graph.

In [None]:
  ggplot(MyFirstData, aes(x = agehead, y=famsize)) + geom_point()
  ?geom_line  

We can use a direct function or ggplot to create a histogram. Notice that changing the options in the function allows you to customize the graph. Use the help function to learn more about the options for each command.

In [None]:
# Base Graphics
  hist(MyFirstData$agehead)
  hist(MyFirstData$agehead, col = "blue", main = "Histogram of age")
  # ggplot2
  ggplot(MyFirstData, aes(x = agehead)) + geom_histogram(fill = "blue") + ggtitle("Histogram of age")