### Intro to R
R is an open-source statistical programming language.

R can do simple things very easily (plots, data descriptions)
And complex things without too much difficulty

Can make calculations, visualizations, animations, and even games

#### Arithmetic in R

In [4]:
# Operations in R
2 + 2 # Addition
3 * 2 # Multiplication
3 / 2 # Decimal division
2 - 1 # Subtraction
3 %% 2 # Modulo
3 %/% 2 # Integer (flooring) division


#### Variables in R

In [12]:
x <- 3 # assign 3 to x, an R object
x # output the result of x
y = 4 # can also assign using `=`, but this has pitfalls
y # output result of y

#### Rules for identifiers
- Case-sensitive, can contain numbers, underscores, and `.`
- Needs to start with a letter or `.`
    - But beware! Identifiers starting with `.` are hidden and don't show up in your environment

In [4]:
someVariable <- 2
somevariable <- 3
.hiddenVar <- 5
ls() # prints variables in scope
.hiddenVar # not in our environment, but still accessible

#### Functions
- Take zero or more inputs, return zero or more outputs
    - May have side effects

In [13]:
# Vector function - c
# Takes variable-length argument list, returns a vector with arguments as its inputs
c(1,2,3)
# Vectors have operations
c(1,2,3) + c(4,5,6) # element-wise addition
c(1,2,3) + c(4,5,6,7,8,9) # element-wise addition, but the shorter vector "wraps around" so that the addition still works

#### Getting help
- `?` operator gets help for a function
- `??` searches for matching demos

In [19]:
??t.test

R Information

Demos with name or title matching ‘t.test’ using regular expression
matching:


tcltk::tkttest          t-test example of GUI interface to a function
                        call.


Type 'demo(PKG::FOO)' to run demonstration 'PKG::FOO'.



Help files with alias or concept or title matching ‘t.test’ using
regular expression matching:


psych::Tal_Or           Data set testing causal direction in presumed
                        media influence
Rcpp::RcppUnitTests     Rcpp : unit tests results
stats::bartlett.test    Bartlett Test of Homogeneity of Variances
  Aliases: bartlett.test, bartlett.test.default, bartlett.test.formula
stats::fisher.test      Fisher's Exact Test for Count Data
stats::pairwise.t.test
                        Pairwise t tests
  Aliases: pairwise.t.test
stats::power.t.test     Power calculations for one and two sample t
                        tests
  Aliases: power.t.test
stats::t.test           Student's t-Test
  Aliases: t.test, t.test.default, t.t

#### R Packages
R has a package manager, CRAN
Packages can be installed using `install.packages`
Only needs to be done once—packages remain installed (unless you update `R` to a new major version)

#### Common R functionality

In [28]:
# include a library
# library("car") # include the "car" library

# get working directory
new_wd <- getwd()
setwd(new_wd) # set working directory

#### Reading and processing data
- Read `.csv` files into tables `read.table("myfile.csv", header = TRUE, sep = ",")` 
- Or: read `.csv` files with `read.csv("myfile.csv")`, which just invokes `read.table` with some sensible defaults
#### Displaying data
- Use `head(myTable, n = 10)` to show the first 10 rows
- Use `tail(myTable, n = 10)` to show the last 10 rows
(You might need to set `options("max.print"=100)` to make sure it will actually print that many rows, though.)

In [32]:
KAPdata <- read.csv("data/KAPJTerm.csv", stringsAsFactors = TRUE)
head(KAPdata, n=3)

Unnamed: 0_level_0,id,start,end,record,prog,duration,X18.,US.,sex,age,⋯,vote2Num,vote3Num,vote4Num,knowSum,knowInfSum,impactSum,anxSum,threatSum,behavSum,behav7Sum
Unnamed: 0_level_1,<fct>,<fct>,<fct>,<fct>,<int>,<int>,<fct>,<fct>,<fct>,<fct>,⋯,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,ID0001,3/20/20 17:04,3/20/20 17:13,3/20/20 17:13,100,535,Yes,Yes,Male,18-29 years old,⋯,0,0,0,12,9,15,7,9,10,3
2,ID0002,3/20/20 17:07,3/20/20 17:20,3/20/20 17:20,100,771,Yes,Yes,Female,40-49 years old,⋯,1,1,1,13,7,18,12,9,7,4
3,ID0003,3/20/20 17:09,3/20/20 17:20,3/20/20 17:20,100,668,Yes,Yes,Female,30-39 years old,⋯,1,1,1,12,9,9,7,4,5,4


In [34]:
str(KAPdata) # shows STRucture of data
# Data types:
# Factors -> categorical data -> string or int with some finite number of values
# Logical -> boolean -> TRUE FALSE or MISSING

'data.frame':	5485 obs. of  167 variables:
 $ id            : Factor w/ 5485 levels "ID0001","ID0002",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ start         : Factor w/ 3961 levels "3/20/20 17:04",..: 1 2 3 1 5 4 6 7 8 9 ...
 $ end           : Factor w/ 3922 levels "3/20/20 17:13",..: 1 3 3 4 5 6 7 8 9 10 ...
 $ record        : Factor w/ 3974 levels "3/20/20 17:13",..: 1 2 2 3 4 5 6 7 8 9 ...
 $ prog          : int  100 100 100 100 100 100 100 100 100 100 ...
 $ duration      : int  535 771 668 1130 567 853 430 685 800 871 ...
 $ X18.          : Factor w/ 1 level "Yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ US.           : Factor w/ 1 level "Yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ sex           : Factor w/ 4 levels "Female","Male",..: 2 1 1 1 1 1 1 1 1 1 ...
 $ age           : Factor w/ 7 levels "18-29 years old",..: 1 3 2 4 5 5 2 1 3 2 ...
 $ trip_int      : Factor w/ 5 levels "0 times","1-2 times",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ trip_dom      : Factor w/ 5 levels "0 times","1-2 times",..: 1 1 1 1 1 1 1 2 1 1 ..

In [36]:
# View column names
colnames(KAPdata) # prints all the column names

In [48]:
# How do we restrict our view to a certain number of columns?
# Subsetting
# KAPdata[row_start:row_end,col_start:col_end] -> subset using range in 

# 1 indexed !!!!!
KAPdata[1:5,1:3] # start index is inclusive, end index is inclsuive
# data frame with just columns 1-3, rows 1-5

Unnamed: 0_level_0,id,start,end
Unnamed: 0_level_1,<fct>,<fct>,<fct>
1,ID0001,3/20/20 17:04,3/20/20 17:13
2,ID0002,3/20/20 17:07,3/20/20 17:20
3,ID0003,3/20/20 17:09,3/20/20 17:20
4,ID0004,3/20/20 17:04,3/20/20 17:23
5,ID0005,3/20/20 17:36,3/20/20 17:46


In [58]:
KAPdata$id # access the data for a single column alone
# alternatively,
KAPdata[,"id"]

# filters / logical operators


Attaching package: ‘psych’


The following object is masked from ‘package:car’:

    logit


“no non-missing arguments to min; returning Inf”
“no non-missing arguments to min; returning Inf”
“no non-missing arguments to min; returning Inf”
“no non-missing arguments to min; returning Inf”
“no non-missing arguments to min; returning Inf”
“no non-missing arguments to min; returning Inf”
“no non-missing arguments to min; returning Inf”
“no non-missing arguments to min; returning Inf”
“no non-missing arguments to min; returning Inf”
“no non-missing arguments to min; returning Inf”
“no non-missing arguments to min; returning Inf”
“no non-missing arguments to min; returning Inf”
“no non-missing arguments to min; returning Inf”
“no non-missing arguments to min; returning Inf”
“no non-missing arguments to min; returning Inf”
“no non-missing arguments to min; returning Inf”
“no non-missing arguments to min; returning Inf”
“no non-missing arguments to min; returning Inf”
“no non-missing argument

Unnamed: 0_level_0,vars,n,mean,sd,median,trimmed,mad,min,max,range,skew,kurtosis,se
Unnamed: 0_level_1,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
id*,1,5485,2743.000000,1583.5274442,2743,2743.000000,2032.6446,1,5485,5484,0.00000000,-1.2006564,21.381456140
start*,2,5485,2081.305196,1120.3725655,2086,2099.459330,1421.8134,1,3961,3960,-0.08859614,-1.1639116,15.127743416
end*,3,5485,2057.330902,1109.5473406,2057,2074.206425,1417.3656,1,3922,3921,-0.07977853,-1.1711755,14.981576659
record*,4,5485,2069.710848,1117.3021361,2066,2085.395306,1414.4004,1,3974,3973,-0.07507275,-1.1698738,15.086285183
prog,5,5485,96.725980,10.8800330,100,100.000000,0.0000,51,100,49,-3.23621140,9.1244696,0.146906799
duration,6,5485,806.855059,2077.1240861,644,662.335612,198.6684,175,86979,86804,29.84943658,1060.8499523,28.046206402
X18.*,7,5485,1.000000,0.0000000,1,1.000000,0.0000,1,1,0,,,0.000000000
US.*,8,5485,1.000000,0.0000000,1,1.000000,0.0000,1,1,0,,,0.000000000
sex*,9,5485,1.429170,0.5441374,1,1.389383,0.0000,1,4,3,1.04996519,1.7596311,0.007347172
age*,10,5485,3.582680,1.3957680,4,3.630895,1.4826,1,7,6,-0.20387351,-0.7785576,0.018846248


In [60]:
# descriptive statistics, courtesy of the `psych` package
library("psych")
describe(KAPdata$drinkNum)

Unnamed: 0_level_0,vars,n,mean,sd,median,trimmed,mad,min,max,range,skew,kurtosis,se
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
X1,1,5485,1.535825,1.645131,1,1.385054,1.4826,0,5,5,0.4087615,-1.399164,0.02221326


In [63]:
# Erroneous code: 
# It doesn't make much sense to get descriptive statistics for a categorical variable.
describe(KAPdata$educ)

“argument is not numeric or logical: returning NA”


ERROR: Error in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm): Calling var(x) on a factor x is defunct.
  Use something like 'all(duplicated(x)[-1L])' to test for a constant vector.
