<a href="https://colab.research.google.com/github/michiWS1920/ADS20/blob/master/ADS_01_RProgramming_2020_API.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Statistical Programming with R
========================================================

WS2020

Prof. Dr. Christoph Flath

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#R-Syntax" data-toc-modified-id="R-Syntax-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>R Syntax</a></span></li><li><span><a href="#Vectors" data-toc-modified-id="Vectors-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Vectors</a></span></li><li><span><a href="#Special-atomic-data-types" data-toc-modified-id="Special-atomic-data-types-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Special atomic data types</a></span></li><li><span><a href="#Data-frames" data-toc-modified-id="Data-frames-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Data frames</a></span></li><li><span><a href="#Lists" data-toc-modified-id="Lists-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Lists</a></span></li><li><span><a href="#R-Functions" data-toc-modified-id="R-Functions-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>R Functions</a></span></li></ul></div>

In [0]:
set.seed(1)

R Syntax
--------------------------------------------
* Lightweight, script-oriented syntax
* No line ends (e.g., ";" in JAVA)
* Common mathematical operators are used:
* Whitespace typically without meaning
* Round brackets: Function arguments, e.g., `mean(x)`
* Square brackets: Array locations, e.g., `playerList[1]`
* Curly brackets: Code blocks (e.g., around loops or logical structures)
* Values are assigned using `<-` or `=`
* Comments initalized by `#`

### Package Management

* Packages are collections of R functions, data, and compiled code
* Numerous packages are available for download and installation
* Installing and loading is done via the GUI or the console (script)
* Example for installing and loading the `tidyverse` package
    * With root privileges
        * `install.packages("tidyverse")`
        * `library(tidyverse)`
    * Without root privileges
        * `install.packages("tidyverse", lib="/data/Rpackages/")`
        * `library(tidyverse, lib.loc="/data/Rpackages/")`

### R Objects and Data Types
* Everything in R is an object
* No strict but dynamic typing
* Basic data types available in R:
    * Numeric: decimal numbers `2.3`
    * Integer: integer numbers `2`
    * Logical: true / false values `TRUE`
    * Character: single or multiple characters `"HOUSE"`
    * (Ordered) Factor: discrete values from a pre-defined scale `"good", "medium", "bad"`

Vectors
--------------------------------------------
* A vector is a sequence of data elements of the same basic type
    * Members of a vector are called components
    * Vectors are initialized using the `c()` command

In [0]:
numericVector <- c(1.2, 1.3, 1.4)

* The number of members in a vector is given by the length function:

In [0]:
length(numericVector)

### Some vector creation helpers: Sequences

The `seq` function has three arguments:
* `from`: starting point
* `to`: end point
* `by`: increment

In [0]:
seq(10, 100, by=10)

The colon operator `1:10` is a shorthand for `by=1`


### Some vector creation helpers: Repetitions / Replications

The `rep` function has two arguments:
* `x`: object to be replicated
* `times`: how often

In [0]:
rep(10, 10) 

Note: The first argument can be again a vector (e.g., sequences)


### Vector Recycling

* If a vector is of insufficient length for an operation, R will recycle previous values
* If longer length is not a mutltiple it will raise a warning


In [0]:
x = c(10,11,12,13)
x + c(0,1)
x + c(0,1,2)

### Subsetting Vectors

* Vector subsetting is done via the `[]` operator
* Positive integers return elements at the specified positions (whitelisting), identical indexes will yield duplicates

In [0]:
x=LETTERS
x[1:3]
x[c(1,1)]

* Negative integers suppress elements at the specified positions (blacklisting)
You cannot mix positive and negative integers in subsetting!

In [0]:
x[-seq(1,26,2)]

* Logical vectors will return all positions where the vector is TRUE, This is clearly the most useful approach if we master logical expressions

In [0]:
x[x<"D"]

### Sorting Vectors

In [0]:
x = c(4, 3, 6, 2, 1, 10, 5, 8, 9, 7)

In [0]:
sort(x)

In [0]:
sort(x, decreasing = T)

In [0]:
order(x)

### Sampling from Vectors

In [0]:
sample(1:6, 5) #sample from a vector (no replacement)
sample(1:6, 5, replace = T) #sample from a vector (replacement)

## Special atomic data types

### NA Values
* Oftentimes we will have missing or corrupt data
* These will often pop up as `NA` entries
* Running analyses on these values can cause problems

In [0]:
set.seed(1)
v = sample(c(NA,1,2,3),8,replace = T)
mean(v)

### Handling NA values
* Identify and remove NA values

In [0]:
is.na(v)
mean(v[!is.na(v)])

* Replace NA values

In [0]:
v.na.replaced <- v
v.na.replaced[is.na(v)] <- 0
mean(v.na.replaced)

### Logical Expressions
* A AND B: `A & B`
* A OR B: `A | B`
* IF-THEN-ELSE: `ifelse(v>=0, "+", "-")`


* Conditions may include the following:
    * `==` equal
    * `!=` not equal to
    * `> (>=)` greater (or equal) than
    * `< (<=)` less (or equal) than
    * `%in%` left element in vector on right

### Standard statistical expressions

In [0]:
v = sample(1:6, 5, T)
mean(v)
max(v)
min(v)

In [0]:
length(v)
length(v[v>3])
quantile(v)

### Extracting Information from Objects

In [0]:
str(v) #structure
summary(v) #summary
class(v) #class
head(v, 5) #first 5 elements

In [0]:
length(v) #number of elements
dim(v) #dimensionality
unique(v) #unique value

### Character vector manipulation
* The command `paste` creates a single string from character representations of the argument vectors, sep specifies the separator

In [0]:
paste(LETTERS[1:8], 1:4, sep="!") 

* `paste0` offers a shortcut

In [0]:
paste(rep("A", 5), 1:4, sep="")
paste0(rep("A", 5), 1:4)

### Text and string operations

* Extracting substrings

In [0]:
substr("ABCDEFG",
       start=2, stop=3)

* Constructing compound strings with `sprintf()`

In [0]:
sprintf("%s is %f feet tall", "Sven", 7.1)

* Splitting string on characters

In [0]:
x <- "Split words."
strsplit(x," ")

* String length

In [0]:
nchar("ABC")

* Check if string `x` contains expression `pattern`

In [0]:
grepl(pattern = "aus",
      x = "Nikolaus")
grepl(pattern = "haus",
      x = "Nikolaus")

* Replacement

In [0]:
gsub(pattern = "aus",
     replacement = "aushaus",
     x = "Nikolaus")

### `lubridate`: temporal data made easy

In [0]:
library(lubridate)

* Parsing time strings

In [0]:
a = ymd("20110720")
c = dmy("31/08/2011")
arrive <- ymd_hms("2011-06-04 12:00:00")
leave <- ymd_hms("2011-08-10 14:00:00")

a
c
arrive
leave

* Extracting time details

In [0]:
wday(arrive) #also second / hour ...

In [0]:
wday(arrive, label = TRUE)

* Time intervals

In [0]:
A <- interval(arrive, leave)
B <- interval(a, c)
int_overlaps(A,B)

### Programming task
* Create two vectors which reflecting 100 throws of two standard dices by sampling from 1:6.
    * How often did dice 1 show the higher number?
    * How often did dice 1 show a number which was at least 3 larger than dice 2?
    * Compare the ten highest throws of the two dices

* Create a third vector reflecting the sum of the two other throws
    * Determine the five highest and the five lowest combined scores


In [0]:
set.seed(1)
dice1 <- sample(1:6, 100, replace=TRUE)
dice2 <- sample(1:6, 100, replace=TRUE)
sum(dice1>dice2)
sum(dice1>dice2+2)
head(sort(dice1, decreasing=TRUE), 10)
head(sort(dice2, decreasing=TRUE), 10)

dice3 <- dice1+dice2
head(sort(dice3, decreasing= TRUE), 5)
head(sort(dice3, decreasing=FALSE), 5)

### Matrices
* A matrix is a collection of data elements arranged in a two-dimensional rectangular layout
* Matrices are created in R with the matrix function

In [0]:
A = matrix(c(2,4,3,1,5,7), nrow=2, ncol=3, byrow = TRUE)
A

Data frames
--------------------------------------------
* A data frame is used for storing data tables
* It is a collection of vectors of equal length
* For example, here is a built-in data frame in R, called `mtcars`.

In [0]:
head(mtcars, 5) #first 5 elements

Unnamed: 0_level_0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Mazda RX4,21.0,6,160,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360,175,3.15,3.44,17.02,0,0,3,2


* The top line of the table, called the header, contains the column names
* Each horizontal line afterward denotes a data row, which begins with the name of the row, and then followed by the actual data Each data member of a row is called a cell

### Adressing Data frames
* To retrieve data in a cell, we would enter its row and column coordinates in the single square bracket `[]` operator

In [0]:
mtcars[1, 2]
mtcars["Mazda RX4", "cyl"]

### Subsetting Data frames
* Subsetting of data frames in base R relies on addressing using vectors instead of single values
* Individual columns are accessed and created using the $ operator

In [0]:
head(mtcars$mpg, 5)
mtcars$l100km = 282.48 / mtcars$mpg


head(mtcars$l100km, 5)

In [0]:
mtcars$l100km = 282.48 / mtcars$mpg
head(mtcars$l100km)

* Going forward, data frames will be our most important object
* Base R is fairly clunky in operating on data frames. Soon we will explore the `tidyverse` which features very elegant and efficient ways for data frame manipulation

### Programming task

* What is the mean / max / min mpg of the vehicles in `mtcars`?

* Create a new column which states the ratio of power vs. weight
    * Which is the best car according to this dimension?




In [0]:
mean(mtcars$mpg)
max(mtcars$mpg)
min(mtcars$mpg)

In [0]:
mtcars$powerratio = mtcars$hp / mtcars$wt
max(mtcars$powerratio)
mtcars[("powerratio")]

Unnamed: 0_level_0,powerratio
Unnamed: 0_level_1,<dbl>
Mazda RX4,41.98473
Mazda RX4 Wag,38.26087
Datsun 710,40.08621
Hornet 4 Drive,34.21462
Hornet Sportabout,50.87209
Valiant,30.34682
Duster 360,68.62745
Merc 240D,19.43574
Merc 230,30.15873
Merc 280,35.75581


### Creating Data Frames
* Manual specification of data frame df

In [0]:
n = c(2, 3, 5)
s = c("aa", "bb", "cc")
b = c(TRUE, FALSE, TRUE)
df = data.frame(n, s, b)
df

* Reading external data from text files `df = read.csv2("pathToData", header=TRUE, sep =";")`
* `read.csv2` will also work on URL to remote files, e.g.,

In [0]:
hsb2 <- read.csv2('https://raw.githubusercontent.com/rpruim/OpenIntro/master/data/hsb2.csv', header=T, sep=",")
head(hsb2,2)

### Extending Data Frames
* Extra rows: `rbind`

In [0]:

df1 = data.frame(numbers=c(1, 2, 3), letters=c("a", "b", "c"))
df2 = data.frame(numbers=c(4), letters=c("d"))
rbind(df1, df2)


* Extra columns: `cbind`

In [0]:
df1 = data.frame(numbers=c(1, 2, 3, 4))
df2 = data.frame(letters=c("a", "b", "c", "d"))
cbind(df1, df2)

Lists
--------------------------------------------
* A list is a generic vector containing different objects.
* We will typically avoid using this unstructured format but may often face it when obtaining data through web services

![lists](http://venus.ifca.unican.es/Rintro/_images/dataStructuresNew.png)

In [0]:
n = c(2, 3, 5)
s = c("aa", "bb", "cc", "dd", "ee")
b = c(TRUE, FALSE, TRUE, FALSE, FALSE)
x = list(n, s, b, 3)   # x contains copies of n, s, b
x

In [0]:
typeof(x[1])

* List elements: We retrieve list elements with the double square bracket `[[]]` operator, result is the original type

In [0]:
typeof(x[[1]])

* Named elements: Similar to data frame columns the list elements can be named and referenced using these names
* `unlist(x)` converts (flattens) a list of vectors into a single vector

### Webservice example
* JSON call from google API - Retrieve bars within 500 meters of Sanderring 2

In [2]:
key = "AIzaSyDWK1u9hF03BqnSIfkRXkzT8zfu4IlIwEQ"
URL <- paste0("https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=49.7881799,9.93524&radius=500&types=bar&key=",key)
install.packages("RCurl")
install.packages("RJSONIO")
library(RCurl)
library(RJSONIO)
response_parsed <- fromJSON(getURL(URL,ssl.verifyhost = 0L, ssl.verifypeer = 0L))

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependency ‘bitops’


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



* The google webservice returns an XML file which is parsed by R into a list
* Parsed response is a list of three lists, results are stored in the list results

In [3]:
names(response_parsed)

In [13]:
names(response_parsed$results[[1]])
response_parsed$results[[1]]$name
response_parsed$results[[1]]$geometry$location
response_parsed$results[[1]]$rating

### Another API

In [5]:
URL = "https://www.anapioficeandfire.com/api/characters?page=1&pageSize=200"
response_parsed_got <- fromJSON(getURL(URL,ssl.verifyhost = 0L, ssl.verifypeer = 0L))
names(response_parsed_got[[27]])
response_parsed_got[[27]]$name
response_parsed_got[[27]]$title
response_parsed_got[[27]]$died

## R Functions

* R allows us to easily and elegantly write virtually any function that we want to implement
* Basic setup for functions:
`fun <- function(arguments) {code}`
* Multiple calculation steps are separated by line breaks
* Without return statement the last evaluation result is returned, if multiple calculations should be returned use `return()`
* For multiple return values use vector or list

In [8]:
fun <- function(x, y) {x+y}
fun(1,5)
fun2 <- function(x, y) {
  z1 <- 2*x + y
  z2 <- x + 2*y
  return(c(z1, z2))}
fun2(1,5)

###  Programming task

* Create a function getNameAndRating(listing) that returns name and rating for a Google Places Listing (from our earlier API call) as a data frame

* If there is no rating available return -1 using ifelse


In [23]:
x = response_parsed$results[[]]$name
GetNameAndRating <- function(x)

ERROR: ignored

In [75]:
x = response_parsed$results[[1]]$name
y = response_parsed$results[[1]]$rating
GetNameAndRating <- function(x, y) {data.frame(name = x, rating = y)}
#if (y < 0) {print("-1")} else {print(y)}
print(GetNameAndRating(x, y))

[1] 4.5
               name rating
1 Cafe & Bar Neubau    4.5


In [0]:
typical <- function(x) { data.frame(mean=mean(x),median=median(x)) }


In [53]:
response_parsed$results[[2]]$name
response_parsed$results[[1]]$rating


### Applying functions to vectors / data frames / lists
* Very often we may be interested in performing the same operation on multiple entries
* While looping is possible in R it should typically avoided
* In base R the `apply` scheme covers this approach, in this course we use the more versatile and performant map function from the `purrr` package

`library(purrr)`

`map(.x, .f) for every element of .x apply .f`

* The base map function returns a list, there are special versions returning typed vectors or data frames: e.g., `map_chr`, `map_dbl`, `map_df`, ...
* For functions with two arguments use `map2(.x, .y, .f)`


### @JennyBryan Lego illustrations
![optional caption text](https://github.com/chrisflath/ADS20/blob/master/figures/minis1.png?raw=1)

### map
![optional caption text](https://github.com/chrisflath/ADS20/blob/master/figures/minis2.png?raw=1)

### map2
![optional caption text](https://github.com/chrisflath/ADS20/blob/master/figures/minis3.png?raw=1)

### map2
![optional caption text](https://github.com/chrisflath/ADS20/blob/master/figures/minis4.png?raw=1)

### map2
![](https://github.com/chrisflath/ADS20/blob/master/figures/minis5.png?raw=1)

### map2
![](https://github.com/chrisflath/ADS20/blob/master/figures/minis6.png?raw=1)

### Programming task

* Leveraging our function, get names and ratings of all bars from our API call in a nice data frame