In [None]:
options(jupyter.rich_display = FALSE)

# GET LEVEL COUNTS 1

Suppose we get two samples of unique letters (with no replacement), and create longer samples out of these with replacement:

```R
RNGversion("3.3.1")
set.seed(30)
len <- 5
probs1 <- sample(c(0, rep(1, len - 1)))
levels1 <- sample(letters, len)
vec1 <- sample(levels1, 20, replace = T, prob = probs1)

probs2 <- sample(probs1)
levels2 <- sample(letters, len)
vec2 <- sample(levels2, 20, replace = T, prob = probs2)
```

```R
> levels1
[1] "d" "w" "f" "y" "z"

> vec1
 [1] "w" "f" "y" "z" "w" "z" "f" "z" "y" "f" "w" "y" "w" "f" "f" "f" "f" "f" "z"
[20] "f"

> levels2
[1] "r" "b" "j" "a" "g"

> vec2
 [1] "j" "a" "g" "j" "r" "a" "g" "j" "g" "r" "r" "r" "j" "j" "r" "g" "r" "r" "a"
[20] "g"
```

- So **vec1** is sampled out of **levels1**, but value **d** is missing in **vec1**
- And **vec2** is sampled out of **levels2**, but value **b** is missing in **vec2**

Create a function **getc** that takes two arguments:

- **vec**: A vector of values with replacement
- **levs**: A vector of unique values

The function
- Should first create a factor vector named **fact** values of which are from **vec** and the levels are from **levs**
- Then return a contingency table of fact - the count of unique levels - using the **table** function as such:

```R
> getc(vec = vec1, levs = levels1)

fact
d w f y z 
0 4 9 3 4 

> getc(vec = vec2, levs = levels2)
fact
r b j a g 
7 0 5 3 5 
```

Note that:
- since **d** is a level but missing any occurrences in **vec1**, its count is **0**
- since **b** is a level but missing any occurrences in **vec2**, its count is **0**

**Hint:** You should use **factor** and **table** functions

In [None]:
RNGversion("3.3.1")
set.seed(30)
len <- 5
probs1 <- sample(c(0, rep(1, len - 1)))
levels1 <- sample(letters, len)
vec1 <- sample(levels1, 20, replace = T, prob = probs1)

probs2 <- sample(probs1)
levels2 <- sample(letters, len)
vec2 <- sample(levels2, 20, replace = T, prob = probs2)
levels1
vec1
levels2
vec2

getc <- function(vec, levs)
{
    fact <- factor(vec, levels = levs)
    table(fact)
}

getc(vec = vec1, levs = levels1)
getc(vec = vec2, levs = levels2)

# GET LEVEL COUNTS 2

Let's have two unordered factor vectors as such:

```R
RNGversion("3.3.1")
set.seed(1)
len <- 5
levels1 <- sample(letters, len)
fact1 <- factor(sample(levels1, 20, replace = T), levels = levels1)
levels2 <- sample(letters, len)
fact2 <- factor(sample(levels2, 20, replace = T), levels = levels2)
```

```R
> fact1
 [1] e e u u g j g u j u n u e j u e j u g j
Levels: g j n u e

> fact2
 [1] j j j k h u u k u j h u u j j u k j u u
Levels: k a j u h
```

As you see the levels are not ordinal (so e.g. **j** is not larger than **g** in **fact1** or **a** is not larger than **k** in **fact2**)

Please a write a function named **getn** that takes two arguments:

- **fact**: A factor vector
- **lev**: A single character value which is a level of **fact**

The function should first make **fact** an ordered factor and then get the count of values that are **larger than or equal to** the level of **lev** as such:

```R
> getn(fact = fact1, lev = "j")
[1] 17

> getn(fact = fact2, lev = "u")
[1] 10
```

Hence:
- The levels larger than or equal to **j** are j,n, u and e in **fact1** and their count is 17
- The levels larger than or equal to **u** are u and h in **fact2** and their count is 10

**Hint:** You should use **factor** function for making the factor ordered

In [None]:
RNGversion("3.3.1")
set.seed(1)
len <- 5
levels1 <- sample(letters, len)
fact1 <- factor(sample(levels1, 20, replace = T), levels = levels1)
levels2 <- sample(letters, len)
fact2 <- factor(sample(levels2, 20, replace = T), levels = levels2)
fact1
fact2

getn <- function(fact, lev)
{
    fact2 <- factor(fact, ordered = T)
    sum(fact2 >= lev)
}

getn(fact = fact1, lev = "j")
getn(fact = fact2, lev = "u")

# SPLIT INTO INTERVALS

Let's have two sets of
- A vector of numeric values
- A vector of breakpoint values to create intervals
- A vector of labels for intervals

as such:

```R
RNGversion("3.3.1")
set.seed(1)

rates1 <- rnorm(20, 2, 4)
brks1 <- quantile(rates1, seq(0, 1, len = 5))
brks1[length(brks1)] <- Inf
brks1[1] <- -Inf
labs1 <- letters[1:(length(brks1)-1)]

rates2 <- rnorm(10, 2, 4)
brks2 <- quantile(rates2, seq(0, 1, len = 4))
brks2[length(brks2)] <- Inf
brks2[1] <- -Inf
labs2 <- letters[1:(length(brks2)-1)]
```

```R
> rates1
 [1] -0.5058152  2.7345733 -1.3425144  8.3811232  3.3180311 -1.2818735
 [7]  3.9497162  4.9532988  4.3031254  0.7784465  8.0471247  3.5593729
[13] -0.4849623 -6.8587995  6.4997237  1.8202656  1.9352389  5.7753448
[19]  5.2848848  4.3756053

> brks1
       0%       25%       50%       75%      100% 
     -Inf 0.4625943 3.4387020 5.0361953       Inf 

> labs1
[1] "a" "b" "c" "d"

> rates2
 [1]  5.67590949  5.12854520  2.29825993 -5.95740678  4.47930299  1.77548504
 [7]  1.37681797 -3.88300954  0.08739978  3.67176624

> brks2
       0% 33.33333% 66.66667%      100% 
     -Inf  1.376818  3.671766       Inf 

> labs2
[1] "a" "b" "c"
```

Please create a function named **splitf** that takes three arguments:

- **values**: A vector of numeric values
- **breaks**: A vector of breakpoint values for intervals
- **labels**: A character vector of labels for intervals

The function should:

- First create a factor of intervals so that for each numeric value a factor level corresponding to an interval is created  using **values**, **breaks** and **labels** as arguments to the **cut** function
- Then split the **values** vector into a list using **split** function and the intervals factor as such:

```R
> splitf(values = rates1, breaks = brks1, labels = labs1)
$a
[1] -0.5058152 -1.3425144 -1.2818735 -0.4849623 -6.8587995

$b
[1] 2.7345733 3.3180311 0.7784465 1.8202656 1.9352389

$c
[1] 3.949716 4.953299 4.303125 3.559373 4.375605

$d
[1] 8.381123 8.047125 6.499724 5.775345 5.284885

> splitf(values = rates2, breaks = brks2, labels = labs2)
$a
[1] -5.95740678  1.37681797 -3.88300954  0.08739978

$b
[1] 2.298260 1.775485 3.671766

$c
[1] 5.675909 5.128545 4.479303
```

So intervals for **rates1** are created from breakpoints $-\infty$, 0.4625943, 3.4387020, 5.0361953 and $\infty$ with labels a, b, c and d.

And **rates1** is split into a list so that each item holds the values that correspond to an interval

For **rates2**, intervals are created from breakpoints $-\infty$, 1.376818,  3.671766 and $\infty$ with labels a, b and c.

**Hint:** You should use **cut** and **split** functions

In [None]:
RNGversion("3.3.1")
set.seed(1)

rates1 <- rnorm(20, 2, 4)
brks1 <- quantile(rates1, seq(0, 1, len = 5))
brks1[length(brks1)] <- Inf
brks1[1] <- -Inf
labs1 <- letters[1:(length(brks1)-1)]

rates2 <- rnorm(10, 2, 4)
brks2 <- quantile(rates2, seq(0, 1, len = 4))
brks2[length(brks2)] <- Inf
brks2[1] <- -Inf
labs2 <- letters[1:(length(brks2)-1)]

rates1
brks1
labs1

rates2
brks2
labs2


splitf <- function(values, breaks, labels)
{
    split(values, cut(values, breaks = breaks, labels = labels))
}

splitf(values = rates1, breaks = brks1, labels = labs1)
splitf(values = rates2, breaks = brks2, labels = labs2)