# Variables, Types, Operators

---

## Basics of programming for Data Science and Machine Learning


Applied Mathematical Modeling in Banking

---

## Table of contents

1. Vectors
2. Matrices
3. Factors
4. Data frames
5. Lists
6. Apply functions family

---

# 1. Vectors

### Announcement of vectors

A vector is a base data type in `R` that allows you to write a collection of elements of the same type with or without `c() `if it is a sequence of values.

_Note. In essence, the function `c()` allows you to combine several vectors._

Consider for example the usual variable `x`:

In [None]:
x <- 10

In essence, `x` in this case is a vector consisting of one value of` 10`. We can also write several elements to the variable `x`:

In [None]:
x <- c(1, 2, 2.5, 3)
x

Vector elements can be values of any type: `numeric`,` character`, `logical`, etc .:

In [None]:
v1 <- c(1, 3, 4, 6, 7)
v2 <- c(T, F, F, T, F)
v3 <- c("Hello", "my", "friend", "!")

Vector elements are also sequences created using the functions `rep ()`, `seq ()` and the operator `:`:

In [None]:
vtr <-  2:7
vtr
vtr <- 7:2
vtr

If you need to combine several vectors, use the `c()` function:

In [None]:
x <- 2:3
y <- c(4,6,9)
z <- c(x, y, 10:12, 100)
z

You can view brief descriptive statistics by vector using the ** `summary()` ** function:

In [None]:
summary(z)

---

### Operations on vectors

The advantage of using vectors over writing each value in a separate variable is the ability to perform 1 operation on all elements of the vector or on several vectors simultaneously, for example, arithmetic operations of addition or multiplication.

In [None]:
v1 <- c(1, 3, 5)
v1
v1 * 10

From the example described above, it can be understood that the addition operation is essentially a superelement sum of vectors when the 1st element of the vector `v1` is added to the 1st element of the vector` v2`(`1 + 2`) and so on. Thus, the resulting vector will have the same length as the vectors `v1` and` v2`.

However, there may be a situation when one of the vectors has a shorter length or even consists of 1 element:

In [None]:
v1 <- c(1, 3, 5, 7)
v2 <- c(2, 4)
v1 + v2

In this case, the number `2` will be added to each element of the vector` v1`. In fact, this means that the vector `v2` will look like `c 2, 2)`, ie there will be a duplication of values to the length of the vector` v1` and then perform the operation of adding elements. Thus, the resulting vector will have the length of the longest of the vectors.

Consider a more complex case where there are vectors with different numbers of elements other than 1:

In [None]:
v1 <- c(2, 3)
v2 <- c(4, 5, 6, 7)
v3 <- c(1, 8, 9)
v1 + v2 + v3

To begin with, it should be noted that the interpreter warns that the lengths of the vectors are not multiples (if they were vectors of length 2, 4, 8, then there would be no warning).

If you extend each vector to the length of the maximum of them, repeating the elements cyclically, you get a set (*marked added elements*):

```r
v1 <- c(2, 3,*2,*3)
v2 <- c(4, 5, 6, 7)
v3 <- c(1, 8, 9,*1)
```

Subtraction (`-`), division(`/`) and multiplication (`*`) operations are performed similarly.

The relation operators and logical operators also act element by element with respect to the vector, but the result is a collection (vector) of values of the logical type `logical` with the values` TRUE/FALSE`.

Consider an example of finding all elements of the array `v1` that are greater than the corresponding index elements of the array` v2`:

In [None]:
v1 <- c(2, 4, 7, 9, 12)
v2 <- c(6, 4, 6, 7, 1)
v1 > v2

In essence, as a result of execution there is a comparison of each element of both vectors among themselves: `2>6`,` 4>4`, `7>6`,` 9>7`, `12>1`.

Therefore, the previously studied operators (arithmetic, logical, relations) can be used to work with vectors as well.

### Naming vector elements

In order to understand what vectors mean and what data is often described, analysts need to sign this data.

We will write down information about daily visits to the site by users during the week in the following way:

In [None]:
# Count of unique bank branch visits from Monday to Sunday
data <- c(1245, 2112, 1321, 1231, 2342, 1718, 1980)

Next, assign values to the days of the week using the `names ()` function:

In [None]:
names(data) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")
print(data)

Otherwise, this code could be written as follows:

In [None]:
data <- c(1245, 2112, 1321, 1231, 2342, 1718, 1980)
days <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")
names(data) <- days
data

If we need to get information, for example, about the name of the 4th element of the vector, we can use the code:

In [None]:
names(data)

The `names ()` function allows not only to set the values of names for vector elements, but also to obtain information about them.

---

## Access to vector elements

Indexing of elements inside the wind occurs from `1` to` n`, where `n` is the number of elements of the vector.

<div class = "alert alert-info alert-sm"> &nbsp; Note. In `R`, the indexing of array, vector, and all other collection types begins with <b>1</b>, not with <b class ="text-danger" style ="text-decoration: line-through">0</b>.<div>

Consider the previous example:

In [None]:
data <- c(1245, 2112, 1321, 1231, 2342, 1718, 1980)
days <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")
names(data) <- days

In order to record information only about site visitors on `Wednesday`, you need to use the operator `[]` and specify the index of the element in the array:

In [None]:
data[3]
data[names(data) == 'Wednesday']

If there is a need to get several elements of the vector that are out of order, you can do it like this:

In [None]:
some_days <- data[c(1, 2, 5)]
some_days

From the example above it is clear that the indices of the vector `data` are another vector `c(1, 2, 5)`, so it can be declared as a separate variable:

In [None]:
indexes <- c(1, 2, 5)
some_days <- data[indexes]
some_days

If there is a need to obtain information about several elements that are placed in a row, then for convenience (and in the case when such an array consists, for example, of 1000+ elements) use the operator `:`, for example:

In [None]:
working_days <- data[1:5]
working_days

Thus, all working days of the week are selected for the `working_days` vector.

---

### Useful functions

Let's take a look at some useful features that will simplify working with vectors. For further calculations we will use two vectors `A` and` B`:

In [None]:
A <- c(3, 5, 8, 2, 5, 4, 2)
B <- c(3, NA, 1, NA, 6, 4, 5)
A
B

<i class = "fa fa-sticky-note-o"> </i> **Function `sum()`**. This function is used to find the sum of the elements of the collection:

In [None]:
sum(A)
sum(B)

An interesting point is that in the presence of gaps in the data (value `NA`) the calculation of the amount is impossible. In this case, the functions can take the additional parameter `na.rm = T`, where` T` is an abbreviation of `TRUE`, which indicates the need to remove gaps in the data before performing the operation.

_Note. You should check the documentation for such a parameter in the function. If it is not present, then it is necessary to carry out cleaning in other ways before work with the data._

In [None]:
sum(B, na.rm = T)

<i class = "fa fa-sticky-note-o"> </i> **The `mean ()`** function is used to find the arithmetic mean of numbers:

In [None]:
mean(A)
mean(B, na.rm = T)

<i class = "fa fa-sticky-note-o"> </i> **`min ()` and `max ()`** functions allow you to find the minimum and maximum values, respectively:

In [None]:
min(A)
max(A)

Also to work in `R` there is a large number of built-in implemented functions to perform statistical, econometric and other research in the field of economics and beyond. Try the `sd()`, `cov()`, `cor()` functions.

<i class = "fa fa-sticky-note-o"> </i> **The `length ()`** function helps to determine the "length" of a vector, ie the number of elements:

In [None]:
length(A)
length(B)

<i class = "fa fa-sticky-note-o"> </i> **The `unique ()`** function identifies unique elements in an array:

In [None]:
A
unique(A)

print("---")

B
unique(B)

<i class = "fa fa-sticky-note-o"> </i> **The `intersect()`** function allows you to find common elements of two vectors, so for vectors `A` and` B` common values are ` 3`, `4` and` 5`:

In [None]:
A
B
intersect(A, B)

Conversely, <i class = "fa fa-sticky-note-o"> </i> **The `union()`** function allows you to combine elements of both sets / vectors:

In [None]:
A
B
union(A, B)

Try to understand the operation of the functions `setdiff()`, `setequal()`, `is.element()`.

_I recommend reading the short materials here: https://stat.ethz.ch/R-manual/R-devel/library/base/html/sets.html_.

---

### Correction of data (NA, NaN, Inf)

In the process of working with data there are problems associated with the correctness of their reading, conversion and operations on them. For example, an incorrect entry in the field of integer type `" +10 "` instead of `10` may result in conversion to` NaN` or division by `0` to` Inf`.

Before using numerical and other data, the stage of cleaning and replacement of values is usually performed depending on the tasks of programming / research. In `R` the following types of the missed values are possible:

- [x] `NA` ** - Not Available.
- [x] `NaN` ** - Not a Number.
- [x] `Inf` ** - Infinity (infinity, can be with the sign` + `and` -`).

Let's start with vector:

In [None]:
vtr <- c(1, -2, NA, NaN, Inf, 1223, -Inf, NA, 21) 
vtr

You can check a single value for a space with the functions `is.na()`, `is.nan()`, `is.infinite()`, `is.finite()`.

In [None]:
is.na(vtr)
is.nan(vtr)
is.infinite(vtr)
is.finite(vtr) # if infinite == TRUE => finite == FALSE :)

Then replacement of values can be executed as follows (we will replace all `NA` and `Nan` with `1000`):

In [None]:
vtr[is.na(vtr)] <- 1000
vtr

## Nan also replaced with is.na()!!!

And then replace `Inf` with the `maximum` value in the vector, and `-Inf` with the `minimum`:

In [None]:
vtr[is.nan(vtr)] <- 500
vtr

vtr[is.na(vtr)] <- 1000
vtr

## Nan also replaced with is.na()!!!

And then replace `Inf` with the `maximum` value in the vector, and `-Inf` with the `minimum`:

In [None]:
vtr <- c(1, -2, NA, NaN, Inf, 1223, -Inf, NA, 21) 
vtr

is.infinite(vtr)
!is.infinite(vtr)
vtr[!is.infinite(vtr)]
max(vtr[!is.infinite(vtr)], na.rm = T)

max(vtr, na.rm = T)
min(vtr, na.rm = T)

vtr[vtr == Inf] <- max(vtr)
vtr[vtr == -Inf] <- min(vtr)
vtr

If you want to replace the value in `Inf` regardless of the sign, you can use` is.infinite() `.

---

## Tasks

#### Task 1

1. Create vector of 10 random number in range $[10;100]$
2. Replace all odd numbers with NA
3. Replace all NA with average value

#### Solution

In [None]:
x <- sample(1:100, size = 10)
x

In [None]:
x[x %% 2 != 0] <- NA
x

In [None]:
x[is.na(x)] <- mean(x, na.rm = T)
x

---

## 2. Matrices

### Creating matrices

**Matrix** - a collection of elements of the same type (`numeric`,` character`, `logical`) with a fixed set of rows and columns. In the case where the matrix has only rows and columns, it is a two-dimensional data array.

The matrix is created using the `matrix()` function:

In [None]:
matrix(1:10, byrow = TRUE, nrow = 2)

where `1:10` - a set of elements of the matrix, it can also be a pre-formed vector (entered, by calculation, from a file, etc.),`byrow = TRUE` - means that the elements in the matrix will be written in rows, so in the pedestrian line contains the value `1:5`, and the second` 6:10` (if we need to write information on the lines then we should use `byrow = FASLE`),`nrow` - the number of rows of the matrix.

In [None]:
sales1 <- c(12, 14, 15)
sales2 <- c(22, 15, 21)
sales <- c(sales1, sales2)
m <- matrix(sales, byrow= T, nrow = 2)
m

---

### Naming matrices

To specify the names of rows and columns of the matrix, use the functions `rownames()` and `colnames()`:

In [None]:
m <- matrix(1:9, nrow = 3)
rownames(m) <- c("row1", "row2", "row3")
colnames(m) <- c("c1", "c2", "c3")
m

---

### Add rows and columns

Special methods `cbind/rbind` are used to change the number of elements in rows and columns of matrices, as well as to quickly combine them.

<i class = "fa fa-sticky-note-o"></i> ** The `cbind` ** function allows you to add one or more matrices and/or vectors behind one of the columns. That is, there is not a simple connection, but a comparison by key field. Consider an example:

In [None]:
m1 <- matrix(c(1:3, 101:103), nrow = 3)
colnames(m1) <- c("A", "B")

m2 <- matrix(c(201:203, 1001:1003), nrow = 3)
colnames(m2) <- c("C", "D")

m_bind <- cbind(m1, m2)

m1
m2
m_bind

---

### Access to matrix elements

The elements of the matrix are accessed by the index of rows and columns. You can select ranges in a similar way to vectors.

Let's look at an example:

In [None]:
m <- matrix(11:25, nrow = 3)
m

To display the 10th element of the matrix, you can use the entries _(note that the account is from the right left corner of the columns)_:

In [None]:
m[10]    
m[[10]]

To display the same element using row and column indexes, write as follows:

In [None]:
# Row #1
# Column #4
m[1,4]

**Question**: What record should you use ti get **18**?

**Answer**: `m[2,3]`

In [None]:
m[2,3]

If you want to output / use an entire row or a whole column, then the block with the index of unnecessary dimensionality can be left blank:

In [None]:
m[1, ] # first row only
m[c(1,3), ] # first and third row only

In [None]:
m[, 1] # first column only
m[, c(1,3)] # first and third column only

You can also specify a list of rows and columns to be output / received simultaneously:

In [None]:
m[c(1,3), 2:4]

You can exclude individual columns or rows by using indexes with minus signs (`-`):

In [None]:
m[-1, c(-2:-3)]

---

### Useful functions

#### Matrix dimmentions

To obtain information about the dimensions of the table, there are special functions: `nrow()`, `ncol()`, `dim()`:

In [None]:
# Decalre matrix
m <- matrix(1:15, ncol = 3)
m

print(paste("Rows:", nrow(m)))
print(paste("Cols:", ncol(m)))

print(paste("Dim:", paste0(dim(m), collapse = " x ")))

Using `nrow()` and` ncol()` allows you to access the last row and column of the matrix, respectively:

In [None]:
m[nrow(m), ] # last row
m[, ncol(m)] # last colum

---

# 3. Factors

Factors in `R` programming allow you to represent a vector of values as categorical values, rather than just a set of text data or numbers. The advantage of the categorical data type is that the element can take only a limited number of values, and not any value that allows the data type.

For example, a numeric vector may contain an infinitely large variation of the values `c(1, 0.021, 192.1444, ..., etc.)`, the character sets may also be different `c ("sdf & Tg6","sdf * Y & 65")`. The number of combinations of such vectors is very large.

In the case of categories, we are talking about certain fixed values. A good example is forms that are filled out on sites with drop-down lists, where the user cannot enter a value, but only select from an existing list. So in the gender field there is usually a limited set of possible options: `Male`,` Female`, `Other`. The user can select only one of these values ​​and does not have the ability to enter something else _(this is an example, each resource can make different forms for users)_.

Creation of factors in `R` occurs by means of function **`factor()`**:

In [None]:
gender <- c("Male", "Female", "Other", "Male", "Female", "Male", "Female", "Female")
gender 
gender_factor <- factor(gender)
gender_factor

When creating a factor, each unique element gets its own ** _ digital? _ ** _ (from the outside it looks like this, it needs to be clarified) _ value inside the collection, this value is called the level (`level`). In the previous example, the variable `gender _factor` received the levels `Female`, `Male`, `Other` in alphabetical order. If we convert factors to numbers, we get:

In [None]:
as.numeric(gender_factor)

gender Thus it is clear that `Female` = 1, `Male` = 2, `Other` = 3. Consider a situation where we get data in which the order of values in the factor collection is different, for example, we need to specify so that `Male` = 1, `Female` = 2, `Other` = 3:

In [None]:
gender <- c("Male", "Female", "Other", "Male", "Female", "Male", "Female", "Female")
gender_factor <- factor(gender, levels = c("Male", "Female", "Other"))
gender_factor
lvl <- levels(gender_factor) # read levels of factor
seq_along(lvl)
as.integer(lvl)
as.numeric(gender_factor)

Now the order of the levels corresponds to ours and this will allow us to successfully combine our collection with similar ones that have the same set of values.

Sometimes it is necessary to change not only the order of the elements in the factorial collection, but also their names. Let's consider a situation when we need to rename values `Male`,` Female`, `Other` in` M`, `F`,` O`:

In [None]:
gender <- c("Male", "Female", "Other", "Male", "Female", "Male", "Female", "Female")
gender_factor <- factor(gender, levels = c("Male", "Female", "Other"))
levels(gender_factor) <- c("M", "F", "O")
gender_factor

But you should check you type with `is.factor()` before converting to numbers:

In [None]:
cities <- c("Rivne", "Ostroh", "Zdolbuniv", "Dubno", "Sarny")
cities_as_factors <- factor(cities)
as.numeric(cities_as_factors)
as.numeric(cities) # you cannot convert characters vector to numerics

---

# 4. Dataframes

Data frames are the most popular data structure in R, becouse it allows collect data with different columns type in one object and quickly manipulate it.

A `data frame`, a matrix-like structure whose columns may be of differing types (numeric, logical, factor and character and so on).

The function `data.frame()` creates data frames, tightly coupled collections of variables which share many of the properties of matrices and of lists, used as the fundamental data structure by most of R's modeling software.

**Syntax**

```{r}
data.frame(..., row.names = NULL, check.rows = FALSE,
           check.names = TRUE, fix.empty.names = TRUE,
           stringsAsFactors = FALSE)
```

**Arguments** (top useful)

- [x] `...` - these arguments are of either the form value or tag = value. Component names are created based on the tag (if present) or the deparsed argument itself.
- [x] `row.names` - NULL or a single integer or character string specifying a column to be used as row names, or a character or integer vector giving the row names for the data frame.
- [x] `stringsAsFactors` - logical: should character vectors be converted to factors? The ‘factory-fresh’ default has been TRUE previously but has been changed to FALSE

**Details**

A `data frame` is a list of variables of the same number of rows with unique row names, given class `data.frame`. If no variables are included, the row names determine the number of rows.

`data.frame` converts each of its arguments to a data frame by calling as.data.frame(optional = TRUE). As that is a generic function, methods can be written to change the behaviour of arguments according to their classes: R comes with many such methods. Character variables passed to `data.frame` are converted to `factor` columns unless protected argument `stringsAsFactors` is false. If a list or data frame or matrix is passed to `data.frame` it is as if each component or column had been passed as a separate argument.

---

### Creating Data Frames

Data frames are usually created by reading in a dataset from file, scraping from websites. However, data frames can also be created explicitly with the `data.frame()` function or they can be coerced from other types of objects like lists. In this case I’ll create a simple data frame df and assess its basic structure:

In [None]:
df <- data.frame(id = 1:5,
                char_col = c("a", "b", "c", "d", "e"),
                log_col = c(T,T,T,F,T),
                double_col = c(2.1, 1, 0.5, pi, 12.7))
df

In [None]:
# assess the structure of a data frame
str(df)

In [None]:
# number of rows
nrow(df)

In [None]:
# number of columns
ncol(df)

If you want convert "on fly" character columns to factor use `stringsAsFactors = TRUE`: 

In [None]:
df <- data.frame(i = 1:5,
                char_col = c("a", "b", "c", "d", "e"),
                log_col = c(T,T,T,F,T),
                double_col = c(2.1, 1, 0.5, pi, 12.7),
                stringsAsFactors = TRUE) # warning it depends on local settings of R
df

Creating data.frames from lists (P.S. lists explained in next chapter):

In [None]:
demo_list <- list(int_col = v_int,
                  char_col = v_char,
                  log_col = v_log,
                  double_col = v_double)
as.data.frame(demo_list)

Matrix can be base for data frame too:

In [None]:
demo_matrix <- matrix(100:119, nrow = 5, ncol = 4)
demo_matrix

as.data.frame(demo_matrix)

---

## Extending data frames

You can add rows and columns to data frame. Merging two data frames by selected column values awailable too.

`cbind()` adds new column

In [None]:
df <-  data.frame(A1 = c("A", "B", "C"),
                  A2 = c("D", "E", "F"))
df

A3 = c(1, 2, 3)
cbind(df, A3)


colnames(df)
colnames(df) <- c("B1", "B2")
colnames(df)

`rbind()` adds new row

In [None]:
letters_frame <-  data.frame(A1 = c("A", "B", "C"),
                            A2 = 1:3)
letters_frame

next_row = c("D", 4) # data types by row should be the same as in initial data frame
rbind(letters_frame, next_row)

### Merge DF

Data frames could me merged by key with `merge()`:

In [None]:
df1 <- data.frame(Id = c(1:4),
                  Name = c("Nick", "Jake", "Jane", "Mary"))
df1

df2 <- data.frame(Id = c(2, 1, 3, 5), # defferent order from Id in df1
                  Age = c(34, 21, 45, 20))
df2

df_final <- merge(df1, df2, by = "Id", all.x = F, all.y = F)
df_final

---

## Subsetting Data Frames

Data frames possess the characteristics of both lists and matrices: if you subset with a single vector, they behave like lists and will return the selected columns with all rows; if you subset with two vectors, they behave like matrices and can be subset by row and column:

In [None]:
df <- data.frame(int_col = 1:5,
                char_col = c("a", "b", "c", "d", "e"),
                log_col = c(T,T,T,F,T),
                double_col = c(2.1, 1, 0.5, pi, 12.7),
                row.names = paste0("row_", 1:5), # setting row names 
                stringsAsFactors = TRUE) # warning it depends on local settings of R
df

In [None]:
# select columns using $ sign
df$log_col

In [None]:
# subsetting by row numbers
df[1, ] # first row
df[nrow(df), ] # last row
df[-1, ] # all except first row

In [None]:
# subsetting by row names
df[c("row_4", "row_5"), ]

In [None]:
# subsetting columns like a list
df[, c("log_col", "double_col")]

In [None]:
# subset for both rows and columns
df[2:5, c(1, 3:4)]

**You can also subset data frames based on conditional statements**

In [None]:
# select only with log_col == TRUE
df[df$double_col > 1, ]

df[!df$log_col, ]

In [None]:
"s" %in% c("s", "t")

In [None]:
# select only with char_col == 'a', 'e'
chars <- df$char_col %in% c("a", "e")
chars
sum(chars)
df[chars, ] # %in% operator for check multuiple values

In [None]:
# select only with double_col > 1 and log_col == TRUE
df[df$log_col == TRUE & df$double_col > 1, ]

In [None]:
# select only specific columns with double_col > 1 and log_col == TRUE
df[df$log_col == TRUE & df$double_col > 1, c("log_col", "int_col", "double_col")]

---

### Order data.frame

Let's use our previous sample `data.frame` but with unordered values:

In [None]:
df <- data.frame(int_col = c(1, 5, 3, 4, 2),
                char_col = c("b", "a", "a", "d", "e"),
                log_col = c(T,T,T,F,T),
                double_col = c(2.1, 1, 0.5, pi, 12.7),
                row.names = paste0("row_", 1:5), # setting row names 
                stringsAsFactors = TRUE) # warning it depends on local settings of R
df

You can use `order()` function for sorting `data.frames`.

In [None]:
# sort by int_col
order(df$char_col)
order(df$int_col)
df[order(df$char_col),]

Use `-` minus to sort descending

In [None]:
# sort by double_col
# rev
df[rev(order(df$int_col)), ]

You can also sor by multiple columns with `order(column1, column2)` or `order(column1, -column2)`.

### Manipulating `data.frames`


typeconvert
ifelse
createnew columns (calculate age) ?lubridate
missing remove
missing replace

edit with dataeditR

---

## Tasks on data.frames

#### Task 1

Write a code evaluates $y = x^2 + e, where x is a random number in range [0; 1]. 

Print calculation result as data.frame with columns `X`, `E`, `Y`.

Use `plot()` funtion to visualize `X` vs `Y` as line chart (type = `l` or `b`).

#### Solution

In [None]:
# initiate data.frame
df <- data.frame(X = 1:10,
                 E = sample(5, 10, replace = T),
                 Y = NA)
head(df)


In [None]:
df$Y <- with(df, X^2 + E)
head(df)

In [None]:
plot(df$X, df$Y, type="l", col = "blue")

---

#### Task 2

1. Install package and load package `ISLR`
2. Save dataset `Credit` into variable `credit_data`.
3. Check dataset structure with `str()` function.
4. Convert Student status "yes/no" to 1/0
5. Order dataset by `Rating` descending
6. Filter only `Age > 50` with `Rating > 400`, how many records do you get?
7. Evaluate average `Income` for `Married = YES` `Married = NO` with Age in range [20,30]

    7.1 Make the same for Age [30;40]
    Any conclusion?

#### Solution

In [None]:
# 1. install.package ISLR
#install.packages("ISLR")
library(ISLR)

In [None]:
# 2. Save dataset `Credit` into variable `credit_data`.
credit_data <- ISLR::Credit
head(credit_data, 3)

In [None]:
# 3. Check dataset structure with `str()` function.
str(credit_data)

In [None]:
# Convert Student status "yes/no" to 1/0

as.numeric(credit_data$Student) - 1

credit_data$Student <- as.character(credit_data$Student) # convert to character first / factors
credit_data$Student <- ifelse(credit_data$Student == "Yes", 1, 0)

head(credit_data)

In [None]:
# 5. Order dataset by `Rating` descending
credit_data <- credit_data[order(-credit_data$Rating), ]
head(credit_data)

In [None]:
# 6. Filter only `Age > 50` with `Rating > 400`
credict_data_filtered <- credit_data[credit_data$Age > 50 & credit_data$Rating > 400, ]
head(credict_data_filtered)
nrow(credict_data_filtered)

In [None]:
#7. Evaluate average `Income` for `Married = YES` `Married = NO` with age in rage [20,30]
# 7.1 Make the same for Age [30;40]
married <- with(credit_data, credit_data[(Age>=20 & Age <=30) & Married == "Yes", ])
head(married)

In [None]:
not_married <- credit_data[credit_data$Age %in% c(20:30) & credit_data$Married == "No", ]
head(not_married)

In [None]:
mean(married$Income) # is it better to be merried? :)
mean(not_married$Income)

---

# 5. Lists

Lists are the R objects which contain elements of different types like − numbers, strings, vectors and another list inside it. A list can also contain a matrix or a function as its elements. List is created using `list()` function.

Before start lest see one more package for working with date `lubridate`. It has a lot of functions for date parsing, manipulating and other. Check it with:

In [None]:
install.packages("lubridate")
??lubridate

For our sample we need function `ymd()` that parse charater date from format like "2012-10-25".

In [None]:
library(lubridate)
date1 <- ymd("2021-05-25")
date2 <- ymd("2021-05-27")

date1
date2

You can also use `ymdhms()` to parse date and time correctly.

In [None]:
datetime <- ymd_hms("2021-05-25 11:05:12", tz = "UTC") # wee need this for client transactions fix
datetime

### Creating a List

Following is an example to create a list containing vectors, strings, numbers and a logical values. Our list will describe a model of banks client:

In [None]:
# initial values

set.seed(1) # for fixing pseudo-random

client_name <- "John Doe"
services <- c("credit", "deposite", "online-app")
is_active <- TRUE
transactions <- data.frame(contract_id = sample(10000:99999, size = 2, replace = T),# random numbers
                          datetime = c(ymd_hms("2021-05-25 11:05:12"),
                                      ymd_hms("2021-05-25 11:07:14"),
                                      ymd_hms("2021-05-25 11:08:02"),
                                      ymd_hms("2021-05-25 11:12:45"),
                                      ymd_hms("2021-05-25 11:47:00"),
                                      ymd_hms("2021-05-25 11:48:08")),
                         oper_type = sample(0:1, size=6, replace = T), # 1 for debet, 0 for credet
                         amount = round(sample(1:1000, size = 6) + runif(6),2))   

#change AMOUNT to minus for debet (opertype == 1 
transactions$amount <- ifelse(transactions$oper_type == 1, (-1)*transactions$amount, transactions$amount) 
transactions

In [None]:
# creating list of signle objects, vector and dataframe
list_data <- list(client_name, is_active, services, transactions)
list_data

### Naming List Elements

Its better to name elements in list:

In [None]:
names(list_data) <- c("ClientName", "IsActive", "Services", "Transactions")
list_data

You can extend list "on fly" with `$`:

In [None]:
list_data$ClientName
list_data$ClientId <- 11125489656
list_data

---

### Accessing List Elements

For now every element can be viewed with index in `[[]]` or `[]`:

In [None]:
# access to list element
list_data[1]
typeof(list_data[1])

In [None]:
# access to object
list_data[[1]]
typeof(list_data[[1]])

Access by `$` also anbled:

In [None]:
list_data$Transactions

### Manipulating List Elements

Lets continue using out `list_data` list.

In [None]:
list_data

We can change data with `[]` and access with `$` symbol.

In [None]:
# changing clint name with index
list_data[1] <- "New Name"
list_data

In [None]:
# changing data with $
list_data$ClientName = "John Doe"
list_data

### Yo can merge lists with `c()` function. Let's create new list and attach it to the `list_data`:

In [None]:
list_2 <- list(Consultant = list(Name = "David Cameron", PhoneNum = "+9562311855"))
list_2

In [None]:
list_data <- c(list_data, list_2)
list_data

### With `unlist()` you can convert a list to a vector.

In [None]:
list_demo <- list(1:10)
list_demo
class(list_demo)
typeof(list_demo)

In [None]:
list_demo * 5 # error, you cannot use * for list

In [None]:
lapply(list_demo, function(c) c*5)

In [None]:
vector_demo <- unlist(list_demo)
vector_demo
class(vector_demo)
typeof(vector_demo)

In [None]:
vector_demo * 5 # now it works

---

## TASKS 

#### Task 1

Wrie a function that calculates `sum`, `average`, `median`, `min`, `max` of taken vector. Generate sample vector of 10 elements in $[1;100]$.

#### Solution

In [None]:
x <- sample(10:100, size = 10)
print(x)

In [None]:
vector_info <- function(vector) {
  x <- list()
  x$Sum <- sum(vector)
  x$Mean <- mean(vector)
  x$Median <- median(vector)
  x$Min <- min(vector)
  x$Max <- max(vector)
  return(x)
}

vector_info(x)
names(vector_info(x))

---

# 6. Apply functions family

You can use a set of function for manipulating, accesing different data structures such as `data.frame`, `list`. 

## The apply() functions family

The `apply()` family pertains to the R base package and is populated with functions to manipulate slices of data from matrices, arrays, lists and dataframes in a repetitive way. These functions allow crossing the data in a number of ways and avoid explicit use of loop constructs. They act on an input list, matrix or array and apply a named function with one or several optional arguments.

The called function could be:

- [x] An aggregating function, like for example the mean, or the sum (that return a number or scalar);
- [x] Other transforming or subsetting functions; and
- [x] Other vectorized functions, which yield more complex structures like lists, vectors, matrices, and arrays.

The `apply()` functions form the basis of more complex combinations and helps to perform operations with very few lines of code. More specifically, the family is made up of the `apply()`, `lapply()`, `sapply()`, `vapply()`, `mapply()`, `rapply()`, and `tapply()` functions.

Using of any functions depends on the structure of the data that you want to operate on and the format of the output that you need.

## `apply()`

`apply()` operates on arrays (2D arrays are matrices).

Syntax is next: **`apply(X, MARGIN, FUN, ...)`**, where 

- [x] `X` is an array or a matrix if the dimension of the array is 2;
- [x] `MARGIN` is a variable defining how the function is applied: when `MARGIN=1`, it applies over rows, whereas with `MARGIN=2`, it works over columns. Note that when you use the construct `MARGIN=c(1,2)`, it applies to both rows and columns; and
- [x] `FUN`, which is the function that you want to apply to the data. It can be any R function, including a User Defined Function (UDF).

In [None]:
# create a matrix

matrix  <- matrix(10:29, ncol = 5, nrow = 4)
matrix

In [None]:
# find sums by col
apply(matrix, 2, sum)

It your turn. TASK. Calculate average value of all rows:

In [None]:
apply(matrix, 1, mean)

## `lapply()`

`lapply()` from `apply()` is:

- [x] It can be used for other objects like `dataframes`, `lists` or `vectors`; and
- [x] The output returned is a `list` (which explains the “l” in the function name), which has the same number of elements as the object passed to it.



?lapply to check params of fucntion:

In [None]:
?lapply

Lets create list of data.frames:

In [None]:
df_a <- data.frame(Value1 = 1:5, Value2 = 101:105)
df_a
#df_b <- data.frame(Value1 = 11:15, Value2 = 201:205)
#df_c <- data.frame(Value1 = 16:20, Value2 = 301:305)
#df_c

lapply(df_a$Value1, sum)

In [None]:
list_demo <- list(df_a, df_b, df_c)
list_demo

In [None]:
# lets select the 2nd row of each data frame

lapply(list_demo, "[", , 2)
# list_demo - data
# "[" -  selection operator
# row index
# col index

TASK. Its your turn. **Select all 1st rows of dataframes**

In [None]:
lapply(list_demo, "[", 1,)

TASK. Its your turn. Select all 1st elements (1st row, 1st col)

In [None]:
lapply(list_demo, "[", 1, 1)

You can apply function to all elemetns. Let's make some names in lowercase 

In [None]:
names_list <- list("John", "Jane", "Jake", "Jacob")
lower_names <- lapply(names_list, tolower) 
class(lower_names)

## `sapplay()`

`sapply()` takes a `list` `vector` or `dataframe` as an input and returns the output in `vector` or `matrix` form.
Lets use `sapply()` function in the previous example and check the result.

In [None]:
sapply(names_list, tolower) 

It tries to simplify the output to the most elementary data structure that is possible. And indeed, `sapply()` is a ‘wrapper’ function for `lapply()`.

Let's try to get every 1st element of 2nd row from out `list_demo`: 

In [None]:
list_demo

In [None]:
data <- sapply(list_demo, "[", 2,1)
data
class(data)

In [None]:
# lest set simplify = FASLE
data <- sapply(list_demo, "[", 2,1, simplify =F)
data
class(data)

## `aggregate()`

This function is from package `stats`. It often used for grouping data by some key. Its from apply family, but working in the same way. So, its good idea discuss it now.

Syntax for data.frame:

```
aggregate(x,               # R object \
          by,              # List of variables (grouping elements) \
          FUN,             # Function to be applied for summary statistics\
          ...,             # Additional arguments to be passed to FUN\
          simplify = TRUE, # Whether to simplify results as much as possible or not\
          drop = TRUE)     # Whether to drop unused combinations of grouping values or not.
```

Syntax for formula:

```# Formula
aggregate(formula,             # Input formula \
          data,                # List or data frame where the variables are stored \
          FUN,                 # Function to be applied for summary statistics \
          ...,                 # Additional arguments to be passed to FUN \
          subset,              # Observations to be used (optional) \
          na.action = na.omit) # How to deal with NA values` 
```

Lets use our `credit_data` from one of the previous tasks:

In [None]:
credit_data <- ISLR::Credit
head(credit_data)

TASK 1. Calculate average Balance by Gender:

In [None]:
# lets use formula syntax
mean_age <- aggregate(Age ~ Gender, data = credit_data, mean)
mean_age 

n <- names(mean_age)
n[n == "Age"] = "Mean Age"
names(mean_age) = n
mean_age

TASK 2. Average Balance for Gender and Student status at the same time

In [None]:
group_bal <- aggregate(Age ~ Gender + Married, data = credit_data, mean)
group_bal

Task 3. FOR STUDENTS. Try get aggregated average `Income` by `Age`. Order final dat.frame by age and make a `plot()`.

In [None]:
group_inc <- aggregate(Income ~ Age + Gender, data = credit_data, mean)
head(group_inc, 10)

In [None]:
levels(group_inc$Gender)
levels(group_inc$Gender) <- c("Male", "Female")

m_data <- group_inc[group_inc$Gender == "Male", ]
nrow(m_data)

f_data <- group_inc[group_inc$Gender == "Female", ]
nrow(f_data)
with(m_data, plot(Age, Income, type = "l", col="red"))
with(f_data, lines(Age, Income, type = "l", col ="blue"))
#plot(group_inc$Age, group_inc$Income, type = "b")

---

# References

1. The Comprehensive R Archive NetworkRcran: Url: https://cran.r-project.org/
2. RStudio official website. Url: https://rstudio.com/
3. Anaconda official website. Url: https://www.anaconda.com/
4. Introduction to R. Datacamp interactive course. Url:  https://www.datacamp.com/courses/free-introduction-to-r
5. Quanargo. Introduction to R. Url: https://www.quantargo.com/courses/course-r-introduction
6. R Coder Project. Begin your data science career with R language! Url: https://r-coder.com/
7. R Core Team (2019). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.URL https://www.R-project.org/.
8. A.B. Shipunov, EM Baldin, P.A. Volkova, VG Sufiyanov. Visual statistics. We use R! - M .: DMK Press, 2012. - 298 p .: ill.
9. An Introduction to R. URL: https://cran.r-project.org/doc/manuals/r-release/R-intro.html
10. R programming. https://www.datamentor.io/r-programming
11. Learn R. R Functions. https://www.w3schools.com/r/r_functions.asp
12. UC Business Analytics R Programming Guide. Managing Data Frames. http://uc-r.github.io/dataframes
13. Learn R programming. R - Lists. https://www.tutorialspoint.com/r/r_lists.htm
14. Tutorial on the R Apply Family by Carlo Fanara. https://www.datacamp.com/community/tutorials/r-tutorial-apply-family