# Introduction to R

This course is a collection of the most important commmands, functions, methods and comments from the Swirl tutorial on R.

In case of wanting to see the description or help window of a function use `help()` and the function between the parenthesis and the `args()` function to look at the arguments that the function requires.


## 1. Worskpace and files

Some of the most useful functions for workspace and file operations:
- `ls()` for listing all the objects on a local workspace (i.e. active variables).
- `getwd()` for getting the current working directory path.
- `dir()` for listing all the files in a directory (by default the working directory).
- `dir.create()` for creating a directory in the current working directory.
- `setwd()` for setting the working directory to a specified directory.
- `file.create()` for creating a file in the current working directory.
- `file.exists()` to check if a file exists in the current working dirctory.
- `file.info()` returns the basic information about a file, as its size, last modification date, etc. Also, if we want to retrieve just one piece of information, we can add '$size' for getting the size for example.

- `file.rename()` for renaming a file.
- `file.remove()` for deleting a file.
- `file.copy()` for making a copy of a file.
- `file.path()` for creating a relative path to a file, independent of the operating system.

## 2. Vectors

Vectors come in two different ways in R: `atomic flavours` which contains exactly one data type, and `lists` which may contain multiple data types.

Some of the most useful methods and functions for sequencing numbers:
- `a:b` creates a sequence of numbers starting in a and finishing at b (or before b) at a step of 1 between each number. If a > b, then the step would be -1.
- `seq()` does the same but with better control. For example, we can control the step with the 'by' argument, or just stating how many numbers we want in the sequence with the 'length' argument.
- `length()` returns the length of a sequence.
- `rep(c, times = n)` replicates the number (or vector) c, n times forming a sequence (vector) of those numbers. If instead we want to repeat each element of the vector n times before moving to the next element, we used 'each = n' instead of 'times = n'.

Some of the most useful methods and functions for vectors:
- `paste()` used for concatenating elements of a vector into one element, in which each element is separated from one another by the 'collapse' argument. If we wanto to concatenate two vectors but element-wise, use the 'sep' argument instead.
- `sample(vec, n)` selects randomly n elements from vector vec.
- `identical()` checks if two vectors are the same.
- `range()` returns the minimum and maximum value for a vector.
- `unique()` returns the unique values of a vector.

Some of the most useful methods and functions for subsets of vectors:
- `v[c(3, 6)]` returns the 3rd ant 6th elements of vector v.
- `v[c(-3, -6)]` returns vector v except for the the 3rd ant 6th elements.
- `which()` returns the indices of a vector which meets some condition.
- `any()` returns TRUE if at least one of the elements of the vector meets some condition.
- `all()` returns TRUE if all the elements of the vector meets some condition.



## 3. Missing values

In R missing values are represented by the 'NA' keyword, and can also be used for logical operators. Some of the most useful methods and functions for handling missing values are:
- `is.na()` checks if a value (or elements in a vector) are NAs.


## 4. Matrices and DataFrames

The main difference between `matrices` and  `DataFrames` is that matrices can only contain a single class of data, while dataframes can consist of many different classes of data.

Some of the most useful methods and functions for matrices and dataframes are:
- `dim()` returns the dimension of an object. Vectors does not have a dimension, so would return NULL (instead use length()). It can also be used for creating a matrix from a vector.
- `matrix()` creates a matrix from a vector or some sequence.
- `data.frame()` creates a dataframe from specified data.
- `colnames(df)` helps assigning a dataframe some desired column names.
- `$` for operating across a column (i.e. sum(df$total) for summing the total column).


## 5. Functions

`Functions` are one of the fundamental building blocks of any programming language. They are small pieces of reusable code that can be treated like any other R object. For example, a very used function is `Sys.Date()` which returns today's date. To check how a function is defined, just type the name of the function whithot parenthesis nor arguments.

An example of the implementation of the mean function:


In [1]:
my_mean = function(vector) {
    sum(vector) / length(vector)
}

Then, you can execute the function by calling its name and providing at least the necesary arguments:

In [2]:
my_mean(c(1, 2, 3, 4, 5, 6, 7, 8, 9))

We can also look at the arguments of a function by calling the `args()` function:

In [3]:
args(my_mean)

We also have `anonymus functions` in R (equivalent to *lambda* functions in Python), that can be defined as follows:

In [4]:
plus_one = function(x){x+1}
plus_one(4)

## 6. Apply functions (loop)

Both `lapply()` and `sapply()` are part of the apply family of functions which offer a concise and convenient means of implementing the Split-Apply-Combine strategy for data analysis. This means: splitting up some data into smaller pieces, applying a function to each piece and then combining the results.

The former takes a list as input, applies a function to each element of the list, then returns a list of the same length as the original one. The latter allows you to automate this process by calling lapply behind the scenes, but then attempting to simplify the result for you.

Whereas *lapply* tries to guess the correct format of the result `vapply()` allows you to specify it explicitly and if the result doesn't match the format specified, an error will arise. This is useful for preventing significant problems in your code that might be causing unexpected return values.

The `tapply()` function helps us to split our data temprarily based on the value of some variable and then apply a function to the members of each group separately. For example, for looking on the quantity of habitants per continent we could use:

In [None]:
tapply(df$population, df$continents, sum)

## 7. Looking at data

Some of the most used methods and functions for dataframes are:
- `dim()` for getting the dimensions of the dataframe (equivalent to .shape in Python).
- `nrow()` returns the number of rows in a dataframe.
- `ncol()` returns the number of columns in a dataframe.
- `object.size()` returns the size of memory space the dataframe is occupying.
- `names()` returns the a vector of the names of the columns within the dataframe.
- `summary()` provides some descriptive information of the dataframe and the values within the columns (similar to the *.describe()* function in Python).
- `str()` provides concise information about the structure of a dataframe (it can also be used for datasets, functions, vectors, etc).


## 8. Simulation

Some of the most used methods and functions for simulation are:
- `sample(vec, n, replace = FALSE)` to generate n random numbers within the vector vec without replacement.
- ``
