# Introduction to R language

Source: An Introduction to Statistical Learning with Applications in R

In this lab, we will introduce some simple R commands. The best way to
learn a new language is to try out the commands. R can be downloaded from
http://cran.r-project.org/

## Basic Commands

R uses functions to perform operations. To run a function called funcname, we type ``funcname(input1, input2)``, where the inputs (or arguments) ``input1`` and ``input2`` tell R how to run the function. A function can have any number of inputs. For example, to create a vector of numbers, we use the function ``c()`` (for concatenate). Any numbers inside the parentheses are joined together. The following command instructs R to join together the numbers
1, 3, 2, and 5, and to save them as a vector named x .When we type x , it gives us back the vector.

In [45]:
x <- c(1, 3, 2, 5)
x

Note that the ``>`` is not part of the command; rather, it is printed by R to
indicate that it is ready for another command to be entered. We can also
save things using ``=`` rather than ``<-`` :

In [46]:
x = c (1 ,6 ,2)
x
y = c (1 ,4 ,3)

Hitting the up arrow multiple times will display the previous commands,
which can then be edited. This is useful since one often wishes to repeat
a similar command. In addition, typing ``?funcname`` will always cause R to
open a new help file window with additional information about the function
``funcname``.

We can tell R to add two sets of numbers together. It will then add the
first number from x to the first number from y , and so on. However, x and
y should be the same length. We can check their length using the ``length()``
function.

In [47]:
length(x)
length(y)
x + y

The ``ls()`` function allows us to look at a list of all of the objects, such ``ls()``
as data and functions, that we have saved so far. The ``rm()`` function can be used to delete any that we don’t want.

In [None]:
Vectors can contain strings

In [84]:
s = c('a', 'b', 'c', 'a', 'a')

length(s)
unique(s)

In [48]:
ls()

In [49]:
rm(x , y)

It’s also possible to remove all objects at once:

In [50]:
rm(list = ls ())
ls()

### list

Lists are data structure that can contain heterogenegous types of data. They provide a basic key/value an indexation facility.

In [51]:
l = list(a=1, b=c(1, 2), v="string")
l
l['I_like'] = c("machine, 'learning")
l["v"]

## Functions

Define user function with the keyword ``function``

In [52]:
add <- function(a, b){
    c = a + b
    return(c)
}

add(2, 3)

### Matrices

The ``matrix()`` function can be used to create a matrix of numbers. Before
we use the ``matrix()`` function, we can learn more about it:

In [53]:
?matrix

0,1
matrix {base},R Documentation

0,1
data,an optional data vector (including a list or expression vector). Non-atomic classed R objects are coerced by as.vector and all attributes discarded.
nrow,the desired number of rows.
ncol,the desired number of columns.
byrow,"logical. If FALSE (the default) the matrix is filled by columns, otherwise the matrix is filled by rows."
dimnames,"A dimnames attribute for the matrix: NULL or a list of length 2 giving the row and column names respectively. An empty list is treated as NULL, and a list of length one as row names. The list can be named, and the list names will be used as names for the dimensions."
x,an R object.
...,additional arguments to be passed to or from methods.
rownames.force,"logical indicating if the resulting matrix should have character (rather than NULL) rownames. The default, NA, uses NULL rownames if the data frame has ‘automatic’ row.names or for a zero-row data frame."


The help file reveals that the ``matrix()`` function takes a number of inputs,
but for now we focus on the first three: the data (the entries in the matrix),
the number of rows, and the number of columns. First, we create a simple
matrix.

In [54]:
X = matrix(data = c(1 ,2 ,3 ,4) , nrow=2 , ncol=2)
X

0,1
1,3
2,4


Note that we could just as well omit typing data= , nrow= , and ncol= in the
matrix() command above: that is, we could just type

In [55]:
X = matrix ( c (1 ,2 ,3 ,4) ,2 ,2)
X

0,1
1,3
2,4


and this would have the same effect. However, it can sometimes be useful to
specify the names of the arguments passed in, since otherwise R will assume
that the function arguments are passed into the function in the same order
that is given in the function’s help file. As this example illustrates, by
default R creates matrices by successively filling in columns. Alternatively,
the ``byrow=TRUE`` option can be used to populate the matrix in order of the
rows.

In [56]:
matrix ( c (1 ,2 ,3 ,4) ,2 ,2 , byrow = TRUE )

0,1
1,2
3,4


Notice that in the above command we did not assign the matrix to a value
such as ``x`` . In this case the matrix is printed to the screen but is not saved
for future calculations. The ``sqrt()`` function returns the square root of each
element of a vector or matrix. The command ``X ^ 2`` or ``X ** 2`` raises each element of ``X``
to the power 2 ; any powers are possible, including fractional or negative
powers.

In [57]:
sqrt(X)

X ^ 2

# or

X ** 2

0,1
1.0,1.732051
1.414214,2.0


0,1
1,9
4,16


0,1
1,9
4,16


The ``rnorm()`` function generates a vector of random normal variables,
with first argument n the sample size. Each time we call this function, we
will get a different answer. Here we create two correlated sets of numbers,
``x`` and ``y`` , and use the ``cor()`` function to compute the correlation between
them.

In [58]:
x = rnorm (50)
y = x + rnorm (50 , mean =50 , sd =.1)
cor ( x , y )

By default, rnorm() creates standard normal random variables with a mean
of 0 and a standard deviation of 1. However, the mean and standard devi-
ation can be altered using the mean and sd arguments, as illustrated above.
Sometimes we want our code to reproduce the exact same set of random
numbers; we can use the ``set.seed()`` function to do this. The set.seed()
function takes an (arbitrary) integer argument.

In [59]:
set.seed(42)
rnorm(10)
set.seed(42)
rnorm(10)

The ``mean()`` and ``var()`` functions can be used to compute the mean and
variance of a vector of numbers. Applying sqrt() to the output of ``var()``
will give the standard deviation. Or we can simply use the ``sd()`` function.

In [60]:
set.seed (3)
y = rnorm(100)
mean (y)

var(y)

sqrt(var(y))

sd(y)

### Indexing Data

We often wish to examine part of a set of data. Suppose that our data is
stored in the matrix ``A``.

In [61]:
A = matrix (1:16 ,4 ,4)
A
A[2 ,3]

0,1,2,3
1,5,9,13
2,6,10,14
3,7,11,15
4,8,12,16


Selects the element corresponding to the second row and the third column. The first number after the open-bracket symbol [ always refers to the row, and the second number always refers to the column. We can also select multiple rows and columns at a time, by providing vectors as the indices.

In [62]:
A [c(1 ,3) , c(2 ,4)]

0,1
5,13
7,15


Or using the sclicing notation

In [63]:
print("Line 1 to 3 and columns 2 to 4")
A [1:3 ,2:4]

print("Line 1 to 2 and all columns")
A [1:2 ,]

print("All lines and all columns 1 to 2")
A [ ,1:2]

[1] "Line 1 to 3 and columns 2 to 4"


0,1,2
5,9,13
6,10,14
7,11,15


[1] "Line 1 to 2 and all columns"


0,1,2,3
1,5,9,13
2,6,10,14


[1] "All lines and all columns 1 to 2"


0,1
1,5
2,6
3,7
4,8


The last two examples include either no index for the columns or no index
for the rows. These indicate that R should include all columns or all rows,
respectively. R treats a single row or column of a matrix as a vector.

In [64]:
A [1 ,]

The use of a negative sign - in the index tells R to keep all rows or columns
except those indicated in the index.

In [65]:
A [ -c(1 ,3) ,]

0,1,2,3
2,6,10,14
4,8,12,16


The ``dim()`` function outputs the number of rows followed by the number of
columns of a given matrix.

In [66]:
dim(A)
dim(A[1:3, ])

## Matrix operation

In [67]:
X = matrix ( c (1 ,2 ,3 ,4) ,2 ,2)
X

0,1
1,3
2,4


### Column-wise or row-wise operations

In [68]:
colSums(X)
colMeans(X)

rowSums(X)
rowMeans(X)

In [69]:
apply(X, 2, mean)

### Centering and scaling

In [70]:
X = matrix ( c (1 ,2 ,3 ,4) ,2 ,2)
Xcs = scale(X)
# Xcs contains the standardized matrix plus some hidden attributes
attributes(Xcs)

xbar <- attr(Xcs, "scaled:center")
xsd <- attr(Xcs, "scaled:scale")
Xcs2 = scale(X, center=xbar, scale=xsd)

all(Xcs==Xcs2)

In [71]:
### Matrix / vector product

In [72]:
X = matrix(c(1 ,2 ,3 ,4, 5, 6) ,3 ,2)
X
dim(X)

0,1
1,4
2,5
3,6


In [73]:
b = matrix(c(1 ,2), 2, 1)
b
dim(b)

X %*% b

0
1
2


0
9
12
15


## Loading Data

For most analyses, the first step involves importing a data set into R . The ``read.table()`` function is one of the primary ways to do this. The help file contains details about how to use this function. We can use the function
``write.table()`` to export data.

Before attempting to load a data set, we must make sure that R knows table()
to search for the data in the proper directory. For example on a Windows
system one could select the directory using the Change dir. . . option under
the File menu. However, the details of how to do this depend on the op-
erating system (e.g. Windows, Mac, Unix) that is being used, and so we
do not give further details here. Use ``getwd()`` and ``setwd()`` to get and set the current working directory.

In [74]:
getwd()

Once the data has been loaded, the ``fix()`` function can be used to view it in a spreadsheet like window.
However, the window must be closed before further R commands can be
entered. Function ``head()`` prints the first rows of the ``data.frame``

In [75]:
df = read.table("../data/iris.csv", header=TRUE, sep=",")
#fix(df)
head(df)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3.0,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa
4,4.6,3.1,1.5,0.2,setosa
5,5.0,3.6,1.4,0.2,setosa
6,5.4,3.9,1.7,0.4,setosa


In [76]:
df = read.csv("../data/iris.csv")
head(df)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3.0,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa
4,4.6,3.1,1.5,0.2,setosa
5,5.0,3.6,1.4,0.2,setosa
6,5.4,3.9,1.7,0.4,setosa


In [1]:
link = 'https://raw.github.com/neurospin/pystatsml/master/data/salary_table.csv'
# X = read.csv(url(link))

``read.table()`` and ``read.csv()`` return ``data.frame`` with is a list of vector or a table of heterogeneous data. ``data.frame`` can be indexed like a matrix. A column can be obtained using its name after the symbol ``$``.

## data.frame

In [77]:
colnames(df)
df$sepal_length[1:10]

In [78]:
summary(df)

  sepal_length    sepal_width     petal_length    petal_width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                

Build data.frame

In [79]:
user1 = data.frame(name=c('eric', 'sophie'),
                     age=c(22, 48), gender=c('M', 'F'),
                     job=c('engineer', 'scientist'))
  
user2 = data.frame(name=c('alice', 'john', 'peter', 'julie', 'christine'),
                   age=c(19, 26, 33, 44, 35), gender=c('F', 'M', 'M', 'F', 'F'),
                   job=c("student", "student", 'engineer', 'scientist', 'scientist'))

### Concatenate and merge

In [80]:
user3 = rbind(user1, user2)

print(user3)

salary = data.frame(name=c('alice', 'john', 'peter', 'julie'), salary=c(22000, 2400, 3500, 4300))

user = merge(user3, salary, by="name", all=TRUE)

print(user)

       name age gender       job
1      eric  22      M  engineer
2    sophie  48      F scientist
3     alice  19      F   student
4      john  26      M   student
5     peter  33      M  engineer
6     julie  44      F scientist
7 christine  35      F scientist
       name age gender       job salary
1      eric  22      M  engineer     NA
2    sophie  48      F scientist     NA
3     alice  19      F   student  22000
4 christine  35      F scientist     NA
5      john  26      M   student   2400
6     julie  44      F scientist   4300
7     peter  33      M  engineer   3500


### Selection

In [81]:
user[(user$gender == 'F') & (user$job == 'scientist'), ]


Unnamed: 0,name,age,gender,job,salary
2,sophie,48,F,scientist,
4,christine,35,F,scientist,
6,julie,44,F,scientist,4300.0


### Iterate over columns

In [82]:
types = NULL
for(n in colnames(user)){
  types = rbind(types, data.frame(var=n, 
                                  type=typeof(user[[n]]),
                                  isnumeric=is.numeric(user[[n]])))
}

print(types)

     var    type isnumeric
1   name integer     FALSE
2    age  double      TRUE
3 gender integer     FALSE
4    job integer     FALSE
5 salary  double      TRUE


## Exercises

Write a function ``fillmissing_with_mean(df)`` that fill all missing value of numerical column with the
mean of the current column.



In [6]:
# link = 'https://raw.github.com/neurospin/pystatsml/master/data/salary_table.csv'
# df = read.csv(url(link))
# link
link = '../data/salary_table.csv'
df = read.csv(link)
head(df)

Unnamed: 0,salary,experience,education,management
1,13876,1,Bachelor,Y
2,11608,1,Ph.D,N
3,18701,1,Ph.D,Y
4,11283,1,Master,N
5,11767,1,Ph.D,N
6,20872,2,Master,Y


In [5]:
getwd()