# Lecture 3: An overview of R: Part II
- Assess the values of an object
- Enter or import data into R
- Export data
- Save and load data
- View data

###### Before we start.
- I have been using print( ) to display R objects, which is totally redundant.
- Because Jupyter Notebook has its own different beautiful display.
- Jupyter is ideal for teaching - bigger display, live editing.
- Hope you like it.
- I am lazy and will not print( ) unless there is something I want to show you.
- It does NOT make a difference for you.

In [1]:
df <- data.frame(names = c("Lucy", "John", "Mark", "Candy"),
                score = c(67, 56, 87, 91))
df

names,score
Lucy,67
John,56
Mark,87
Candy,91


In [2]:
print(df)

  names score
1  Lucy    67
2  John    56
3  Mark    87
4 Candy    91


## 3.1 Assess the values of an object - the index system of R
<b style="color:red;">Key Operators are "[ ]" and "$"</b>


###### Recall object classes:
- Vector
- Matrix
- Array
    - Recall that these three are essentially the same thing.
- Data frame
- List
- (Factor)

### 3.1.1 Index a vector

In [3]:
vector <- 2:6
vector

In [4]:
# Pick the 2nd
vector[2]

In [5]:
# Pick 2nd - 4th
vector[2:4]

In [6]:
# Pick no. 1, 3, 5
vector[c(1, 3, 5)]

In [7]:
# Code like a pro
# This good practice makes it clearer for revisits and/or edits
# Reproducibility!

# Pick no. 1, 3, 5
index <- c(1, 3, 5)

vector[index]

In [8]:
# Re-order
vector[c(5,4,3,2,1)]

###### Use the names

In [9]:
# Recall that we could give names to vector entries
names(vector) <- letters[2:6]; vector

In [10]:
vector["b"]

###### Use "," to separate dimensions.
- 1st dimension: row
- 2nd dimension: column
- 3rd ...

### 3.1.2 Index a matrix

In [56]:
matrix <- matrix(c(3:14), nrow = 4, byrow = TRUE)
print(matrix)
# Note that the indices are given.

     [,1] [,2] [,3]
[1,]    3    4    5
[2,]    6    7    8
[3,]    9   10   11
[4,]   12   13   14


In [12]:
matrix[2, 3]

In [13]:
matrix[2, ]

In [14]:
matrix[ , c(1, 3)]

0,1
3,5
6,8
9,11
12,14


In [15]:
# Change the order of columns. 
matrix[ , c(3, 1)]

0,1
5,3
8,6
11,9
14,12


###### Use the names

In [57]:
rownames(matrix)

NULL

In [58]:
# Recall that we could give names to columns and rows

row.names <- c("row1", "row2", "row3", "row4")
col.names <- c("col1", "col2", "col3")
rownames(matrix) <- row.names
colnames(matrix) <- col.names
print(matrix)
rownames(matrix)

     col1 col2 col3
row1    3    4    5
row2    6    7    8
row3    9   10   11
row4   12   13   14


In [17]:
matrix["row1", ]
# The output is a named vector as a result of dimension reduction

In [18]:
matrix["row2", "col3"]

### 3.1.3 Index an array

In [19]:
array <- array(3:14, dim = c(2, 3, 2))
print(array)

, , 1

     [,1] [,2] [,3]
[1,]    3    5    7
[2,]    4    6    8

, , 2

     [,1] [,2] [,3]
[1,]    9   11   13
[2,]   10   12   14



In [20]:
array[ , , 1]

0,1,2
3,5,7
4,6,8


In [21]:
array[2, 3, 2]

In [22]:
array[1, , 2]

### 3.1.4 Index a data frame

In [23]:
print(df)

  names score
1  Lucy    67
2  John    56
3  Mark    87
4 Candy    91


In [24]:
df[2, ]

Unnamed: 0,names,score
2,John,56


In [25]:
df[ , 1]

###### Use the names

In [26]:
# There are (column) names that are ready to use in data frames.
names(df)

In [27]:
df$names
# data.frame$variable.name gives the variable.

###### Very useful stuff that are not in [---].

In [59]:
# What is John's score?
df[df$names == "John",]

Unnamed: 0,names,score
2,John,56


In [60]:
# Anyone scored 100?
print(df[df$score == 100,])

[1] names score
<0 rows> (or 0-length row.names)


In [30]:
# Highest score?
max(df$score)     # max() for maximum

In [31]:
# Who had the highes score?
df[df$score == max(df$score), ]

Unnamed: 0,names,score
4,Candy,91


In [32]:
# Note that this is still a data frame.
str(df[df$score == max(df$score), ])

'data.frame':	1 obs. of  2 variables:
 $ names: Factor w/ 4 levels "Candy","John",..: 1
 $ score: num 91


In [33]:
# I only need the name.
df[df$score == max(df$score), ]$names

In [34]:
# Change the order of columns
df[ , c("score", "names")]
# By now you should have realized that,
# we change the order of columns by picking the columns
# in the order that we want.

score,names
67,Lucy
56,John
87,Mark
91,Candy


### 3.1.5 Index a list

In [35]:
list <- list("Red", factor(c("a","b")), c(21,32,11), TRUE)
print(list)

[[1]]
[1] "Red"

[[2]]
[1] a b
Levels: a b

[[3]]
[1] 21 32 11

[[4]]
[1] TRUE



In [36]:
list[[1]]

In [37]:
list[[3]][2]

## 3.2 Enter or import data into R
Here we talk about importing data frames.
### 3.2.1 Direct data entering
Recall the data.frame( ) function. See the first code chunk of this lecture.

### 3.2.2 Use datasets that come with R or R packages
Many R packages come with datasets that help explain how the packages and functions work, including those already installed when you download R and those already loaded everytime you open R.

In [38]:
head(mtcars) # You can use this dataset directly whenever you want.

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21.0,6,160,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
Valiant,18.1,6,225,105,2.76,3.46,20.22,1,0,3,1


In [39]:
# data() # Shows all datasets in base R.

Some require loading the package, e.g. "survival" package has a demo data "cancer".

In [40]:
# head(cancer)
# Load 'cancer' data before loading the 'survival' package will result in error.
library(survival)
head(cancer)

inst,time,status,age,sex,ph.ecog,ph.karno,pat.karno,meal.cal,wt.loss
3,306,2,74,1,1,90,100,1175.0,
3,455,2,68,1,0,90,90,1225.0,15.0
3,1010,1,56,1,0,90,90,,15.0
5,210,2,57,1,1,90,60,1150.0,11.0
1,883,2,60,1,0,100,90,,0.0
12,1022,1,74,1,1,50,80,513.0,0.0


### 3.2.3 Read data files

<b style="color:red;">RStudio allows you to do everything in this section by clicking!!!</b>

It is necessary to import data into R before we start working on our analysis. R offers a wide range of packages for importing data in any format.

1. For <b>.txt</b> and <b>.csv</b> files by default: read.table( ), read.csv( ), read.csv2( ), read.delim( ) and read.delim2( ).
2. Packages are needed to read files from <b>Excel, SPSS, SAS, Stata</b>, and various relational databases.

#### 1. For .txt and .csv files

In [41]:
# ?read.table     # Uncomment to run the code

###### Example command
data <- read.table(file, header = TRUE, sep = "", quote = "\"", dec = ".", fill = TRUE, comment.char = "")
- file: A local <b>file</b> with complete path or a <b>URL</b>
- header: Whether use the first row as the names of the columns
- sep: What separates the entries, by default:
    - read.table( ): white spaces, one or more
    - read.csv( ): ,
    - read.csv2( ): ;
    - ...
- ...

###### Data from Dr. Hanley's teaching website.
http://www.medicine.mcgill.ca/epidemiology/hanley/bios602/MultilevelData/otitisDataTall.txt

In [42]:
x <- read.csv(file = "http://www.medicine.mcgill.ca/epidemiology/hanley/bios602/MultilevelData/otitisDataTall.txt",
                header = FALSE)
dim(x)

In [43]:
head(x) # head() Displays the first 6 (default) rows.

V1,V2,V3
family,proportion,zygosity
1,0,2
1,0.04646,2
2,0.05162,2
2,0,2
3,0,2


In [44]:
xx <- read.csv(file = "http://www.medicine.mcgill.ca/epidemiology/hanley/bios602/MultilevelData/otitisDataTall.txt",
                 header = TRUE)
dim(xx)

In [45]:
head(xx) # Note that "header = TRUE" makes the first row column names.

family,proportion,zygosity
1,0.0,2
1,0.04646,2
2,0.05162,2
2,0.0,2
3,0.0,2
3,0.0913,2


###### For local files, we need to give the complete path to the file.
data <- read.csv(file = "~/Desktop/PhD3/Teaching/EPIB613/2018/classlist.csv", header = TRUE)
###### Or, set working directory to that folder
setwd("~/Desktop/PhD3/Teaching/EPIB613/2018")

data <- read.csv("classlist.csv", header = TRUE)

#### 2. For Excel, SAS, SPSS, Stata, etc. files, Google!
- There are a lot of packages.
- Read the help files of the package/function you use.
- Check the data before moving on.

###### There are also a lot of tutorials online.

https://www.datacamp.com/community/tutorials/r-data-import-tutorial

###### But you still need to google every time. Trust me.

###### (Optional) Something that I learned before I started using RStudio

In [46]:
# d <- read.csv(file.choose())

###### <b style="color:red;">Bottom line - You can always click in RStudio, and if necessary, copy the code to your script for reproducibility.</b>

## 3.3 Export data
Similar to reading data:
- For .txt and .csv files by default: write.table( ), write.csv( ), write.csv2( ).
- Packages are needed to write files to Excel, SPSS, SAS, Stata, and various relational databases.
    - The packages that read these files types usually also have functions that write to these file types.

In [63]:
df
write.csv(df, file = "~/Desktop/df.csv")

names,score
Lucy,67
John,56
Mark,87
Candy,91


## 3.4 Save and load data in R

<b style="color:red;">RStudio allows you to do everything in this section by clicking!!!</b>

- Two functions: <b>save( )</b> and <b>load( )</b> allows saving and loading R workspace image.

    - Saving workspace image will create a .RData file in your working directory.
    - Your current work is saved.

- Yes I said do NOT save workspace images last class.

    - Unless you are working with a 5GB dataset that takes 30 minutes to load into R.

## 3.5 View data
It is very important to check the data immediately after we import it into R.

In [48]:
# Check the dimensions of the data frame.
dim(mtcars)

In [49]:
# Check the column names
names(mtcars)

In [50]:
# Or if you remember the function str()
str(mtcars)

'data.frame':	32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...


In [65]:
# Look at the first few rows, default is 6 rows
head(mtcars, n=10)

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2
Valiant,18.1,6,225.0,105,2.76,3.46,20.22,1,0,3,1
Duster 360,14.3,8,360.0,245,3.21,3.57,15.84,0,0,3,4
Merc 240D,24.4,4,146.7,62,3.69,3.19,20.0,1,0,4,2
Merc 230,22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
Merc 280,19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4


In [52]:
# Check the last few rows, default is 6 rows
tail(mtcars, n = 3)

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Ferrari Dino,19.7,6,145,175,3.62,2.77,15.5,0,1,5,6
Maserati Bora,15.0,8,301,335,3.54,3.57,14.6,0,1,5,8
Volvo 142E,21.4,4,121,109,4.11,2.78,18.6,1,1,4,2


In [53]:
# Quick summary of the data frame
summary(mtcars)

      mpg             cyl             disp             hp       
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
      drat             wt             qsec             vs        
 Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
 1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
 Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
 Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
 3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
 Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
       am              gear            carb      
 Min.   :0.0000   Min.   :3.000  

In [54]:
# Check missing values
sum(is.na(mtcars))
# is.na() is true if a cell is "NA" - missing value
# sum() over all cells tells how many true's there are.
# Recall from Lecture 2.