Data Analysis in R
===

##Review: working directory
*  R works best if you have a dedicated directory/folder for each project, aka the working directory. Put all data files in the working directory or in its sub-directories.
* The "project" in RStudio is the working directory “Pro”. If you are interested to learn more about RStudio, please visit the [Introduction to RStudio](http://hpc.loni.org/training/weekly-materials/2020-Spring/HPC_Intro_RStudio_Spring2020.pdf) tutorial from LONI HPC. 

Show current working folder:

In [None]:
getwd()

Let us create a new folder called "data":

In [None]:
dir.create("data")  

Go to this new folder:

In [None]:
setwd("data")

Show current working folder:

In [None]:
getwd()

## Dataset: Forbes Global 2000 list
*   The `forbes` dataset consists of 2000 rows (observations) on 8 variables describing companies’ rank, name, country, category, sales, profits, assets and market value. 
http://hpc.loni.org/training/weekly-materials/Downloads/Forbes2000.csv.zip
> * **`rank`** the ranking of the company
> * **`name`** the name of the company
> * **`country`** the country the company is situated in
> * **`category`** the products the company produces
> * **`sales`** the amount of sales of the company in billion USD
> * **`profits`** the profit of the company in billion USD
> * **`assets`** the assets of the company in billion USD
> * **`marketvalue`** the market value of the company in billion USD

*  First 10 lines of the raw data

>rank | name| country| category | sales | profits | assets | marketvalue
>--- | --- | --- | --- | --- | --- | --- | ---
>1 | Citigroup | United States | Banking | 94.71 | 17.85 | 1264.03 | 255.3
>2 | General Electric | United States | Conglomerates | 134.19 | 15.59 | 626.93 | 328.54
>3 | American Intl Group | United States | Insurance | 76.66 | 6.46 | 647.66 |194.87
>4 | ExxonMobil | United States | Oil & gas operations | 222.88 | 20.96 | 166.99 | 277.02
>5 | BP | United Kingdom | Oil & gas operations | 232.57 | 10.27 | 177.57 | 173.54
>6 | Bank of America | United States | Banking | 49.01 | 10.81 | 736.45 | 117.55
>7 | HSBC Group | United Kingdom | Banking | 44.33 | 6.66 | 757.6 | 177.96
>8 | Toyota Motor | Japan | Consumer durables | 135.82 | 7.99 | 171.71 | 115.4
>9 | Fannie Mae | United States | Diversified financials | 53.13 | 6.48 | 1019.17 | 76.84
>10 | Wal-Mart Stores | United States | Retailing | 256.33 | 9.05 | 104.91 | 243.74


In [None]:
# We are in the /content/data directory
getwd()
# Go back to the /content directory
setwd("/content")
# Create a working diretory for the Forbes project
dir.create("forbes")
# Go to the working directory
setwd("forbes")
# Make sure we are in the right directory
getwd()

# Step by step Data Analysis in R


1. Get data
2. Read data
3. Inspect data
4. Preprocess data (missing and dubious values, discard columns not needed etc.)
5. Analyze data
6. Generate report







## 1. Getting Data
* Downloading files from internet
> * Manually download the file to the working directory 
> * Or use R function `download.file()`
* Unzip with the `unzip()` function

In [None]:
# Download the file
download.file("http://hpc.loni.org/training/weekly-materials/Downloads/Forbes2000.csv.zip","Forbes2000.csv.zip")
# Unzip the file
unzip("Forbes2000.csv.zip","Forbes2000.csv")
# Make sure we have the files
list.files()   

##2. Reading data
* R understands many different data formats and has lots of ways of reading/writing them (csv, xml, excel, sql, json etc.)

>Input | Output | Purpose
>--- | --- | ---
>read.table (read.csv) | write.table (write.csv) | for reading/writing tabular data
>readLines | writeLines | for reading/writing lines of a text file
>source | dump | for reading/writing in R code files
>dget | dput | for reading/writing in R code files
>load | save | for reading in/saving workspaces

* ` read.csv()` is identical to `read.table()` except that the default separator is a comma.

In [None]:
# Read the csv file into the dataframe "forbes"
# header: whether the data file has a header row.
# stringsAsFactors: whether to treat strings as factor levels.
# na.strings: what strings denote a missing value.
# sep: what is the separator for columns.

forbes <- read.csv("Forbes2000.csv",header=T,stringsAsFactors = FALSE,na.strings ="NA",sep=",")

In [None]:
# Inspect the dataframe "forbes" to make sure the data was successfully read.
str(forbes)

* **Note: Changes since R 4.0.0** 
> * R now uses a `stringsAsFactors = FALSE` default, and hence by default no longer converts strings to factors in calls to `data.frame()` and `read.table()`.
* It is a good practice to specify some options rather than using the default.

* For big datasets, consider using the `fread()` function

In [None]:
# Download a dummy data called "Forbes_big", which has 2,000,000 rows
download.file("http://hpc.loni.org/training/weekly-materials/Downloads/Forbes_big.csv.zip","Forbes_big.csv.zip")
# Unzip the data file.
unzip("Forbes_big.csv.zip","Forbes_big.csv")
# Check its size.
file.info("Forbes_big.csv")

In [None]:
# install and load package "data.table" so we have the fread() function.
install.packages("data.table")
library(data.table)

In [None]:
# Compare the reading speed between read.csv() and fread()
t1 <- system.time({
test1<-read.csv("Forbes_big.csv",header=T,stringsAsFactors = FALSE,na.strings ="NA",sep=",")
})
print("read.csv():")
print(t1)

t2 <- system.time({
test2<-fread("Forbes_big.csv",data.table=FALSE,header=T)
})
print("fread():")
print(t2)

##3. Inspecting data
* `class()`: check object class
* `dim()`: dimension of the data
* `head()`: print on screen the first few lines of data, may use n as argument
* `tail()`: print the last few lines of data

In [None]:
# Check the object type.
class(forbes)
# Check the dimension.
dim(forbes)

In [None]:
# Show part of the data.
head(forbes,n=10)

* `str()` (structure) displays the structure of the “forbes” dataframe.


In [None]:
str(forbes)

* `summary()` has statistical summary of the “Forbes” dataframe. **Note: there are missing values (NAs) in the profits.**







In [None]:
summary(forbes)

##4. Preprocess data 


### 4.1 Preprocessing - missing values
* Missing values are denoted in R by NA or NaN for undefined mathematical operations.
> * `is.na(x)` is used to test objects "x" if there are NAs
> * `which(is.na(x))` returns the indices of the NA values.

In [None]:
# Don't run this command line as you will get a very long list
# is.na(forbes$profits)  

In [None]:
# If you cannot hold your curiosity back, run this instead:
head(is.na(forbes$profits))

In [None]:
which(is.na(forbes$profits))

In [None]:
# Alternatively, we can use the complete.cases() function.
which(! complete.cases(forbes))

* More about missing value inspection


In [None]:
# How many NAs?
table(is.na(forbes$profits))

In [None]:
# List the rows with missing values
forbes[is.na(forbes$profits),]

* Many R functions also have a logical “`na.rm`” option
> * `na.rm=TRUE` means the NA values should be discarded


In [None]:
# Find the mean of the profits (should return NA).
mean(forbes$profits)  
# This will not (after removing the NA values).
mean(forbes$profits,na.rm=T)

* **Note: Not all missing values are marked with “NA” in the raw data!**


* The simplest way to deal with the missing values is to remove them. 
> * If a column (variable) has a high percentage of the missing value, remove the whole column or just don’t use it for the analysis.
> * If a row (observation) has a missing value, remove the row with `na.omit()`. e.g. 


In [None]:
forbes2 <- na.omit(forbes)
dim(forbes2)

* Alternatively, the missing values can be replaced by basic statistics e.g. 
> * replace by mean 


In [None]:
# The vector "miss" contains the row indices of missing values
miss<-which(is.na(forbes$profits)) 
forbes[miss,"profits"] <- mean(forbes$profits,na.rm=T)
forbes[miss,]

* Or use advanced statistical techniques. List of popular R Packages:
> * MICE
> * Amelia (named after Amelia Earhart, the first female pilot to fly solo across the Atlantic Ocean, who mysteriously went **missing** while flying over the Pacific Ocean in 1937.)
> * missForest (non parametric imputation method)
> * Hmisc
> * mi

###4.2 Preprocessing - subsetting data
* On most occasions we do not need all of the raw data
* Reducing the size of data saves resources needed to deal with it
* There are a number of methods of extracting a subset of R objects, either by row or column or both.


#### 4.2.1 Subsetting by row: use conditions


Find all companies with negative profit:

In [None]:
forbes[forbes$profits < 0,]

Find three companies with the highest sales:


In [None]:
# The order function returns indices
order_sales <- order(forbes$sales, decreasing=T)
# Top 3
forbes[order_sales[1:3],c("name","sales")]

####4.2.2 Subsetting by row: use `subset()` function
Pass the condition as an argument to the `subset()` function.


In [None]:
# Find the business categories to which the Bermuda companies belong.
# Note that it's "country" instead of "forbes$country" in the subset() function.
Bermudacomp <- subset(forbes, country == "Bermuda")
table(Bermudacomp[,"category"]) #frequency table of categories

In [None]:
# Subset the insurance companies located in Bermuda.
subset(forbes, country == "Bermuda" & category == "Insurance")

####4.2.3 Subsetting by column
Create another dataframe with only numeric variables

In [None]:
#use data.frame function
forbes3 <- data.frame(sales=forbes$sales,profits=forbes$profits,
           assets=forbes$assets, mvalue=forbes$marketvalue)
str(forbes3)

#use subset() function
forbes4 <- subset(forbes,select=c(sales,profits,assets,marketvalue))
str(forbes4)

#or simply use indexing
forbes5 <- forbes[,c(5:8)]
str(forbes5)

####4.2.4 Subsetting by row and column


In [None]:
# Find the insurance companies based in Bermuda and retain only the numeric columns.
subset(forbes, country == "Bermuda" & category == "Insurance", select=c(name,sales,profits,assets,marketvalue))

### 4.3 Preprocessing – Factors
* Factors are variables in R which take on a limited number of different values; such variables are often referred to as categorical variables


We can use the `factor()` function to convert characters to (unordered) factors:

In [None]:
forbes$country<-factor(forbes$country)
str(forbes)

* We often want to merge classes/categories (as long as it makes sense):
 * For better model performance.
 * Some categories may have too few subjects.

In [None]:
table(forbes$country)

* Merge small classes into a larger classes

Merge all South American countries to "Venezuela"

In [None]:
 forbes$country[(forbes$country=="Bahamas")|(forbes$country=="Bermuda")|(forbes$country=="Brazil")|(forbes$country=="Cayman Islands")|(forbes$country=="Chile")|(forbes$country=="Panama/ United Kingdom")|(forbes$country=="Peru")]<-"Venezuela"

Merge small classes into a larger classes

In [None]:
forbes$country[(forbes$country=="Austria")|(forbes$country=="Belgium")|(forbes$country=="Czech Republic")|(forbes$country=="Denmark")|(forbes$country=="Finland")|(forbes$country=="France")|(forbes$country=="Germany")|(forbes$country=="Greece")|(forbes$country=="Hungary")|(forbes$country=="Ireland")|(forbes$country=="Italy")|(forbes$country=="Luxembourg")|(forbes$country=="Netherlands")|(forbes$country=="Norway")|(forbes$country=="Poland")|(forbes$country=="Portugal")|(forbes$country=="Russia")|(forbes$country=="Spain")|(forbes$country=="Sweden")|(forbes$country=="Switzerland")|(forbes$country=="Turkey")|(forbes$country=="France/ United Kingdom")|(forbes$country=="United Kingdom/ Netherlands")|(forbes$country=="Netherlands/ United Kingdom")]<-"United Kingdom"
forbes$country[(forbes$country=="China")|(forbes$country=="Hong Kong/China")|(forbes$country=="Indonesia")|(forbes$country=="Japan")|(forbes$country=="Kong/China")|(forbes$country=="Korea")|(forbes$country=="Malaysia")|(forbes$country=="Philippines")|(forbes$country=="Singapore")|(forbes$country=="South Korea")|(forbes$country=="Taiwan")]<-"Thailand"
forbes$country[(forbes$country=="Africa")|(forbes$country=="Australia")|(forbes$country=="India")|(forbes$country=="Australia/ United Kingdom")|(forbes$country=="Islands")|(forbes$country=="Israel")|(forbes$country=="Jordan")|(forbes$country=="Liberia")|(forbes$country=="Mexico")|(forbes$country=="New Zealand")|(forbes$country=="Pakistan")|(forbes$country=="South Africa")|(forbes$country=="United Kingdom/ Australia")]<-"United Kingdom/ South Africa"

* Drop those levels with zero counts

Use `droplevels()` function:


In [None]:
forbes$country<-droplevels(forbes$country)

Now we can check the new frequency tables:

In [None]:
table(forbes$country)

* Rename each class

In [None]:
levels(forbes$country)<-c("Canada","East/Southeast Asia","Europe","Other","United States","Latin America")
levels(forbes$country)
table(forbes$country)

**Can we use the `match()` function we learned about yesterday for this task (dictionary lookup)?**

###4.4 Export the cleaned dataset
* Save forbes to Forbes2000_clean.csv

In [None]:
write.csv(forbes,"Forbes2000_clean.csv",row.names=FALSE)
list.files()

##Exercise 1

1. Find all German companies with negative profit


In [None]:
#@title Hint { display-mode: "form" }
forbes <- read.csv("Forbes2000.csv",header=T,stringsAsFactors = FALSE,na.strings ="NA",sep=",") #reload raw data 
forbes <- na.omit(forbes)  # omit NAs
# finish lines below:
Germanycomp <- subset(    ,    ) #get a subset of Germany company
Germanycomp[       ,c("name","sales","profits","assets")]

In [None]:
#@title Solution { display-mode: "form" }

forbes <- read.csv("Forbes2000.csv",header=T,stringsAsFactors = FALSE,na.strings ="NA",sep=",") #reload raw data 
forbes <- na.omit(forbes)  # omit NAs
Germanycomp <- subset(forbes, country == "Germany")
Germanycomp[Germanycomp$profits < 0,c("name","sales","profits","assets")]

2. Arbitrarily merge the categories into three sectors: industry, services and finance.
* Can we use the `match()` function to perform a dictionary lookup?

In [None]:
#@title Hint { display-mode: "form" }

# factorize the values in the "category" 
forbes$category <- factor(forbes$category)
str(forbes)
table(forbes$category)

# arbitrarily define "Industry"
forbes$category[(forbes$category=="Aerospace & defense")|(forbes$category=="Chemicals")|(forbes$category=="Conglomerates")|(forbes$category=="Construction")|(forbes$category=="Consumer durables")|(forbes$category=="Drugs & biotechnology")|(forbes$category=="Food markets")|(forbes$category=="Food drink & tobacco")|(forbes$category=="Food markets")|(forbes$category=="Household & personal products")|(forbes$category=="Materials")|(forbes$category=="Semiconductors")|(forbes$category=="Technology hardware & equipment")|(forbes$category=="Utilities")]<-"Industry"
# arbitrarily define "Services" and "Finance"
# finish lines below
#forbes$category[    ]<-"Services"
#forbes$category[    ]<-"Finance"
# drop levels with 0 count
#forbes$category<-droplevels()
table(forbes$category)

In [None]:
#@title Solution { display-mode: "form" }

# Extract the values of categories
categories <- unique(forbes$category)
# Create a random (arbitrary) sample vector of the same length
sectors <- sample(c("Industry","Services","Finance"),length(categories),
  replace=T)
# Combine the two vectors into a dictionary
sectordict <- cbind.data.frame(categories,sectors)
# Use the match() function to perform a dictionary lookup.
sector_indices <- match(forbes$category,sectordict$categories)
forbes_w_sectors <- cbind.data.frame(forbes,sectors=sectordict$sectors[sector_indices])
# Convert the "sectors" column into factor.
forbes_w_sectors$sectors <- factor(forbes_w_sectors$sectors)
# Inspect the result
str(forbes_w_sectors)
table(forbes_w_sectors$sectors)

# We will be back at 10:20 AM

## 5. Data analysis


###5.1 Two common questions:
* Which statistical model should I use?
* How to choose the right R packages?


#### Which statistical model should I use for my data analysis?
* This is not a statistics workshop…
* Courses provided by your institution or open courses such as Coursera
* You could collaborate with statisticians on your campus

####How to choose the right R packages for my data analysis?
* CRAN task views 
https://cran.r-project.org/web/views/
* RDocumentation https://www.rdocumentation.org
> *  a website, an R package and an API
> * supports taskview 
> * searchs all CRAN, Bioconductor and GitHub packages




###5.2 Import the cleaned dataset (Optional)
* Subsetting by column
Create a dataframe with the clean data

In [None]:
# Read the data back in from the file we just wrote.
forbes_clean <- read.csv("Forbes2000_clean.csv",header=T,stringsAsFactors = T,na.strings ="NA",sep=",")
# Sanity check
str(forbes_clean)
summary(forbes_clean)
dim(forbes_clean)

###5.3 Extract Variables 
* Create another data frame with only numeric variables + country

In [None]:
# We can save the result into the same object, 
# but do be careful when you do this - the old object will be overwritten.
forbes_clean <- forbes_clean[,c(3, 5:8)]
str(forbes_clean)

###5.4 Training Set and Test Set
* A dataset could be randomly split into two parts: training set and test set. 
 * Use the training set to built the model
 * Use the test set to validate the model

In [None]:
set.seed(1) #set random seed reproducible
indx <- sample(1:2000,size=2000,replace=F)
forbes.train <- forbes_clean[indx[1:1600],]
forbes.test <- forbes_clean[indx[1601:2000],]

###5.5 Roadmap of generalizations of linear models
* Roadmap of generalizations of linear models:
> https://drive.google.com/open?id=1HrnpinlmyZl9_GL9xX24a5Nv6PUumovI


* Explanation of Acronyms

>Models | Acronym | R function
>--- | --- | ---
>Linear Models | LM | lm, aov
>Generalized LMs | GLM | glm
>Linear Mixed Models | LMM | lme, aov
>Non-linear Models | NLM | nls
>Non-linear Mixed Models | NLMM | nlme
>Generalized LMMs | GLMM | glmmPQL
>Generalized Additive Models | GAM | gam


* Symbol Meanings in Model Formulae

>Symbol | Example | Meaning
>--- | --- | ---
>+ | +X | Include variable X in the model
>- | -X | Exclude variable X in the model
>: | X:Z | Include the interaction between X and Z
>\* | X\*Z | Include X and Z and the interactions
>\| | NLM | Conditioning: include X given Z
>^ | NLMM | Include A, B and C and all the interactions up to three way
>/ | GLMM | As is: include a new variable consisting of these variables multiplied




* Model Formulae
 * General form: response ~ term1 <+ term2 + term3...>

> Example | Meaning
>--- | --- 
>y ~ x | Simple regression
>y ~ -1 +  x | LM through the origin
>y ~ x + x^2 | Quadratic regression
>y ~ x1 + x2 + x3 | Multiple regression
>y ~ . | All variables included
>y ~ . - x1 | All variables except X1
>y ~ A + B + A : B | Add interaction
>y ~ A \* B | Same above
>y ~ (A+B)^2 | Same above






###5.6 A multiple linear regression example
* Predict the market value with other variables, including the geographical location
> marketvalue ~ profits + sales + assets + country


In [None]:
# ~. means against all variables
lm <- lm(marketvalue ~ ., data = forbes.train)
summary(lm)

* R has created a n-1 variables each with two levels. These n-1 new variables contain the same information as the single variable. This recoding creates a table called contrast matrix.


In [None]:
contrasts(forbes.train$country)


###5.7 A Stepwise regression example
* The function `regsubsets()` in the leaps library allow us to do the stepwise regression


In [None]:
# Install and load package
install.packages("leaps")
library(leaps)
# Stepwise regression.
bwd <- regsubsets(marketvalue ~ ., data = forbes.train,nvmax =3,method ="backward")
summary(bwd)

An asterisk indicates that a given variable is included in the corresponding model.


###5.8 A Regression tree example
* The function `rpart() `in the rpart library allow us to grow a regression tree


In [None]:
# Install and load package
install.packages("rpart")
library(rpart)
# Build the regression tree.
rpartmodel <- rpart(marketvalue ~ ., data = forbes.train,control = rpart.control(xval = 10, minbucket = 50))
# The par() function can be used to set graphic parameters.
par(mfrow=c(1,1),xpd=NA,cex=1.5)
# Show the result regression tree.
plot(rpartmodel,uniform=T)
text(rpartmodel,use.n=T)
# We can get the predicted values by calling the predict() function.
# predict(rpartmodel,forbes.test)

###5.9 A Bagging tree example
* The function `randomForest()` in the randomForest library allow us to grow a regression tree


In [None]:
# Install and load package
install.packages("randomForest")
library(randomForest)
# Build the bagging tree.
bag <- randomForest(marketvalue ~ ., data = forbes.train, importance =TRUE)
# Examine the result.
importance(bag)
varImpPlot(bag)

### 5.10 The predictive results in terms of the MAD and RMSE values 
* MAD:

$MAD = \frac{1}{N}\times\sum_{i=1}^N|y_i-\hat{y_i}|$


* RMSE:

$RMSE = \sqrt{\sum_{i=1}^N(y_i-\hat{y_i})^2/N}$



* Bagging tree example for calculating RMSE and MAD

In [None]:
forbes_clean2 <- forbes_clean[,c(2:5)]  # create a new dataframe with only numeric variables included
set.seed(2) 
indx <- sample(1:2000,size=2000,replace=F)
forbes.train <- forbes_clean2[indx[1:1600],]
forbes.test <- forbes_clean2[indx[1601:2000],]
bag <- randomForest(marketvalue ~ ., data = forbes.train, importance =TRUE)
# RMSE and MAD 
bag.yhat <- predict(bag, newdata = forbes.test) 
bag.y <- forbes.test["marketvalue"] 
bag.rmse <- sqrt(mean(data.matrix((bag.y - bag.yhat)^2)))
bag.rmse
bag.abs = abs(bag.y - bag.yhat) 
bag.mad = (sum(bag.abs))/400 
bag.mad 

##Exercise 2
1. Use the `lm()` function to perform a multiple linear regression with profits as the response and all other numeric variables as the predictors. Use the `summary()` function to print the results. 


In [None]:
forbes_clean2 <- forbes_clean[,c(2:5)]  # create a new dataframe with only numeric variables included
set.seed(3) 
indx <- sample(1:2000,size=2000,replace=F)
forbes.train <- forbes_clean2[indx[1:1600],]
forbes.test <- forbes_clean2[indx[1601:2000],]
str(forbes.train)

In [None]:
#@title Hint {display-mode: "form"}

#finish lines below
lm <- lm(   profits ~ , data=forbes.train     )
summary(lm)

In [None]:
#@title Solution { display-mode: "form" }
lm <- lm(profits ~ ., data = forbes.train)
summary(lm)

2. Comment on the output. For instance:  Is there a relationship between the predictors and the response? 

3. Which predictors appear to have a statistically significant relationship to the response? 

4. What does the coefficient for the sales variable suggest?

##Exercise 3

Compare the RMSE for the multilinear regression (`lm`), regression tree (`rpart`) and bagging tree (`randomforest`). 

Report your findings in a table like this:

>Model | RMSE 
>--- | --- 
>MLR | 14.41041 
>Regression tree | 17.85625 
>Bagging tree | 11.69301 

In [None]:
# Make sure that you are using the same training and test data.

forbes_clean2 <- forbes_clean[,c(2:5)]  # create a new dataframe with only numeric variables included
set.seed(3) 
indx <- sample(1:2000,size=2000,replace=F)
forbes.train <- forbes_clean2[indx[1:1600],]
forbes.test <- forbes_clean2[indx[1601:2000],]

In [None]:
#@title Hint { display-mode: "form" }

# For each model, do the following:
# Fit the model
# Calculate the yhat with the predict() fuction
# Extract the y values
# Calculate the RMSE using the example we covered above

In [None]:
#@title Solution { display-mode: "form" }

mlr <- lm(marketvalue ~ ., data = forbes.train)
# RMSE
mlr.yhat <- predict(mlr, newdata = forbes.test) 
mlr.y <- forbes.test["marketvalue"] 
mlr.rmse <- sqrt(mean(data.matrix((mlr.y - mlr.yhat)^2)))
cat("RMSE for MLR: ",mlr.rmse,"\n")

rt <- rpart(marketvalue ~ ., data = forbes.train,control = rpart.control(xval = 10, minbucket = 50))
# RMSE
rt.yhat <- predict(rt, newdata = forbes.test) 
rt.y <- forbes.test["marketvalue"] 
rt.rmse <- sqrt(mean(data.matrix((rt.y - rt.yhat)^2)))
cat("RMSE for RT: ",rt.rmse,"\n")

bag <- randomForest(marketvalue ~ ., data = forbes.train, importance =TRUE)
# RMSE
bag.yhat <- predict(bag, newdata = forbes.test) 
bag.y <- forbes.test["marketvalue"] 
bag.rmse <- sqrt(mean(data.matrix((bag.y - bag.yhat)^2)))
cat("RMSE for BT: ", bag.rmse,"\n")

# There are packages that can help you, e.g. MLmetrics.
install.packages("MLmetrics")
library(MLmetrics)
# Use the MSE function to calculate mean square error.
sqrt(MSE(predict(rt, newdata = forbes.test), 
  forbes.test$marketvalue))

# we will be back at 11:10 AM

##6. Generate report with R Markdown
### 6.1 Why R Markdown
* Weaves R code and human readable texts together into a plain text file that has the extension `.Rmd`, which then can be converted (rendered) by the `rmarkdown` package into documents of many output formats of either documents and presentations:
> * beamer_presentation
> * context_document
> * github_document
> * html_document
> * ioslides_presentation
> * latex_document
> * md_document
> * odt_document
> * pdf_document
> * powerpoint_presentation
> * rtf_document
> * slidy_presentation
> * word_document

* [Rmarkdown gallery](https://rmarkdown.rstudio.com/gallery.html)
* Also helps make your research reproducible



###6.2 Working with R Markdown

![A example R Mardown file](https://d33wubrfki0l68.cloudfront.net/ece57b678854545e6602a23daede51ad72da2170/21cca/lesson-images/outputs-1-word.png)

Create the R markdown file, which usually contains three types of content:
* An (optional) YAML header surrounded by `---`s
* R code chunks surrounded by ` ``` `s
* Text mixed with simple text formatting


In [None]:
download.file("https://raw.githubusercontent.com/lsuhpchelp/lbrnloniworkshop2022/master/day2/Example.Rmd","Example.Rmd")

Then install and load the `rmarkdown` package.

In [None]:
install.packages("rmarkdown")
library(rmarkdown)

Render the R Markdown file into the desired format.

In [None]:
render("Example.Rmd", output_format = "word_document")

You can find an excellent R Markdown lesson from Rstudio [here](https://rmarkdown.rstudio.com/lesson-1.html).

### 6.3 Converting Jupyter/Colab notebook to `.Rmd`


**Conversion script below is developed by JJ Allaire, Yihui Xie, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, Hadley Wickham, Joe Cheng, Winston Chang, Richard Iannone.
https://github.com/rstudio/rmarkdown/blob/master/R/jupyter.R**

In [None]:
# Download conversion script
download.file("https://raw.githubusercontent.com/rstudio/rmarkdown/master/R/jupyter.R","jupyter.R")
# Download the Colab notebook (part of yesterday's material)
download.file("https://raw.githubusercontent.com/lsuhpchelp/lbrnloniworkshop2022/master/day2/Rmd4day1.ipynb","Rmd4day1.ipynb")
# Convert the Colab notebook to R Markdown.
source("jupyter.R")
convert_ipynb("Rmd4day1.ipynb",output = xfun::with_ext("Rmd4day1.ipynb", "Rmd"))
# Render the .Rmd file.
render("Rmd4day1.Rmd", output_format = "word_document")


###6.4 Cheatsheet and reference guide
* [Cheatsheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/rmarkdown.pdf)
* [Reference guide](https://rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf)

# Afternoon session
* Practice with data_R_practice.ipynb
* Work on the exercises from today
* Work on the exercises from yesterday


# Getting Help
* Documentation: http://hpc.loni.org/docs
* Contact us
> * Email ticket system: sys-help@loni.org
> * Telephone Help Desk: 225-578-0900