![](logo.png)

# <font color='red'>Data Input & Output in R</font>

> ### CSV Input & Output
> ### Excel Input & Output
> ### SQL with R
> ### Webscraping with R

## Working Directories

In order to read files into dataframes in R, the files and the jupyter notebook need to contained within the same folder. If the files are located in a different folder than the jupyter notebook, the working directory needs to be changed to that location. To get the current working directory (location on the computer) use the built-in R function **getwd()**. This will display the current location or path.

To set a new path (location), use the built-in R function to change the directory to the location of the saved files you want to input to R.

In [None]:
# Determine the current working directory
getwd()

In [None]:
# Change working directory to new file path
setwd('C:\\Users\\jcdunne\\NC State PB&G Dropbox\\Jeffrey Dunne\\CS590 Course Content')

NOTE: Use double backslashes '\\' for the file path or use a single forward slash '/'

In [None]:
# Determine the current working directory
getwd()

In [None]:
# Change working directory to new file path
setwd('C:\\Users\\jcdunne\\NC State PB&G Dropbox\\Jeffrey Dunne\\CS590 Course Content\\Week 5\\R')

# <font color='red'>CSV Input & Output</font>

CSV stands for comma separated variable and its one of the most common ways we'll be working with data throughout the semester. The basic format of a csv file is the first line indicating the column names and the rest of the rows/lines being data points separated by commas. One of the most basic ways to read in csv files in R is to use read.csv() which is built-in to R. Later on we'll learn about fread which will be a bit faster and more convenient, but its important to understand all your options.

When using read.csv() you'll need to either pass in the entire path of the file or have the file be in the same directory as your R script. Make sure to account for possible spaces in the file path name, you may need to use backslashes to account for this. This is often a point of confusion for people new to programming, so make sure you understand the above before continuing.

In [None]:
# Pass in the entire file path if not in the same directory
example <- read.csv('example.csv')

In [None]:
# Check the structure of the dataframe
str(example)

In [None]:
# Check the column names
colnames(example)

In [None]:
# Check the head of the example dataframe
head(example)

So we can now see how easy it is to read a csv, if we have another flat file format like a tab separated file (tab delimited), or some other sort of delimiter we can specify this when calling read.csv, form the documentation:

read.table(file, header = FALSE, sep = "", quote = "\"'",
       dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"),
       row.names, col.names, as.is = !stringsAsFactors,
       na.strings = "NA", colClasses = NA, nrows = -1,
       skip = 0, check.names = TRUE, fill = !blank.lines.skip,
       strip.white = FALSE, blank.lines.skip = TRUE,
       comment.char = "#",
       allowEscapes = FALSE, flush = FALSE,
       stringsAsFactors = default.stringsAsFactors(),
       fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)

In [None]:
# View the read.csv help documentation
help(read.csv)

In [None]:
# Universal help - Jupyter notebooks. Works for Python
?read.csv

## read.table

The built-in R function **read.table()** is the general form of the built-in R function **read.csv**, in fact read.csv is actually just a thin wrapper around read.table which just makes it easier to use sometimes. For example:

In [None]:
# Read CSV using the read.table function
read.table('example.csv')

NOTE: There are errors in reading the file because there were no arguements determining the delimiter and the rows that determine the column names.

In [None]:
# Adding the delimiter
read.table('example.csv', sep=',')

In [None]:
# Adding the row that contains the column names
read.table('example.csv', sep=',', header=TRUE)

An alternative to using read.csv or read.table is the built-in R function **fread()**. In most situations you'll want to use fread() because it is faster and more convenient

In [None]:
# Load the library containing fread()
library("data.table")

In [None]:
# Run fread() to load example.csv
fread('example.csv')

## Output to CSV

You can output your files to csv by using the built-in R function **write.csv()**

In [None]:
# Change name of example to df for output
df <- example

In [None]:
write.csv(df, file = 'CSV output.csv')
fread('CSV output.csv')

NOTE: The index from df gets output along with the data. Use the row.names arguement to change the output of the csv file

In [None]:
# Export without row names i.e. the index positions
write.csv(df, file = 'CSV output.csv', row.names = FALSE)
fread('CSV output.csv')

# <font color='red'>Excel Input & Output</font>

R has the ability to read and write to excel, which makes it very convenient to work on the same datasets with colleagues who only know excel, meaning they can work with excel and hand you the files, then you work with them in R

To do this, we need the readxl package for R. You can download it by using:

install.packages('readxl')

And then load the library using

library('readxl')

You may need to specify repos="http://cran.rstudio.com/" as an argument in the packages call if you get a mirror error.

Let's see how we can use this:

In [None]:
# In case you don't have readxl (you may not need to specify repos)
#install.packages('readxl',repos="http://cran.rstudio.com/")

In [None]:
# Load the readxl package
library(readxl)

In [None]:
# list the sheets of the excel file
excel_sheets('Peanut Database.xlsx')

In [None]:
# Call info from the sheets using read_excel
df <- read_excel('Peanut Database.xlsx',sheet='NCSU Breeding Lines')

In [None]:
# Show the head of the dataframe
head(df)

In [None]:
# Check the structure of the dataframe
str(df)

In [None]:
# Print a summary of the columns within the dataframe
summary(df)

If you had multiple sheets that you wanted to import into a list, you could do this with the built-in R function **lapply()**. We will learn more about lapply() later in the semester. This is just one use of this function.

In [None]:
# Reading in multiple sheets from Peanut Database
entire_workbook <- lapply(excel_sheets('Peanut Database.xlsx'), read_excel, path='Peanut Database.xlsx')

In [None]:
# Show the entire list:
entire_workbook

## Writing to Excel

Writing to excel requires the xlsx package:

In [None]:
# Install the XLSX package
install.packages('xlsx',repos="http://cran.rstudio.com/")

In [None]:
# Load the XLSX library
library(xlsx)

In [None]:
# Set example back to the dataframe
df <- example

In [None]:
# Use write.xlsx() to output an excel formatted dataframe
write.xlsx(df, "output.xlsx")

In [None]:
read_excel("output.xlsx")

# <font color='red'>SQL with R</font>

This will actually be a brief section, because connecting R to a SQL database is completely dependent on the type of database you are using (MYSQL, Oracle, etc...).

So instead of trying to cover all of these (since each requires a different package), we'll use this section to point you in the right direction for various database types, once you've downloaded the correct library, actually connecting is usually quite simple, then its just a matter of passing in SQL queries through R.

We'll show a general version using the DBI package, then point to more specific resources.

## RODBC - General Use

The RODBC library is one way of connecting to databases. Regardless of what you decide to use, I highly recommend a google search consisting of your database of choice + R. Here's an example use of RODBC

In [None]:
install.packages("RODBC") 
# RODBC Example of syntax
library(RODBC)

myconn <-odbcConnect("Database_Name", uid="User_ID", pwd="password")
dat <- sqlFetch(myconn, "Table_Name")
querydat <- sqlQuery(myconn, "SELECT * FROM table")
close(myconn)

### MySQL
The RMySQL package provides an interface to MySQL.

### Oracle
The ROracle package provides an interface for Oracle.

### JDBC
The RJDBC package provides access to databases through a JDBC interface.

Google is the best way to go for your personal situation since databases and your permissions can differ extensively.

# <font color='red'>Webscraping with R</font>

**NOTE: TO FULLY UNDERSTAND THIS SECTION YOU WILL NEED TO KNOW HTML AND CSS, YOU WILL ALSO NEED TO KNOW THE PIPE OPERATOR IN R (%>%). COME BACK TO THIS SECTION AFTER WE COVER THAT MATERIAL**

Web Scraping in general is almost always going to be unique to your personal use case, this is because every website is different, updates occur, and things can change. To fully understand webscraping in R, you'll need to understand HTML and CSS in order to know what you are trying to grab off the website.

If you don't know HTML or CSS, you may be able to use an auto-web-scrape tool, like import.io. Check it out, it will auto scrape and create a csv file for you.

## rvest library
Below is a simple example of using rvest, but the best way to see a good demo of rvest is through the built-in demos by using:

In [36]:
# Use the provided demo of rvest
demo(package='rvest')

Now if you are familiar with HTML and CSS a very useful library is rvest. Below we will go over a simple example:

In [37]:
# Install rvest package and dependencies
install.packages('rvest')

package 'rvest' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\jcdunne\AppData\Local\Temp\Rtmp6FNWLo\downloaded_packages


Imagine we’d like to scrape some information about The Lego Movie from IMDB. We start by downloading and parsing the file with html():

In [38]:
# Load rvest library
library(rvest)
# Pull the html information from the Lego movie
lego_movie <- read_html("http://www.imdb.com/title/tt1490017/")

Loading required package: xml2


To extract the rating, we start with SelectorGadget to figure out which css selector matches the data we want: strong span. (If you haven’t heard of selectorgadget, make sure to read "SelectorGadget" – it’s the easiest way to determine which selector extracts the data that you’re interested in.) We use html_node() to find the first node that matches that selector, extract its contents with html_text(), and convert it to numeric with as.numeric():

In [39]:
# Use the pipe operator in R %>% to pipe together each of the arguements for extracting the lego movie rating
lego_movie %>% 
  html_node("strong span") %>%
  html_text() %>%
  as.numeric()

We use a similar process to extract the cast, using html_nodes() to find all nodes that match the selector: The cast is found within table 1 of the HTML output. We can use html_node() and [[ ]] to find it, then coerce it to a data frame with html_table():

In [40]:
lego_movie %>%
  html_nodes("table") %>%
  .[[1]] %>%
  html_table(header = TRUE)

"Cast overview, first billed only:","Cast overview, first billed only:","Cast overview, first billed only:","Cast overview, first billed only:"
<lgl>,<chr>,<chr>.1,<chr>.2
,Will Arnett,...,Batman / Bruce Wayne (voice)
,Elizabeth Banks,...,Wyldstyle / Lucy (voice)
,Craig Berry,...,Blake / Additional Voices (voice)
,Alison Brie,...,Unikitty (voice)
,David Burrows,...,Octan Robot / Additional Voices (voice)
,Anthony Daniels,...,C-3PO (voice)
,Charlie Day,...,Benny (voice)
,Amanda Farinos,...,Mom (voice)
,Keith Ferguson,...,Han Solo (voice)
,Will Ferrell,...,Lord Business / President Business / The Man Upstairs (voice)
