## Importing Excel data
Excel is a widely used data analysis tool. If you prefer to do your analyses in R, though, you'll need an understanding of how to import .csv data into R. This chapter will show you how to use readxl and gdata to do so.

### List the sheets of an Excel file
Before you can start importing from Excel, you should find out which sheets are available in the workbook. You can use the excel_sheets() function for this.

You will find the Excel file urbanpop.xlsx in your working directory (type dir() to see it). This dataset contains urban population metrics for practically all countries in the world throughout time (Source: Gapminder). It contains three sheets for three different time periods. In each sheet, the first row contains the column names.

In [1]:
# Load the readxl package
library(readxl)

# Print the names of all worksheets
excel_sheets("urbanpop.xlsx")

### Import an Excel sheet
Now that you know the names of the sheets in the Excel file you want to import, it is time to import those sheets into R. You can do this with the read_excel() function. Have a look at this recipe:

data <- read_excel("data.xlsx", sheet = "my_sheet")

This call simply imports the sheet with the name "my_sheet" from the "data.xlsx" file. You can also pass a number to the sheet argument; this will cause read_excel() to import the sheet with the given sheet number. sheet = 1 will import the first sheet, sheet = 2 will import the second sheet, and so on.

In this exercise, you'll continue working with the urbanpop.xlsx file.

In [2]:
# Read the sheets, one by one
pop_1 <- read_excel("urbanpop.xlsx", sheet = 1)
pop_2 <- read_excel("urbanpop.xlsx", sheet = 2)
pop_3 <- read_excel("urbanpop.xlsx", sheet = 3)


# Put pop_1, pop_2 and pop_3 in a list: pop_list
pop_list = list(pop_1, pop_2, pop_3)
# Display the structure of pop_list
str(pop_list)

List of 3
 $ : tibble [209 x 8] (S3: tbl_df/tbl/data.frame)
  ..$ country: chr [1:209] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
  ..$ 1960   : num [1:209] 769308 494443 3293999 NA NA ...
  ..$ 1961   : num [1:209] 814923 511803 3515148 13660 8724 ...
  ..$ 1962   : num [1:209] 858522 529439 3739963 14166 9700 ...
  ..$ 1963   : num [1:209] 903914 547377 3973289 14759 10748 ...
  ..$ 1964   : num [1:209] 951226 565572 4220987 15396 11866 ...
  ..$ 1965   : num [1:209] 1000582 583983 4488176 16045 13053 ...
  ..$ 1966   : num [1:209] 1058743 602512 4649105 16693 14217 ...
 $ : tibble [209 x 9] (S3: tbl_df/tbl/data.frame)
  ..$ country: chr [1:209] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
  ..$ 1967   : num [1:209] 1119067 621180 4826104 17349 15440 ...
  ..$ 1968   : num [1:209] 1182159 639964 5017299 17996 16727 ...
  ..$ 1969   : num [1:209] 1248901 658853 5219332 18619 18088 ...
  ..$ 1970   : num [1:209] 1319849 677839 5429743 19206 19529 ...
  ..$ 1971   

### Reading a workbook
In the previous exercise you generated a list of three Excel sheets that you imported. However, loading in every sheet manually and then merging them in a list can be quite tedious. Luckily, you can automate this with lapply(). If you have no experience with lapply(), feel free to take Chapter 4 of the Intermediate R course.

Have a look at the example code below:

my_workbook <- lapply(excel_sheets("data.xlsx"), read_excel, path = "data.xlsx")

The read_excel() function is called multiple times on the "data.xlsx" file and each sheet is loaded in one after the other. The result is a list of data frames, each data frame representing one of the sheets in data.xlsx.

You're still working with the urbanpop.xlsx file.

In [3]:
# Read all Excel sheets with lapply(): pop_list
pop_list = lapply(excel_sheets("urbanpop.xlsx"), read_excel, path = "urbanpop.xlsx")

# Display the structure of pop_list
str(pop_list)

List of 3
 $ : tibble [209 x 8] (S3: tbl_df/tbl/data.frame)
  ..$ country: chr [1:209] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
  ..$ 1960   : num [1:209] 769308 494443 3293999 NA NA ...
  ..$ 1961   : num [1:209] 814923 511803 3515148 13660 8724 ...
  ..$ 1962   : num [1:209] 858522 529439 3739963 14166 9700 ...
  ..$ 1963   : num [1:209] 903914 547377 3973289 14759 10748 ...
  ..$ 1964   : num [1:209] 951226 565572 4220987 15396 11866 ...
  ..$ 1965   : num [1:209] 1000582 583983 4488176 16045 13053 ...
  ..$ 1966   : num [1:209] 1058743 602512 4649105 16693 14217 ...
 $ : tibble [209 x 9] (S3: tbl_df/tbl/data.frame)
  ..$ country: chr [1:209] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
  ..$ 1967   : num [1:209] 1119067 621180 4826104 17349 15440 ...
  ..$ 1968   : num [1:209] 1182159 639964 5017299 17996 16727 ...
  ..$ 1969   : num [1:209] 1248901 658853 5219332 18619 18088 ...
  ..$ 1970   : num [1:209] 1319849 677839 5429743 19206 19529 ...
  ..$ 1971   

### The col_names argument
Apart from path and sheet, there are several other arguments you can specify in read_excel(). One of these arguments is called col_names.

By default it is TRUE, denoting whether the first row in the Excel sheets contains the column names. If this is not the case, you can set col_names to FALSE. In this case, R will choose column names for you. You can also choose to set col_names to a character vector with names for each column. It works exactly the same as in the readr package.

You'll be working with the urbanpop_nonames.xlsx file. It contains the same data as urbanpop.xlsx but has no column names in the first row of the excel sheets.

In [5]:
# Import the first Excel sheet of urbanpop_nonames.xlsx (R gives names): pop_a
pop_a = read_excel("urbanpop_nonames.xlsx", sheet = 1, col_names = FALSE)

# Import the first Excel sheet of urbanpop_nonames.xlsx (specify col_names): pop_b
cols <- c("country", paste0("year_", 1960:1966))
pop_b = read_excel("urbanpop_nonames.xlsx", sheet = 1, col_names = cols)


# Print the summary of pop_a
summary(pop_a)

# Print the summary of pop_b
summary(pop_b)

New names:
* `` -> ...1
* `` -> ...2
* `` -> ...3
* `` -> ...4
* `` -> ...5
* ...


     ...1                ...2                ...3                ...4          
 Length:209         Min.   :     3378   Min.   :     1028   Min.   :     1090  
 Class :character   1st Qu.:    88978   1st Qu.:    70644   1st Qu.:    74974  
 Mode  :character   Median :   580675   Median :   570159   Median :   593968  
                    Mean   :  4988124   Mean   :  4991613   Mean   :  5141592  
                    3rd Qu.:  3077228   3rd Qu.:  2807280   3rd Qu.:  2948396  
                    Max.   :126469700   Max.   :129268133   Max.   :131974143  
                    NA's   :11                                                 
      ...5                ...6                ...7          
 Min.   :     1154   Min.   :     1218   Min.   :     1281  
 1st Qu.:    81870   1st Qu.:    84953   1st Qu.:    88633  
 Median :   619331   Median :   645262   Median :   679109  
 Mean   :  5303711   Mean   :  5468966   Mean   :  5637394  
 3rd Qu.:  3148941   3rd Qu.:  3296444   3rd Qu.:  3317

   country            year_1960           year_1961           year_1962        
 Length:209         Min.   :     3378   Min.   :     1028   Min.   :     1090  
 Class :character   1st Qu.:    88978   1st Qu.:    70644   1st Qu.:    74974  
 Mode  :character   Median :   580675   Median :   570159   Median :   593968  
                    Mean   :  4988124   Mean   :  4991613   Mean   :  5141592  
                    3rd Qu.:  3077228   3rd Qu.:  2807280   3rd Qu.:  2948396  
                    Max.   :126469700   Max.   :129268133   Max.   :131974143  
                    NA's   :11                                                 
   year_1963           year_1964           year_1965        
 Min.   :     1154   Min.   :     1218   Min.   :     1281  
 1st Qu.:    81870   1st Qu.:    84953   1st Qu.:    88633  
 Median :   619331   Median :   645262   Median :   679109  
 Mean   :  5303711   Mean   :  5468966   Mean   :  5637394  
 3rd Qu.:  3148941   3rd Qu.:  3296444   3rd Qu.:  3317

### The skip argument
Another argument that can be very useful when reading in Excel files that are less tidy, is skip. With skip, you can tell R to ignore a specified number of rows inside the Excel sheets you're trying to pull data from. Have a look at this example:

read_excel("data.xlsx", skip = 15)

In this case, the first 15 rows in the first sheet of "data.xlsx" are ignored.

If the first row of this sheet contained the column names, this information will also be ignored by readxl. Make sure to set col_names to FALSE or manually specify column names in this case!

The file urbanpop.xlsx is available in your directory; it has column names in the first rows.

In [6]:
# Import the second sheet of urbanpop.xlsx, skipping the first 21 rows: urbanpop_sel
urbanpop_sel = read_excel("urbanpop.xlsx", sheet = 2, col_names = FALSE, skip = 21)

# Print out the first observation from urbanpop_sel
head(urbanpop_sel, 1)

New names:
* `` -> ...1
* `` -> ...2
* `` -> ...3
* `` -> ...4
* `` -> ...5
* ...


...1,...2,...3,...4,...5,...6,...7,...8,...9
Benin,382022.1,411859.5,443013.1,475611.4,515819.5,557937.6,602093.2,648409.7


### Work that Excel data!
Now that you can read in Excel data, let's try to clean and merge it. You already used the cbind() function some exercises ago. Let's take it one step further now.

The urbanpop.xls dataset is available in your working directory. The file still contains three sheets, and has column names in the first row of each sheet.

In [12]:
# Add code to import data from all three sheets in urbanpop.xls
path <- "urbanpop.xls"
urban_sheet1 <- read_excel(path, sheet = 1)
urban_sheet2 <- read_excel(path, sheet = 2)
urban_sheet3 <- read_excel(path, sheet = 3)

# Extend the cbind() call to include urban_sheet3: urban
urban <- cbind(urban_sheet1, urban_sheet2[-1], urban_sheet3[-1])

# Remove all rows with NAs from urban: urban_clean
urban_clean = na.omit(urban)

# Print out a summary of urban_clean
summary(urban_clean)

   country               1960                1961                1962          
 Length:197         Min.   :     3378   Min.   :     3433   Min.   :     3481  
 Class :character   1st Qu.:    87735   1st Qu.:    92905   1st Qu.:    98331  
 Mode  :character   Median :   599714   Median :   630788   Median :   659464  
                    Mean   :  5012388   Mean   :  5282488   Mean   :  5440972  
                    3rd Qu.:  3130085   3rd Qu.:  3155370   3rd Qu.:  3250211  
                    Max.   :126469700   Max.   :129268133   Max.   :131974143  
      1963                1964                1965          
 Min.   :     3532   Min.   :     3586   Min.   :     3644  
 1st Qu.:   104988   1st Qu.:   112084   1st Qu.:   119322  
 Median :   704989   Median :   740609   Median :   774957  
 Mean   :  5612312   Mean   :  5786961   Mean   :  5964970  
 3rd Qu.:  3416490   3rd Qu.:  3585464   3rd Qu.:  3666724  
 Max.   :134599886   Max.   :137205240   Max.   :139663053  
      1966   