# L03-1-Reading Flat Files
## Assignment Instructions
Rename with your name in place of Studentname and make your edits and updates here.


# Reading Flat Files

It doesn’t get more universal than importing the typical ‘flat file’. As you can imagine, there are many ways and packages in R that can do this. And this has been streamlined as much as possible, while providing customizable options a mile long. The good news is that we can expect a simple single line of code to do the trick only providing a minimum of information, letting R make all the rest of the assumptions for us. 

Fortunately, there is a tidyverse version of file import, so that is what we will use. Hopefully, as we explore more and more of the tidyverse, it will start to become more intuitive. I can often guess what a function name would be called or how it will return a result. It is like guessing the meaning of a word you hadn’t heard of before, but being reasonably close to its definition because it is consistent with the rest of the language that you already learned. 
In this exercise, you will read in comma separated value (csv) files into R using read_csv(). Note the underscore between read and csv. This version is from the readr package, a part of the tidyverse. There is a function that uses a period in place of an underscore which is the base R version. Remember to use the underscore version of the function.
read_csv() has some parameters most with default values. The only required parameter is the filename. That is the only thing it couldn’t guess. It can’t get much simpler than that. The remaining parameters are used to adjust the default behavior when you encounter that situation.

There is also a read_tsv() to use for tab separated values. Just substitute tsv for csv in the function call. There is also a read_delim() function where you can specify your own column delimiter in case the file uses something other than a comma or tab. Read_csv() and read_tsv() are called wrapper functions around read_delim() meaning that they set default values to make it easy to use for a specific context. So instead of setting the column delimiter in read_delim() you simply call the read_csv function which sets that parameter for you. This is just like geom_jitter() we encountered before as a wrapper function around geom_point() with the jitter parameter enabled. 

I find the wrapper functions easier to read than parameters, so I use those when I can. But since they are wrapper functions, that means you can call the underlying read_delim() function and have access to all the parameters for complete customization.

There are also write versions of the read functions like write_csv() and write_tsv() which takes a data frame and writes it to a file. This is a useful way to export a copy of the data especially after doing a bunch of data wrangling.

There is a write function specific for use when importing to excel called write_excel(). It adds a byte order mark (BOM) that may be helpful for tools who have trouble reading in csv files without it. I don’t find my Excel to have any problems with standard CSV files and sometimes the byte order mark can be annoying as it may get interpreted as part of the name of the first column. So, use write_excel() if you need it otherwise I would stick with write_csv() or write_tsv(). Coming from a data warehouse background, I prefer tab delimiters over commas since tabs are less prevalent in the data as compared to commas, but in this course we will tend to use commas. 

We will explore skipping columns, handling header rows, renaming columns, and setting data types. We will also see what can go wrong during data type conversion and how to get details about the errors with problems(). 

If you cannot resolve a data type conversion issue during import, just bring the column in as a string column and then handle it after it is in a data frame where we have the full power of R at our disposal. This is most common when dealing with various date formats. If we are lucky we can convert to a date during import, but sometimes the import tools cannot handle the wide diversity of dates. So, we bring it in as a string and then use different R packages that specialize in date manipulation to coerce them into dates.


## R Features
* library()
* write_csv()
* write_tsv()
* write_excel_csv()
* getwd()
* dir()
* read_csv()
* read_tsv()
* glimpse()
* names()
* print()
* c()
* head()
* spec()
* problems()

## Datasets
* mpg


In [1]:
# Load libraries
library(tidyverse)


Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats


Notice readr is a tidyverse library

## write_csv() - write_tsv() -  write_excel_csv()
Save a data frame to a delimited file. This is about twice as fast as write.csv, and never writes row names.

write_csv(x, path, na = "NA", append = FALSE, col_names = !append)

write_tsv(x, path, na = "NA", append = FALSE, col_names = !append)

write_excel_csv(x, path, na = "NA", append = FALSE, col_names = !append)


In [2]:
# We are going to create the 
# files we need to import by first
# exporting them or writing them to disk
# Let's start with help on write_csv()
?write_csv

Notice there are a few versions. We will explore csv, tsv, excel_csv. We pass these functions a data frame and a filename string

In [6]:
# Let's create a csv file
# data: mpg 
# file: mpg.csv
# function: write_csv()
write_csv(mpg, "fmpg.csv")

# Now write it as a tsv file
# Make sure the file extension is tsv
# data: mpg 
# file: mpg.tsv
# function: write_tsv()
write_tsv(mpg, "fmpg.tsv")

# Now write it as an excel csv
# data: mpg 
# file: mpg_excel.csv
# function: write_excel_csv()
write_excel_csv(mpg, "fmpg_excel.csv")


## getwd(), setwd()

Get or Set Working Directory. getwd returns an absolute filepath representing the current working directory of the R process; setwd(dir) is used to set the working directory to dir.

getwd()

setwd(dir)

In [7]:
# Pull up help on getwd()
?getwd

In [8]:
# Let's find where it put these files
# Unless we specify the full path
# they go into the current working directory
# We have a function to see what this is
# function: getwd()
getwd()


## dir()
List the Files in a Directory/Folder. These functions produce a character vector of the names of files or directories in the named directory.

dir(path = ".", pattern = NULL, all.files = FALSE,
           full.names = FALSE, recursive = FALSE,
           ignore.case = FALSE, include.dirs = FALSE, no.. = FALSE)

In [9]:
# Pull up help on dir()
?dir

In [10]:
# Let's list the files in the 
# current working directory
# function: dir()
dir()


Notice the files you had written are listed.

Now that we have some files of known data to experiment on, let's read them in.

In [11]:
# file: mpg.csv
# variable: df
# function: read_csv()
foo <- read_csv("fmpg.csv")


Parsed with column specification:
cols(
  manufacturer = col_character(),
  model = col_character(),
  displ = col_double(),
  year = col_integer(),
  cyl = col_integer(),
  trans = col_character(),
  drv = col_character(),
  cty = col_integer(),
  hwy = col_integer(),
  fl = col_character(),
  class = col_character()
)


Notice the column specification output. The default behavior is to review the first 1000 rows and determine the most appropriate data type.

In [13]:
# Let's compare the imported data frame
# with the original to see
# how well it matches

# glimpse df
glimpse(mpg)

# glimpse mpg
glimpse(foo)


Observations: 234
Variables: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi"...
$ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro"...
$ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0,...
$ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, ...
$ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, ...
$ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "a...
$ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4",...
$ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17...
$ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25...
$ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p",...
$ class        <chr> "compact", "compact", "compact", "compact", "compact",...
Observations: 234
Variables: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi"

Notice that both have the same # number of rows and columns and the same variable names and order and the same data types.

In [14]:
# Let's read in the tsv file
# and compare it to mpg
# Use the above code blocks as your guide
# You can use the same df variable
# It will overwrite the prior contents

# read_tsv()
df <- read_tsv("fmpg.tsv")

# glimpse() imported varaible
glimpse(df)

# glimpse() mpg
glimpse(mpg)



Parsed with column specification:
cols(
  manufacturer = col_character(),
  model = col_character(),
  displ = col_double(),
  year = col_integer(),
  cyl = col_integer(),
  trans = col_character(),
  drv = col_character(),
  cty = col_integer(),
  hwy = col_integer(),
  fl = col_character(),
  class = col_character()
)


Observations: 234
Variables: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi"...
$ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro"...
$ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0,...
$ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, ...
$ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, ...
$ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "a...
$ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4",...
$ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17...
$ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25...
$ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p",...
$ class        <chr> "compact", "compact", "compact", "compact", "compact",...
Observations: 234
Variables: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi"

Notice any differences?

In [15]:
# One more time with the 
# mpg_excel.csv file

# read_csv()
df <- read_csv("fmpg_excel.csv")

# glimpse() imported varaible
glimpse(df)

# glimpse() mpg
glimpse(mpg)



Parsed with column specification:
cols(
  manufacturer = col_character(),
  model = col_character(),
  displ = col_double(),
  year = col_integer(),
  cyl = col_integer(),
  trans = col_character(),
  drv = col_character(),
  cty = col_integer(),
  hwy = col_integer(),
  fl = col_character(),
  class = col_character()
)


Observations: 234
Variables: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi"...
$ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro"...
$ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0,...
$ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, ...
$ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, ...
$ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "a...
$ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4",...
$ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17...
$ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25...
$ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p",...
$ class        <chr> "compact", "compact", "compact", "compact", "compact",...
Observations: 234
Variables: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi"

Notice any differences?

## read_csv()
Read a delimited file into a data frame. 

read_csv and read_tsv are special cases of the general read_delim. They're useful for reading the most common types of flat file data, comma separated values and tab separated values, respectively. read_csv2 uses ; for separators, instead of ,. This is common in European countries which use , as the decimal separator.

`
read_csv(file, col_names = TRUE, col_types = NULL,
  locale = default_locale(), na = c("", "NA"), quoted_na = TRUE,
  comment = "", trim_ws = TRUE, skip = 0, n_max = Inf,
  guess_max = min(1000, n_max), progress = interactive())
`

Some notable parameters are listed below. We will explore some of them in more detail.

### trim_ws 
trims whitespace on the beginning and ends of the value
I always do this cleaning step, it is on be default

### guess_max
These are the rows used to determine the data types. 1000 is the minimum, but you can go higher if you think it would make a difference.

### col_names
Specify the column names. This is handy when the file doesn't provide it in the header. This serves two purposes, first you can turn it on or off (TRUE or FALSE) meaning that you are stating that the first row is a header row or not. The second is to provide a vector of column names and it will use those names instead. Note that it will think the first row is a data row not a header row. 

### skip
This allows you to skip the first n rows you specify. I have found files that have summary info at the top that need to be skipped. skip can also be used when you want to use col_names to rename the columns and skip the header row at the same time.

### col_types
This allows you to define the column types upon import. An automatic attempt is made, but you may want a different data type or if you are like me, you want to be more deterministic and not leave it to the data to determine its type. Another useful feature is to drop or skip columns you don't want. This is really a trade off of when to do the data wrangling, right now or after it is already in a data frame variable. As always, it is a personal preference and often a balance.

In [None]:
# Let's start overriding the parameters
# to get a sense of what features we have 
# Start with pulling up help on read_csv()
___


Notice it has csv, tsv, and the generic delim functions csv2 is for other locales that use the comma as a decimal separator.

In [16]:
# Let's set col_names to false
# What do you think will happen?
df <- read_csv("mpg.csv", col_names = FALSE)

# Compare the output to mpg
glimpse(df)
glimpse(mpg)



Parsed with column specification:
cols(
  X1 = col_character(),
  X2 = col_character(),
  X3 = col_character(),
  X4 = col_character(),
  X5 = col_character(),
  X6 = col_character(),
  X7 = col_character(),
  X8 = col_character(),
  X9 = col_character(),
  X10 = col_character(),
  X11 = col_character()
)


Observations: 235
Variables: 11
$ X1  <chr> "manufacturer", "audi", "audi", "audi", "audi", "audi", "audi",...
$ X2  <chr> "model", "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro"...
$ X3  <chr> "displ", "1.8", "1.8", "2", "2", "2.8", "2.8", "3.1", "1.8", "1...
$ X4  <chr> "year", "1999", "1999", "2008", "2008", "1999", "1999", "2008",...
$ X5  <chr> "cyl", "4", "4", "4", "4", "6", "6", "6", "4", "4", "4", "4", "...
$ X6  <chr> "trans", "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "a...
$ X7  <chr> "drv", "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "...
$ X8  <chr> "cty", "18", "21", "20", "21", "16", "18", "18", "18", "16", "2...
$ X9  <chr> "hwy", "29", "29", "31", "30", "26", "26", "27", "26", "25", "2...
$ X10 <chr> "fl", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p...
$ X11 <chr> "class", "compact", "compact", "compact", "compact", "compact",...
Observations: 234
Variables: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi"

Notice that the column names are automatically generated X1 to X11. Notice that all columns came is an character. 

Do you know why?

Notice that there is an extra row in df. 

What is that extra row?

In [17]:
# We created a simulated troubleshooting opportunity!
# Let's start fixing this by skipping the first line
df <- read_csv("mpg.csv", col_names = FALSE, skip = 1)

# Compare the output to mpg
glimpse(df)
glimpse(mpg)



Parsed with column specification:
cols(
  X1 = col_character(),
  X2 = col_character(),
  X3 = col_double(),
  X4 = col_integer(),
  X5 = col_integer(),
  X6 = col_character(),
  X7 = col_character(),
  X8 = col_integer(),
  X9 = col_integer(),
  X10 = col_character(),
  X11 = col_character()
)


Observations: 234
Variables: 11
$ X1  <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "audi",...
$ X2  <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "a4 qua...
$ X3  <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.8, 2.8...
$ X4  <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 2008, 200...
$ X5  <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, 8, 8, ...
$ X6  <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto(l5)",...
$ X7  <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4", "4"...
$ X8  <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 15, 15,...
$ X9  <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 25, 24,...
$ X10 <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p"...
$ X11 <chr> "compact", "compact", "compact", "compact", "compact", "compact...
Observations: 234
Variables: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi"

Notice row count and data types are corrected. Column names still need work.

In [19]:
# Let's add our own column names
# Display the column names from mpg.
# What was that function again??
names(mpg)

In [20]:
# Now let's store that into a variable
# which is called a vector in R
v_mpg_col_names <- names(mpg)

In [21]:
# To see what is in a variable 
# you can use print()
print(v_mpg_col_names)

# or you can just type the variable name
v_mpg_col_names

 [1] "manufacturer" "model"        "displ"        "year"         "cyl"         
 [6] "trans"        "drv"          "cty"          "hwy"          "fl"          
[11] "class"       


In [22]:
# Let's use our new names variable to add the column
# names back in
df <- read_csv("mpg.csv", col_names = v_mpg_col_names, skip = 1)

# Compare the output to mpg
glimpse(df)
glimpse(mpg)



Parsed with column specification:
cols(
  manufacturer = col_character(),
  model = col_character(),
  displ = col_double(),
  year = col_integer(),
  cyl = col_integer(),
  trans = col_character(),
  drv = col_character(),
  cty = col_integer(),
  hwy = col_integer(),
  fl = col_character(),
  class = col_character()
)


Observations: 234
Variables: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi"...
$ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro"...
$ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0,...
$ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, ...
$ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, ...
$ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "a...
$ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4",...
$ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17...
$ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25...
$ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p",...
$ class        <chr> "compact", "compact", "compact", "compact", "compact",...
Observations: 234
Variables: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi"

Notice everything is as it should now

In [23]:
# Let's say we had done our homework and 
# knew what column names we wanted

# We can create our own names vector
# using the c() function short for combine
# which combines comma separated values into a vector
v_mpg_col_names <- c('vehicle', 'model', 'disp', 'year', 
                     'cyl', 'trans', 'drv', 'cty', 'hwy', 'fl', 'class')

# You can use single quotes or double quotes. Since I copied and pasted
# the above from the earlier output window, it had single quotes so I used them
# I normally use double quotes so they don't mess with SQL syntax that uses
# single quotes when I am embedded SQL strings into R
# Again it is a personal preference.

# Also notice that I added a carriage return after 'year', 
# Any time there is whitespace outside of a string, you can 
# add a carriage return (newline) for readability

# Notice that I changed column names manufacturer to vehicle and displ to disp
# so we can determine if our code is really working.

# Let's try with the new names
df <- read_csv("mpg.csv", col_names = v_mpg_col_names, skip = 1)

# Compare the output to mpg
glimpse(df)
glimpse(mpg)


Parsed with column specification:
cols(
  vehicle = col_character(),
  model = col_character(),
  disp = col_double(),
  year = col_integer(),
  cyl = col_integer(),
  trans = col_character(),
  drv = col_character(),
  cty = col_integer(),
  hwy = col_integer(),
  fl = col_character(),
  class = col_character()
)


Observations: 234
Variables: 11
$ vehicle <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "au...
$ model   <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "a4...
$ disp    <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.8,...
$ year    <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 2008,...
$ cyl     <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, 8,...
$ trans   <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto(l...
$ drv     <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4",...
$ cty     <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 15,...
$ hwy     <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 25,...
$ fl      <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p",...
$ class   <chr> "compact", "compact", "compact", "compact", "compact", "com...
Observations: 234
Variables: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi"

Now that we can easily rename our columns.

Let's set the data types manually

We will use the col_type variable and a convenient shorthand letter for each column
* c = character, 
* i = integer, 
* n = number, same as double
* d = double, 
* l = logical, 
* D = date, 
* T = date time, 
* t = time, 
* ? = guess, 
* _/- to skip the column


In [24]:
# One letter for each column
v_mpg_col_type <- "ccdiicciicc"

df <- read_csv("mpg.csv", col_names = v_mpg_col_names, skip = 1, col_types  = v_mpg_col_type)

# Compare the output to mpg
glimpse(df)
glimpse(mpg)


Observations: 234
Variables: 11
$ vehicle <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "au...
$ model   <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "a4...
$ disp    <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.8,...
$ year    <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 2008,...
$ cyl     <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, 8,...
$ trans   <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto(l...
$ drv     <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4",...
$ cty     <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 15,...
$ hwy     <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 25,...
$ fl      <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p",...
$ class   <chr> "compact", "compact", "compact", "compact", "compact", "com...
Observations: 234
Variables: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi"

Notice we got it all back and we have demonstrated full control over the column names and data types. Also notice that the specification information isn't displayed when specifying the column type.

# spec()

Examine the column specifications for a data frame. spec extracts the full column specifications. cols_condense takes a spec object and condenses its definition by setting the default column type to the most frequent type and only listing columns with a different type.

spec(x)

In [30]:
my_class <- as.factor(mpg$class)
my_class
class(my_class)
levels(my_class)
as.numeric(my_class)

In [37]:
my_model <- lm(mpg, displ)

?lm

ERROR: Error in is.data.frame(data): object 'displ' not found


In [None]:
# Pull up help on spec()
___readr::spec

In [None]:
# We can see the specification using
# spec()
___(df)

In [None]:
# One more thing, what
# happens when there is a parsing error
# What does that look like and 
# how can we best deal with it.

# Let's attempt to create an error
# by taking a string and asking for a double
# on the first column and a 
# date on the last column
v_mpg_col_names <- c('vehicle', 'model', 'disp', 'year', 
                     'cyl', 'trans', 'drv', 'cty', 'hwy', 'fl', 'class')
v_mpg_col_type <- "dcdiicciicD"

df <- read_csv("mpg.csv", col_names = ___, skip = 1, col_type = ___)

# Compare the output to mpg
___
___(mpg)

# Notice the warning message
# including problems() function

# Notice the NA values for the first and 
# last columns. It put NA as a placeholder
# meaning Not Available



In [None]:
# Let's run 
# problems()
# pass it in your data frame
# Let's also just view the top 10 rows
___(df) %>% ___(10)

# Notice for each instance it has the column
# name, what it expects, and what was provided