<a href="https://colab.research.google.com/github/nmagee/ds1002/blob/main/notebooks/21-dataframes-in-r.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Frames in R

To begin to understand data frames in R, let's build a simple example by hand with three vectors we want to relate in a table.

In [1]:
album <- c("Low End Theory", "Nevermind", "Port of Morrow",
           "Dark Side of the Moon", "Naked", "OK Computer",
           "Abbey Road", "Thriller", "Rumours", "The Joshua Tree")
year <- c(1991, 1991, 2012, 1973, 1988, 1997, 1969, 1982, 1977, 1987)
digital <- c(FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE)

df <- data.frame(Album = album, Year = year, Digital = digital)

In [None]:
df

## Column Naming

We can also name the columns of a data frame, whether we imported it or built it by hand:

In [None]:
df <- data.frame(Album = album, Year = year, Digital = digital)

In [None]:
names(df) <- c("Album", "Year", "Digital")

## Insert a Column

Data can always be added or removed from a data frame after its creation.

In [None]:
df$Album
df$Artist

## Structure

To fetch the structure of a data frame, use `str()`. This tells us how many observations (rows) it contains along with how many variables (columns) each observation contains.

The output looks much like a list. This shouldn't be surprising, since the dataframe can contain mixed data types. However, each column of variables must be of the same data type. (Names are all strings, Years are all numbers, and Vinyl is all bools/logicals.)

In [None]:
str(df)

## Selecting Columns

Simply indicate the indexes for the rows you want in the second half of the slice brackets `[ ]`. This can be a range separated by a colon `:`.

In [None]:
df[,1:2]

## Selecting & Querying

Indicate the indexes of the row you want, appended with a comma and empty value for a column specification. This can be the first half of the slice brackets `[ ]`.

The command below asks for all columns of records 1-3:

In [None]:
df[1:3,]

Or this example asks for records 1-3, and columns 1-2.

In [None]:
df[1:3,1:2]

Select specific rows by combining them into the first half of the slice bracket `[ ]`.

In [None]:
df[c(1,3,4),]

Select specific rows based on filter. Here we **query** the data frame for all albums with a `FALSE` value for the `vinyl` column.

In [None]:
df[df$Digital == FALSE,]

In [None]:
# Now we can search a larger data set:
df[df$Digital == TRUE,]

In [None]:
# Or using mathematical operators to filter:
df[df$Year > 1990,]

In [None]:
# Or combine operators to filter more carefully. All comparison operators can be used
# ( ==, !=, <, >, <=, <= )
#
# as well as all logical operators
#   - AND: %
#   -  OR: |
#   - NOT: !

df[df$Year > 1990 & df$Digital == TRUE,]

In [None]:
df[df$Year > 1990 | df$Digital == FALSE,]

## Sorting

In [None]:
df_by_year <- df[order(df$Year),]
df_by_year

## `tidyverse`

All of the above operations -- selecting, querying, filtering, sorting -- are also possible (more easily) using methods built into the `tidyverse` library.

Import that and then we will review those operations.

```
Understand the PIPE in tidyverse: %>%
```

In [None]:
library(tidyverse)

In [None]:
df %>%
  select(Album, Year) %>%
  filter(Year > 1990)

Album,Year
<chr>,<dbl>
Low End Theory,1991
Nevermind,1991
Port of Morrow,2012
OK Computer,1997


## Import CSV Data

Just as with Pandas in Python, it is much more common to load data from files, such as CSV. This is possible manually, as well as by using the `tidyverse` library.

### Import a CSV Manually

R has a native `read.csv()` method. A few things to note as you run this cell:

- This method automatically creates a data frame from the CSV data.
- The file can be local or remote via URL.
- Since the data file has a header row, names are automatically assigned to columns.

In [None]:
titanic <- read.csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
titanic

We can now use the same filtering methods as above to inspect the data. Here we save each successive query as a new data frame.

In [None]:
titanic %>%
  select(Pclass, Name, Sex, Fare, Age) %>%
  filter(Sex == "female") %>%
  filter(Pclass == 3) %>%
  filter(Age < 20) %>%
  arrange(Name)

Or we can explore one of the built-in data sets from R. In this case let's use the `starwars` data set.

In [None]:
starwars %>%   # and-then
  select(gender, mass, height, species) %>%
  filter(species == "Human") %>%
  na.omit()

In [None]:
starwars %>%   # and-then
  select(gender, mass, height, species) %>%
  filter(species == "Human") %>%
  na.omit() %>%
  mutate(height = height / 100) %>%
  mutate(BMI = mass / height^2) %>%
  group_by(gender) %>%
  summarise(Average_BMI = mean(BMI))

Finally, we can explore the `msleep` (Mammal Sleep) sample data set using `tidyverse`.

In [None]:
my_data <- msleep %>%
  select(name, order, bodywt, sleep_total) %>%
  filter(order == "Primates", bodywt > 20) %>%
  arrange(bodywt)

my_data

## Data Cleaning

Normal cleaning and management operations are just as common in R as they are in Python. The most frequent cleaning tasks are:

- Identifying and removing duplicate rows.
- Finding empty / NULL / `NA` values and determining what to do with them, i.e. deleting, imputing, etc.


### Duplicate Rows

To remove duplicate rows from a data frame there is a simple one-line command. This will select all NON-duplicated rows from the `df` data frame and pass them into a new data frame named `df2`:

In [None]:
df2 <- df[!duplicated(df), ]

In [None]:
# or, using dplyr

library(dplyr)
df2 <- df %>% distinct()

Let's import the `very-messy-data.csv` file from an earlier homework to inspect and clean.

In [None]:
df <- read.csv("https://raw.githubusercontent.com/nmagee/ds1002/main/data/very-messy-data.csv")

# Use str() to get the structure of the data frame:
str(df)

In [None]:
# Now remove the dupe rows
df2 <- df[!duplicated(df), ]
str(df2)

### Remove Rows with `NA` values

A simple way to do this is to extract only valid data out of the data frame with the `na.omit` method:

In [None]:
df_no_empty <- na.omit(df)
str(df_no_empty)

In [None]:
# Two other methods to achieve this:

#Remove rows with NA's using complete.cases
df <- df[complete.cases(df), ]

#Remove rows with NA's using rowSums()
df <- df[rowSums(is.na(df)) == 0, ]

# Or with the tidyverse library
library("tidyr")

#Remove rows with NA's using drop_na()
df <- df %>% drop_na()

### Imputation of Missing Data

The question surrounding imputation is WHAT to replace `NA` values with. This question is a data/statistical one and should not be treated lightly. The answer can throw off results greatly.

With that caveat in mind, here is the method for imputing missing values and replacing them with the mean of the rest of the data.

The R below will update the sepal and petal columns by replacing empty values with the mean of the valid values within each column.

In [None]:
df2$sepal_length[is.na(df2$sepal_length)] <- mean(df2$sepal_length, na.rm = T)
df2$sepal_width[is.na(df2$sepal_width)] <- mean(df2$sepal_width, na.rm = T)
df2$petal_length[is.na(df2$petal_length)] <- mean(df2$petal_length, na.rm = T)
df2$petal_width[is.na(df2$petal_width)] <- mean(df2$petal_width, na.rm = T)

df2

In [None]:
# Another way to achieve this is using the Hmisc package

df3 <- read.csv("https://raw.githubusercontent.com/nmagee/ds1002/main/data/very-messy-data.csv")
df3 <- df[!duplicated(df3), ]

install.packages("Hmisc")
library(Hmisc)

impute(df3$sepal_length, median)

In [None]:
df3

In [None]:
# data()
glimpse(faithful)
?faithful

## Extract Row Data back into Vector

To extract a column of attributes back into a vector, call it out by appending `$ColName` to the data frame.

In [None]:
df$Album

# Or assign into a var
albums_extracted <- df$Album