<a href="https://colab.research.google.com/github/nmagee/ds1002/blob/main/notebooks/22-data-cleaning-in-r.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Data Cleaning

Normal cleaning and management operations are just as common in R as they are in Python. The most frequent cleaning tasks are:

- Identifying and removing duplicate rows.
- Finding empty / NULL / `NA` values and determining what to do with them, i.e. deleting, imputing, etc.


In [1]:
album <- c("Low End Theory", "Nevermind", "Port of Morrow",
           "Dark Side of the Moon", "Naked", "OK Computer",
           "Abbey Road", "Thriller", "Rumours", "The Joshua Tree")
year <- c(1991, 1991, 2012, 1973, 1988, 1997, 1969, 1982, 1977, 1987)
digital <- c(FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE)

df <- data.frame(Album = album, Year = year, Digital = digital)

In [None]:
df

### Duplicate Rows

To remove duplicate rows from a data frame there is a simple one-line command. This will select all NON-duplicated rows from the `df` data frame and pass them into a new data frame named `df2`:

In [None]:
df2 <- df[!duplicated(df), ]

In [None]:
# or, using dplyr

library(dplyr)
df2 <- df %>% distinct()

Let's import the `very-messy-data.csv` file from an earlier homework to inspect and clean.

In [None]:
df <- read.csv("https://raw.githubusercontent.com/nmagee/ds1002/main/data/very-messy-data.csv")

# Use str() to get the structure of the data frame:
str(df)

In [None]:
# Now remove the dupe rows
df2 <- df[!duplicated(df), ]
str(df2)

### Remove Rows with `NA` values

A simple way to do this is to extract only valid data out of the data frame with the `na.omit` method:

In [None]:
df_no_empty <- na.omit(df)
str(df_no_empty)

In [None]:
# Two other methods to achieve this:

#Remove rows with NA's using complete.cases
df <- df[complete.cases(df), ]

#Remove rows with NA's using rowSums()
df <- df[rowSums(is.na(df)) == 0, ]

# Or with the tidyverse library
library("tidyr")

#Remove rows with NA's using drop_na()
df <- df %>% drop_na()

### Imputation of Missing Data

The question surrounding imputation is WHAT to replace `NA` values with. This question is a data/statistical one and should not be treated lightly. The answer can throw off results greatly.

With that caveat in mind, here is the method for imputing missing values and replacing them with the mean of the rest of the data.

The R below will update the sepal and petal columns by replacing empty values with the mean of the valid values within each column.

In [None]:
df2$sepal_length[is.na(df2$sepal_length)] <- mean(df2$sepal_length, na.rm = T)
df2$sepal_width[is.na(df2$sepal_width)] <- mean(df2$sepal_width, na.rm = T)
df2$petal_length[is.na(df2$petal_length)] <- mean(df2$petal_length, na.rm = T)
df2$petal_width[is.na(df2$petal_width)] <- mean(df2$petal_width, na.rm = T)

df2

In [None]:
# Another way to achieve this is using the Hmisc package

df3 <- read.csv("https://raw.githubusercontent.com/nmagee/ds1002/main/data/very-messy-data.csv")
df3 <- df[!duplicated(df3), ]

install.packages("Hmisc")
library(Hmisc)

impute(df3$sepal_length, median)

In [None]:
df3

In [None]:
# data()
glimpse(faithful)
?faithful

## Extract Row Data back into Vector

To extract a column of attributes back into a vector, call it out by appending `$ColName` to the data frame.

In [None]:
df$Album

# Or assign into a var
albums_extracted <- df$Album