<a href="https://colab.research.google.com/github/nmagee/ds1002/blob/main/notebooks/22-data-cleaning-in-r.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Normalized Data

Having cleaned, ordered, organized data makes your next steps easy. Take the `airquality` sample data set for example:

In [None]:
airquality

In [None]:
boxplot(airquality)

In [None]:
hist(airquality)

## Data Cleaning

Normal cleaning and management operations are just as common in R as they are in Python. The most frequent cleaning tasks are:

- Identifying and removing duplicate rows.
- Normalizing data values.
- Finding empty / NULL / `NA` values and determining what to do with them, i.e. deleting, imputing, etc.


In [42]:
df <- read.csv("https://raw.githubusercontent.com/nmagee/ds1002/main/data/very-messy-data-2.csv")

In [None]:
# Use str() to get the structure of the data frame:

str(df)

In [None]:
# Use summary() to get summary data for each attribute, including empty values.

summary(df)

In [None]:
df

### Duplicate Rows

To see how many duplicate rows exist in a data frame:

In [None]:
nrow(df[duplicated(df), ])

To remove duplicate rows from a data frame there is a simple one-line command. This will select all NON-duplicated rows from the `df` data frame and pass them into a new data frame named `df2`:

In [35]:
df2 <- df[!duplicated(df), ]

In [None]:
# or, using dplyr, pass the

library(dplyr)
df2 <- df %>% distinct()

### Look for Irregularities

Sometimes a row value will be out of the bounds of expected data values. A good example of this might be a `logical` column where you expect to see `TRUE` and `FALSE`. It's useful to look at a list of the distinct values from a column. Use the `unique()` function to return these.

In [None]:
unique(df2$sepal_length)
unique(df2$sepal_width)
unique(df2$petal_length)
unique(df2$petal_width)

### Update Values As Needed

To remove or `NA` a specific value within an observation, simply map a new value to a `df` search.

Suppose you want to remove "empty" values and replace them with `NA`, use this syntax:

```
df2[df2==""] <- NA
```

In [36]:
df2[df2=="gg28"] <- NA
df2[df2=="gg29"] <- NA
df2[df2==""] <- NA
df2

Unnamed: 0_level_0,id,sepal_length,sepal_width,petal_length,petal_width,species
Unnamed: 0_level_1,<int>,<dbl>,<dbl>,<dbl>,<chr>,<chr>
1,1,3.5,2.9,1.4,0.5,virginica
2,2,3.6,3.2,3.0,0.5,setosa
3,3,3.8,,2.2,1.5,setosa
4,4,5.8,2.7,2.6,1.2,virginica
5,5,4.9,3.6,3.0,1.2,virginica
6,6,5.0,2.7,1.4,2.3,setosa
7,7,4.8,3.0,2.6,1.4,setosa
8,8,5.5,2.2,2.1,2.1,virginica
9,9,5.5,2.9,1.1,3,setosa
10,10,,3.4,2.6,1.9,virginica


The above method is useful whenever you need to push a replacement value into specific cells.

### Remove Rows with `NA` values

A simple way to do this is to extract only valid data out of the data frame with the `na.omit` method:

**This is a destructive action!**

In [None]:
df_no_empty <- na.omit(df)
str(df_no_empty)

In [None]:
# Two other methods to achieve this:

#Remove rows with NA's using complete.cases
df <- df[complete.cases(df), ]

#Remove rows with NA's using rowSums()
df <- df[rowSums(is.na(df)) == 0, ]

# Or with the tidyverse library
library("tidyr")

#Remove rows with NA's using drop_na()
df <- df %>% drop_na()

### Imputate Missing Data

The question surrounding imputation is WHAT to replace `NA` values with. This question is a data/statistical one and should not be treated lightly. The answer can throw off results greatly.

With that caveat in mind, here is the method for imputing missing values and replacing them with the mean of the rest of the data.

The R below will update the sepal and petal columns by replacing empty values with the mean of the valid values within each column.

In [37]:
df2$sepal_length[is.na(df2$sepal_length)] <- mean(df2$sepal_length, na.rm = T)
df2$sepal_width[is.na(df2$sepal_width)] <- mean(df2$sepal_width, na.rm = T)
df2$petal_length[is.na(df2$petal_length)] <- mean(df2$petal_length, na.rm = T)
df2$petal_width[is.na(df2$petal_width)] <- mean(df2$petal_width, na.rm = T)

df2

“argument is not numeric or logical: returning NA”


Unnamed: 0_level_0,id,sepal_length,sepal_width,petal_length,petal_width,species
Unnamed: 0_level_1,<int>,<dbl>,<dbl>,<dbl>,<chr>,<chr>
1,1,3.500,2.900000,1.4,0.5,virginica
2,2,3.600,3.200000,3.0,0.5,setosa
3,3,3.800,2.987694,2.2,1.5,setosa
4,4,5.800,2.700000,2.6,1.2,virginica
5,5,4.900,3.600000,3.0,1.2,virginica
6,6,5.000,2.700000,1.4,2.3,setosa
7,7,4.800,3.000000,2.6,1.4,setosa
8,8,5.500,2.200000,2.1,2.1,virginica
9,9,5.500,2.900000,1.1,3,setosa
10,10,4.525,3.400000,2.6,1.9,virginica


In [None]:
# Another way to achieve this is using the Hmisc package

df3 <- read.csv("https://raw.githubusercontent.com/nmagee/ds1002/main/data/very-messy-data.csv")
df3 <- df3[!duplicated(df3), ]

install.packages("Hmisc")
library(Hmisc)

In [None]:
df3$sepal_length <- impute(df3$sepal_length, median)
df3$sepal_width <- impute(df3$sepal_width, median)
df3$petal_length <- impute(df3$petal_length, median)
df3$petal_width <- impute(df3$petal_width, median)

In [None]:
df3

In [None]:
# data()
glimpse(faithful)
?faithful

### Remove Whitespace

Create a simple data frame with extra space characters thrown in:

In [None]:
df_space <- data.frame(first  = c("Boston ", " Chicago ", "New York ", " Minneapolis", " Portland"),
                        second = c("Massachusetts", " Illinois", "New  York", "Minnesota", "  Oregon"),
                        third = c("New England  ", "Mid-West", " New England", "Mid-West", "North-West ")
                      )

In [None]:
df_space$first

In [None]:
library(dplyr)
library(stringr)

df_space %>%
  mutate(across(where(is.character), str_trim))

In [None]:
df_space$first

Note the above `dplyr` does not save the cleaned data. To do this, assign the output into the same/new parameter:

In [None]:
df_space <- df_space %>%
  mutate(across(where(is.character), str_trim))

## Extracting Row Data back into Vector

To extract a column of attributes back into a vector, call it out by appending `$ColName` to the data frame.

In [None]:
df$Album

# Or assign into a var
albums_extracted <- df$Album