## Categorical and Text Data

Categorical and text data can often be some of the messiest parts of a dataset due to their unstructured nature. In this chapter, you’ll learn how to fix whitespace and capitalization inconsistencies in category labels, collapse multiple categories into one, and reformat strings for consistency.

### Not a member
Now that you've practiced identifying membership constraint problems, it's time to fix these problems in a new dataset. Throughout this chapter, you'll be working with a dataset called sfo, containing survey responses from passengers taking flights from San Francisco International Airport (SFO). Participants were asked questions about the airport's cleanliness, wait times, safety, and their overall satisfaction.

There were a few issues during data collection that resulted in some inconsistencies in the dataset. In this exercise, you'll be working with the dest_size column, which categorizes the size of the destination airport that the passengers were flying to. A data frame called dest_sizes is available that contains all the possible destination sizes. Your mission is to find rows with invalid dest_sizes and remove them from the data frame.

In [4]:
library(dplyr)

# data
sfo <- readRDS("sfo_survey_ch2_1.rds")
str(sfo)

# build dest_sizes
dest_size = c("Small", "Medium", "Large", "Hub")
passenger_per_day = c("0-20K", "20K-70K", "70K-100K", "100K+")
dest_sizes = data.frame(dest_size, passenger_per_day)
print(dest_sizes)

# Count the number of occurrences of dest_size
sfo %>%
  count(dest_size)



'data.frame':	2809 obs. of  12 variables:
 $ id           : int  1842 1844 1840 1837 1833 3010 1838 1845 2097 1846 ...
 $ day          : chr  "Monday" "Monday" "Monday" "Monday" ...
 $ airline      : chr  "TURKISH AIRLINES" "TURKISH AIRLINES" "TURKISH AIRLINES" "TURKISH AIRLINES" ...
 $ destination  : chr  "ISTANBUL" "ISTANBUL" "ISTANBUL" "ISTANBUL" ...
 $ dest_region  : chr  "Middle East" "Middle East" "Middle East" "Middle East" ...
 $ dest_size    : chr  "Hub" "Hub" "Hub" "Hub" ...
 $ boarding_area: chr  "Gates 91-102" "Gates 91-102" "Gates 91-102" "Gates 91-102" ...
 $ dept_time    : chr  "2018-12-31" "2018-12-31" "2018-12-31" "2018-12-31" ...
 $ wait_min     : num  255 315 165 225 175 ...
 $ cleanliness  : chr  "Average" "Somewhat clean" "Average" "Somewhat clean" ...
 $ safety       : chr  "Neutral" "Somewhat safe" "Somewhat safe" "Somewhat safe" ...
 $ satisfaction : chr  "Somewhat satsified" "Somewhat satsified" "Somewhat satsified" "Somewhat satsified" ...
  dest_size passenge

dest_size,n
Small,1
Hub,1
Hub,1756
Large,143
Large,1
Medium,682
Small,225


In [6]:
# Use the correct type of filtering join on the sfo_survey data frame and the dest_sizes data frame 
# to get the rows of sfo_survey with invalid dest_size values.

# Find bad dest_size rows
sfo %>% 
  # Join with dest_sizes data frame to get bad dest_size rows
  anti_join(dest_sizes, by = "dest_size") %>%
  # Select id, airline, destination, and dest_size cols
  select(id,airline, destination, dest_size)

id,airline,destination,dest_size
982,LUFTHANSA,MUNICH,Hub
2063,AMERICAN,PHILADELPHIA,Large
777,UNITED INTL,SAN JOSE DEL CABO,Small


In [8]:
# Use the correct filtering join on sfo_survey and dest_sizes to get the rows of sfo_survey that have a valid dest_size.

# Remove bad dest_size rows
sfo %>% 
  # Join with dest_sizes
  semi_join(dest_sizes, by = "dest_size") %>%
  # Count the number of each dest_size
  count(dest_size)

dest_size,n
Hub,1756
Large,143
Medium,682
Small,225


### Identifying inconsistency
In the video exercise, you learned about different kinds of inconsistencies that can occur within categories, making it look like a variable has more categories than it should.

In this exercise, you'll continue working with the sfo_survey dataset. You'll examine the dest_size column again as well as the cleanliness column and determine what kind of issues, if any, these two categorical variables face.

In [9]:
# Count the number of occurrences of each category of the dest_size variable of sfo_survey.

# Count dest_size
sfo %>%
  count(dest_size)

dest_size,n
Small,1
Hub,1
Hub,1756
Large,143
Large,1
Medium,682
Small,225


In [11]:
# Count cleanliness
sfo %>%
  count(cleanliness)

cleanliness,n
Average,433
Clean,970
Dirty,2
Somewhat clean,1254
Somewhat dirty,30
,120


### Correcting inconsistency
Now that you've identified that dest_size has whitespace inconsistencies and cleanliness has capitalization inconsistencies, you'll use the new tools at your disposal to fix the inconsistent values in sfo_survey instead of removing the data points entirely, which could add bias to your dataset if more than 5% of the data points need to be dropped.

In [13]:
library(stringr)
# Add new columns to sfo_survey
sfo <- sfo %>%
  # dest_size_trimmed: dest_size without whitespace
  mutate(dest_size_trimmed = str_trim(dest_size),
         # cleanliness_lower: cleanliness converted to lowercase
         cleanliness_lower = str_to_lower(cleanliness))

# Count values of dest_size_trimmed
sfo %>%
count(dest_size_trimmed)

# Count values of cleanliness_lower
sfo %>%
  count(cleanliness_lower)

dest_size_trimmed,n
Hub,1757
Large,144
Medium,682
Small,226


cleanliness_lower,n
average,433
clean,970
dirty,2
somewhat clean,1254
somewhat dirty,30
,120


### Collapsing categories
One of the tablets that participants filled out the sfo_survey on was not properly configured, allowing the response for dest_region to be free text instead of a dropdown menu. This resulted in some inconsistencies in the dest_region variable that you'll need to correct in this exercise to ensure that the numbers you report to your boss are as accurate as possible.

In [14]:
library(forcats)

# Count categories of dest_region
sfo %>%
  count(dest_region)

# Categories to map to Europe
europe_categories <- c("eur", "Europ", "EU")

# Add a new col dest_region_collapsed
sfo %>%
  # Map all categories in europe_categories to Europe
  mutate(dest_region_collapsed = fct_collapse(dest_region, 
                                     Europe = europe_categories)) %>%
  # Count categories of dest_region_collapsed
  count(dest_region_collapsed)

dest_region,n
Asia,260
Australia/New Zealand,66
Canada/Mexico,220
Central/South America,29
East US,498
Europe,401
Middle East,79
Midwest US,281
West US,975


"Problem with `mutate()` input `dest_region_collapsed`.
i Unknown levels in `f`: eur, Europ, EU
"Unknown levels in `f`: eur, Europ, EU"

dest_region_collapsed,n
Asia,260
Australia/New Zealand,66
Canada/Mexico,220
Central/South America,29
East US,498
Europe,401
Middle East,79
Midwest US,281
West US,975
