# Data cleaning notebook

This notebook contains the examples from chapter 2 of the book. Let us start by reading in the data into a data frame named `bike`.

In [1]:
bike <- read.csv("https://raw.githubusercontent.com/jgendron/com.packtpub.intro.r.bi/master/Chapter2-DataCleaning/data/Ch2_raw_bikeshare_data.csv", stringsAsFactors = FALSE)

## 1 Summarizing your data for inspection

We will start by looking a bit at the data

In [2]:
str(bike)

'data.frame':	17379 obs. of  13 variables:
 $ datetime  : chr  "1/1/2011 0:00" "1/1/2011 1:00" "1/1/2011 2:00" "1/1/2011 3:00" ...
 $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ workingday: int  0 0 0 0 0 0 0 0 0 0 ...
 $ weather   : int  1 1 1 1 1 2 1 1 1 1 ...
 $ temp      : num  9.84 9.02 9.02 9.84 9.84 ...
 $ atemp     : num  14.4 13.6 13.6 14.4 14.4 ...
 $ humidity  : chr  "81" "80" "80" "75" ...
 $ windspeed : num  0 0 0 0 0 ...
 $ casual    : int  3 8 5 3 0 0 2 1 1 8 ...
 $ registered: int  13 32 27 10 1 1 0 2 7 6 ...
 $ count     : int  16 40 32 13 1 1 2 3 8 14 ...
 $ sources   : chr  "ad campaign" "www.yahoo.com" "www.google.fi" "AD campaign" ...


Clear problems that we can note already are that the `datetime` column is in characters and not a proper date-time format in R. Moreover, the `humidity` variable is also character, while it looks like it should be numeric or integer.

In [3]:
dim(bike)
head(bike)
tail(bike)

datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count,sources
1/1/2011 0:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16,ad campaign
1/1/2011 1:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40,www.yahoo.com
1/1/2011 2:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32,www.google.fi
1/1/2011 3:00,1,0,0,1,9.84,14.395,75,0.0,3,10,13,AD campaign
1/1/2011 4:00,1,0,0,1,9.84,14.395,75,0.0,0,1,1,Twitter
1/1/2011 5:00,1,0,0,2,9.84,12.88,75,6.0032,0,1,1,www.bing.com


Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count,sources
17374,12/31/2012 18:00,1,0,1,2,10.66,13.635,48,8.9981,10,326,336,facebook page
17375,12/31/2012 19:00,1,0,1,2,10.66,12.88,60,11.0014,5,206,211,
17376,12/31/2012 20:00,1,0,1,2,10.66,12.88,60,11.0014,4,140,144,AD campaign
17377,12/31/2012 21:00,1,0,1,1,10.66,12.88,60,11.0014,3,96,99,AD campaign
17378,12/31/2012 22:00,1,0,1,1,10.66,13.635,56,8.9981,4,90,94,ad campaign
17379,12/31/2012 23:00,1,0,1,1,10.66,13.635,65,8.9981,3,50,53,direct


# 2 Finding and fixing flawed data

We will now try to find and fix errors in the data. Note this is detective work and there is not always one answer and one method that find and solves all errors.

### Missing values

Missing values is a common problem occuring with most data sets. It is simply the case when a cell in a tidy data frame is missing a value. Sometimes, a specific values is used to indicate that the value is missing. In R this is represented by the value `NA`. Ine above `tail` call you can such a case. First, let us see how to count the number of `NA` in a data frame:

In [5]:
table(is.na(bike))


 FALSE   TRUE 
225373    554 

This show us that our data frame `bike` contain 554 `NA` values, while 225373 values are not missing. The following code can help us see if it is all variables that contain `NA`:

In [6]:
library(stringr)
str_detect(bike, "NA")

“argument is not an atomic vector; coercing”

This result shows that it is only the last column containing `NA` values. Another way confirming this is by:

In [7]:
table(is.na(bike$sources))


FALSE  TRUE 
16825   554 

We will fix this error later.

### Dealing with missing values

This is a huge topic on its own...

### Erroneous values

We now return to the issue with the `humidity` attribute. We will start by searching for characters in the column:

In [8]:
bad_data <- str_subset(bike$humidity, "[a-z A-Z]")
bad_data

From this we can see that the value `x61` appears somewhere in the column `humidity`. This is clearly an error. There are not always a clear answer or solution, but in this case, it seems like someone just miss typed `61` as `x61`. Let us find the location of this error:

In [11]:
location <- str_detect(bike$humidity, bad_data)
bike[location, ]

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count,sources
14177,8/18/2012 21:00,3,0,0,1,27.06,31.06,x61,0,90,248,338,www.bing.com


Note, that the `str_detect` function give us a vector of `TRUE` and `FAlSE` which is only true for the row that has the error. This we can use to subset the `bike` data frame to see the row with the error.

We can now replace this error in the following way and inspect that we fixed the error:

In [12]:
bike$humidity <- str_replace_all(bike$humidity, "x61", "61")
bike[location, ]

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count,sources
14177,8/18/2012 21:00,3,0,0,1,27.06,31.06,61,0,90,248,338,www.bing.com


# 3 Converting inputs to data types suitable for analysis

We will now try and convert the columns with wrong data format to a proper data format. There is not always one right data format to put your data in. It depends on what you want to do with it later. Some times it is prefered to have a column values as character strings and other times it is prefered to have them as factors. However, in most cases it make sense to turn dates into a proper data format instead of character strings.

We will start by turning the `humidity` column into a numeric columns as we have now fixed the issue with the non numeric value:

In [13]:
bike$humidity <- as.numeric(bike$humidity)
# bike <- mutate(bike, humidity = as.numeric(humidity))
str(bike)

'data.frame':	17379 obs. of  13 variables:
 $ datetime  : chr  "1/1/2011 0:00" "1/1/2011 1:00" "1/1/2011 2:00" "1/1/2011 3:00" ...
 $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ workingday: int  0 0 0 0 0 0 0 0 0 0 ...
 $ weather   : int  1 1 1 1 1 2 1 1 1 1 ...
 $ temp      : num  9.84 9.02 9.02 9.84 9.84 ...
 $ atemp     : num  14.4 13.6 13.6 14.4 14.4 ...
 $ humidity  : num  81 80 80 75 75 75 80 86 75 76 ...
 $ windspeed : num  0 0 0 0 0 ...
 $ casual    : int  3 8 5 3 0 0 2 1 1 8 ...
 $ registered: int  13 32 27 10 1 1 0 2 7 6 ...
 $ count     : int  16 40 32 13 1 1 2 3 8 14 ...
 $ sources   : chr  "ad campaign" "www.yahoo.com" "www.google.fi" "AD campaign" ...


Next we will look at te `holiday` and `workingday` variables which are numeric, but it is more natural to have them as factors. Here is how to fix that:

In [14]:
bike$holiday <- factor(bike$holiday, levels = c(0, 1), labels = c("no", "yes"))
bike$workingday <- factor(bike$workingday, levels = c(0, 1), labels = c("no", "yes"))
str(bike)

'data.frame':	17379 obs. of  13 variables:
 $ datetime  : chr  "1/1/2011 0:00" "1/1/2011 1:00" "1/1/2011 2:00" "1/1/2011 3:00" ...
 $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ holiday   : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ workingday: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ weather   : int  1 1 1 1 1 2 1 1 1 1 ...
 $ temp      : num  9.84 9.02 9.02 9.84 9.84 ...
 $ atemp     : num  14.4 13.6 13.6 14.4 14.4 ...
 $ humidity  : num  81 80 80 75 75 75 80 86 75 76 ...
 $ windspeed : num  0 0 0 0 0 ...
 $ casual    : int  3 8 5 3 0 0 2 1 1 8 ...
 $ registered: int  13 32 27 10 1 1 0 2 7 6 ...
 $ count     : int  16 40 32 13 1 1 2 3 8 14 ...
 $ sources   : chr  "ad campaign" "www.yahoo.com" "www.google.fi" "AD campaign" ...


Note that this code turned `0` into "no" and `1` into "yes". To make this decision we need to know that this is correct of course. That is we need to know something about the data set. Often this kind of information can be found in data dictionary, if the data set has a such.

In similar manners we can turn the `season` and `weather` columns into factors, which seems to be the right thing to do in this case:

In [17]:
bike$season <- factor(bike$season, levels = c(1, 2, 3, 4), labels = c("spring", "summer", "fall", "winter"), ordered = TRUE)
bike$weather <- factor(bike$weather, levels = c(1, 2, 3, 4), labels = c("clr_part_cloud", "mist_cloudy", "lt_rain_snow", "hvy_rain_snow"), ordered = TRUE)
str(bike)

datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count,sources
2011-01-01 00:00:00,spring,no,no,clr_part_cloud,9.84,14.395,81,0.0000,3,13,16,ad campaign
2011-01-01 01:00:00,spring,no,no,clr_part_cloud,9.02,13.635,80,0.0000,8,32,40,www.yahoo.com
2011-01-01 02:00:00,spring,no,no,clr_part_cloud,9.02,13.635,80,0.0000,5,27,32,www.google.fi
2011-01-01 03:00:00,spring,no,no,clr_part_cloud,9.84,14.395,75,0.0000,3,10,13,AD campaign
2011-01-01 04:00:00,spring,no,no,clr_part_cloud,9.84,14.395,75,0.0000,0,1,1,Twitter
2011-01-01 05:00:00,spring,no,no,mist_cloudy,9.84,12.880,75,6.0032,0,1,1,www.bing.com
2011-01-01 06:00:00,spring,no,no,clr_part_cloud,9.02,13.635,80,0.0000,2,0,2,ad campaign
2011-01-01 07:00:00,spring,no,no,clr_part_cloud,8.20,12.880,86,0.0000,1,2,3,www.yahoo.com
2011-01-01 08:00:00,spring,no,no,clr_part_cloud,9.84,14.395,75,0.0000,1,7,8,www.yahoo.com
2011-01-01 09:00:00,spring,no,no,clr_part_cloud,13.12,17.425,76,0.0000,8,6,14,www.bing.com


**NOTE:** It is not always a good idea to turn character strings into factors. In fact, I will advice not to do it at this stage. You can always do it later, if you realize it will make something easier for you or it is required by other functions you want to use.

We will now finally fix the date format issue with the `datetime` column. To do this we first need to understand what format the `datetime` column is in. Looking at the `str` output above seems to indicate that it is on the format "m/dd/yyyy hh:mm". The "lubridate" package is a very nice package to work with dates and times. It even has a function to deal with this particular format, namely the function `mdy_hm` function - the name hopefully give away why it is useful in our case! So let us use it to transform the `datetime` column into a proper format:

In [16]:
library(lubridate)
bike$datetime <- mdy_hm(bike$datetime)
str(bike)


Attaching package: ‘lubridate’

The following object is masked from ‘package:base’:

    date



'data.frame':	17379 obs. of  13 variables:
 $ datetime  : POSIXct, format: "2011-01-01 00:00:00" "2011-01-01 01:00:00" ...
 $ season    : Ord.factor w/ 4 levels "spring"<"summer"<..: 1 1 1 1 1 1 1 1 1 1 ...
 $ holiday   : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ workingday: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ weather   : Ord.factor w/ 4 levels "clr_part_cloud"<..: 1 1 1 1 1 2 1 1 1 1 ...
 $ temp      : num  9.84 9.02 9.02 9.84 9.84 ...
 $ atemp     : num  14.4 13.6 13.6 14.4 14.4 ...
 $ humidity  : num  81 80 80 75 75 75 80 86 75 76 ...
 $ windspeed : num  0 0 0 0 0 ...
 $ casual    : int  3 8 5 3 0 0 2 1 1 8 ...
 $ registered: int  13 32 27 10 1 1 0 2 7 6 ...
 $ count     : int  16 40 32 13 1 1 2 3 8 14 ...
 $ sources   : chr  "ad campaign" "www.yahoo.com" "www.google.fi" "AD campaign" ...


Note that the the date-time format the the function turned the column into is something called "POSIXct". You can google it if you want, but we will not go into this i any more details for now.

# Adapting string variables to a standard

Finally, we will look at adapting sting variables to a standard. The issue here is that sometimes string values are hard to work with and may contain more information than needed. In such cases, with some manipulation we can turn the variable into a factor variable with fewer value, however, values that are still meaningful to us. This is the case for the `source` column, which we will now take a closer look at. First we look at what are all the unique values this column takes:

In [18]:
unique(bike$source)

There are some obvious cleaning we can do here! First of all there are two values for Twitter which should be indentified and there are three values for ad campaing that should probably also be identified. Moreover, we might want to replace `NA` by unknown in this case. The "stringr" package can again help us and we can solve these issues in the following way:

In [19]:
bike$sources <- tolower(bike$sources)
bike$sources <- str_trim(bike$sources)
na_loc <- is.na(bike$sources)
bike$sources[na_loc] <- "unknown"
unique(bike$source)

This is much better, but we might also want to group all webpages into on category, that is all sources that starts with "www." We can do this using the DataCombine package in the following way:

In [20]:
install.packages("DataCombine")
library(DataCombine)
web_sites <- "(www.[a-z]*.[a-z]*)"
current <- unique(str_subset(bike$sources, web_sites))
replace <- rep("web", length(current))
replacements <- data.frame(from = current, to = replace)
bike <- FindReplace(data = bike, Var = "sources", replacements, from = "from", to = "to", exact = FALSE)
bike$sources <- as.factor(bike$sources)
unique(bike$sources)

Installing package into ‘/home/nbuser/R’
(as ‘lib’ is unspecified)


We now have a nice and clean dataset in the right format, which we will use for further analysis in the later lectures. Have a look at it:

In [21]:
str(bike)
bike

'data.frame':	17379 obs. of  13 variables:
 $ datetime  : POSIXct, format: "2011-01-01 00:00:00" "2011-01-01 01:00:00" ...
 $ season    : Ord.factor w/ 4 levels "spring"<"summer"<..: 1 1 1 1 1 1 1 1 1 1 ...
 $ holiday   : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ workingday: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ weather   : Ord.factor w/ 4 levels "clr_part_cloud"<..: 1 1 1 1 1 2 1 1 1 1 ...
 $ temp      : num  9.84 9.02 9.02 9.84 9.84 ...
 $ atemp     : num  14.4 13.6 13.6 14.4 14.4 ...
 $ humidity  : num  81 80 80 75 75 75 80 86 75 76 ...
 $ windspeed : num  0 0 0 0 0 ...
 $ casual    : int  3 8 5 3 0 0 2 1 1 8 ...
 $ registered: int  13 32 27 10 1 1 0 2 7 6 ...
 $ count     : int  16 40 32 13 1 1 2 3 8 14 ...
 $ sources   : Factor w/ 7 levels "ad campaign",..: 1 7 7 1 5 7 1 7 7 7 ...


datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count,sources
2011-01-01 00:00:00,spring,no,no,clr_part_cloud,9.84,14.395,81,0.0000,3,13,16,ad campaign
2011-01-01 01:00:00,spring,no,no,clr_part_cloud,9.02,13.635,80,0.0000,8,32,40,web
2011-01-01 02:00:00,spring,no,no,clr_part_cloud,9.02,13.635,80,0.0000,5,27,32,web
2011-01-01 03:00:00,spring,no,no,clr_part_cloud,9.84,14.395,75,0.0000,3,10,13,ad campaign
2011-01-01 04:00:00,spring,no,no,clr_part_cloud,9.84,14.395,75,0.0000,0,1,1,twitter
2011-01-01 05:00:00,spring,no,no,mist_cloudy,9.84,12.880,75,6.0032,0,1,1,web
2011-01-01 06:00:00,spring,no,no,clr_part_cloud,9.02,13.635,80,0.0000,2,0,2,ad campaign
2011-01-01 07:00:00,spring,no,no,clr_part_cloud,8.20,12.880,86,0.0000,1,2,3,web
2011-01-01 08:00:00,spring,no,no,clr_part_cloud,9.84,14.395,75,0.0000,1,7,8,web
2011-01-01 09:00:00,spring,no,no,clr_part_cloud,13.12,17.425,76,0.0000,8,6,14,web
