# Text and Regex
___

General principles re: text in datasets:
* All lower case when possible.
* Be descriptive!
* Not duplicated - be distinct.
* Not have underscores or dots or white spaces.

Useful links:
* [A friendly and simply explanation to RegEx.](https://www.lunametrics.com/regex-book/Regular-Expressions-Google-Analytics.pdf)
* [A cheeky cheat-sheet.](http://www.rexegg.com/regex-quickstart.html)

## Data gathering
The Baltimore City data website no longer seems to accept download requests; it must be downloaded from the page. Download the dataset [here](https://data.baltimorecity.gov/Transportation/Baltimore-Fixed-Speed-Cameras/dz54-2aru).

In [3]:
camera.data <- read.csv("data/Baltimore_Fixed_Speed_Cameras.csv")

In [17]:
file.URL1 <- "https://raw.githubusercontent.com/jtleek/dataanalysis/master/week2/007summarizingData/data/reviews.csv"
file.URL2 <- "https://raw.githubusercontent.com/jtleek/dataanalysis/master/week2/007summarizingData/data/solutions.csv"

if (!file.exists("data/reviews.csv")) {
    download.file(file.URL1, "data/reviews.csv", method="curl", extra = "-L")
}

if (!file.exists("data/solutions.csv")) {
    download.file(file.URL2, "data/solutions.csv", method="curl", extra = "-L")
}

reviews <- read.csv("data/reviews.csv")
solutions <- read.csv("data/solutions.csv")

## Extracting Names

In [4]:
names(camera.data)

In [5]:
str(camera.data)

'data.frame':	80 obs. of  6 variables:
 $ address     : Factor w/ 71 levels "E 33RD ST & THE ALAMEDA",..: 49 49 70 57 1 14 14 31 5 35 ...
 $ direction   : Factor w/ 4 levels "E/B","N/B","S/B",..: 2 3 1 3 1 1 4 3 4 1 ...
 $ street      : Factor w/ 61 levels "\nPulaski Hwy \n",..: 2 2 60 56 6 11 11 3 30 38 ...
 $ crossStreet : Factor w/ 66 levels "33rd St","4th St",..: 6 6 49 1 58 40 40 36 7 38 ...
 $ intersection: Factor w/ 74 levels "\nPulaski Hwy \n & Moravia Park Drive",..: 3 3 73 69 8 14 14 4 36 48 ...
 $ Location.1  : Factor w/ 76 levels "(39.1999130165, -76.5559766825)",..: 7 6 8 49 48 35 36 74 32 29 ...


In [6]:
tolower(names(camera.data))

## Extract first element from a list

In [9]:
splitNames <- strsplit(names(camera.data), "\\.")

In [10]:
splitNames

In [15]:
firstElement <- function(x){x[1]}

In [16]:
sapply(splitNames, firstElement)

## Remove character vectors

This can be accomplished with `sub` or `gsub`. `sub` will replace only the first instance encountered; `gsub` will replace them all.

In [18]:
names(reviews)

In [20]:
sub("_", "", names(reviews),)

## Finding values
We might like to use `grep` or `grepl`.

In [None]:
?grep

In [22]:
grep("Alameda", camera.data$intersection)

In [29]:
grep("Alameda", camera.data$intersection, value=TRUE)

In [30]:
grep("GoyderTown", camera.data$intersection)

In [31]:
length(grep("GoyderTown", camera.data$intersection))

In [23]:
table(grepl("Alameda", camera.data$intersection))


FALSE  TRUE 
   77     3 

Give me everything BUT Alameda.

In [25]:
camera.data.2 <- camera.data[!grepl("Alameda", camera.data$intersection), ]

## stringr package
The `stringr` package contains a wealth of string manipulation functions.

In [32]:
library(stringr)

In [34]:
nchar("Hello, world")

In [36]:
substr("Joshua Goyder", 1, 6)

In [37]:
paste("Joshua", "Goyder")

In [38]:
paste("Joshua", "Goyder", sep = " is truly amazing and unrelated to the nice man known only as ")

In [39]:
paste0("what", "what")

In [40]:
str_trim("    Whaaaaaat        ")