# PC Session 1

**Author:**
[Helge Liebert](https://hliebert.github.io/)

# Reading the pdf files content as data

### Libraries

In [None]:
library("stringr")
library("readr")

### Extract source files

You will need to extract the source files to a folder. Pick the character encoding that fits your operating system.

In [None]:
## use txt-utf-8.zip if you are on MacOS or Linux
## unzip("txt-utf-8.zip")

In [None]:
## use txt-latin-1.zip if you are on Windows
## unzip("txt-latin-1.zip")

### Get file names, w/ and w/o paths

In [None]:
files <- list.files(path = "txt/", pattern = "*.txt", full.names = TRUE)
head(files)

In [None]:
names <- list.files(path = "txt/", pattern = "*.txt")
head(names)

### Read files

In [None]:
## read only first 5000 bytes, to preserve memory
content <- lapply(files, function(f) readChar(f, nchars = 5000))

## read all
## content <- lapply(files, readr::read_file)
## content <- lapply(files, function(f) readChar(f, nchars = file.info(f)$size))
                  
head(content)

### Read as data

In [None]:
data <- as.data.frame(cbind(names, content))
head(data)

### Extract more info from file name

In [None]:
## regex to get author names
data$names <- gsub("\\.txt$", "", data$names)
data$author <- gsub(" - .*$", "", data$names)
head(data)

In [None]:
## cleaner, no false positives (check first obs)
data$author <- str_extract(data$names, "^.*?( - )")
data$author <- gsub(" - ", "", data$author)
head(data)

In [None]:
## same for year
data$year <- str_extract(data$names, " - (20|19)[0-9][0-9] - ")
data$year <- gsub(" - ", "", data$year)
head(data)

In [None]:
## same for title
#data$title <- str_extract(data$names, " - .*?$") ## not good, title may contain hyphen
data$title <- str_extract(data$names, " - (20|19)[0-9][0-9] - .*$")
data$title <- gsub("^ - (20|19)[0-9][0-9] - ", "", data$title)
head(data)

In [None]:
## trim whitespace everywhere
data$author <- trimws(data$author)
data$year <- trimws(data$year)
data$title <- trimws(data$title)
head(data)

### Filter/clean content

In [None]:
## remove supplementary material
data <- data[!grepl("^Supplemental", data$content), ]

In [None]:
## check initial content metadata
data$content[5]

In [None]:
## remove JSTOR metadata page
data$content <- gsub("^.* are collaborating with JSTOR to digitize.*?\\.", "", data$content)
data$content[5]

In [None]:
## More
## ...

# Read single text file and transform it to a data frame

In [None]:
jobs <- read_file("example-unix.txt")
## jobs <- read_file("example.txt")

In [None]:
## TASK: Create a data frame with ids in one column and job ad text in another
## ...

In [None]:
ids <- str_extract_all(jobs, "32[0-9]{12}")
ids <- unlist(ids)
ids

In [None]:
posts <- str_split(jobs, "32[0-9]{12}")[[1]]
posts <- unlist(posts)[-1]
posts

In [None]:
posts <- trimws(posts)
posts

In [None]:
jobs.df <- data.frame(ids, posts)
jobs.df

# Convert single text file to csv using sed, then read it 

### Transformation

The above will read the complete text file into memory, which may be infeasible (or very slow). Other command line tools like `sed` can do text manipulations much faster. I included this for you to explore and for self study.

I recommend doing this directly in a shell (e.g., `bash` or `zsh`), not in R. Escaping is tedious in R. Note that in the notebook the output of `system()` calls is not visible. You can check it in Rstudio, or better yet, directly from a shell instead of calling `system()` in R.

In [None]:
## advanced, for self study. this works on very large files.
## crude way of creating a readable file quickly using shell programs.
## uses sed to insert a ';' separator and line break based on a regex pattern,
## such that ID and Text field can be read as a csv. you could also do this from
## the command line.  note: this requires sed to be installed on your system.
## also, R requires double backslash escaping, and escaping nested quotations --
## overall easier to do this directly in a shell.

## These three lines are all you need to execute.
## sed -e 's/^32[0-9]\{12\}$/"\n\0;"/' example-unix.txt > example.csv
## sed -i '1d' example.csv
## sed -i '$a"' example.csv

In [None]:
## check structure
system("head example-unix.txt -n 100", intern = TRUE)

In [None]:
## match id, then replace with separator (;), linebreak (\n), matched id (\0), separator (;)
system("sed -e 's/^32[0-9]\\{12\\}$/\"\\n\\0;\"/' example-unix.txt", intern = TRUE)

In [None]:
## match id, then replace with separator, linebreak, matched id (\0), separator (;), then direct output to file
system("sed -e 's/^32[0-9]\\{12\\}$/\"\\n\\0;\"/' example-unix.txt > example.csv", intern = TRUE)

In [None]:
# check and fix first/last row (could also do this in an editor)
system("head example.csv", intern = TRUE)
system("tail example.csv", intern = TRUE)

In [None]:
## -i operates on the file directly, 1d deletes the first line
system("sed -i '1d' example.csv")
# last row, $ selects last row, a appends the following characters
system("sed -i '$a\"' example.csv")

In [None]:
# check
system("head example.csv", intern = TRUE)
system("tail example.csv", intern = TRUE)

### Read file

In [None]:
example <- read.table("example.csv", sep = ";") 
options(scipen = 9999)
stopifnot(ncol(example)==2)

names(example) <- c("id", "ad")
example$ad <- trimws(example$ad)
head(example)

### Fix character encoding

In [None]:
Encoding(example$ad) <- "UTF-8" 
head(example)