# Preprocessing text data

The purpose of this notebook is to try out different preprocessing steps for text data, and to see how changes in preprocessing can influence the data that gets input to modeling.

## Dataset for the exercise

* [New York Times Comments](https://www.kaggle.com/aashita/nyt-comments/data)  <-  set of readers' comments to articles published in the New York Times.

## Tools

R has a variety of preprocessing tools, including the popular libraries [tm](https://cran.r-project.org/web/packages/tm/tm.pdf) and [quanteda](https://cran.r-project.org/web/packages/quanteda/quanteda.pdf). This example uses quanteda - however the same steps could be performed using other tools as well.

## Read data

In [None]:
path <- 'data/nyt-comments/'
files <- list.files( path ) ## Get all files from directory path
files <- files[ grepl("Comments", files) ] ## Get only files with reader comments

# For the purposes of the example, let's use only one of the comment data files
data <- read.csv( paste(path, files[1],  sep='') )

print( "Data size" )
print( nrow(data) )

## Preprocess and create Document-Term Matrix

We try here several basic preprocessing steps, including removing html tags, removing punctuation, removing numbers, removing stopwords, lowercasing and stemming words, and finally removing infrequent and frequent words. Each of these have implications for the resulting Document-Term Matrix. You can try out different options below and see their influence.

In [None]:
library(quanteda)

In [None]:
data$commentBody <- gsub("<.*?>", "", data$commentBody) # Remove html tags before creating corpus

corp <- corpus( data, text_field = "commentBody" ) # Create corpus

In [None]:
stopwords = stopwords('en') ## Add to or replace this list to use custom stopwords

# Split texts to word tokens
token <- tokens( 
    corp, 
    remove_punct=TRUE, # Remove punctuation
    remove_numbers=TRUE # Remove numbers
)

token <- tokens_tolower( token ) # Lowercase words
token <- tokens_select( token, pattern = stopwords, selection = 'remove') # Remove stopwords before stemming
token <- tokens_wordstem( token ) # Stem words

In [None]:
# Create the DTM (quanteda calls it 'dfm' for 'Document-Feature Matrix')
dtm <- dfm( token )
dtm

In [None]:
# Remove infrequent words
dtm <- dfm_trim( 
    dtm, 
    min_docfreq = 10, ## Remove words that occur in less than 10 documents 
    max_docfreq = dim(dtm)[1] * 0.9 # Remove words that occur in more than 90% of the documents
)

In [None]:
topfeatures( dtm, scheme='count', n=10 ) # Get 10 most frequent words, based on word count
topfeatures( dtm, scheme='docfreq', n=10 ) # Get 10 most frequent words, based on document frequency

In [None]:
# Get list of words in the DTM
featnames(dtm)

## Things to try out and think about

* Check the top words and list of words in the Document-Term Matrix. Do you see anything that should still be removed?
* Modify the stopword list to remove unwanted words.
* There might be some strings and symbols that quanteda does not recognize as e.g. punctuation or numbers. How would you go about removing these?
* You can check the quanteda [documentation](https://cran.r-project.org/web/packages/quanteda/quanteda.pdf), and especially the tokens-function (p. 77) to see the available preprocessing options there. Using regular expressions with the [gsub-function](https://www.digitalocean.com/community/tutorials/sub-and-gsub-function-r) might also help, as with the html tags above.