# Preprocessing text data

The purpose of this notebook is to try out different preprocessing steps for text data, and to see how changes in preprocessing can influence the data that gets input to modeling.

## Dataset for the exercise

* [New York Times Comments](https://www.kaggle.com/aashita/nyt-comments/data)  <-  set of readers' comments to articles published in the New York Times.

## Tools

Python has a variety of preprocessing tools. This example uses some of the tools built into [scikit-learn](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html) and the [nltk package](https://www.nltk.org/). However the same steps could be performed using other tools as well.

## Read data

In [None]:
import os
import csv

In [None]:
path = 'data/nyt-comments/'
files = os.listdir( path ) ## Get all files from directory path
files = [f for f in files if 'Comments' in f] ## Get only files with reader comments

# For the purposes of the example, let's use only one of the data files
file = files[0]

documents = []

# Iterate over entries in data and add their comment body to list
for entry in csv.DictReader( open( path + file ) ):
    documents.append( entry['commentBody'] )

print( "Data size" )
print( len(documents) )

## Preprocess and create Document-Term Matrix

We try here several basic preprocessing steps, including removing html tags, removing punctuation, removing numbers, removing stopwords, lowercasing and stemming words, and finally removing infrequent and very frequent words. Each of these have implications for the resulting Document-Term Matrix. You can try out different options below and see their influence.

In [None]:
import nltk
from nltk.corpus import stopwords

# Let's use nltk's in-built stopword list
nltk.download('stopwords')

# Add to or replace this list to use custom stopwords
stopwords = stopwords.words('english')

In [None]:
import string
from string import digits
import re

# Function to perform basic preprocessing
def preprocess( doc ):
    
    # Remove html
    p = re.compile(r'<.*?>')
    doc = p.sub('', doc)
    
    # Remove punctuation
    doc = doc.translate( str.maketrans( str.maketrans('', '', string.punctuation) ) )
    
    # Remove numbers
    doc = doc.translate( str.maketrans('', '', digits) )
    
    # Lowercase
    doc = doc.lower()
    
    # Remove extra whitespaces
    doc = re.sub(' +', ' ', doc) 
    doc = doc.strip()
    
    return doc

In [None]:
# Try preprocessing on one document
print( documents[0] )
print( preprocess( documents[0] ) )

In [None]:
# Preprocess both stopwords and actual data using same steps
documents = [preprocess(doc) for doc in documents]
stopwords = [preprocess(stop) for stop in stopwords]

In [None]:
from nltk.stem.snowball import EnglishStemmer

stemmer = EnglishStemmer() # nltk's in-built stemmer

# Function for stemming texts
def stem( text ):
    words = nltk.word_tokenize(text)
    return [ stemmer.stem(w) for w in words ]

In [None]:
# Stem words – this might take some time!
documents_stemmed = [' '.join( stem(d) ) for d in documents]
stopwords_stemmed = stem( ' '.join( stopwords ) )

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

tf_vectorizer = CountVectorizer(
    max_df=0.90, # Remove words that occur in over 90% of the documents
    min_df=10, # Remove words that occur in less than 10 documents
    stop_words=stopwords_stemmed, # Remove stop words
    analyzer = "word"
)

# Create the Document-Term Matrix
dtm = tf_vectorizer.fit_transform(documents_stemmed)

# Get list of words in the DTM
list( tf_vectorizer.get_feature_names_out() )

In [None]:
import pandas as pd

# Transform DTM to pandas dataframe to examine word frequencies
dtm_df = pd.DataFrame( dtm.todense() )
dtm_df.columns = tf_vectorizer.get_feature_names_out()
dtm_df

In [None]:
word_counts = dtm_df.sum() # Get word frequencies for each term in vocabulary
doc_counts = dtm_df.astype(bool).sum() # Get document frequencies for each term in vocabulary

In [None]:
# 10 most frequent words, based on word count
word_counts.sort_values( ascending=False ).nlargest(10)

In [None]:
# 10 most frequent words, based on document frequency
doc_counts.sort_values( ascending=False ).nlargest(10)

## Things to try out and think about

* Check the top words and list of words in the Document-Term Matrix. Do you see anything that should still be removed?
* Modify the stopword list to remove unwanted words
* Think about different ways of performing preprocessing and try these. Should you e.g. replace punctuation with whitespace, or remove it altogether, as in the above code? Should numbers be removed? Try also different thresholds for removing terms on the basis of document frequency.