# M1 Loading and Preparing the Dataset

## Objective

The goal of this preliminary milestone is to load and preprocess the dataset. The raw text is noisy and we want to remove nonwords and non-ASCII characters, keep punctuation to a minimum, and reduce the overall vocabulary of the corpus.

- Although this corpus is not as noisy as a text directly extracted from a social network (for example, Twitter or Facebook), it is still not as structured as academic papers or newspaper articles. Furthermore, the corpus displays some interesting particularities, such as the presence of HTML markup and LaTeX-formatted equations. The corpus is also rich in specific entities, names of theorems, and statistical test algorithms, and it mixes colloquial writing with more formally structured paragraphs.


- The garbage-in, garbage-out golden rule of machine learning is also applicable to language models. Simply put, if we skip the preprocessing/cleaning part of the project, the vocabulary of our language model will be too vast and noisy to make any sense. Generated text, for instance, may mix in mathematical symbols with punctuation signs or random HTML tags and numbers. By reducing the volume of the corpus vocabulary, we increase the relevance and quality of the generated text and improve the reliability of sentence selection based on their respective probabilities. We also reduce the memory imprint of our code and its execution time.


- Preprocessing the text to reduce noise and vocabulary size is an iterative process. You should start simple and further refine the preprocessing steps after building and evaluating your first language models.~

In [2]:
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.tokenize.treebank import TreebankWordTokenizer, TreebankWordDetokenizer

## Load the dataset into a pandas DataFrame

In [3]:
df = pd.read_csv('~/data/stackexchange_812k.csv')

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 812132 entries, 0 to 812131
Data columns (total 5 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   post_id     812132 non-null  int64  
 1   parent_id   75535 non-null   float64
 2   comment_id  553076 non-null  float64
 3   text        812132 non-null  object 
 4   category    812132 non-null  object 
dtypes: float64(2), int64(1), object(2)
memory usage: 31.0+ MB


In [5]:
df.head()

Unnamed: 0,post_id,parent_id,comment_id,text,category
0,1,,,Eliciting priors from experts,title
1,2,,,What is normality?,title
2,3,,,What are some valuable Statistical Analysis op...,title
3,4,,,Assessing the significance of differences in d...,title
4,6,,,The Two Cultures: statistics vs. machine learn...,title


In [6]:
df.text.sample(10)

498456    S. Haykin, Adaptive Filter Theory, 5th Edition...
89243     Understanding the violation of the independenc...
15152     Confusion over lmer and p-values: how do p-val...
15904     Is a large control sample better than a balanc...
170280    <p>Figure 1 there clarifies things a bit. All ...
649371    Fair enough, I agree/stand corrected. I still ...
112754    <p>There are some angles on this to consider. ...
810641                  Any question @RiturajSinghRathore ?
120845    <p>Your example is a very good one because it ...
66418     Question about notation of expectation operato...
Name: text, dtype: object

## Use regular expressions to remove elements that are not words, such as HTML tags, LaTeX expressions, URLs, digits, and line returns.

In [7]:
HTML = "<[^>]*>"
LATEX = "\$[^>]*\$"
URLS = "http\S+"
CRS = "[\r\n]+"
DIGITS = "\$[^>]*\$"
SPACES = "\s\s+"
PUNCT = '"#$%&()*+/:;<=>@[\\]^_`{|}~”“'
pattern = r"[{}]".format(PUNCT)

def clean_text(text):
    """
        text: a string        
        return: modified initial string
    """
    text = re.sub(HTML,' ', text)
    text = re.sub(LATEX,' ', text)
    text = re.sub(URLS,' ', text)
    text = re.sub(CRS,' ', text)
    text = re.sub(DIGITS,' ', text)
    text = re.sub(pattern,' ', text)
    text = re.sub(SPACES,' ', text)
    text = re.sub(DIGITS,' ', text)
    return text.strip()

In [8]:
clean_text('Formulate hypotheses when $\mu_A < \mu_B$')

'Formulate hypotheses when'

In [9]:
clean_text('See my response to <a href="https://stackoverflow.com/questions/2252144/datasets-for-running-statistical-analysis-on')

'See my response to a href'

In [10]:
# Sample of comments
for p in df[df.category == 'comment'].text.sample(3).values:
  print('-' * 20)
  print(p)

--------------------
May fit better under: http://stats.stackexchange.com/questions/1906/data-mining-conferences?
--------------------
$X$ is the mean of something and is distributed as $Exp(1)$ and Y is the actual value (i.e. not the mean) with parameter $X = x$ such that $Y | X = x$ is distributed as $Pois(X = x)$. I'll give my own answer right now and hopefully, you can confirm it.
--------------------
What result do you get if you just use a random forest regression model instead of the classifier and then regressor?


In [11]:
df.text = df.text.apply(clean_text)

In [12]:
# Post clean sample of comments
for p in df[df.category == 'comment'].text.sample(3).values:
  print('-' * 20)
  print(p)

--------------------
Glen b and samooch and Marius The formula expansion of factor symbol factor time will include main effects for both symbol and time. Furthermore, the 0 will only change the labeling of the effects. Instead of an Intercept term you will see the estimates for interaction of the lowest levels for symbol and time. If you wanted to avoid estimating a main effect for time you would need to use factor symbol factor time
--------------------
I don't understand your question. Are you asking what does it mean if all variables are most strongly correlated with PC1?
--------------------
Yes, I didn't know the term eigenface but that's what they are when I have just one output neuron with 6 output neurons I still get faces but they are much more noisy .


## Remove texts that contain blanks only.

In [13]:
df.text.count()

812132

In [14]:
df[df.text.str.len() == 0].text.count()

1422

1422 out of 812132 entries have a zero length text.

In [15]:
df = df[df.text.str.len() > 0]

In [16]:
df.text.count()

810710

## Remove texts that are extremely large or too short to add any information to the model. 

We want to keep paragraphs that contain at least a few words and remove the paragraphs that are composed of large numerical tables.

In [17]:
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()
df['tokens'] = df.text.apply(lambda t : tokenizer.tokenize(t.lower()))

In [18]:
df['n_tokens'] = df.tokens.apply(len)

In [19]:
df.n_tokens.describe()

count    810710.000000
mean         63.246199
std         122.586727
min           1.000000
25%          16.000000
50%          36.000000
75%          72.000000
max       14835.000000
Name: n_tokens, dtype: float64

In [20]:
df.n_tokens.max()

14835

In [21]:
df = df[(df.n_tokens > 4) & (df.n_tokens < 5000)].reset_index(drop = True)
df.shape

(791172, 7)

## Use a tokenizer to create a version of the original text that is a string of space-separated lowercase tokens. 

For instance,

- Thank you!, This equation y = ax + by=ax+b, is very helpful.

    would be transformed to:

    thank you ! this equation , is very helpful .

- “retrieve a distance matrix” is a matter of coding. It also might be irrelevant: one can imagine creative answers.

    becomes, if you choose to remove double quotes from the original text:

    retrieve a distance matrix is a matter of coding. it also might be irrelevant : one can imagine creative answers .

Note that punctuation signs (, . : !) are also represented as tokens.

In [22]:
from nltk import word_tokenize
from nltk import Text

In [23]:
def space_separated_lower(text):
    tokens = word_tokenize(text.lower())
    return " ".join(list(filter(lambda x: x not in ['“', "”"], tokens)))

In [24]:
text = '“retrieve a distance matrix” is a matter of coding. It also might be irrelevant: one can imagine creative answers.'
space_separated_lower(text)

'retrieve a distance matrix is a matter of coding . it also might be irrelevant : one can imagine creative answers .'

In [25]:
df['tokens'] = df.text.apply(space_separated_lower)

## Export the resulting DataFrame into a CSV file.

In [27]:
import csv
df.to_csv("../data/stackexchange_cleaned.csv", quoting = csv.QUOTE_ALL, index = False)