# Milestone 1: Clean and Tokenize Text
The goal of this task is to to reduce the noise in the original raw text and tokenize the text to prepare it for the language maodel. We need to remove everything that is not exactly text (e.g html tags, math equations, urls, etc), filter out any rows with very short or very long texts and finally tokenize the text for use in a language model. The cleaned and tokenized output will be used as the input in the next milestone to build our n-gram model.

In [2]:
# We only need the following librairies
import pandas as pd
import re
import string
import csv
import numpy as np
from nltk.tokenize import WordPunctTokenizer

In [32]:
# if there's a problem with the versions of the librairies, you can . . uncomment this line and install the proper versions

# !pip install -r requirements.txt

Let's load the dataset and shuffle it.

In [4]:
#Load data using pandas read_csv
data = pd.read_csv('stackexchange_812k.csv').sample(frac = 1, random_state = 0).reset_index(drop = True)

In [5]:
#Check data has 812132 rows
assert data.shape == (812132, 5), "The dataset does not have the right dimensions"

And start by exploring the dataset.

In [6]:
#Check the first few rows of the data
data.head()

Unnamed: 0,post_id,parent_id,comment_id,text,category
0,291254,,601672.0,The condition makes the gradient unbiased. (it...,comment
1,115372,,221284.0,"Yes, that sounds fine to me.",comment
2,327356,,,<p>Consider gaussian variables belonging to a ...,post
3,186923,,355055.0,Thanks S. Catterall. ^-^ Integrability: I knew...,comment
4,433143,,,Feature with very few extreme values,title


We have 3 types of text:

In [7]:
#Check the value counts for the category column
data.category.value_counts()

comment    553076
post       167304
title       91752
Name: category, dtype: int64

In [8]:
#Print a few examples of titles 
for p in data[data.category == 'title'].text.sample(3).values:
    print('-' * 20)
    print(p)

--------------------
comparison of coefficients (they are not Standardized Beta ones)
--------------------
Can data ever be too high dimensional for the Lasso?
--------------------
Which test should I use to predict a ratio from multilevel, time series data?


We see that posts text have html tags and latex formatted equations.

In [9]:
#Print a few examples of posts 
for p in data[data.category == 'post'].text.sample(3).values:
    print('-' * 20)
    print(p)

--------------------
<p>I'm looking for a suitable image dataset to train an SVM, a CNN and possibly an MLP as classifiers and to compare the results. Since an SVM archieves good results with small data sets and a CNN and above all an MLP requires a very long time for training with large datasets, this dataset should be rather smaller. But this search is quite difficult because the dataset should not be too big but not too small. The dataset should be suitable to train a CNN und possible even an MLP in the shortest possible time, I mean with GPU max. 1 day. It would probably be better if the image resolution is not too high. Since the features have to be extracted in advance in SVM, it would probably be better if the data is not too complex.</p>

<p>Regarding CNN, I know that there are the so-called pre-trained networks, but I am not sure if they are really suitable for such a comparison.</p>

<p>Does anyone have experience with it? I would be very grateful for any tip on a suitable da

In [10]:
#Print a few examples of comments 
for p in data[data.category == 'comment'].text.sample(3).values:
    print('-' * 20)
    print(p)

--------------------
I did also tutorials in R about Random Forest. The problem is not how to program, my problem is that I don't know what to program if the random forest has to be used. It is not about programming
--------------------
Never use regression with time series data.  Use a Transfer Function model approach.
--------------------
This is an awfully complex procedure. Let's step back for a minute.  What is the point of this?  What are you trying to do? Most likely we will not end up going this route.


# Clean up raw text
We're going to remove the following elements:
* html tags
* line returns
* urls
* latex equations
* numbers
* mentions: @someone
* digits
* most of the punctuation
* and extra spaces

For that we will use a series of simple regex patterns and the following pandas dataframe pattern:

```
pattern = r" some regex pattern"
df.text.apply(lambda t : re.sub(pattern,' ', t) )
```

Note that it's up to you to decide which elements should be removed or kept. This sequence of transformations can be modified. 

Not also that the regex patterns we use here are chosen for their simplicity. Feel free to use more precise patterns.  





In [11]:
# Remove html tags
data['text'] = data.text.apply(lambda t : re.sub("<[^>]*>",' ', t) )

In [12]:
# Remove line returns
data['text'] = data.text.apply(lambda t : re.sub("[\r\n]+",' ', t) )


In [13]:
# Remove urls
data['text'] = data.text.apply(lambda t : re.sub("http\S+",' ', t) )


In [14]:
# Remove mentions
data['text'] = data.text.apply(lambda t : re.sub("@\S+",' ', t) )


In [15]:
# Remove latex
data['text'] = data.text.apply(lambda t : re.sub("\$[^>]*\$",' ', t) )


In [16]:
# Remove digits 
data['text'] = data.text.apply(lambda t : re.sub("\d+",' ', t) )


In [17]:
# Remove some of the punctuation but keep ,.!? and -
remove = '"#$%&()*+/:;<=>@[\\]^_`{|}~”“'
pattern = r"[{}]".format(remove)
data['text'] = data.text.apply(lambda t : re.sub(pattern,' ', t) )


In [18]:
# Remove multiple spaces
data['text'] = data.text.apply(lambda t : re.sub("\s\s+",' ', t) )


In [19]:
# Finally remove trailing spaces with strip()
data['text'] = data.text.apply(lambda t : t.strip() )


Let's check out the resulting text for the different types:

In [20]:
# Print examples of titles again - they should not be changed
for p in data[data.category == 'title'].text.sample(3).values:
    print('-' * 20)
    print(p)

--------------------
Does a quadratic log-likehood mean the MLE is approximately normally distributed?
--------------------
How to test if a value is over-represented in one sample vs another
--------------------
what is the Probability of selecting SNPs from a list of SNPs simply by chance


In [21]:
# Print examples of posts again - they should have much less clutter
for p in data[data.category == 'post'].text.sample(3).values:
    print('-' * 20)
    print(p)

--------------------
Being very new to recommender-systems and Matrix Factorization i was wondering how to get topN recommendations for a given user. So far my strategy is to create all possible User Item combinations, predict all of them and take the best N ones based on their predicted rating. This seems very inefficient to me as the number of possible combinations grows very fast. Question Is there a better method? If yes which one? In practice my approach by works as follows Having data as follows taken from movie lens in R-Package libFMexe User Movie Rating lt fctr gt lt fctr gt lt dbl gt Toy Story GoldenEye Toy Story Richard III Build all possible combinations test data.frame User factor levels levels dat User , Rating L User Movie Rating lt fctr gt lt fctr gt lt int gt GoldenEye Toy Story Richard III GoldenEye Toy Story Richard III Finally predict all user movie combinations and take the topN ratings per User as predictions. require libFMexe libFM dat, test, Rating User Movie, t

In [22]:
# Print examples of comments again - should also be less noisy
for p in data[data.category == 'comment'].text.sample(3).values:
    print('-' * 20)
    print(p)

--------------------
A kernel is NOT a mapping into feature space. It's a function that computes inner products in feature space. One could say that the choice of kernel implicitly determines a feature space mapping, but this is different than the kernel itself being such a mapping.
--------------------
metafor rma is a specific function in R. Could you more generally explain your model and how you currently calculate SE?
--------------------
you can only say over represented if you know define what it means to be normally represented. It’s much like a question about outliers first you have to decide what’s an in-lier. Statistics doesn’t do it for you.


# Tokenize

Let's tokenize the text. 
This will allow us to count the number of tokens of each text and subsequently remove test that are too long or too short.
You can use other librairies to tokenize the text (spacy for instance) or other tokenizer. Here we use the [WordPunctTokenizer](https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.regexp.WordPunctTokenizer) from NLTK.

And we create a new columns called tokens




In [23]:
#Create a tokenizer object using WordPunctTokenizer from NLTK 
tokenizer = WordPunctTokenizer()
#Apply the tokenizer to the text column (conver to lowercase first) and save output into new column called tokens
data['tokens'] = data.text.apply(lambda t : tokenizer.tokenize(t.lower())) 

Let's now count the tokens in each piece of text


In [24]:
#Count the number of tokens in each text and save output into new column called n_tokens
data['n_tokens'] = data.tokens.apply(len)

In [25]:
#Describe the number of tokens
data.n_tokens.describe()

count    812132.000000
mean         60.074186
std          99.416031
min           0.000000
25%          16.000000
50%          35.000000
75%          70.000000
max       10874.000000
Name: n_tokens, dtype: float64

In [26]:
#Do a histplot to check the distribution of the number of tokens
data.n_tokens.hist(bins = 100)

<matplotlib.axes._subplots.AxesSubplot at 0x12eb58e48>

We see that we have some extremely long texts. Let's look at the largest one

In [27]:
# Print the largest token
print(data[data.n_tokens > 10000].text.values[0])

My sample includes subjects, of which belong to group L , while the other to group L please see data below . I used GLM for a binary outcome to test for group differences in background variables - summary pre lt - glm L g a m p e, family binomial logit , data df ...yielding significant differences for of them g, a, m, p, and e . So I modeled these background variables as covariates when testing for an association between my predictor, chr and my outcome rsk , in each one of the groups L , L , again using GLM for binary outcome summary fit lt - glm rsk chr g a m p e, family binomial logit , data df which df L , The results showed that a significant association does exist for L but not for L . I would appreciate your help in how to test whether significance non-significance can be attributed to the group condition? . Or in other words, is it true that for subjects L , a significant correlation is evident, while for L ' it's absent. Thanks for responders! Uri structure list L c , , , , , 

We can see that most of the longest texts are composed of tables with limited semantic value. 
We will remove rows that have more than an arbitrary number of tokens (let's say 5000) as well as rows that have too few tokens.

In [28]:
#Remove rows with less than 4 or more than 5000 tokens
data = data[(data.n_tokens > 4) & (data.n_tokens < 5000)].reset_index(drop = True)
#Check the number of rows after filtering
data.shape

(789649, 7)

In [29]:
#Lets check the value counts for each category again
data.category.value_counts()

comment    540587
post       165377
title       83685
Name: category, dtype: int64

# Export data
We could export the dataframe as such using a pickle file format. 

However if we want to keep the original csv format it's going to be easier if we transform the list of tokens into a space separated string.

On retrieval we will only have to split the string to get back the list of tokens.

In [30]:
#Change tokens column to a space separated string of tokens rather a list
data['tokens'] = data.tokens.apply(lambda tk : ' '.join(tk))
#Check the first few rows
data.tokens.head()

0    the condition makes the gradient unbiased . it...
1                       yes , that sounds fine to me .
2    consider gaussian variables belonging to a gau...
3    thanks s . catterall . - integrability i knew ...
4                 feature with very few extreme values
Name: tokens, dtype: object

And finally let's export the dataframe into a csv file.

We will use that csv file as the new cleaned and tokenized dataset to build our language model in milestone 2.

In [31]:
#Write to csv to upload
data.to_csv('stackexchange_812k_tokenized.csv', quoting = csv.QUOTE_ALL, index = False)