<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" align="left" src="https://i.creativecommons.org/l/by-nc-sa/4.0/80x15.png" /></a>&nbsp;| [Emmanuel Rachelson](https://personnel.isae-supaero.fr/emmanuel-rachelson?lang=en) | <a href="https://supaerodatascience.github.io/machine-learning/">https://supaerodatascience.github.io/machine-learning/</a>

<div style="font-size:22pt; line-height:25pt; font-weight:bold; text-align:center;">Text data pre-processing</div>

In this exercice, we shall load a database of email messages and pre-format them so that we can design automated classification methods or use off-the-shelf classifiers. The general purpose of this notebook is to give a practical notion (through this example) of how important data pre-processing can be in a Machine Learning workflow, and generalize it to other situations.

"What is there to pre-process?" you might ask. Well, actually, text data comes in a very noisy form that we, humans, have become accustomed to and filter out effortlessly to grasp the core meaning of the text. It has a lot of formatting (fonts, colors, typography...), punctuation, abbreviations, common words, grammatical rules, etc. that we might wish to discard before even starting the data analysis.

Here are some pre-processing steps that can be performed on text:
1. loading the data, removing attachements, merging title and body;
2. tokenizing - splitting the text into atomic "words";
3. removal of stop-words - very common words;
4. removal of non-words - punctuation, numbers, gibberish;
3. lemmatization - merge together "find", "finds", "finder".

The final goal is to be able to represent a document as a mathematical object, e.g. a vector, that our machine learning black boxes can process.



<img id="fig1" src="https://imgs.xkcd.com/comics/constructive.png"> 

A tech company comes to you to create a moderation system for their social network : they want to detect spam comments, and later on also detect offensive contents to remove them automatically.


<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#1.-Load-the-data" data-toc-modified-id="1.-Load-the-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>1. Load the data</a></span></li><li><span><a href="#2.-Filtering-out-the-noise" data-toc-modified-id="2.-Filtering-out-the-noise-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>2. Filtering out the noise</a></span></li><li><span><a href="#3.-Even-better-filtering" data-toc-modified-id="3.-Even-better-filtering-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>3. Even better filtering</a></span></li><li><span><a href="#4.-Term-frequency-times-inverse-document-frequency" data-toc-modified-id="4.-Term-frequency-times-inverse-document-frequency-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>4. Term frequency times inverse document frequency</a></span></li><li><span><a href="#5.-Utility-function" data-toc-modified-id="5.-Utility-function-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>5. Utility function</a></span></li></ul></div>

# 1. Load the data

To showcase our proposed system, we load a database of email messages and pre-format them so that we can design automated classification methods or use off-the-shelf classifiers.


**Questions** :
- What simple statistics could you print on the dataset ?
- Why is there multiple folders on the dataset ?

In [None]:
#!git clone git@github.com:SupaeroDataScience/machine-learning.git

In [1]:
import os

train_dir = '../data/lingspam_public/bare/'
# train_dir = 'machine-learning/data/ling-spam/train-mails/'

email_path = []
email_label = []
for d in os.listdir(train_dir):
    folder = os.path.join(train_dir,d)
    email_path += [os.path.join(folder,f) for f in os.listdir(folder)]
    email_label += [f[0:3]=='spm' for f in os.listdir(folder)]
print("number of emails",len(email_path))
email_nb = 8 # try 8 for a spam example
print("email file:", email_path[email_nb])
print("email is a spam:", email_label[email_nb])
print(open(email_path[email_nb]).read())

number of emails 2893
email file: ../data/lingspam_public/bare/part7/6-476msg3.txt
email is a spam: False
Subject: request for discourse list

dear linguists i 'd like to know if there are any listservs on discourse anlysis text linguistics and pragmatics . thanks gul durmusoglu



# 2. Filtering out the noise

One nice thing about scikit-learn is that is has lots of preprocessing utilities. Like [`CountVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) for instance, that converts a collection of text documents to a matrix of token counts.

- To remove stop-words, we set: `stop_words='english'`
- To convert all words to lowercase: `lowercase=True`
- The default tokenizer in scikit-learn removes punctuation and only keeps words of more than 2 letters.

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
countvect = CountVectorizer(input='filename', stop_words='english', lowercase=True)
word_count = countvect.fit_transform(email_path)

In [3]:
print("Number of documents:", len(email_path))
words = countvect.get_feature_names_out()
print("Number of words:", len(words))
print("Document - words matrix:", word_count.shape)
print("First words:", words[0:100])

Number of documents: 2893
Number of words: 60618
Document - words matrix: (2893, 60618)
First words: ['00' '000' '0000' '00001' '00003000140' '00003003958' '00007' '0001'
 '00010' '00014' '0003' '00036' '000bp' '000s' '000yen' '001' '0010'
 '0010010034' '0011' '00133' '0014' '00170' '0019' '00198' '002' '002656'
 '0027' '003' '0030' '0031' '00333' '0037' '0039' '003n7' '004' '0041'
 '0044' '0049' '005' '0057' '006' '0067' '007' '00710' '0073' '0074'
 '00799' '008' '009' '00919680' '0094' '00a' '00am' '00arrival' '00b'
 '00coffee' '00congress' '00d' '00dinner' '00f' '00h' '00hfstahlke' '00i'
 '00j' '00l' '00m' '00p' '00pm' '00r' '00t' '00tea' '00the' '00uzheb' '01'
 '0100' '01003' '01006' '0104' '0106' '01075' '0108' '011' '0111' '0117'
 '0118' '01202' '01222' '01223' '01225' '01232' '01235' '01273' '013'
 '0131' '01334' '0135' '01364' '0139' '013953' '013a']


# 3. Even better filtering

That's already quite ok, but this pre-processing does not perform lemmatization, the list of stop-words could be better and we could wish to remove non-english words (misspelled, with numbers, etc.).

A slightly better preprocessing uses the [Natural Language Toolkit](https://www.nltk.org/https://www.nltk.org/). The one below:
- tokenizes;
- removes punctuation;
- removes stop-words;
- removes non-English and misspelled words (optional);
- removes 1-character words;
- removes non-alphabetical words (numbers and codes essentially).

In [None]:
# Run only of nltk is not installed
#%pip install nltk
#import nltk
#nltk.download('stopwords')
#nltk.download('words')
#nltk.download('wordnet')
#nltk.download('omw-1.4')

In [None]:
from nltk import wordpunct_tokenize      #→ splits text into tokens (words & punctuation).    
from nltk.stem import WordNetLemmatizer  #→ reduces words to their base/lemma (e.g., running → run).
from nltk.corpus import stopwords        #→ built-in English stop word list.
from nltk.corpus import words            #→ dictionary of valid English words from NLTK.
from string import punctuation           #→ string of punctuation symbols
class LemmaTokenizer(object):
    def __init__(self, remove_non_words=True):
        self.wnl = WordNetLemmatizer()
        self.stopwords = set(stopwords.words('english'))
        self.words = set(words.words())
        self.remove_non_words = remove_non_words
    def __call__(self, doc):
        # tokenize words and punctuation
        word_list = wordpunct_tokenize(doc)
        # remove stopwords
        word_list = [word for word in word_list if word not in self.stopwords]
        # remove non words
        if(self.remove_non_words):
            word_list = [word for word in word_list if word in self.words]
        # remove 1-character words
        word_list = [word for word in word_list if len(word)>1]
        # remove non alpha
        word_list = [word for word in word_list if word.isalpha()]
        return [self.wnl.lemmatize(t) for t in word_list]

countvect = CountVectorizer(input='filename',tokenizer=LemmaTokenizer(remove_non_words=True))
word_count = countvect.fit_transform(email_path)
feat2word = {v: k for k, v in countvect.vocabulary_.items()}



In [5]:
print("Number of documents:", len(email_path))
words = countvect.get_feature_names_out()
print("Number of words:", len(words))
print("Document - words matrix:", word_count.shape)
print("First words:", words[0:100])

Number of documents: 2893
Number of words: 14282
Document - words matrix: (2893, 14282)
First words: ['aa' 'aal' 'aba' 'aback' 'abacus' 'abandon' 'abandoned' 'abandonment'
 'abbas' 'abbreviation' 'abdomen' 'abduction' 'abed' 'aberrant'
 'aberration' 'abide' 'abiding' 'abigail' 'ability' 'ablative' 'ablaut'
 'able' 'abler' 'aboard' 'abolition' 'abord' 'aboriginal' 'aborigine'
 'abound' 'abox' 'abreast' 'abridged' 'abroad' 'abrogate' 'abrook'
 'abruptly' 'abscissa' 'absence' 'absent' 'absolute' 'absolutely'
 'absoluteness' 'absolutist' 'absolutive' 'absolutization' 'absorbed'
 'absorption' 'abstract' 'abstraction' 'abstractly' 'abstractness'
 'absurd' 'absurdity' 'abu' 'abundance' 'abundant' 'abuse' 'abusive'
 'abyss' 'academe' 'academic' 'academically' 'academician' 'academy'
 'accelerate' 'accelerated' 'accelerative' 'accent' 'accentuate'
 'accentuation' 'accept' 'acceptability' 'acceptable' 'acceptance'
 'acceptation' 'accepted' 'acception' 'access' 'accessibility'
 'accessible' 'acce

# 4. Term frequency times inverse document frequency

After this first preprocessing, each document is summarized by a vector of size "number of words in the extracted dictionnary". For example, the first email in the list has become:

In [6]:
mail_number = 0
text = open(email_path[mail_number]).read()
print("Original email:")
print(text)
#print(LemmaTokenizer()(text))
#print(len(set(LemmaTokenizer()(text))))
#print(len([feat2word[i] for i in word_count2[mail_number, :].nonzero()[1]]))
#print(len([word_count2[mail_number, i] for i in word_count2[mail_number, :].nonzero()[1]]))
#print(set([feat2word[i] for i in word_count2[mail_number, :].nonzero()[1]])-set(LemmaTokenizer()(text)))
emailBagOfWords = {feat2word[i]: word_count[mail_number, i] for i in word_count[mail_number, :].nonzero()[1]}
print("Bag of words representation (", len(emailBagOfWords), " words in dict):", sep='')
print(emailBagOfWords)
print("\nVector reprensentation (", word_count[mail_number, :].nonzero()[1].shape[0], " non-zero elements):", sep='')
print(word_count[mail_number, :])

Original email:
Subject: ials ( 6th lang teacher ed )

the university of edinburgh institute for applied language studies ( ials ) 6th symposium for language teacher educators evaluation and research in language teacher education - - - - - - - - - - edinburgh - - - - - - - - - - - wednesday 18th november - friday 20th november 1998 call for papers * the role of research and evaluation in language teacher education * methods of researching and evaluating language teacher education * the ethics of evaluation and research in language teacher education * evaluating programmes , trainers , and materials in language teacher education * researching the influence of context on the delivery of language teacher education * researching and evaluating methodologies of language teacher education * research as part of the process of training and teacher development * assessing the development of trainee skills * investigating how teachers change * researching supervision and post-lesson feedback * r

**Questions** : 
- What is a bag-of-word representation ?
- What kind of feature selection or feature engineering could you derivate from this bag of word representation ?

Counting words is a good start but there is an issue: longer documents will have higher average count values than shorter documents, even though they might talk about the same topics.

To avoid these potential discrepancies it suffices to divide the number of occurrences of each word in a document by the total number of words in the document: these new features are called `tf` for Term Frequencies.

Another refinement on top of `tf` is to downscale weights for words that occur in many documents in the corpus and are therefore less informative than those that occur only in a smaller portion of the corpus.

This downscaling is called `tf–idf` for “Term Frequency times Inverse Document Frequency” and again, scikit-learn does the job for us with the [TfidfTransformer](scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html) function.

Perfect — the phrase in your screenshot refers to **TF–IDF**:

---

## **TF–IDF = Term Frequency × Inverse Document Frequency**

It’s a statistical measure used in **text mining and NLP** to evaluate how important a word is to a document within a collection (corpus).

---

### **1. Term Frequency (TF)**

* How often a word appears in a document.
* Formula:
  [
  TF(t,d) = \frac{\text{Number of times term t appears in document d}}{\text{Total terms in document d}}
  ]

👉 Example:
In document *d1*:
*"the cat sat on the mat"*

* `cat` appears 1 time, total words = 6
* TF(`cat`, d1) = 1/6

---

### **2. Inverse Document Frequency (IDF)**

* Measures how unique or rare a word is across the corpus.
* Formula:
  [
  IDF(t) = \log \frac{N}{1 + n_t}
  ]
  where:

  * *N* = total number of documents
  * *nₜ* = number of documents containing term *t*

👉 Meaning:

* If a word appears in **all documents**, IDF ≈ 0 (not useful for distinguishing).
* If a word appears in **few documents**, IDF is high (more important).

---

### **3. TF–IDF Score**

[
TF\text{–}IDF(t,d) = TF(t,d) \times IDF(t)
]

👉 Interpretation:

* High TF–IDF = word is frequent in the document but rare across other documents → very informative.
* Low TF–IDF = word is either common everywhere (*the, is, and*) or not frequent → not informative.

---

### **Example**

Corpus = 2 documents:

* d1: "cat sat on mat"

* d2: "dog sat on log"

* Word **“sat”** → appears in both documents → IDF low.

* Word **“cat”** → only in d1 → IDF high.

So:

* TF–IDF("cat", d1) > TF–IDF("sat", d1).

---

✅ **Why use TF–IDF?**

* It’s the foundation of **search engines** (ranking documents by relevance).
* Helps filter out **common words** (like stopwords) without needing a predefined list.
* Used in **text classification, clustering, and keyword extraction**.

---

Would you like me to **extend your CountVectorizer code** to use `TfidfVectorizer` with your custom tokenizer, so you see the actual TF–IDF scores for words in your dataset?


In [7]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer().fit_transform(word_count)
tfidf.shape

(2893, 14282)

Now every email in the corpus has a vector representation that filters out unrelevant tokens and retains the significant information.

In [8]:
print("email 0:")
print(tfidf[0,:])

email 0:
<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 58 stored elements and shape (1, 14282)>
  Coords	Values
  (0, 428)	0.021668194656170932
  (0, 480)	0.03965204853794966
  (0, 658)	0.07930409707589932
  (0, 1126)	0.029642508334120857
  (0, 1733)	0.02491156311317319
  (0, 1980)	0.03686932232020423
  (0, 2688)	0.039045919819268096
  (0, 2890)	0.0584370988942504
  (0, 3251)	0.05255481414582111
  (0, 3403)	0.06891276945435791
  (0, 3965)	0.3709094897191037
  (0, 4317)	0.07106820000914878
  (0, 4346)	0.1750506972872621
  (0, 4407)	0.048629512469554305
  (0, 4694)	0.06178348135429918
  (0, 4980)	0.029559492528951074
  (0, 5018)	0.042615690094226964
  (0, 5824)	0.055212812782293695
  (0, 6138)	0.05683061305812067
  (0, 6305)	0.04530175029244509
  (0, 6384)	0.05014336855822714
  (0, 6391)	0.01970221977995616
  (0, 6486)	0.07045254621446256
  (0, 6678)	0.06393903942186312
  (0, 7020)	0.26167024631502506
  :	:
  (0, 8913)	0.03208251597430934
  (0, 9195)	0.030352501688857378


# 5. Utility function

Let's put all this loading process into a separate file so that we can reuse it in other experiments.

In [9]:
import load_spam
spam_data = load_spam.spam_data_loader()
spam_data.load_data()

In [10]:
spam_data.print_email(8)

email file: ../data/lingspam_public/bare/part7/6-476msg3.txt
email is a spam: False
Subject: request for discourse list

dear linguists i 'd like to know if there are any listservs on discourse anlysis text linguistics and pragmatics . thanks gul durmusoglu

Bag of words representation (12 words in dictionary):
{'subject': np.int64(1), 'like': np.int64(1), 'thanks': np.int64(1), 'linguistics': np.int64(1), 'request': np.int64(1), 'know': np.int64(1), 'discourse': np.int64(2), 'pragmatic': np.int64(1), 'list': np.int64(1), 'dear': np.int64(1), 'text': np.int64(1), 'gul': np.int64(1)}
