# Data Preprocessing - why is it important?

The main focus of this project is to show few __techniques__ that have been used in __data science projects__.
<br><br>
The techniques that we are going to use are:
<br>
<p>1-Case alignment</p>
<p>2-Tokenization</p>
<p>3-Stopwords removal</p>
<p>4-Stemming</p>
<p>5-Lemmatization</p>

## Collecting the data

We are going to use this pdf archive* as the source of all the preprocessing implementation: alex.smola.org/drafts/thebook.pdf. _(By the way, very great content!)_
<br>
<sub>*Not the entire pdf, just two pages, because the main objective of this notebook is to show the techniques.</sub>
<br><br><br>
For this, we will use the pdftotext library that provide support for the text extraction for pdf files.

In case you want to test this project by your own, you will need to download [the pdf file](https://alex.smola.org/drafts/thebook.pdf) that I've used and update the <code>path</code> variable with the directory where the file is in your computer.

In [160]:
import pdftotext
import io

pages = ''
dict = {}
path = "/home/miohana/thebook.pdf"

fileObject = open(path, 'rb')
pdf = pdftotext.PDF(fileObject)

#we just wanted the page 11
for index, page in enumerate(pdf, 1):    
    if (index == 11):
        pages += str(page) 
        dict[index] = page
        break

result = pages
print(result)

                                       1
                               Introduction
Over the past two decades Machine Learning has become one of the main-
stays of information technology and with that, a rather central, albeit usually
hidden, part of our life. With the ever increasing amounts of data becoming
available there is good reason to believe that smart data analysis will become
even more pervasive as a necessary ingredient for technological progress.
   The purpose of this chapter is to provide the reader with an overview over
the vast range of applications which have at their heart a machine learning
problem and to bring some degree of order to the zoo of problems. After
that, we will discuss some basic tools from statistics and probability theory,
since they form the language in which many machine learning problems must
be phrased to become amenable to solving. Finally, we will outline a set of
fairly basic yet effective algorithms to solve an important problem, namely
that

## 1 - Case alignment

Avoid duplicated words - <code>Computer</code> and <code>computer</code> have the same meaning.

In [161]:
result = result.lower()

## Removing "Introduction" word and the summary titles  - isn't necessary
For this, we need to find starting index for the word "over", because that is the on that the text will start with.

In [162]:
print(result.find('over'))

85


In [163]:
print(result[85:])
result = result[85:]

over the past two decades machine learning has become one of the main-
stays of information technology and with that, a rather central, albeit usually
hidden, part of our life. with the ever increasing amounts of data becoming
available there is good reason to believe that smart data analysis will become
even more pervasive as a necessary ingredient for technological progress.
   the purpose of this chapter is to provide the reader with an overview over
the vast range of applications which have at their heart a machine learning
problem and to bring some degree of order to the zoo of problems. after
that, we will discuss some basic tools from statistics and probability theory,
since they form the language in which many machine learning problems must
be phrased to become amenable to solving. finally, we will outline a set of
fairly basic yet effective algorithms to solve an important problem, namely
that of classification. more sophisticated tools, a discussion of more general
problems a

In [165]:
result = result.replace('1.1 a taste of machine learning', '')

In [166]:
result = result.replace('1.1.1 applications', '')

In [167]:
print(result)

over the past two decades machine learning has become one of the main-
stays of information technology and with that, a rather central, albeit usually
hidden, part of our life. with the ever increasing amounts of data becoming
available there is good reason to believe that smart data analysis will become
even more pervasive as a necessary ingredient for technological progress.
   the purpose of this chapter is to provide the reader with an overview over
the vast range of applications which have at their heart a machine learning
problem and to bring some degree of order to the zoo of problems. after
that, we will discuss some basic tools from statistics and probability theory,
since they form the language in which many machine learning problems must
be phrased to become amenable to solving. finally, we will outline a set of
fairly basic yet effective algorithms to solve an important problem, namely
that of classification. more sophisticated tools, a discussion of more general
problems a

### Also, we don't want the final phrase of the text. And the page.

In [168]:
print(result.find('to achieve this goal'))

1988


In [169]:
print(result[:1988])
result = result[:1988]

over the past two decades machine learning has become one of the main-
stays of information technology and with that, a rather central, albeit usually
hidden, part of our life. with the ever increasing amounts of data becoming
available there is good reason to believe that smart data analysis will become
even more pervasive as a necessary ingredient for technological progress.
   the purpose of this chapter is to provide the reader with an overview over
the vast range of applications which have at their heart a machine learning
problem and to bring some degree of order to the zoo of problems. after
that, we will discuss some basic tools from statistics and probability theory,
since they form the language in which many machine learning problems must
be phrased to become amenable to solving. finally, we will outline a set of
fairly basic yet effective algorithms to solve an important problem, namely
that of classification. more sophisticated tools, a discussion of more general
problems a

## 2 - Tokenization (by sentence)

In [170]:
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords

tokens_by_sentence = ''

tokens_by_sentence = sent_tokenize(result)
tokens_by_sentence = [w.replace('\n', ' ').replace('- ', '') for w in tokens_by_sentence]

In [171]:
print(tokens_by_sentence)

['over the past two decades machine learning has become one of the mainstays of information technology and with that, a rather central, albeit usually hidden, part of our life.', 'with the ever increasing amounts of data becoming available there is good reason to believe that smart data analysis will become even more pervasive as a necessary ingredient for technological progress.', 'the purpose of this chapter is to provide the reader with an overview over the vast range of applications which have at their heart a machine learning problem and to bring some degree of order to the zoo of problems.', 'after that, we will discuss some basic tools from statistics and probability theory, since they form the language in which many machine learning problems must be phrased to become amenable to solving.', 'finally, we will outline a set of fairly basic yet effective algorithms to solve an important problem, namely that of classification.', 'more sophisticated tools, a discussion of more genera

## 2 - Tokenization (by word)

In [172]:
from nltk.tokenize import word_tokenize

tokens_by_word = ''

tokens_by_word = word_tokenize(result)
tokens_by_word = [w.replace('\n', ' ').replace('- ', '') for w in tokens_by_word]

In [173]:
print(tokens_by_word)

['over', 'the', 'past', 'two', 'decades', 'machine', 'learning', 'has', 'become', 'one', 'of', 'the', 'main-', 'stays', 'of', 'information', 'technology', 'and', 'with', 'that', ',', 'a', 'rather', 'central', ',', 'albeit', 'usually', 'hidden', ',', 'part', 'of', 'our', 'life', '.', 'with', 'the', 'ever', 'increasing', 'amounts', 'of', 'data', 'becoming', 'available', 'there', 'is', 'good', 'reason', 'to', 'believe', 'that', 'smart', 'data', 'analysis', 'will', 'become', 'even', 'more', 'pervasive', 'as', 'a', 'necessary', 'ingredient', 'for', 'technological', 'progress', '.', 'the', 'purpose', 'of', 'this', 'chapter', 'is', 'to', 'provide', 'the', 'reader', 'with', 'an', 'overview', 'over', 'the', 'vast', 'range', 'of', 'applications', 'which', 'have', 'at', 'their', 'heart', 'a', 'machine', 'learning', 'problem', 'and', 'to', 'bring', 'some', 'degree', 'of', 'order', 'to', 'the', 'zoo', 'of', 'problems', '.', 'after', 'that', ',', 'we', 'will', 'discuss', 'some', 'basic', 'tools', 'f

## 3 - Stopwords removal

In [174]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [175]:
print(stopwords.words('portuguese')[:10])

['de', 'a', 'o', 'que', 'e', 'do', 'da', 'em', 'um', 'para']


In [176]:
def remove_stopwords(text):
    stopWords = stopwords.words('english')
    not_stopword = [word for word in text if not word in stopWords]
    return not_stopword

result = remove_stopwords(tokens_by_word)

In [177]:
print(result)

['past', 'two', 'decades', 'machine', 'learning', 'become', 'one', 'main-', 'stays', 'information', 'technology', ',', 'rather', 'central', ',', 'albeit', 'usually', 'hidden', ',', 'part', 'life', '.', 'ever', 'increasing', 'amounts', 'data', 'becoming', 'available', 'good', 'reason', 'believe', 'smart', 'data', 'analysis', 'become', 'even', 'pervasive', 'necessary', 'ingredient', 'technological', 'progress', '.', 'purpose', 'chapter', 'provide', 'reader', 'overview', 'vast', 'range', 'applications', 'heart', 'machine', 'learning', 'problem', 'bring', 'degree', 'order', 'zoo', 'problems', '.', ',', 'discuss', 'basic', 'tools', 'statistics', 'probability', 'theory', ',', 'since', 'form', 'language', 'many', 'machine', 'learning', 'problems', 'must', 'phrased', 'become', 'amenable', 'solving', '.', 'finally', ',', 'outline', 'set', 'fairly', 'basic', 'yet', 'effective', 'algorithms', 'solve', 'important', 'problem', ',', 'namely', 'classification', '.', 'sophisticated', 'tools', ',', 'di

## 4 - Stemming

For Non-English projects (such as portuguese), a good library is __RSLP Stemmer__.
<br>
For our purpose, we will use the __SnowballStemmer__ that provides a great support for the english language.
<br><br>
We need to remember that we have a lot of libraries that have similar functionalities. The right thing to do is to test each one to make sure we choose the one with the best result.
<br>
Examples of other stemmers:
* PorterStemmer (English)
* LancasterStemmer (English)
* ISRIStemmer (Arabic)
* RSLPSTemmer (Portuguese)

In [194]:
from nltk.stem.snowball import SnowballStemmer

englishStemmer = SnowballStemmer('english')

words = [word for word in result if word.isalpha()]

stemmed = [englishStemmer.stem(word) for word in tokens_by_word]
print(stemmed[:50])

['over', 'the', 'past', 'two', 'decad', 'machin', 'learn', 'has', 'becom', 'one', 'of', 'the', 'main-', 'stay', 'of', 'inform', 'technolog', 'and', 'with', 'that', ',', 'a', 'rather', 'central', ',', 'albeit', 'usual', 'hidden', ',', 'part', 'of', 'our', 'life', '.', 'with', 'the', 'ever', 'increas', 'amount', 'of', 'data', 'becom', 'avail', 'there', 'is', 'good', 'reason', 'to', 'believ', 'that']


## 5 - Lemmatization

In [197]:
from nltk.stem import WordNetLemmatizer

wordnet_lemmatizer = WordNetLemmatizer()

lemmatized = [wordnet_lemmatizer.lemmatize(word, pos="v") for word in tokens_by_word]
print(lemmatized[:50])

['over', 'the', 'past', 'two', 'decades', 'machine', 'learn', 'have', 'become', 'one', 'of', 'the', 'main-', 'stay', 'of', 'information', 'technology', 'and', 'with', 'that', ',', 'a', 'rather', 'central', ',', 'albeit', 'usually', 'hide', ',', 'part', 'of', 'our', 'life', '.', 'with', 'the', 'ever', 'increase', 'amount', 'of', 'data', 'become', 'available', 'there', 'be', 'good', 'reason', 'to', 'believe', 'that']
