# Data Pre-Processing

Jupyter notebook that will contain all our methods and compare them about performance with few metrics.

***Author:*** Paulo Ribeiro

## Import 

In [1]:
from data_helpers import CorpusClean
import warnings
warnings.filterwarnings('ignore')

%load_ext autoreload
%autoreload 2

## Data

First we initiate the corpus loader.

In [2]:
corpus = CorpusClean(
    corpus_path="data/corpus.json/corpus.json",
)

Loading corpus... (can take up to 3 minutes)
Corpus Loaded ! 



## Split by Languages

First step is to separate the corpus in function of their languages.

In [3]:
corpus.split_per_lang()

Stats for 'en' docs:   0%|          | 0/207363 [00:00<?, ?it/s]

Stats for 'fr' docs:   0%|          | 0/10676 [00:00<?, ?it/s]

Stats for 'de' docs:   0%|          | 0/10992 [00:00<?, ?it/s]

Stats for 'es' docs:   0%|          | 0/11019 [00:00<?, ?it/s]

Stats for 'it' docs:   0%|          | 0/11250 [00:00<?, ?it/s]

Stats for 'ko' docs:   0%|          | 0/7893 [00:00<?, ?it/s]

Stats for 'ar' docs:   0%|          | 0/8829 [00:00<?, ?it/s]

Unnamed: 0,lang,num_docs,max_words,min_words,mean_words,median_words,95th_quantile_words
0,en,207363,66333,80,2163.921269,1478.0,5892.0
1,fr,10676,68896,311,5596.36952,4886.5,13543.75
2,de,10992,28677,3,4474.812591,4339.0,11067.15
3,es,11019,75804,294,5689.988021,5298.0,12636.7
4,it,11250,49172,404,5419.6064,4910.5,11942.4
5,ko,7893,35738,183,2700.349677,2511.0,6702.8
6,ar,8829,45428,293,4563.532676,4125.0,11775.4


In [4]:
corpus.show_current_docs()

--------------------------------------------------
Current state of documents (first 300 characters):
--------------------------------------------------

Language: en
Docid: doc-en-9633
--------------------------------------------------
Text:
Mars Hill Church was a Christian megachurch, founded by Mark Driscoll, Lief Moi, and Mike Gunn. It was a multi-site church based in Seattle, Washington and grew from a home Bible study to 15 locations in 4 U.S. states. Services were offered at its 15 locations; the church also podcast content of wee

Language: fr
Docid: doc-fr-1447
--------------------------------------------------
Text:
La production de café au Costa Rica représente en 2016 environ 1,2 % de la production mondiale de café, ce qui fait le  grand producteur du monde derrière la Côte d'Ivoire.

Particularités 
Les grains de café du Costa Rica, sont considérés comme parmi les meilleurs dans le monde. Tarrazú est pensé p

Language: de
Docid: doc-de-9933
--------------------------------

## Lower-Case

Second step is to make sure that texts are written in lowercase. Such step is necessary since methods like TF-IDF or even BM25 are case sensitive.

In [5]:
corpus.lower_case()

Lower casing 'en' docs:   0%|          | 0/207363 [00:00<?, ?it/s]

Lower casing 'fr' docs:   0%|          | 0/10676 [00:00<?, ?it/s]

Lower casing 'de' docs:   0%|          | 0/10992 [00:00<?, ?it/s]

Lower casing 'es' docs:   0%|          | 0/11019 [00:00<?, ?it/s]

Lower casing 'it' docs:   0%|          | 0/11250 [00:00<?, ?it/s]

Lower casing 'ko' docs:   0%|          | 0/7893 [00:00<?, ?it/s]

Lower casing 'ar' docs:   0%|          | 0/8829 [00:00<?, ?it/s]

In [6]:
corpus.show_current_docs()

--------------------------------------------------
Current state of documents (first 300 characters):
--------------------------------------------------

Language: en
Docid: doc-en-9633
--------------------------------------------------
Text:
mars hill church was a christian megachurch, founded by mark driscoll, lief moi, and mike gunn. it was a multi-site church based in seattle, washington and grew from a home bible study to 15 locations in 4 u.s. states. services were offered at its 15 locations; the church also podcast content of wee

Language: fr
Docid: doc-fr-1447
--------------------------------------------------
Text:
la production de café au costa rica représente en 2016 environ 1,2 % de la production mondiale de café, ce qui fait le  grand producteur du monde derrière la côte d'ivoire.

particularités 
les grains de café du costa rica, sont considérés comme parmi les meilleurs dans le monde. tarrazú est pensé p

Language: de
Docid: doc-de-9933
--------------------------------

In [None]:
corpus.store(path='clean_data/lower_case')

## Remove StopWords

Third step is to remove the stop-words to only keep the more relevant words.

In [7]:
corpus.stop_words()

Removing stop words for 'en' docs:   0%|          | 0/207363 [00:00<?, ?it/s]

Removing stop words for 'fr' docs:   0%|          | 0/10676 [00:00<?, ?it/s]

Removing stop words for 'de' docs:   0%|          | 0/10992 [00:00<?, ?it/s]

Removing stop words for 'es' docs:   0%|          | 0/11019 [00:00<?, ?it/s]

Removing stop words for 'it' docs:   0%|          | 0/11250 [00:00<?, ?it/s]

Removing stop words for 'ar' docs:   0%|          | 0/8829 [00:00<?, ?it/s]

Stats for 'en' docs:   0%|          | 0/207363 [00:00<?, ?it/s]

Stats for 'fr' docs:   0%|          | 0/10676 [00:00<?, ?it/s]

Stats for 'de' docs:   0%|          | 0/10992 [00:00<?, ?it/s]

Stats for 'es' docs:   0%|          | 0/11019 [00:00<?, ?it/s]

Stats for 'it' docs:   0%|          | 0/11250 [00:00<?, ?it/s]

Stats for 'ko' docs:   0%|          | 0/7893 [00:00<?, ?it/s]

Stats for 'ar' docs:   0%|          | 0/8829 [00:00<?, ?it/s]

Unnamed: 0,lang,num_docs,max_words,min_words,mean_words,median_words,95th_quantile_words
0,en,207363,46759,64,1399.159247,950.0,3769.0
1,fr,10676,42867,255,3593.587018,3168.0,8512.5
2,de,10992,22234,2,3098.057041,2848.5,7979.85
3,es,11019,41715,198,3126.667846,2913.0,6916.4
4,it,11250,31037,319,3492.4488,3167.0,7575.05
5,ko,7893,35738,183,2700.349677,2511.0,6702.8
6,ar,8829,37106,235,3692.840072,3270.0,9346.0


In [8]:
corpus.show_current_docs()

--------------------------------------------------
Current state of documents (first 300 characters):
--------------------------------------------------

Language: en
Docid: doc-en-9633
--------------------------------------------------
Text:
mars hill church christian megachurch, founded mark driscoll, lief moi, mike gunn. multi-site church based seattle, washington grew home bible study 15 locations 4 u.s. states. services offered 15 locations; church also podcast content weekend services, conferences, internet 260,000 sermon views onl

Language: fr
Docid: doc-fr-1447
--------------------------------------------------
Text:
production café costa rica représente 2016 environ 1,2 % production mondiale café, fait grand producteur monde derrière côte d'ivoire. particularités grains café costa rica, considérés comme parmi meilleurs monde. tarrazú pensé produire plus désirable grains café costa rica. 2012, café tarrazú deven

Language: de
Docid: doc-de-9933
--------------------------------

In [None]:
corpus.store(path='clean_data/lower_case_stop_words')

## Lemmatization

Final step, use a lemmatizer to only keep the essence of a word to make sure that same words are not written in too many ways possible, i.e. plural and singular, present and other verb tense, ...

In [10]:
corpus.lemmatization()

In [None]:
corpus.show_current_docs()

In [None]:
corpus.store(path='clean_data/lower_case_stop_words_lemmatization')