# Text Preprocessing

Dans toute tâche d'apprentissage automatique, le nettoyage ou le prétraitement des données est aussi important que la construction du modèle. Les données textuelles sont l'une des formes les moins structurées de données disponibles et lorsqu'il s'agit de traiter le langage humain, c'est trop complexe. 
Dans ce Brief nous allons travailler sur le prétraitement des données textuelles en utilisant [NLTK](http://www.nltk.org).

## Veille technologique: Natural Language processing (NLP)
1- Les cas d'utlisation de NLP dans notre vie  
2- Comment Facebook, Google et Amazon utilisent NLP  
3- Préparation des données textuelles  

## Setup


In [3]:

# importer les bibliothèques nécessaires
import nltk

def display(to_display):
    print()
    print(to_display)
    print()


In [None]:

# télécharger les données NLTK 
nltk.download('punkt')
nltk.download('stopwords')


## Nettoyage des données

Dans cette partie nous allons utiliser [NLTK](http://www.nltk.org) pour nettoyer un texte de [wikipidéa](https://en.wikipedia.org/wiki/Natural_language_processing) sur la définition du NLP  
"Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves."

In [5]:

# lowercase: mettre tout le texte en minuscule
text= 'Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.'

text = text.lower() 

display(text)



natural language processing (nlp) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. the goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. the technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.



In [9]:

# supprimer les ponctuations
import string

text = text.translate(str.maketrans('', '', string.punctuation))

display(text)



natural language processing nlp is a subfield of linguistics computer science and artificial intelligence concerned with the interactions between computers and human language in particular how to program computers to process and analyze large amounts of natural language data the goal is a computer capable of understanding the contents of documents including the contextual nuances of the language within them the technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves



### Word Tokenization
La tokénisation([Tokenize](https://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.casual)) consiste à diviser les chaînes de caractères en mots individuels sans blancs ni tabulations.


In [10]:

from nltk import word_tokenize

tokens = nltk.word_tokenize(text)

display(tokens)



['natural', 'language', 'processing', 'nlp', 'is', 'a', 'subfield', 'of', 'linguistics', 'computer', 'science', 'and', 'artificial', 'intelligence', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', 'language', 'in', 'particular', 'how', 'to', 'program', 'computers', 'to', 'process', 'and', 'analyze', 'large', 'amounts', 'of', 'natural', 'language', 'data', 'the', 'goal', 'is', 'a', 'computer', 'capable', 'of', 'understanding', 'the', 'contents', 'of', 'documents', 'including', 'the', 'contextual', 'nuances', 'of', 'the', 'language', 'within', 'them', 'the', 'technology', 'can', 'then', 'accurately', 'extract', 'information', 'and', 'insights', 'contained', 'in', 'the', 'documents', 'as', 'well', 'as', 'categorize', 'and', 'organize', 'the', 'documents', 'themselves']




### Stopwords
Les mots d'arrêts sont des mots qui n'ajoutent pas de sens significatif au texte. Utiliser NLTK pour lister les stop words et les supprimer du texte.


In [12]:

from nltk.corpus import stopwords

# récupérer les stopwords
english_stopwords = nltk.corpus.stopwords.words('english')

display(english_stopwords)

# supprimer les stopwords
display("    > Length with stopwords: " + str(len(tokens)))

tokens = [word for word in tokens if word not in english_stopwords]

display(tokens)
display("    > Length without stopwords: " + str(len(tokens)))




['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 

### Stemming
L'étymologie est le processus qui consiste à réduire les mots à leur racine, leur base ou leur forme ([Stemming](https://en.wikipedia.org/wiki/Stemming)).

In [13]:

from nltk.stem.porter import PorterStemmer


## Développement des fonctions

Développer chaque étape du prétraitement du texte dans une fonction

In [18]:

# lowercase: mettre tout le texte en minuscule
def lowercase(text):
    return text.lower()

# supprimer les ponctuations
def remove_punctuation(text):
    return text.translate(str.maketrans('', '', string.punctuation))
    
# tokenization
def tokenize(text):
    return nltk.word_tokenize(text)

# stopwords
def remove_stopwords(words, lang = 'english'):
    _stopwords = nltk.corpus.stopwords.words(lang)
    return [word for word in words if word not in _stopwords]
    
# stemming
def stemmize(tokens):
    stemmer = PorterStemmer()
    return [stemmer.stem(token) for token in tokens]

tokens = stemmize(tokens) 

display(tokens)
display("    > Length tokens: " + str(len(tokens)))



['natur', 'languag', 'process', 'nlp', 'subfield', 'linguist', 'comput', 'scienc', 'artifici', 'intellig', 'concern', 'interact', 'comput', 'human', 'languag', 'particular', 'program', 'comput', 'process', 'analyz', 'larg', 'amount', 'natur', 'languag', 'data', 'goal', 'comput', 'capabl', 'understand', 'content', 'document', 'includ', 'contextu', 'nuanc', 'languag', 'within', 'technolog', 'accur', 'extract', 'inform', 'insight', 'contain', 'document', 'well', 'categor', 'organ', 'document']


    > Length tokens: 47



# What about Twitter messages !! :)

Dans cette partie nous allons appliquer les étapes de prétraitement de texte sur une base de données des messages Twitter 

In [19]:

import nltk                                
from nltk.corpus import twitter_samples    
import matplotlib.pyplot as plt            
import random       


In [None]:

nltk.download('twitter_samples')


In [21]:

# select the set of positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')


In [24]:

#print positive in greeen
display('\033[92m' + all_positive_tweets[random.randint(0, 5000)])

# print negative in red
display('\033[91m' + all_negative_tweets[random.randint(0, 5000)])



[92m@aaronbethunee I'm sofa surfing :) cunt


[91m@thebodycoach Joe I'm sick can you come round and make me soup ? :(

