# Prétraitement des données
Afin d'appliquer les algorithmes d'apprentissage non-supervisé choisis (*k-moyenne* et *clustering hiérarchique*), il nous faut tout d'abord vectoriser nos données. On appliquera notamment le plongement de mots. (TF-IDF) \
\
Ce notebook permet d'obtenir deux dataframes, qui correspondent à nos deux sous-corpus, après l'application de la fonction TF-IDF. On utilise les fichiers `.csv` obtenus comme données d'entrée pour nos algorithmes non-supervisés.

## Première étape : nettoyage des questions
On retire tous les *stop-words*, on ne garde que les *stemmers* (pas commparable au lemme : linguistiquement parlant, ce n'est pas très intéressant. On se demande si ce n'est pas le + gros défaut de la méthode pré-traitement.)

In [1]:
# importation des modules
import pandas as pd
import re
import string

import nltk
from nltk.corpus import stopwords

In [2]:
# chargement des deux sous-corpus
data_prep = pd.read_csv("../data/corpus_prep.csv")
data_spon = pd.read_csv("../data/corpus_spon.csv")

In [3]:
nltk.download('stopwords')
stemmer = nltk.SnowballStemmer("french")
stopword = set(stopwords.words('french'))


[nltk_data] Downloading package stopwords to /home/lena/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
# fonction nettoyage des questions
def clean(question):
    question = str(question).lower()
    question = re.sub('\[.*?\]', '', question)
    question = re.sub('https?://\S+|www\.\S+', "", question)
    question = re.sub('<.*?>+', '', question)
    question = re.sub('[%s]' % re.escape(string.punctuation), '', question)
    question = re.sub('\n', '', question)
    question = re.sub('\w*\d\w*', '', question)
    question = [word for word in question.split(' ') if word not in stopword]
    question = " ".join(question)
    question = [stemmer.stem(word) for word in question.split(' ')]
    question = " ".join(question)
    return question

In [5]:
# application de la fonction à chacun des sous-corpus
data_spon["question"] = data_spon["question"].apply(clean)
data_spon["previous_5_turn"] = data_spon["previous_5_turn"].apply(clean)
data_spon["next_5_turn"] = data_spon["next_5_turn"].apply(clean)

data_prep["question"] = data_prep["question"].apply(clean)
data_prep["previous_5_turn"] = data_prep["previous_5_turn"].apply(clean)
data_prep["next_5_turn"] = data_prep["next_5_turn"].apply(clean)

In [6]:
# clustering data
clustering_data_spon = data_spon[["previous_5_turn", "question", "next_5_turn"]]
clustering_data_prep = data_prep[["previous_5_turn", "question", "next_5_turn"]]


## Deuxième partie : vectorisation et TF-IDF
Vectorisation des mots.

In [7]:
# application de la fonction TF-IDF (plongement de mots)

# importation des modules
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.snowball import FrenchStemmer
from nltk.tokenize import RegexpTokenizer

In [8]:
# corpus questions uniquement
corpus_spon = clustering_data_spon["question"]
corpus_prep = clustering_data_prep["question"]

In [9]:
def vectors_tfidf(corpus):
    tokenizer = RegexpTokenizer(r"\w+")
    stemmer = FrenchStemmer()
    stopwords_fr = stopwords.words('french')
    vectorizer = TfidfVectorizer(tokenizer=tokenizer.tokenize, stop_words=stopwords_fr,
                                analyzer="word", ngram_range=(1, 3), min_df=5, max_df=0.9)
    x = vectorizer.fit_transform(corpus)
    tf_idf = pd.DataFrame(data=x.toarray(),
                          columns=vectorizer.get_feature_names_out())
    return tf_idf

tf_idf_spon = vectors_tfidf(corpus_spon)
tf_idf_prep = vectors_tfidf(corpus_prep)



In [14]:
print("{} rows".format(tf_idf_spon.shape[0]))
tf_idf_spon.T.nlargest(5, 0)
tf_idf_prep.T.nlargest(5,0)

634 rows


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,435,436,437,438,439,440,441,442,443,444
temp,0.613833,0.0,0.0,0.0,0.0,0.0,0.0,0.646865,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
combien,0.583904,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
orléan,0.531287,1.0,0.905057,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
a,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.126186,0.252564,0.0,0.0
a chos,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.206196,0.0,0.0


Finalement, on sauvegarde les résultats obtenus dans deux fichiers `.csv` distincts.

In [None]:
tf_idf_spon.to_csv("data/tf_idf_spon.csv", encoding="utf-8")
tf_idf_spon.to_csv("data/tf_idf_prep.csv", encoding="utf-8")

In [15]:
corpus_spon.head()

0                    pourquoi ça 
1        quest fait comm travail 
2                quoi ça consist 
3                    tous détail 
4    alor cest journ assez longu 
Name: question, dtype: object

In [16]:
corpus_prep.head()

0    depuis combien temp habitezvous orléan 
1                           plais à  orléan 
2                estce compt rest à  orléan 
3      estce voul bien décrir journ travail 
4                      questc plaît travail 
Name: question, dtype: object