# Catégoriser automatiquement des questions : OC projet 6 #

## Génération d'un jeu de test pour le projet 6:  ##

Nous allons utiliser un ensemble de questions complétement indépendant des données utilisées pour l'entrainement afin d'évaluer la qualité de prédiction de nos algorithmes. Nous évaluerons ces algorithmes sur un ensemble de 10000 questions.

In [1]:
import pandas as pd
import numpy as np
import nltk
import seaborn as sns
import matplotlib.pyplot as plt
import string
import re

from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize

## I - Importation des données : ##

In [2]:
table_raw = []
for i in range(5):
    df = pd.read_csv(f"QueryResults_{i + 20}.csv", sep=",", encoding="utf-8")
    table_raw.append(df)
table_raw = pd.concat(table_raw)
table_raw.drop_duplicates(inplace=True)
table_raw.dropna(axis=0, how="any", subset=["Title"], inplace=True)
table_raw.reset_index(inplace=True)

## II- Traitement des données : ##

In [3]:
table_raw.shape

(66232, 23)

In [4]:
table_raw.dropna(axis=1, how="any", inplace=True)
table_raw.drop(["index", "Id", "PostTypeId", "ViewCount", 
                "LastActivityDate", "AnswerCount", "CommentCount"], axis=1, inplace=True)

In [7]:
table_eval = table_raw.sample(frac=0.15)

In [8]:
table_eval.shape

(9935, 5)

### III- Nettoyage des données: ##

In [10]:
table_eval["Body"] = table_eval["Body"].apply(lambda x:BeautifulSoup(x, "html.parser").get_text())

In [11]:
table_eval["Tags"] = table_eval["Tags"].apply(lambda x: re.sub(r'><',' ',x))
table_eval["Tags"] = table_eval["Tags"].apply(lambda x: re.sub(r'[<>]','',x).split())

In [12]:
table_eval["BodyTitle"] = table_eval["Title"] + " " + table_eval["Body"] 

In [13]:
table_eval.head(10)

Unnamed: 0,CreationDate,Score,Body,Title,Tags,BodyTitle
60921,2011-02-25 12:32:06,16,My Eclipse Java package is treated as a folder...,Why is my Eclipse Java package being treated a...,"[java, eclipse, package]",Why is my Eclipse Java package being treated a...
19397,2010-12-21 02:11:39,7,"I've been wondering how to do ""true"" (semi) re...",Best approach for (cross-platform) real-time d...,"[php, javascript, push]",Best approach for (cross-platform) real-time d...
38220,2011-01-23 22:27:42,5,"This one seems like an easy one, but I'm havin...",Calculating Base-n logarithm in Ruby,"[ruby, math, logarithm]",Calculating Base-n logarithm in Ruby This one ...
12795,2010-11-16 08:33:22,18,I've researched a bit about how to achieve wha...,C# How to simply encrypt a text file with a PG...,"[c#, encryption, public-key, pgp, public-key-e...",C# How to simply encrypt a text file with a PG...
56978,2011-02-20 07:26:35,12,I would like to return the contents of a cell ...,How can I detect when a user is finished editi...,"[delphi, events, tstringgrid]",How can I detect when a user is finished editi...
64663,2011-03-03 02:43:09,13,"I'm getting a lot of ""Unknown type"" warnings w...",Why doesn't Closure Compiler recognize type de...,"[javascript, design-patterns, google-closure-c...",Why doesn't Closure Compiler recognize type de...
32638,2011-01-14 20:20:21,15,At my place of employment we have a temperamen...,Mirroring the official nuget package repository,"[powershell, nuget]",Mirroring the official nuget package repositor...
657,2010-10-24 19:37:10,7,Suppose I have a class\nclass C {\n C(in...,constructor with one default parameter,"[c++, constructor, default-parameters]",constructor with one default parameter Suppose...
13167,2010-11-16 19:49:02,5,"So, here's the problem. iPhones are awesome, ...",Single request to multiple asynchronous responses,"[iphone, objective-c, networking, httprequest,...",Single request to multiple asynchronous respon...
51092,2011-02-11 09:21:08,7,"I have an interface TestInterface<U,V> that ha...",Guice annotatedWith for interface with Generics,[guice],Guice annotatedWith for interface with Generic...


### IV- Enregistrement des données de test: ##

In [14]:
table_eval.to_json("table_eval.json")