# OC IML Projet 5 : Catégorisez automatiquement des questions

Stack Overflow est un site célèbre de questions-réponses liées au développement informatique
développez *un système de suggestion de tag* pour le site. Celui-ci prendra la forme d’un algorithme de machine learning qui assigne automatiquement plusieurs tags pertinents à une question.


Ce notebook contient : 
- API preparation

## import

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import seaborn as sns
sns.set(color_codes=True, font_scale=1.33)

import numpy as np
import pandas as pd
pd.set_option("display.max_columns", 25)

import string
from string import punctuation 

import re

from sklearn import metrics
from sklearn.model_selection import GridSearchCV
import scipy.stats as st

import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from collections import defaultdict
from nltk.stem.snowball import EnglishStemmer
from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.decomposition import NMF

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.dummy import DummyClassifier
from sklearn.metrics import hamming_loss
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC
from sklearn import model_selection
from sklearn.externals import joblib
from skmultilearn.problem_transform import BinaryRelevance

import time
import datetime

import pickle

import math

[nltk_data] Downloading package punkt to /Users/gregory/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/gregory/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/gregory/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Definitions

In [3]:
# source paths
PATH_SOURCE_QUESTIONS = '../../data/QueryResults.csv' 
# export path
PATH_EXPORT_FOLDER = '../../data/'

## Useful functions

In [31]:
# prepare dictionnary of translation to suppress ponctuation
replace_punctuation = str.maketrans(string.punctuation,
                                    ' '*len(string.punctuation))
def cleaning_text(questions_curr):

    # lower case
    questions_curr = ' '.join([w.lower() for w in \
                               nltk.word_tokenize(questions_curr) \
                              if not w.lower() in list(sw)])
    # delete newlines
    questions_curr = re.sub(r'\s+', ' ', questions_curr)
    # delete single quotes
    questions_curr = re.sub(r"\'", " ", questions_curr)
    # delete tags
    questions_curr = re.sub('<[^<]+?>',' ', questions_curr)
    # delete numbers (forming group = word with only numbers 
    # example : delete "123" but not "a123")
    questions_curr = re.sub(r'\b\d+\b','', questions_curr) 
    # delete ponctuation (replace by space)
    questions_curr = questions_curr.translate(replace_punctuation)

    return questions_curr

In [9]:
stemmer = EnglishStemmer()

def stem_tokens(tokens, stemmer):
    '''
    Stem words in tokens.
    and suppress word < 3 characters
    '''
    stemmed = []
    for item in tokens:
        if re.match('[a-zA-Z0-9]{3,}',item):
            stemmed.append(stemmer.stem(item))
    return stemmed

def myTokenizer(text):
    '''
    Create tokens from text
    '''
    tokens = nltk.word_tokenize(text)
    stems = stem_tokens(tokens, stemmer)
    return stems

## Supervised model

### Compress models for API

In [4]:
# load
myModel = open(PATH_EXPORT_FOLDER + \
    'model_RF_tags51_max_depthNone_max_features31_min_samples_split2_n_estimators25.pkl',
    'rb')
clf = joblib.load(myModel)

In [6]:
# compress
joblib.dump(clf, PATH_EXPORT_FOLDER + \
    'mdl_cmp_RF_tags51_max_depthNone_max_features31_min_samples_split2_n_estimators25.pkl',
    compress=True)

['../../data/mdl_cmp_RF_tags51_max_depthNone_max_features31_min_samples_split2_n_estimators25.pkl']

### Load from disk other useful tools

In [10]:
# CounterVectorizer
tf_vectorizer_sup_1 = joblib.load(PATH_EXPORT_FOLDER + 'cvect_tags51.pkl')
# TfidfTransformer 
tfidf_transformer_sup_1 = joblib.load(PATH_EXPORT_FOLDER + 'tfidf_tags51.pkl')
# MultiLabelBinarizer
mlb = joblib.load(PATH_EXPORT_FOLDER + 'mlb_tags51.pkl')

### Load stopwords [TODO]

In [30]:
sw = ["p"]

### Predict tags

#### Input Test Question

In [11]:
df_quest = pd.read_csv(PATH_SOURCE_QUESTIONS, sep=',')

In [28]:
quest_text = df_quest[df_quest["Id"] == 50000005]["Title"] + " " + \
    df_quest[df_quest["Id"] == 50000005]["Body"]
quest_text = quest_text.values[0]
print("Question Text:\n", quest_text)

Question Text:
 How to insert an entry to a table only if it does not exist <p>My table looks like this  on sql server</p>

<pre><code>wordId     word      
----------------
1214       pen           
1215       men    
1216       cat  
</code></pre>

<p>WordId and word is being passed with the stored procedure and,I need to check on my stored procedure if the wordId already exists on the table or not, and only if the wordId doesn't exists I need to execute the insert statement.  </p>



#### Clean Text

In [34]:
quest_text_cleaned = cleaning_text(quest_text)
print("Question cleaned:\n", quest_text_cleaned)

Question cleaned:
 how to insert an entry to a table only if it does not exist   my table looks like this on sql server       wordid word                          pen  men  cat       wordid and word is being passed with the stored procedure and   i need to check on my stored procedure if the wordid already exists on the table or not   and only if the wordid does n t exists i need to execute the insert statement    


#### CounterVectorize

In [36]:
contVectValue = tf_vectorizer_sup_1.transform([quest_text_cleaned])
contVectValue

<1x1000 sparse matrix of type '<class 'numpy.int64'>'
	with 17 stored elements in Compressed Sparse Row format>

#### TfidfTransform

In [37]:
tfIdfValue = tfidf_transformer_sup_1.transform(contVectValue)
tfIdfValue

<1x1000 sparse matrix of type '<class 'numpy.float64'>'
	with 17 stored elements in Compressed Sparse Row format>

#### Predict

In [38]:
encoded_y_pred = clf.predict(tfIdfValue)

#### Decode tags

In [39]:
mlb.inverse_transform(encoded_y_pred)

[('sql', 'sql-server')]