# Quora Insincere Questions Classification
# Part 1: Preprocessing

## Introduction
We want to classify quora question (one sentence mainly) into 2 classes
* Normal question
* Toxic trolling

To do so, we have a train / testing (tests not labeled) of 1306142 training questions (0/1 labels) and 56372 tests (no labels). On the 1306142 training data, around 80810 are labeled as trolling question and the remaining 1225312 are not. It brings us a pretty unbalanced data set (might want to use weight balancing on that: https://www.kdnuggets.com/2018/12/handling-imbalanced-datasets-deep-learning.html).

## Ideas
Things I want to try:
* weight balancing (see above);
* specializing projection on training data (will need a small enough model to specialize, glove is humongous)
* cross-validation on hyperparameters (probably not...);
* finding similar problems with :
 * label emotion detection;
 * etc
* A model progressively refining his idea of the class while reading the sentence, word by word
* Taking the average projection of sentence make us lose the length of the sentence as a variable
* Two layers embedding to capture more semantic content
* Put Glove data in SQL BDD for memory manageable projections
* Note to self: python multithreading is still garbage in practice.

## Data Preprocessing

### Training / Test Data Split
To start we are going to split the traing data in training / testing sets, and project them over one of the given projection. We will keep the same ratio of normal / toxic data between training and testing sets.

Analysing the data shows that some lines have weird, hard to parse structure. Most of them are of type:
* *id,"Question",[01]*
But some questions go on several lines with intricated """. Well, let's hope pandas can deal with that.

In [2]:
import os
import numpy as np
import tensorflow as tf
import pandas as pd

import src.quora.preproc as pp

model_path = "data/embedded"

In [5]:
data = pd.read_csv("data/train.csv")
print(data.shape)
print(data.iloc[522266]["question_text"])

(1306122, 3)
In "Star Trek 2013" why did they :

*Spoilers*
*Spoilers*
*Spoilers*
*Spoilers*

1)Make warping look quite a bit like an hyperspace jump
2)what in the world were those bright particles as soon as they jumped.
3)Why in the world did they make it possible for two entities to react in warp space in separate jumps.
4)Why did Spock get emotions for this movie.
5)What was the point of hiding the "Enterprise" underwater.
6)When they were intercepted by the dark ship, how come they reached Earth when they were far away from her.(I don't seem to remember the scene where they warp to earth).
7)How did the ship enter earth's atmosphere when it wasnt even in orbit.
8)When Scotty opened the door of the black ship , how come pike and khan didn't slow down?


It seems OK.
Let's split our data in train / test samples. Let's say 90 / 10 split (we only have 80000 trolling data, so let's not have a too short testing set). pp.split_quora_csv will do the job. This function keeps the same proportion of troll data in both training and testing sets.

In [1]:
import src.quora.preproc as pp
pp.split_quora_csv("data/train.csv",train_prop=0.9,
                   output_train=os.path.join(model_path,"training_90.csv"),
                   output_test=os.path.join(model_path,"test_90.csv"))

Amount of troll data = 80810
Amount of genuine data = 1225312
Train data shape = (1175509, 3)
Test data shape = (130613, 3)
Saved in training_90.csv and test_90.csv


A quick grep on both file confirm that they have a proportion of troll inputs of 16.16%

### Data Projection
Next step is to project our data on one of the given vector space. We are going to start with Glove and reach all the way to a proper classifier before trying other space or specializing one.

In [3]:
# Loading data
train_data = pd.read_csv(os.path.join(model_path,"training_90.csv"))
test_data = pd.read_csv(os.path.join(model_path,"test_90.csv"))

# Loading glove projections in a dictionary
space = pp.load_glove("glove/glove.6B.50d.txt")
print("Elements in space: {}".format(len(space)))
print("Projection shape: {}".format(next(iter(space.values())).shape))

Elements in space: 400000
Projection shape: (50, 1)


Next we project sentences, word by word, in this space. In our first naive approach, we will take the mean of all the vectors in the sentence as an input.
Let's try with a simple sentence of the data set

In [4]:
#sentence = train_data.sample(n=1,random_state=1)["question_text"].values[0]
sentence = "I like cats"
proj = pp.project_sentence(sentence,space)
print(sentence)
sentence = sentence.lower()
words = sentence.split()
for i in range(min(5,len(words))):
    print("({} + {} + {}) / 3 = {}".format(
        space[words[0]][i][0],space[words[1]][i][0],space[words[2]][i][0],proj[i][0]))

I like cats
(0.11891 + 0.36808 + 0.43082) / 3 = 0.3059366666666667
(0.15255 + 0.20834 + -0.85216) / 3 = -0.1637566666666667
(-0.082073 + -0.22319 + -0.55639) / 3 = -0.28721766666666665


Seems to work well. 

Let's now project the whole set and save it. project_data runs in parallel using by default 4 threads.

In [11]:
# For training data first
%time train_proj = pp.project_data(train_data,space)
pp.save_projections(train_data["qid"],train_proj,train_data["target"],os.path.join(model_path,"training_90p_50d.proj"))

CPU times: user 2min 44s, sys: 224 ms, total: 2min 45s
Wall time: 2min 45s


In [12]:
# For test data
%time test_proj = pp.project_data(test_data,space)
pp.save_projections(test_data["qid"],test_proj,test_data["target"],os.path.join(model_path,"test_90p_50d.proj"))

CPU times: user 18 s, sys: 27.5 ms, total: 18.1 s
Wall time: 18.1 s


And data are projected and saved. Next will be to connect them to a Neural Network model.
For the sake of fast prototyping and testing, we also save a reduced model of only 50,000 elements.

In [13]:
# let's check how many 1s are in the 50000 elements
nb_el = 50000
print("Global ratio = {}".format(sum(train_data["target"][:])/train_data.shape[0]))
print("{} ratio = {}".format(nb_el,sum(train_data["target"][:nb_el])/nb_el))

Global ratio = 0.06187021962400968
50000 ratio = 0.05996


In [14]:
# And let's save that
pp.save_projections(train_data["qid"][:nb_el],train_proj[:nb_el],train_data["target"][:nb_el],
                    os.path.join(model_path,"training_90p_50d_{}.proj".format(nb_el)))