# Quora Insincere Questions Classification

## Introduction
We want to classify quora question (one sentence mainly) into 2 classes
* Normal question
* Toxic trolling

To do so, we have a train / testing (tests not labeled) of 1306142 training questions (0/1 labels) and 56372 tests (no labels). On the 1306142 training data, around 80810 are labeled as trolling question and the remaining 1225312 are not. It brings us a pretty unbalanced data set (might want to use weight balancing on that: https://www.kdnuggets.com/2018/12/handling-imbalanced-datasets-deep-learning.html).

## Ideas
Things I want to try:
* weight balancing (see above);
* specializing projection on training data (will need a small enough model to specialize, glove is humongous)
* cross-validation on hyperparameters (probably not...);
* finding similar problems with :
 * label emotion detection;
 * etc
* A model progressively refining his idea of the class while reading the sentence, word by word
* Taking the average projection of sentence make us lose the length of the sentence as a variable
* Two layers embedding to capture more semantic content

## Data Preprocessing

### Training / Test Data Split
To start we are going to split the traing data in training / testing sets, and project them over one of the given projection. We will keep the same ratio of normal / toxic data between training and testing sets.

Analysing the data shows that some lines have weird, hard to parse structure. Most of them are of type:
* *id,"Question",[01]*
But some questions go on several lines with intricated """. Well, let's hope pandas can deal with that.

In [1]:
import os
import numpy as np
import tensorflow as tf
import pandas as pd

datadir = "/media/marc/DATA/DATA/kaggle/01-quora"

In [7]:
data = pd.read_csv(os.path.join(datadir,"train.csv"))
print(data.shape)
print(data.iloc[522266]["question_text"])

(1306122, 3)
In "Star Trek 2013" why did they :

*Spoilers*
*Spoilers*
*Spoilers*
*Spoilers*

1)Make warping look quite a bit like an hyperspace jump
2)what in the world were those bright particles as soon as they jumped.
3)Why in the world did they make it possible for two entities to react in warp space in separate jumps.
4)Why did Spock get emotions for this movie.
5)What was the point of hiding the "Enterprise" underwater.
6)When they were intercepted by the dark ship, how come they reached Earth when they were far away from her.(I don't seem to remember the scene where they warp to earth).
7)How did the ship enter earth's atmosphere when it wasnt even in orbit.
8)When Scotty opened the door of the black ship , how come pike and khan didn't slow down?


It sems to be OK with long descriptions. Using spaCy or NLTK should do the job on the text.

Let's split our data in train / test samples. Let's say 90 / 10 split (we only have 80000 trolling data, so let's not have a too short testing set).
First we isolate the trolling data from the rest.

In [8]:
troll_mask = np.array(data["target"] == 1)
print("Amount of troll data = {}".format(np.sum(troll_mask)))
troll_data = data[["qid","question_text","target"]][troll_mask]
genu_mask = np.logical_not(troll_mask)
print("Amount of genuine data = {}".format(np.sum(genu_mask)))
genu_data = data[["qid","question_text","target"]][genu_mask]
assert data.shape[0] == troll_data.shape[0] + genu_data.shape[0],"Kind data + troll data != total number"

Amount of troll data = 80810
Amount of genuine data = 1225312


In [9]:
# Let's shuffle the data
troll_shuff = troll_data.sample(n=troll_data.shape[0],random_state=1).reset_index(drop=True)
genu_shuff = genu_data.sample(n=genu_data.shape[0],random_state=1).reset_index(drop=True)

assert troll_shuff.shape[0] + genu_shuff.shape[0] == data.shape[0]

And finally we fuse the data into independent training and test sets.

In [10]:
train_prop = 0.9

# troll data
train_part = int(troll_shuff.shape[0]*train_prop)
train_data = troll_shuff[:train_part][:]
test_data = troll_shuff[train_part:][:]
assert troll_shuff.shape[0] == train_data.shape[0] + test_data.shape[0]

# add the genuine data (note the index reseting)
train_part = int(genu_shuff.shape[0]*train_prop)
train_data = train_data.append(genu_shuff[:train_part][:]).reset_index(drop=True)
test_data = test_data.append(genu_shuff[train_part:][:]).reset_index(drop=True)

# Check out resulting size
print("Train data shape = {}".format(train_data.shape))
print("Test data shape = {}".format(test_data.shape))
assert train_data.shape[0] + test_data.shape[0] == data.shape[0]

Train data shape = (1175509, 3)
Test data shape = (130613, 3)


And we finally shuffle the data one last time and save them.

In [12]:
train_data = train_data.sample(n=train_data.shape[0],random_state=1).reset_index(drop=True)
test_data = test_data.sample(n=test_data.shape[0],random_state=1).reset_index(drop=True)
train_data.to_csv("training_90.csv",index=False)
test_data.to_csv("test_90.csv",index=False)

### Data Projection
Next step is to project our data on one of the given vector space. We are going to start with Glove and reach all the way to a proper classifier before trying other space or specializing one.

In [134]:
# Loading data
train_data = pd.read_csv("training_90.csv")
test_data = pd.read_csv("test_90.csv")

In [76]:
# projecion data
def load_space(filename):
    space = {}
    i = 1
    with open(filename,"rt") as f:
        for line in f:
            word,*coeff = line.split()
            try:
                space[word] = np.array(coeff,dtype=float)
            except ValueError as err:
                print("Error \"{}\" at line {}".format(err,i))
            i+=1
    # reshaping (because (size,))
    vsize = next(iter(space.values())).shape[0]
    for word in space.keys():
        space[word] = space[word].reshape(vsize,1)
    return space

%time space = load_space("glove/glove.6B.50d.txt") # let's try with something small

CPU times: user 4.08 s, sys: 31.1 ms, total: 4.12 s
Wall time: 4.11 s


In [77]:
print("Elements in space: {}".format(len(space)))
print("Projection shape: {}".format(space["to"].shape))

Elements in space: 400000
Projection shape: (50, 1)


Next we project sentences, word by word, in this space. In our first naive approach, we will take the mean of all the vectors in the sentence as an input.
Let's try with a simple sentence of the data set

In [96]:
def project_sentence(sent,space):
    sentence = sent.lower() # all these projection sets are lower case only
    vsize = next(iter(space.values())).shape[0]
    proj = np.zeros((vsize,1))
    count = 0
    for word in sentence.split():
        if word in space:
            proj += space[word]
            count +=1
    if count != 0:
        proj /= count
    return proj

# quick check
sentence = train_data.sample(n=1,random_state=1)["question_text"].values[0]
proj = project_sentence(sentence,space)
print(sentence)
sentence = sentence.lower()
words = sentence.split()
for i in range(min(5,len(words))):
    print("({} + {} + {}) / 3 = {}".format(space[words[0]][i][0],space[words[1]][i][0],space[words[2]][i][0],proj[i][0]))

Why do some Quora writers need to write War and Peace, when a terse answer would do?
(0.32386 + 0.29605 + 0.92871) / 3 = 0.26804880000000003
(0.011154 + -0.13841 + -0.10834) / 3 = 0.02078619999999998
(0.23443 + 0.043774 + 0.21497) / 3 = 0.02956493333333334
(-0.18039 + -0.38744 + -0.50237) / 3 = -0.27284440000000004
(0.6233 + 0.12262 + 0.10379) / 3 = 0.24397293333333334


In [152]:
vsize = next(iter(space.values())).shape[0]

# Project training data
projections = np.zeros((train_data.shape[0],vsize,1))
for i in range(train_data.shape[0]):
    sentence = train_data.iloc[i]["question_text"]
    projections[i] = project_sentence(sentence,space)

# Save in a complete dataframe
train_complete = train_data.copy(deep=True)
train_complete["projections"] = None
for i in range(projections.shape[0]):
    train_complete.at[i,"projections"] = projections[i]

# project test data
projections = np.zeros((test_data.shape[0],vsize,1))
for i in range(test_data.shape[0]):
    sentence = test_data.iloc[i]["question_text"]
    projections[i] = project_sentence(sentence,space)

# Save in a complete dataframe
test_complete = test_data.copy(deep=True)
test_complete["projections"] = None
for i in range(projections.shape[0]):
    test_complete.at[i,"projections"] = projections[i]
    
# and save
train_complete.to_csv("training_90p_50d.csv",index=False)
test_complete.to_csv("test_90p_50d.csv",index=False)

NameError: name 'test_compete' is not defined