# Insincere Question Classification
## Quora Insincere Questions Classification Kaggle Challenge

The purpose of this project is to study questions from Quora and to classify if the questions are sincere or not. 

To study this problem, we will work through studying this with a hybrid network of convolutional and LSTM units. Note this data can be obtained from the [Kaggle challenge](https://www.kaggle.com/c/quora-insincere-questions-classification)

We begin by importing a number of packages

In [None]:
# General packages for data-analysis
import numpy as np
import pandas as pd
import tensorflow as tf
import re

# Functions needed for tokenizing our comments
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Function to split data randomly
from sklearn.model_selection import train_test_split

# Various layers for our eventual convolutions neural network (CNN)
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import Input
from tensorflow.keras.layers import Conv1D
from tensorflow.keras.layers import MaxPooling1D
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dropout
from tensorflow.keras import Sequential

# For the F1 score, which is how the competition is judged
from sklearn.metrics import f1_score

# In order to compute the descision threshold to maximize the f1 score
from scipy.optimize import minimize

## Data analysis

We would like to knowmore about the data we are about to analyze. To do this, we can load it into a Pandas dataframe.

In [None]:
# Loading our data into dataframes
trainingdf = pd.read_csv('../input/train.csv')
unknowndf = pd.read_csv('../input/test.csv')
# We will take a look at the training data 
trainingdf

Now we take a look at the distribution of sincere or insincere labels.

In [None]:
insincereexmaplesdf = trainingdf[trainingdf['target']==1]
sincereexmaplesdf = trainingdf[trainingdf['target']==0]

print("Number of insincere questions: "+str(len(insincereexmaplesdf)))
print("Number of sincere questions: "+str(len(sincereexmaplesdf)))
print("Percentage of insincere questions: "+str(len(insincereexmaplesdf)/float(len(trainingdf))*100))

So we can see that this is an imbalenced classification problem. We will likely need to deal with this properly later.

In [None]:
# For debugging/optimizing purposes, we will cut down on the amount of data to feed the neural network.
#trainingdf = trainingdf[:20000]
#insincereexmaplesdf = trainingdf[trainingdf['target']==1]
#sincereexmaplesdf = trainingdf[trainingdf['target']==0]

#print("Number of insincere questions: "+str(len(insincereexmaplesdf)))
#print("Number of sincere questions: "+str(len(sincereexmaplesdf)))
#print("Percentage of insincere questions: "+str(len(insincereexmaplesdf)/float(len(trainingdf))*100))

## Cleaning and tokenizing

With some data analysis out of the way, we can look at cleaning and tokenizing the comments. We first want to replace contractions that are not in most of the embeddings later. 

We use the tokenizer first split up the comments into tokens (words) then assign each token a numeral integer. After some analysis of the tokenizing, we can then split the data in test and training sets.

To start, we do some cleaning. There are two ways to do this, one is more accurate than the other but the more accurate one is meassured in minutes on the data compared to seconds for the less accurate one. The shorter, less accurate one is commented out for now.

In [None]:
#%%time
# Dictionaries of various corrections
#contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", "didn't": "did not",  "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not", "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",  "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would", "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would", "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam", "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have", "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock", "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have", "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is", "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as", "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would", "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have", "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have", "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are", "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",  "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is", "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have", "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have", "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have","you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have", "you're": "you are", "you've": "you have"}
#punct_mapping = {"‘": "'", "₹": "e", "´": "'", "°": "", "€": "e", "™": "tm", "√": " sqrt ", "×": "x", "²": "2", "—": "-", "–": "-", "’": "'", "_": "-", "`": "'", '“': '"', '”': '"', '“': '"', "£": "e", '∞': 'infinity', 'θ': 'theta', '÷': '/', 'α': 'alpha', '•': '.', 'à': 'a', '−': '-', 'β': 'beta', '∅': '', '³': '3', 'π': 'pi'}
#spell_mapping = {'colour': 'color', 'centre': 'center', 'favourite': 'favorite', 'travelling': 'traveling', 'counselling': 'counseling', 'theatre': 'theater', 'cancelled': 'canceled', 'labour': 'labor', 'organisation': 'organization', 'wwii': 'world war 2', 'citicise': 'criticize', 'youtu ': 'youtube ', 'Qoura': 'Quora', 'sallary': 'salary', 'Whta': 'What', 'narcisist': 'narcissist', 'howdo': 'how do', 'whatare': 'what are', 'howcan': 'how can', 'howmuch': 'how much', 'howmany': 'how many', 'whydo': 'why do', 'doI': 'do I', 'theBest': 'the best', 'howdoes': 'how does', 'mastrubation': 'masturbation', 'mastrubate': 'masturbate', "mastrubating": 'masturbating', 'pennis': 'penis', 'Etherium': 'Ethereum', 'narcissit': 'narcissist', 'bigdata': 'big data', '2k17': '2017', '2k18': '2018', 'qouta': 'quota', 'exboyfriend': 'ex boyfriend', 'airhostess': 'air hostess', "whst": 'what', 'watsapp': 'whatsapp', 'demonitisation': 'demonetization', 'demonitization': 'demonetization', 'demonetisation': 'demonetization', 'pokémon': 'pokemon'}
# Full dicitonary that will be corrected
#correction_mapping = {**contraction_mapping,**punct_mapping,**spell_mapping}
# Define function that will take the contractions and replace them with the non-contracted phrase
# Note currently does not replace contractions next to punctuation...
# Commented code below does, but takes a few min vs. a few sec.
#def clean_questions(text):
#    specials = ["’", "‘", "´", "`"]
#    for s in specials:
#        text = text.replace(s, "'")
#    text = ' '.join([correction_mapping[t] if t in correction_mapping else t for t in text.lower().split(" ")])
#    return text
# Now apply the function to the data
# Note we include the unknown data here too, as it must be converted for the model
#trainingdf_cleaned = trainingdf.iloc[:,1].apply(lambda x: clean_questions(x))
#unknowndf_cleaned = unknowndf.iloc[:,1].apply(lambda x: clean_questions(x))

In [None]:
#%%time
# Dictionaries of various corrections
contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", "didn't": "did not",  "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not", "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",  "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would", "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would", "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam", "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have", "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock", "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have", "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is", "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as", "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would", "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have", "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have", "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are", "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",  "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is", "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have", "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have", "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have","you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have", "you're": "you are", "you've": "you have"}
punct_mapping = {"‘": "'", "₹": "e", "´": "'", "°": "", "€": "e", "™": "tm", "√": " sqrt ", "×": "x", "²": "2", "—": "-", "–": "-", "’": "'", "_": "-", "`": "'", '“': '"', '”': '"', '“': '"', "£": "e", '∞': 'infinity', 'θ': 'theta', '÷': '/', 'α': 'alpha', '•': '.', 'à': 'a', '−': '-', 'β': 'beta', '∅': '', '³': '3', 'π': 'pi'}
spell_mapping = {'colour': 'color', 'centre': 'center', 'favourite': 'favorite', 'travelling': 'traveling', 'counselling': 'counseling', 'theatre': 'theater', 'cancelled': 'canceled', 'labour': 'labor', 'organisation': 'organization', 'wwii': 'world war 2', 'citicise': 'criticize', 'youtu ': 'youtube ', 'Qoura': 'Quora', 'sallary': 'salary', 'Whta': 'What', 'narcisist': 'narcissist', 'howdo': 'how do', 'whatare': 'what are', 'howcan': 'how can', 'howmuch': 'how much', 'howmany': 'how many', 'whydo': 'why do', 'doI': 'do I', 'theBest': 'the best', 'howdoes': 'how does', 'mastrubation': 'masturbation', 'mastrubate': 'masturbate', "mastrubating": 'masturbating', 'pennis': 'penis', 'Etherium': 'Ethereum', 'narcissit': 'narcissist', 'bigdata': 'big data', '2k17': '2017', '2k18': '2018', 'qouta': 'quota', 'exboyfriend': 'ex boyfriend', 'airhostess': 'air hostess', "whst": 'what', 'watsapp': 'whatsapp', 'demonitisation': 'demonetization', 'demonitization': 'demonetization', 'demonetisation': 'demonetization', 'pokémon': 'pokemon'}
# Full dicitonary that will be corrected
correction_mapping = {**contraction_mapping,**punct_mapping,**spell_mapping}
# Define function that will take the contractions and replace them with the non-contracted phrase
def correct_questions(text):
    specials = ["’", "‘", "´", "`"]
    for s in specials:
        text = text.replace(s, "'")
    for contraction, replacement in correction_mapping.items():
        text = re.sub(contraction, replacement, text.lower())
    return text
# Now apply the function to the data
# Note we include the unknown data here too, as it must be converted for the model
trainingdf_corrected = trainingdf.iloc[:,1].apply(lambda x: correct_questions(x))
unknowndf_corrected = unknowndf.iloc[:,1].apply(lambda x: correct_questions(x))

In [None]:
#%%time
# We can also try to take care of possessives, many of which are not in the embedding
# Note we do this after we take care of contractions to avoid removing it's and other similar contractions
trainingdf_cleaned = trainingdf_corrected.apply(lambda x: re.sub("'s", "", x))
unknowndf_cleaned = unknowndf_corrected.apply(lambda x: re.sub("'s", "", x))

In [None]:
# Set up the tokenizer first
# This will split the words and lower them
tokenizer = Tokenizer()
# Fit the tokenizer to the comments, so it can build the integer assignments
tokenizer.fit_on_texts(pd.concat([trainingdf_cleaned,unknowndf_cleaned]))
# Use the word-integer assignments to covert the comments into a list of integers
sequences = tokenizer.texts_to_sequences(trainingdf_cleaned)
unknownsequences = tokenizer.texts_to_sequences(unknowndf_cleaned)

# Save the dictionary between words and integers for later use
word_index = tokenizer.word_index

In [None]:
# In order to put the comments (now integer lists) through the neural network, they need to be the same length.
# So we find the longest comment, and then pad the rest of the shorter comments with zeros.
# Note this does involve having to consider the 
maxsequncelength = max([len(max(sequences, key=len)),len(max(unknownsequences, key=len))])
data = pad_sequences(sequences, maxlen=maxsequncelength)
unknowndata = pad_sequences(unknownsequences, maxlen=maxsequncelength)

In [None]:
# Split data into testing and training sets. Can assign split ratio, though it is left as the default here.
# Though our test data will be our validation data here, as the true test data would be submitted to the competition.
testtrainsplit = 0.20;
X_train, X_test, y_train, y_test = train_test_split(data, trainingdf.iloc[:,2:], test_size = testtrainsplit, random_state=0)

In [None]:
# This will resample the data to include more insincere questions. However, it doesn't seem to help much.
from imblearn.over_sampling import RandomOverSampler

X_res, y_res = RandomOverSampler(sampling_strategy=0.2).fit_resample(X_train, y_train.values.ravel())

## Embedding

In order to feed this to a neural network, we want to do some dimensional reduction/normalization of the data. The comments so far have been turned into a vector of integers. But these values are not normalized, which they need to be for the neural network.

The standard apporach for machine learning would be to one-hot encode each work instead of a single integer, leading to the comemnts be represented by an array. However, this would result in a very large dimensional, and sparse, set of vectors/arrays. This is undesirable for the neural network.

Instead it is better to encode the words in a word vectors that care more about co-occurence. To start, we will be using a pre-trained word vector representation.

In [None]:
# Here we intialize our embedding dictionary
embeddings_index = {}
# There are a number of different dimensional word-representations
# However, for this project we are limited in which word-vectors to choose
embeddingdim = 300;
# Below we just read the file and store it in the initialized dictionary
f = open('../input/embeddings/wiki-news-300d-1M/wiki-news-300d-1M.vec',encoding='UTF-8')
for line in f:
    values = line.split(' ')
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

With the word vectors in hand, we set up the translation between our tokenized word integer and their corresponding word vector representation as a matrix.

In [None]:
# start with the an array of zeros. Note the shape is (num. of words, dim. of GloVe)
embedding_matrix = np.zeros((len(word_index) + 1, embeddingdim))
unknown_words = []
# Go through our word/integer dictionary
for word, i in word_index.items():
    # For each word, find the corresponding rep in GloVe and set embedding vector
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector
    else:
        unknown_words.append(word)
    # Note if the embedding_vector is not found, remains a vector of zeros as initialized

With the embedding matrix found, we simply set up the embedding layer with this matrix as the first layer of our neural network.

In [None]:
embedding_layer = Embedding(len(word_index) + 1,
                            embeddingdim,
                            weights=[embedding_matrix],
                            input_length=maxsequncelength,
                            trainable=False)
# Note the first two arguments are the shape of the weights, 
# but the expected input length is still the length of the longest comment
# Also we have set trainable=False, as we do not want the weights to change.
# Finally, note the embedding_matrix has knowledge of the unknown (true test) date
# but since this is not a trainable layer, it shouldn't be a problem.

## Hybrid Neural Network

Now we build our network. This will begin with our embedding layer, followed by one convolutional layer with relu activations. This will feed into a LSTM layer, which will then feed into a dense layer, also with relu activation, before heading to the final layer with sigmoid activation (as opposed to softmax, as these labels are independant). 

The model will use the binary crossentropy as the loss function, as the labels should be independant, and will be scored with the accuracy metric. The optimizer is a simple Adam optimizer.

In [None]:
# Now we will define a function to setup the model
def model_setup(rfs_var, mps_var, actc_var, actd_var, met_var, drpout_var, node_num_var):
    # We will begin with a simple sequential model
    model = Sequential()
    # Frist layer is the embedding layer as we had defined above
    model.add(embedding_layer)
    # We enxt have one convolutional layer, with relu activation and pooling layer
    model.add(Conv1D(node_num_var, rfs_var, activation=actc_var))
    model.add(MaxPooling1D(mps_var))
    # Next we add a LSTM layer
    model.add(LSTM(node_num_var))
    model.add(Dropout(drpout_var))
    # Next we flatten and send into the dense layer
    model.add(Flatten())
    model.add(Dense(node_num_var, activation=actd_var))
    # Finally we forward the results from the dense layer into the final prediction layer.
    model.add(Dense(1, activation='sigmoid'))
    # We then compile the model with the layers, using cross-entropy 
    model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=[met_var])
    # We can then fit the model, using our test data as validation
    model.fit(X_res, y_res, validation_data=(X_test, y_test),
          epochs=1, batch_size=node_num_var)
    
    return model
    
# Model hyperparameters, these are chosen by some optimization below
# Size of the receptive field for convolutional layers
rfs = 5
# Size of the max pooling windows
mps = 2
# Activation for convolutional layer
actc = 'relu'
# Activation for dense layer
actd = 'relu'
# Metric for the eval
met = 'acc'
# The dropout ratio for the dropout layers
drpout = 0.1
# The number of nodes for the layers
node_num = 256

# Now we run the model
model = model_setup(rfs, mps, actc, actd, met, drpout, node_num)

Nowe we turn to computing the F1 score. To do this, we need to find the optimal threshold for decision. We can use the scipy function minimize to optimize this. Since we actually want a maximium, the signs need to be fliped but otherwise it is very straightforward. 

We will do this maximization on the training data and use the threshold from that on the test/validation data. This will allow us to examine possible under/overfitting of the model.

In [None]:
def f1_score_computation():
    # In order to compute the F1 score, we need to apply the model to the data
    CNN_probas_train = model.predict(X_train)
    CNN_probas_test = model.predict(X_test)

    # Start by defining the function to be optimized. Note the -ve!
    def f1_score_function(x):
        return -f1_score(y_train, (CNN_probas_train > x).astype(np.int))
    # Do mutliple minimizations to find global min instead of local
    # Save as dict w/ threshold:score, making sure to flip sign for true score
    results = {}
    for i in range(1,10):
        results[minimize(f1_score_function, i*0.1, method='nelder-mead').x[0]] = -minimize(f1_score_function, i*0.1, method='nelder-mead').fun
    # Extract the threshold and F1 score of best threshold
    threshold = max(results.items(), key=lambda x: x[1])[0]
    max_f1score = max(results.items(), key=lambda x: x[1])[1]
    
    return (threshold, max_f1score, CNN_probas_test)

f1_results = f1_score_computation()
# Print the results
print("Decision threshold: "+str(f1_results[0]))
print("Max training F1 score: "+str(f1_results[1]))

# To check for overfitting, we can look at the F1 score of the training set
print("Testing F1 score: "+str(f1_score(y_test, (f1_results[2] > f1_results[0]).astype(np.int))))

Note that these scores are not great compared to the competition scores, but there could be a lot more done to improve this model. This includes more data cleaning before tokenization, tuning hyperparameters in the network, treating the class imbalance differently, etc.

## Examining Results and Debugging

Below is largely troubleshooting and more specific details of the results from the machine learning alogrithm. This can all be commented out for evaluation.

In [None]:
# If we want, we can have a summary of the weights and shapes of the layers
#model.summary()

In [None]:
# We will also look at the results to check reasonableness.
# e.g. we are not picking all the same prediction, they are not too small or large, etc
#CNN_probas_test

In [None]:
# We can got through a number of different hyperparameters here
# This is done by hand, but likely that there is a better method

# Size of the receptive field for convolutional layers
#trial_rfs = [3, 4, 5]
# Size of the max pooling windows
#trial_mps = [2, 3]
# Activation for convolutional layer
#trial_actc = ['tanh', 'relu']
# Activation for dense layer
#trial_actd = ['relu']
# Metric for the eval
#trial_met = ['acc']
# The dropout ratio for the dropout layers
#trial_drpout = [0.1, 0.3, 0.5, 0.7, 0.9]
# The number of nodes for the layers
#trial_node_num = [128, 256]

#import itertools
#results = {}
#for a, b, c, d, e, f, g in itertools.product(trial_rfs, trial_mps, trial_actc, trial_actd, trial_met, trial_drpout, trial_node_num):
#    model = model_setup(a, b, c, d, e, f, g)
#    f1_results = f1_score_computation()
#    results[f1_score(y_test, (f1_results[2] > f1_results[0]).astype(np.int))] = [a,b,c,d,e,f,g]
    
#max_f1_score_trials = max(results.items(), key=lambda x: x[0])[0]
#max_f1_vars_trials = max(results.items(), key=lambda x: x[0])[1]

In [None]:
#(max_f1_score_trials, max_f1_vars_trials)

In [None]:
#results

In [None]:
#Now we want to which ones of the validation set was mislabeled
#Start by comparing the test labels with the predicitons
#test_labels_df = y_test.reset_index().drop(['index'],axis=1)
#predictions_df = pd.DataFrame((CNN_probas_test > threshold).astype(np.int).flatten(),columns=['target'])
#match_df = (test_labels_df==predictions_df).rename_axis({'target':'match'},axis=1)
#comparison_df = pd.concat([y_test.reset_index()['index'], match_df,test_labels_df],axis=1)

# Extract the confusion matrix
#true_positive = comparison_df[(comparison_df['match']==True) & (comparison_df['target']==1)]
#true_negative = comparison_df[(comparison_df['match']==True) & (comparison_df['target']==0)]
#false_negative = comparison_df[(comparison_df['match']==False) & (comparison_df['target']==1)]
#false_positive = comparison_df[(comparison_df['match']==False) & (comparison_df['target']==0)]

#confusion_matrix = pd.DataFrame([['TN: '+str(len(true_negative)),'FN: '+str(len(false_negative))],['FP: '+str(len(false_positive)),'TP: '+str(len(true_positive))]],columns = ['Sincere','Insincere'], index = ['Pred. Sincere','Pred. Insincere'])
#print('Sincere: '+str(len(test_labels_df[test_labels_df['target']==0])))
#print('Insincere: '+str(len(test_labels_df[test_labels_df['target']==1])))
#confusion_matrix

## Applying the model

Next we can apply the trained model to the test data. We will use the threshold calculated above to compute give the predictions

In [None]:
submission = pd.DataFrame(
    {'qid':unknowndf['qid'], 'prediction':(model.predict(unknowndata) > f1_results[0]).astype(np.int).flatten()},
    columns = ['qid','prediction'])
submission

In [None]:
submission.to_csv('submission.csv',index=False)

The kaggle score for this submission was 0.65369, which was within the top 31% of submissions