## Contents

1. [Introduction](#1)
2. [Pre-Processing for Baseline](#2)
3. [Baseline Model](#3)

## Introduction <a></a>

This notebook will document my new project to learn NLP, using the Quora Insincere Questions data source. I am attempting the challenge after the competition deadline has already passed, but in a Kaggle Kernel. After completing the project, I will download it and push to my Github repo.

This notebook will begin with pre-processing and building a quick baseline model, using TF-IDF with logistic regression. Then I will peform some slightly different pre-processing for word embeddings (following advice from experienced kagglers) and then build an LSTM model.

In [39]:
import pandas as pd
import re
import string
import nltk
from nltk.corpus import stopwords

from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

In [40]:
train_set = pd.read_csv("../input/quora-insincere-questions-classification/train.csv")
test_set = pd.read_csv("../input/quora-insincere-questions-classification/test.csv")

In [41]:
print("Train shape : ",train_set.shape)
print("Test shape : ",test_set.shape)

Train shape :  (1306122, 3)
Test shape :  (375806, 2)


In [42]:
train_set.head()

Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0


In [43]:
train_set['target'].value_counts()

0    1225312
1      80810
Name: target, dtype: int64

## Pre-Processing for Baseline <a></a>

In [44]:
# Removes punctuation
import re, string
re_tok = re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])')
def tokenize(s): return re_tok.sub(r' \1 ', s).split()

In [45]:
train_text = train_set['question_text']
test_text = test_set['question_text']
train_target = train_set['target']

In [46]:
tfidf_vectoriser = TfidfVectorizer(ngram_range=(1,2), tokenizer=tokenize,
               min_df=3, max_df=0.9, strip_accents='unicode')
train_tfidf = tfidf_vectoriser.fit_transform(train_text)
test_tfidf = tfidf_vectoriser.transform(test_text)

In [47]:
classifier = LogisticRegression().fit(train_tfidf,train_target)
y_pred = classifier.predict(test_tfidf)



In [48]:
y_pred

array([1, 0, 0, ..., 0, 0, 0])

In [49]:
submit_df = pd.DataFrame({"qid": test_set["qid"], "prediction": y_pred})
submit_df.to_csv("submission.csv", index=False)

Leaderboard score ~0.56 for both SVM and LogisticReg (default params)