# Quora Insincere Questions Classification 
## - Data Cleanup and Wrangling
*******

**The dataset of quora questions has been obtained from a kaggle competition, available at: https://www.kaggle.com/c/quora-insincere-questions-classification/data
I will be using the training dataset since it also has the classification labels, which would make the classification problem a supervised machine learning problem. <br>
The dataset contains about 1.3 million rows (questions), the question ID and the classification (sincere vs insincere) for each. The first step would be to get information from the question text column as that is the main part of the data available and that will be used to solve the classification problem. Then I will process the question text itself to make it more managable for the subsequent analysis.**


In [47]:
import pandas as pd

In [48]:
data = pd.read_csv('train.csv')

In [49]:
data.head()

Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0


**Now, we will use the question text column to extract information which might be useful in classification. This includes:**
* Length of the string
* Number of capital letters
* Ratio of number of capital letters to the length of the question
* Number of words used
* Number of unique words
* Ratio of unique words to total number of words
* Number of exclamation marks, question marks and other punctuations
* Number of symbols / special characters used

In [50]:
data['length'] = data['question_text'].apply(lambda x : len(x))
data['capitals'] = data['question_text'].apply(lambda comment: sum(1 for c in comment if c.isupper()))
data['caps_vs_length'] = data.apply(lambda row: float(row['capitals'])/float(row['length']),axis=1)

In [51]:
data['num_words'] = data['question_text'].apply(lambda comment: len(comment.split()))
data['num_unique_words'] = data['question_text'].apply(lambda comment: len(set(w for w in comment.split())))
data['words_vs_unique'] = data['num_unique_words'] / data['num_words']

In [52]:
data['num_exclamation_marks'] = data['question_text'].apply(lambda comment: comment.count('!'))
data['num_question_marks'] = data['question_text'].apply(lambda comment: comment.count('?'))
data['num_punctuation'] = data['question_text'].apply(lambda comment: sum(comment.count(w) for w in '.,;:'))
data['num_symbols'] = data['question_text'].apply(lambda comment: sum(comment.count(w) for w in '*&$%'))

In [53]:
data.head()

Unnamed: 0,qid,question_text,target,length,capitals,caps_vs_length,num_words,num_unique_words,words_vs_unique,num_exclamation_marks,num_question_marks,num_punctuation,num_symbols
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0,72,2,0.027778,13,13,1.0,0,1,0,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0,81,1,0.012346,16,15,0.9375,0,1,1,0
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0,67,2,0.029851,10,8,0.8,0,2,0,0
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0,57,4,0.070175,9,9,1.0,0,1,0,0
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0,77,3,0.038961,15,15,1.0,0,1,0,0


In [54]:
data.describe()

Unnamed: 0,target,length,capitals,caps_vs_length,num_words,num_unique_words,words_vs_unique,num_exclamation_marks,num_question_marks,num_punctuation,num_symbols
count,1306122.0,1306122.0,1306122.0,1306122.0,1306122.0,1306122.0,1306122.0,1306122.0,1306122.0,1306122.0,1306122.0
mean,0.06187018,70.67884,2.860756,0.04557576,12.80361,12.13578,0.9666687,0.001742563,1.057475,0.2904055,0.0154373
std,0.2409197,38.78428,2.573325,0.03919927,7.052437,6.040779,0.05687022,0.04676167,0.258872,0.7637911,0.1502499
min,0.0,1.0,0.0,0.0,1.0,1.0,0.0952381,0.0,0.0,0.0,0.0
25%,0.0,45.0,1.0,0.02,8.0,8.0,0.9354839,0.0,1.0,0.0,0.0
50%,0.0,60.0,2.0,0.03389831,11.0,11.0,1.0,0.0,1.0,0.0,0.0
75%,0.0,85.0,4.0,0.05882353,15.0,15.0,1.0,0.0,1.0,0.0,0.0
max,1.0,1017.0,180.0,1.0,134.0,96.0,1.0,5.0,10.0,35.0,20.0


**Now that we have extracted useful information about the format and other characteristics of text, the next thing to do it normalize/process the text which can be used in eventual analysis. The steps of normalization include:**
* Removing accented characters
* Expanding contractions 
* Removing special characters
* Lemmatizing the text to retain only the base words
* Removing stop words which add unnecessary noise

In [56]:
import spacy
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
import re
import contractions
import unicodedata
nlp = spacy.load('en_core_web_sm', parse=True, tag=True, entity=True)
tokenizer = ToktokTokenizer()
stopword_list = nltk.corpus.stopwords.words('english')
stopword_list.remove('no')
stopword_list.remove('not')

In [57]:
def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

In [58]:
def expand_contractions(text):
    expanded_text = contractions.fix(text)
    return expanded_text

In [59]:
def remove_special_characters(text, remove_digits=False):
    pattern = r'[^a-zA-z0-9\s]' if not remove_digits else r'[^a-zA-z\s]'
    text = re.sub(pattern, '', text)
    return text

In [60]:
def lemmatize_text(text):
    text = nlp(text)
    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
    return text

In [61]:
def remove_stopwords(text, is_lower_case=False):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text

In [65]:
def preprocess(text):
    text = text.lower()
    text = remove_accented_chars(text)
    text = expand_contractions(text)
    text = remove_special_characters(text)
    text = lemmatize_text(text)
    text = remove_stopwords(text)
    text = re.sub(r'[\r|\n|\r\n]+', ' ',text)    # remove extra newlines      
    text = re.sub(' +', ' ', text)    # remove extra whitespace
    return text

In [68]:
data['processed_text'] = data['question_text'].apply(preprocess)

KeyboardInterrupt: 

**Next, there are few more informative features we can derive after processing the question text:**
* Number of words after processing the text
* Ratio of words before and after processing
* Ration of lenght before and after processng

In [None]:
data.to_csv('processed_data.csv', index=False)