## Quora Insincerity Detection
---
Started: 30 Jan 2022

## Step Zero
#### Aim: Import necessary libraries, download objects and prepare constants (if any), Download data
---

In [1]:
#!pip install pyspellchecker
# import all necessary libraries

# For dataframes
import pandas as pd 

# For numerical arrays
import numpy as np 

# For stemming/Lemmatisation/POS tagging
import spacy

# For getting stopwords
from spacy.lang.en.stop_words import STOP_WORDS

# For K-Fold cross validation
from sklearn.model_selection import KFold

# For visualizations
import matplotlib.pyplot as plt

# For regular expressions
import re

# For handling string
import string

# For all torch-supported actions
import torch

# For spell-check
# from spellchecker import SpellChecker

# For performing mathematical operations
import math

# For dictionary related activites
from collections import defaultdict

# For counting actions (EDA)
from collections import  Counter

# For count vectorisation (EDA)
from sklearn.feature_extraction.text import CountVectorizer

# For one-hot encoding
from tensorflow.keras.utils import to_categorical

# For DL model
from tensorflow.keras.layers import Dense, Input, GlobalMaxPooling1D
from tensorflow.keras.layers import Conv1D, MaxPooling1D, Embedding, LSTM
from tensorflow.keras.models import Model, Sequential

# For generating random integers
from random import randint

#For making wordclouds
from wordcloud import WordCloud 

# For TF-IDF vectorisation
from sklearn.feature_extraction.text import TfidfVectorizer

# For padding
from tensorflow.keras.preprocessing.sequence import pad_sequences

# For tokenization
from tensorflow.keras.preprocessing.text import Tokenizer

# For plotting
import seaborn as sns

print("Necessary libraries imported")

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [2]:
df=pd.read_csv('../input/quora-insincere-questions-classification/train.csv')
df.head(10)

## Step One
#### Aim: Time to do some EDA baby!
---
This phase involves complete understanding of what is there in the dataset, and the key nuances that needs to be understood before framing the Input-Output ML pipeline. We perform the following:

- Dataset description (to know what's presented and what's not available)

#### 1.1 Dataset Description
Studying the basic statistics of the dataset, which covers the following aspects:

- Analysing columns
- Null-Value statistics
- Overall column-wise stats
- Highest and lowest word length, input length
- Target types and frequency

In [38]:
print('Total rows in dataset: ',len(df),'Rows\n')
print('Dataset columns: ')
print(df.columns)
print('\nNull Statistics (in %): ')
print(df.isnull().sum()* 100 / len(df))
print('\nDataset description: ')
print(df.describe())
print('\nEssay prompt frequency: ')
print(dff.loc[:,[('question_text',  'count'),('question_text', 'unique')]])

print('\nMax and Min statistics for word/char count of question_text: ')
print('MAX\t',max(df.question_text.apply(lambda x: len(x))),'characters,',
     max(df.question_text.apply(lambda x: len(x.split()))), 'words')
print('MIN\t',min(df.question_text.apply(lambda x: len(x))),'character(s),',
     min(df.question_text.apply(lambda x: len(x.split()))), 'word(s)')

print('Pie plot against Target')
df.target.value_counts().plot(title='Target categories',kind='pie')


**Inference:**
1. There are `1306122` rows, all of which are non-null and unique *(Thats a lot!)*
2. We only have 3 columns: one column is the Q-ID (won't contribute to decision making), the other column is the Question Text *(the main input)*, and the last column is the Target *(the expected output)*
3. The data is highly biased, with 1225k sincere questions `target=0` and only 80k insincere questions `target=1` *(Which means we've got to be careful while training our models cuz the dataset itself is partial towards sincerity)*
4. There can be 100+ words or 1000+ characters in a `question_text` *(ie, our model should be scalable enough to handle bigger sentences)*

#### 1.2 Effect of essay-word-lengths over score
Here, we observe the trend of distribution of targets across different essay sets, capturing the following trends:

- Total words vs target
- Word length vs target

In [39]:
def get_avg_length(essay):
    summ=0
    for word in essay.split():
        summ+=len(word)
    return round(summ/len(essay.split()),2)

df['avg_word_length']=df.question_text.apply(lambda x: get_avg_length(x))
print("ScatterPlot for Target vs average word length")
plt.plot(figsize=(10,10)) 
plt.title('Target versus average word length')

sns.stripplot(data=df,
    x="target", y="avg_word_length")

plt.show()



In [40]:
df['total_words']=df.question_text.apply(lambda x: len(x.split()))
print("Boxplot for Target vs total words")
plt.plot(figsize=(8,15)) 
plt.title('Target versus total words')

sns.boxplot(data=df,
    x="target", y="total_words")
plt.show()

**Inference:**
1. The average wordlength for both the categories is scattered around the 5-15 mark, but there are a few outliers whose average wordlength crosses 60. <br>*(we have noise in our data: we can expect sentences that aren't grammatically/syntactically/spellingly correct.)*
2. Although the maximum of total words used could cross 100, majority of the questions are based around the 0-50 region, with exceptions of outliers. <br>*(We're gonna have to use a dataset which has majority of sentences within a respectable range, and a minority of sentences that could pose a problem due to unusually high number of words.)*

#### 1.3 Unigram analysis
The primary goal here is to see what words are most frequently used. This will be done in the following ways:

- Frequency of stop-words used
- Most commonly occuring non-stop-words in each target category

In [41]:
print("Stop-word freuqency")

fig, axes = plt.subplots(1,len(df.target.unique()), figsize=(20,8))
fig.suptitle('Stop-word frequency')

for index,target in enumerate(df.target.unique()):
    dct=defaultdict(int) 
    curdf=df[df['target']==target]  
    
    allwordsarr=curdf.question_text.str.cat().split() #First, we join all strings from the question_text column, then we split em all so that we get an array of all words, which could be counted
    counter=Counter(allwordsarr)
    most=counter.most_common()
    x=[]
    y=[]
    for word,count in most[:30]:
        if (word in STOP_WORDS) :
            x.append(word)
            y.append(count)
    sns.barplot(ax=axes[index],x=y,y=x)
    axes[index].set_title("Target: "+str(target))


In [42]:
print("Wordcloud")

fig, axes = plt.subplots(1,len(df.target.unique()), figsize=(20,8))

    
for index,target in enumerate(df.target.unique()):
    dct=defaultdict(int) 
    curdf=df[df['target']==target]  
    
    df_fullstring=" ".join(curdf.question_text.str.cat().split()) #first, we join all strings of column, then split by space, then join again because we want a full string here
    wordcloud = WordCloud(background_color='white',max_words=100,
                      max_font_size=40,
                      scale=3,
                      random_state=1).generate(df_fullstring)

    axes[index].imshow(wordcloud)
    axes[index].set_title('Target: '+str(target))
    axes[index].axis('off')
    
fig.suptitle('Wordcloud')


In [43]:
print("Most commonly occcuring words in all categories")

fig, axes = plt.subplots(1,len(df.target.unique()), figsize=(20,8))
fig.suptitle('Common-word frequency')

for index,target in enumerate(df.target.unique()):
    dct=defaultdict(int) 
    curdf=df[df['target']==target]  
    allwordsarr=curdf.question_text.str.cat().split()
    counter=Counter(allwordsarr)
    most=counter.most_common()
    x=[]
    y=[]
    for word,count in most[:100]:
        if (word.lower() not in STOP_WORDS):
            x.append(word)
            y.append(count)
    sns.barplot(ax=axes[index],x=y,y=x)
    axes[index].set_title("Target: "+str(target))


**Inference**

We get the following information from these two series of barcharts:

- The frequency of stop-words in each word is heavy enough *(raising the need to do stopword-removal during data cleaning and feature-formatting)*
- There seems to be no explainable reason to have a word heavily associated with a category. For instance, the popular word "India" appears in both sides 

#### 1.4 Bigram analysis
The primary goal here is to analyse the trend of bigrams used in each category

In [44]:
def get_top_bigrams(corpus, n=None):
    vec = CountVectorizer(ngram_range=(2, 2)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

print("Bigram analysis")

fig, axes = plt.subplots(1,len(df.target.unique()), figsize=(20,8))
fig.suptitle('Bigram analysis')

for index,target in enumerate(df.target.unique()):
    dct=defaultdict(int) 
    top_bigrams=get_top_bigrams(df[df['target']==target].question_text)[:50]
    x,y=map(list,zip(*top_bigrams))
    sns.barplot(ax=axes[index],x=y,y=x)
    axes[index].set_title("Target: "+str(target))

fig.show()

**Inference**
- The data is highly populated with stop words, almost all of the top occurences have stop-words in them
- Its surprising to see a good amount of region-specific, religion-specific and community-specific words. For instance, the word "Donald Trump" is very popular in `target=1` class *(ie, there is geographical bias; and we have to ensure that the model we make WILL NOT blindly imbibe this trait)*

#### 1.5 Other stuff

Here, we analyse the following
- Presence of HTML tags
- Presence of URLs
- Presence of emojis
- Capitalisation
- Punctuation-statistics

In [45]:
from emoji import UNICODE_EMOJI

def count_emojis(s):
    count = 0
    for emoji in UNICODE_EMOJI['en']:
        count += s.count(emoji)
    return count

allvalues=[]
plt.figure(figsize=(8,8))
for index,target in enumerate(df.target.unique()):
    cur_dict=dict()
    curdf=df[df['target']==target]  
    
    curdf.question_text=curdf.question_text.apply(lambda x: str(x))
    df_caps=" ".join(curdf.question_text.str.cat().split())
    curdf.question_text=curdf.question_text.apply(lambda x: x.lower())
    df_fullstring=curdf.question_text.str.cat()
    cur_dict['HTML tags']=len(re.findall("<.*>",df_fullstring))
    cur_dict['URL']=len(re.findall("http",df_fullstring))
    cur_dict['Capitalised']=len(re.findall("[^\.!\?]\s[A-Z]\w+[\W\?:\.!-_]",df_caps))
    cur_dict['Emojis']=count_emojis(df_caps)
    
    x_keys = list(cur_dict.keys())
    y_values = list(cur_dict.values())
    x_axis = np.arange(len(x_keys))
    bars=plt.bar(x_axis - 0.2+0.4*index, y_values,0.4, label = 'Target: '+str(target))
    allvalues+=y_values
    for bar in bars:
        yval = bar.get_height()
        plt.text(bar.get_x()+0.15, yval+5, yval)


    
plt.xticks(x_axis, x_keys)
plt.xlabel("Elements")
plt.ylabel("Quantity of occurence")
plt.legend()
plt.show()



In [46]:

def count_punctuations(s):
    arr=list()
    for punct in string.punctuation:
        count = s.count(punct)
        arr.append(count)
    return arr

allvalues=[]
plt.figure(figsize=(20,20))
for index,target in enumerate(df.target.unique()):
    cur_dict=dict()
    curdf=df[df['target']==target]  
    
    curdf.question_text=curdf.question_text.apply(lambda x: str(x))
    df_fullstring=curdf.question_text.str.cat()
    
    x_keys = string.punctuation
    y_values = count_punctuations(df_fullstring)
    x_axis = np.arange(len(x_keys))
    bars=plt.bar(x_axis - 0.2+0.4*index, y_values,0.4, label = 'Target: '+str(target))
    


    
plt.xticks(x_axis, x_keys)
plt.xlabel("Punctuation")
plt.ylabel("Quantity of occurence")
plt.legend()
plt.show()


**Inference**
- HTML tags and HTTP URLs do occur in our dataset *(ignoring them might result in loss of significant data, hence we need to find out a way to treat them)*
- There are A LOT of capitalised words, **indicating** the fact that there could be a huge number of proper nouns in the dataset *(Proper nouns like names of people, cities, etc are unseen entities with respect to pre-trained transformers like BERT. We may or may not lose information by lowercasing them)*
- There are emojis in the dataset, but almost all of them are either of the following: `©`, `™`, `®` *(hence, we don't need to care about cleaning emoticons)*
- Punctuations.... there's a huge load of punctuations. Its not at all surprising to see the Question mark symbol top the charts, because after all... Quora is for Questions!

### **EDA: the conclusion**

We gained the following knowledge by doing Exploratory Data Knowledge:

- There are no null rows, and no duplicate rows
- Oh boy, We have hell-a-lot of data imbalance! (~93% of `target=0` and ~7% of `target=1`)
- There's bias, ie, we can see community specific and location specific terms skewed towards a category
- Both unigram and bigram analysis shows the abundance of Stop Words *(Spoiler alert: we're not doing anything to treat it😉)*
- We are unable to draw a clear relation between question length, average word length and target class *(they come in all shapes and sizes 🏈⚽)*
- We have a few HTML tags, and a few HTTP URLs
- We only have ©orporate emoticons™ in our dataset®
- The dataset is enriched with punctuations, keeping them could hopefully contribute to the knowledge mining process❕‼

### STEPS 2 and above

For data pre-processing, feature engineering and model development pipeline of this problem statement, hop on to [the **Part two** of this notebook](https://www.kaggle.com/kodoorakiller/a-bertilicious-way)




In [52]:
# ---------------------------------------------------------------------------------------------------------
# Copy, edit or upvote if you find this notebook useful. Reach out via comments or email if you wish to!
# Author: kodooraKILLER 
# ---------------------------------------------------------------------------------------------------------