<a href="https://colab.research.google.com/github/imaaditya-stack/SpamFilterForQuoraQuestions-DeepLearning/blob/master/SpamFilterForQuora.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook contains PreProcessing of the text data. <br>
I have used **NLTK kit** and made custom functions to clean the raw text data in order to extract better features from cleaned data. <br>
The NLTK module is a massive tool kit, aimed at helping you with the entire Natural Language Processing (NLP) methodology. NLTK will aid you with everything from splitting sentences from paragraphs, splitting up words, recognizing the part of speech of those words, highlighting the main subjects, and then even with helping your machine to understand what the text is all about.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
#The easiest method to installing the NLTK module is going to be with pip.
#!pip install nltk

In [2]:
#importing packages
import pandas as pd
import re

#we need to install some of the components for NLTK
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [0]:
#Reading data
df = pd.read_csv("/content/drive/My Drive/spamfilterdata.csv")

In [0]:
df.head()

Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0


In [0]:
df.shape

(1306122, 3)

In [4]:
#Checking For any Nan values
df.isnull().any()

qid              False
question_text    False
target           False
dtype: bool

In [0]:
df.target.value_counts()

0    1225312
1      80810
Name: target, dtype: int64

The dataset is heavily imbalance.

In [0]:
#Dropping qid column as it has no affect on the target variable
df.drop(columns="qid", inplace=True)

In [0]:
df.head()

Unnamed: 0,question_text,target
0,How did Quebec nationalists see their province...,0
1,"Do you have an adopted dog, how would you enco...",0
2,Why does velocity affect time? Does velocity a...,0
3,How did Otto von Guericke used the Magdeburg h...,0
4,Can I convert montra helicon D to a mountain b...,0


### Number of Words
Calculate number of words in each question


In [0]:
df["word_count"] = df["question_text"].apply(lambda x: len(str(x).split(" ")))
df[["question_text","word_count"]].head()

Unnamed: 0,question_text,word_count
0,How did Quebec nationalists see their province...,13
1,"Do you have an adopted dog, how would you enco...",16
2,Why does velocity affect time? Does velocity a...,10
3,How did Otto von Guericke used the Magdeburg h...,9
4,Can I convert montra helicon D to a mountain b...,15


In [0]:
max(df["word_count"]), min(df["word_count"])

(122, 1)

The dataset contains questions where maximum words are 121 in a question and minimum is 1.

### Number of characters
Calculate number of characters in each question


In [0]:
df['char_count'] = df['question_text'].str.len() ## this also includes spaces
df[['question_text','char_count']].head()

Unnamed: 0,question_text,char_count
0,How did Quebec nationalists see their province...,72
1,"Do you have an adopted dog, how would you enco...",81
2,Why does velocity affect time? Does velocity a...,67
3,How did Otto von Guericke used the Magdeburg h...,57
4,Can I convert montra helicon D to a mountain b...,77


In [0]:
max(df["char_count"]), min(df["char_count"])

(1017, 1)

The dataset contains questions where maximum characters are 1017 in a question and minimum is 1.

### Number of special characters
Calculate number of special characters in each question

In [0]:
df['hastags'] = df['question_text'].apply(lambda x: len([x for x in x.split() if x.startswith('#')]))
df[['question_text','hastags']].head()

Unnamed: 0,question_text,hastags
0,How did Quebec nationalists see their province...,0
1,"Do you have an adopted dog, how would you enco...",0
2,Why does velocity affect time? Does velocity a...,0
3,How did Otto von Guericke used the Magdeburg h...,0
4,Can I convert montra helicon D to a mountain b...,0


In [0]:
max(df["hastags"]), min(df["hastags"])

(10, 0)

The dataset contains questions where maximum special characters are 10 in a question and minimum is 0.

### Number of numerics
Calculate number of numerics in dataset

In [0]:
df['numerics'] = df['question_text'].apply(lambda x: len([x for x in x.split() if x.isdigit()]))
df[['question_text','numerics']].head()

Unnamed: 0,question_text,numerics
0,How did Quebec nationalists see their province...,0
1,"Do you have an adopted dog, how would you enco...",0
2,Why does velocity affect time? Does velocity a...,0
3,How did Otto von Guericke used the Magdeburg h...,0
4,Can I convert montra helicon D to a mountain b...,0


In [0]:
max(df["numerics"]), min(df["numerics"])

(15, 0)

The dataset contains questions where maximum numerics are 15 in a question and minimum is 0.

### Finding Commom Words

In [0]:
freq_common = pd.Series(' '.join(df['question_text']).split()).value_counts()[:10]
freq_common

the     653948
What    417802
to      403183
a       402682
in      363131
is      331647
of      330825
I       306261
How     261930
and     251973
dtype: int64

### Finding Rare words

In [0]:
freq_rare = pd.Series(' '.join(df['question_text']).split()).value_counts()[-10:]
freq_rare

phonepe         1
Capoeria        1
BASF?           1
Mumi&Bubi       1
Ashu            1
avation?        1
lunch/dinner    1
(drunk)         1
movies/films    1
IDW             1
dtype: int64

# Pre Processing 

In [0]:
def tokenize(words):
  """Divides the text into a sequence of words"""
  return(word_tokenize(words))

def to_lowercase(words):
  """Convert all characters to lowercase from list of tokenized words"""
  new_words = []
  for word in words:
      new_word = word.lower()
      new_words.append(new_word)
  return new_words

def remove_punctuation(words):
  """Remove punctuation from list of tokenized words"""
  new_words = []
  for word in words:
      new_word = re.sub(r'[^\w\s]', '', word)
      if new_word != '':
          new_words.append(new_word)
  return new_words

def remove_digits(words):
  """Remove Digits from the list of tokenized words"""
  pattern = '[0-9]'
  new_words = []
  for word in words:
      new_word = re.sub(pattern, '', word)
      if new_word != '':
          new_words.append(new_word) 
  return new_words

def remove_stopwords(words):
  """Remove stop words from list of tokenized words"""
  new_words = []
  for word in words:
      if word not in stopwords.words('english'):
          new_words.append(word)
  return new_words

def get_wordnet_pos(treebank_tag):
  """This function is needed for Lemmatize to work for all the pos tags"""
  if treebank_tag.startswith('J'):
      return wordnet.ADJ
  elif treebank_tag.startswith('V'):
      return wordnet.VERB
  elif treebank_tag.startswith('N'):
      return wordnet.NOUN
  elif treebank_tag.startswith('R'):
      return wordnet.ADV
  else:
      return ''

def lemmatize_verbs(words):
  """Lemmatize verbs in list of tokenized words"""
  lemmatizer = WordNetLemmatizer()
  lemmas = []
  for word in words:
      lemma = lemmatizer.lemmatize(word, wordnet.VERB)
      lemmas.append(lemma)
  return lemmas

def remove_duplicates(words):
  """Remove duplicate words from list of tokenized words""" 
  final_words = [] 
  for word in words: 
      if word not in final_words: 
          final_words.append(word) 
  return final_words 

def normalize(words):
  """Apply all the above functions sequentially to a given input string"""
  words = tokenize(words)
  words = to_lowercase(words)
  words = remove_punctuation(words)
  words = remove_digits(words)
  words = remove_stopwords(words)
  words = lemmatize_verbs(words)
  words = remove_duplicates(words)
  return words

## Testing Pre Processing Function for an individual sample

### Sample Text 1
Studying is #hard. Sometimes you won’t be #prepared. But taking #prescription #drugs is a #dangerous way & get an #edge. Some people have gotten hooked on these #medicines & even died. Just #budget the time to hunker down and hit those #books.?

In [0]:
#Output after preprocessing
print(normalize("Studying is #hard. Sometimes you won’t be #prepared. But taking #prescription #drugs is a #dangerous way & get an #edge. Some people have gotten hooked on these #medicines & even died. Just #budget the time to hunker down and hit those #books.?"))

['study', 'hard', 'sometimes', 'prepare', 'take', 'prescription', 'drug', 'dangerous', 'way', 'get', 'edge', 'people', 'hook', 'medicine', 'even', 'die', 'budget', 'time', 'hunker', 'hit', 'book']


### Sample Text 2
What is [math]\frac{\int_{1x^5}^{3x^{-5}} \tan(\tan({\boxed{\int_{1x^0}^{1x^2} \sum_{\varpi=1}^{\infty} \int_{2x^{-3}}^{2x^2} \sum_{\alpha=7}^{\infty} \underbrace{\sqrt[2]{1x^5}}_{\text{Gauss's Law of Theoretical Probability.}} d\tau dx}}^{1x^0})) d\mu}{\int_{2x^{-3}}^{1x^5} \cos(\int_{2x^2}^{1x^{-3}} \frac{\sqrt[2]{\overbrace{\underbrace{\frac{3x^3+3x^5}{\sqrt[3]{2x^{-3}}}}_{\text{Gauss's Law of Theoretical Probability.}} \times \overbrace{\tan(2x^0)}^{\text{Gauss's Law of Theoretical Probability.}}-\sum_{4=7}^{\infty} \boxed{3x^{-5}}}^{\text{Inverse Function.}}}}{{\boxed{\int_{2x^2}^{2x^4} 3x^1 d9} \div \sum_{6=6}^{\infty} \sqrt[3]{2x^2}+\sqrt[4]{\sin(2x^0+3x^0)}}^{2x^{-4}}+\boxed{\frac{\vec{\boxed{\sum_{\gamma=10}^{\infty} 1x^{-5}}}}{\frac{\sum_{\iota=2}^{\infty} 1x^{-5}-\frac{3x^{-1}}{1x^{-4}}}{\sin(\tan(3x^{-2}))}}}} \times \boxed{\sqrt[2]{{{{\sqrt[5]{2x^5}}^{2x^{-1}}}^{2x^{-1}} \div \sum_{\chi=6}^{\infty} \int_{1x^4}^{2x^{-4}} 3x^2 d\vartheta+{2x^{-3}}^{2x^{-5}}}^{3x^{-4}}}} d\mu) d\iota}[/math]?

In [0]:
#Output after preprocessing
print(normalize(r"What is [math]\frac{\int_{1x^5}^{3x^{-5}} \tan(\tan({\boxed{\int_{1x^0}^{1x^2} \sum_{\varpi=1}^{\infty} \int_{2x^{-3}}^{2x^2} \sum_{\alpha=7}^{\infty} \underbrace{\sqrt[2]{1x^5}}_{\text{Gauss's Law of Theoretical Probability.}} d\tau dx}}^{1x^0})) d\mu}{\int_{2x^{-3}}^{1x^5} \cos(\int_{2x^2}^{1x^{-3}} \frac{\sqrt[2]{\overbrace{\underbrace{\frac{3x^3+3x^5}{\sqrt[3]{2x^{-3}}}}_{\text{Gauss's Law of Theoretical Probability.}} \times \overbrace{\tan(2x^0)}^{\text{Gauss's Law of Theoretical Probability.}}-\sum_{4=7}^{\infty} \boxed{3x^{-5}}}^{\text{Inverse Function.}}}}{{\boxed{\int_{2x^2}^{2x^4} 3x^1 d9} \div \sum_{6=6}^{\infty} \sqrt[3]{2x^2}+\sqrt[4]{\sin(2x^0+3x^0)}}^{2x^{-4}}+\boxed{\frac{\vec{\boxed{\sum_{\gamma=10}^{\infty} 1x^{-5}}}}{\frac{\sum_{\iota=2}^{\infty} 1x^{-5}-\frac{3x^{-1}}{1x^{-4}}}{\sin(\tan(3x^{-2}))}}}} \times \boxed{\sqrt[2]{{{{\sqrt[5]{2x^5}}^{2x^{-1}}}^{2x^{-1}} \div \sum_{\chi=6}^{\infty} \int_{1x^4}^{2x^{-4}} 3x^2 d\vartheta+{2x^{-3}}^{2x^{-5}}}^{3x^{-4}}}} d\mu) d\iota}[/math]?"))

['math', 'frac', 'int_', 'x', 'tan', 'box', 'sum_', 'varpi', 'infty', 'alpha', 'underbrace', 'sqrt', '_', 'text', 'gauss', 'law', 'theoretical', 'probability', 'dtau', 'dx', 'dmu', 'cos', 'overbrace', 'xx', 'time', 'inverse', 'function', 'div', 'sin', 'vec', 'gamma', 'iota', 'chi', 'dvartheta', 'diota']


## Applying Pre Processing Function on some data samples 

In [0]:
sampledata = pd.DataFrame(df["question_text"][:5])

In [0]:
sampledata

Unnamed: 0,question_text
0,How did Quebec nationalists see their province...
1,"Do you have an adopted dog, how would you enco..."
2,Why does velocity affect time? Does velocity a...
3,How did Otto von Guericke used the Magdeburg h...
4,Can I convert montra helicon D to a mountain b...


In [0]:
##Output after preprocessing
sampledata["question_text"].apply(lambda x: " ".join(normalize(x)))

0              quebec nationalists see province nation
1                adopt dog would encourage people shop
2                  velocity affect time space geometry
3          otto von guericke use magdeburg hemispheres
4    convert montra helicon mountain bike change tyres
Name: question_text, dtype: object

In [0]:
#Applying the preprocessing function on the whole dataset
df["Question_text_modified"] = df["question_text"].apply(lambda x: " ".join(normalize(x)))

# Saving the cleaned dataset to a csv file for traning.

In [0]:
df[["Question_text_modified","target"]].to_csv(path_or_buf='SpamFilterCleanedData.csv',sep=",",index=False,encoding='utf-8')