This project is implemented mainly into three phases described as follows:
1. PHASE-1: PRE-PROCESSING & EDA
   
2. PHASE-2: FEATURE EXTRACTION

3. PHASE-3: MODELLING

## PHASE-1: PRE-PROCESSING & EDA

This phase is further divided into following three parts:
1. DATA CLEANING
2. DATA VISUALIZATION
3. INSIGHTS DISCOVERY

#### Importing required libraries/packages

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

import string
import re
from tqdm.notebook import tqdm

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize, sent_tokenize

from textblob import TextBlob

import contractions
from spellchecker import SpellChecker

nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('punkt') 
nltk.download('stopwords')

# initialize WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Create a spellchecker object for English
spell = SpellChecker(language='en')

warnings.filterwarnings("ignore")

### 1.A DATA CLEANING

List of functions created to implement Data Cleaning:
1. text_cleaning(text)
2. words_correction(word_list)
3. lemmatize_with_pos(word)
4. count_spelling_mistakes(essay)
5. count_pos_tags(tokens)
6. assign_score_category(row)
7. process_df(df_train, output_csv_file_path)

Let's look at each of them in detail as follows:

#### 1. text_cleaning(text):

    INPUT: raw essay  
    OUTPUT: a list of cleaned tokens (words) 

    This function is performing following operations:
    1. Contractions - to expand the shortened words
    2. Tokenization - to convert text in a list of words
    3. Cleaning - to remove whitespaces, new lines, tabs, stopwords and punctuation

In [2]:
contractions.contractions_dict['dont'] = 'do not'
contractions.contractions_dict['didnt'] = 'did not'
contractions.contractions_dict['couldnt'] = 'could not'
contractions.contractions_dict['cant'] = 'can not'
contractions.contractions_dict['doesnt'] = 'does not'
contractions.contractions_dict['wont'] = 'would not'
contractions.contractions_dict['shouldnt'] = 'should not'

In [3]:
def text_cleaning(text):  

    # creating an empty list
    expanded_words = [] 
    
    # Perform contractions to convert words like don't to do not
    for word in text.split():
      # using contractions.fix to expand the shortened words
      expanded_words.append(contractions.fix(word))
    
    expanded_text = ' '.join(expanded_words)
    
    # tokenizing text 
    tokens = word_tokenize(expanded_text)
    
    # converting list to string
    text = ' '.join(tokens)
    
    # convert text to lowercase and remove leading/trailing white space
    text = ''.join(text.lower().strip()) 
    
    # remove newlines, tabs, and extra white spaces
    text = re.sub('\n|\r|\t', ' ', text)
    text = re.sub(' +', ' ', text)
    text = ''.join(text.strip()) 

    # remove stop words and punctuation
    stop_words = set(stopwords.words('english'))
    cleaned_text = ' '.join([word for word in text.split() if word not in stop_words])
    cleaned_text = ''.join([char for char in cleaned_text if char not in string.punctuation])
    cleaned_text = ' '.join([char for char in cleaned_text.split() if len(char) > 2]) # Added this for only keeping words with lengths>2
    
    return cleaned_text

#### 2. lemmatize_with_pos(word):

    INPUT: list of corrected spelled words  
    OUTPUT: list of words in base form 

    This functions removes stem from words using part-of-speech tagging. It determines the appropriate POS tag for each word using the `get_wordnet_pos()` function, which maps the POS tag to the first character used by the WordNetLemmatizer. 

In [4]:
# define a function to apply lemmatization with POS tagging to each word
def lemmatize_with_pos(word):
    pos = get_wordnet_pos(word)
    if pos:
        return lemmatizer.lemmatize(word, pos=pos)
    else:
        return lemmatizer.lemmatize(word)

# define a function to get the appropriate POS tag for a word
def get_wordnet_pos(word):
    """Map POS tag to first character used by WordNetLemmatizer"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)  # default to noun if not found

# define a function to apply lemmatization to each word
def lemmatize_text(text):
    return [lemmatize_with_pos(word) for word in text]

#### 3. words_correction(word_list):

    INPUT: cleaned tokenized list of words  
    OUTPUT: list of correct spelled words

    If the word is misspelled, then the corrected version of that specific word is added the list of corrected words; else word is directly added to corrected words list.

In [5]:
# Define a function to count the number of spelling mistakes in a given essay
def count_spelling_mistakes(df):
    for j in tqdm(range(len(df))):
        essay = df.cleaned_tokenize_text[j].split()
        
        misspelled = list(spell.unknown(essay))
        
        mistakes, corrections = [], []
        for i, m in enumerate(misspelled):
            s = spell.correction(m)
            if s:
                corrections.append(s)
                mistakes.append(m)
        
        for i, m in enumerate(mistakes):
            indexes = [i for i, j in enumerate(essay) if j == m]
            for ind in indexes:
                essay[ind] = corrections[i]

        essay = " ".join(essay)
        n_mistakes = len(mistakes)
        mistakes = " ".join(mistakes)
        corrections = " ".join(corrections)

        df.loc[j, ['corrected_text', 'mistakes', 'corrected_words', 'num_mistakes']] = [essay, mistakes, corrections, n_mistakes]
    return df

#### 4. count_pos_tags(tokens):

    INPUT: list of words in base form (lemmatized words)  
    OUTPUT: number of nouns, verbs, adverbs and adjectives in a essay

In [6]:
def count_pos_tags(tokens):
    noun_count = 0
    verb_count = 0
    adjective_count = 0
    adverb_count = 0
    
    # loop through each token and increment the corresponding counter
    for token, tag in nltk.pos_tag(tokens):
        if tag.startswith('N'):  # noun
            noun_count += 1
        elif tag.startswith('V'):  # verb
            verb_count += 1
        elif tag.startswith('J'):  # adjective
            adjective_count += 1
        elif tag.startswith('R'):  # adverb
            adverb_count += 1
    
    # return a dictionary with the counts
    return {'noun': noun_count, 'verb': verb_count, 'adjective': adjective_count, 'adverb': adverb_count}

In [7]:
def list_to_string(lst):
    return ' '.join(lst)

#### 5. assign_score_category(row):

    INPUT: each row of target features  
    OUTPUT: categorical label of low, high, medium

In [8]:
# define a function to assign score category based on scores
def assign_score_category(row):
    if (row[['cohesion', 'syntax', 'vocabulary', 'phraseology', 'grammar', 'conventions']] <= 2.5).sum() > 4:
        return 'low'
    elif (row[['cohesion', 'syntax', 'vocabulary', 'phraseology', 'grammar', 'conventions']] >= 4).sum() > 4:
        return 'high'
    else:
        return 'medium'

#### 6. process_df(df):

    INPUT: raw data  
    OUTPUT: processed data

In [9]:
def process_df(df):

    # 1. apply the text_cleaning function to the 'full_text' column using apply() method
    print("Cleaning essays")
    df['cleaned_tokenize_text'] = df['full_text'].apply(text_cleaning)

    # 2. Apply the function to the tokenized text column and store the results in new columns
    print("Getting count of spelling mistakes")
    df = count_spelling_mistakes(df)
    
    # 3. apply lemmatize_text function to the corrected_text
    print("Lemmatizing text")
    df['lemmatized_text'] = df['corrected_text'].apply(lambda x: lemmatize_text(x.split()))
    
    # 4. Compute the statistics
    print("Analyzing sentences")
    df['sent_count'] = df['full_text'].apply(lambda x: len(sent_tokenize(x)))

    # 5. Compute the average number of words in a sentence in an essay
    print("Getting average sentence length")
    df['sent_len'] = df['full_text'].apply(lambda x: np.mean([len(w.split()) for w in sent_tokenize(x)]))

    # 6. Apply the count_pos_tags function to each row
    print("POS tagging")
    df['pos_counts'] = df['lemmatized_text'].apply(count_pos_tags)
    
    # 7. Compute the word count for each essay
    df['word_count'] = df.full_text.apply(lambda x: len(x.split()))

    # 8. Extract the count for each POS tag into a separate column
    print("Counting POS tags")
    df['noun_count'] = df['pos_counts'].apply(lambda x: x['noun'])
    df['verb_count'] = df['pos_counts'].apply(lambda x: x['verb'])
    df['adjective_count'] = df['pos_counts'].apply(lambda x: x['adjective'])
    df['adverb_count'] = df['pos_counts'].apply(lambda x: x['adverb'])

    # 9. apply the function to create a new column
    if "cohesion" in df.columns:
        df['score_category'] = df.apply(assign_score_category, axis=1)
    
    # 10. drop the tokens and pos_counts columns
    df = df.drop(['pos_counts'], axis=1)
    
    df['lemmatized_text'] = df['lemmatized_text'].apply(list_to_string)
    df['num_mistakes'] = df['num_mistakes'].apply(int)
    df['sent_len'] = df['sent_len'].apply(int)
    df['sent_count'] = df['sent_count'].apply(int)
    
    return df

In [10]:
df = pd.read_csv('../data/train.csv')
df = process_df(df)
df.to_csv("../data/processed_essays.csv", index=False)

Cleaning essays
Getting count of spelling mistakes


  0%|          | 0/3911 [00:00<?, ?it/s]

Lemmatizing text
Analyzing sentences
Getting average sentence length
POS tagging
Counting POS tags
