## Feature Extraction

In this notebook we perform data sanitization and extraction, the data is taken from kaggle consisting of more than 12000 essays.

As part of extracting features we use ideas from multiple research papers referenced below.

### Novelty
In addition to already proposed ideas we also use **Latent Semantic Indexing** for extracting concepts and getting similarity between essays from the given text essays.


#### REFRENCES:
1. Mahana, M., Johns, M., & Apte, A. (2012). Automated essay grading using machine learning. Mach. Learn. Session, Stanford University.

2. Suresh, A., & Jha, M. (2018). Automated essay grading using natural language processing and support vector machine. International Journal of Computing and Technology, 5(2), 18-21.

3. Rokade, A., Patil, B., Rajani, S., Revandkar, S., & Shedge, R. (2018, April). Automated Grading System Using Natural Language Processing. In 2018 Second International Conference on Inventive Communication and Computational Technologies (ICICCT) (pp. 1123-1127). IEEE.

4. Song, S., & Zhao, J. (2013). Automated essay scoring using machine learning. Stanford University.

5. Kakkonen, T., Myller, N., & Sutinen, E. (2006). Applying Part-of-Seech Enhanced LSA to Automatic Essay Grading. arXiv preprint cs/0610118.

In [1]:
import re, collections
import pandas as pd
import numpy as np
import enchant
import warnings
warnings.filterwarnings('ignore')

In [2]:
import nltk
nltk.download('vader_lexicon')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import TfidfTransformer 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import TruncatedSVD

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/nishal/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package punkt to /home/nishal/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/nishal/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/nishal/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


### Loading data set

8 essay set, totalling 12977 unique essays

In [3]:
data = pd.read_excel('training_set_rel3.xlsx')

In [4]:
essay_set_num = 3
data = data.loc[data['essay_set'] == essay_set_num]
data

Unnamed: 0,essay_id,essay_set,essay,rater1_domain1,rater2_domain1,rater3_domain1,domain1_score,rater1_domain2,rater2_domain2,domain2_score,...,rater2_trait3,rater2_trait4,rater2_trait5,rater2_trait6,rater3_trait1,rater3_trait2,rater3_trait3,rater3_trait4,rater3_trait5,rater3_trait6
3583,5978,3,The features of the setting affect the cyclist...,1.0,1.0,,1.0,,,,...,,,,,,,,,,
3584,5979,3,The features of the setting affected the cycli...,2.0,2.0,,2.0,,,,...,,,,,,,,,,
3585,5980,3,Everyone travels to unfamiliar places. Sometim...,1.0,1.0,,1.0,,,,...,,,,,,,,,,
3586,5981,3,I believe the features of the cyclist affected...,1.0,1.0,,1.0,,,,...,,,,,,,,,,
3587,5982,3,The setting effects the cyclist because of the...,2.0,2.0,,2.0,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5304,7704,3,"In the story, the setting affected the cyclist...",2.0,2.0,,2.0,,,,...,,,,,,,,,,
5305,7705,3,The features of the setting affect the cyclist...,1.0,1.0,,1.0,,,,...,,,,,,,,,,
5306,7706,3,The setting greatly affects the cyclist trying...,1.0,2.0,,2.0,,,,...,,,,,,,,,,
5307,7707,3,The features of the setting affected the cycli...,2.0,2.0,,2.0,,,,...,,,,,,,,,,


### Filtering data set

We only use the actual essay along with the domain score for training, all of the other columns are discarded.

We do this for feature extraction of every set.

In [5]:
data.drop(data.iloc[:, 1:2], inplace=True, axis=1)
data.drop(data.iloc[:, 2:5], inplace=True, axis=1)
data.drop(data.iloc[:, 3:], inplace=True, axis=1)
data.reset_index(drop=True, inplace=True)

In [6]:
num_essays = data.shape[0]
data

Unnamed: 0,essay_id,essay,domain1_score
0,5978,The features of the setting affect the cyclist...,1.0
1,5979,The features of the setting affected the cycli...,2.0
2,5980,Everyone travels to unfamiliar places. Sometim...,1.0
3,5981,I believe the features of the cyclist affected...,1.0
4,5982,The setting effects the cyclist because of the...,2.0
...,...,...,...
1721,7704,"In the story, the setting affected the cyclist...",2.0
1722,7705,The features of the setting affect the cyclist...,1.0
1723,7706,The setting greatly affects the cyclist trying...,2.0
1724,7707,The features of the setting affected the cycli...,2.0


In [8]:
def get_wordlist(sentence):
    # Remove non-alphanumeric characters
    sentence = re.sub("[^a-zA-Z0-9]"," ", sentence)
    words = nltk.word_tokenize(sentence)

    return words

In [9]:
def get_tokenized_sentences(essay):
    sentences = nltk.sent_tokenize(essay.strip())
    
    tokenized_sentences = []
    for sentence in sentences:
        if len(sentence) > 0:
            tokenized_sentences.append(get_wordlist(sentence))
    
    return tokenized_sentences

### Numerical features 
Features like the average length of words, the word count and the sentence count give us an idea about the fluency in language and dextirity of the writer.

Reference - 

1. Mahana, M., Johns, M., & Apte, A. (2012). Automated essay grading using machine learning. Mach. Learn. Session, Stanford University.

In [10]:
def get_word_length_average(essay):
    # Sanitize essay
    essay = re.sub(r'\W', ' ', essay)
    words = nltk.word_tokenize(essay)
    avg = sum(len(word) for word in words) / len(words)
    
    return avg

In [11]:
def get_word_count(essay):
    essay = re.sub(r'\W', ' ', essay)
    count = len(nltk.word_tokenize(essay))
    
    return count

In [12]:
def get_sentence_count(essay):
    sentences = nltk.sent_tokenize(essay)
    count = len(sentences)
    
    return count

### Lemmatization and Part of Speech tagging

Lemmatization involves use of a vocabulary to perform a morphological analysis of words. Getting lemma count along with different part of speech count like that of nouns, adjectives, verbs, adverbs allows us to understand the lexical density and overall semantic difficulty of the essay.

Reference - 

2. Suresh, A., & Jha, M. (2018). Automated essay grading using natural language processing and support vector machine. International Journal of Computing and Technology, 5(2), 18-21.

In [13]:
def get_lemma_count(essay):
    tokenized_sentences = get_tokenized_sentences(essay)      
    
    lemmas = []
    for sentence in tokenized_sentences:
        pos_tagged_tokens = nltk.pos_tag(sentence) 
        for token_tuple in pos_tagged_tokens:
            word = token_tuple[0]
            pos_tag = token_tuple[1]
            # assume default part of speech to be noun
            pos = wordnet.NOUN
            if pos_tag.startswith('J'):
                pos = wordnet.ADJ
            elif pos_tag.startswith('V'):
                pos = wordnet.VERB
            elif pos_tag.startswith('R'):
                pos = wordnet.ADV
                
            lemmas.append(WordNetLemmatizer().lemmatize(word, pos))
    
    lemma_count = len(set(lemmas))
    
    return lemma_count

In [14]:
def get_pos_counts(essay):
    tokenized_sentences = get_tokenized_sentences(essay)
    
    nouns, adjectives, verbs, adverbs = 0, 0, 0, 0
    
    essay = re.sub(r'\W', ' ', essay)
    words = nltk.word_tokenize(essay)
    
    for sentence in tokenized_sentences:
        pos_tagged_tokens = nltk.pos_tag(sentence)
        for token_tuple in pos_tagged_tokens:
            pos_tag = token_tuple[1]
            if pos_tag.startswith('N'): 
                nouns += 1
            elif pos_tag.startswith('J'):
                adjectives += 1
            elif pos_tag.startswith('V'):
                verbs += 1
            elif pos_tag.startswith('R'):
                adverbs += 1
    
    return nouns/len(words), adjectives/len(words), verbs/len(words), adverbs/len(words)

### Spell Check (Orthography)
Correct word spelling indicates command over language and facility of use. To test for these characteristics we extracted the count of spelling errors per essay. We used PyEnchant spell checker along with the hunspell dictionary to obtain the count of misspelt words per essay.

Reference - 

1. Mahana, M., Johns, M., & Apte, A. (2012). Automated essay grading using machine learning. Mach. Learn. Session, Stanford University.

In [15]:
def get_spell_error_count(essay):   
    essay = re.sub(r'\W', ' ', essay)
    words = nltk.word_tokenize(essay)
    
    d = enchant.Dict("en_US")
    misspelt_count = 0
    for word in words:
        if(d.check(word) == False):
            misspelt_count += 1
    
    total_words = get_word_count(essay)
    error_prob = misspelt_count/total_words
    
    return error_prob

### Sentiment Analysis (Opinion mining)
It allows us to gauge the emotive effectiveness of the essay. We use the VADER which is a  lexicon and rule-based sentiment analysis tool provided by nltk package. It does also provide the degree of positiveness or negativess although we don't use it in this case.

Reference - 

4. Song, S., & Zhao, J. (2013). Automated essay scoring using machine learning. Stanford University.

In [16]:
def get_sentiment_tags(essay):
    negative, positive, neutral = 0, 0, 0
    
    ss = SentimentIntensityAnalyzer().polarity_scores(essay)
    for k in sorted(ss):
        if k == 'compound':
            pass
        elif k == 'neg':
            negative += ss[k]
        elif k == 'pos':
            positive += ss[k]
        elif k == 'neu':
            neutral += ss[k]
            
    return negative, positive, neutral

### Term-Frequence & Inverse Document Frequency

Term frequency is the number of times a word appears in the document. Document frequency is number of times a word occurs in all the documents. TF multiplied by the inverse of DF gives TF-IDF scores. The vectors obtained help us get the corresspondence between essays. 

Reference -

2. Suresh, A., & Jha, M. (2018). Automated essay grading using natural language processing and support vector machine. International Journal of Computing and Technology, 5(2), 18-21.

In [17]:
def get_tfidf_vectors(essays):
    vectorizer = TfidfVectorizer(stop_words='english')
    
    words = []
    for essay in essays:
        essay = re.sub(r'\W', ' ', essay)
        words.append(nltk.word_tokenize(essay))
        
    docs_lemmatized = [[WordNetLemmatizer().lemmatize(j) for j in i]for i in words]
    
    corpus = [' '.join(i) for i in docs_lemmatized]
    vectors = vectorizer.fit_transform(corpus)
    feature_names = vectorizer.get_feature_names()
    
    return feature_names, vectors

In [18]:
feature_names,vectors_all = get_tfidf_vectors(data['essay'])
print("num of essays X number of features",vectors_all.shape)

num of essays X number of features (1726, 5648)


### Latent Semantic Analysis



In [19]:
# SVD represent documents and terms in vectors 
reduced_dim = vectors_all.shape[0]
svd_model = TruncatedSVD(n_components=reduced_dim, algorithm='randomized', random_state=122)
lsa = svd_model.fit_transform(vectors_all)

pd.DataFrame(svd_model.components_,index=range(reduced_dim), columns=feature_names)

Unnamed: 0,12,55,aase,abadond,abait,abanded,abandon,abandond,abandone,abandoned,...,yoth,youg,youknow,young,younger,youself,youth,yuuth,zone,zoo
0,0.010919,0.010589,0.000196,0.000522,0.000887,0.002316,0.004920,0.002370,0.000482,0.040612,...,0.000420,0.000260,0.000437,0.008669,0.000992,0.000319,0.000885,0.000313,0.000409,0.000826
1,0.009312,0.013621,-0.000654,-0.000953,-0.002802,-0.006772,-0.005069,-0.000470,0.000315,-0.068142,...,-0.000706,-0.001107,-0.000473,-0.008043,-0.003698,0.000918,-0.002483,-0.001822,-0.000620,-0.000925
2,0.019438,0.028952,0.000551,0.000380,-0.000939,0.003592,0.008468,0.001364,0.001377,0.021686,...,-0.000519,0.001079,0.000517,0.007915,0.001213,-0.000061,-0.000091,0.001323,0.000328,-0.000749
3,-0.001358,0.003005,-0.002566,0.000628,-0.002755,-0.007728,-0.007111,0.001989,0.000207,-0.040350,...,-0.000285,-0.001345,-0.001264,-0.009069,-0.007983,-0.001563,-0.002063,-0.002036,0.000056,0.002726
4,0.000941,-0.007515,0.000158,0.001456,0.000538,-0.001993,-0.002700,-0.001185,0.001731,-0.003726,...,0.002063,-0.000265,-0.000007,0.005642,-0.001221,-0.001155,0.003615,0.000578,0.001756,0.002488
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1721,-0.031210,0.008621,-0.013941,0.021602,0.006533,0.039129,-0.010050,0.028601,-0.016180,-0.020155,...,0.004322,-0.001586,0.009875,-0.021218,0.022644,0.011482,-0.001963,-0.005019,-0.004151,-0.007997
1722,0.004471,0.048659,-0.016070,-0.000399,-0.010275,-0.011194,-0.002908,0.017613,0.005814,0.027142,...,-0.005631,0.002202,0.008449,-0.003417,-0.033736,-0.013038,-0.003654,0.000134,-0.006852,0.024344
1723,-0.000077,0.042585,0.012905,0.004978,-0.012835,0.006479,0.003837,-0.017387,0.014774,-0.012416,...,0.014551,-0.003136,-0.008727,0.030928,0.015458,-0.020861,-0.007210,0.001279,0.014539,0.007602
1724,0.018108,-0.000879,0.127454,-0.041874,0.006283,-0.027222,0.032730,0.039435,0.014156,-0.003497,...,-0.000882,-0.010780,0.000950,0.000684,-0.002534,-0.022508,-0.002193,-0.000254,-0.010710,0.003600


In [20]:
# Compute document similarity using LSA components
lsa_similarity = np.asarray(np.asmatrix(lsa) * np.asmatrix(lsa).T)
pd.DataFrame(lsa_similarity,index=range(num_essays), columns=range(num_essays))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1716,1717,1718,1719,1720,1721,1722,1723,1724,1725
0,1.000000,0.068853,0.054585,0.018912,0.135495,0.086298,0.067861,0.046194,0.063801,0.092982,...,0.104991,0.063121,1.327585e-01,0.150642,0.082806,0.103764,0.104746,0.097137,0.148862,0.048891
1,0.068853,1.000000,0.037035,0.019983,0.114694,0.077461,0.152446,0.007161,0.094809,0.189670,...,0.091184,0.043505,1.908301e-02,0.178060,0.237521,0.080757,0.041102,0.118562,0.110990,0.027557
2,0.054585,0.037035,1.000000,0.041501,0.006919,0.068392,0.101424,0.012773,0.006297,0.023608,...,0.007473,0.015083,6.160939e-02,0.012050,0.018089,0.021317,0.014779,0.027017,0.018006,0.004985
3,0.018912,0.019983,0.041501,1.000000,0.017226,0.019920,0.064073,0.055700,0.001408,0.013777,...,0.005621,0.024752,2.190728e-02,0.092889,0.028568,0.021729,0.005262,0.009036,0.011590,0.044532
4,0.135495,0.114694,0.006919,0.017226,1.000000,0.119695,0.082859,0.023144,0.049212,0.086310,...,0.085996,0.016014,1.567855e-01,0.100934,0.061280,0.115156,0.036284,0.068402,0.111283,0.058079
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1721,0.103764,0.080757,0.021317,0.021729,0.115156,0.061339,0.075399,0.028776,0.080834,0.139366,...,0.064384,0.128776,1.487662e-02,0.113234,0.101874,1.000000,0.124081,0.104924,0.091863,0.067853
1722,0.104746,0.041102,0.014779,0.005262,0.036284,0.026057,0.003884,0.120326,0.182113,0.244980,...,0.097394,0.142939,5.352473e-02,0.059046,0.210343,0.124081,1.000000,0.042728,0.119699,0.116887
1723,0.097137,0.118562,0.027017,0.009036,0.068402,0.049055,0.040805,0.062196,0.099133,0.079486,...,0.166395,0.077638,-1.212138e-16,0.098877,0.090348,0.104924,0.042728,1.000000,0.098030,0.034389
1724,0.148862,0.110990,0.018006,0.011590,0.111283,0.038355,0.060236,0.041603,0.073863,0.073370,...,0.142405,0.052338,1.453900e-02,0.103304,0.123063,0.091863,0.119699,0.098030,1.000000,0.040581


### Cosine Similarity

Using the vectors we got we measure how similar they are using cosine distance between vector pairs. 

This feature has not been taken from any of the papers

In [21]:
highest = max(data['domain1_score'].tolist())

def get_cosine_similarity(essay_id):
    index_high = data.index[data['domain1_score'] == highest].tolist()
    n = len(index_high)

    j = data.index[data['essay_id'] == essay_id]
    similarity = 0
    for i in index_high:
        similarity += cosine_similarity(vectors_all[i,:],vectors_all[j,:])
    similarity /= n
    
    return np.asscalar(similarity)

### Extraction

Finally we apply all the above functions to our data set and store the results in a csv file.

In [22]:
def extract_features(data):
    
    features = data.copy()
    
    features['word_count'] = features['essay'].apply(get_word_count)
    print("Added 'word_count' feature successfully.")
    
    features['sent_count'] = features['essay'].apply(get_sentence_count)
    print("Added 'sent_count' feature successfully.")
    
    features['avg_word_len'] = features['essay'].apply(get_word_length_average)
    print("Added 'avg_word_len' feature successfully.")
    
    features['lemma_count'] = features['essay'].apply(get_lemma_count)
    print("Added 'lemma_count' feature successfully.")
    
    features['spell_err_count'] = features['essay'].apply(get_spell_error_count)
    print("Added 'spell_err_count' feature successfully.")
    
    features['noun_count'], features['adj_count'], features['verb_count'], features['adv_count'] = zip(*features['essay'].map(get_pos_counts))
    print("Added 'noun_count', 'adj_count', 'verb_count' and 'adv_count' features successfully.")
    
    features['neg_score'], features['pos_score'], features['neu_score'] = zip(*features['essay'].map(get_sentiment_tags))
    print("Added 'neg_score', 'pos_score' and 'neu_score' features successfully.")
    
    features['cosine_similarity'] = features['essay_id'].apply(get_cosine_similarity)
    print("Added 'similarity' feature successfully.")
    
    return features

In [23]:
features_set1 = extract_features(data)

Added 'word_count' feature successfully.
Added 'sent_count' feature successfully.
Added 'avg_word_len' feature successfully.
Added 'lemma_count' feature successfully.
Added 'spell_err_count' feature successfully.
Added 'noun_count', 'adj_count', 'verb_count' and 'adv_count' features successfully.
Added 'neg_score', 'pos_score' and 'neu_score' features successfully.
Added 'similarity' feature successfully.


In [24]:
features_set1

Unnamed: 0,essay_id,essay,domain1_score,word_count,sent_count,avg_word_len,lemma_count,spell_err_count,noun_count,adj_count,verb_count,adv_count,neg_score,pos_score,neu_score,cosine_similarity
0,5978,The features of the setting affect the cyclist...,1.0,51,3,4.098039,30,0.019608,0.294118,0.019608,0.176471,0.000000,0.094,0.000,0.906,0.086098
1,5979,The features of the setting affected the cycli...,2.0,179,12,4.418994,102,0.089385,0.273743,0.072626,0.162011,0.039106,0.098,0.043,0.860,0.108432
2,5980,Everyone travels to unfamiliar places. Sometim...,1.0,97,8,4.164948,67,0.010309,0.226804,0.082474,0.195876,0.113402,0.151,0.116,0.733,0.041446
3,5981,I believe the features of the cyclist affected...,1.0,87,3,3.896552,60,0.114943,0.218391,0.126437,0.149425,0.034483,0.067,0.108,0.825,0.030414
4,5982,The setting effects the cyclist because of the...,2.0,134,3,4.126866,73,0.067164,0.231343,0.111940,0.156716,0.059701,0.120,0.029,0.851,0.071810
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1721,7704,"In the story, the setting affected the cyclist...",2.0,66,6,4.136364,50,0.000000,0.287879,0.075758,0.212121,0.015152,0.101,0.064,0.834,0.097250
1722,7705,The features of the setting affect the cyclist...,1.0,54,3,4.333333,46,0.037037,0.277778,0.092593,0.185185,0.129630,0.141,0.042,0.817,0.085422
1723,7706,The setting greatly affects the cyclist trying...,2.0,113,5,4.159292,73,0.000000,0.230088,0.061947,0.185841,0.061947,0.148,0.000,0.852,0.082425
1724,7707,The features of the setting affected the cycli...,2.0,152,7,4.302632,84,0.013158,0.217105,0.072368,0.197368,0.098684,0.057,0.052,0.891,0.100491


In [25]:
filename = 'features_set_' + str(essay_set_num) + '.csv'
features_set1.to_csv(filename, index=False)