## Feature Extraction

In this notebook we perform data sanitization and extraction, the data is taken from kaggle consisting of more than 12000 essays.

As part of extracting features we use ideas from multiple research papers referenced below.

### Novelty
In addition to already proposed ideas we also use **Latent Semantic Indexing** for extracting concepts and getting similarity between essays from the given text essays.


#### REFRENCES:
1. Mahana, M., Johns, M., & Apte, A. (2012). Automated essay grading using machine learning. Mach. Learn. Session, Stanford University.

2. Suresh, A., & Jha, M. (2018). Automated essay grading using natural language processing and support vector machine. International Journal of Computing and Technology, 5(2), 18-21.

3. Rokade, A., Patil, B., Rajani, S., Revandkar, S., & Shedge, R. (2018, April). Automated Grading System Using Natural Language Processing. In 2018 Second International Conference on Inventive Communication and Computational Technologies (ICICCT) (pp. 1123-1127). IEEE.

4. Song, S., & Zhao, J. (2013). Automated essay scoring using machine learning. Stanford University.

5. Kakkonen, T., Myller, N., & Sutinen, E. (2006). Applying Part-of-Seech Enhanced LSA to Automatic Essay Grading. arXiv preprint cs/0610118.

In [1]:
import re, collections
import pandas as pd
import numpy as np
import enchant
import warnings
warnings.filterwarnings('ignore')

In [2]:
import nltk
nltk.download('vader_lexicon')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import TfidfTransformer 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import TruncatedSVD

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/nishal/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package punkt to /home/nishal/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/nishal/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/nishal/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


### Loading data set

8 essay set, totalling 12977 unique essays

In [3]:
data = pd.read_excel('training_set_rel3.xlsx')

In [4]:
essay_set_num = 7
data = data.loc[data['essay_set'] == essay_set_num]
data

Unnamed: 0,essay_id,essay_set,essay,rater1_domain1,rater2_domain1,rater3_domain1,domain1_score,rater1_domain2,rater2_domain2,domain2_score,...,rater2_trait3,rater2_trait4,rater2_trait5,rater2_trait6,rater3_trait1,rater3_trait2,rater3_trait3,rater3_trait4,rater3_trait5,rater3_trait6
10686,17834,7,Patience is when your waiting .I was patience ...,8.0,7.0,,15.0,,,,...,2.0,2.0,,,,,,,,
10687,17836,7,"I am not a patience person, like I can’t sit i...",6.0,7.0,,13.0,,,,...,2.0,1.0,,,,,,,,
10688,17837,7,One day I was at basketball practice and I was...,7.0,8.0,,15.0,,,,...,2.0,2.0,,,,,,,,
10689,17838,7,I going to write about a time when I went to t...,8.0,9.0,,17.0,,,,...,2.0,3.0,,,,,,,,
10690,17839,7,It can be very hard for somebody to be patient...,7.0,6.0,,13.0,,,,...,1.0,2.0,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12250,19558,7,One time I was getting a cool @CAPS1 game it w...,6.0,6.0,,12.0,,,,...,2.0,1.0,,,,,,,,
12251,19559,7,A patent person in my life is my mom. Aicason ...,9.0,7.0,,16.0,,,,...,2.0,3.0,,,,,,,,
12252,19561,7,A time when someone else I know was patient wa...,11.0,8.0,,19.0,,,,...,2.0,2.0,,,,,,,,
12253,19562,7,I hate weddings. I love when people get marrie...,12.0,10.0,,22.0,,,,...,2.0,3.0,,,,,,,,


### Filtering data set

We only use the actual essay along with the domain score for training, all of the other columns are discarded.

We do this for feature extraction of every set.

In [5]:
data.drop(data.iloc[:, 1:2], inplace=True, axis=1)
data.drop(data.iloc[:, 2:5], inplace=True, axis=1)
data.drop(data.iloc[:, 3:], inplace=True, axis=1)
data.reset_index(drop=True, inplace=True)

In [6]:
num_essays = data.shape[0]
data

Unnamed: 0,essay_id,essay,domain1_score
0,17834,Patience is when your waiting .I was patience ...,15.0
1,17836,"I am not a patience person, like I can’t sit i...",13.0
2,17837,One day I was at basketball practice and I was...,15.0
3,17838,I going to write about a time when I went to t...,17.0
4,17839,It can be very hard for somebody to be patient...,13.0
...,...,...,...
1564,19558,One time I was getting a cool @CAPS1 game it w...,12.0
1565,19559,A patent person in my life is my mom. Aicason ...,16.0
1566,19561,A time when someone else I know was patient wa...,19.0
1567,19562,I hate weddings. I love when people get marrie...,22.0


In [14]:
df = pd.DataFrame([[5, 6, 7]], columns=["essay_id","essay","domain1_score"])
df

Unnamed: 0,essay_id,essay,domain1_score
0,5,6,7


In [7]:
def get_wordlist(sentence):
    # Remove non-alphanumeric characters
    sentence = re.sub("[^a-zA-Z0-9]"," ", sentence)
    words = nltk.word_tokenize(sentence)

    return words

In [8]:
def get_tokenized_sentences(essay):
    sentences = nltk.sent_tokenize(essay.strip())
    
    tokenized_sentences = []
    for sentence in sentences:
        if len(sentence) > 0:
            tokenized_sentences.append(get_wordlist(sentence))
    
    return tokenized_sentences

### Numerical features 
Features like the average length of words, the word count and the sentence count give us an idea about the fluency in language and dextirity of the writer.

Reference - 

1. Mahana, M., Johns, M., & Apte, A. (2012). Automated essay grading using machine learning. Mach. Learn. Session, Stanford University.

In [9]:
def get_word_length_average(essay):
    # Sanitize essay
    essay = re.sub(r'\W', ' ', essay)
    words = nltk.word_tokenize(essay)
    avg = sum(len(word) for word in words) / len(words)
    
    return avg

In [10]:
def get_word_count(essay):
    essay = re.sub(r'\W', ' ', essay)
    count = len(nltk.word_tokenize(essay))
    
    return count

In [11]:
def get_sentence_count(essay):
    sentences = nltk.sent_tokenize(essay)
    count = len(sentences)
    
    return count

### Lemmatization and Part of Speech tagging

Lemmatization involves use of a vocabulary to perform a morphological analysis of words. Getting lemma count along with different part of speech count like that of nouns, adjectives, verbs, adverbs allows us to understand the lexical density and overall semantic difficulty of the essay.

Reference - 

2. Suresh, A., & Jha, M. (2018). Automated essay grading using natural language processing and support vector machine. International Journal of Computing and Technology, 5(2), 18-21.

In [12]:
def get_lemma_count(essay):
    tokenized_sentences = get_tokenized_sentences(essay)      
    
    lemmas = []
    for sentence in tokenized_sentences:
        pos_tagged_tokens = nltk.pos_tag(sentence) 
        for token_tuple in pos_tagged_tokens:
            word = token_tuple[0]
            pos_tag = token_tuple[1]
            # assume default part of speech to be noun
            pos = wordnet.NOUN
            if pos_tag.startswith('J'):
                pos = wordnet.ADJ
            elif pos_tag.startswith('V'):
                pos = wordnet.VERB
            elif pos_tag.startswith('R'):
                pos = wordnet.ADV
                
            lemmas.append(WordNetLemmatizer().lemmatize(word, pos))
    
    lemma_count = len(set(lemmas))
    
    return lemma_count

In [13]:
def get_pos_counts(essay):
    tokenized_sentences = get_tokenized_sentences(essay)
    
    nouns, adjectives, verbs, adverbs = 0, 0, 0, 0
    
    essay = re.sub(r'\W', ' ', essay)
    words = nltk.word_tokenize(essay)
    
    for sentence in tokenized_sentences:
        pos_tagged_tokens = nltk.pos_tag(sentence)
        for token_tuple in pos_tagged_tokens:
            pos_tag = token_tuple[1]
            if pos_tag.startswith('N'): 
                nouns += 1
            elif pos_tag.startswith('J'):
                adjectives += 1
            elif pos_tag.startswith('V'):
                verbs += 1
            elif pos_tag.startswith('R'):
                adverbs += 1
    
    return nouns/len(words), adjectives/len(words), verbs/len(words), adverbs/len(words)

### Spell Check (Orthography)
Correct word spelling indicates command over language and facility of use. To test for these characteristics we extracted the count of spelling errors per essay. We used PyEnchant spell checker along with the hunspell dictionary to obtain the count of misspelt words per essay.

Reference - 

1. Mahana, M., Johns, M., & Apte, A. (2012). Automated essay grading using machine learning. Mach. Learn. Session, Stanford University.

In [14]:
def get_spell_error_count(essay):   
    essay = re.sub(r'\W', ' ', essay)
    words = nltk.word_tokenize(essay)
    
    d = enchant.Dict("en_US")
    misspelt_count = 0
    for word in words:
        if(d.check(word) == False):
            misspelt_count += 1
    
    total_words = get_word_count(essay)
    error_prob = misspelt_count/total_words
    
    return error_prob

### Sentiment Analysis (Opinion mining)
It allows us to gauge the emotive effectiveness of the essay. We use the VADER which is a  lexicon and rule-based sentiment analysis tool provided by nltk package. It does also provide the degree of positiveness or negativess although we don't use it in this case.

Reference - 

4. Song, S., & Zhao, J. (2013). Automated essay scoring using machine learning. Stanford University.

In [15]:
def get_sentiment_tags(essay):
    negative, positive, neutral = 0, 0, 0
    
    ss = SentimentIntensityAnalyzer().polarity_scores(essay)
    for k in sorted(ss):
        if k == 'compound':
            pass
        elif k == 'neg':
            negative += ss[k]
        elif k == 'pos':
            positive += ss[k]
        elif k == 'neu':
            neutral += ss[k]
            
    return negative, positive, neutral

### Term-Frequence & Inverse Document Frequency

Term frequency is the number of times a word appears in the document. Document frequency is number of times a word occurs in all the documents. TF multiplied by the inverse of DF gives TF-IDF scores. The vectors obtained help us get the corresspondence between essays. 

Reference -

2. Suresh, A., & Jha, M. (2018). Automated essay grading using natural language processing and support vector machine. International Journal of Computing and Technology, 5(2), 18-21.

In [16]:
def get_tfidf_vectors(essays):
    vectorizer = TfidfVectorizer(stop_words='english')
    
    words = []
    for essay in essays:
        essay = re.sub(r'\W', ' ', essay)
        words.append(nltk.word_tokenize(essay))
        
    docs_lemmatized = [[WordNetLemmatizer().lemmatize(j) for j in i]for i in words]
    
    corpus = [' '.join(i) for i in docs_lemmatized]
    vectors = vectorizer.fit_transform(corpus)
    feature_names = vectorizer.get_feature_names()
    
    return feature_names, vectors

In [17]:
feature_names,vectors_all = get_tfidf_vectors(data['essay'])
print("num of essays X number of features",vectors_all.shape)

num of essays X number of features (1569, 9138)


### Latent Semantic Analysis



In [18]:
# SVD represent documents and terms in vectors 
reduced_dim = vectors_all.shape[0]
svd_model = TruncatedSVD(n_components=reduced_dim, algorithm='randomized', random_state=122)
lsa = svd_model.fit_transform(vectors_all)

pd.DataFrame(svd_model.components_,index=range(reduced_dim), columns=feature_names)

Unnamed: 0,00,000,15,190,21,27,30,30am,30pm,35,...,œsix,œthe,œthen,œwas,œwellâ,œwhat,œwhatâ,œwhy,œyou,œyouâ
0,0.001022,0.001117,0.000238,0.000398,0.000538,0.000458,0.000364,0.000550,0.000605,0.000268,...,0.000272,0.000272,0.000091,0.000081,0.000231,0.000231,0.000231,0.000231,0.000231,0.000081
1,-0.001041,-0.000316,-0.000378,-0.000644,-0.000627,0.000821,-0.000094,-0.000719,-0.000619,-0.000190,...,-0.000365,-0.000365,0.000189,-0.000095,-0.000219,-0.000219,-0.000219,-0.000219,-0.000219,-0.000095
2,0.002137,-0.000618,-0.000045,-0.000488,-0.000634,-0.000058,-0.000892,-0.000798,0.001315,0.001227,...,0.000696,0.000696,0.000969,0.000787,0.001675,0.001675,0.001675,0.001675,0.001675,0.000787
3,-0.001304,-0.000886,-0.000349,-0.001009,0.001308,-0.000210,0.002147,0.000143,-0.000550,-0.000144,...,0.000627,0.000627,0.000299,0.000222,0.000104,0.000104,0.000104,0.000104,0.000104,0.000222
4,-0.002155,-0.000991,-0.000763,-0.000685,-0.000519,-0.000841,0.000893,-0.000221,-0.000459,-0.000334,...,-0.000456,-0.000456,0.000643,0.000094,0.000133,0.000133,0.000133,0.000133,0.000133,0.000094
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1564,-0.002234,0.018293,0.005340,-0.003878,0.001401,0.012321,-0.005822,-0.007661,-0.016320,0.005765,...,0.002224,0.002224,-0.000048,-0.000808,-0.000495,-0.000495,-0.000495,-0.000495,-0.000495,-0.000808
1565,0.004450,-0.010192,-0.002901,-0.009475,-0.009178,0.007652,0.003318,-0.004794,-0.006527,-0.001624,...,0.003654,0.003654,0.001742,0.002626,-0.001118,-0.001118,-0.001118,-0.001118,-0.001118,0.002626
1566,-0.001575,-0.034740,0.000245,-0.000368,-0.000731,-0.005036,0.002357,-0.005813,0.012520,0.001081,...,0.001796,0.001796,-0.003287,0.000641,-0.006879,-0.006879,-0.006879,-0.006879,-0.006879,0.000641
1567,0.054281,-0.097997,-0.006105,-0.005157,-0.000963,0.004370,-0.001572,-0.002846,-0.001660,-0.006483,...,-0.008133,-0.008133,0.000648,0.000604,-0.001723,-0.001723,-0.001723,-0.001723,-0.001723,0.000604


In [19]:
# Compute document similarity using LSA components
lsa_similarity = np.asarray(np.asmatrix(lsa) * np.asmatrix(lsa).T)
pd.DataFrame(lsa_similarity,index=range(num_essays), columns=range(num_essays))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1559,1560,1561,1562,1563,1564,1565,1566,1567,1568
0,1.000000,0.071469,0.015369,0.193639,0.051428,0.029865,0.077420,0.013215,0.388682,0.213989,...,0.077083,0.004276,0.072728,0.076294,0.131251,0.021802,0.053630,0.037008,0.072476,0.046536
1,0.071469,1.000000,0.026020,0.045808,0.128025,0.015626,0.047426,0.019096,0.094359,0.028794,...,0.023090,0.037797,0.039388,0.060182,0.080888,0.034003,0.012451,0.029164,0.033281,0.007367
2,0.015369,0.026020,1.000000,0.063186,0.039994,0.040058,0.014183,0.028007,0.048132,0.020856,...,0.024036,0.031381,0.061835,0.071120,0.070754,0.067503,0.039460,0.025963,0.027108,0.009317
3,0.193639,0.045808,0.063186,1.000000,0.061093,0.076080,0.086670,0.043097,0.069822,0.117171,...,0.056199,0.053734,0.106519,0.107885,0.085123,0.077114,0.036716,0.044122,0.079889,0.032604
4,0.051428,0.128025,0.039994,0.061093,1.000000,0.018170,0.018797,0.018461,0.027674,0.061834,...,0.010182,0.069484,0.032971,0.056238,0.055449,0.033205,0.041778,0.021484,0.038159,0.011424
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1564,0.021802,0.034003,0.067503,0.077114,0.033205,0.087350,0.017830,0.130354,0.106623,0.046599,...,0.061299,0.029110,0.048785,0.160802,0.108553,1.000000,0.025467,0.056428,0.050847,0.034714
1565,0.053630,0.012451,0.039460,0.036716,0.041778,0.071719,0.024900,0.022927,0.044425,0.038887,...,0.070641,0.059585,0.031530,0.060079,0.057630,0.025467,1.000000,0.037601,0.054673,0.080925
1566,0.037008,0.029164,0.025963,0.044122,0.021484,0.084145,0.020094,0.028937,0.062734,0.040779,...,0.045872,0.069468,0.032464,0.090366,0.139172,0.056428,0.037601,1.000000,0.041949,0.042769
1567,0.072476,0.033281,0.027108,0.079889,0.038159,0.092889,0.049453,0.030616,0.076587,0.043874,...,0.065519,0.046898,0.041646,0.129504,0.101350,0.050847,0.054673,0.041949,1.000000,0.028050


### Cosine Similarity

Using the vectors we got we measure how similar they are using cosine distance between vector pairs. 

This feature has not been taken from any of the papers

In [20]:
highest = max(data['domain1_score'].tolist())

def get_cosine_similarity(essay_id):
    index_high = data.index[data['domain1_score'] == highest].tolist()
    n = len(index_high)

    j = data.index[data['essay_id'] == essay_id]
    similarity = 0
    for i in index_high:
        similarity += cosine_similarity(vectors_all[i,:],vectors_all[j,:])
    similarity /= n
    
    return np.asscalar(similarity)

### Extraction

Finally we apply all the above functions to our data set and store the results in a csv file.

In [21]:
def extract_features(data):
    
    features = data.copy()
    
    features['word_count'] = features['essay'].apply(get_word_count)
    print("Added 'word_count' feature successfully.")
    
    features['sent_count'] = features['essay'].apply(get_sentence_count)
    print("Added 'sent_count' feature successfully.")
    
    features['avg_word_len'] = features['essay'].apply(get_word_length_average)
    print("Added 'avg_word_len' feature successfully.")
    
    features['lemma_count'] = features['essay'].apply(get_lemma_count)
    print("Added 'lemma_count' feature successfully.")
    
    features['spell_err_count'] = features['essay'].apply(get_spell_error_count)
    print("Added 'spell_err_count' feature successfully.")
    
    features['noun_count'], features['adj_count'], features['verb_count'], features['adv_count'] = zip(*features['essay'].map(get_pos_counts))
    print("Added 'noun_count', 'adj_count', 'verb_count' and 'adv_count' features successfully.")
    
    features['neg_score'], features['pos_score'], features['neu_score'] = zip(*features['essay'].map(get_sentiment_tags))
    print("Added 'neg_score', 'pos_score' and 'neu_score' features successfully.")
    
    features['cosine_similarity'] = features['essay_id'].apply(get_cosine_similarity)
    print("Added 'similarity' feature successfully.")
        
    # TODO: LSA 
    
    return features

In [22]:
features_set1 = extract_features(data)

Added 'word_count' feature successfully.
Added 'sent_count' feature successfully.
Added 'avg_word_len' feature successfully.
Added 'lemma_count' feature successfully.
Added 'spell_err_count' feature successfully.
Added 'noun_count', 'adj_count', 'verb_count' and 'adv_count' features successfully.
Added 'neg_score', 'pos_score' and 'neu_score' features successfully.
Added 'similarity' feature successfully.


In [23]:
features_set1

Unnamed: 0,essay_id,essay,domain1_score,word_count,sent_count,avg_word_len,lemma_count,spell_err_count,noun_count,adj_count,verb_count,adv_count,neg_score,pos_score,neu_score,cosine_similarity
0,17834,Patience is when your waiting .I was patience ...,15.0,94,3,4.063830,60,0.021277,0.255319,0.021277,0.276596,0.053191,0.061,0.000,0.939,0.055627
1,17836,"I am not a patience person, like I can’t sit i...",13.0,96,3,3.708333,59,0.093750,0.250000,0.041667,0.187500,0.072917,0.127,0.000,0.873,0.045038
2,17837,One day I was at basketball practice and I was...,15.0,156,1,3.814103,83,0.057692,0.217949,0.051282,0.250000,0.038462,0.015,0.031,0.955,0.049515
3,17838,I going to write about a time when I went to t...,17.0,224,13,3.830357,118,0.017857,0.205357,0.040179,0.250000,0.062500,0.047,0.055,0.898,0.081108
4,17839,It can be very hard for somebody to be patient...,13.0,155,12,4.083871,73,0.012903,0.148387,0.064516,0.238710,0.109677,0.056,0.045,0.899,0.048745
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1564,19558,One time I was getting a cool @CAPS1 game it w...,12.0,73,6,3.273973,51,0.095890,0.260274,0.082192,0.232877,0.000000,0.042,0.063,0.895,0.068362
1565,19559,A patent person in my life is my mom. Aicason ...,16.0,223,9,4.188341,121,0.035874,0.237668,0.058296,0.183857,0.085202,0.046,0.108,0.846,0.039163
1566,19561,A time when someone else I know was patient wa...,19.0,172,13,3.924419,94,0.034884,0.215116,0.034884,0.209302,0.087209,0.010,0.080,0.910,0.053159
1567,19562,I hate weddings. I love when people get marrie...,22.0,300,30,3.896667,152,0.110000,0.256667,0.063333,0.216667,0.096667,0.093,0.085,0.822,0.081143


In [24]:
filename = 'features_set_' + str(essay_set_num) + '.csv'
features_set1.to_csv(filename, index=False)