## Feature Extraction

In this notebook we perform data sanitization and extraction, the data is taken from kaggle consisting of more than 12000 essays.

As part of extracting features we use ideas from multiple research papers referenced below.

### Novelty
In addition to already proposed ideas we also use **Latent Semantic Indexing** for extracting concepts and getting similarity between essays from the given text essays.


#### REFRENCES:
1. Mahana, M., Johns, M., & Apte, A. (2012). Automated essay grading using machine learning. Mach. Learn. Session, Stanford University.

2. Suresh, A., & Jha, M. (2018). Automated essay grading using natural language processing and support vector machine. International Journal of Computing and Technology, 5(2), 18-21.

3. Rokade, A., Patil, B., Rajani, S., Revandkar, S., & Shedge, R. (2018, April). Automated Grading System Using Natural Language Processing. In 2018 Second International Conference on Inventive Communication and Computational Technologies (ICICCT) (pp. 1123-1127). IEEE.

4. Song, S., & Zhao, J. (2013). Automated essay scoring using machine learning. Stanford University.

5. Kakkonen, T., Myller, N., & Sutinen, E. (2006). Applying Part-of-Seech Enhanced LSA to Automatic Essay Grading. arXiv preprint cs/0610118.

In [1]:
import re, collections
import pandas as pd
import numpy as np
import enchant
import warnings
warnings.filterwarnings('ignore')

In [2]:
import nltk
nltk.download('vader_lexicon')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import TfidfTransformer 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import TruncatedSVD

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/nishal/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package punkt to /home/nishal/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/nishal/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/nishal/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


### Loading data set

8 essay set, totalling 12977 unique essays

In [3]:
data = pd.read_excel('training_set_rel3.xlsx')

In [4]:
data

Unnamed: 0,essay_id,essay_set,essay,rater1_domain1,rater2_domain1,rater3_domain1,domain1_score,rater1_domain2,rater2_domain2,domain2_score,...,rater2_trait3,rater2_trait4,rater2_trait5,rater2_trait6,rater3_trait1,rater3_trait2,rater3_trait3,rater3_trait4,rater3_trait5,rater3_trait6
0,1,1,"Dear local newspaper, I think effects computer...",4.0,4.0,,8.0,,,,...,,,,,,,,,,
1,2,1,"Dear @CAPS1 @CAPS2, I believe that using compu...",5.0,4.0,,9.0,,,,...,,,,,,,,,,
2,3,1,"Dear, @CAPS1 @CAPS2 @CAPS3 More and more peopl...",4.0,3.0,,7.0,,,,...,,,,,,,,,,
3,4,1,"Dear Local Newspaper, @CAPS1 I have found that...",5.0,5.0,,10.0,,,,...,,,,,,,,,,
4,5,1,"Dear @LOCATION1, I know having computers has a...",4.0,4.0,,8.0,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12973,21626,8,In most stories mothers and daughters are eit...,17.0,18.0,,35.0,,,,...,4.0,4.0,4.0,3.0,,,,,,
12974,21628,8,I never understood the meaning laughter is th...,15.0,17.0,,32.0,,,,...,4.0,4.0,4.0,3.0,,,,,,
12975,21629,8,"When you laugh, is @CAPS5 out of habit, or is ...",20.0,26.0,40.0,40.0,,,,...,5.0,5.0,5.0,5.0,4.0,4.0,4.0,4.0,4.0,4.0
12976,21630,8,Trippin' on fen...,20.0,20.0,,40.0,,,,...,4.0,4.0,4.0,4.0,,,,,,


### Filtering data set

We only use the actual essay along with the domain score for training, all of the other columns are discarded.

We use the first 2000 essays for feature extraction.

In [5]:
data.drop(data.iloc[:, 1:2], inplace=True, axis=1)
data.drop(data.iloc[:, 2:5], inplace=True, axis=1)
data.drop(data.iloc[:, 3:], inplace=True, axis=1)

num_essays = 2000
data.drop(range(num_essays,12978), inplace=True)

In [6]:
data

Unnamed: 0,essay_id,essay,domain1_score
0,1,"Dear local newspaper, I think effects computer...",8.0
1,2,"Dear @CAPS1 @CAPS2, I believe that using compu...",9.0
2,3,"Dear, @CAPS1 @CAPS2 @CAPS3 More and more peopl...",7.0
3,4,"Dear Local Newspaper, @CAPS1 I have found that...",10.0
4,5,"Dear @LOCATION1, I know having computers has a...",8.0
...,...,...,...
1995,3190,I believe that they should not be pulled off o...,4.0
1996,3191,When have you ever went into a library and fou...,4.0
1997,3192,When I go to a library I @MONTH1 find some stu...,3.0
1998,3193,"Certain people beleive that offensive books, m...",3.0


In [7]:
def get_wordlist(sentence):
    # Remove non-alphanumeric characters
    sentence = re.sub("[^a-zA-Z0-9]"," ", sentence)
    words = nltk.word_tokenize(sentence)

    return words

In [8]:
def get_tokenized_sentences(essay):
    sentences = nltk.sent_tokenize(essay.strip())
    
    tokenized_sentences = []
    for sentence in sentences:
        if len(sentence) > 0:
            tokenized_sentences.append(get_wordlist(sentence))
    
    return tokenized_sentences

### Numerical features 
Features like the average length of words, the word count and the sentence count give us an idea about the fluency in language and dextirity of the writer.

Reference - 

1. Mahana, M., Johns, M., & Apte, A. (2012). Automated essay grading using machine learning. Mach. Learn. Session, Stanford University.

In [9]:
def get_word_length_average(essay):
    # Sanitize essay
    essay = re.sub(r'\W', ' ', essay)
    words = nltk.word_tokenize(essay)
    avg = sum(len(word) for word in words) / len(words)
    
    return avg

In [10]:
def get_word_count(essay):
    essay = re.sub(r'\W', ' ', essay)
    count = len(nltk.word_tokenize(essay))
    
    return count

In [11]:
def get_sentence_count(essay):
    sentences = nltk.sent_tokenize(essay)
    count = len(sentences)
    
    return count

### Lemmatization and Part of Speech tagging

Lemmatization involves use of a vocabulary to perform a morphological analysis of words. Getting lemma count along with different part of speech count like that of nouns, adjectives, verbs, adverbs allows us to understand the lexical density and overall semantic difficulty of the essay.

Reference - 

2. Suresh, A., & Jha, M. (2018). Automated essay grading using natural language processing and support vector machine. International Journal of Computing and Technology, 5(2), 18-21.

In [12]:
def get_lemma_count(essay):
    tokenized_sentences = get_tokenized_sentences(essay)      
    
    lemmas = []
    for sentence in tokenized_sentences:
        pos_tagged_tokens = nltk.pos_tag(sentence) 
        for token_tuple in pos_tagged_tokens:
            word = token_tuple[0]
            pos_tag = token_tuple[1]
            # assume default part of speech to be noun
            pos = wordnet.NOUN
            if pos_tag.startswith('J'):
                pos = wordnet.ADJ
            elif pos_tag.startswith('V'):
                pos = wordnet.VERB
            elif pos_tag.startswith('R'):
                pos = wordnet.ADV
                
            lemmas.append(WordNetLemmatizer().lemmatize(word, pos))
    
    lemma_count = len(set(lemmas))
    
    return lemma_count

In [13]:
def get_pos_counts(essay):
    tokenized_sentences = get_tokenized_sentences(essay)
    
    nouns, adjectives, verbs, adverbs = 0, 0, 0, 0
    
    essay = re.sub(r'\W', ' ', essay)
    words = nltk.word_tokenize(essay)
    
    for sentence in tokenized_sentences:
        pos_tagged_tokens = nltk.pos_tag(sentence)
        for token_tuple in pos_tagged_tokens:
            pos_tag = token_tuple[1]
            if pos_tag.startswith('N'): 
                nouns += 1
            elif pos_tag.startswith('J'):
                adjectives += 1
            elif pos_tag.startswith('V'):
                verbs += 1
            elif pos_tag.startswith('R'):
                adverbs += 1
    
    return nouns/len(words), adjectives/len(words), verbs/len(words), adverbs/len(words)

### Spell Check (Orthography)
Correct word spelling indicates command over language and facility of use. To test for these characteristics we extracted the count of spelling errors per essay. We used PyEnchant spell checker along with the hunspell dictionary to obtain the count of misspelt words per essay.

Reference - 

1. Mahana, M., Johns, M., & Apte, A. (2012). Automated essay grading using machine learning. Mach. Learn. Session, Stanford University.

In [14]:
def get_spell_error_count(essay):   
    essay = re.sub(r'\W', ' ', essay)
    words = nltk.word_tokenize(essay)
    
    d = enchant.Dict("en_US")
    misspelt_count = 0
    for word in words:
        if(d.check(word) == False):
            misspelt_count += 1
    
    total_words = get_word_count(essay)
    error_prob = misspelt_count/total_words
    
    return error_prob

### Sentiment Analysis (Opinion mining)
It allows us to gauge the emotive effectiveness of the essay. We use the VADER which is a  lexicon and rule-based sentiment analysis tool provided by nltk package. It does also provide the degree of positiveness or negativess although we don't use it in this case.

Reference - 

4. Song, S., & Zhao, J. (2013). Automated essay scoring using machine learning. Stanford University.

In [15]:
def get_sentiment_tags(essay):
    negative, positive, neutral = 0, 0, 0
    
    ss = SentimentIntensityAnalyzer().polarity_scores(essay)
    for k in sorted(ss):
        if k == 'compound':
            pass
        elif k == 'neg':
            negative += ss[k]
        elif k == 'pos':
            positive += ss[k]
        elif k == 'neu':
            neutral += ss[k]
            
    return negative, positive, neutral

### Term-Frequence & Inverse Document Frequency

Term frequency is the number of times a word appears in the document. Document frequency is number of times a word occurs in all the documents. TF multiplied by the inverse of DF gives TF-IDF scores. The vectors obtained help us get the corresspondence between essays. 

Reference -

2. Suresh, A., & Jha, M. (2018). Automated essay grading using natural language processing and support vector machine. International Journal of Computing and Technology, 5(2), 18-21.

In [16]:
def get_tfidf_vectors(essays):
    vectorizer = TfidfVectorizer(stop_words='english')
    
    words = []
    for essay in essays:
        essay = re.sub(r'\W', ' ', essay)
        words.append(nltk.word_tokenize(essay))
        
    docs_lemmatized = [[WordNetLemmatizer().lemmatize(j) for j in i]for i in words]
    
    corpus = [' '.join(i) for i in docs_lemmatized]
    vectors = vectorizer.fit_transform(corpus)
    feature_names = vectorizer.get_feature_names()
    
    return feature_names, vectors

In [17]:
feature_names,vectors_all = get_tfidf_vectors(data['essay'])
print("num of essays X number of features",vectors_all.shape)

num of essays X number of features (2000, 15378)


### Latent Semantic Analysis



In [25]:
# SVD represent documents and terms in vectors 
reduced_dim = 2000
svd_model = TruncatedSVD(n_components=reduced_dim, algorithm='randomized', random_state=122)
lsa = svd_model.fit_transform(vectors_all)

pd.DataFrame(svd_model.components_,index=range(reduced_dim), columns=feature_names)

Unnamed: 0,00,000,00pm,aa,aamous,aand,aare,abad,abandon,abandoned,...,yup,zap,zero,zingbobway,zip,zombie,zone,zoning,zoo,zoom
0,0.000136,0.000324,0.000300,0.000202,0.000221,0.000053,0.000335,1.036299e-04,0.000142,0.000152,...,0.000167,0.000406,0.001298,0.000226,0.000130,0.000913,0.000091,0.000163,0.001020,0.000350
1,-0.000017,-0.000173,0.000004,0.000015,-0.000218,0.001072,-0.000092,-4.487398e-07,-0.000067,-0.000032,...,-0.000018,-0.000030,-0.000355,-0.000185,-0.000028,0.000020,-0.000081,-0.000049,0.000363,-0.000019
2,-0.000401,-0.000022,-0.000363,-0.000706,0.001078,-0.000022,-0.000639,-1.363574e-04,-0.000600,0.000704,...,0.000089,0.001566,-0.000198,0.000663,0.000263,-0.000770,-0.000386,-0.000026,0.000954,0.000181
3,-0.000053,-0.000220,-0.000755,-0.000142,0.001396,0.000333,-0.000016,2.614697e-04,0.000320,-0.000077,...,-0.000688,0.001051,-0.002359,0.000774,-0.000734,-0.001660,0.000245,-0.000514,-0.001230,-0.000176
4,-0.000483,-0.001170,-0.000850,0.000362,-0.000025,0.000244,0.000113,-2.421830e-04,-0.000740,-0.001198,...,-0.000708,0.000610,-0.000440,-0.001458,-0.000609,-0.003020,-0.000834,-0.000549,-0.003298,-0.001048
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,0.003748,0.004954,0.007427,0.001501,-0.009587,0.004324,-0.003213,1.464198e-04,0.001533,0.004019,...,-0.000129,0.001986,0.005908,0.002301,0.002396,0.002392,-0.001746,-0.002693,0.001786,-0.003040
1996,-0.005518,0.000429,-0.004113,-0.014792,-0.000835,0.002408,-0.004457,-8.130669e-04,-0.004209,-0.004715,...,0.002359,-0.004622,0.003274,-0.004002,0.004485,-0.000175,-0.001691,0.000440,-0.006926,0.009475
1997,-0.007540,-0.000406,-0.000246,0.000815,0.002653,0.001489,-0.000886,-7.077058e-04,0.002643,-0.003071,...,-0.002596,-0.000471,0.000359,0.000770,-0.001592,-0.003722,0.001559,-0.002417,0.012219,-0.004841
1998,-0.004616,-0.012992,0.005516,-0.006076,-0.011491,-0.000809,0.004747,-1.410273e-03,0.001028,-0.002323,...,0.004501,-0.010876,-0.005693,-0.006089,0.005860,0.008448,0.000202,-0.000009,0.005299,0.003088


In [36]:
# Compute document similarity using LSA components
lsa_similarity = np.asarray(np.asmatrix(lsa) * np.asmatrix(lsa).T)
pd.DataFrame(lsa_similarity,index=range(num_essays), columns=range(num_essays))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999
0,1.000000,0.140902,0.171522,0.095338,0.138435,0.096193,0.145993,0.159541,0.161309,0.120384,...,0.144524,0.082640,0.082743,0.049397,0.127117,0.058964,0.093549,0.036924,0.079958,0.075325
1,0.140902,1.000000,0.193965,0.139094,0.156889,0.116773,0.172582,0.155856,0.134259,0.105740,...,0.069009,0.042250,0.017035,0.049591,0.021204,0.034626,0.055880,0.051850,0.025982,0.023517
2,0.171522,0.193965,1.000000,0.130630,0.145819,0.095854,0.174459,0.160876,0.119314,0.141472,...,0.042253,0.047994,0.064691,0.066445,0.043972,0.043620,0.082474,0.045107,0.099334,0.071228
3,0.095338,0.139094,0.130630,1.000000,0.144374,0.094235,0.111451,0.150861,0.095013,0.079430,...,0.028193,0.039631,0.045939,0.034298,0.055913,0.040684,0.048377,0.023817,0.064920,0.044435
4,0.138435,0.156889,0.145819,0.144374,1.000000,0.107828,0.167058,0.187601,0.190546,0.144120,...,0.046729,0.069090,0.070003,0.035328,0.066820,0.042474,0.062947,0.032841,0.047874,0.060753
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,0.058964,0.034626,0.043620,0.040684,0.042474,0.040073,0.053939,0.049708,0.077236,0.039848,...,0.215721,0.264417,0.115308,0.134424,0.240917,1.000000,0.278135,0.292377,0.376166,0.285387
1996,0.093549,0.055880,0.082474,0.048377,0.062947,0.050669,0.051481,0.113600,0.065854,0.057337,...,0.258010,0.304455,0.120156,0.102158,0.380408,0.278135,1.000000,0.161604,0.258801,0.300367
1997,0.036924,0.051850,0.045107,0.023817,0.032841,0.031405,0.051775,0.086812,0.044223,0.035252,...,0.113711,0.201387,0.045754,0.054098,0.151760,0.292377,0.161604,1.000000,0.417356,0.256883
1998,0.079958,0.025982,0.099334,0.064920,0.047874,0.051362,0.063268,0.084928,0.047775,0.045450,...,0.189957,0.282968,0.163649,0.050309,0.249007,0.376166,0.258801,0.417356,1.000000,0.310659


### Cosine Similarity

Using the vectors we got we measure how similar they are using cosine distance between vector pairs. 

This feature has not been taken from any of the papers

In [30]:
def get_cosine_similarity(essay_id):
    index_high = data.index[data['domain1_score'] == 4].tolist()
    n = len(index_high)

    j = data.index[data['essay_id'] == essay_id]
    similarity = 0
    for i in index_high:
        similarity += cosine_similarity(vectors_all[i,:],vectors_all[j,:])
    similarity /= n
    
    return np.asscalar(similarity)

### Extraction

Finally we apply all the above functions to our data set and store the results in a csv file.

In [31]:
def extract_features(data):
    
    features = data.copy()
    
    features['word_count'] = features['essay'].apply(get_word_count)
    print("Added 'word_count' feature successfully.")
    
    features['sent_count'] = features['essay'].apply(get_sentence_count)
    print("Added 'sent_count' feature successfully.")
    
    features['avg_word_len'] = features['essay'].apply(get_word_length_average)
    print("Added 'avg_word_len' feature successfully.")
    
    features['lemma_count'] = features['essay'].apply(get_lemma_count)
    print("Added 'lemma_count' feature successfully.")
    
    features['spell_err_count'] = features['essay'].apply(get_spell_error_count)
    print("Added 'spell_err_count' feature successfully.")
    
    features['noun_count'], features['adj_count'], features['verb_count'], features['adv_count'] = zip(*features['essay'].map(get_pos_counts))
    print("Added 'noun_count', 'adj_count', 'verb_count' and 'adv_count' features successfully.")
    
    features['neg_score'], features['pos_score'], features['neu_score'] = zip(*features['essay'].map(get_sentiment_tags))
    print("Added 'neg_score', 'pos_score' and 'neu_score' features successfully.")
    
    features['cosine_similarity'] = features['essay_id'].apply(get_cosine_similarity)
    print("Added 'similarity' feature successfully.")
        
    # TODO: LSA 
    
    return features

In [32]:
features_set1 = extract_features(data)

Added 'word_count' feature successfully.
Added 'sent_count' feature successfully.
Added 'avg_word_len' feature successfully.
Added 'lemma_count' feature successfully.
Added 'spell_err_count' feature successfully.
Added 'noun_count', 'adj_count', 'verb_count' and 'adv_count' features successfully.
Added 'neg_score', 'pos_score' and 'neu_score' features successfully.
Added 'similarity' feature successfully.


In [33]:
features_set1

Unnamed: 0,essay_id,essay,domain1_score,word_count,sent_count,avg_word_len,lemma_count,spell_err_count,noun_count,adj_count,verb_count,adv_count,neg_score,pos_score,neu_score,cosine_similarity
0,1,"Dear local newspaper, I think effects computer...",8.0,350,16,4.237143,162,0.045714,0.237143,0.051429,0.211429,0.068571,0.000,0.170,0.830,0.090943
1,2,"Dear @CAPS1 @CAPS2, I believe that using compu...",9.0,423,20,4.312057,185,0.061466,0.252955,0.044917,0.200946,0.044917,0.014,0.219,0.766,0.049000
2,3,"Dear, @CAPS1 @CAPS2 @CAPS3 More and more peopl...",7.0,283,14,4.342756,145,0.031802,0.289753,0.070671,0.183746,0.056537,0.045,0.197,0.759,0.069262
3,4,"Dear Local Newspaper, @CAPS1 I have found that...",10.0,530,27,4.813208,236,0.122642,0.335849,0.079245,0.183019,0.054717,0.008,0.152,0.840,0.056878
4,5,"Dear @LOCATION1, I know having computers has a...",8.0,473,30,4.334038,190,0.035941,0.241015,0.067653,0.190275,0.076110,0.026,0.096,0.879,0.071470
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,3190,I believe that they should not be pulled off o...,4.0,368,28,4.105978,152,0.024457,0.195652,0.097826,0.190217,0.081522,0.121,0.092,0.786,0.221566
1996,3191,When have you ever went into a library and fou...,4.0,610,36,4.013115,199,0.031148,0.178689,0.054098,0.254098,0.068852,0.144,0.051,0.805,0.212200
1997,3192,When I go to a library I @MONTH1 find some stu...,3.0,196,11,4.076531,102,0.045918,0.183673,0.102041,0.250000,0.051020,0.156,0.107,0.737,0.134513
1998,3193,"Certain people beleive that offensive books, m...",3.0,382,19,4.526178,141,0.031414,0.225131,0.083770,0.206806,0.078534,0.128,0.074,0.798,0.206831


In [37]:
features_set1.to_csv('features.csv', index=False)