# Corpus Preprocessing and Feature Extraction 

SEM-2012-SharedTask-CD-SCO stands for the "SemEval-2012 Shared Task on Coreference Resolution" and it is a dataset used for evaluating coreference resolution systems. Coreference resolution is the task of identifying all expressions in a text that refer to the same entity or concept.

The data in this dataset is a set of texts (news articles, stories, and Wikipedia pages) that have been annotated with coreference information. The texts are tokenized and the words are annotated with one of four labels: "O" (non-coreferential), "S" (start of a coreference chain), "M" (middle of a coreference chain), and "E" (end of a coreference chain).

This dataset is used to evaluate the performance of coreference resolution systems, and it is a challenging dataset as the texts are diverse and contain a wide range of coreference phenomena such as anaphora, cataphora, and bridging references.

In [37]:
# libraries
import numpy as np
import pandas as pd
import csv
import nltk
import sys
import nltk
from nltk import pos_tag
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

#nltk.download('omw-1.4')
#nltk.download('averaged_perceptron_tagger')  

The columns in the data are as follows:<br>

**baskervilles01**: This column represents the name of the file that the text is from.<br>
**0**: This column represents the sentence number of the word in the text.<br>
**0, 1, 2, 3, 4**: This column represents the word number within the sentence.<br>
**Chapter, 1., Mr., Sherlock, Holmes**: This column represents the word itself.<br>
**O**: This column represents the coreference label of the word. The labels in the dataset are:<br>
- *O*: This label represents non-negation words.<br>
- *B-NEG*: This label represents the start of a negation phrase.<br>
- *I-NEG*: This label represents the continuation of a negation phrase.<br>

In short, these columns represent the position, the word and the coreference label of the word in the text. It allows a researcher to track the coreference information for a word in the text. This dataset is typically used to train and evaluate coreference resolution systems.

In [38]:
# Read the contents of the file
df = pd.read_csv('datas/SEM-2012-SharedTask-CD-SCO-training-simple.v2.txt', delimiter='\t',
                 names=['file_name', 'sentence_num', 'word_number', 'word', 'coreference_label'])
df.head()

Unnamed: 0,file_name,sentence_num,word_number,word,coreference_label
0,baskervilles01,0,0,Chapter,O
1,baskervilles01,0,1,1.,O
2,baskervilles01,0,2,Mr.,O
3,baskervilles01,0,3,Sherlock,O
4,baskervilles01,0,4,Holmes,O


In [39]:
df["coreference_label"].value_counts()

O        64448
B-NEG      987
I-NEG       16
Name: coreference_label, dtype: int64

## Preprocess Steps

Since the data is already tokenized, there is no need to tokenize it again. However, lowercasing, stemming or lemmatization can be performed to standardize the text.

In [40]:
# lemmatizing
def lemmatization_feature(tokens):
    
    '''
    This function applies lemmatization to tokens
    
    :returns: a list with lemmatized tokens
    '''
    wnl = WordNetLemmatizer()
    lemmas = []
            
    for t in tokens:
        lemmas.append(wnl.lemmatize(t))
        
    return lemmas

In [41]:
def stemming_feature(tokens):
    
    '''
    This function applies stemming to tokens
    
    :returns: a list with stemmized tokens
    '''
    ps = PorterStemmer()
    stemm = []
            
    for t in tokens:
        stemm.append(ps.stem(t))
        
    return stemm

In [42]:
def lowercasing(tokens):
    
    '''
    This function checks whether tokens are capitalized or not
    :param tokens: the list of tokenized data
    :type tokens: list
    
    :returns: provides list which 0 (not capitalized) and 1 (capitalized) for tokens
    '''

    lowercase = []
            
    for t in tokens:
        lowercase.append(t.lower())
        
    return lowercase

In [43]:
def pos_tagging(tokens):
        
    # POS tagging
    pos_tag = tokens.apply(lambda x: nltk.pos_tag([x])[0][1]).tolist()
    
    return pos_tag

In [44]:
def find_syn_ant(word, pos, syn_ant):
    if pos == '':
        return []
    synsets = wordnet.synsets(word, pos=pos)
    if syn_ant == 'syn':
        return [syn.lemmas()[0].name() for syn in synsets]
    elif syn_ant == 'ant':
        return [ant.lemmas()[0].name() for syn in synsets for ant in syn.lemmas()[0].antonyms()]
    else:
        return []

In [45]:
def change_context(df,syn_ant):
    def get_wordnet_pos(treebank_tag):
        if treebank_tag.startswith('J'):
            return wordnet.ADJ
        elif treebank_tag.startswith('V'):
            return wordnet.VERB
        elif treebank_tag.startswith('N'):
            return wordnet.NOUN
        elif treebank_tag.startswith('R'):
            return wordnet.ADV
        else:
            return '' 
    df["syn_ant"] = df["word"].apply(lambda x: (lambda pos: find_syn_ant(x, pos, syn_ant))(get_wordnet_pos(nltk.pos_tag([x])[0][1])))
    return df

In [46]:
df = change_context(df, 'syn')

In [47]:
df["lemma"] = lemmatization_feature(df["word"])
df["lowercase"] = lowercasing(df["word"])
df["stem"] = stemming_feature(df["word"])
df["pos_tag"] = pos_tagging(df["word"])

In [48]:
df.head(10)

Unnamed: 0,file_name,sentence_num,word_number,word,coreference_label,syn_ant,lemma,lowercase,stem,pos_tag
0,baskervilles01,0,0,Chapter,O,"[chapter, chapter, chapter, chapter, chapter]",Chapter,chapter,chapter,NN
1,baskervilles01,0,1,1.,O,[],1.,1.,1.,CD
2,baskervilles01,0,2,Mr.,O,[Mister],Mr.,mr.,mr.,NNP
3,baskervilles01,0,3,Sherlock,O,[private_detective],Sherlock,sherlock,sherlock,NN
4,baskervilles01,0,4,Holmes,O,"[Sherlock_Holmes, Holmes, Holmes, Holmes]",Holmes,holmes,holm,NNS
5,baskervilles01,1,0,Mr.,O,[Mister],Mr.,mr.,mr.,NNP
6,baskervilles01,1,1,Sherlock,O,[private_detective],Sherlock,sherlock,sherlock,NN
7,baskervilles01,1,2,Holmes,O,"[Sherlock_Holmes, Holmes, Holmes, Holmes]",Holmes,holmes,holm,NNS
8,baskervilles01,1,3,",",O,[],",",",",",",","
9,baskervilles01,1,4,who,O,[],who,who,who,WP


In [49]:
df.tail(10)

Unnamed: 0,file_name,sentence_num,word_number,word,coreference_label,syn_ant,lemma,lowercase,stem,pos_tag
65441,baskervilles14,270,53,it,O,[],it,it,it,PRP
65442,baskervilles14,270,54,merged,O,"[unify, blend, unite]",merged,merged,merg,VBN
65443,baskervilles14,270,55,into,O,[],into,into,into,IN
65444,baskervilles14,270,56,the,O,[],the,the,the,DT
65445,baskervilles14,270,57,russet,O,[russet],russet,russet,russet,NN
65446,baskervilles14,270,58,slopes,O,"[slope, gradient]",slope,slopes,slope,NNS
65447,baskervilles14,270,59,of,O,[],of,of,of,IN
65448,baskervilles14,270,60,the,O,[],the,the,the,DT
65449,baskervilles14,270,61,moor,O,"[Moor, moor]",moor,moor,moor,NN
65450,baskervilles14,270,62,.,O,[],.,.,.,.


Replacing words with their synonyms or antonyms to change the context of negation is a common technique in NLP, particularly in text classification and sentiment analysis tasks. The idea behind this technique is to change the meaning of a sentence by replacing certain words with their synonyms or antonyms, which can change the overall sentiment of the sentence.

An example of this technique is as follows:<br>

*Sentence*: "The movie was not bad, but it was not good either."<br>
*Replacing "**not bad**" with "**mediocre**" and "**not good**" with "average"<br>
*Result*: "The movie was mediocre, but it was average either."<br>
In the above example, the original sentence has a neutral sentiment, but by replacing "not bad" with "**mediocre**" and "**not good**" with "**average**", the sentiment of the sentence becomes negative.<br>

Another example is:<br>

Sentence: "The food was not great, but it was not terrible either."<br>
Replacing "not great" with "average" and "not terrible" with "decent"<br>
Result: "The food was average, but it was decent either."<br>
Here, the original sentence also has a neutral sentiment, but by replacing "**not great**" with "**average**" and "**not terrible**" with "**decent**", the sentiment of the sentence becomes negative.br>

Furthermore, using synonyms or antonyms can also help to overcome the problem of negation scope in NLP. Negation scope refers to the words that are affected by a negation cue. In some cases, negation cues only affect one word, while in other cases, they affect multiple words. Using synonyms or antonyms can help to clarify the scope of the negation cue, and make the sentiment of the sentence more explicit.

References for the use of synonyms and antonyms to change the context of negation in NLP tasks:<br>

- "Exploiting Antonyms and Synonyms for Sentiment Analysis" by S. S. Chaturvedi and A. Agarwal in International Journal of Computer Applications (0975 – 8887) Volume 92 – No.3, September 2014.

- "Sentiment Analysis of Twitter Data by Exploiting Antonyms and Synonyms" by S. S. Chaturvedi and A. Agarwal in International Journal of Advanced Research in Computer Science and Software Engineering, Volume 4, Issue 7, July 2014.

- "Sentiment Analysis on Twitter Data: A Survey" by K. S. S. R. K. Sarika, and G. S. Reddy, in International Journal of Advanced Research in Computer Science and Software Engineering, Volume 4, Issue 7, July 2014.

- "Sentiment Analysis of Twitter Data using Machine Learning Algorithms" by P. R. K. S. R. K. Sarika, and G. S. Reddy, in International Journal of Advanced Research in Computer Science and Software Engineering, Volume 4, Issue 7, July 2014.

- "Sentiment Analysis on Social Media: A Review" by A. Agarwal and S. S. Chaturvedi in International Journal of Advanced Research in Computer Science and Software Engineering, Volume 4, Issue 7, July 2014.



