# Word Sense Disambiguation and Feature Selection

2.2 Selecting Salient Features

In the class you saw that there are two kinds of features: 
i) collocational and 
ii) bag of words. 

Having seen how to extract collocational features (by hand) to disambiguate the sense for a word with multiple senses in the previous problem, now let’s try to understand which features are important to disambiguate the word senses on a bigger corpus. You can use a combination of collocational and bag of words features to conduct feature selection.
Dataset: We will use the English Lexical Sample task from Senseval for this problem. The data files for this project are available here: https://bit.ly/2kKEgwx

It consists of i) a corpus file (wsd data.xml) and ii) a dictionary (dict.xml) that describes commonly used senses for each word. Both these files are in XML format. Every lexical item in the dictionary file contains multiple sense items, and each instance in the training data is annotated with the correct sense of the target word for a given context. The file wsd data.xml contains several <welt> tags corresponding to each word in the corpus. Each <welt> tag has an attribute item, whose value is “word.pos”, where “word” is the target word and “pos” represents the part-of-speech of the target word. Here ‘n’, ‘v’, and ‘a’ stand for noun, verb, and adjective, respectively. Each <welt> tag has several <instance> tags, each corresponds to an instance for the word that corresponds to the parent <welt> tag. Each <welt> tag also has an id attribute and contains one or more <ans> and a <context> tag. Every <ans> tag has two attributes, instance and senseid. The senseid attribute identifies the correct sense from the dictionary for the present word in the current context. A special value “U” is used to indicate if the correct sense is unclear. You can discard such instances from your feature extraction process for this assignment (we keep these cases so that you can take a look and think about how they can be utilized as well for realworld applications).

A <context> tag contains:
prev-context <head> target-word <head> next-context
1. prev-context is the actual text given before the target word
2. head is the actual appearance of the target word. Note that it may be morphological variant
of the target word. For example, the word “begin.v” could show up as “beginning” instead
of “begin” (lemma).
3. next-context is the actual text that follows the target word.

The dictionary file simply contains a gloss field for every sense item to indicate the corresponding definition. Each gloss consists of commonly used definitions delimited by a semicolon, and may have multiple real examples wrapped by quotation marks being also delimited by a semicolon

# 2.2.1 Feature extraction
Your first task is to extract features from the aforementioned corpus.   
(1) Start with bag-of-word features and collocation features (define your own window size, see Hints below).  
(2) Design new type of features. Submit the code and output for both.  

In [83]:
import glob
import os
import re
import pandas as pd
from bs4 import BeautifulSoup
import sys
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from collections import Counter
from sklearn.feature_selection import SelectKBest, chi2

In [84]:
def get_relavant_text_and_pos_tags(text, focus_elements, window_size, y_label):
    
#     print(text)
    # Get word tokens
    word_tokens = word_tokenize(text)

    #Get POS tags
    pos_tags = pos_tag(word_tokens)
    
    all_heads_extracted = 0
    # Token Iterator
    i = 0
    # Head Iterator
    head = 0
    
    head_index_list = []
    while head != len(focus_elements):
#         print(focus_elements[head].text, i, word_tokens[i])
        if word_tokens[i] == focus_elements[head].text:
            if (i >= window_size) & (i < len(word_tokens)-window_size):
                head_index_list.append((head, i))
                head = head + 1
        i = i + 1
    
    result = []
    for head, index in head_index_list:
        result_dict = {}
        
        previous_context_words = word_tokens[index-window_size:index]
        previous_context_tags = pos_tags[index-window_size:index]
        previous_context_bigrams = previous_context_words[-1] + " " + word_tokens[index]
        
        next_context_words = word_tokens[index+1:index+window_size+1]
        next_context_tags = pos_tags[index+1:index+window_size+1]
        next_context_bigrams = word_tokens[index] + " " + next_context_words[0]
        
        result_dict['text'] = word_tokens[index-window_size:index+window_size+1]
        result_dict['previous_context'] = {
            'words': previous_context_words,
            'tags': previous_context_tags,
            'bigrams':previous_context_bigrams
        }
        
        result_dict['head'] = {
            'words': word_tokens[index],
            'tags': pos_tags[index]
        }
        
        result_dict['next_context'] = {
            'words': next_context_words,
            'tags': next_context_tags,
            'bigrams':next_context_bigrams
        }
        
        result_dict['y'] = y_label
        
        result.append(result_dict)
        
    return result

In [85]:
def load_wsd_data(input_file):
    result = {}
    # read file
    with open(input_file) as file_object:
        content = file_object.read()

    # clean the content
    content = re.sub('\n*', '', content)

    # parse content
    soup = BeautifulSoup(content, 'html.parser')
    
    # iterate through all welt elements
    for welt in soup.find_all('welt'):
        parsed_result_list = []
        item_name = welt.get('item')
        print("Parsing welt: {}".format(item_name))
        for instance in welt:
            # Getting only the first sense id
            senseid = instance.findAll('ans')[0].get('senseid')
            
            # Ignoring confusing tags
            if senseid != "U":
                # Get the label
                y = item_name + '_' + senseid

                # Get context
                context = instance.find('context')

                # Get focus elements
                focus_elements = instance.findAll('head')
                parsed_result = get_relavant_text_and_pos_tags(context.text, focus_elements, 2, y)
                parsed_result_list.extend(parsed_result)
                
        result[item_name] = parsed_result_list
                
    #             break
#         break
    return result

In [86]:
def create_features(instance_list, windows_size):
    bow_tokens = []
    bigrams_tokens = []
    
    # Get all vocab and bigrams (extra feature)
    for i in range(len(instance_list)):
        instance = instance_list[i]
#         print(instance)
        bow_tokens.extend(instance['previous_context']['words'])
        bow_tokens.extend(instance['next_context']['words'])
        
        bigrams_tokens.append(instance['previous_context']['bigrams'])
        bigrams_tokens.append(instance['next_context']['bigrams'])
        
    bow = Counter(bow_tokens)
    bigrams = Counter(bigrams_tokens)
    
    # Get collocation features names
    collocation_features_names = []
    for i in range(1, windows_size+1):
        collocation_features_names.append('pos-'+str(i))
        collocation_features_names.append('pos'+str(i))
    
    # Create column names
    column_names = list(bow.keys()) + list(bigrams.keys()) + collocation_features_names + ['y']
    
    print("Create dataframe: {}".format(len(instance_list)))
    # Create dataframe
    df = pd.DataFrame(columns=column_names)
    
    # Populate dataframe
    for i in range(len(instance_list)):
        instance = instance_list[i]
        df.loc[i, column_names] = [0] * len(column_names)
        # Populate the bag of word tokens
        temp_bow_tokens = instance['previous_context']['words'] + instance['next_context']['words']
        for temp_bow_token in temp_bow_tokens:
            df.loc[i, temp_bow_token] = 1
        
        # Populate the bigrams tokens
        temp_bigrams_tokens = [instance['previous_context']['bigrams']] + [instance['next_context']['bigrams']]
        for temp_bigram_token in temp_bigrams_tokens:
            df.loc[i, temp_bigram_token] = 1
        
        # Populate the pos tags
        for j in range(1, windows_size+1):
            df.loc[i, 'pos-'+str(j)] = instance['previous_context']['tags'][-j][1]
            df.loc[i, 'pos'+str(j)] = instance['next_context']['tags'][j-1][1]
        
        # Set y label
        df.loc[i, 'y'] = instance['y']
    
    print("Hot encode dataframe")
    # Hot encode the categorical variables
    df = pd.get_dummies(df, columns=collocation_features_names, prefix=collocation_features_names)
    return df

In [75]:
# dict_data = load_data('../input/WSD/dict.xml')
# wsd_data = load_wsd_data('../input/WSD/wsd_data_custom.xml')
wsd_data = load_wsd_data('../input/WSD/wsd_data.xml')

Parsing welt: activate.v
Parsing welt: add.v
Parsing welt: appear.v
Parsing welt: argument.n
Parsing welt: arm.n
Parsing welt: ask.v
Parsing welt: atmosphere.n
Parsing welt: audience.n
Parsing welt: bank.n
Parsing welt: begin.v
Parsing welt: climb.v
Parsing welt: decide.v
Parsing welt: degree.n
Parsing welt: difference.n
Parsing welt: different.a
Parsing welt: difficulty.n
Parsing welt: disc.n
Parsing welt: eat.v
Parsing welt: encounter.v
Parsing welt: expect.v
Parsing welt: express.v
Parsing welt: hear.v
Parsing welt: hot.a
Parsing welt: image.n
Parsing welt: important.a
Parsing welt: interest.n
Parsing welt: judgment.n
Parsing welt: lose.v
Parsing welt: mean.v
Parsing welt: miss.v
Parsing welt: note.v
Parsing welt: operate.v
Parsing welt: organization.n
Parsing welt: paper.n
Parsing welt: party.n
Parsing welt: performance.n
Parsing welt: plan.n
Parsing welt: play.v
Parsing welt: produce.v
Parsing welt: provide.v
Parsing welt: receive.v
Parsing welt: remain.v
Parsing welt: rule.v
Pars

Additional feature - Bigrams including the head word  
  
Window Size - +/-2, the reason for using this window size, smaller window size gives lesser number of features to encode leading to less sparsity in comparision if you use very large windows. Most of the times +/-2 should be sufficient to start with.  

2.2.2 Feature selection

Now, with the extracted features, perform feature selection to list top 10 features that would be most important to disambiguate a word sense. (1) Design your own feature selection algorithm and explain the intuition. (2) List the top 10 features in your answer key and also provide your code for this task. Submit the code and output for both.

You can use following resources to read ways to perform feature selection:
https://scikit-learn.org/stable/modules/feature_selection.html
https://www.datacamp.com/community/tutorials/feature-selection-python

In [89]:
def select_top_k_fetures(x, y, k):
    # Create and fit selector
    selector = SelectKBest(chi2, k)
    selector.fit(x, y)
    
    # Get columns(index) to keep and map to corresponding column names
    column_indices = selector.get_support(indices=True)
    selected_column_names = []
    for column_index in column_indices:
        selected_column_names.append(x.columns[column_index])

    # Create new dataframe with only desired columns
    x_new = x.loc[:, selected_column_names]
    return x_new

In [None]:
features_output_file = open('../output/features.txt', 'w')
features_select_output_file = open('../output/features_select.txt', 'w')
for key in wsd_data.keys():
    print(key)
    # Create features and feature vectors
    print("Creating features")
    df = create_features(wsd_data[key], 2)
    x = df[df.columns.difference(['y'])]
    y = df[['y']]
    features_output_file.write('{}: '.format(key))
    for column in x.columns:
         features_output_file.write('{},'.format(column))
    features_output_file.write('\n\n')
    
    # Store feature vector and corresponding y as a csv file
    x.to_csv('../output/'+key+'_x.csv', index=False)
    y.to_csv('../output/'+key+'_y.csv', index=False)
    
    # Select top k features
    print("Selecting features")
    x_new = select_top_k_fetures(x, y, 10)
    features_select_output_file.write('{}: {}\n\n'.format(key, x_new.columns))
    x_new.to_csv('../output/'+key+'_fs_x.csv', index=False)
    
features_output_file.close()
features_select_output_file.close()

activate.v
Creating features
Create dataframe: 230
Hot encode dataframe
Selecting features
add.v
Creating features
Create dataframe: 264
Hot encode dataframe
Selecting features
appear.v
Creating features
Create dataframe: 267


To select features, i am using SelectKBest functionality from scikit-learn which takes in an argument the function to be used to select features. In this case, i have choses chi2 as the selection algorithm.