<H1> Argumentative Unit Segmentation </H1>
<ul>
    <li>
        <b>Objective:</b> To implement token-based argumentative unit classifier.
    </li>
    <li>
        <b>Tutors</b>
        <ul>
            <li>Denis Kuchelev</li>
            <li>Enri Ozuni</li>
            <li>Nikit Srivastava</li>
        </ul>
    </li>
</ul>

<h3>Task Description</h3>

In this tutorial, we will accomplish the task of training classifiers on argument annotated data to enable them to classify argument-units and their non-argumentative counterparts in a given text (argumentative) document. We will also compute the performance of our classifiers.

<h3>Reference</h3>

Inspired by the work done by <a href="https://aclweb.org/anthology/papers/W/W17/W17-5115/">Ajjour Et al.</a>
<ul>
    <li><b>Paper:</b> <a href="https://pdfs.semanticscholar.org/7494/ca6484b1e63a1e92a37299d64e52aabd63c9.pdf">Unit Segmentation of Argumentative Texts</a></li>
    <li><b>Published:</b> Copenhagen, Denmark, September 8, 2017</li>
</ul>

<h3>Example</h3>

<b>Input Text:</b>
The last 50 years have seen a significant increase in the number of tourist travelling worldwide. While some might think the tourism bring large profit for the destination countries, I would contend that <mark>this industry has affected the cultural attributes and damaged the natural environment of the tourist destinations</mark>.
Firstly, it is an undeniable fact that <mark>tourists from different cultures will probably cause changes to the cultural identity of the tourist destinations</mark>.


<b>Argumentative Units:</b>
1. this industry has affected the cultural attributes and damaged the natural environment of the tourist destinations
2. tourists from different cultures will probably cause changes to the cultural identity of the tourist destinations

<h3>Action Plan</h3>

<b>Steps to Argumentative Unit Segmentation:</b>
1. Load the annotated data
2. Perform IOB-Tagging for tokens
3. Prepare feature extraction for each token
4. Split dataset into training and test
5. Select and train the classifier
6. Perform predictions on the test set
7. Evaluate the model

<h3>Required Libraries</h3>


<code>pip install pprint</code><br>
<code>pip install python-crfsuite</code><br>
<code>pip install nltk</code><br>
<code>pip install scikit</code><br>
<code>pip install gensim</code><br>
<code>pip install stanfordnlp</code><br>
<code>pip install numpy</code>

<h3>Prepare Word Embeddings model</h3>

In [1]:
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

pathToGloveModel = '/Users/nikitsrivastava/Downloads/glove.6B/gensim-glove.6B.300d.txt'
w2vModel = KeyedVectors.load_word2vec_format(pathToGloveModel, binary=False)

# Path to google news word2vec model
# pathToW2VModel = '/Users/nikitsrivastava/Downloads/GoogleNews-vectors-negative300.bin.gz'
# w2vModel = KeyedVectors.load_word2vec_format(pathToW2VModel, binary=True)

In [2]:
# check if word embeddings model is working
w2vModel.similarity("house","home")

0.5005335110285759

<h3>Import the Model Implementation from "main.py"</h3>

In [3]:
import importlib
import main
importlib.reload(main)
from main import SvmModel
argAnnModel = SvmModel(w2vModel)

<h3>Step 1: Load the annotated data</h3>

In [4]:
argAnnModel.load_data("./data")
print("Data loaded successfully!")

Data loaded successfully!


In [5]:
argAnnModel.sample_loaded_data()

{'annotations': [[1699, 1770],
                 [1847, 1949],
                 [1962, 2043],
                 [343, 439],
                 [457, 598],
                 [600, 722],
                 [737, 879],
                 [887, 1025],
                 [1038, 1146],
                 [1207, 1321],
                 [1329, 1415],
                 [1426, 1673]],
 'tokens': [('Government', 0, 10),
            ('should', 11, 17),
            ('be', 18, 20),
            ('responsible', 21, 32),
            ('for', 33, 36),
            ('education', 37, 46),
            ('and', 47, 50),
            ('health', 51, 57),
            ('care', 58, 62),
            ('or', 63, 65),
            ('not', 66, 69),
            ('?', 69, 70),
            ('Despite', 72, 79),
            ('the', 80, 83),
            ('development', 84, 95),
            ('of', 96, 98),
            ('modern', 99, 105),
            ('society', 106, 113),
            (',', 113, 114),
            ('the', 115, 118),
          

<h3>Step 2: Perform IOB-Tagging for tokens</h3>

In [6]:
argAnnModel.prepare_labels()
print("IOB-Tagging done successfully!")

IOB-Tagging done successfully!


In [7]:
argAnnModel.sample_labels()

[('GOVERNMENT', 'O'),
 ('AND', 'O'),
 ('EDUCATION', 'O'),
 ('Primary', 'O'),
 ('and', 'O'),
 ('secondary', 'O'),
 ('education', 'O'),
 ('provide', 'O'),
 ('a', 'O'),
 ('basic', 'O'),
 ('knowledge', 'O'),
 ('and', 'O'),
 ('skills', 'O'),
 ('to', 'O'),
 ('the', 'O'),
 ('people', 'O'),
 ('in', 'O'),
 ('this', 'O'),
 ('world', 'O'),
 ('.', 'O'),
 ('I', 'O'),
 ('completely', 'O'),
 ('agree', 'O'),
 ('that', 'O'),
 ('government', 'O'),
 ('should', 'O'),
 ('provide', 'O'),
 ('those', 'O'),
 ('education', 'O'),
 ('to', 'O'),
 ('their', 'O'),
 ('people', 'O'),
 (';', 'O'),
 ('however', 'O'),
 (',', 'O'),
 ('for', 'B'),
 ('the', 'I'),
 ('more', 'I'),
 ('higher', 'I'),
 ('education', 'I'),
 ('like', 'I'),
 ('university', 'I'),
 (',', 'I'),
 ('the', 'I'),
 ('students', 'I'),
 ('or', 'I'),
 ('their', 'I'),
 ('parents', 'I'),
 ('should', 'I'),
 ('pay', 'I'),
 ('for', 'I'),
 ('the', 'I'),
 ('fees', 'I'),
 ('.', 'O'),
 ('Some', 'B'),
 ('parents', 'I'),
 ('might', 'I'),
 ('not', 'I'),
 ('have', 'I'),
 

<h3>Step 3: Prepare feature extraction for each token</h3>

In [8]:
argAnnModel.extract_features()
print("Feature extraction done successfully!")

Feature extraction done successfully!


In [None]:
#argAnnModel.sample_features()

<h3>Step 4: Split dataset into training and test</h3>

In [None]:
argAnnModel.split_dataset()
print("Dataset split done successfully!")

Dataset split done successfully!


In [None]:
print("Training dataset document count: ",len(argAnnModel.train_data))
print("Test dataset document count: ",len(argAnnModel.test_data))

Training dataset document count:  64
Test dataset document count:  16


<h3>Step 5: Select and train the classfier</h3>

In [None]:
argAnnModel.train(["length", "bow", "pos", "w2v"])
print("Classifier trained successfully!")

<h3>Step 6: Perform predictions on the test set</h3>

In [None]:
argAnnModel.predict()
print("Predictions completed!")

In [None]:
argAnnModel.sample_prediction()

<h3>Step 7: Evaluate the model</h3>

In [None]:
argAnnModel.printEval(argAnnModel.evaluate())