# Word Clustering

In this notebook we will evaluate different word representation using clustering techniques. We will compare a standard representation for the task of Argument Mining developed by (Stab and Gurevych, 2016) with several unsupervised representation using the algorithms word2vec (Mikolov et al, 2013) and glove (Pennington et al, 2014).

## Dataset

We will be analyzing a dataset of text of the European Court of Human Rights (ECHR). We have downloaded all english documents available in their website https://hudoc.echr.coe.int, using this [web scrapper](https://github.com/MIREL-UNC/echr_dataset).


### Preprocess

There are some common preprocessing steps all the word representation require. We apply a first preprocessing step where we create, from the raw text, a intermidiate representation of each document called [UnlabeledDocument](https://github.com/mit0110/argument_mining/blob/master/preprocess/annotated_documents.py#L10).

To build the `UnlabeledDocument` we apply the following steps:
  1. Split paragraphs by new line characters
  2. Divide the paragraphs in sentences with nltk sent_tokenizer
  3. For each sentence:
      3. Tokenize it using nltk word_tokenize
      4. Apply the stanford `LexicalizedStanfordParser` to it and store the resulting tree.
      5. Obtain the PoS tag sequence
  6. Store the result

The `UnlabeledDocument` class allows us to store several caracteristics of the documents:
  * Paragraph and sentence structure.
  * Parse trees.
  * Absolute and relative position of the words


## Word representation

### Coocurrence matrix

To represent each of our instances, i.e. words, the simplest solution is to use the other words in a window context of fixed size. The resulting word vector is a count of the words that appear, for example, within two places to the left or to the right of the target word. This representation is called word co-ocurrence matrix.

To represent each of our instances, i.e. words, we will first use a set of handcrafted features that has been proved to improve classification for argumentative sentences detection.

## References

Stab, Christian & Gurevych, Iryna. (2016). Parsing Argumentation Structures in Persuasive Essays. Computational Linguistics. 43. . 10.1162/COLI_a_00295. 

Mikolov, Tomas et al. “Efficient Estimation of Word Representations in Vector Space.” CoRR abs/1301.3781 (2013): n. pag.

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation.