# TP 2

First of all, install collatex and levenshtein packages in your python environment

(For levenstein distance see: [wikipedia Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance))

In [None]:
!pip install --upgrade collatex
!pip install levenshtein

Tell python you want to use collatex

In [2]:
from collatex import *

Create an object that will contain you collation (`collation` is an arbitrary variable name)


In [3]:
collation = Collation()

Add witnesses to you object `collation`

In [4]:
collation.add_plain_witness( "A", "The quick brown fox jumped over the lazy dog.")
collation.add_plain_witness( "B", "The brown fox jumped over the dog." )
collation.add_plain_witness( "C", "The bad fox jumped over the lazy dog." )

And now let collatex do the collation for you. We store the collation in the variable `alignment_table` and we just have to `print`the variable.

In [None]:
alignment_table = collate(collation)
print(alignment_table)

The collate function take different arguments.

# 1. Parameters

## 1.1. Segmentation
The segmentation parameter determines whether each token is output separately (`False`) or whether the similar adjacent tokens are merged.

In the preceding example, "fox jumped over the" are merged in a unique cell. To make the results easier to read, you may want to use just one word per cell.

Let's try to add the parameter `segmentation=False`.

In [None]:
alignment_table = collate(collation, segmentation=False)
print(alignment_table)

## 1.2. Near Match
The Near match parameter controls the behavior of CollateX Python in some situations where no exact alignment is possible because there is neither string-equality nor a forced-match environment.

Consider:

In [None]:
collation = Collation()
collation.add_plain_witness("A", "The big gray koala")
collation.add_plain_witness("B", "The grey koala")
alignment_table = collate(collation, segmentation=False)
print(alignment_table)

Because “gray” and “grey” are not string-equal, CollateX Python does not know to align them, which means that it does not know whether “grey” in Witness B should be aligned with “big” or with “gray” in witness A. In situations like this, CollateX Python always chooses the leftmost option, which means that in this case it aligns “grey” with “big”, rather than with “gray”.

Let's try now with the Near Match paramater on True:

In [None]:
alignment_table = collate(collation, segmentation=False, near_match=True)
print(alignment_table)

The "closest" alignement is defined in collatex on the basis of levenshtein distance (i.e. the number of modification you need to transform a word_1 into a word_2)

## 1.3. Layout
The `layout` parameter allow you to chose between `vertical` or `horizontal` alignment.

Try to display the above alignment table verticaly.

## 1.4. Output
Now let's look at the different output options for your table.
By default, CollateX outputs an ASCII table.

### 1.4.1. html
1. Create an object `Collation()`
2. add the three following manuscripts:

`"MS_A", "The quick brown fox jumped over the lazy dog."`

`"MS_B", "The brown fox jumped over the dog."`

`"MS_C", "The bad fox jumped over the lazy dog."`

3. Export your collation with and `output` parameter define as `html`

the `html2` provide a more readable output with colored lines (cyan = same, red = variant)

### 1.4.2. Graph
Your result may also be displayed as a beautiful graph with the two parameters `svg` and `svg_simple`.

Try both of them

### 1.4.3. csv
You may want to have it ready in exploitable datas like `csv`, `xml-tei`or `json` to compute them in a more complex workflow, for example, to import your results in excell, or to add lemmatization and morphological analysis using `spacy` or `bert` model or just to open your enormous table into excel or google sheet.

Try now the `output='csv'` parameter

Enregistrer le fichier csv sur votre disque

In [16]:
# Save the CSV string to a file
with open('alignment_table.csv', 'w') as f:
  f.write(alignment_table)

### 1.4.4. XML / XML TEI
you may also want to export the result in XML

(to beautifully display your xml, you may use beautifulsoup...)

In [None]:
from bs4 import BeautifulSoup

bs = BeautifulSoup(alignment_table, 'xml')
print(bs.prettify())


Et hop! Votre apparat critique en XML TEI est fait sans noeuds au cerveau :-)

### 1.4.5. JSON
And finally you can also export your allignment table into JSON format. This is the most complete output format, and therefore a common choice for subsequent preprocessing.

To have a beautiful display of your `json` execute the next cell

In [None]:
import json
parsed = json.loads(alignment_table)
print(json.dumps(parsed, indent=4))
