## Timbuktu Chronicles: Research Project Notes
Ayah Aboelela and Gregory Crane

### Introduction
The Timbuktu Chronicles (https://en.wikipedia.org/wiki/Timbuktu_Chronicles) are a series of 17th century manuscripts written in Timbuktu about historical West Africa.
Tarikh al-Fattash by Mahmud Kati and Tarikh al-Sudan by ʻAbd al-Raḥmān ibn ʻAbd Allāh Saʻdī are two such chronicles, which will make up the central focus of this project. This project aims to make these texts, or certain passages from them, more accessible by providing best possible digital editions for them. Selected passages may be produced with curated annotations. Some possible integrations include geo-tagging, named-entity recognition, improved Arabic OCR for the texts, and others.

#### Audiences
The following are some capabilities we aim to produce for audiences:

- Allowing readers with no knowledge of Arabic to see what Arabic terms lie behind translations. Readers should be able to search for Arabic words and see how they are used in different contexts, even when they are translated somewhat differently.
- Scaffolding for readers with a basic knowledge of Arabic (e.g., one year of MSA with the standard al-Kitab textbook) to work directly with these sources.
- Providing visualizations of information that we can extract from these sources (e.g., maps of places mentioned, basic information that we can extract about people). Here we apply NER and Info-extraction to the Arabic and/or French (probably both).


### Tasks and Updates
#### For the week of 6/14:

- Continue correcting some lines from the OCR output of HathiTrust for Tarikh al-Fattash/al-Sudan
	- Currently corrected ~270 lines from Tarikh al-Fattash, considering to correct 400 from there and 400 from Tarikh al-Sudan
- Pick a cool passage from Tarikh al-Fattash to explore with other tools
- From Greg Crane: "compare Google Translate from the Arabic with Google Translate and DeepL from the French. Which gives us the best translation? Here you need to explain the Arabic a bit. You don’t need to do a lot."
- Explore CAMEL: Prioritize camel_morphology tool - get it to work so that you can see how well the indirect translation from french is to the arabic
- From Greg Crane: evaluate NLTK and Spacey NER (Named Entity Recognition)

Google Collab doc - currently has simple experiments for CAMeL and Spacey NER: 

https://colab.research.google.com/drive/1fJishB2763Mua79tVAj0ZtnhdZBXA2ZX?usp=sharing


#### For the week of 6/21:

- Continue selecting passages to annotate (first from the Arabic, then find corresponding French, then translate in DeepL, then paste it in correctedEnglishPassages.txt, and then correct it by removing footnotes/repetitions)
- Look into training Spacey:

	- https://spacy.io/usage/training
	- https://towardsdatascience.com/train-ner-with-custom-training-data-using-spacy-525ce748fab7
	- https://www.machinelearningplus.com/training-custom-ner-model-in-spacy/ 
		- this is probably probably the most helpful link regarding the format of the training data
	- Training data: https://spacy.io/api/data-formats#training
- Correct/clean the english translation samples for training
- Correct the spacey output (xml files - they actually may NOT be xml files.)
- Use the corrected xml files to upload back to Spacey and use for training


#### For the week of 6/28:
- Correct the labels for NER training in the samples
- Train different Spacy models and see which ones work best (https://spacy.io/models/en)
- Collect more samples from the chronicles for NER training/labeling
- Start using eScriptorium for OCR

### Transcriptions

We need to use optical character recognition to produce a clear, interactable edition of the Arabic texts. HathiTrust currently uses Google's OCR engine for this, but it has a lot of mistakes due to the abnormal font that these texts use. Ideally, we could use an existing Arabic OCR tool and train it on the new text. Options include OpenITI and perhaps some other tools listed below, like Kraken and Tesseract. 

Other Arabic OCR programs that may be possible to train include:
- https://github.com/HusseinYoussef/Arabic-OCR/
- https://github.com/msfasha/ArabicDLOCR


### Translations
For the purposes of this project, we need an English translation of these texts. The currently existing English translations are not open-source, and so are inaccessible to us. Our options include:
- Machine learning to directly translate from Arabic into English
- Machine learning to indirectly translate the French translation into English
    - for this we can use www.deepl.com/translate

### Geotagging and Named Entity Recognition

https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da - in Python NLTK and Spacey. Might need to do it on an English translation, and from that link it to the Arabic word.

### Tools Overview

#### OpenITI
- Dissertation: https://drive.google.com/file/d/13poyZhEh-Hnq9Gzy-6pNrSxP95jR0IAY/view
- OpenITI's homepage: https://openiti.org
- OpenITI's CorpusBuilder: https://openiti.org/projects/corpusbuilder
- More info: https://alraqmiyyat.github.io/OpenITI/
	- THE LINK THEY HAVE FOR THE APP,  iti-corpus.github.io/ , IS INVALID	
- Technical overview: 
	○ https://github.com/berkmancenter/corpusbuilder/wiki/Technical-Overview - this says that you can either use Tesseract or Kraken for a back-end. More info on those here: https://github.com/berkmancenter/corpusbuilder/wiki/Rolling-out
- Guidelines for uploading and correcting text for OCR (but no info on where the app is or how to access it): https://openiti.org/guidelines
- CorpusBuilder "isn't meant to be used directly. It's purpose is to be both the database of corpuses and the tools to work on them — all to be consumed by an external application" (via https://github.com/berkmancenter/corpusbuilder/blob/master/README.md) 
- I think you might be able to use ShariahSOURCE for this: https://github.com/berkmancenter/SHARIAsource 

#### CAMEL
CAMEL is an "Open source Python Toolkit for Arabic Natural Language Processing.  CAMeL Tools currently provides utilities for pre-processing, morphological modeling, Dialect Identification, Named Entity Recognition and Sentiment Analysis."
- Documentation: https://camel-tools.readthedocs.io/en/latest/
- GitHub repo: https://github.com/CAMeL-Lab/camel_tools
- Jupyter Notebook (Google Collab) tutorial: https://colab.research.google.com/drive/1Y3qCbD6Gw1KEw-lixQx1rI6WlyWnrnDS?usp=sharing


These are the steps I did for using CAMEL, following https://github.com/CAMeL-Lab/camel_tools and https://camel-tools.readthedocs.io/en/latest/getting_started.html

On Anaconda:

pip install camel-tools -f https://download.pytorch.org/whl/torch_stable.html

^ I think I might not need to do that if I do the "INSTALL FROM SOURCE" option, so I then did:

pip uninstall camel-tools

git clone https://github.com/CAMeL-Lab/camel_tools.git

cd camel_tools

pip install -f https://download.pytorch.org/whl/torch_stable.html .

pip install --upgrade -f https://download.pytorch.org/whl/torch_stable.html .

Camel_data full (the camel_data command only works in anaconda. Not sure why, maybe it has to do with pip, which also only works in anaconda)

Some commands did not work on my local machine, so instead I did them on the Google Collab notebook (link is above, under Tasks and Updates)

#### Kraken
OCR tool that you can train, works for Arabic. Could perhaps compare with OpenITI's functionality.

But https://digitalorientalist.com/2019/11/05/using-kraken-to-train-your-own-ocr-models/ says that it only works on Mac

#### Tesseract
Another OCR tool for multiple languages, and I think it also works on Windows
https://betterprogramming.pub/beginners-guide-to-tesseract-ocr-using-python-10ecbb426c3d

#### deepl.com
Indirect machine translation: can be used to translate from French to Arabic

#### Stanza
https://stanfordnlp.github.io/stanza/
I tried their online demo and the named entity recognition was not great for Arabic, maybe CAMEL or NLTK/Spacey) would have a better option

#### Others: nltk, spacey




### Related Work / Other Examples
#### Al-Thuraya Project
Pretty cool: a gazetter and geospatial model. "A gazetteer (al-Ṯurayyā Gazetteer, or al-Thurayyā Gazetteer) and a geospatial model of the early Islamic world"

Maybe we can use this or expand it for some of the places mentioned in Timbuktu Chronicle

Links:

https://alraqmiyyat.github.io/althurayya/
https://althurayya.github.io/

#### Scaife viewer

https://scaife-viewer.org/

If we want to produce something similar to what's available on Scaife-viewer, maybe we can use the code from their open-source MIT-licensed repos: https://github.com/scaife-viewer/explorehomer


#### Visualizing Homer
- https://github.com/jtauber/homer-ngram
- https://jtauber.github.io/homer-ngram/viewer/iliad.html
    - This shows the lines that are repeated
    - Not related to our project but it would be interesting to do the same for the Quran, since repetition is an important quality of it

