## Load Arabic Word-Embedding

contributed by **Ali Ahmed** 

A utility to load word embedding model.

**AraVec N-Gram** model is used as a source of word embeddings, as it provides larger set of embeddings to the words we have in **AraWordNet**. **fastText** and **AraVec uni-gram** are both uni-gram models so they are missing many words in the WordNet. In this notebook, we load the the N-gram model.

### Prerequisite:
- Define `fasttext_path`, `uni_gram_aravec_path` and `n_gram_aravec_path` variables

## Import and Setup

In [None]:
import io
import gensim

%store -r fasttext_path
%store -r uni_gram_aravec_path
%store -r n_gram_aravec_path

## Load fastText word embedding model

fastText[1] word embedding could be found at https://fasttext.cc/docs/en/crawl-vectors.html. We use the Arabic word embedding.

[1] E. Grave, P. Bojanowski, P. Gupta, A. Joulin, T. Mikolov, “Learning Word Vectors for 157 Languages”, in Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), 2018.

In [None]:
# Code to load the model, the code is imported from fasttext: https://fasttext.cc/docs/en/crawl-vectors.html
def load_vectors(fname):
    fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
    n, d = map(int, fin.readline().split())
    data = {}
    for line in fin:
        tokens = line.rstrip().split(' ')
        data[tokens[0]] = list(map(float, tokens[1:]))
    return data

fasttext_model = load_vectors(fasttext_path)

## Load AraVec Uni-Gram word embedding model

AraVec[2] uni-gram word embedding could be found at https://github.com/bakrianoo/aravec#unigrams-models. We use the Wikipedia-SkipGram with vector size 300.

[2] A. Soliman, K. Eisa, and S. R. El-Beltagy, “AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP”, in proceedings of the 3rd International Conference on Arabic Computational Linguistics (ACLing 2017), Dubai, UAE, 2017.

In [None]:
unigram_aravec_model = gensim.models.Word2Vec.load(uni_gram_aravec_path)

### Load AraVec N-Gram word embedding model

AraVec n-gram word embedding could be found at https://github.com/bakrianoo/aravec#n-grams-models-1. We use the Wikipedia-SkipGram with vector size 300.

In [None]:
ngram_aravec_model = gensim.models.Word2Vec.load(n_gram_aravec_path)

<table style="width:100%">
  <tr>
      <td colspan="1" style="text-align:left;background-color:#0071BD;color:white">
        <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/">
            <img alt="Creative Commons License" style="border-width:0;float:left;padding-right:10pt"
                 src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png" />
        </a>
        &copy; T. Dong, C. Bauckhage<br/>
        Licensed under a 
        <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/" style="color:white">
            CC BY-NC 4.0
        </a>.
      </td>
      <td colspan="2" style="text-align:left;background-color:#66A5D1">
          <b>Acknowledgments:</b>
          This material was prepared within the project
          <a href="http://www.b-it-center.de/b-it-programmes/teaching-material/p3ml/" style="color:black">
              P3ML
          </a> 
          which is funded by the Ministry of Education and Research of Germany (BMBF)
          under grant number 01/S17064. The authors gratefully acknowledge this support.
      </td>
  </tr>
</table>