**Arabicthon**

**Ibn Sidah Team**

Prof. Yaser Hifny
yhifny@yahoo.com

Dr. Waleed Nazeeh
w.nazeeh@gmail.com

Mr. Amr ElGendy
amr.algendy@gmail.com



# **Mounting Google drive, define paths, and install required tools**

In [1]:
# Print CPU and memory details
import tensorflow as tf
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

!lscpu |grep 'Model name'

print('Normal CPU')
print('Processor model')
!cat /proc/cpuinfo  | grep 'name'| uniq
print('Number of processors')
!cat /proc/cpuinfo  | grep process| wc -l
print('Memory details')
!free -h

Your runtime has 54.8 gigabytes of available RAM

You are using a high-RAM runtime!
Model name:          Intel(R) Xeon(R) CPU @ 2.20GHz
Normal CPU
Processor model
model name	: Intel(R) Xeon(R) CPU @ 2.20GHz
Number of processors
8
Memory details
              total        used        free      shared  buff/cache   available
Mem:            51G        847M         48G        1.2M        1.8G         49G
Swap:            0B          0B          0B


In [2]:
import sys
import re
import os
from pathlib import Path
from google.colab import drive, files
# Mount google drive folders
drive.mount('/content/drive')


Mounted at /content/drive


In [3]:
# Prepare required paths

# Project path
PROJ_PATH = '/content/drive/My Drive/Sense_Gram_Project'
# model evaluation path
model_eval_path = (os.path.join(PROJ_PATH, 'model_evaluation'))
# Utilities path
util_path = (os.path.join(PROJ_PATH, 'utilities'))

# wikipedia dataset path
wiki_path = (os.path.join(PROJ_PATH, 'datasets/arabic_wikipedia'))
# billion words dataset path
billion_path = (os.path.join(PROJ_PATH, 'datasets/arabic_billion_words'))
# classic dataset path
classic_path = (os.path.join(PROJ_PATH, 'datasets/classic_misc'))

# Set current directory to the project directory
os.chdir(PROJ_PATH)

In [4]:
# Install CAMel tools and datasets required to extract lemmas
!pip install camel-tools
!camel_data -i all


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting camel-tools
  Downloading camel_tools-1.4.0-py3-none-any.whl (104 kB)
[K     |████████████████████████████████| 104 kB 4.3 MB/s 
Collecting emoji
  Downloading emoji-1.7.0.tar.gz (175 kB)
[K     |████████████████████████████████| 175 kB 25.5 MB/s 
[?25hCollecting transformers>=3.0.2
  Downloading transformers-4.19.2-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 53.2 MB/s 
Collecting camel-kenlm
  Downloading camel-kenlm-2021.12.27.tar.gz (418 kB)
[K     |████████████████████████████████| 418 kB 60.3 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 5.1 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |███████████████████████████

# **Utilities methods**

In [None]:
# A list of words was prepared by Mr. Amr ElGendy (team member) to be used in the model evaluation
# For each word of this list we extracted 10 contexts from the dataset
# Mr. Amr revised these contexts and chose the best group of them
# Then we removed this group from the dataset to not be a part of the model training process (i.e., unseen data)

# Inputs of method are: dataset file, words to find their contexts, max. number of 
# contexts to be extracted, and output file
def extract_contexts(datatset_file, words_list_file, max_occ, contexts_list_file):
  # Load dataset file in a list
  list_dataset = open(datatset_file, 'r').readlines()
  # Load the list of words to find their contexts
  list_words = open(words_list_file, 'r').readlines()
  # Output file; every word and contexts
  f = open(contexts_list_file, 'w')

  for word in list_words:  
    f.write('Current Word: ')
    f.write(word)  
    word= word.strip()      # Remove \n
    word_count = 0
    for line in list_dataset:
      lines = re.split('\. |\! |\? ',line)
      for x in lines:
        # Add space after and before the word to extract context with EXACT word (e.g., no prefix or suffix)
        if (x.find(' '+word+' ') != -1):
          f.write(x+'\n')
          word_count = word_count + 1
      if word_count >= max_occ:
        break
  f.close()


In [None]:
# Remove contexts that will be used in the model evaluation from the dataset before building sensegram model

# Inputs of method are: dataset file, list of contexts to be removed, and the output new dataset file
def remove_contexts(datatset_file, words_list_file, new_datatset_file):
  text = open(datatset_file, 'r').read()
  context_list = [i.strip() for i in open(words_list_file, 'r').readlines()]
  f = open(new_datatset_file, 'w')
  for context in context_list:
    text = text.replace(context, " ")
  f.write(text)


In [None]:
# Using CAMel tools to create lemma for group of files and save the output 
# in a new file with the same name in addition to ".lemma"

# Input of method is the directory contains dataset files as parts to crete lemma for them
def create_lemma(dataset_folder):
  
  from camel_tools.utils.dediac import dediac_ar
  from camel_tools.tokenizers.word import simple_word_tokenize
  from camel_tools.disambig.mle import MLEDisambiguator

  # iterate over files has "part" word in dataset folser
  files = Path(dataset_folder).glob('*.part.*')
  for filename in files:
    #print(file)
    in_file = open(filename, 'r')
    out_file = open((str(filename))+'.lemma', 'w')#filename.replace('.txt','_lemma.txt', 'w'
    print('Processing file: ' + str(filename))
    mle = MLEDisambiguator.pretrained()
    numlines = 0
    for line in in_file:
      numlines = numlines + 1      
      line = line.strip()
      if line == '': continue      
      # The disambiguator expects pre-tokenized text
      sentence = simple_word_tokenize(line)  
      disambig = mle.disambiguate(sentence)
      lemmas = " ".join([d.analyses[0].analysis['lex'] for d in disambig if d.analyses])
      # Remove diacritization
      out_file.write(dediac_ar(lemmas) +'\n')
      out_file.flush()

      del sentence
      del disambig
      del lemmas
      #if numlines%1000 == 0:
      #  print("Finish {} lines".format(numlines))
    out_file.close()

# **Prepare Arabic wikipedia dataset**

## 1) Download dataset and install wiki extractor

In [None]:
# Change current directory to the dataset directory
print(wiki_path)
os.chdir(wiki_path)


/content/drive/My Drive/Sense_Gram_Project/datasets/arabic_wikipedia


In [None]:
# Download Arabic wikipedia
!wget https://dumps.wikimedia.org/arwiki/latest/arwiki-latest-pages-articles.xml.bz2


--2022-05-26 04:10:30--  https://dumps.wikimedia.org/arwiki/latest/arwiki-latest-pages-articles.xml.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.7, 2620:0:861:1:208:80:154:7
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1419710855 (1.3G) [application/octet-stream]
Saving to: ‘arwiki-latest-pages-articles.xml.bz2’


2022-05-26 04:15:36 (4.44 MB/s) - ‘arwiki-latest-pages-articles.xml.bz2’ saved [1419710855/1419710855]



In [None]:
# Install wikiextractor
!pip install wikiextractor

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wikiextractor
  Downloading wikiextractor-3.0.6-py3-none-any.whl (46 kB)
[K     |████████████████████████████████| 46 kB 2.1 MB/s 
[?25hInstalling collected packages: wikiextractor
Successfully installed wikiextractor-3.0.6


## 2) Extract text and clean dataset by removing all tags

In [None]:
# Ref: https://github.com/attardi/wikiextractor/blob/master/README.md
# --processes option to enable multiprocessing for processing articles in parallel
# --no-templates option to remove tags such as HTML tags
# --bytes option to determin maximum bytes for each output file (100 megabyte)
!python -m wikiextractor.WikiExtractor arwiki-latest-pages-articles.xml.bz2 --output ara_wiki_without_tags --processes 8 --no-templates --bytes 100M


INFO: Starting page extraction from arwiki-latest-pages-articles.xml.bz2.
INFO: Using 8 extract processes.
INFO: Extracted 100000 articles (1309.7 art/s)
INFO: Extracted 200000 articles (2262.3 art/s)
INFO: Extracted 300000 articles (2054.4 art/s)
INFO: Extracted 400000 articles (2017.8 art/s)
INFO: Extracted 500000 articles (2314.6 art/s)
INFO: Extracted 600000 articles (2686.0 art/s)
INFO: Extracted 700000 articles (3499.8 art/s)
INFO: Extracted 800000 articles (2390.1 art/s)
INFO: Extracted 900000 articles (2251.1 art/s)
INFO: Extracted 1000000 articles (2343.8 art/s)
INFO: Extracted 1100000 articles (1514.9 art/s)
INFO: Extracted 1200000 articles (2983.9 art/s)
INFO: Extracted 1300000 articles (3290.6 art/s)
INFO: Extracted 1400000 articles (3520.7 art/s)
INFO: Extracted 1500000 articles (3818.1 art/s)
INFO: Extracted 1600000 articles (3150.8 art/s)
INFO: Extracted 1700000 articles (3468.1 art/s)
INFO: Extracted 1800000 articles (3258.3 art/s)
INFO: Extracted 1900000 articles (2810

In [None]:
# Concatentae all files and remove other tags such as templatestyles src= and styles.css
!cat ara_wiki_without_tags/AA/wiki* > ara_wiki.txt
!sed -i 's/<[^>]*>/ /g' ara_wiki.txt
!grep -v template ara_wiki.txt > ara_wiki_clean.txt
# Print some lines to check that all tags are removed
!head -500 ara_wiki_clean.txt


 
ماء

الماء مادةٌ شفافةٌ عديمة اللون والرائحة، وهو المكوّن الأساسي للجداول والبحيرات والبحار والمحيطات وكذلك للسوائل في جميع الكائنات الحيّة، وهو أكثر المركّبات الكيميائيّة انتشاراً على سطح الأرض. يتألّف جزيء الماء من ذرّة أكسجين مركزية ترتبط بها ذرّتا هيدروجين على طرفيها برابطة تساهميّة بحيث تكون صيغته الكيميائية H2O. عند الظروف القياسية من الضغط ودرجة الحرارة يكون الماء سائلاً؛ أمّا الحالة الصلبة فتتشكّل عند نقطة التجمّد، وتدعى بالجليد؛ أمّا الحالة الغازية فتتشكّل عند نقطة الغليان، وتسمّى بخار الماء.
إنّ الماء هو أساس وجود الحياة على كوكب الأرض، وهو يغطّي 71% من سطحها، وتمثّل مياه البحار والمحيطات أكبر نسبة للماء على الأرض، حيث تبلغ حوالي 96.5%. وتتوزّع النسب الباقية بين المياه الجوفيّة وبين جليد المناطق القطبيّة (1.7% لكليهما)، مع وجود نسبة صغيرة على شكل بخار ماء معلّق في الهواء على هيئة سحاب (غيوم)، وأحياناً أخرى على هيئة ضباب أو ندى، بالإضافة إلى الزخات المطريّة أو الثلجيّة. تبلغ نسبة الماء العذب حوالي 2.5% فقط من الماء الموجود على الأرض، وأغلب هذه الكمّيّة (حوالي 99%) موجودة في 

## 3) Extract contexts to be used in model evaluation

In [None]:
# Extract contexts that we will be used in model evaluation
extract_contexts('ara_wiki_clean.txt', os.path.join(model_eval_path, 'modern_ambiguous_AMR_words.txt'),
                 10, os.path.join(model_eval_path, 'modern_ambiguous_AUTO_contexts_wiki_dataset.txt'))


In [None]:
# Mr. Amr Elgendy revised the previous output file and prepared a new one called "modern_ambiguous_AMR_contexts_wiki_dataset.txt"
# Remove contexts that will be used in the model evaluation from the dataset before building sensegram model
remove_contexts('ara_wiki_clean.txt', os.path.join(model_eval_path, 'modern_ambiguous_AMR_contexts_wiki_dataset.txt'),
                 'ara_wiki_clean_without_evaluation_contexts.txt')


## 4) Extract lemmas from the dataset

In [None]:
# Divide the dataset into files with 5 million lines per file
!split --lines=5000000 ara_wiki_clean_without_evaluation_contexts.txt ara_wiki_clean_without_evaluation_contexts.txt.part.
# The output will be a series of files ara_wiki_clean_without_evaluation_contexts.txt.part.aa, .ab, .ac, ...

# Create leamma for all parts of dataset
create_lemma(wiki_path)

# Concatenate all files in one file contains dataset lemmas
!cat *.lemma > ara_wiki_clean_without_evaluation_contexts_lemmas.txt


Processing file: /content/drive/My Drive/Sense_Gram_Project/datasets/arabic_wikipedia/ara_wiki_clean.txt.part.aa
Processing file: /content/drive/My Drive/Sense_Gram_Project/datasets/arabic_wikipedia/ara_wiki_clean.txt.part.ab
Processing file: /content/drive/My Drive/Sense_Gram_Project/datasets/arabic_wikipedia/ara_wiki_clean.txt.part.ac
Processing file: /content/drive/My Drive/Sense_Gram_Project/datasets/arabic_wikipedia/ara_wiki_clean.txt.part.ad


# **Prepare Arabic billion words dataset**

# 1) Download dataset

In [None]:
# Change current directory to Arabic billion words dataset directory
print(billion_path)
os.chdir(billion_path)


/content/drive/My Drive/Sense_Gram_Project/datasets/arabic_billion_words


In [None]:
# Download arabic_billion_words
!wget http://www.abuelkhair.net/corpus/Alittihad_XML_utf_8.rar
!unrar e Alittihad_XML_utf_8.rar

!wget http://www.abuelkhair.net/corpus/Almasryalyoum_XML_utf_8.rar
!unrar e Almasryalyoum_XML_utf_8.rar

!wget http://www.abuelkhair.net/corpus/Almustaqbal_XML_utf_8.rar
!unrar e Almustaqbal_XML_utf_8.rar

!wget http://www.abuelkhair.net/corpus/Alqabas_XML_utf_8.rar
!unrar e Alqabas_XML_utf_8.rar

!wget http://www.abuelkhair.net/corpus/Echoroukonline_XML_utf_8.rar
!unrar e Echoroukonline_XML_utf_8.rar

!wget http://www.abuelkhair.net/corpus/Ryiadh_XML_utf_8.rar
!unrar e Ryiadh_XML_utf_8.rar

!wget http://www.abuelkhair.net/corpus/Sabanews_XML_utf_8.rar
!unrar e Sabanews_XML_utf_8.rar

!wget http://www.abuelkhair.net/corpus/SaudiYoum_XML_utf_8.rar
!unrar e SaudiYoum_XML_utf_8.rar

!wget http://www.abuelkhair.net/corpus/Techreen_XML_utf_8.rar
!unrar e Techreen_XML_utf_8.rar

!wget http://www.abuelkhair.net/corpus/Youm7_XML_utf_8.rar
!unrar e Youm7_XML_utf_8.rar


--2022-05-26 06:53:27--  http://www.abuelkhair.net/corpus/Alittihad_XML_utf_8.rar
Resolving www.abuelkhair.net (www.abuelkhair.net)... 162.241.244.55
Connecting to www.abuelkhair.net (www.abuelkhair.net)|162.241.244.55|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 348259999 (332M) [application/x-rar-compressed]
Saving to: ‘Alittihad_XML_utf_8.rar’


2022-05-26 06:53:55 (12.2 MB/s) - ‘Alittihad_XML_utf_8.rar’ saved [348259999/348259999]


UNRAR 5.50 freeware      Copyright (c) 1993-2017 Alexander Roshal


Extracting from Alittihad_XML_utf_8.rar

Extracting  Alittihad_utf_8.xml                                            0%  1%  2%  3%  4%  5%  6%  7%  8%  9% 10% 11% 12% 13% 14% 15% 16% 17% 18% 19% 20% 21% 22% 23% 24% 25% 26% 27% 28% 29% 30% 31% 32% 33% 34% 35% 36% 37% 38% 39% 40% 41% 42% 

# 2) Extract extract news headlines and their text only

In [None]:
# Extract headline and text tags only
!cat *.xml > temp.cat
!grep 'Headline\|Text' temp.cat > arabic_billion_words.txt
!unlink temp.cat

# Remove headline and text tags
!sed -i 's/<[^>]*>/ /g' arabic_billion_words.txt

# Print first 500 line to check the output
!head -500 arabic_billion_words.txt


  بن غليطة: 396 شركة معتمدة لدى التنظيم العقاري في دبي 
  أكد المهندس مروان بن غليطة المدير التنفيذي لمؤسسة التنظيم العقاري بدبي أن استخدام العقوبات والغرامات المنصوص عليها في القانون ضد الشركات العقارية غير المسجلة بحسابات الضمان، أو تلك التي لم تحصل على ترخيص لممارسة النشاط العقاري، آخر ما تفكر فيه المؤسسة. وشدد على أن ''''التنظيم العقاري'''' لا تستخدم القانون سيفاً مسلطاً ضد الشركات بل هدفها العمل على خلق شراكات عمل واضحة ومرنة مع كل الشركات والفاعلين في قطاع التطوير العقاري من أجل تحقيق الهدف الاستراتيجي من إنشاء المؤسسة والرامي إلى تنظيم القطاع وايجاد قاعدة بيانات تساهم في تطوير وتعزيز هذا القطاع الحيوي.
وأفاد: لا توجد أرقام نهائية حول الشركات التي تمارس نشاطاً عقارياً، وبالتالي فتسجيل 396 شركة لدى المؤسسة بنهاية المهلة التي انتهت في 27 ديسمبر الماضي، نعتبرها البداية، واللبنة الأولى لقاعدة بيانات حول السوق العقارية في دبي، وهذه الخطوة ستليها خطوات أخرى لتشكل في مجملها مرجع معلوماتي حول النشاط العقاري، سواء من حيث عدد الشركات والمشروعات التطويرية وحجم القطاع وكل ما يتعلق به. وحول 

# 3) Extract contexts to be used in model evaluation

In [None]:
# Extract contexts that we will be used in model evaluation
extract_contexts('arabic_billion_words.txt', os.path.join(model_eval_path, 'modern_ambiguous_AMR_words.txt'),
                 10, os.path.join(model_eval_path, 'modern_ambiguous_AUTO_contexts_billion_dataset.txt'))


In [None]:
# Mr. Amr Elgendy revised the previous output file and prepared a new one called "modern_ambiguous_AMR_contexts_wiki_dataset.txt"
# Remove contexts that will be used in the model evaluation from the dataset before building sensegram model
remove_contexts('arabic_billion_words.txt', os.path.join(model_eval_path, 'modern_ambiguous_AMR_contexts_billion_dataset.txt'),
                 'arabic_billion_words_without_evaluation_contexts.txt')


# 4) Extract lemmas from the dataset

Since lemma creation for Arabic billion dataset needs huge amount of time, we distributed the datatset parts over 3 PCs in addition to Google colab to catch up the competeion deadline. Then, we uplaoded all files to Google drive and reusme the dataset creation process.

In [None]:
# Divide the dataset into files with 5 million lines per file
!split --lines=5000000 arabic_billion_words_without_evaluation_contexts.txt arabic_billion_words_without_evaluation_contexts.txt.part.

# Create leamma for all parts of dataset
create_lemma(billion_path)

!cat *lemma > arabic_billion_words_without_evaluation_contexts_lemmas.txt


Processing file: arabic_billion_words_without_evaluation_contexts.txt.part.aa
Processing file: arabic_billion_words_without_evaluation_contexts.txt.part.ab
Processing file: arabic_billion_words_without_evaluation_contexts.txt.part.ac


# **Prepare Arabic classic dataset**

This dataset wad prepared by Mr. Amr ElGendy (team member) from classic Arabic books

In [6]:
# Change current directory to the dataset directory
print(classic_path)
os.chdir(classic_path)

/content/drive/My Drive/Sense_Gram_Project/datasets/classic_misc


# Extract lemmas from the dataset

In [8]:
# Divide the dataset into files with 2 million lines per file
!split --lines=2000000 classic_misc.txt classic_misc.txt.part.

# Create leamma for all parts of dataset
create_lemma(classic_path)

# The output will be a series of files ara_wiki_clean.txt.part.aa, ara_wiki_clean.txt.part.ab ...
!cat *lemma > classic_misc_lemma.txt


Processing file: classic_misc.txt.part.aa
Processing file: classic_misc.txt.part.ab
