# Utility: Normalize Arabic Text
contributed by Ali Ahmed 
A utility to clean and normalize arabic text. It normalizes the text following the instructions from AraVec[1] so that we can get its embedding.

[1] https://github.com/bakrianoo/aravec#code-samples

## Import and Setup

In [None]:
import re

## Clean/Normalize Arabic Text

Normalizing the text is done in 4 steps:
1. Remove "tashkil": tashkil [3] is a form of vowel marks that represents missing vowel letters and is used to correctly pronounce a word. While the tashkil is being used in the Arabic WordNet, it is not used in the word embeddings so we have to normalize the text by removing all tashkil occurences.
2. Remove longation: longation is used to communicate a long vowel pronounciation. We have to remove that too to normalized both the word embedding and WordNet.
3. Substituting characters with their normalized versions: This step is similar when we want to normalize two German texts one is using 'ä' and the other is using the other form 'ae'. In a similar way, some arabic characters have to be normalized so they look similar in both texts. Also, symbols are cleared in this step.
4. Trimming: Removing white spaces before and after text.

[3] https://en.wikipedia.org/wiki/Arabic_diacritics

In [None]:
def clean_str(text):
    search = ["أ","إ","آ","ة","_","-","/",".","،"," و "," يا ",'"',"ـ","'","ى","\\",'\n', '\t','&quot;','?','؟','!']
    replace = ["ا","ا","ا","ه"," "," ","","",""," و"," يا","","","","ي","",' ', ' ',' ',' ? ',' ؟ ',' ! ']

    # Remove tashkil
    p_tashkeel = re.compile(r'[\u0617-\u061A\u064B-\u0652]')
    text = re.sub(p_tashkeel,"", text)

    # Remove longation
    p_longation = re.compile(r'(.)\1+')
    subst = r"\1\1"
    text = re.sub(p_longation, subst, text)

    text = text.replace('وو', 'و')
    text = text.replace('يي', 'ي')
    text = text.replace('اا', 'ا')

    # Normalize characters
    for i in range(0, len(search)):
        text = text.replace(search[i], replace[i])

    # Trim
    text = text.strip()

    return text

<table style="width:100%">
  <tr>
      <td colspan="1" style="text-align:left;background-color:#0071BD;color:white">
        <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/">
            <img alt="Creative Commons License" style="border-width:0;float:left;padding-right:10pt"
                 src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png" />
        </a>
        &copy; T. Dong, C. Bauckhage<br/>
        Licensed under a 
        <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/" style="color:white">
            CC BY-NC 4.0
        </a>.
      </td>
      <td colspan="2" style="text-align:left;background-color:#66A5D1">
          <b>Acknowledgments:</b>
          This material was prepared within the project
          <a href="http://www.b-it-center.de/b-it-programmes/teaching-material/p3ml/" style="color:black">
              P3ML
          </a> 
          which is funded by the Ministry of Education and Research of Germany (BMBF)
          under grant number 01/S17064. The authors gratefully acknowledge this support.
      </td>
  </tr>
</table>