# SECTION 1: IMPORTING QARAFI'S DHAKHIRA FROM OPENITI

In [1]:
import urllib.request

I got the url below from https://kitab-corpus-metadata.azurewebsites.net/, which has a useful searchable index of all of the texts that OpenITI has processed.

In [2]:
book_url="https://raw.githubusercontent.com/OpenITI/0700AH/master/data/0684ShihabDinQarafi/0684ShihabDinQarafi.Dhakhira/0684ShihabDinQarafi.Dhakhira.JK003500-ara1"

In [3]:
book_text=urllib.request.urlopen(book_url).read()

In [4]:
book_text=book_text.decode('utf-8')

# SECTION 2: PREPARING TEXT FOR SEMANTIC SIMILARITY PROCESSING

## SECTION 2.1: LOCATING THE testimony CHAPTER (KITAB AL-SHAHADAT) IN THE DHAKHIRA

In [6]:
import re

I use regular expressions to find all instances of chapter titles in the text of the Dhakhira. The two cells below define the regular expression and then use it create a list of all chapter titles in the text. Since we will ultimately partition only chapter on testimony, we need to know how the beginning and end of that chapter is specifically demarcated in the text.

In [12]:
chaptertitles=re.compile('#.+\|.+\\(.كتاب.+')

In [14]:
titleinstances=chaptertitles.findall(book_text)

In [15]:
titleinstances

['# | 1 ( كتاب الطهارة )',
 '# | 1 ( كتاب الصلاة )',
 '# | 1 ( كتاب الصيام )',
 '# | ( كتاب الزكاة )',
 '# | ( كتاب الحج )',
 '# | ( كتاب الجهاد )',
 '# | ( كتاب النذر )',
 '# | 1 ( كتاب الأطعمة )',
 '# | ( كتاب الأشربة )',
 '# | ( كتاب الذبائح )',
 '# | ( كتاب الأضحية )',
 '# | ( كتاب العقيقة )',
 '# | ( كتاب الصيد )',
 '# | 1 ( كتاب البيوع )',
 '# | 1 ( كتاب الصلح )',
 '# | ( كتاب الإجارة )',
 '# | 1 ( كتاب القراض )',
 '# | 1 ( كتاب العارية )',
 '# | ( كتاب القسمة )',
 '# | ( كتاب الشفعة )',
 '# | 1 ( كتاب الوكالة )',
 '# | 1 ( كتاب الشركة )',
 '# | 1 ( كتاب الرهون )',
 '# | 1 ( كتاب التفليس وديون الميت )',
 '# | ( كتاب اللقطة )',
 '# | ( كتاب اللقيط )',
 '# | ( كتاب الوديعة )',
 '# | ( كتاب الحمالة )',
 '# | ( كتاب الحوالة )',
 '# | ( كتاب الإقرار )',
 '# | ( كتاب الأقضية )',
 '# | ( كتاب الشهادات )',
 '# | ( كتاب الوثائق )',
 '# | ( كتاب الدعاوي )',
 '# | ( كتاب الأيمان )',
 '# | ( كتاب العتق )',
 '# | ( كتاب التدبير )',
 '# | ( كتاب الكتابة )',
 '# | ( كتاب أمهات الأولاد )',
 '# |

In [16]:
len(titleinstances)

44

In [19]:
titleinstances[31]

'# | ( كتاب الشهادات )'

### Removing Artificial Line Breaks and Partitioning Dhakhira by Paragraph

OpenITI, for readability partitioned lines artificially and indicated that by two tildes "~~ ". In order to partition into paragraph we need to replace all new lines followed by two tildes with a space.

In [21]:
book_text_prep_for_paras=book_text.replace('\n~~',' ~~')

In [22]:
book_text_para=re.split('\n',book_text)

The book_text_para variable now contains the dhakhira divided into paragraphs in list form. 

### Locating the Paragaphs that Belong to the testimony Chapter

The command below locates the paragraph number which demarcates the beginning of the testimony chapter, by giving as an input the title of the testimony chapter: (# | كتاب الشهادات)

In [23]:
book_text_para.index(titleinstances[31])

69069

In [24]:
book_text_para[69069]

'# | ( كتاب الشهادات )'

In [25]:
book_text_para.index(kitabtitleinstances[32])

72435

By Using the index command above, we have discovered that the paragraphs that belong to the testimony (Shahada) chapter comprise of paragraphs 69069 to 72434. 

In [29]:
kitab_shahada_text_para=book_text_para[69069:72435]

Command above puts only the paragraphs that constitute the testimony chapter into the variable kitab_shahada_text_para. The command below takes the paragraphs that make up the testimony chapter and sews them back together into a one text contained in one variable that is NOT a list.

In [30]:
kitab_shahada_text=" ".join(kitab_shahada_text_para)

## SECTION 2.2 CLEANING THE testimony CHAPTER

One of the most important steps in getting semantic similarity scores is to remove all characters, such as latin characters, punction marks, etc. from the testimony chapter - basically everything that is not an Arabic letter. One way to do that is to count all instances of all unique characters that make up the testimony chapter. Then create a set that contains the characters you want to remove, and then delete them.

The command below takes a string (or text) as an input, and returns a list of all the unique characters it contains along with the number of times it occurs in the text.

In [31]:
from collections import Counter

In [32]:
letters_kitab_shahada_text=Counter(kitab_shahada_text)

In [33]:
letters_kitab_shahada_text

Counter({' ': 44045,
         '#': 330,
         '&': 73,
         '(': 409,
         ')': 411,
         '0': 304,
         '1': 292,
         '2': 249,
         '3': 159,
         '4': 64,
         '5': 71,
         '6': 63,
         '7': 62,
         '8': 61,
         '9': 139,
         '@': 260,
         'B': 65,
         'E': 65,
         'P': 362,
         'Q': 130,
         'V': 181,
         '^': 38,
         'a': 181,
         'e': 181,
         'g': 181,
         'm': 139,
         's': 139,
         '|': 616,
         '~': 6072,
         'ء': 425,
         'آ': 163,
         'أ': 1919,
         'ؤ': 149,
         'إ': 531,
         'ئ': 376,
         'ا': 29430,
         'ب': 6670,
         'ة': 3149,
         'ت': 5476,
         'ث': 1295,
         'ج': 2173,
         'ح': 3049,
         'خ': 1396,
         'د': 6428,
         'ذ': 1732,
         'ر': 5862,
         'ز': 1106,
         'س': 2407,
         'ش': 2771,
         'ص': 1459,
         'ض': 910,
         'ط': 862,
 

What we are interested in not necessarily the frequency of the occurence of each character in the text, but just the unique character themselves. But, in order to do that we have to convert it into a list format. The two commands below do that.

In [34]:
letters_and_count_kitab_jinaya_text=dict(letters_kitab_shahada_text)

In [35]:
letters_kitab_jinaya_text=list(letters_and_count_kitab_jinaya_text.keys())

I manually generated the list below. Based on the output of the "Counter" command, I identified all non-Arabic character that occur in the testimony chapter, and put them into a list. I named the list 'cancel'. I'll use this list to go through the testimony chapter and delete each of these characters.

In [36]:
cancel=['#','|','(',")",':','~','3','@','Q','B','E','P','a','g','e','V','2','0','5','m','s','4','7','^','6','&','8','%','9','-','.','_','$','،','\'','1','!']

In the command below, I move the contents of the testimony chapter into a new variable. Why? In case my deletion code messes up. I have an untouched copy I can return to. I'll do my deletions on the new copy -- called kitab_shahada_text2

In [37]:
kitab_shahada_text2=kitab_shahada_text

The command below uses a loop to go through and delete each character I placed in the list named 'cancel'.

In [38]:
for i in cancel:
    kitab_shahada_text2=kitab_shahada_text2.replace(i,'')

I want to check to make sure I deleted all non-Arabic characters. So I run the Counter command on the variable kitab_shahada_text2. And sure enough, I was successful. (Not on the first try of course!! LOL).

In [39]:
Counter(kitab_shahada_text2)

Counter({' ': 44045,
         'ء': 425,
         'آ': 163,
         'أ': 1919,
         'ؤ': 149,
         'إ': 531,
         'ئ': 376,
         'ا': 29430,
         'ب': 6670,
         'ة': 3149,
         'ت': 5476,
         'ث': 1295,
         'ج': 2173,
         'ح': 3049,
         'خ': 1396,
         'د': 6428,
         'ذ': 1732,
         'ر': 5862,
         'ز': 1106,
         'س': 2407,
         'ش': 2771,
         'ص': 1459,
         'ض': 910,
         'ط': 862,
         'ظ': 231,
         'ع': 6646,
         'غ': 923,
         'ف': 5190,
         'ق': 5306,
         'ك': 3683,
         'ل': 21989,
         'م': 9942,
         'ن': 9869,
         'ه': 9116,
         'و': 10101,
         'ى': 1684,
         'ي': 10524})

## SECTION 2.3 CREATING THE 20-GRAMS FROM THE testimony CHAPTER

We will use a command called 'ngrams' which is found in a popular natural language processing library called NLTK. But, in order to create 20-grams using the ngrams command, I have to feed it text in which the words that make up the testimony chapter is converted into a list in which each word is an element of the list. The command below does that.

In [40]:
words_kitab_shahada_text=kitab_shahada_text2.split()

You've probably seen this 'len' command before up above. It is a very useful command that will return the number of elements in a list, if it is applied to a list, and if it is applied to a string (or text) it will return the number of characters that makeu up the string.

In [42]:
len(words_kitab_shahada_text)

41717

The command below downloads the ngrams command from the nltk.util library. Python does this automatically.

In [43]:
from nltk.util import ngrams

Now, we are ready to feed the ngrams command our list of words that make up the testimony chapter. We also specify that we want it to create 20-grams, as opposed to 7, or 9, or whatever number you want.

In [44]:
twentygrams_kitab_shahada_text=ngrams(words_kitab_shahada_text,20)

The output of the ngrams command, which I stored in the 'twentygrams_kitab_shahada_text' variable is in a format that only other nltk commands can understand. But, we need to create 20-grams that out Google semantic search service can take as in input -- which are regular phrases. So we need to convert the output into a list. Each element of the list will be 20 words long and in fact the 20 words themselves will be elements of their own list. So we wil in fact have a list of lists.

In [46]:
twentygrams_kitab_shahada=list(twentygrams_kitab_shahada_text)

We can't feed GSSS a list of words. We need to recombine them into phrases that consist of 20 words in length. The code below does just that by using a loop to go through the main list of 20-gram phrases making up the twentygrams_kitab_shahada variable.

In [47]:
joined_twentygrams_kitab_shahada=[]
for i in range(0,len(twentygrams_kitab_shahada)):
    fragment=' '.join(twentygrams_kitab_shahada[i])
    joined_twentygrams_kitab_shahada.append(fragment)

In [48]:
len(joined_twentygrams_kitab_shahada)

41698

In [49]:
joined_twentygrams_kitab_shahada[0]

'كتاب الشهادات شهد في لسان العرب له ثلاثة معان شهد بمعنى علم ومنه قوله تعإلى وكنا لحكمهم شاهدين والله على'

We've made a lot of progress. From time to time is good to save that progress in the form of an external file. So I saved the twenty-grams to a CSV file.

In [50]:
import csv

In [41]:
with open('20gram_kitab_shahada_fragments.csv','w',encoding='utf-8',newline='\n') as fragmentfile:
    wr=csv.writer(fragmentfile,dialect='excel')
    for fragment in joined_twentygrams_kitab_shahada:
        wr.writerow([fragment])

# SECTION 3: SENDING FRAGMENTS AND RECEIVING SIMILARITY SCORES FROM GSSS

Up until now, you don't need anything other than having installed python -- jupyter notebook on your laptop. But, in order to proceed further you need to have done many things. As noted in my blog, the GSSS is an experimental service provided by Google cloud. That means, it is not commercially available yet. So, you need to get approved to use it. You can apply for the service here: https://events.withgoogle.com/ai-workshop/registrations/my-rsvp/confirm-account/. Then you need to follow their detailed instructions in order get the service set up. Unfortunately, those detailed instructions are only provided if you get accepted to the experimental program.

Code below downloads the specific command Google provided in order to send their cloud service information from within Python.

In [52]:
import googleapiclient.discovery

The code below is a template of code that GSSS provides correctly format the "input text" and the "candidate texts" and the security crediantials needed to get access to the cloud service.

In [57]:
def rank_candidates(input_text,
                  candidate_text_list,
                  project='semantic-similarity-for-text',
                  model='universal_encoder',
                  version=None):
 """Rank candidates against given input.

 Args:
   input_text: input text
   candidate_text_list: list of candidate text to rank.
   project: cloud ml project id
   model: model name
   version: model version

 Returns:
    list of tuples of (candidate text, score) sorted by score in reverse order.

 Raises:
   RuntimeError: if API call failed.
 """
 service = googleapiclient.discovery.build('ml', 'v1')
 name = 'projects/{}/models/{}'.format(project, model)
 if version is not None:
   name += '/versions/{}'.format(version)

 instances = []
 for r in candidate_text_list:
   instances.append({'Input': input_text, 'Candidate': r})

 response = service.projects().predict(
     name=name,
     body={
         'instances': instances,
     }).execute()

 if 'error' in response:
   raise RuntimeError(response['error'])

 return sorted(
     zip(candidate_text_list,
         [p['similarity_score'][0] for p in response['predictions']]),
     reverse=True,
     key=lambda p: p[1])

The code above takes as an input, at a minimum a pair of texts -- one is the input text, which in our cases is the text of the canon we are searching for, and the other is a list of 20-gram fragments for which we want a semantic score. GSSS says, don't send a list of more than 1000 candidate texts. But, we have 41,698 20-grams. How do we handle this? We use a loop to send 1000 at a time until the list is finished.

Here's the problem: when I tried to send a 1000 at a time, the service sometimes timed out. With some experimentaiton, I determined 200 is optimal. The next two cells contain the code that loops through the 20-grams list, sending GSSS the information they want, and receiving and recording their response.

Because the service would time out, I had make a simple way of keeping track of what scores I successfully recieved and recorded versus the ones that I still needed scores for.

Note: before you execute the two cells below, you need to log in to GSSS's cloud service so you have the right security credentials to get a response back from GSSS.

In [55]:
outputmid=[]
#variable that will record GSSS's response, before it is saved to a file external to python.

input_text='البينة على المدعي واليمين على من أنكر'
#the canon we are searching for

initial=0
#the intial list number we begin with

interval=200
#how many fragments we want to send to GSSS at one time.

final=400
#the position of the last 20-gram fragment you want to stop at. Sometimes this is useful, 
#if you don't want all scores all at the same time.

In [56]:
for i in range(initial,final,interval):
    
    print('current iteration is: ' + str(i))
    # this helps you keep track of the last set of fragments you have sent to GSSS
    
    outputmid=rank_candidates(input_text,joined_twentygrams_kitab_shahada[i:i+interval])
    # this is the core line of code. It is what sends the canon and the list of 20-grams to GSSS.
    
    with open('bayyina_shahada_similarity_scores.csv','a+',encoding='utf-8',newline='\n') as scorefile:
        wr=csv.writer(scorefile,dialect='excel')
        wr.writerows(outputmid)
    #This takes the output recieved from GSSS and saves it to an external file.
        
    print('just added '+str(len(outputmid))+' scores!')
    #This tells you that you were successfully able to save x number of scores to the file.

current iteration is: 0


NameError: name 'rank_candidates' is not defined