## NOTE
Please, run the cells of this notebook on a google colab or a local jupyter notebook to visually appreciate the performance and result of this solution algorithm.

Also, to use these functions in your own code, a python file that contains all the functions in this notebook can be found in the same directory as this notebook. Simply, move the python file to your code's current working directory and then import all the functions from the python file using the python file's name and function names.


All codes are well commented and explained.

In [2]:
import spacy # install spacy when running on a local jupyter server.
import numpy as np
import pandas as pd




## Problem
Given an email from a student as a text that contains the stem 'share' and the word 'email', build a function that classifies the email into : (i) Student has shared (ii) Student wants to know if can share.

## Solution Explanation
Since the emails from the students are basic one-liners, this problem can be simply solved by looking at the structure of the english text (email). This is possible because this solution can be modeled by checking whether these emails are structured as a Yes/No question or not. 

For example, 'Can I share your email?', 'Am i allowed to share your email', 'I can share your email, can't I?' are all Yes/No questions. In the english language, there are several principles that must be met before a Yes/No question can be formed. They are:

-  The auxillary verb comes immediately before pronoun/subject at the beginning or end of the sentence. In a statement, it is the other way round. For example, in the sentence 'Can I share your email?'. The auxillary verb 'can' comes immediately before the subject/pronoun 'I'. This is the same in the sentence - 'I can share your email, can't I?'.
This is by far the strongest determinant of a successful Yes/No question formation and consequently has the greatest weight in my solution.

- The use of punctuation marks '?' at the end of the sentence which confers a question.

- Specific to this problem is the use of the past tense of share -'shared'. The word 'shared' in a student's email would indicate that the email address has already been shared in the past before the student even sent the index mail.



## Implementation of Solution
Therefore, I use these 3 points above to craft a solution to the problem.

To solve this effectively, I will need a part of speech [POS] tagger. I will use the inbuilt POS tagger in the spacy library to retrieve the part of speech of every given word in an email text.

All codes are well commented below. 5 functions have been created:

- `batch_classify()`: This is the main function. It takes input as a string of email text or list of email text. It calls all other functions. It processes accordingly and produces either a 'Student wants to know if your email can be shared' or 'Student has shared your email'.

- `batch_preprocess()`: This function takes an email (given that it contains the stem ' share' and the word 'email', tokenizes it using spacy library. Then it gets the POS tags associated with it. Using this tags, we can check the email text to see if the auxillary verb comes immediately before the subject/pronoun or not. It also checks if the word 'share' is in its base form or past participle ('shared') form in the email sentence using the tags (known as tense_score) and if there is a '?' in the email text (punctuation score). It is facilitated using multiprocessing.

- `batch_get_scores()`: This function checks if the auxillary verb comes immediately before the pronoun/subject at the beginning or end of the sentence. it assigns a score of 1 if it does and 0 if it does not (verb_before_pronoun score).

- `batch_calc_total_score()`: It evaluates all the criteria (3 points above) and calculates a custom total score using a predefined formula (weighted average). 

- `pad_sequences()`: This function is used to pad each email text to the same length so multiple emails can be processed in batches.

#### WorkFlow
`batch_classify` after a call passes its arguments to `batch_preprocess` which first checks if the emails contain the stem 'share' and the word 'email. Here, it will also be tokenized and all POS information collected and stored in a tuple that includes the tokenized word itself. 

Also, the tense form of the verb 'share' is collected and stored as 1 if in past participle format or 0 if in base or present form **(tense_score)**. It also check if it contains a '?' stored as 1 or 0 if there isnt **(punctuation score)**.

This list of tuple is padded to equal length using `pad_sequences` and then passed to `batch_get_scores` that checks if an auxillary verb comes immediately before a pronoun/subject. I take advantage of the multidimensional numpy array to parallelize this condition. Now, if the email meets this condition, it returns a 1 and if not, it returns a 0 **(verb_before_pronoun_score)**.

These 3 scores **(tense_score, punctuation score, verb_vefore_pronoun_score)** are then passed to `batch_calc_total_score` that uses a set of customizable predefined weights to calculate a final score. If now predefined weights passed, it calculates a simple average of the 3 scores.


If this final score is >= 0.5, `batch_classify` outputs a 'Student wants to know if your email can be shared.' else it returns a 'Student has shared your email' prompt. ** This is done for each email texts if a list is provided**.


We eventually test this solution on a list of 30 manually crafted emails (also containing emails that don't contain the stem 'share' and 'email') to visually appreciate the performance of the solution later in the notebook. We also scale this dataset to around 12000 to test the batch processing function.




## Note
The algorithm has been optimized to process emails in batches. This was shown to perform 319 times faster than serial/individual processing. It took the optimized algorithm 1.5 seconds on the average to process 567 email text and about 383 seconds when processing each data point at a time.


In [4]:
def batch_preprocess(sentences, batch_size=50):
  """
    It takes a list of sentences (or just a sentence) and converts into a list of list of tuples. Each tuple represents containing tokens and their tag properties.
    It also registers the tense form of the stem word 'share' in each email and the presence of the '?' punctuation mark.
    Args:
    - sentences. A text string or a list of strings (emails)
    Output:
    - appr_email_address. list of integer. contains the indexes of the emails that have the stem 'share' and the word 'email' in them.
    - sent_tokenized_list. list of list of tuples. list of emails tokenized along with their tag properties.
    - tense_scores. list of int. contains a score for each sentence (email) signifying if it contains the past participle of 'share'
    - has_punc. list of int. contains a score for each sentence (email) signifying if it contains a '?'.
    - sentence_length. list of int. contains the length of the tokens in each email or sentence.
  """

  if isinstance(sentences, str):# Check if the sentence argument is a string or a list of email text
    sentences = [sentences]# if yes, put the sentence into a list.

  sent_tokenized_list = [] # create list to store the tokenized form of the email texts along with its properties.
  tense_scores = [] # Create a list variable that scores the tense form of 'share' in the email. if in past participle {shared}, it appends 1 else it appends 0.
  appr_email_index = [] # Create a list that stores the indexes of all relevant emails (that contain the stem 'share' and the word 'email')
  sentence_length=[] # Create a list that stores the length of tokens in each sentence.
  punc_scores = [] # Create a list that stores 1 if a sentence contains a '?' and 0 if it does not.

  # Load our tokenizer object from spacy library
  nlp = spacy.load('en_core_web_sm')
  # For each email in the email_list:
  for ind, sentence in enumerate(nlp.pipe(sentences, n_process=-1, batch_size = batch_size, disable=["tok2vec",'ner', 'textcat', 'parser', "attribute_ruler"])):
    # if email contains the stem 'share' and the word 'email':
    if 'share' in " ".join([token.lemma_ for token in sentence]) and 'email' in " ".join([token.lemma_ for token in sentence]):
      sent_tokenized = [] # create list to store the tokens of the index sentence along with its properties. It stores each token and attributes as a tuple

      is_punc = 0 # initialize the is_punc function to 0

      tense_score = 0 # initialize the tense_score to 0. changes it to 1 when share is in its past participle format.

      # create the length variable to store the lenght of the tokens (words) in the each email text.
      length_of_tokens = 0

      # For each token in the index sentence
      for token in sentence:
        # continue updating the length variable:
        length_of_tokens += 1
        # If the lemma form of the token is 'share' (For example, lemma of sharing or shared == 'share'), append
        if token.lemma_ == 'share':
          # Append token to the list with its properties then set the tense_score variable to 1 if the tense form of 'share' is in past participle.
          sent_tokenized.append((token.lemma_, token.tag_, token.pos_, spacy.explain(token.tag_)))
          if spacy.explain(token.tag_) == 'verb, past participle':
            tense_score = 1

        elif token.lemma_ == '?':# if token is the '?', set the is_punc to 1, then append the token with its properties to the list
          is_punc = 1
          #sent_tokenized.append((token, token.tag_, token.pos_, spacy.explain(token.tag_))) # otherwise, append token and its properties to the list 
        else: # Else just append
          sent_tokenized.append((token, token.tag_, token.pos_, spacy.explain(token.tag_))) # otherwise, append token and its properties to the list
      
      # update all the lists created at the beginning of the function accordingly at each sentence (email text) level.
      sentence_length.append(length_of_tokens)
      tense_scores.append(tense_score)
      punc_scores.append(is_punc)
      appr_email_index.append(ind)
      sent_tokenized_list.append(sent_tokenized)
    else: # Else, skip the email
      continue

  # If there are relevant emails (that contain the stem 'share' and the word 'email') in the list of email text, return all the updated list
  if len(appr_email_index) != 0 : 
    return appr_email_index, sent_tokenized_list, np.array(tense_scores), np.array(punc_scores), sentence_length
  else: # Else return None signifying that all emails in the list are invalid for this operation
    return None


def batch_get_score(token_sent):
  '''
    Calculates a score (0 or 1) that represents whether the auxillary verb in each sentence comes before the pronoun.
    Args:
    - token_sent. list of list of tuples (list of list of tokens and corresponding tag properties)
    Output:
    - comes_before_pronoun. a list of scores : (0 or 1)
  '''
  # size == (batch_size, word_length, word_properties {4})
  token_sent = np.array(token_sent)

  # token_array.shape == (batch_size, word_length_per_sentence, word_properties*2)
  token_array = np.concatenate((token_sent,np.roll(token_sent, -1, 1)), axis= -1)

  # Check if uncontracted auxillary verb comes immediately before pronoun/subject. E.g 'CAN I send your email'
  # temp_size.shape == (batch_size, word_length_per_sentence, 2)
  temp = np.concatenate((np.any(np.all(token_array[:,:,[2,5]]== ['AUX','PRP'], axis=2), axis=1, keepdims=True),
                         np.any(np.all(token_array[:,:,[3,5]]== ['verb, modal auxiliary','PRP'], axis=2 ), axis=1, keepdims=True)),axis= -1) 
  
  #verb_before_pronoun_score.shape == (batch_size, 1)
  verb_before_pronoun_score = np.any(temp, axis=-1, keepdims=True)

  # Checks if contracted auxillary verb comes immediately before pronoun/subject. E.g 'I can share your email, CAN'T I'
  # token_array.shape == (batch_size, word_length_per_sentence, word_properties*3)
  token_array = np.concatenate((token_sent, np.roll(token_sent, -1, 1), np.roll(token_sent, -2, 1)), axis=-1)

  # temp.shape == (batch_size, word_length_per_sentence, 2)
  temp = np.concatenate((np.any(np.all(token_array[:,:,[2,5,6,7,9]]== ['AUX','RB', 'PART','adverb','PRP'], axis=2), axis=1, keepdims=True),
                         np.any(np.all(token_array[:,:,[3,5,6,7,9]]== ['verb, modal auxiliary','RB', 'PART',
                           'adverb', 'PRP'], axis=2), axis=1, keepdims=True)), axis= -1)
  
  # Check if each sentence have the contracted version or the uncontracted version of auxillary verb coming before the pronoun.
  # verb_before_pronoun_score.shape == (batch_size, 2)
  verb_before_pronoun_score = np.concatenate((verb_before_pronoun_score, np.any(temp, axis=-1, keepdims=True)), axis=-1)
  # verb_before_pronoun_score.shape == (batch_size, 1)
  verb_before_pronoun_score = np.any(verb_before_pronoun_score, axis=-1).astype('int')

  return verb_before_pronoun_score


def batch_calc_total_score(verb_before_pronoun_score, tense_score, punc_score, weight = [0.5, 0.3, 0.2]):
  '''
     Calculate the final score given 3 scores provided as positional arguments. It uses the weight parameter to calculate the final score if given or
     it calculates an average over 3 scores.
     Args:
     - verb_before_pronoun_score : list of integers. whether an auxillary verb comes before the pronoun. 1 if it does else 0.
     - tense_score : list of integers (0,1). represents whether an index sentence contains the past participle form of 'share' or not.
     - punc_score : list of intergers (0,1). represents whether an index sentence contains a '?' or not.
     - weight : list of 3 values. Modifiable. it's used to calculate a weighted average of the 3 arguments above.
    Output:
    - score. an array of calculated scores for each sentence (email)
  '''
  # If weight is given:
  if weight != None:
    # Calculate weighted average score using contents of the weight.
    score = verb_before_pronoun_score*weight[0] + (1 - tense_score) * weight[1] + punc_score * weight[2]
    return score
  # Else, calculate the average of the 3 scores.
  else:
    score = (verb_before_pronoun_score + (1 - tense_score) + punc_score)/3

  return score



def pad_sequences(sentences, max_length , pad_value= [('sp','sp','sp','sp')]):
  '''
    Pads all emails to the same token length using the value of the pad_value argument.
    Args:
    - sentences: list of strings. list of tokenized emails.
    - max_length: int. length to which to pad sequences.
    - pad_value = list of tuple. the value to use in padding the sequences.
    Output: 
    - sentences. list of sequences padded to equal length.
  '''
  # Go through each email and pad each token sequence to the same length with the provide pad_value argument using the max_legth as argument.
  for ind  in range(len(sentences)):
    sentences[ind] += pad_value * (max_length - len(sentences[ind]))
  return sentences


def batch_classify(sentences, batch_size=250,  weight = [0.5, 0.3, 0.2], return_score=True):
  '''
    Main function: processes the list of email text, passing it to all other functions then returns the result in string format.
    Args:
    - sentences: List of strings or str. list of emails or a one email in string format.
    - batch_size: int. number of emails to process at once.
    - weight: list of float. should contain list of weights to use for calculation.
    - return_score: Float. if calculated scores should be returned.
    Output:
    - A list of strings whose elements states either : 'Student wants to know if your email can be shared' or 'Student has shared your email'
  '''
  # Check if sentences is a string in the case of just one email passed.
  # Set the is_text variable True or False accordingly.
  if isinstance(sentences, str):
    is_text = True
  else:
    is_text = False

  # Call the batch_preprocess function and collect all the output.
  email_ids , sentences, tense_scores , punc_scores , sent_length = batch_preprocess(sentences, batch_size = batch_size)

  # if the list containing the indexes of all the relevant emails is empty, return the string
  if email_ids == None:
    return "The text provided do not contain the stem 'share' and the word 'email' and so, could not be processed."
  
  # Create a list to append the result of the algorithm
  result = []
  # Create a list to store all the scores for each of the email text.
  scores = []

  # pad each processed email in the list.
  sentences = pad_sequences(sentences, max(sent_length))

  # process and calculate score in batches (using the batch_size):
  for start in range(0, len(sentences), batch_size):
    # Get score that represents that the auxillary verb comes before the pronoun:
    verb_before_pronoun_score = batch_get_score(sentences[start: start+batch_size])
    # Calculate the final scores for each email in the list
    score = batch_calc_total_score(verb_before_pronoun_score, tense_scores[start: start+batch_size], punc_scores[start: start+batch_size], weight= weight)
    
    # Convert scores to output string results : 'Student wants to know if your email can be shared.' or 'Student has shared your email.'
    # Then append to the result list.
    result.extend(list(np.where(score >= 0.5, 'Student wants to know if your email can be shared.', 'Student has shared your email.')))
    # append scores to the scores list.
    scores.extend(list(score))
    # Set start to the the beginning of the next batch.
    start += batch_size

  # If one email was passed, return the result of the only element in the result list.
  if is_text:
    # if return score, return the calculated score of the email as well.
    if return_score:
      return result[0], scores[0]
    else:
      return result[0]
  else: # Else, return the list of relevant email ids , corresponding results +/- scores.
    if return_score:
      return email_ids, result, scores
    else: 
      return email_ids, result


# Test Solution.
Below, I provide a list of 30 carefully crafted sentences that will be used to test the solution. The 2nd and 3rd email messages in the list do not contain 'share' and 'email'. The 9th does not contain 'email'.

All the results are shown below and saved to a csv file called 'validation.csv'. This folder can be found in the same parent directory as this notebook.

In [11]:

sentence = ["I have shared your email",'I have learnt a lot from you, thank you.', "Thank you for helping me, mary", "I have shared your email. Haven't I" , "Your email has been shared with my friends." "I have shared your email. Haven't I", "May i share your email?", 'I may share your email', "I can share your email, can't I?",
            "Can you help my friend if i share your email.", "Do you mind if i share your email", "Will it be possible for me to share your email",
            "Can I keep sharing your email?", "I have shared it, haven't I?","Can i share your email", "I will share your email", "I shall share your email",
            "I've shared your email", "Should I share your email", "Am I allowed to share your email","Am I able to share your email", "I am able to share your email",
            "Will you help my friends if I share your email with them?", "I can keep sharing your email, can't I?", "Should I not share your email", "I will keep sharing your email",
            "Can I continue sharing your email", "Do you want me to share your email","Can I not share your email", "Your email will be shared to my friends",
            "Do you mind if I share your email with my friends",]

In [12]:
len(sentence)

30

In [13]:
email_ids , result , scores= batch_classify(sentence)

In [14]:
len(email_ids), len(result), len(scores)

(27, 27, 27)

In [15]:
df = pd.DataFrame({'Email indexes': email_ids, 'Sentences': np.array(sentence)[email_ids] , 'Targets' : result , 'Scores': scores})
df.to_csv('validation.csv')
df.head(-1)

Unnamed: 0,Email indexes,Sentences,Targets,Scores
0,0,I have shared your email,Student has shared your email.,0.0
1,3,I have shared your email. Haven't I,Student wants to know if your email can be sha...,0.5
2,4,Your email has been shared with my friends.I h...,Student wants to know if your email can be sha...,0.5
3,5,May i share your email?,Student wants to know if your email can be sha...,1.0
4,6,I may share your email,Student has shared your email.,0.3
5,7,"I can share your email, can't I?",Student wants to know if your email can be sha...,1.0
6,8,Can you help my friend if i share your email.,Student wants to know if your email can be sha...,0.8
7,9,Do you mind if i share your email,Student wants to know if your email can be sha...,0.8
8,10,Will it be possible for me to share your email,Student wants to know if your email can be sha...,0.8
9,11,Can I keep sharing your email?,Student wants to know if your email can be sha...,1.0


Based on whether weights were provided to calculate the final score, the results might defer slightly. 

For example, take the 2nd data sample result in the table above: 

I have shared your email. Haven't I

The result differ for both modes. This statement confers that the student believes that the email has been shared in the past but has forgotten or not quite certain anymore.

In [16]:
email_ids, result , scores = batch_classify(sentence, weight=None)
df = pd.DataFrame({'Email indexes': email_ids, 'Sentences': np.array(sentence)[email_ids] , 'Targets' : result , 'Scores': scores})
df.to_csv('validation2.csv')
df.head(-1)

Unnamed: 0,Email indexes,Sentences,Targets,Scores
0,0,I have shared your email,Student has shared your email.,0.0
1,3,I have shared your email. Haven't I,Student has shared your email.,0.333333
2,4,Your email has been shared with my friends.I h...,Student has shared your email.,0.333333
3,5,May i share your email?,Student wants to know if your email can be sha...,1.0
4,6,I may share your email,Student has shared your email.,0.333333
5,7,"I can share your email, can't I?",Student wants to know if your email can be sha...,1.0
6,8,Can you help my friend if i share your email.,Student wants to know if your email can be sha...,0.666667
7,9,Do you mind if i share your email,Student wants to know if your email can be sha...,0.666667
8,10,Will it be possible for me to share your email,Student wants to know if your email can be sha...,0.666667
9,11,Can I keep sharing your email?,Student wants to know if your email can be sha...,1.0


## Evaluation of Model Efficiency On Scaled Data

Now, let's evaluate the efficency of the model on even more data. In this case, we will simply extend our manually crafted set from 30 samples to 12030 samples. Then, we calculate the time taken for the model to process 1 datasample and 12030 samples.

In [17]:
sentence.extend(sentence * 400)
len(sentence)

12030

In [18]:
from time import time
# result , stem_score, punc_score = analyze(sentence[-1])
# get_scores(result)
#print(sentence[-1])
start = time()
result = batch_classify(sentence[0])
print(f'Time taken for 1 sample processing : {time() - start} seconds ')

Time taken for 1 sample processing : 0.8039522171020508 seconds 


In [19]:
from time import time
# result , stem_score, punc_score = analyze(sentence[-1])
# get_scores(result)
#print(sentence[-1])
start = time()
result = batch_classify(sentence, batch_size=2000)
print(f'Time taken for {len(sentence)} samples processing : {time() - start} seconds ')

Time taken for 12030 samples processing : 7.7364184856414795 seconds 


It took 7.7 seconds to process 12030 samples and  0.8 sec to process just 1 sample.
