# TezaAI programming assignment

To run this notebook, following steps are needed:

Requirement:
1. Python 3.6.5

Open Terminal

1. Navigate to the project folder (cd {downloaded_file}/teza_ai) 
2. Activate the virtual environment (. ./setup-env.sh)
3. Install the dependepencies (pip install -r requirements.txt)
4. Start "Jupyter Notebook" (jupyter notebook)
5. Click on the notebook "TezaAI Programming Assignment"

In [42]:
# Libraries import

import cerberus
import json
import os
import re
import numpy as np
import string
from collections import Counter 
from pprint import pprint
from deepmerge import always_merger
import spacy
from spacy import displacy
from spacy.tokens import Span
from spacy.matcher import PhraseMatcher
from spacy.pipeline import SentenceSegmenter

In [43]:
nlp = spacy.load('en')

## Input Files Validation schema

This peace of code below validates the input files format.
i.e. if they are valid files.

In [3]:
# building validation dict based on the structure below

# {
#  "_id": "https://www.ekmhinnovators.com/ekmh-innovators-blog-beta/interview-ourcrowd-ceo-jon-medved-on-impact-investing-crowdfunding",
#  "title": "Interview: OurCrowd CEO Jon Medved on Crowdfunding, Beyond ...",
#  "body": "EKMH Innovators Interview Series An interview ...",
#  "origin": "google custom search",
#  "feedId": 103,
#  "jobId": "37b3e04c-cf7d-4032-82ad-a2bd89dc90ac",
#  "person": {
# 	 "id": "16",
# 	 "name": "Jon Medved"
#  }
# }

STRING_MANDATORY = {'type': 'string', 'empty': False, 'required': True}
INT_MANDATORY = {'type': 'integer', 'empty': False, 'required': True}
VALIDATION_SCHEMA = {
    '_id': STRING_MANDATORY, 
    'title': STRING_MANDATORY, 
    'body': STRING_MANDATORY, 
    'origin': STRING_MANDATORY, 
    'feedId': INT_MANDATORY, 
    'jobId': STRING_MANDATORY, 
    'person': {
        'type': 'dict', 'required': True, 'empty': False, 
            'schema': {
                'id': STRING_MANDATORY,
                'name': STRING_MANDATORY
            }
        }
    }

In [4]:
# displaying validation dictionary

pprint(VALIDATION_SCHEMA)

{'_id': {'empty': False, 'required': True, 'type': 'string'},
 'body': {'empty': False, 'required': True, 'type': 'string'},
 'feedId': {'empty': False, 'required': True, 'type': 'integer'},
 'jobId': {'empty': False, 'required': True, 'type': 'string'},
 'origin': {'empty': False, 'required': True, 'type': 'string'},
 'person': {'empty': False,
            'required': True,
            'schema': {'id': {'empty': False,
                              'required': True,
                              'type': 'string'},
                       'name': {'empty': False,
                                'required': True,
                                'type': 'string'}},
            'type': 'dict'},
 'title': {'empty': False, 'required': True, 'type': 'string'}}


In [5]:
# Data global var

DATA = None

In [6]:
"""
validate input data schema
"""
def validate_schema(entities, schema, filepath):
    checker = cerberus.Validator()
    checker.allow_unknown = True
    if checker.validate(entities, schema):
        print(f'Input schema validated for {filepath} !')
    else:
        errors = checker.errors
        proceed = False
        error = errors
        raise ValueError(f'Format mismatch for the input {errors}')

In [7]:
"""
Validate and read the JSON data
file_path: path of the json file
"""
def read_and_validate_data(filepath):
    with open(filepath) as json_file:
        data = json.load(json_file)
        validate_schema(entities=data, schema=VALIDATION_SCHEMA,\
                        filepath=filepath)
    return data

Before executing the block of code below, more files (interview JSONs) can be added to the data folder

The data folder with the source code comprises of the already provided 6 JSON input interview files



In [8]:
# Path of the data files
# Add more files to the project folder or change this path to read from any 
# other directory path
DATA_FOLDER = os.path.join(os.path.curdir, 'data') # by default it takes the JSONs from project folder

# only considering JSON files at the moment
interview_files_list = [os.path.join(DATA_FOLDER, f) for f in os.listdir(DATA_FOLDER) \
                        if '.json' in os.path.splitext(f) ]


print(interview_files_list)
if len(interview_files_list) < 1:
    proceed = False
    error = "No data file available"

['./data/0.json', './data/1.json', './data/2.json', './data/3.json', './data/4.json', './data/5.json']


In [9]:
# Reading in the data input files while also 
# validating them
DATA = [read_and_validate_data(file) for file in interview_files_list]

Input schema validated for ./data/0.json !
Input schema validated for ./data/1.json !
Input schema validated for ./data/2.json !
Input schema validated for ./data/3.json !
Input schema validated for ./data/4.json !
Input schema validated for ./data/5.json !


In [10]:
interview_one = nlp(DATA[5]['body'])

In [11]:
for word in interview_one:
    print(f"{word.text:{10}} {word.pos_:{10}} {word.tag_:{10}} {spacy.explain(word.tag_)}")

The        DET        DT         determiner
following  VERB       VBG        verb, gerund or present participle
is         AUX        VBZ        verb, 3rd person singular present
a          DET        DT         determiner
script     NOUN       NN         noun, singular or mass
from       ADP        IN         conjunction, subordinating or preposition
"          PUNCT      ``         opening quotation mark
Inside     PROPN      NNP        noun, proper singular
Apple      PROPN      NNP        noun, proper singular
"          PUNCT      ''         closing quotation mark
which      DET        WDT        wh-determiner
aired      VERB       VBD        verb, past tense
on         ADP        IN         conjunction, subordinating or preposition
Dec.       PROPN      NNP        noun, proper singular
20         NUM        CD         cardinal number
,          PUNCT      ,          punctuation mark, comma
2015       NUM        CD         cardinal number
.          PUNCT      .          punctuati

we         PRON       PRP        pronoun, personal
can        VERB       MD         verb, modal auxiliary
do         AUX        VB         verb, base form
.          PUNCT      .          punctuation mark, sentence closer
[          PUNCT      -LRB-      left round bracket
Tim        PROPN      NNP        noun, proper singular
Cook       PROPN      NNP        noun, proper singular
:          PUNCT      :          punctuation mark, colon or ellipsis
This       DET        DT         determiner
is         AUX        VBZ        verb, 3rd person singular present
the        DET        DT         determiner
future     NOUN       NN         noun, singular or mass
of         ADP        IN         conjunction, subordinating or preposition
television NOUN       NN         noun, singular or mass
,          PUNCT      ,          punctuation mark, comma
coming     VERB       VBG        verb, gerund or present participle
now        ADV        RB         adverb
.          PUNCT      .          punctua

very       ADV        RB         adverb
private    ADJ        JJ         adjective
person     NOUN       NN         noun, singular or mass
.          PUNCT      .          punctuation mark, sentence closer
But        CCONJ      CC         conjunction, coordinating
it         PRON       PRP        pronoun, personal
became     VERB       VBD        verb, past tense
increasingly ADV        RB         adverb
clear      ADJ        JJ         adjective
to         ADP        IN         conjunction, subordinating or preposition
me         PRON       PRP        pronoun, personal
that       SCONJ      IN         conjunction, subordinating or preposition
if         SCONJ      IN         conjunction, subordinating or preposition
I          PRON       PRP        pronoun, personal
said       VERB       VBD        verb, past tense
something  PRON       NN         noun, singular or mass
,          PUNCT      ,          punctuation mark, comma
that       SCONJ      IN         conjunction, subordinating

In [12]:
pos_counts = interview_one.count_by(spacy.attrs.POS)

In [13]:
for k,v in sorted(pos_counts.items()):
    print(f"{k}: {interview_one.vocab[k].text:{5}} {v}")

84: ADJ   287
85: ADP   388
86: ADV   270
87: AUX   370
89: CCONJ 147
90: DET   535
91: INTJ  20
92: NOUN  757
93: NUM   84
94: PART  160
95: PRON  432
96: PROPN 494
97: PUNCT 695
98: SCONJ 99
99: SYM   4
100: VERB  555
101: X     1


In [14]:
print(interview_one[85].text, interview_one.vocab[85].text)

the ADP


In [15]:
options = {
    'distance': 110,
    'compact': "True",
    'color': 'white',
    'bg': '#09a3d5',
    'font': 'Times'
}

In [16]:
spans = list(interview_one.sents)

In [17]:
# displacy.render(interview_one[:100], style='dep', jupyter=True, options=options)

# displacy.serve(spans, style='dep', options=options)

In [18]:
def show_ents(doc):
    
    persons = set()
    
    entities = doc.ents
    if entities:
        for entity in entities:
            if entity.label_ == 'PERSON':
                persons.add(entity.text)
#                 print(entity.text + ' - ' + entity.label_ +' _ ' +str(spacy.explain(entity.label)))
    else:
        print('No entities found')
    return persons


In [46]:
def split_on_newlines(doc):
    start = 0
    see_newline = False
    for word in doc:
        if seen_newline:
            yield doc[start:word.i]
            start = word.i
            seen_newline = False
        elif word.text.startswith('\n'):
            seen_newline = True
    yield doc[start:]

In [19]:
show_ents(interview_one)

{'Ahrendts',
 'Andrew Bast',
 'Angela Ahrendts',
 'Charlie',
 'Charlie Rose',
 'Cook',
 'Dan Riccio',
 'Eddy Cue',
 'Foxconn',
 'Glen Rochkind',
 'Graham',
 'Graham Townsend',
 'Jeff Williams',
 'Jobs',
 'Jony',
 'Jony I',
 'Michael Radutzky',
 'Phil Schiller',
 'Steve',
 'Steve Jobs',
 'Tim Cook',
 'co--',
 'to--'}

## Common programming methods

Let us define some common methods before progressing with interview quotes extraction

In [21]:
"""
Remove special character except the exceptions list
"""
def remove_special_character(text_to_process, exceptions=[], strip=False):
    new_string = str(text_to_process)
    if strip:
        new_string = new_string.strip()
    special_chars = string.printable[62:]
    special_chars = [x for x in special_chars if x not in exceptions]
    for char in text_to_process:
        if char in special_chars:
            new_string = new_string.replace(char, '')
    return new_string

In [22]:
"""
Remove special character without spaces
"""
def remove_special_character_without_spaces(text_to_process, exceptions=[]):
    return remove_special_character(text_to_process, [' '], strip=True)

In [23]:
"""
Remove special character with custom
"""
def remove_special_character_with_custom(text_to_process, custom=[]):
    return remove_special_character(text_to_process, custom, strip=True)

In [24]:
"""
Remove spaces
"""
def remove_spaces(text_to_process):
    return text_to_process.replace(" ", "")

In [25]:
"""
Reverse the array
"""
def reverse_the_array(arr):
    return arr[::-1]

## Custom variables and methods for finding Interviewee and Interviewers

In [26]:
REGULAR_EXPRESSION = '(?<=\{})(.*?)(?=\:)' # to match the names with article


# article generally ends with a fullstop, question mark or semicolon..
# add more to list if required
INTERVIEW_ENDING = ['.', '?']  


# Name threshold
THRESHOLD = 20

In [27]:
"""
Predicts the interviewee names based on the information
provided in the json metadata while mapping it with the 
data


Assumption: assuming that there is only interviewee
"""

def predict(name_doc_mapping={}, softening=False):
    predicted_interviewee = {}
    for key, value in name_doc_mapping.items():
        
        item_list = max(value.items(), key=lambda x: x[1])
        labels = list()
        
        for k, v in value.items():
            if v == item_list[1]:
                predicted_interviewee[key] = k
        
                
    return predicted_interviewee

In [28]:
"""
Getting occurence of name in the interview text
"""

def element_occurence(name, interview_text):
    return len([pos.start() for pos in re.finditer(f'{name}:', interview_text)])

# In cell interpretation

In [29]:
# Approach 1: Interviewee name based on name provided in JSON
# And using that info evaluate what form of name is used
# e.g. for a name Tom Hilton, in the article only "Tom" could be used

# ------------------------------------------------------------------
# in-cell compilation of logic ..... Fetch names based 
# ------------------------------------------------------------------

doc = {}

for idx, entity in enumerate(DATA):
    
    name_dict = {}
    
    interview_text = str.lower(entity.get('body')) # interview content
    interviewee = entity.get('person')['name'] # person being interviewed
    
    sub_names = interviewee.split()
    sub_names.append(interviewee)
    names = [x.lower() for x in sub_names] # name list because only part of the name could be used
    
    
    # create of a dictionary to map interviewer names used in the article
    for name in names:
        name_dict[name] = element_occurence(name, interview_text)
    doc[f'doc_{idx}'] = name_dict



In [30]:
doc

{'doc_0': {'joe': 23, 'lonsdale': 0, 'joe lonsdale': 0},
 'doc_1': {'balaji': 0, 'srinivasan': 13, 'balaji srinivasan': 0},
 'doc_2': {'jon': 0, 'medved': 11, 'jon medved': 11},
 'doc_3': {'phil': 0, 'libin': 64, 'phil libin': 64},
 'doc_4': {'john': 0, 'mackey': 22, 'john mackey': 1},
 'doc_5': {'tim': 0, 'cook': 43, 'tim cook': 43}}

In [31]:
# Approach 2: Interviewee & Interviewers name based on the article contents 
# and evaluating the results with the approach 1 result to find consolidated results
# ------------------------------------------------------------------
# in-cell compilation of logic ..... Fetch article based entities
# ------------------------------------------------------------------

doc_exprsn = {}

for idx, entity in enumerate(DATA):
    
    count_dict = {}
    all_matches = []
    
    interview_text = str.lower(entity.get('body')) # interview content
    
    # Now we will try to create the same kind of map for the interviewers & interviewees based
    # on article contents
    for ex in INTERVIEW_ENDING:
        expression = REGULAR_EXPRESSION.format(ex)
        matches = re.findall(expression, interview_text)
        all_matches = all_matches + matches
    all_matches = list(map(remove_special_character_without_spaces, all_matches))
    counter = Counter(all_matches)
    for elm in set(all_matches):
        if len(elm) < THRESHOLD:
            count_dict[elm] = counter[elm]
    doc_exprsn[f'doc_{idx}'] = count_dict




In [32]:
doc_exprsn

{'doc_0': {'joe': 16, '” then they say': 1, 'martin': 2},
 'doc_1': {'srinivisan': 1,
  'srinivasan': 12,
  'jackson': 2,
  'balaji s srinivasan': 1},
 'doc_2': {'ekmh': 1, 'jon medved': 8},
 'doc_3': {'laughter phil libin': 1,
  'yeah nicole torres': 1,
  'nicole torres': 18,
  'phil libin': 53},
 'doc_4': {' mackey': 1, 'mackey': 19, 'reason': 1, 'john mackey': 1},
 'doc_5': {'phil schiller': 1,
  'graham townsend': 1,
  'angela ahrendts': 1,
  'tim cook': 30,
  ' tim cook': 2,
  'jony ive': 8,
  'charlie rose': 14}}

## Predicting Interviewers and Interviewees

### Predicting Interviewee

In [33]:
print('Predicting Interviewees names used in the article .......\n')
pprint(predict(doc))

Predicting Interviewees names used in the article .......

{'doc_0': 'joe',
 'doc_1': 'srinivasan',
 'doc_2': 'jon medved',
 'doc_3': 'phil libin',
 'doc_4': 'mackey',
 'doc_5': 'tim cook'}


### Finding potential interviewer

In [34]:
print('Predicting Interviewers names used in the article .......\n')

count = 0
article_name_dict = {}
for key, value in doc_exprsn.items():
    temp_dict = {}
    name_keys = list(value.keys())
    data = remove_special_character_with_custom(str.lower(DATA[count].get('body')), custom=[':', ' '])
    interviewer_names = doc.get(key)
    potential_removers = [ j for k in interviewer_names.keys() for j in name_keys if k in j]
    
    for item in set(potential_removers):
        name_keys.remove(item)
    
    for name_key in name_keys:
        name_key = name_key.strip()
        occurence = element_occurence(name_key, data)
        if  occurence:
            temp_dict[name_key] = occurence
    article_name_dict[f'doc_{count}'] = temp_dict
    count = count + 1
pprint(article_name_dict)


Predicting Interviewers names used in the article .......

{'doc_0': {'martin': 24, '” then they say': 1},
 'doc_1': {'jackson': 14, 'srinivisan': 1},
 'doc_2': {'ekmh': 12},
 'doc_3': {'nicole torres': 62, 'yeah nicole torres': 1},
 'doc_4': {'reason': 22},
 'doc_5': {'angela ahrendts': 2,
           'charlie rose': 54,
           'graham townsend': 5,
           'jony ive': 13,
           'phil schiller': 1}}


## Conclusion

As a part of this excercise, to find the quotes by the user from the websites.

I was able to write a program, which tries to predict the interviewer and interviewee.

E.g.Interviewee prediction 


{

    'doc_0': 'joe', 
    
    'doc_1': 'srinivasan', 
    
    'doc_2': 'jon medved',
    
    'doc_3': 'phil libin',
    
    'doc_4': 'mackey',
    
    'doc_5': 'tim cook'
    
}

and Interviewers prediction


{

    'doc_0': {
        'martin': 24, 
        '” then they say': 1
        },
     'doc_1': {
         'jackson': 14, 
         'srinivisan': 1
         },
     'doc_2': {
         'ekmh': 12
     },
     'doc_3': {
         'nicole torres': 62, 
         'yeah nicole torres': 1
      },
     'doc_4': {
         'reason': 22
       },
     'doc_5': {
         'angela ahrendts': 2,
         'charlie rose': 54,
         'graham townsend': 5,
         'jony ive': 13,
         'phil schiller': 1}
         
}


Though, it is a solid start towards finding the quotes by the users which could be either interviewer or interviewee.

The algorithm to extract the interviewer and interviewee name needs further tuning. 

For e.g. The interview article in "1.json" has typos and different personas being used for the interviewee

**Balaji Srinivasan** 
                    
                    -----> Balaji S. Srinivasan 
                    |
                    |
                    |--> Srinivisan
                    
                    
 Similarly, if one wishes to extract the quotes from interviewers one can see the outliers for sure 
 
 
 For e.g. in "3.json", the algorithm extracts "yeah nicole torres" as the interviewer which is incorrect and corresponds to only "nicole torres"
 
 Writing a generic solution with all these bottlenecks **will require more time**.
 
 But the basic concepts would be same:
 
 Step 1: Find the Interviewer accurately
 
 Step 2: Find the Interviewee accurately
 
 Step 3: Parse through colons, and based on the info from step 1 and 2 extract sentences

 
 

In [35]:
pprint(show_ents(interview_one))

# 'doc_5': {
#      'angela ahrendts': 2,
#      'charlie rose': 54,
#      'graham townsend': 5,
#      'jony ive': 13,
#      'phil schiller': 1
#     }

{'Ahrendts',
 'Andrew Bast',
 'Angela Ahrendts',
 'Charlie',
 'Charlie Rose',
 'Cook',
 'Dan Riccio',
 'Eddy Cue',
 'Foxconn',
 'Glen Rochkind',
 'Graham',
 'Graham Townsend',
 'Jeff Williams',
 'Jobs',
 'Jony',
 'Jony I',
 'Michael Radutzky',
 'Phil Schiller',
 'Steve',
 'Steve Jobs',
 'Tim Cook',
 'co--',
 'to--'}


In [36]:
options = {
    'ents': ["PERSON"]
}

In [41]:
displacy.render(
    interview_one, 
    style='ent', 
    jupyter=True, 
    options=options)