# Assignment : Information Extraction

In [1]:
import nltk
import re
import string
#nltk.download('all')


---

### For subsequent tasks in this assignment, you will work with the documents in `football_players.txt` to perform various information extraction tasks.

In [2]:
# Download the text file (uncomment the line below in this cell, if not already downloaded from Blackboard)
#!curl "https://ideone.com/plain/OvwDXZ" > football_players.txt

 Read all the documents from `football_players.txt` into a list called `docs`.

In [3]:
docs = []
lines = open('5810556.txt',encoding="utf8").readlines()
for i in lines:
    docs.append(i)

## Docs has the file football_player.txt

## Task 2
Write a function that takes a document and returns a list of sentences with part-of-speech tags.

Please keep in mind that the expected output is a list within a list as shown below.


In [4]:
"""Function for returning sentence where each word is Pos tagged. All the three task mentioned above performed inside the function
and returned document is a pos tagged document. Input to the function must be a  string"""

def preprocess_postag(document):
    pos=[]
    try:
        sentences=nltk.sent_tokenize(document)   ##converting the string in to list of sentences (sentence segmentation)
        words=[nltk.word_tokenize(sent) for sent in sentences]   ## tokenization of the sentence is done
        for j in words:
               pos.append(nltk.pos_tag(j)) ##pos_tagged words are returned 
        return pos
    except:
        print('Input should be a string')
  

Run the cell below to verify your result for the second sentence in the first document.
Expected output: 
`[('He', 'PRP'), ('is', 'VBZ'), ('a', 'DT'), ('forward', 'NN'), ('and', 'CC'), ('serves', 'NNS'), ('as', 'IN'), ('captain', 'NN'), ('for', 'IN'), ('Portugal', 'NNP'), ('.', '.')]`

In [5]:
"""first_doc is the first document in the given file."""

first_doc = docs[0]
tagged_sentences = preprocess_postag(first_doc)  ##calling function to which returns tagged word.
tagged_sentences[1]

[('He', 'PRP'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('forward', 'NN'),
 ('and', 'CC'),
 ('serves', 'NNS'),
 ('as', 'IN'),
 ('captain', 'NN'),
 ('for', 'IN'),
 ('Portugal', 'NNP'),
 ('.', '.')]

## Task 3 
Write a function that takes a list of tokens with POS tags for a sentence and returns a list of named entities (NE). 

Hint: Use `binary = True` while calling NE chunk function

In [6]:
"""This function identifies the Named Entity chunks from the pos tagged sentences passed. Passing binary = True results in a 
named entities tagged as NE; otherwise, the classifier adds category labels such as PERSON, ORGANIZATION, and GPE."""

def find_named_entities(sent):
    
    """Pos tagged sentence is passed in the function to return all the named entities"""
        
    try:
        tree=nltk.ne_chunk(sent,binary=True)       ##The named entity tagged as NE is recoginized using ne_chunk

        named_entities = []                       ##list to store all named entities

        for subtree in tree.subtrees():
            if subtree.label()=='NE':           ##if the subtree is a named entity then appending the leaves of the subtree         
                entity=''
                for leaf in subtree.leaves():
                    entity=entity+leaf[0]+" "
                named_entities.append(entity.strip())

        return named_entities
    except:
        print('Please make sure the input enter is the list of tokens with POS tags')

Run the cell below to verify your result for the first sentence in the first document.
Expected output: `['Cristiano Ronaldo', 'Santos Aveiro', 'ComM', 'GOIH', 'Portuguese', 'Portuguese', 'Spanish', 'Real Madrid', 'Portugal']`

In [7]:
tagged_sent_doc1 = preprocess_postag(docs[0])    ## tagged sentences of document 1 is returned
doc1_sent1=find_named_entities(tagged_sent_doc1[0])   ## named entity in first sentence of document 1
print(doc1_sent1)

['Cristiano Ronaldo', 'Santos Aveiro', 'ComM', 'GOIH', 'Portuguese', 'Portuguese', 'Spanish', 'Real Madrid', 'Portugal']


## Task 4 

Implement the `find_all_named_entities` function below to find **all** NEs in a given document.

Hint: Use `find_named_entities` implemented above for this task.

In [8]:
"""Input is the document of one string containing"""

def find_all_named_entities(doc):
    named_entity=[]
    tagged=preprocess_postag(doc)
    for sent in tagged:
        ne=find_named_entities(sent)
        named_entity.extend(ne)
    
    return named_entity
        

How many named entities did you find in the first document?

In [9]:
named_ent=len(find_all_named_entities(docs[0]))

print("The total no. of named entities in the first document is ",named_ent)


The total no. of named entities in the first document is  56


## Task 5 

Find named entities across **all** documents in `football_players.txt`, and save the result into a single flat list.

In [10]:
# your code goes here
def all_named_ent(file):
    all_named_entities = []
    for doc in file:
        all_named_entities.extend(find_all_named_entities(doc))
    
    return all_named_entities
        

How many named entities did you find across all documents?

In [11]:
# your code goes here

ne_list=all_named_ent(docs)   ## list of all the named entities across the whole document

print("The total no. of named enities across the whole document contains "+str(len(ne_list))+" values")



The total no. of named enities across the whole document contains 380 values


## Task 6 

Write functions to extract the name of the player, country of origin and date of birth as well as the following relations: team(s) of the player and position(s) of the player.

Hint: Use the `re.compile()` function to create the extraction patterns.

Reference: https://docs.python.org/3/howto/regex.html

In [24]:
def name_of_the_player(doc):
    
    """In the function, since the name of the player is occuring along with word born so using regex we are extracting the complete
    sentence and then using nltk function, POS tagging of the senetnce is done.In order to create a NP-chunker,a chunker grammer is defined and using this grammar
    chunk parseris defined which results in a tree. Using subtree.leaves we extract the name of the player"""
    
    name=[]
    noun_phrases = []
    pattern1=re.compile(r'.*\bborn\b.+?(?=\))') ##pattern to extract the line in whih name is occuring along with born
    names=pattern1.findall(doc)
    token=nltk.word_tokenize(names[0])
    pos = nltk.pos_tag(token)
    grammar= r"""{NP: {<NNP><NN|NNP|,>*<NNP>}"""    ##Grammar for chunking is defined
    chunker = nltk.RegexpParser(grammar)
    tree = chunker.parse(pos)   ##The result is a tree after the parsing with the grammar defined
    
    for subtree in tree.subtrees():
        
        noun_phrases.append(subtree.leaves())           
    print(noun_phrases)
    ##The noun-phrase is a list tuples with name and POS tags so we extract the name of the player in following name list
    name=[tuple_[0] for tuple_ in noun_phrases[1]]
    name=" ".join(name)   
    if name.find(','):    ## condition to check if ',' is present in the name of the player then remove it from player name 
        name1=name.replace(',',"")
        return name1
    else:
        return name
    
def country_of_origin(doc):
    
    """Pattern matches word character which is occuring along with national team and using group capture, 
    country of origin is extracted and returned by the function"""
    
    pattern=re.compile(r'([A-Z][a-z]+)\s\bnational team')
    c_list=pattern.findall(doc)
    country= list(set(c_list))
    return country[0]

def date_of_birth(doc):
    
    """regex pattern for matching word characters that occur with born and using group the date of birth is extracted. The day
    format in the capturing group should be 1 digit or 2 digit that is why \d{1,2} has been used to capture wherever the 2 digits are 
    occuring along with sentence born followed by the name of month. For year the number of digits should be 4, therefore \d{4} is used for capturing where 4 digits are occuring along with word born.""" 
    
    dob=re.compile(r'born\b\s(\d{1,2}\s\w+\s\d{4})')
    date=dob.findall(doc)
    return date[0]

    
def team_of_the_player(doc):
    
    """Some of the players doesn't have national team information therefore assumed that player plays for his country of origin (might not be true in every case)"""
    
    """regex for capturing club names since the named entity is preceded by word 'club' and after that club name(named entity) is present.
    I am trying to capture the complete sentence after word 'club' in group using the regex pattern. The sentence captured in the group are then passed to
    find_all_named_entities function to extract all the named entities """
    
    regex=re.compile(r'[A-Z][a-z]{3,} \bclub(.*?)\.')
    doc1=regex.findall(doc)
    ne=[]
    if doc1:       ##in some document the there are two sentence where the same pattern is there
        for i in doc1:
            """All the named entity are getting captured but in document4, all the club name are captured but for club 'Paris Saint-Germain', only it is capturing only Paris for that"""
            x=find_all_named_entities(i)     
            ne.extend(x)
            
        """if the country of original is present in named entity then concatenating 'national team' to it and if country name is not present then
        append country name with national team as player plays in his national team (assumption)"""
        
        if country_of_origin(doc) in ne:      
            ne1 = [i.replace(country_of_origin(doc), country_of_origin(doc)+' national team') for i in ne if country_of_origin(doc) in ne]
            return ne1
        else: 
            ne.append(country_of_origin(doc)+' national team')
            return ne
    else:   ##if no pattern is captured return the country name as the team name(assumption, possible might not be true)
        return country_of_origin(doc)+' national team'    
    
    
def position_of_the_player(doc):    
    
    """The positions are occuring along with different words like ['is a','as a','as an']. To capture all the positions
    multiple patterns are used and through groups we are trying to capture the position of the player. Positive lookahead has been
    used to look for positions occuring after the words in the list mentioned above."""

    pattern1=re.compile(r'(?<=\bis a|\bas a)\s([a-z]+\s)for.[A-Z]+[a-z]+')
    pattern2=re.compile(r'(?<=\bis a)( [a-z]+)[\w+\s+]+for [A-Z][a-z]+')
    pattern3=re.compile(r'(?<=\bas a) (\w+\s[a-z]{5,})')
    pattern4=re.compile(r'(?<=\bas an)( \w*ing\s[a-z]+[\w+\s]+)\.*')
    pattern5=re.compile(r'(?<=\bcareer as a) ([a-z]+)')
    
    list1=pattern1.findall(doc)
    list2=pattern2.findall(doc)
    list3=pattern3.findall(doc)
    list4=pattern4.findall(doc)
    list5=pattern5.findall(doc)
    
    position=list1+list2+list3+list4+list5
        
    if re.match(r'(?=.*\bor\b)', position[0]):   ##Since there are multiple position separated by 'or' so using re.split to return the positions in the list
        position=re.split("or |[^a-zA-Z ]+",position[0])
        position=[i.strip() for i in position]
    else:
        position=position[0].strip()
        
        
    return position

Execute the cell below to verify the `date_of_birth` function for the third player. Expected output `5 February 1992`


In [25]:
print(date_of_birth(docs[2]))

5 February 1992


In [26]:
print(name_of_the_player(docs[1]))

[[('Lionel', 'NNP'), ('Andrés', 'NNP'), ('``', '``'), ('Leo', 'NNP'), ("''", "''"), ('Messi', 'NNP'), ('(', '('), ('Spanish', 'JJ'), ('pronunciation', 'NN'), (':', ':'), ('[', 'JJ'), ('ljoˈnel', 'NNS'), ('anˈdɾes', 'VBP'), ('ˈmesi', 'JJ'), (']', 'NNP'), (';', ':'), ('born', 'VBN'), ('24', 'CD'), ('June', 'NNP'), ('1987', 'CD')], [('Lionel', 'NNP'), ('Andrés', 'NNP')]]
Lionel Andrés


## Task 6 
Identify one other relation (besides team and player) and write a function to extract it.

In [23]:
"""Relation of debut age and debut has been derived using this function"""


def debut_age_year(doc):
    Relation={}   
    
    """Positive look ahead has been used in forming regex as the age of the player is coming after 'age' in every document.Age is occuring in 2 digits and using positive lookahead
    trying to capture the age in group. The year of the debut is occuring along with word 'debut' and it is 4 digits in size, so capturing
    the year using group again where similar pattern is observed.) """
    
    age=re.compile(r'(?<=debut)[\s\w+\s,]+(?<=age)[a-z+\s+]*([\d]{2})') ##regex for capturing the debut age of the player 
    year=re.compile(r'debut[\s\w+\s]*?(\d{4}?)')   ##Regex for capturing the debut year of the player
    group_age=age.findall(doc)
    group_year=year.findall(doc)
    
    if group_age:    ##certain documents doesn't have age mentioned 
        Relation['Age']=group_age[0] ##for document3 debut age are returned, the second age is the international debut age so considering first element of the group
    else:
        Relation['Age']='not mentioned'
        
    if group_year:  ##certain documents doesn't have year mentioned so possible that the list coulb be empty list
        Relation['Debut Year']=group_year[0] ##Due to present of multiple debut year at different level, we are considering first group which is capturing the debut year in football correctly
    else:
        Relation['Debut Year']='not mentioned'
        
    return Relation

In [24]:
print(debut_age_year(docs[0]))

{'Age': '18', 'Debut Year': '2003'}
