# Resume NER
## Extract Information from Resumes using NER (Named Entity Recognition)

### Part 1 - Data Exploration and preprocessing
In this first part of the challenge, we will load and examine the dataset we will be working with. We will also prepare the data for training which we will start in the second part of the challenge. You will be required to program some basic python pertaining to file loading, data conversion, and basic dictionaries and array manipulation. If you are experienced with Python, this will be easy. If you are new to python and/or programming, it will be a good opportunity to learn some basic programming you will need for data loading and exploration.

* *If you need help setting up python or running this notebook, please get help from the  assistants to the professor*
* *It might be helpful to try your code out first in a python ide like pycharm before copying it an running it here in this notebook*  

#### Load the Dataset
The dataset we will be using is located in the dataset folder included in the project. Verify the data is available by executing the code cell below 

In [1]:
import os
dataset_path = "../dataset/Entity Recognition in Resumes.json"
print("Path exists? {}".format(os.path.exists(dataset_path)))

Path exists? True


So far so good? OK then let's load the dataset. The dataset is structured so that each line of text is a resume. 
You will do the following:
1. using python's built-in "open" function, get a filehandle to the dataset (tip don't forget the file is utf8!)
2. load the data into an array of resumes (each text line is one resume) 
3. use the print function to print how many resumes were loaded
4. use the print function to output one of the resumes so we can see how the resumes look in raw text form 


In [2]:
## use the "open" function to get a filehandle. 
with open("../dataset/Entity Recognition in Resumes.json",encoding="utf8") as f:
    ## use the filehandle to read all lines into an array of text lines. 
    lines = f.readlines()
    ## print how many lines were loaded
    print("{} Lebensläufe geladen..".format(len(lines)))  
    ## now print one resume/line to see how the resumes look in raw text form
    print()
    print("Sample resume:")
    #print(lines[81])


701 Lebensläufe geladen..

Sample resume:


#### Convert the dataset to json
As we can see, the resumes are not in a convenient human-readable form, but are json dictionaries. We want to work with the resumes as python dictionaries and not as raw text, so we will convert the resumes from text to dictionaries. We will do the following:
1. Import the json module
2. Loop through all of the text lines and use the json 'loads' function to convert the line to a python dictionary. Tip - you can use a 'for' loop, or if you want to get fancy, a python 'list comprehension' to accomplish this. 
3. Select one of the converted resumes so that we can examine its structure.   


In [3]:
## import json module to load json strings
import json
## using a for loop or a list comprehension, cycle through all lines (loaded above) and convert them to dictionaries 
## using json loads function. Make sure all converted resumes are stored in the 'all_resumes' array below  
all_resumes = [json.loads(s) for s in lines]
## select one resume to examine from the all_resumes list
resume = all_resumes[42]

##### Explore the resume data structure
You should have one sample resume saved in the "resume" variable. Now we will examine the resume dictionary. Execute the code cell below to see the keys in the dictionary 

In [4]:
## explore keys in cv
print("keys in resume:")
print(list(resume.keys()))
print()
for key in resume.keys():
    print("Typ of key '{}'' is {} ".format(key,type(resume[key])))

keys in resume:
['content', 'annotation', 'extras', 'metadata']

Typ of key 'content'' is <class 'str'> 
Typ of key 'annotation'' is <class 'list'> 
Typ of key 'extras'' is <class 'NoneType'> 
Typ of key 'metadata'' is <class 'dict'> 


##### Question: which key do you think points to the text content of the resume?
*Answer here*
##### Question: which key do you think points to the list of entity annotations? 
*Answer here*

Based on your answers above, see if you were right by printing the text content and the entity list by completing and executing the code below

In [5]:
## print the resume text
print("resume content:")
print(resume["content"])
## print the resume's list of entity annotations
print("resume entity list:")
print(resume["annotation"])

resume content:
Navas Koya
Test Engineer

Mangalore, Karnataka - Email me on Indeed: indeed.com/r/Navas-Koya/23c1e4e94779b465

Willing to relocate to: Mangalore, Karnataka - Bangalore, Karnataka - Chennai, Tamil Nadu

WORK EXPERIENCE

System Engineer

Infosys -

August 2014 to Present

.NET application Maintenance and do the code changes if required

Test Engineer

Infosys -

June 2015 to February 2016

PrProject 2:

Title: RBS W&G Proving testing.
Technology: Manual testing
Role: Software Test Engineer

Domain: Banking
Description:

Write test cases & descriptions. Review the entries. Upload and map the documents into
HP QC. Execute the testing operations in TPROD mainframe. Upload the result in QC along with
the proof.
Roles and Responsibilities:
•Prepared the Test Scenarios

•Prepared and Executed Test Cases
•Performed functional, Regression testing, Sanity testing.

•Reviewed the Test Reports and Preparing Test Summary Report.
•Upload Test cases to the QC.
•Execute in TPROD Mainfra

##### Explore the list of entity labels
The entity list is a list of dictionaries, we want to explore this list
1. Cycle through the entities in the list. You can use a 'for' loop for this
2. For each entity - which will be a dictionary - print out each key and each value for the key

In [6]:
## explore entity list
for entity in resume["annotation"]:
    for key in entity:
        print("Key: '{}' Value: {}".format(key, entity[key]))
    print()

Key: 'label' Value: ['Skills']
Key: 'points' Value: [{'start': 2110, 'end': 2403, 'text': 'SKILL SET • ASP.NET, C# • QA tools\n\n• Coding and modularization • Excellent communication skills\n\n• VB, VB.net, ASP • Technical specifications creation\n\n• HTML • System backups\n\n• Sql server 2005, Oracle • System upgrades\n\n• Java/C/C++ • Excellent problem-solving abilities\n\nNavas Najeer Koya 3'}]

Key: 'label' Value: ['Location']
Key: 'points' Value: [{'start': 2055, 'end': 2063, 'text': 'Mangalore'}]

Key: 'label' Value: ['Skills']
Key: 'points' Value: [{'start': 1895, 'end': 1946, 'text': 'C# (Less than 1 year), .NET, SQL Server, Css, Html5\n'}]

Key: 'label' Value: ['Graduation Year']
Key: 'points' Value: [{'start': 1880, 'end': 1884, 'text': ' 2014'}]

Key: 'label' Value: ['Location']
Key: 'points' Value: [{'start': 1851, 'end': 1859, 'text': 'Mangalore'}]

Key: 'label' Value: ['Location']
Key: 'points' Value: [{'start': 1829, 'end': 1837, 'text': 'Mangalore'}]

Key: 'label' Value

##### Question: What keys do the entity entries have? What is the datatype of the values of these keys?
*Answer here*
##### Question: What do these keys and values mean? (think of their significance as entity labels)
*Answer here*

##### Convert  data to "spacy" offset format
Before we go any further, we need to convert the data into a slightly more compact format. This format is the format we will be using to train our first models in the next part of the challenge. Here we will do the following:
1. Import the data conversion function
2. Convert the data with that function, storing the results in a variable
3. Inspect the converted data

In [7]:
## define data conversion method
def convert_data(data):
    """
    Creates NER training data in Spacy format from JSON dataset
    Outputs the Spacy training data which can be used for Spacy training.
    """
    text = data['content']
    entities = []
    if data['annotation'] is not None:
        for annotation in data['annotation']:
            # only a single point in text annotation.
            point = annotation['points'][0]
            labels = annotation['label']
            # handle both list of labels or a single label.
            if not isinstance(labels, list):
                labels = [labels]
            for label in labels:
                # dataturks indices are both inclusive [start, end] but spacy is not [start, end)
                entities.append((point['start'], point['end'] + 1, label))
    return (text, {"entities": entities})
   
## using a loop or list comprehension, convert each resume in all_resumes using the convert function above, storing the result
converted_resumes = [convert_data(resume) for resume in all_resumes]
## print the number of resumes in converted resumes 
print("Converted resumes has {} resumes.".format(len(converted_resumes)))


Converted resumes has 701 resumes.


##### Question: how is the converted data different than the original data? How is it the same? 
*Answer here*

##### filter out resumes without annotations
A few of the resumes have an empty entity list. We want to filter these resumes out of our data before continuing. We will do the following:
1. cycle through all resumes using for loop or list comprehension
2. for each resume, if the resume has no labled entities, ignore it. Otherwise save it to new resume list 

In [8]:
## filter out resumes where resume entities list is None (you can do this in a one-line list comprehension)
## sove to converted_resumes variable
converted_resumes = [res for res in converted_resumes if res[1]['entities']]
## print length of new filtered converted_resumes.  
print(len(converted_resumes))

690


##### Print all entities for one converted resume
The converted data also has an entity list. You should be able to examine this using similar techniques we have used above on the converted data. In the next code block you will write code that will print all of the entities for one resume. TIP each entity entry in the 'entities' list consists of a start index of the entity in the resume text, an end index, and the entity label. We will do the following:
1. Store one converted resume in the 'converted_resume' variable
2. Find the entity list in the converted_resume
3. Cycle through the entities, and - using the start and end index - print the label of the entity and the value of the entity. This will be the text substring pointed to by the start and end index

In [9]:
## store a resume in the variable
converted_resume = converted_resumes[73]
## find text content and store in variable
text = converted_resume[0]
## find the entities list and store in variable
entities_list = converted_resume[1]['entities']
## for each entity, print the label, and the text (text content substring pointed to by start and end index)
for entity in entities_list:
    print("'{}': '{}'".format(entity[2],text[entity[0]:entity[1]]))
    print()



'Skills': 'CRM (10+ years), CUSTOMER RELATIONSHIP MANAGEMENT (10+ years), TESTING (10+ years),
UI (10+ years), USER INTERFACE (10+ years)
'

'Graduation Year': '2003'

'College Name': 'St. Joseph's convent school
'

'Graduation Year': '2009'

'College Name': 'Padmanava College of Engineering
'

'Degree': 'B.E in CSE
'

'Companies worked at': 'SAP Labs '

'Companies worked at': 'SAP Labs '

'Companies worked at': 'SAP Labs '

'Designation': 'Quality Engineer'

'Companies worked at': 'SAP AG'

'Designation': 'Offshore SAP CRM Functional Consultant
'

'Email Address': 'Indeed: indeed.com/r/Shabnam-Saba/dc70fc366accb67f'

'Location': 'Bengaluru'

'Designation': 'Offshore SAP CRM Functional Consultant
'

'Name': 'Shabnam Saba'



##### Question: What are some of the entity labels you see? Are there any entity values that seem surprising or particularly interesting? 
*Answer here*

##### Collect unique labels of all entities in dataset
Now we are interested in finding out all of the (unique) entity labels which exist in our dataset. Complete and execute the code below to do this.

In [10]:
## collect names of all entities in complete resume dataset
## create empty list where we will store all entity labels
all_labels = list()
for res in converted_resumes:
    entity_list = res[1]["entities"]
    all_labels.extend([ent[2] for ent in entity_list])           
## all_labels is not yet unique. Make a set to contain only unique values
unique_labels = set(all_labels)
print("Entity labels: ",unique_labels)

Entity labels:  {'Skills', 'Graduation Year', 'Email Address', 'College Name', 'Can Relocate to', 'links', 'Certifications', 'Links', 'Rewards and Achievements', 'UNKNOWN', 'College', 'Location', 'Companies worked at', 'Relocate to', 'Name', 'Years of Experience', 'Degree', 'Designation', 'state', 'des', 'University', 'abc', 'Address', 'projects', 'training'}


Now we see all entity labels in our dataset. Do some of them seem particularly interesting to you? 

Choose up to 3 Entities from the list that you would like to use for training a named entity recognition model. 
##### Question: which entities did you choose? 
*Answer here*

##### Validate entities
Now we need to check that there is adequate training data for the entities you have chosen. 

In [11]:
## store entity label names in an array 
chosen_entity_label = ["Companies worked at","Degree"]
## for each chosen entity label, count how many documents have a labeled entity for that label, and how many labeled entities total there are 
## for that entity
for chosen in chosen_entity_label:
    found_docs_with_entity = 0
    entity_count = 0
    for resume in converted_resumes:
        entity_list = resume[1]["entities"]
        _,_,labels = zip(*entity_list)
        if chosen in labels:
            found_docs_with_entity+=1
            entity_count+=len([l for l in labels if l == chosen])
    print("Docs with {}: {}".format(chosen,found_docs_with_entity))
    print("Total count of {}: {}".format(chosen,entity_count))

Docs with Companies worked at: 627
Total count of Companies worked at: 2830
Docs with Degree: 606
Total count of Degree: 1012


#####  Question: Is adequate training data available for the entities you have chosen? (there should be at least several hundered examples total of each entity)
*Answer here*

##### Save converted data for later use
We are almost done with the first part of the challenge! One more detail. We need to save the "converted_resumes" list so we can load it in the next notebook. We will do the following:
1. Store the location we want to save the data to in the 'converted_resumes_path' variable
2. Using python's 'open' function and the 'json' module's 'dump' function, save the data to disk. Make sure to create missing directories (if applicable) using python's "os.makedirs" function. Save the file with a ".json" file extension
3. Check the filesystem if the file exists and is complete

In [12]:
converted_resumes_path = "../dataset/converted_resumes.json"
with open(converted_resumes_path,'w+',encoding='utf8') as f:
    json.dump(converted_resumes,f)
    print("Done!")

Done!


### Congratulations!
We are done with part one. Now we will go on to train our own NER Models with the dataset and the entities we have chosen. 