# Resume NER
## Extract Information from Resumes using NER (Named Entity Recognition)

### Part 1 - Data Exploration and preprocessing
In this first part of the challenge, we will load and examine the dataset we will be working with. We will also prepare the data for training which we will start in the second part of the challenge. You will be required to program some basic python pertaining to file loading, data conversion, and basic dictionaries and array manipulation. If you are experienced with Python, this will be easy. If you are new to python and/or programming, it will be a good opportunity to learn some basic programming you will need for data loading and exploration.

* *If you need help setting up python or running this notebook, please get help from the  assistants to the professor*
* *It might be helpful to try your code out first in a python ide like pycharm before copying it an running it here in this notebook*  

#### Load the Dataset
The dataset we will be using is located in the dataset folder included in the project. Verify the data is available by executing the code cell below 

In [1]:
import os
dataset_path = "./data/Entity Recognition in Resumes.json"
print("Path exists? {}".format(os.path.exists(dataset_path)))

Path exists? True


So far so good? OK then let's load the dataset. The dataset is structured so that each line of text is a resume. 
You will do the following:
1. using python's built-in "open" function, get a filehandle to the dataset (tip don't forget the file is utf8!)
2. load the data into an array of resumes (each text line is one resume) 
3. use the print function to print how many resumes were loaded
4. use the print function to output one of the resumes so we can see how the resumes look in raw text form 


In [2]:
## use the "open" function to get a filehandle. 
with open(dataset_path,encoding="utf8") as f:
    ## use the filehandle to read all lines into an array of text lines. 
    lines = f.readlines()
    ## print how many lines were loaded
    print(len(lines))
    ## now print one resume/line to see how the resumes look in raw text form
    print("Sample resume:")
    #TODO print sample resume
    print(lines[0])


701
Sample resume:
{"content": "Afreen Jamadar\nActive member of IIIT Committee in Third year\n\nSangli, Maharashtra - Email me on Indeed: indeed.com/r/Afreen-Jamadar/8baf379b705e37c6\n\nI wish to use my knowledge, skills and conceptual understanding to create excellent team\nenvironments and work consistently achieving organization objectives believes in taking initiative\nand work to excellence in my work.\n\nWORK EXPERIENCE\n\nActive member of IIIT Committee in Third year\n\nCisco Networking -  Kanpur, Uttar Pradesh\n\norganized by Techkriti IIT Kanpur and Azure Skynet.\nPERSONALLITY TRAITS:\n• Quick learning ability\n• hard working\n\nEDUCATION\n\nPG-DAC\n\nCDAC ACTS\n\n2017\n\nBachelor of Engg in Information Technology\n\nShivaji University Kolhapur -  Kolhapur, Maharashtra\n\n2016\n\nSKILLS\n\nDatabase (Less than 1 year), HTML (Less than 1 year), Linux. (Less than 1 year), MICROSOFT\nACCESS (Less than 1 year), MICROSOFT WINDOWS (Less than 1 year)\n\nADDITIONAL INFORMATION\n\nTECH

#### Convert the dataset to json
As we can see, the resumes are not in a convenient human-readable form, but are json dictionaries. We want to work with the resumes as python dictionaries and not as raw text, so we will convert the resumes from text to dictionaries. We will do the following:
1. Import the json module
2. Loop through all of the text lines and use the json 'loads' function to convert the line to a python dictionary. Tip - you can use a 'for' loop, or if you want to get fancy, a python 'list comprehension' to accomplish this. 
3. Select one of the converted resumes so that we can examine its structure.   


In [3]:
## import json module to load json strings
import json
## using a for loop or a list comprehension, cycle through all lines (loaded above) and convert them to dictionaries 
## using json loads function. Make sure all converted resumes are stored in the 'all_resumes' array below  
all_resumes = []
for line in lines:
    all_resumes.append(json.loads(line))

## select one resume to examine from the all_resumes list
resume = all_resumes[0]

##### Explore the resume data structure
You should have one sample resume saved in the "resume" variable. Now we will examine the resume dictionary. Complete the code below to see the keys in the dictionary 

In [4]:
## explore keys in cv
print("keys and values in resume:")
## TODO print out the keys and values for the sample resume
print(resume.keys())

keys and values in resume:
dict_keys(['content', 'annotation', 'extras', 'metadata'])


##### Question: which key do you think points to the text content of the resume?
content
##### Question: which key do you think points to the list of entity annotations? 
annotation

Based on your answers above, see if you were right by printing the text content and the entity list by completing and executing the code below

In [5]:
## TODO print the resume text
print("resume content:")
print(resume["content"])
## TODO print the resume's list of entity annotations
print("resume entity list:")
print(resume["annotation"])

resume content:
Afreen Jamadar
Active member of IIIT Committee in Third year

Sangli, Maharashtra - Email me on Indeed: indeed.com/r/Afreen-Jamadar/8baf379b705e37c6

I wish to use my knowledge, skills and conceptual understanding to create excellent team
environments and work consistently achieving organization objectives believes in taking initiative
and work to excellence in my work.

WORK EXPERIENCE

Active member of IIIT Committee in Third year

Cisco Networking -  Kanpur, Uttar Pradesh

organized by Techkriti IIT Kanpur and Azure Skynet.
PERSONALLITY TRAITS:
• Quick learning ability
• hard working

EDUCATION

PG-DAC

CDAC ACTS

2017

Bachelor of Engg in Information Technology

Shivaji University Kolhapur -  Kolhapur, Maharashtra

2016

SKILLS

Database (Less than 1 year), HTML (Less than 1 year), Linux. (Less than 1 year), MICROSOFT
ACCESS (Less than 1 year), MICROSOFT WINDOWS (Less than 1 year)

ADDITIONAL INFORMATION

TECHNICAL SKILLS:

• Programming Languages: C, C++, Java, .ne

##### Explore the list of entity labels
The entity list is a list of dictionaries, we want to explore this list
1. Cycle through the entities in the list. You can use a 'for' loop for this
2. For each entity - which will be a dictionary - print out each key and each value for the key

In [6]:
## explore entity list
##TODO print out each key and each value for each entity in the entities list
for annotation in resume["annotation"]:
    for key in annotation.keys():
        print("Key:")
        print(key)
        print("Value:")
        print(annotation[key])

Key:
label
Value:
['Email Address']
Key:
points
Value:
[{'start': 1155, 'end': 1198, 'text': 'indeed.com/r/Afreen-Jamadar/8baf379b705e37c6'}]
Key:
label
Value:
['Links']
Key:
points
Value:
[{'start': 1143, 'end': 1239, 'text': 'https://www.indeed.com/r/Afreen-Jamadar/8baf379b705e37c6?isid=rex-download&ikw=download-top&co=IN'}]
Key:
label
Value:
['Skills']
Key:
points
Value:
[{'start': 743, 'end': 1140, 'text': 'Database (Less than 1 year), HTML (Less than 1 year), Linux. (Less than 1 year), MICROSOFT\nACCESS (Less than 1 year), MICROSOFT WINDOWS (Less than 1 year)\n\nADDITIONAL INFORMATION\n\nTECHNICAL SKILLS:\n\n• Programming Languages: C, C++, Java, .net, php.\n• Web Designing: HTML, XML\n• Operating Systems: Windows […] Windows Server 2003, Linux.\n• Database: MS Access, MS SQL Server 2008, Oracle 10g, MySql.'}]
Key:
label
Value:
['Graduation Year']
Key:
points
Value:
[{'start': 729, 'end': 732, 'text': '2016'}]
Key:
label
Value:
['College Name']
Key:
points
Value:
[{'start': 675, '

##### Question: What keys do the entity entries have? What is the datatype of the values of these keys?
Each entry has the keys label and points. The values of key label are arrays with a single string value and the entry of the key points is a dictionary with the keys start, end and text.

##### Convert  data to "spacy" offset format
Before we go any further, we need to convert the data into a slightly more compact format. This format is the format we will be using to train our first models in the next part of the challenge. Here we will do the following:
1. Use the provided data conversion function
2. Convert the data with that function, storing the results in a variable
3. Inspect the converted data

In [7]:
## data conversion method
def convert_data(data):
    """
    Creates NER training data in Spacy format from JSON dataset
    Outputs the Spacy training data which can be used for Spacy training.
    """
    text = data['content']
    entities = []
    if data['annotation'] is not None:
        for annotation in data['annotation']:
            # only a single point in text annotation.
            point = annotation['points'][0]
            labels = annotation['label']
            # handle both list of labels or a single label.
            if not isinstance(labels, list):
                labels = [labels]
            for label in labels:
                # dataturks indices are both inclusive [start, end] but spacy is not [start, end)
                entities.append((point['start'], point['end'] + 1, label))
    return (text, {"entities": entities})
   
## TODO using a loop or list comprehension, convert each resume in all_resumes using the convert function above, storing the result
converted_resumes = []
for resume in all_resumes:
    converted_resumes.append(convert_data(resume))

## TODO print the number of resumes in converted resumes 
print(len(converted_resumes))
print(converted_resumes[0])

701
('Afreen Jamadar\nActive member of IIIT Committee in Third year\n\nSangli, Maharashtra - Email me on Indeed: indeed.com/r/Afreen-Jamadar/8baf379b705e37c6\n\nI wish to use my knowledge, skills and conceptual understanding to create excellent team\nenvironments and work consistently achieving organization objectives believes in taking initiative\nand work to excellence in my work.\n\nWORK EXPERIENCE\n\nActive member of IIIT Committee in Third year\n\nCisco Networking -  Kanpur, Uttar Pradesh\n\norganized by Techkriti IIT Kanpur and Azure Skynet.\nPERSONALLITY TRAITS:\n• Quick learning ability\n• hard working\n\nEDUCATION\n\nPG-DAC\n\nCDAC ACTS\n\n2017\n\nBachelor of Engg in Information Technology\n\nShivaji University Kolhapur -  Kolhapur, Maharashtra\n\n2016\n\nSKILLS\n\nDatabase (Less than 1 year), HTML (Less than 1 year), Linux. (Less than 1 year), MICROSOFT\nACCESS (Less than 1 year), MICROSOFT WINDOWS (Less than 1 year)\n\nADDITIONAL INFORMATION\n\nTECHNICAL SKILLS:\n\n• Program

##### Question: how is the converted data different than the original data? How is it the same? 
Combines the text content and dictionary of the annotations to a tuple.
The dictionary entities has a list of each annotation, that was combined to a triple with the scheme (start, end+1, label).

##### filter out resumes without annotations
A few of the resumes have an empty entity list. We want to filter these resumes out of our data before continuing. We will do the following:
1. cycle through all resumes using for loop or list comprehension
2. for each resume, if the resume has no labled entities, ignore it. Otherwise save it to new resume list 

In [8]:
## TODO filter out resumes where resume entities list is None (you can do this in a one-line list comprehension)
## sove to converted_resumes variable
converted_resumes = [(text, entitylist) for (text, entitylist) in converted_resumes if len(entitylist["entities"]) > 0]
## TODO print length of new filtered converted_resumes.  
print(len(converted_resumes))

690


##### Print all entities for one converted resume
The converted data also has an entity list. You should be able to examine this using similar techniques we have used above on the converted data. In the next code block you will write code that will print all of the entities for one resume. TIP each entity entry in the 'entities' list consists of a start index of the entity in the resume text, an end index, and the entity label. We will do the following:
1. Store one converted resume in the 'converted_resume' variable
2. Find the entity list in the converted_resume
3. Cycle through the entities, and - using the start and end index - print the label of the entity and the value of the entity. This will be the text substring pointed to by the start and end index

In [9]:
## store one resume in the variable
converted_resume = converted_resumes[0]
## find text content and store in variable
text = converted_resume[0]
## find the entities list and store in variable
entities_list = converted_resume[1]["entities"]
## TODO for each entity, print the label, and the text (text content substring pointed to by start and end index)
for (start, end, label) in entities_list:
    print(label + ": " + text[start:end])

Email Address: indeed.com/r/Afreen-Jamadar/8baf379b705e37c6
Links: https://www.indeed.com/r/Afreen-Jamadar/8baf379b705e37c6?isid=rex-download&ikw=download-top&co=IN
Skills: Database (Less than 1 year), HTML (Less than 1 year), Linux. (Less than 1 year), MICROSOFT
ACCESS (Less than 1 year), MICROSOFT WINDOWS (Less than 1 year)

ADDITIONAL INFORMATION

TECHNICAL SKILLS:

• Programming Languages: C, C++, Java, .net, php.
• Web Designing: HTML, XML
• Operating Systems: Windows […] Windows Server 2003, Linux.
• Database: MS Access, MS SQL Server 2008, Oracle 10g, MySql.
Graduation Year: 2016
College Name: Shivaji University Kolhapur 
Degree: Bachelor of Engg in Information Technology
Graduation Year: 2017

College Name: CDAC ACTS
Degree: PG-DAC
Companies worked at: Cisco Networking
Email Address: indeed.com/r/Afreen-Jamadar/8baf379b705e37c6
Location: Sangli
Name: Afreen Jamadar


##### Question: What are some of the entity labels you see? Are there any entity values that seem surprising or particularly interesting? 
Skills might be the most interesting, because it has multiple points and everything could be in it.

##### Collect unique labels of all entities in dataset
Now we are interested in finding out all of the (unique) entity labels which exist in our dataset. Complete and execute the code below to do this.

In [10]:
## collect names of all entities in complete resume dataset
all_labels = list()
for res in converted_resumes:
    ## entity list of res
    entity_list = res[1]["entities"]
    ## TODO extend all_labels with labels of entities
    for (start, end, label) in entity_list:
        all_labels.append(label)
    ##all_labels.           
## TODO all_labels is not yet unique. Make the list a set of unique values
unique_labels = list(set(all_labels))
## Print unique entity labels
print("Entity labels: ",unique_labels)

Entity labels:  ['links', 'University', 'des', 'Graduation Year', 'Location', 'Name', 'Email Address', 'projects', 'Companies worked at', 'UNKNOWN', 'Skills', 'Relocate to', 'state', 'Can Relocate to', 'Degree', 'Years of Experience', 'College Name', 'training', 'College', 'Address', 'Rewards and Achievements', 'Designation', 'Links', 'Certifications', 'abc']


Now we see all entity labels in our dataset. Do some of them seem particularly interesting to you? 

Choose up to 3 Entities from the list that you would like to use for training a named entity recognition model. 
##### Question: which entities did you choose? 
 Designation, Companies worked at, Skills

##### Validate entities
Now we need to check that there is adequate training data for the entities you have chosen. 

In [13]:
## TODO store entity label names for the entities you want to work with in an array 
chosen_entity_label = ['Designation', 'Companies worked at', 'Skills']
##['Skills', 'projects', 'Rewards and Achievements', 'Designation']
## for each chosen entity label, count how many documents have a labeled entity for that label, and how many labeled entities total there are 
## for that entity
for chosen in chosen_entity_label:
    found_docs_with_entity = 0
    entity_count = 0
    for resume in converted_resumes:
        entity_list = resume[1]["entities"]
        _,_,labels = zip(*entity_list)
        if chosen in labels:
            found_docs_with_entity+=1
            entity_count+=len([l for l in labels if l == chosen])
    print("Docs with {}: {}".format(chosen,found_docs_with_entity))
    print("Total count of {}: {}".format(chosen,entity_count))

Docs with Designation: 650
Total count of Designation: 2842
Docs with Companies worked at: 627
Total count of Companies worked at: 2830
Docs with Skills: 536
Total count of Skills: 2152


#####  Question: Is adequate training data available for the entities you have chosen? (there should be at least several hundered examples total of each entity)
Every entity has over 500 examples.

##### Save converted data for later use
We are almost done with the first part of the challenge! One more detail. We need to save the "converted_resumes" list so we can load it in the next notebook. We will do the following:
1. Store the location we want to save the data to in the 'converted_resumes_path' variable
2. Using python's 'open' function and the 'json' module's 'dump' function, save the data to disk. Make sure to create missing directories (if applicable) using python's "os.makedirs" function. Save the file with a ".json" file extension
3. Check the filesystem if the file exists and is complete

In [14]:
converted_resumes_path = "./data/converted_resumes.json"
##TODO save converted resumes to path using "open" and json's "dump" function. 
with open(converted_resumes_path, 'w') as output:
    json.dump(converted_resumes, output)

### Congratulations!
We are done with part one. Now we will go on to train our own NER Models with the dataset and the entities we have chosen. 