# Homework: Modeling and Querying Linked Data Using LinkedIn



Have you ever wondered about (1) what it takes to be a data scientist or "data person", and (2) how social networks and recommender systems work?

This homework is focused on (1) working with hierarchical data stored in dataframes, (2) traversing relationships among data. 

We will focus on questions about data scientists from a crawl of the LinkedIn dataset.

In [0]:
!pip install pymongo[tls,srv]
!pip install swifter
!pip install lxml

Collecting dnspython<2.0.0,>=1.16.0; extra == "srv"
[?25l  Downloading https://files.pythonhosted.org/packages/ec/d3/3aa0e7213ef72b8585747aa0e271a9523e713813b9a20177ebe1e939deb0/dnspython-1.16.0-py2.py3-none-any.whl (188kB)
[K     |████████████████████████████████| 194kB 2.7MB/s 
[?25hInstalling collected packages: dnspython
Successfully installed dnspython-1.16.0
Collecting swifter
  Downloading https://files.pythonhosted.org/packages/1f/22/0a46b4d2a417824d7e883a8bd8e01c3b000bbdeaa7c154891b7cba94cbf7/swifter-0.296-py3-none-any.whl
Collecting tqdm>=4.33.0
[?25l  Downloading https://files.pythonhosted.org/packages/a5/13/cd55c23e3e158ed5b87cae415ee3844fc54cb43803fa3a0a064d23ecb883/tqdm-4.40.0-py2.py3-none-any.whl (54kB)
[K     |████████████████████████████████| 61kB 3.3MB/s 
Collecting partd>=0.3.8; extra == "complete"
  Downloading https://files.pythonhosted.org/packages/44/e1/68dbe731c9c067655bff1eca5b7d40c20ca4b23fd5ec9f3d17e201a6f36b/partd-1.1.0-py3-none-any.whl
Collecting locke



In [0]:
import pandas as pd
import numpy as np
import json
import sqlite3
from lxml import etree
import urllib
import zipfile

import time
import swifter
from pymongo import MongoClient
from pymongo.errors import DuplicateKeyError, OperationFailure

# Step 1: Acquire and load data

We need to pull the zipfile with LinkedIn data to your local machine or the Google Colab cloud-hosted machine.  Only when the data is local can we efficiently parse it (and we'll read directly out of a zip file).

The zip file contains two files with the same schema.  You can start with the `tiny` instance to test your queries, then go on to `small`. 


* `linkedin_small.json` (100K records)
* `linkedin_tiny.json` (10K records)

The cell below will download the zip file, and may take a while. **You do not need to modify the two cells below.**

**INSTRUCTOR:  Replace X with the url of the zip file you have created.**

In [0]:
url = 'https://upenn-bigdataanalytics.s3.amazonaws.com/linkedin.zip'
# url = 'X'
filehandle, _ = urllib.request.urlretrieve(url,filename='local.zip')

The cell below creates pointers to the two versions of our dataset. To switch between them, simply change the `file` variable in the cell below.

In [0]:
def fetch_file(fname):
    zip_file_object = zipfile.ZipFile(filehandle, 'r')
    for file in zip_file_object.namelist():
        file = zip_file_object.open(file)
        if file.name == fname: return file
    return None
    
linkedin_tiny = fetch_file('linkedin_tiny.json')
linkedin_small = fetch_file('linkedin_small.json')
file = linkedin_tiny

## Step 1.1:  Store the data in dataframes

In the cell below, adapt the data loading code from the associated lecture notebook.  You will need the function that extracts relations from JSON files and the function that converts relations to dataframes. Read in a maximum of 20000 people. Put the code that reads a line of the file, extracts the relations, removes the interval field, and stores the field information with a try statement, just in case. In the error case, just use a `pass` command to move on. At the end of the next cell, you should have nine dataframes with the following names:

1. `people_df`
2. `names_df`
3. `education_df`
4. `groups_df`
5. `skills_df`
6. `experience_df`
7. `honors_df`
8. `also_view_df`
9. `events_df`

In [0]:
# TODO: Adapt the data loading code from class.

# YOUR CODE HERE
### BEGIN SOLUTION
def get_df(rel):
    ret = pd.DataFrame(rel).fillna('')
    for k in ret.keys():
        ret[k] = ret[k].astype(str)
    return ret

def extract_relation(rel, name):
    '''
    Pull out a nested list that has a key, and return it as a list
    of dictionaries suitable for treating as a relation / dataframe
    '''
    # We'll return a list
    ret  = []
    if name in rel:
        ret2 = rel.pop(name)
        try:
            # Try to parse the string as a dictionary
            ret2 = json.loads(ret2.replace('\'','\"'))
        except:
            # If we get an error in parsing, we'll leave as a string
            pass
        
        # If it's a dictionary, add it to our return results after
        # adding a key to the parent
        if isinstance(ret2, dict):
            item = ret2
            item['person'] = rel['_id']
            ret.append(item)
        else:
            # If it's a list, iterate over each item
            index = 0
            for r in ret2:
                item = r
                if not isinstance(item, dict):
                    item = {'person': rel['_id'], 'value': item}
                else:
                    item['person'] = rel['_id']
                    
                # A fix to a typo in the data
                if 'affilition' in item:
                    item['affiliation'] = item.pop('affilition')
                    
                item['pos'] = index
                index = index + 1
                ret.append(item)
    return ret
    

names = []
people = []
groups = []
education = []
skills = []
experience = []
honors = []
also_view = []
events = []


lines = []
i = 1
LIMIT = 20000  # Max records to parse
for line in file:
    line = line.decode('utf-8')
    try:
        person = json.loads(line)

        # By inspection, all of these are nested dictionary or list content
        nam = extract_relation(person, 'name')
        edu = extract_relation(person, 'education')
        grp = extract_relation(person, 'group')
        skl = extract_relation(person, 'skills')
        exp = extract_relation(person, 'experience')
        hon = extract_relation(person, 'honors')
        als = extract_relation(person, 'also_view')
        eve = extract_relation(person, 'events')

        # This doesn't seem relevant and it's the only
        # non-string field that's sometimes null
        if 'interval' in person:
            person.pop('interval')

        lines.append(person)
        names = names + nam
        education = education + edu
        groups  = groups + grp
        skills = skills + skl
        experience = experience + exp
        honors = honors + hon
        also_view = also_view + als
        events = events + eve
    except:
        pass
    
    i = i + 1
    if i >= LIMIT:
        break

people_df = get_df(pd.DataFrame(lines))
names_df = get_df(pd.DataFrame(names))
education_df = get_df(pd.DataFrame(education))
groups_df = get_df(pd.DataFrame(groups))
skills_df = get_df(pd.DataFrame(skills))
experience_df = get_df(pd.DataFrame(experience))
honors_df = get_df(pd.DataFrame(honors))
also_view_df = get_df(pd.DataFrame(also_view))
events_df = get_df(pd.DataFrame(events))
### END SOLUTION

In [0]:
# Sanity Check 1.1 - please do not modify or delete this cell!

display(experience_df)


Unnamed: 0,org,title,end,start,desc,person,pos
0,Citi Staffing,Temp,Present,2011,"Filled in as Receptionist, aided in data entry...",in-emilymaggiotto,0
1,BIG Management,"Concert Booking, Digital Media Strategy, and P...",,November 2011,Managing artists- Booking concerts- Social Med...,in-emilymaggiotto,1
2,The ITO Partnership,Project Management Consultant,,March 2011,-\tCreated and maintained timelines and aided ...,in-emilymaggiotto,2
3,"New Universal Entertainment (NUE) Agency, LLC",Office Manager / Assistant to CEO,,June 2010,NUE Agency is a young boutique talent agency r...,in-emilymaggiotto,3
4,Downtown Music LLC,Assistant to Executive VP / General Counsel,,September 2008,Downtown Music is a music company consisting o...,in-emilymaggiotto,4
...,...,...,...,...,...,...,...
47301,CARREFOUR GLOBAL SOURCING LTD,HR REGIONAL MANAGER,,October 2006,Mission: initiated human resources organizatio...,in-fgauchet,2
47302,Carrefour Global Sourcing,manager,,2006,,in-fgauchet,3
47303,CARREFOUR,MANAGER,,2006,,in-fgauchet,4
47304,CARREFOUR MARCHANDISES INTERNATIONALES,Human Resources Development Manager,,2000,Mission: initiated and coordinated various hum...,in-fgauchet,5


## Step 1.2: Save data to SQLite

Next save the data to SQLite, using the same approach as in the associated lecture notebook.

In [0]:
conn = sqlite3.connect('linkedin.db')

# YOUR CODE HERE
### BEGIN SOLUTION
people_df.to_sql('people', conn, if_exists='replace', index=False)
names_df.to_sql('names', conn, if_exists='replace', index=False)
education_df.to_sql('education', conn, if_exists='replace', index=False)
groups_df.to_sql('groups', conn, if_exists='replace', index=False)
skills_df.to_sql('skills', conn, if_exists='replace', index=False)
experience_df.to_sql('experience', conn, if_exists='replace', index=False)
honors_df.to_sql('honors', conn, if_exists='replace', index=False)
also_view_df.to_sql('also_view', conn, if_exists='replace', index=False)
events_df.to_sql('events', conn, if_exists='replace', index=False)
### END SOLUTION

In [0]:
# Sanity Check 1.2.1 - please do not modify or delete this cell!

people_df.describe()

Unnamed: 0,_id,locality,industry,url,summary,interests,specilities,overview_html,homepage
count,10137,10137,10137,10137,10137.0,10137.0,10137.0,10137.0,10137.0
unique,10137,1237,950,10137,5566.0,2859.0,2929.0,514.0,279.0
top,in-erwinfschoellkopf,San Francisco Bay Area,Information Technology and Services,http://ar.linkedin.com/in/federicoemiliani,,,,,
freq,1,385,749,1,4569.0,7270.0,7208.0,9624.0,9859.0


In [0]:
# Sanity Check 1.2.2 - please do not modify or delete this cell!

skills_df.describe()

Unnamed: 0,person,value,pos
count,104647,104647,104647
unique,7529,16043,50
top,in-enjoywithfranklin,Social Media,0
freq,50,806,7529


In [0]:
# Sanity Check 1.2.3 - please do not modify or delete this cell!

experience_df.describe()

Unnamed: 0,org,title,end,start,desc,person,pos
count,47306,47306,47306.0,47306,47306.0,47306,47306
unique,34771,28932,9.0,1286,27496.0,8795,62
top,IBM,Consultant,,January 2010,,in-fernandofortunocitoler,0
freq,146,455,35966.0,514,19595.0,62,8795


# Step 2: What is a data scientist?

In this homework, we will use LinkedIn to analyze what it means to be a data scientist (as of a few years ago).

## Step 2.1: What are common skills for data scientists?

Our first question is:  for anyone who's job revolves around data (database administrators, data curators, data engineers, data scientists), *what are the most common skills*?

### Step 2.1.1: Collect skills (Pandas)

Complete the `collect_skills` function below. The function should:

1. Using `experience_df`, find all people with a position containing "data" in the title. Remember upper versus lower case.
2. Using `skills_df`, find all people with "data science" as a skill. Again, remember to account for case.
3. For all of the unique people found in steps 1 and 2, find the rest of their skills
4. Return a dataframe of the top 15 skills, by frequency  (see pandas.DataFrame.sort_values).  The columns should be called `skill` (the name of the skill) and `scientists` (the count of the number of data scientists with this skill).

In [0]:
# TODO: Find the top 15 skills for data scientists (Pandas)

def collect_skills(experience_df, people_df, skills_df):
    ### BEGIN SOLUTION
    # Any experience with the word (fragment) data, which includes database XYZ, data analyst, data engineer
    data_titles_df = experience_df[experience_df['title'].apply(lambda s: s.lower().find('data') >= 0)]
    print(data_titles_df.shape)
    # Which people?
    data_scientists_df = people_df.merge(data_titles_df, left_on=['_id'], right_on=['person'])[['_id']]
    print(data_scientists_df.shape)

    # Now find anyone with data science as a skill
    ds_skills_df = skills_df[skills_df['value'].apply(lambda s: s.lower() == 'data science')]
    print(ds_skills_df.shape)

    data_skilled_df = people_df.merge(ds_skills_df, left_on=['_id'], right_on=['person'])[['_id']]
    data_scientists_df = pd.concat([data_scientists_df, data_skilled_df]).drop_duplicates()
        
    data_scientist_skills_df = data_scientists_df.merge(skills_df, left_on=['_id'], right_on=['person'])[['person','value']].rename(columns={'value': 'skill', 'person': 'scientists'})
    return data_scientist_skills_df.groupby('skill').count().sort_values('scientists',ascending=False).reset_index().head(15)
   
    ### END SOLUTION

In [0]:
# Sanity Check 2.1.1 - please do not modify or delete this cell!

top_skills_df = collect_skills(experience_df, people_df, skills_df)
display(top_skills_df)

if "skill" not in top_skills_df:
    raise AssertionError("skill column not defined")
if "scientists" not in top_skills_df:
    raise AssertionError("scientists column not defined")
if len(top_skills_df) != 15:
    raise AssertionError("dataframe does not have top 15")  

(325, 7)
(325, 1)
(3, 3)


Unnamed: 0,skill,scientists
0,SQL,28
1,Databases,23
2,Business Intelligence,22
3,Business Analysis,19
4,MySQL,19
5,Java,17
6,Microsoft SQL Server,17
7,JavaScript,17
8,Linux,17
9,Data Warehousing,16


### Step 2.1.2: Top skills (SQL)

Compute the same table as in 2.1.1 using SQL. Store it as a datafrane called `top_skills_sql` but otherwise matching the schema and other properties. Be sure to save the data to SQLLite in a table called `top_skills`.

In [0]:
# TODO: Find the top 15 skills for data scientists (SQL)

# YOUR CODE HERE
### BEGIN SOLUTION
# Any experience with the word (fragment) data, which includes database XYZ, data analyst, data engineer
conn.execute('drop view if exists data_titles')
conn.execute('create view data_titles as select * from experience where instr(lower(title), "data")')

#pd.read_sql_query('select * from data_titles', conn)

# Which people?
conn.execute('drop view if exists data_scientists')
conn.execute('create view data_scientists as select distinct _id from people join data_titles on _id = person union select distinct _id from people join skills on _id = person where lower(value)="data science"')
pd.read_sql_query('select * from data_scientists', conn)

conn.execute('drop view if exists data_scientist_skills')
conn.execute('create view data_scientist_skills as select value as skill, person as scientists from data_scientists join skills on _id = person')
top_skills_sql = pd.read_sql_query('select skill, count(*) as scientists from data_scientist_skills group by skill order by count(*) desc limit 15', conn)

top_skills_sql.to_sql('top_skills', conn, if_exists='replace', index=False)
### END SOLUTION

display(top_skills_sql)

Unnamed: 0,skill,scientists
0,SQL,28
1,Databases,23
2,Business Intelligence,22
3,Business Analysis,19
4,MySQL,19
5,Java,17
6,JavaScript,17
7,Linux,17
8,Microsoft SQL Server,17
9,Data Warehousing,16


In [0]:
# Sanity Check 2.1.2 - please do not modify or delete this cell!

if "skill" not in top_skills_sql:
    raise AssertionError("skill column not defined")
if "scientists" not in top_skills_sql:
    raise AssertionError("scientists column not defined")
if len(top_skills_df) < 1:
    raise AssertionError("dataframe has no results")  
if len(top_skills_sql.merge(top_skills_df)) != len(top_skills_sql):
    raise AssertionError("Pandas and SQL versions are not of the same length")

## Step 2.2: What are common titles for those with data science skills?

Complete the `collect_titles` function below that aggregates the most recent titles of people with data science skills. This function should use the given dataframes as input and return a two column dataframe: one column called `title` and the other called `count`. You should only consider people who have at least `min_skills` of the top skills for a data scientist. You should also only keep those titles that appear at least `min_count` times.

For extra practice, you can also do this in SQL.

In [0]:
# TODO: Find the common titles (Pandas)

def collect_titles(top_skills_df, skills_df, people_df, experience_df, min_skills, min_count):
    # YOUR CODE HERE
    ### BEGIN SOLUTION
    people_skills_df = top_skills_df.merge(skills_df, left_on='skill', right_on='value').\
        merge(people_df, left_on='person', right_on='_id')[['_id','skill']]

    ds_titles_df = people_skills_df.groupby('_id').count().reset_index().sort_values('skill', ascending=False)
    ds_titles_df = ds_titles_df[ds_titles_df['skill'] >= min_skills]
    ds_titles_df = ds_titles_df.merge(experience_df,left_on='_id', right_on='person')

    # ds_titles_df = ds_titles_df[(ds_titles_df['end'] == 'Present') | (ds_titles_df['pos'] == 0)].\
    #     groupby('title').count().reset_index(f)[['title','_id']].sort_values('_id', ascending=False).\
    #     rename(columns={'_id': 'count'})
    ds_titles_df = ds_titles_df[(ds_titles_df['end'] == 'Present') | (ds_titles_df['pos'] == 0)].\
        groupby('title').count().reset_index()[['title','_id']].sort_values('_id', ascending=False).\
        rename(columns={'_id': 'count'})

    return ds_titles_df[ds_titles_df['count'] >= min_count]
    ### END SOLUTION

In [0]:
# Sanity Check 2.2 - please do not modify or delete this cell!

ds_titles_df = collect_titles(top_skills_df, skills_df, people_df, experience_df, 6, 2)
display(ds_titles_df)

if "title" not in ds_titles_df:
    raise AssertionError("title column not defined")
if "count" not in ds_titles_df:
    raise AssertionError("count column not defined")
if len(ds_titles_df) < 1:
    raise AssertionError("dataframe has no results")

Unnamed: 0,title,count
79,Software Engineer,6
43,Owner,6
77,Software Developer,4
71,Senior Software Engineer,4
86,Web Developer,2
30,IT Manager,2
82,System Engineer,2
81,Sr. Software Engineer,2
35,Lead Engineer,2
10,"Co-Founder, CTO",2


## Step 2.3: Who employs "data people" based on title?

Now let's find the list of companies that have employed people with the above titles, ranked by number of employees who have had these titles.

### Step 2.3.1: Data employers

Complete the `collect_employers` function below that aggregates the employers with positions corresponding to the most recent titles of people with data science skills. This function should use the given dataframes as input and return a two column dataframe: one column called `org` and the other called `people`. Show the names of companies (in field `org`) with at least `min_count` employees who are "data people" (include that count in the `people` column). Order the dataframe by the count of data people in the company in descending order.

In [0]:
# TODO: Find the data employers
def collect_employers(experience_df, ds_titles_df, min_count):
    # YOUR CODE HERE
    ### BEGIN SOLUTION
    experiences_df = experience_df.merge(ds_titles_df[['title']],left_on='title',right_on='title')[['org','person']].groupby('org').count().reset_index().rename(columns={'person':'people'}).sort_values('people',ascending=False)
    return experiences_df[experiences_df['people'] >= min_count]
    ### END SOLUTION

In [0]:
# Sanity Check 2.3.1 - please do not modify or delete this cell!

employers_df = collect_employers(experience_df, ds_titles_df, 5)
display(employers_df)

if "IBM" not in employers_df['org'].tolist():
    raise AssertionError("Missing IBM")
    
if employers_df['people'].min() < 4:
    raise AssertionError("Not filtering properly")

Unnamed: 0,org,people
655,Google,11
560,Facebook,5
718,IBM,5
1439,Thomson Reuters,5
1087,Oracle,5


### Step 2.3.2:  Employees of Data Employers

Complete the `collect_employees` function below that aggregates the employees of employers with positions corresponding to the most recent titles of people with data science skills. In other words, who are the employees of the data employers you found before and what are their titles? This function should use the given dataframes as input and return the `org`, `family_name`, `given_name`, and `title` of each person.

In [0]:
# TODO: Find the employees of the data employers

# YOUR CODE HERE
### BEGIN SOLUTION
def collect_employees(people_df, experience_df, employers_df, names_df, ds_titles_df):
    return people_df.merge(experience_df.merge(employers_df[['org']],left_on='org',right_on='org'), left_on='_id', right_on='person')[['_id','org','title']].\
        merge(names_df,left_on='_id',right_on='person').merge(ds_titles_df[['title']],left_on='title',right_on='title')[['org','family_name','given_name','title']].rename(columns={'org': 'organization'})
### END SOLUTION

In [0]:
# Sanity Check 2.3.2 - please do not modify or delete this cell!

title_people_df = collect_employees(people_df, experience_df, employers_df, names_df, ds_titles_df)
display(title_people_df)

if len(title_people_df.columns) != 4:
    raise AssertionError('Wrong number of columns. Check schema again')

Unnamed: 0,organization,family_name,given_name,title
0,Facebook,Agafonov,Anton,Software Engineer
1,Google,Belinsky,Eran,Software Engineer
2,Google,Ribas,Eduardo,Software Engineer
3,Thomson Reuters,Erichsen,David,Software Engineer
4,Facebook,Sheripov,Eliskhan,Software Engineer
5,IBM,Wang,Jiajun,Software Engineer
6,Oracle,Zhang,Ethan,Software Engineer
7,Oracle,Pasquini,Ettore,Software Engineer
8,Facebook,Bogatov,Eugene,Software Engineer
9,IBM,Evanchik,Stephen,Software Engineer


## Step 2.4: Find peers

In many common social graph settings, we can make recommendations to people based on their similarity with other people. In this case, we define similarity in terms of the number of identical skills.

Suppose A and B have similar skills: A -> X1 and B -> X1, A -> X2 and B -> X2, etc. up to A -> Xk and B -> Xk.

Then given that A and B have similar skills, we might recommend A's employer to B, and vice versa.

### Step 2.4.1: Compute the top pairs of peers

Let's consider only the first 100 people in `people_df`.
Find, out of this set, the pairs of people with the most shared/common skills, and return the closest 20 pairs in descending order.  We'll then use this to make a *recommendation* for a potential employer and position to each person.

Complete the `collect_peers` function below that finds the top `num` pairs of peers. In other words, compare each person with each *other* person, counting the total set of skills in common. This function should use the given dataframes and `num` as input and return a three column dataframe: `person_1`, `person_2`, and `common_skills`. The first two columns should be person IDs and the last column should be the number of skills that this pair of people shares.

**Hint:** Doing this requires a *Cartesian product*, i.e., every ID paired with every other ID.  Think about how to create a dataframe just with people IDs, then add a field to this dataframe that will let us combine every record with every record.

In [0]:
# TODO: Finish the collect_peers function

people_df_subset = people_df.head(100)

def collect_peers(people_df_subset, skills_df, num):
    # YOUR CODE HERE
    ### BEGIN SOLUTION
    people_skills_df = people_df_subset.merge(skills_df, left_on='_id', right_on='person')[['_id','industry','value']]

    people_ids_df = people_df_subset[['_id']]
    people_ids_df.loc[:,'_id2'] = 0

    cartesian_df = people_ids_df.merge(people_ids_df,left_on='_id2',right_on='_id2')[['_id_x','_id_y']]
    cartesian_df = cartesian_df[cartesian_df['_id_x'] != cartesian_df['_id_y']]
    cartesian_df = cartesian_df.rename(columns={'_id_x': 'person_1', '_id_y': 'person_2'})

    recs_df = people_skills_df.merge(cartesian_df, left_on='_id', right_on='person_1').merge(people_skills_df, left_on=['person_2','value'], right_on=['_id','value'])[['person_1','person_2','value']].\
        groupby(by=['person_1','person_2']).count().reset_index().sort_values('value', ascending=False).head(num)

    return recs_df.rename(columns={"value":"common_skills"})
    ### END SOLUTION


In [0]:
# Sanity Check 2.4.1 - please do not modify or delete this cell!

recs_df = collect_peers(people_df_subset, skills_df, 20)
display(recs_df)

if "person_1" not in recs_df:
    raise AssertionError("person_1 column not defined")
if "person_2" not in recs_df:
    raise AssertionError("person_2 column not defined")
if "common_skills" not in recs_df:
    raise AssertionError("common_skills column not defined")
if(len(recs_df) != 20):
    raise AssertionError('Wrong number of rows in recs_df')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


Unnamed: 0,person_1,person_2,common_skills
867,in-emilyprog,in-emilyngerhard,9
477,in-emilyngerhard,in-emilyprog,9
718,in-emilypetroff,in-emilymcmonagle,7
764,in-emilyping,in-emilymerillj2t,7
272,in-emilymerillj2t,in-emilypopescu,7
270,in-emilymerillj2t,in-emilyping,7
813,in-emilypopescu,in-emilymerillj2t,7
220,in-emilymcmonagle,in-emilypetroff,7
233,in-emilymcmonagle,in-emilysimarski,7
1327,in-emilysimarski,in-emilymcmonagle,7


### Step 2.4.2: Get the last jobs

Complete the `last_job` function below that takes `experience_df` as input and returns the `person`, `title`, and `org` corresponding to each person's **last** (most recent) employment experience (three column dataframe).

In [0]:
# TODO: Complete the last_job function

def last_job(experience_df):
    # YOUR CODE HERE
    ### BEGIN SOLUTION
    return experience_df[experience_df['pos'] == '0'][['person','org','title']].sort_values('person')
    ### END SOLUTION

In [0]:
# Sanity Check 2.4.2 - please do not modify or delete this cell!

last_job_df = last_job(experience_df)
display(last_job_df)

if(len(last_job_df.columns) != 3):
    raise AssertionError('Wrong number of columns in last_job_df')

Unnamed: 0,person,org,title
0,in-emilymaggiotto,Citi Staffing,Temp
10,in-emilymain,Goldman Sachs,"HCM Senior Business Analyst, Consultant"
19,in-emilymalleytaylor,UCLA Anderson,"Associate Director, MBA Career Education & Com..."
25,in-emilymansfield,UK Web Media,Account Manager
28,in-emilymartin09,"Ernst & Young, LLP","Assurance Senior, Nevada CPA"
...,...,...,...
47274,in-fgarciagarcia,LLORENTE & CUENCA,"Director, Online Communication"
47287,in-fgarciagrial,GRIAL,research GRoup in InterAction and eLearning Head
47290,in-fgarciapolite,"Institut Químic de Sarrià, Ramon Llull University",Laboratory Assistant
47294,in-fgarriga,Zapnus Ltd.,Android Developer


### Step 2.4.3: Recommend jobs

Complete the `recommend_jobs` function below that takes `recs_df`, `names_df`, and `last_job_df` as input and returns for each `person_1`, `person_2`'s most recent `title` and `org`.

In [0]:
# TODO: Complete the recommend_jobs function

def recommend_jobs(recs_df, names_df, last_job_df):
    # YOUR CODE HERE
    ### BEGIN SOLUTION
    return recs_df.merge(names_df,left_on='person_1',right_on='person')[['family_name','given_name','person_1','person_2']].\
        merge(last_job_df,left_on='person_2',right_on='person', how="left")[['family_name','given_name','person_2','org','title']].sort_values('family_name')
    ### END SOLUTION

In [0]:
# Sanity Check 2.4.3 - please do not modify or delete this cell!

recommended_df = recommend_jobs(recs_df, names_df, last_job_df)
display(recommended_df)

if "family_name" not in recommended_df:
    raise AssertionError("person_1 column not defined")
if "given_name" not in recommended_df:
    raise AssertionError("person_2 column not defined")
if "person_2" not in recommended_df:
    raise AssertionError("common_skills column not defined")
if "org" not in recommended_df:
    raise AssertionError("common_skills column not defined")
if "title" not in recommended_df:
    raise AssertionError("common_skills column not defined")

Unnamed: 0,family_name,given_name,person_2,org,title
19,Amanti,Emily,in-emilymonsell,Cafedirect,Consumer Communications Manager
1,Gerhard,Emily,in-emilyprog,Michigan Healthcare Professionals,Clerical Assistant
8,Kimberly,Emily,in-emilyping,Matrix Resources,Account Executive
7,Kimberly,Emily,in-emilymerillj2t,"J2T Recruiting Consultants, Inc.",Executive Recruiter-IT
17,Maggiotto,Emily,in-emilysit,salesforce.com,Marketing Operations (APAC)
10,McMonagle,Emily,in-emilysimarski,Chic Little Devil Stylehouse & CLD PR,Account Manager
9,McMonagle,Emily,in-emilypetroff,Revenue Canvas LLC,Coach/Consultant & Owner
5,Merrill,Emily,in-emilypopescu,11th Hour Search LLC,Recruiter
6,Merrill,Emily,in-emilyping,Matrix Resources,Account Executive
18,Merriman,Emily,in-emilynordby,Koehler & Dramm Wholesale Florist,Marketing & Communications Director
