# Lecture Notebook: Making Choices about Data Representation and Processing

## LinkedIn Social Analysis

This module explores concepts in:

* Designing data representations to capture important relationships
* Reasoning over graphs
* Exploring and traversing graphs
* Performance implications of design choices
* Techniques for indexing, parallelism, and sequence

It sets the stage understanding cloud/cluster-compute (parallel) data processing.



In [0]:
!pip install pymongo[tls,srv]
!pip install swifter
!pip install lxml

Collecting dnspython<2.0.0,>=1.16.0; extra == "srv"
[?25l  Downloading https://files.pythonhosted.org/packages/ec/d3/3aa0e7213ef72b8585747aa0e271a9523e713813b9a20177ebe1e939deb0/dnspython-1.16.0-py2.py3-none-any.whl (188kB)
[K     |████████████████████████████████| 194kB 2.7MB/s 
[?25hInstalling collected packages: dnspython
Successfully installed dnspython-1.16.0
Collecting swifter
  Downloading https://files.pythonhosted.org/packages/1f/22/0a46b4d2a417824d7e883a8bd8e01c3b000bbdeaa7c154891b7cba94cbf7/swifter-0.296-py3-none-any.whl
Collecting tqdm>=4.33.0
[?25l  Downloading https://files.pythonhosted.org/packages/b9/08/8505f192efc72bfafec79655e1d8351d219e2b80b0dec4ae71f50934c17a/tqdm-4.38.0-py2.py3-none-any.whl (53kB)
[K     |████████████████████████████████| 61kB 3.5MB/s 
Collecting partd>=0.3.8; extra == "complete"
  Downloading https://files.pythonhosted.org/packages/8b/17/09c352519da1db31634979c3aa9126078e9ece0f561c5f641e0649b78905/partd-1.0.0-py2.py3-none-any.whl
Collecting l



In [0]:
import pandas as pd
import numpy as np

# JSON parsing
import json

# HTML parsing
from lxml import etree
import urllib

# SQLite RDBMS
import sqlite3

# Time conversions
import time

# Parallel processing
import swifter

# NoSQL DB
from pymongo import MongoClient
from pymongo.errors import DuplicateKeyError, OperationFailure

import os
import zipfile

# Part A: Getting the Data

We will use a crawl of LinkedIn, stored as a sequence of JSON objects (one per line).  It is taken from Kaggle (https://www.kaggle.com/linkedindata/linkedin-crawled-profiles-dataset).

The Lecture Notebook on Modeling Data and Knowledge shows how to get the data into the SQLite database, and we will use this database for tasks in this notebook. If you haven't already created the database, you may run the following remote Python script, otherwise, you can skip this step.  Instructions on how to get the Kaggle data in Colab are here: https://stackoverflow.com/questions/49310470/using-kaggle-datasets-in-google-colab

In [0]:
url = 'https://raw.githubusercontent.com/chenleshang/OpenDS4All/master/Module2/module2dataloading.py'
urllib.request.urlretrieve(url,filename='module2dataloading.py')
url = 'https://upenn-bigdataanalytics.s3.amazonaws.com/linkedin.zip'
#url = 'X'
filehandle, _ = urllib.request.urlretrieve(url,filename='local.zip')

In [0]:
def fetch_file(fname):
    zip_file_object = zipfile.ZipFile(filehandle, 'r')
    for file in zip_file_object.namelist():
        file = zip_file_object.open(file)
        if file.name == fname: return file
    return None
    
linkedin_small = fetch_file('linkedin_small.json')# 100K records
# note that linkedin_tiny.json has bug. Do not use! 

from module2dataloading import *
import importlib

In [0]:
data_loading(file=fetch_file('linkedin_small.json'), dbname='linkedin.db', filetype='localobj', LIMIT=20000)

10000
20000


# Part B: Big Data Takes a Long Time to Process

This dataset is very big, and processing it may take a long time depending on how the processing is performed.  We'll explore this, and see how we can improve performance.  Then we'll see how an SQL database automatically finds good ways to execute queries.

In [0]:
%%time
# 100,000 records from linkedin
# Note that we are loading all the data into a dataframe first, then selecting the rows we want.
linked_in = fetch_file('linkedin_small.json')
people = []

i=1
for line in linked_in:
    person = json.loads(line)
    people.append(person)

    if(i % 10000==0):
      print(i)
      if(i == 20000):
        break

    i += 1
    
people_df = pd.DataFrame(people)
people_df[people_df['industry'] == 'Medical Devices']

10000
20000
CPU times: user 2.52 s, sys: 175 ms, total: 2.7 s
Wall time: 2.7 s


In [0]:
%%time
# 100,000 records from linkedin
# Note that we are selecting the data we want as we loading the data into a dataframe.
linked_in = fetch_file('linkedin_small.json')

people = []

i = 1
for line in linked_in:
    person = json.loads(line)
    if 'industry' in person and person['industry'] == 'Medical Devices':
        people.append(person)

    if(i % 10000 == 0):
      print(i)
      if( i == 20000):
        break
    i += 1
    
people_df = pd.DataFrame(people)
people_df

10000
20000
CPU times: user 2.38 s, sys: 41.6 ms, total: 2.42 s
Wall time: 2.42 s


## SQL query without an index

In the above, we rewrote the processing to perform the filter (industry is Medical Devices) early.  However, SQL databases will automatically "push down" selection and projection where feasible.  They also don't need to parse the data.  Here we assume that the data is already in a relational database (so it is not a head-to-head comparison with the above).

In [0]:
conn = sqlite3.connect('linkedin.db')

## This is just to reset things so we don't have an index
conn.execute('begin transaction')
conn.execute('drop index if exists people_industry')
conn.execute('commit')

<sqlite3.Cursor at 0x7f0cbf9cb0a0>

In [0]:
%%time

pd.read_sql_query('select * from people where industry="Medical Devices"', conn)

CPU times: user 8.47 ms, sys: 7.04 ms, total: 15.5 ms
Wall time: 15.4 ms


Unnamed: 0,_id,locality,industry,summary,url,overview_html,specilities,interests,homepage
0,in-00000001,United States,Medical Devices,SALES MANAGEMENT / BUSINESS DEVELOPMENT / PROJ...,http://www.linkedin.com/in/00000001,,,,
1,in-13806219531,China,Medical Devices,,http://cn.linkedin.com/in/13806219531,,,,
2,in-1scottsanderson,Greater Nashville Area,Medical Devices,"Whether achieving new highs in medical sales, ...",http://www.linkedin.com/in/1scottsanderson,,"Customer Service, Sales Growth, Direct Sales, ...",,
3,in-2008annvu,"Rochester, New York Area",Medical Devices,Change agent and proactive leader that drives ...,http://www.linkedin.com/in/2008annvu,,,,
4,in-2johnstroh,"Orange County, California Area",Medical Devices,Contact –email: johnstroh@verizon.netmobile: 7...,http://www.linkedin.com/in/2johnstroh,,,"John Stroh – President, CEO, COO, CFO, Directo...",
...,...,...,...,...,...,...,...,...,...
101,in-amygrubbnarcotta,Greater Boston Area,Medical Devices,,http://www.linkedin.com/in/amygrubbnarcotta,,,,
102,in-amykeller,San Francisco Bay Area,Medical Devices,,http://www.linkedin.com/in/amykeller,,,,
103,in-amylenger,Greater New York City Area,Medical Devices,,http://www.linkedin.com/in/amylenger,,,,
104,in-amymtran,San Francisco Bay Area,Medical Devices,Dedicated biomedical engineer with a diverse t...,http://www.linkedin.com/in/amymtran,"<dl id=""overview""><dt id=""overview-summary-cur...","Proficient in MATLAB. Experience with SQL, Vis...",,


## Let's build an index now...

To speed up the SQL query processing, we can build an index. 

In [0]:
conn = sqlite3.connect('linkedin.db')

conn.execute('begin transaction')
conn.execute('drop index if exists people_industry')
conn.execute("create index people_industry on people(industry)")
conn.execute('commit')

<sqlite3.Cursor at 0x7f0cc33cd8f0>

In [0]:
%%time
# Treat the view as a table, see what's there
pd.read_sql_query('select * from people where industry="Medical Devices"', conn)

# In our tests, this was 5x faster!

CPU times: user 5.73 ms, sys: 28 µs, total: 5.75 ms
Wall time: 5.14 ms


Unnamed: 0,_id,locality,industry,summary,url,overview_html,specilities,interests,homepage
0,in-00000001,United States,Medical Devices,SALES MANAGEMENT / BUSINESS DEVELOPMENT / PROJ...,http://www.linkedin.com/in/00000001,,,,
1,in-13806219531,China,Medical Devices,,http://cn.linkedin.com/in/13806219531,,,,
2,in-1scottsanderson,Greater Nashville Area,Medical Devices,"Whether achieving new highs in medical sales, ...",http://www.linkedin.com/in/1scottsanderson,,"Customer Service, Sales Growth, Direct Sales, ...",,
3,in-2008annvu,"Rochester, New York Area",Medical Devices,Change agent and proactive leader that drives ...,http://www.linkedin.com/in/2008annvu,,,,
4,in-2johnstroh,"Orange County, California Area",Medical Devices,Contact –email: johnstroh@verizon.netmobile: 7...,http://www.linkedin.com/in/2johnstroh,,,"John Stroh – President, CEO, COO, CFO, Directo...",
...,...,...,...,...,...,...,...,...,...
101,in-amygrubbnarcotta,Greater Boston Area,Medical Devices,,http://www.linkedin.com/in/amygrubbnarcotta,,,,
102,in-amykeller,San Francisco Bay Area,Medical Devices,,http://www.linkedin.com/in/amykeller,,,,
103,in-amylenger,Greater New York City Area,Medical Devices,,http://www.linkedin.com/in/amylenger,,,,
104,in-amymtran,San Francisco Bay Area,Medical Devices,Dedicated biomedical engineer with a diverse t...,http://www.linkedin.com/in/amymtran,"<dl id=""overview""><dt id=""overview-summary-cur...","Proficient in MATLAB. Experience with SQL, Vis...",,


In [0]:
conn = sqlite3.connect('linkedin.db')

people_df = pd.read_sql_query('select * from people limit 500', conn)
experience_df = pd.read_sql_query('select * from experience limit 5000', conn)
skills_df = pd.read_sql_query('select * from skills limit 8000', conn)

print ("%d people"%len(people_df))
print ("%d experiences"%len(experience_df))
print ("%d skills"%len(skills_df))

500 people
5000 experiences
8000 skills


In [0]:
# Implement a dataframe merge in Python.

def merge(S,T,l_on,r_on):
    ret = pd.DataFrame()
    count = 0
    for s_index in range(0, len(S)):
        for t_index in range(0, len(T)):
            count = count + 1
            if S.loc[s_index, l_on] == T.loc[t_index, r_on]:
                ret = ret.append(S.loc[s_index].append(T.loc[t_index].drop(labels=r_on)), ignore_index=True)

    print('Merge compared %d tuples'%count)
    return ret

In [0]:
%%time
# Here's a test join, with people and their experiences.  We can see how many
# comparisons are made

merge(people_df, experience_df, '_id', 'person')

Merge compared 2500000 tuples
CPU times: user 44.9 s, sys: 5.62 ms, total: 44.9 s
Wall time: 44.9 s


Unnamed: 0,_id,desc,end,homepage,industry,interests,locality,org,overview_html,pos,specilities,start,summary,title,url
0,in-00001,Biomarker Leader for compounds in clinical dev...,Present,,Pharmaceuticals,,"Antwerp Area, Belgium",Johnson and Johnson,"<dl id=""overview""><dt id=""overview-summary-cur...",0,"Biomarkers in Oncology, Cancer Genomics, Molec...",November 2009,Ph.D. scientist with background in cancer rese...,"Senior Scientist, Oncology Biomarkers",http://be.linkedin.com/in/00001
1,in-00001,Single Cell Gene expression.,,,Pharmaceuticals,,"Antwerp Area, Belgium",Albert Einstein Medical Center,"<dl id=""overview""><dt id=""overview-summary-cur...",1,"Biomarkers in Oncology, Cancer Genomics, Molec...",September 2008,Ph.D. scientist with background in cancer rese...,Associate at Dept of Molecular Genetics,http://be.linkedin.com/in/00001
2,in-00001,Work on peptide to restore wt p53 function in ...,,,Pharmaceuticals,,"Antwerp Area, Belgium",Columbia University,"<dl id=""overview""><dt id=""overview-summary-cur...",2,"Biomarkers in Oncology, Cancer Genomics, Molec...",August 2006,Ph.D. scientist with background in cancer rese...,Associate Research Scientist,http://be.linkedin.com/in/00001
3,in-00001,Molecular profiling of colorectal cancer.,,,Pharmaceuticals,,"Antwerp Area, Belgium",Memorial Sloan Kettering Cancer Center,"<dl id=""overview""><dt id=""overview-summary-cur...",3,"Biomarkers in Oncology, Cancer Genomics, Molec...",January 2003,Ph.D. scientist with background in cancer rese...,Post Doctoral Research Fellow,http://be.linkedin.com/in/00001
4,in-00001,Cancer Research at Dept of Surgery.Molecular p...,,,Pharmaceuticals,,"Antwerp Area, Belgium",Sahlgrenska University Hospital,"<dl id=""overview""><dt id=""overview-summary-cur...",4,"Biomarkers in Oncology, Cancer Genomics, Molec...",November 2001,Ph.D. scientist with background in cancer rese...,Research Scientist,http://be.linkedin.com/in/00001
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2223,in-3256068,Develops and maintains business relationship w...,,,Real Estate,"movies, travel and making friends","Chengdu City, China",Servcorp,,1,"advertising, cash management, cashier, closing...",October 2007,My company specializes offering a total busine...,PR & Marketing Manager,http://cn.linkedin.com/in/3256068
2224,in-3256068,Assists the store manager in executing store o...,,,Real Estate,"movies, travel and making friends","Chengdu City, China",Starbucks,,2,"advertising, cash management, cashier, closing...",January 2006,My company specializes offering a total busine...,Shift Supervisor,http://cn.linkedin.com/in/3256068
2225,in-3256068,,,,Real Estate,"movies, travel and making friends","Chengdu City, China",McDonald's Corporation,,3,"advertising, cash management, cashier, closing...",January 2001,My company specializes offering a total busine...,PR & Marketing,http://cn.linkedin.com/in/3256068
2226,in-3256068,Hires and trains marketing coordinatorsDevelop...,,,Real Estate,"movies, travel and making friends","Chengdu City, China",McDonald's Corporation,,4,"advertising, cash management, cashier, closing...",January 2001,My company specializes offering a total busine...,Marketing Manager,http://cn.linkedin.com/in/3256068


In [0]:
# Let's find all people (by ID) who have Marketing as a skill

mktg_df = skills_df[skills_df['value'] == 'Marketing'].reset_index()[['person']]
mktg_df

Unnamed: 0,person
0,in-01011985
1,in-01mihaipop
2,in-021370900310
3,in-02k17m87
4,in-0311101678
5,in-05stephaniemartinez
6,in-12magazine
7,in-140hours
8,in-19655
9,in-1alyssalee


In [0]:
%%time
# Test differences in join order (Part 1)
merge(merge(people_df, experience_df, '_id', 'person'), mktg_df, '_id', 'person')

Merge compared 2500000 tuples
Merge compared 51244 tuples
CPU times: user 45.9 s, sys: 914 µs, total: 45.9 s
Wall time: 45.9 s


Unnamed: 0,_id,desc,end,homepage,industry,interests,locality,org,overview_html,pos,specilities,start,summary,title,url
0,in-01011985,,Present,,Biotechnology,,"Hyderabad Area, India",BioGenex,,0,"Marketing , Operations Management , P&L Head, ...",September 2012,•Having 12 Yrs of Experience in Marketing & In...,Senior Manager -IBD,http://in.linkedin.com/in/01011985
1,in-01mihaipop,"Shake Advertising is an integrated agency, we ...",Present,,Marketing și publicitate,,Romania,SHAKE advertising,,0,"IT&C/Internet, Media / Publishing, Services, A...",August 2010,Engineer...Product manager FMCG...Product Mana...,Managing partner,http://ro.linkedin.com/in/01mihaipop
2,in-01mihaipop,Company with a wide area of products oriented ...,,,Marketing și publicitate,,Romania,Saint Discount,,1,"IT&C/Internet, Media / Publishing, Services, A...",January 2010,Engineer...Product manager FMCG...Product Mana...,Owner,http://ro.linkedin.com/in/01mihaipop
3,in-01mihaipop,Construction company dealing in diamond cuttin...,,,Marketing și publicitate,,Romania,Zygo Construct,,2,"IT&C/Internet, Media / Publishing, Services, A...",March 2008,Engineer...Product manager FMCG...Product Mana...,Managing partner,http://ro.linkedin.com/in/01mihaipop
4,in-01mihaipop,"Direct Fastening, Screw Fastening & Rotary Dri...",,,Marketing și publicitate,,Romania,Hilti,,3,"IT&C/Internet, Media / Publishing, Services, A...",August 2007,Engineer...Product manager FMCG...Product Mana...,Product Manager,http://ro.linkedin.com/in/01mihaipop
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75,in-3256068,Develops and maintains business relationship w...,,,Real Estate,"movies, travel and making friends","Chengdu City, China",Servcorp,,1,"advertising, cash management, cashier, closing...",October 2007,My company specializes offering a total busine...,PR & Marketing Manager,http://cn.linkedin.com/in/3256068
76,in-3256068,Assists the store manager in executing store o...,,,Real Estate,"movies, travel and making friends","Chengdu City, China",Starbucks,,2,"advertising, cash management, cashier, closing...",January 2006,My company specializes offering a total busine...,Shift Supervisor,http://cn.linkedin.com/in/3256068
77,in-3256068,,,,Real Estate,"movies, travel and making friends","Chengdu City, China",McDonald's Corporation,,3,"advertising, cash management, cashier, closing...",January 2001,My company specializes offering a total busine...,PR & Marketing,http://cn.linkedin.com/in/3256068
78,in-3256068,Hires and trains marketing coordinatorsDevelop...,,,Real Estate,"movies, travel and making friends","Chengdu City, China",McDonald's Corporation,,4,"advertising, cash management, cashier, closing...",January 2001,My company specializes offering a total busine...,Marketing Manager,http://cn.linkedin.com/in/3256068


In [0]:
%%time 
# Test differences in join order (Part 2)
merge(merge(people_df, mktg_df, '_id', 'person'), experience_df, '_id', 'person')

Merge compared 11500 tuples
Merge compared 85000 tuples
CPU times: user 1.8 s, sys: 4.78 ms, total: 1.81 s
Wall time: 1.81 s


Unnamed: 0,_id,desc,end,homepage,industry,interests,locality,org,overview_html,pos,specilities,start,summary,title,url
0,in-01011985,,Present,,Biotechnology,,"Hyderabad Area, India",BioGenex,,0,"Marketing , Operations Management , P&L Head, ...",September 2012,•Having 12 Yrs of Experience in Marketing & In...,Senior Manager -IBD,http://in.linkedin.com/in/01011985
1,in-01mihaipop,"Shake Advertising is an integrated agency, we ...",Present,,Marketing și publicitate,,Romania,SHAKE advertising,,0,"IT&C/Internet, Media / Publishing, Services, A...",August 2010,Engineer...Product manager FMCG...Product Mana...,Managing partner,http://ro.linkedin.com/in/01mihaipop
2,in-01mihaipop,Company with a wide area of products oriented ...,,,Marketing și publicitate,,Romania,Saint Discount,,1,"IT&C/Internet, Media / Publishing, Services, A...",January 2010,Engineer...Product manager FMCG...Product Mana...,Owner,http://ro.linkedin.com/in/01mihaipop
3,in-01mihaipop,Construction company dealing in diamond cuttin...,,,Marketing și publicitate,,Romania,Zygo Construct,,2,"IT&C/Internet, Media / Publishing, Services, A...",March 2008,Engineer...Product manager FMCG...Product Mana...,Managing partner,http://ro.linkedin.com/in/01mihaipop
4,in-01mihaipop,"Direct Fastening, Screw Fastening & Rotary Dri...",,,Marketing și publicitate,,Romania,Hilti,,3,"IT&C/Internet, Media / Publishing, Services, A...",August 2007,Engineer...Product manager FMCG...Product Mana...,Product Manager,http://ro.linkedin.com/in/01mihaipop
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75,in-3256068,Develops and maintains business relationship w...,,,Real Estate,"movies, travel and making friends","Chengdu City, China",Servcorp,,1,"advertising, cash management, cashier, closing...",October 2007,My company specializes offering a total busine...,PR & Marketing Manager,http://cn.linkedin.com/in/3256068
76,in-3256068,Assists the store manager in executing store o...,,,Real Estate,"movies, travel and making friends","Chengdu City, China",Starbucks,,2,"advertising, cash management, cashier, closing...",January 2006,My company specializes offering a total busine...,Shift Supervisor,http://cn.linkedin.com/in/3256068
77,in-3256068,,,,Real Estate,"movies, travel and making friends","Chengdu City, China",McDonald's Corporation,,3,"advertising, cash management, cashier, closing...",January 2001,My company specializes offering a total busine...,PR & Marketing,http://cn.linkedin.com/in/3256068
78,in-3256068,Hires and trains marketing coordinatorsDevelop...,,,Real Estate,"movies, travel and making friends","Chengdu City, China",McDonald's Corporation,,4,"advertising, cash management, cashier, closing...",January 2001,My company specializes offering a total busine...,Marketing Manager,http://cn.linkedin.com/in/3256068


In [0]:
experience_df.loc[0].drop(labels='person')

org                                    Johnson and Johnson
title                Senior Scientist, Oncology Biomarkers
end                                                Present
start                                        November 2009
desc     Biomarker Leader for compounds in clinical dev...
pos                                                      0
Name: 0, dtype: object

In [0]:
%%time

# Slide 21
conn.execute('drop view if exists people500')
conn.execute('drop view if exists experience5000')
conn.execute('drop view if exists skills8000')
conn.execute('create view people500 as select * from people limit 500')
conn.execute('create view experience5000 as select * from experience limit 500')
conn.execute('create view skills8000 as select * from skills limit 500')

pd.read_sql_query('select * from (people500 join skills8000 on _id=person) ps join ' + \
                  'experience5000 ex on ps._id=ex.person and value="Marketing"', conn)

CPU times: user 8.82 ms, sys: 1.98 ms, total: 10.8 ms
Wall time: 34.7 ms


In [0]:
# Join using a *map*, which is a kind of in-memory index
# from keys to (single) values
def merge_map(S,T,l_on,r_on):
    ret = pd.DataFrame()
    T_map = {}
    count = 0
    # Take each value in the r_on field, and
    # make a map entry for it
    for t_index in range(0, len(T)):
        # Make sure we aren't overwriting an entry!
        assert (T.loc[t_index,r_on] not in T_map)
        T_map[T.loc[t_index,r_on]] = T.loc[t_index]
        count = count + 1

    # Now find matches
    for s_index in range(0, len(S)):
        count = count + 1
        if S.loc[s_index, l_on] in T_map:
                ret = ret.append(S.loc[s_index].append(T_map[S.loc[s_index, l_on]].drop(labels=r_on)), ignore_index=True)

    print('Merge compared %d tuples'%count)
    return ret

In [0]:
%%time
# Here's a test join, with people and their experiences.  We can see how many
# comparisons are made
merge_map(experience_df, people_df, 'person', '_id')

Merge compared 5500 tuples
CPU times: user 10.2 s, sys: 2.58 ms, total: 10.2 s
Wall time: 10.2 s


Unnamed: 0,desc,end,homepage,industry,interests,locality,org,overview_html,person,pos,specilities,start,summary,title,url
0,Biomarker Leader for compounds in clinical dev...,Present,,Pharmaceuticals,,"Antwerp Area, Belgium",Johnson and Johnson,"<dl id=""overview""><dt id=""overview-summary-cur...",in-00001,0,"Biomarkers in Oncology, Cancer Genomics, Molec...",November 2009,Ph.D. scientist with background in cancer rese...,"Senior Scientist, Oncology Biomarkers",http://be.linkedin.com/in/00001
1,Single Cell Gene expression.,,,Pharmaceuticals,,"Antwerp Area, Belgium",Albert Einstein Medical Center,"<dl id=""overview""><dt id=""overview-summary-cur...",in-00001,1,"Biomarkers in Oncology, Cancer Genomics, Molec...",September 2008,Ph.D. scientist with background in cancer rese...,Associate at Dept of Molecular Genetics,http://be.linkedin.com/in/00001
2,Work on peptide to restore wt p53 function in ...,,,Pharmaceuticals,,"Antwerp Area, Belgium",Columbia University,"<dl id=""overview""><dt id=""overview-summary-cur...",in-00001,2,"Biomarkers in Oncology, Cancer Genomics, Molec...",August 2006,Ph.D. scientist with background in cancer rese...,Associate Research Scientist,http://be.linkedin.com/in/00001
3,Molecular profiling of colorectal cancer.,,,Pharmaceuticals,,"Antwerp Area, Belgium",Memorial Sloan Kettering Cancer Center,"<dl id=""overview""><dt id=""overview-summary-cur...",in-00001,3,"Biomarkers in Oncology, Cancer Genomics, Molec...",January 2003,Ph.D. scientist with background in cancer rese...,Post Doctoral Research Fellow,http://be.linkedin.com/in/00001
4,Cancer Research at Dept of Surgery.Molecular p...,,,Pharmaceuticals,,"Antwerp Area, Belgium",Sahlgrenska University Hospital,"<dl id=""overview""><dt id=""overview-summary-cur...",in-00001,4,"Biomarkers in Oncology, Cancer Genomics, Molec...",November 2001,Ph.D. scientist with background in cancer rese...,Research Scientist,http://be.linkedin.com/in/00001
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2223,Develops and maintains business relationship w...,,,Real Estate,"movies, travel and making friends","Chengdu City, China",Servcorp,,in-3256068,1,"advertising, cash management, cashier, closing...",October 2007,My company specializes offering a total busine...,PR & Marketing Manager,http://cn.linkedin.com/in/3256068
2224,Assists the store manager in executing store o...,,,Real Estate,"movies, travel and making friends","Chengdu City, China",Starbucks,,in-3256068,2,"advertising, cash management, cashier, closing...",January 2006,My company specializes offering a total busine...,Shift Supervisor,http://cn.linkedin.com/in/3256068
2225,,,,Real Estate,"movies, travel and making friends","Chengdu City, China",McDonald's Corporation,,in-3256068,3,"advertising, cash management, cashier, closing...",January 2001,My company specializes offering a total busine...,PR & Marketing,http://cn.linkedin.com/in/3256068
2226,Hires and trains marketing coordinatorsDevelop...,,,Real Estate,"movies, travel and making friends","Chengdu City, China",McDonald's Corporation,,in-3256068,4,"advertising, cash management, cashier, closing...",January 2001,My company specializes offering a total busine...,Marketing Manager,http://cn.linkedin.com/in/3256068


In [0]:
%%time

# An exercise: how can you modify merge_map to make this work?  (This can be skipped if you wish.)

merge_map(people_df, experience_df, '_id', 'person')

AssertionError: ignored