# Lecture Notebook: Making Choices about Data Representation and Processing

## LinkedIn Social Analysis

This module explores concepts in:

* Designing data representations to capture important relationships
* Reasoning over graphs
* Exploring and traversing graphs
* Performance implications of design choices
* Techniques for indexing, parallelism, and sequence

It sets the stage understanding cloud/cluster-compute (parallel) data processing.



In [0]:
!pip install pymongo[tls,srv]
!pip install swifter
!pip install lxml

Collecting dnspython<2.0.0,>=1.16.0; extra == "srv"
[?25l  Downloading https://files.pythonhosted.org/packages/ec/d3/3aa0e7213ef72b8585747aa0e271a9523e713813b9a20177ebe1e939deb0/dnspython-1.16.0-py2.py3-none-any.whl (188kB)
[K     |████████████████████████████████| 194kB 5.4MB/s 
[?25hInstalling collected packages: dnspython
Successfully installed dnspython-1.16.0
Collecting swifter
  Downloading https://files.pythonhosted.org/packages/17/76/5f30ae762215205e2299370ac81cf5346b08e26ea9dacb89e5dc8aa453e6/swifter-0.301-py3-none-any.whl
Collecting tqdm>=4.33.0
[?25l  Downloading https://files.pythonhosted.org/packages/47/55/fd9170ba08a1a64a18a7f8a18f088037316f2a41be04d2fe6ece5a653e8f/tqdm-4.43.0-py2.py3-none-any.whl (59kB)
[K     |████████████████████████████████| 61kB 4.1MB/s 
Collecting distributed>=2.0; extra == "complete"
[?25l  Downloading https://files.pythonhosted.org/packages/1a/39/9ce1e5733d3c411b6e5ed23b7d8aad861ee5792537cfdda0360dc99e8ec7/distributed-2.10.0-py3-none-any.whl



In [0]:
import pandas as pd
import numpy as np

# JSON parsing
import json

# HTML parsing
from lxml import etree
import urllib

# SQLite RDBMS
import sqlite3

# Time conversions
import time

# Parallel processing
import swifter

# NoSQL DB
from pymongo import MongoClient
from pymongo.errors import DuplicateKeyError, OperationFailure

import os
import zipfile

# Part A: Getting the Data

We use a synthetic linkedin data to test this notebook. The Lecture Notebook on Modeling Data and Knowledge shows how to get and process the data. We wrap up those steps in a file represented as 'module2dataloading.py' for you to use. 

In [0]:
# Getting the data processing script, which was covered in the modelling data module. 
# url = 'https://XXX/module2dataloading.py'
# urllib.request.urlretrieve(url,filename='module2dataloading.py')

# Also, get the linkedin data. 
# url = 'X'
# filehandle, _ = urllib.request.urlretrieve(url,filename='local.zip')

('module2dataloading.py', <http.client.HTTPMessage at 0x7f67887c36d8>)

In [0]:
# def fetch_file(fname):
#     zip_file_object = zipfile.ZipFile(filehandle, 'r')
#     for file in zip_file_object.namelist():
#         file = zip_file_object.open(file)
#         if file.name == fname: return file
#     return None
    
# linkedin_small = fetch_file('linkedin_small.json')# 100K records
# # note that linkedin_tiny.json has bug. Do not use! 

from module2dataloading import *
import importlib

In [0]:
# If use colab and want to mount google drive as 'local' folder, then run this cell. 
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

In [0]:
# If want to load data locally, use open() function. 
data_loading(file=open('/content/drive/My Drive/Colab Notebooks/test_data_10000.json'), dbname='linkedin.db', filetype='localobj', LIMIT=20000)

10000


# Part B: Big Data Takes a Long Time to Process

This dataset is very big, and processing it may take a long time depending on how the processing is performed.  We'll explore this, and see how we can improve performance.  Then we'll see how an SQL database automatically finds good ways to execute queries.

In [0]:
%%time
# 10,000 records from linkedin
# Note that we are loading all the data into a dataframe first, then selecting the rows we want.
# linked_in = fetch_file('linkedin_small.json')
linked_in = open('/content/drive/My Drive/Colab Notebooks/test_data_10000.json')

people = []

i=1
for line in linked_in:
    person = json.loads(line)
    people.append(person)

    if(i % 10000==0):
      print(i)
      if(i == 20000):
        break

    i += 1
    
people_df = pd.DataFrame(people)
people_df[people_df['industry'] == 'Medical Devices']

10000
CPU times: user 990 ms, sys: 147 ms, total: 1.14 s
Wall time: 1.17 s


In [0]:
%%time
# 10,000 records from linkedin
# Note that we are selecting the data we want as we loading the data into a dataframe.
# linked_in = fetch_file('linkedin_small.json')
linked_in = open('/content/drive/My Drive/Colab Notebooks/test_data_10000.json')

people = []

i = 1
for line in linked_in:
    person = json.loads(line)
    if 'industry' in person and person['industry'] == 'Medical Devices':
        people.append(person)

    if(i % 10000 == 0):
      print(i)
      if( i == 20000):
        break
    i += 1
    
people_df = pd.DataFrame(people)
people_df

10000
CPU times: user 908 ms, sys: 53.2 ms, total: 961 ms
Wall time: 967 ms


## SQL query without an index

In the above, we rewrote the processing to perform the filter (industry is Medical Devices) early.  However, SQL databases will automatically "push down" selection and projection where feasible.  They also don't need to parse the data.  Here we assume that the data is already in a relational database (so it is not a head-to-head comparison with the above).

In [0]:
conn = sqlite3.connect('linkedin.db')

## This is just to reset things so we don't have an index
conn.execute('begin transaction')
conn.execute('drop index if exists people_industry')
conn.execute('commit')

<sqlite3.Cursor at 0x7f6788801ea0>

In [0]:
%%time

pd.read_sql_query('select * from people where industry="Medical Devices"', conn)

CPU times: user 7.02 ms, sys: 12.1 ms, total: 19.1 ms
Wall time: 21.6 ms


Unnamed: 0,locality,industry,summary,url,specilities,interests,_id,overview_html,homepage
0,"Tianjin Suburb, China",Medical Devices,A highly effective senior Procurement professi...,qfacoyibhrwshwwrqoyfurkiehlpudqpvodgoreyhxw813,"Analyse financière, optimisation de financemen...","new technology, innovation, management reading...",zhgmkbseftvzngxjoujarpfpxoslb813,"<dl id=""overview""><dt id=""overview-summary-cur...",{'photography portfolio': ['http://www.gijsbek...
1,"Rest of Zhejiang, China",Medical Devices,Widely established qualifications in Electrica...,aolkrkybmkezvujdyjykzlwogyhhdgazxxdvewqtcaq1022,"Solvency II, Management Information, Data and ...","Social media, Emerging technology, Photography...",vwgjcqgnlpwpzsklpswfxhkfbqhrm1022,"<dl id=""overview""><dt id=""overview-summary-cur...",{'Bedrijfswebsite': ['http://www.theresearcher...
2,"Cockeysville, Maryland",Medical Devices,At this moment I am interested in gaining more...,vmxcbuehcrqdyoaulwzhgsmqnbkklykvmhwwkmpbowe1158,Financial Model Development,"Cloud computing, SaaS, Continuous Delivery, Sy...",pnudxebyienbkqcmsgmaxazwrevpv1158,"<dl id=""overview""><dt id=""overview-summary-cur...",{'Unternehmenswebseite': ['http://www.hug-ge.c...
3,Greater Boston Area,Medical Devices,As Chairman of one of our London-based Chief E...,nsqgdvxuuthsqafkrozofwanluhcictzczejtmmxwuz1180,"Excel, Power Point, MS Project, Word, VBA, WBS...","Traveling\nWriting\nPhotography\nFootball, Box...",tcdlugklajpnwpohubnnvbkfovthb1180,,{'Company Website': ['http://www.saflexcones.c...
4,"Jerez de La Frontera y alrededores, España",Medical Devices,A senior management level professional with 15...,apatuarvxvqutxpthnsunnyxjjjqntzrlzytnfndfsb1410,Broad background in information technology man...,"Finance and investing, entrepreneurial venture...",lqofdxehnheexmdydojiupdtmeuiy1410,"<dl id=""overview""><dt id=""overview-summary-cur...",{'Company Website': ['http://research.ict.csir...
5,"South San Francisco, California",Medical Devices,"Freelance 3D Instructor for 3ds Max, Maya, Xsi...",vbhpithejtqriexizsbwzzyldjslfyhelfiyzhdgyzx1922,"Branding and identity, brand essence and value...",Exploring new places!,ztzvgtzuejnhnlgmadttrqehhlrou1922,"<dl id=""overview""><dt id=""overview-summary-cur...",
6,"Katowice, Silesian District, Poland",Medical Devices,Master in FInance program in IE Business Schoo...,xchpnyvtfaesbnardrcqdgwrsrefzoknzdcobxngvbx2440,"Continuous improvement, CRM, process design, i...",Aikido,ljyjxkbpnmohcqoxkdqcbaldsjesc2440,"<dl id=""overview""><dt id=""overview-summary-cur...",
7,Greater Milwaukee Area,Medical Devices,Mission: Intend to grow as a professional in a...,xhvjjriqffyqtugxkicpilltftalpfruzbmsnncfqdw2839,"Strategic Planning, Sales, Marketing, Client R...","Economics, Sales Strategy, New Business Develo...",iufzzahpiqbjnrvqgebrzjnqzbkdi2839,"<dl id=""overview""><dt id=""overview-summary-cur...",{'Company Website': ['http://www.wegmanspms.co...
8,"Chicago, Illinois",Medical Devices,,uffitmoafjwrakphobssqdedqkomtryktjetqjubukq2920,,,ltvouhudbhwsmtwxmpozivlxomsuf2920,,
9,"Guadalajara Area, Mexico",Medical Devices,Highly-dedicated and experienced development l...,rdhllocwdttzyijodiybctrehmlzsqabybavwpoxije4024,"Internal Audit, Auditor, Senior Auditor, Inter...","Reading, Traveling, Writing, Teaching, Learnin...",hlmlufhnxfgndjpjqpbqrdfcppxxf4024,"<dl id=""overview""><dt id=""overview-summary-cur...",


## Let's build an index now...

To speed up the SQL query processing, we can build an index. 

In [0]:
conn = sqlite3.connect('linkedin.db')

conn.execute('begin transaction')
conn.execute('drop index if exists people_industry')
conn.execute("create index people_industry on people(industry)")
conn.execute('commit')

<sqlite3.Cursor at 0x7f6788801f10>

In [0]:
%%time
# Treat the view as a table, see what's there
pd.read_sql_query('select * from people where industry="Medical Devices"', conn)

# In our tests, this was 5x faster!

CPU times: user 4.66 ms, sys: 0 ns, total: 4.66 ms
Wall time: 5.46 ms


Unnamed: 0,locality,industry,summary,url,specilities,interests,_id,overview_html,homepage
0,"Tianjin Suburb, China",Medical Devices,A highly effective senior Procurement professi...,qfacoyibhrwshwwrqoyfurkiehlpudqpvodgoreyhxw813,"Analyse financière, optimisation de financemen...","new technology, innovation, management reading...",zhgmkbseftvzngxjoujarpfpxoslb813,"<dl id=""overview""><dt id=""overview-summary-cur...",{'photography portfolio': ['http://www.gijsbek...
1,"Rest of Zhejiang, China",Medical Devices,Widely established qualifications in Electrica...,aolkrkybmkezvujdyjykzlwogyhhdgazxxdvewqtcaq1022,"Solvency II, Management Information, Data and ...","Social media, Emerging technology, Photography...",vwgjcqgnlpwpzsklpswfxhkfbqhrm1022,"<dl id=""overview""><dt id=""overview-summary-cur...",{'Bedrijfswebsite': ['http://www.theresearcher...
2,"Cockeysville, Maryland",Medical Devices,At this moment I am interested in gaining more...,vmxcbuehcrqdyoaulwzhgsmqnbkklykvmhwwkmpbowe1158,Financial Model Development,"Cloud computing, SaaS, Continuous Delivery, Sy...",pnudxebyienbkqcmsgmaxazwrevpv1158,"<dl id=""overview""><dt id=""overview-summary-cur...",{'Unternehmenswebseite': ['http://www.hug-ge.c...
3,Greater Boston Area,Medical Devices,As Chairman of one of our London-based Chief E...,nsqgdvxuuthsqafkrozofwanluhcictzczejtmmxwuz1180,"Excel, Power Point, MS Project, Word, VBA, WBS...","Traveling\nWriting\nPhotography\nFootball, Box...",tcdlugklajpnwpohubnnvbkfovthb1180,,{'Company Website': ['http://www.saflexcones.c...
4,"Jerez de La Frontera y alrededores, España",Medical Devices,A senior management level professional with 15...,apatuarvxvqutxpthnsunnyxjjjqntzrlzytnfndfsb1410,Broad background in information technology man...,"Finance and investing, entrepreneurial venture...",lqofdxehnheexmdydojiupdtmeuiy1410,"<dl id=""overview""><dt id=""overview-summary-cur...",{'Company Website': ['http://research.ict.csir...
5,"South San Francisco, California",Medical Devices,"Freelance 3D Instructor for 3ds Max, Maya, Xsi...",vbhpithejtqriexizsbwzzyldjslfyhelfiyzhdgyzx1922,"Branding and identity, brand essence and value...",Exploring new places!,ztzvgtzuejnhnlgmadttrqehhlrou1922,"<dl id=""overview""><dt id=""overview-summary-cur...",
6,"Katowice, Silesian District, Poland",Medical Devices,Master in FInance program in IE Business Schoo...,xchpnyvtfaesbnardrcqdgwrsrefzoknzdcobxngvbx2440,"Continuous improvement, CRM, process design, i...",Aikido,ljyjxkbpnmohcqoxkdqcbaldsjesc2440,"<dl id=""overview""><dt id=""overview-summary-cur...",
7,Greater Milwaukee Area,Medical Devices,Mission: Intend to grow as a professional in a...,xhvjjriqffyqtugxkicpilltftalpfruzbmsnncfqdw2839,"Strategic Planning, Sales, Marketing, Client R...","Economics, Sales Strategy, New Business Develo...",iufzzahpiqbjnrvqgebrzjnqzbkdi2839,"<dl id=""overview""><dt id=""overview-summary-cur...",{'Company Website': ['http://www.wegmanspms.co...
8,"Chicago, Illinois",Medical Devices,,uffitmoafjwrakphobssqdedqkomtryktjetqjubukq2920,,,ltvouhudbhwsmtwxmpozivlxomsuf2920,,
9,"Guadalajara Area, Mexico",Medical Devices,Highly-dedicated and experienced development l...,rdhllocwdttzyijodiybctrehmlzsqabybavwpoxije4024,"Internal Audit, Auditor, Senior Auditor, Inter...","Reading, Traveling, Writing, Teaching, Learnin...",hlmlufhnxfgndjpjqpbqrdfcppxxf4024,"<dl id=""overview""><dt id=""overview-summary-cur...",


In [0]:
conn = sqlite3.connect('linkedin.db')

people_df = pd.read_sql_query('select * from people limit 500', conn)
experience_df = pd.read_sql_query('select * from experience limit 5000', conn)
skills_df = pd.read_sql_query('select * from skills limit 8000', conn)

print ("%d people"%len(people_df))
print ("%d experiences"%len(experience_df))
print ("%d skills"%len(skills_df))

500 people
5000 experiences
8000 skills


In [0]:
# Implement a dataframe merge in Python.

def merge(S,T,l_on,r_on):
    ret = pd.DataFrame()
    count = 0
    for s_index in range(0, len(S)):
        for t_index in range(0, len(T)):
            count = count + 1
            if S.loc[s_index, l_on] == T.loc[t_index, r_on]:
                ret = ret.append(S.loc[s_index].append(T.loc[t_index].drop(labels=r_on)), ignore_index=True)

    print('Merge compared %d tuples'%count)
    return ret

In [0]:
%%time
# Here's a test join, with people and their experiences.  We can see how many
# comparisons are made

merge(people_df, experience_df, '_id', 'person')

Merge compared 2500000 tuples
CPU times: user 47.9 s, sys: 2.7 ms, total: 47.9 s
Wall time: 47.9 s


Unnamed: 0,_id,desc,end,homepage,industry,interests,locality,org,overview_html,pos,specilities,start,summary,title,url
0,ichsrrdhpxlojntrimsvrbzexeeyi0,Talent Acquistion Strategy at UBS Investment Bank,Present,,Animasyon,"Reading books, watching movies and nature trip...","Eskisehir, Turkey",UBS,,0,Areas Of Practice: Criminal Defense; DUI; Misd...,July 2012,An experienced general and financial manager i...,Executive Director - Talent Acquisition,ftsycpiuiasaecavdmmkooqckeojdsriefopopyeeat0
1,ichsrrdhpxlojntrimsvrbzexeeyi0,,,,Animasyon,"Reading books, watching movies and nature trip...","Eskisehir, Turkey",Barclays Bank,,1,Areas Of Practice: Criminal Defense; DUI; Misd...,2009,An experienced general and financial manager i...,Head of Resourcing - Retail and Business Bank ...,ftsycpiuiasaecavdmmkooqckeojdsriefopopyeeat0
2,ichsrrdhpxlojntrimsvrbzexeeyi0,,,,Animasyon,"Reading books, watching movies and nature trip...","Eskisehir, Turkey",Enterprise Plc,,2,Areas Of Practice: Criminal Defense; DUI; Misd...,2008,An experienced general and financial manager i...,Interim Talent & Development Director,ftsycpiuiasaecavdmmkooqckeojdsriefopopyeeat0
3,ichsrrdhpxlojntrimsvrbzexeeyi0,,,,Animasyon,"Reading books, watching movies and nature trip...","Eskisehir, Turkey",Vodafone,,3,Areas Of Practice: Criminal Defense; DUI; Misd...,2008,An experienced general and financial manager i...,Interim Global Resourcing & Talent Mobility Ma...,ftsycpiuiasaecavdmmkooqckeojdsriefopopyeeat0
4,ichsrrdhpxlojntrimsvrbzexeeyi0,,,,Animasyon,"Reading books, watching movies and nature trip...","Eskisehir, Turkey",AstraZeneca,,4,Areas Of Practice: Criminal Defense; DUI; Misd...,2006,An experienced general and financial manager i...,Global Resourcing & Talent Manager,ftsycpiuiasaecavdmmkooqckeojdsriefopopyeeat0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2262,kqdvnuxlxteqeylpkuowwboluewwi499,,,,Bankwesen,Playing the trumpet\nLearning Italian\nSnowboa...,"Murrieta, California",PwC,"<dl id=""overview""><dt>\nConnections\n</dt>\n<d...",3,"Identity design, digtal design, digtal services",April 2007,Your ideal customer is the one already searchi...,Senior Manager,syauzbaqggxzklufogdjwkleiwccknmwfrhofsvtgxy499
2263,kqdvnuxlxteqeylpkuowwboluewwi499,,,,Bankwesen,Playing the trumpet\nLearning Italian\nSnowboa...,"Murrieta, California",GEM CONSULTING,"<dl id=""overview""><dt>\nConnections\n</dt>\n<d...",4,"Identity design, digtal design, digtal services",June 2004,Your ideal customer is the one already searchi...,Manager of Information Technology & Telecommun...,syauzbaqggxzklufogdjwkleiwccknmwfrhofsvtgxy499
2264,kqdvnuxlxteqeylpkuowwboluewwi499,,,,Bankwesen,Playing the trumpet\nLearning Italian\nSnowboa...,"Murrieta, California",ACCENTURE,"<dl id=""overview""><dt>\nConnections\n</dt>\n<d...",5,"Identity design, digtal design, digtal services",April 2000,Your ideal customer is the one already searchi...,Senior Consultant in Communications & High Tec...,syauzbaqggxzklufogdjwkleiwccknmwfrhofsvtgxy499
2265,kqdvnuxlxteqeylpkuowwboluewwi499,,,,Bankwesen,Playing the trumpet\nLearning Italian\nSnowboa...,"Murrieta, California",OTE PLUS,"<dl id=""overview""><dt>\nConnections\n</dt>\n<d...",6,"Identity design, digtal design, digtal services",July 1999,Your ideal customer is the one already searchi...,Business Process Consultant,syauzbaqggxzklufogdjwkleiwccknmwfrhofsvtgxy499


In [0]:
# Let's find all people (by ID) who have Marketing as a skill

mktg_df = skills_df[skills_df['value'] == 'Marketing'].reset_index()[['person']]
mktg_df

Unnamed: 0,person
0,tbmhpotienijsuyshhlfhymkbqscr16
1,zijelfcuxmuyinlxajtzeozjvkxuq39
2,yypbfkalsyehapbryfwruiburuayb47
3,slqzulzudmucmrvgdalpsfrembzso117
4,caaepjzmsdyzdnvgfhhxrzjzhkgps140
5,pimwmmzfmriracqvynaoaekvhwrco185
6,ynrqhzmeisoyrfgxinuhqlzunbfdq201
7,vrdijyrmlcquplatflfgwdkzvunww214
8,oxnwtfohqwaqwwjltqtgykmlyqutm350
9,mclppfviyumavtawcfvsmxnyjvply396


In [0]:
%%time
# Test differences in join order (Part 1)
merge(merge(people_df, experience_df, '_id', 'person'), mktg_df, '_id', 'person')

Merge compared 2500000 tuples
Merge compared 38539 tuples
CPU times: user 48.5 s, sys: 737 µs, total: 48.5 s
Wall time: 48.6 s


Unnamed: 0,_id,desc,end,homepage,industry,interests,locality,org,overview_html,pos,specilities,start,summary,title,url
0,tbmhpotienijsuyshhlfhymkbqscr16,Overall responsibility for HR and Safety withi...,Present,,컴퓨터 소프트웨어,"Adaptive Systems, Adaptive Hypermedia, User Mo...","Hitchin, Hertfordshire, United Kingdom","Schindler - Paris, France","<dl id=""overview""><dt id=""overview-summary-cur...",0,My functional strengths are in areas of sales ...,January 2012,Human Resources Professional with experience i...,"VP HR, Europe South, Middle East, Africa",rmcrprqnspgxnxymcxoalhkhwxqlqpwdupismbijerj16
1,tbmhpotienijsuyshhlfhymkbqscr16,"European HR Consulting, with a strong focus on...",,,컴퓨터 소프트웨어,"Adaptive Systems, Adaptive Hypermedia, User Mo...","Hitchin, Hertfordshire, United Kingdom",Paun HR Consult SPRL - Belgium,"<dl id=""overview""><dt id=""overview-summary-cur...",1,My functional strengths are in areas of sales ...,December 2008,Human Resources Professional with experience i...,Managing Partner and Owner,rmcrprqnspgxnxymcxoalhkhwxqlqpwdupismbijerj16
2,tbmhpotienijsuyshhlfhymkbqscr16,Brought on board to improve the quality of top...,,,컴퓨터 소프트웨어,"Adaptive Systems, Adaptive Hypermedia, User Mo...","Hitchin, Hertfordshire, United Kingdom","Schindler AG, Switzerland","<dl id=""overview""><dt id=""overview-summary-cur...",2,My functional strengths are in areas of sales ...,2008,Human Resources Professional with experience i...,Head of Executive Recruitment- Corporate HR,rmcrprqnspgxnxymcxoalhkhwxqlqpwdupismbijerj16
3,tbmhpotienijsuyshhlfhymkbqscr16,Managed expatriate employees all over the worl...,,,컴퓨터 소프트웨어,"Adaptive Systems, Adaptive Hypermedia, User Mo...","Hitchin, Hertfordshire, United Kingdom",France Telecom- Orange,"<dl id=""overview""><dt id=""overview-summary-cur...",3,My functional strengths are in areas of sales ...,October 2005,Human Resources Professional with experience i...,"VP Group International Mobility- Paris, France",rmcrprqnspgxnxymcxoalhkhwxqlqpwdupismbijerj16
4,tbmhpotienijsuyshhlfhymkbqscr16,"Oversaw HR, Real-Estate, Facilities, and Inter...",,,컴퓨터 소프트웨어,"Adaptive Systems, Adaptive Hypermedia, User Mo...","Hitchin, Hertfordshire, United Kingdom",Mobistar Belgium (France Telecom Group),"<dl id=""overview""><dt id=""overview-summary-cur...",4,My functional strengths are in areas of sales ...,July 2000,Human Resources Professional with experience i...,Human Resources Director,rmcrprqnspgxnxymcxoalhkhwxqlqpwdupismbijerj16
5,tbmhpotienijsuyshhlfhymkbqscr16,Recruited by startup telecom to define HR stra...,,,컴퓨터 소프트웨어,"Adaptive Systems, Adaptive Hypermedia, User Mo...","Hitchin, Hertfordshire, United Kingdom",Orange Romania (France Telecom Group),"<dl id=""overview""><dt id=""overview-summary-cur...",5,My functional strengths are in areas of sales ...,March 1997,Human Resources Professional with experience i...,HR Director,rmcrprqnspgxnxymcxoalhkhwxqlqpwdupismbijerj16
6,tbmhpotienijsuyshhlfhymkbqscr16,Recruited to create an HR organization from th...,,,컴퓨터 소프트웨어,"Adaptive Systems, Adaptive Hypermedia, User Mo...","Hitchin, Hertfordshire, United Kingdom",Dac Air - Regional Airline,"<dl id=""overview""><dt id=""overview-summary-cur...",6,My functional strengths are in areas of sales ...,1995,Human Resources Professional with experience i...,HR Director,rmcrprqnspgxnxymcxoalhkhwxqlqpwdupismbijerj16
7,tbmhpotienijsuyshhlfhymkbqscr16,Established HR practice for one of Romania’s f...,,,컴퓨터 소프트웨어,"Adaptive Systems, Adaptive Hypermedia, User Mo...","Hitchin, Hertfordshire, United Kingdom",CableVision of Romania,"<dl id=""overview""><dt id=""overview-summary-cur...",7,My functional strengths are in areas of sales ...,1993,Human Resources Professional with experience i...,HR Director,rmcrprqnspgxnxymcxoalhkhwxqlqpwdupismbijerj16
8,zijelfcuxmuyinlxajtzeozjvkxuq39,Overall responsibility for HR and Safety withi...,Present,{'CECS Department': ['http://www.cs.louisville...,Mining & Metals,TED Talks; Environmentalism; Music Creation; R...,"Witbank Area, South Africa","Schindler - Paris, France",,0,"Search Engine Marketing, Email Campaigns, Sear...",January 2012,7+ Years of experience in the storage stack of...,"VP HR, Europe South, Middle East, Africa",wacghnjvtjehyfjuzwztlhdlwnlobvsvtzrseiyryjw39
9,zijelfcuxmuyinlxajtzeozjvkxuq39,"European HR Consulting, with a strong focus on...",,{'CECS Department': ['http://www.cs.louisville...,Mining & Metals,TED Talks; Environmentalism; Music Creation; R...,"Witbank Area, South Africa",Paun HR Consult SPRL - Belgium,,1,"Search Engine Marketing, Email Campaigns, Sear...",December 2008,7+ Years of experience in the storage stack of...,Managing Partner and Owner,wacghnjvtjehyfjuzwztlhdlwnlobvsvtzrseiyryjw39


In [0]:
%%time 
# Test differences in join order (Part 2)
merge(merge(people_df, mktg_df, '_id', 'person'), experience_df, '_id', 'person')

Merge compared 8500 tuples
Merge compared 75000 tuples
CPU times: user 1.53 s, sys: 2.94 ms, total: 1.53 s
Wall time: 1.54 s


Unnamed: 0,_id,desc,end,homepage,industry,interests,locality,org,overview_html,pos,specilities,start,summary,title,url
0,tbmhpotienijsuyshhlfhymkbqscr16,Overall responsibility for HR and Safety withi...,Present,,컴퓨터 소프트웨어,"Adaptive Systems, Adaptive Hypermedia, User Mo...","Hitchin, Hertfordshire, United Kingdom","Schindler - Paris, France","<dl id=""overview""><dt id=""overview-summary-cur...",0,My functional strengths are in areas of sales ...,January 2012,Human Resources Professional with experience i...,"VP HR, Europe South, Middle East, Africa",rmcrprqnspgxnxymcxoalhkhwxqlqpwdupismbijerj16
1,tbmhpotienijsuyshhlfhymkbqscr16,"European HR Consulting, with a strong focus on...",,,컴퓨터 소프트웨어,"Adaptive Systems, Adaptive Hypermedia, User Mo...","Hitchin, Hertfordshire, United Kingdom",Paun HR Consult SPRL - Belgium,"<dl id=""overview""><dt id=""overview-summary-cur...",1,My functional strengths are in areas of sales ...,December 2008,Human Resources Professional with experience i...,Managing Partner and Owner,rmcrprqnspgxnxymcxoalhkhwxqlqpwdupismbijerj16
2,tbmhpotienijsuyshhlfhymkbqscr16,Brought on board to improve the quality of top...,,,컴퓨터 소프트웨어,"Adaptive Systems, Adaptive Hypermedia, User Mo...","Hitchin, Hertfordshire, United Kingdom","Schindler AG, Switzerland","<dl id=""overview""><dt id=""overview-summary-cur...",2,My functional strengths are in areas of sales ...,2008,Human Resources Professional with experience i...,Head of Executive Recruitment- Corporate HR,rmcrprqnspgxnxymcxoalhkhwxqlqpwdupismbijerj16
3,tbmhpotienijsuyshhlfhymkbqscr16,Managed expatriate employees all over the worl...,,,컴퓨터 소프트웨어,"Adaptive Systems, Adaptive Hypermedia, User Mo...","Hitchin, Hertfordshire, United Kingdom",France Telecom- Orange,"<dl id=""overview""><dt id=""overview-summary-cur...",3,My functional strengths are in areas of sales ...,October 2005,Human Resources Professional with experience i...,"VP Group International Mobility- Paris, France",rmcrprqnspgxnxymcxoalhkhwxqlqpwdupismbijerj16
4,tbmhpotienijsuyshhlfhymkbqscr16,"Oversaw HR, Real-Estate, Facilities, and Inter...",,,컴퓨터 소프트웨어,"Adaptive Systems, Adaptive Hypermedia, User Mo...","Hitchin, Hertfordshire, United Kingdom",Mobistar Belgium (France Telecom Group),"<dl id=""overview""><dt id=""overview-summary-cur...",4,My functional strengths are in areas of sales ...,July 2000,Human Resources Professional with experience i...,Human Resources Director,rmcrprqnspgxnxymcxoalhkhwxqlqpwdupismbijerj16
5,tbmhpotienijsuyshhlfhymkbqscr16,Recruited by startup telecom to define HR stra...,,,컴퓨터 소프트웨어,"Adaptive Systems, Adaptive Hypermedia, User Mo...","Hitchin, Hertfordshire, United Kingdom",Orange Romania (France Telecom Group),"<dl id=""overview""><dt id=""overview-summary-cur...",5,My functional strengths are in areas of sales ...,March 1997,Human Resources Professional with experience i...,HR Director,rmcrprqnspgxnxymcxoalhkhwxqlqpwdupismbijerj16
6,tbmhpotienijsuyshhlfhymkbqscr16,Recruited to create an HR organization from th...,,,컴퓨터 소프트웨어,"Adaptive Systems, Adaptive Hypermedia, User Mo...","Hitchin, Hertfordshire, United Kingdom",Dac Air - Regional Airline,"<dl id=""overview""><dt id=""overview-summary-cur...",6,My functional strengths are in areas of sales ...,1995,Human Resources Professional with experience i...,HR Director,rmcrprqnspgxnxymcxoalhkhwxqlqpwdupismbijerj16
7,tbmhpotienijsuyshhlfhymkbqscr16,Established HR practice for one of Romania’s f...,,,컴퓨터 소프트웨어,"Adaptive Systems, Adaptive Hypermedia, User Mo...","Hitchin, Hertfordshire, United Kingdom",CableVision of Romania,"<dl id=""overview""><dt id=""overview-summary-cur...",7,My functional strengths are in areas of sales ...,1993,Human Resources Professional with experience i...,HR Director,rmcrprqnspgxnxymcxoalhkhwxqlqpwdupismbijerj16
8,zijelfcuxmuyinlxajtzeozjvkxuq39,Overall responsibility for HR and Safety withi...,Present,{'CECS Department': ['http://www.cs.louisville...,Mining & Metals,TED Talks; Environmentalism; Music Creation; R...,"Witbank Area, South Africa","Schindler - Paris, France",,0,"Search Engine Marketing, Email Campaigns, Sear...",January 2012,7+ Years of experience in the storage stack of...,"VP HR, Europe South, Middle East, Africa",wacghnjvtjehyfjuzwztlhdlwnlobvsvtzrseiyryjw39
9,zijelfcuxmuyinlxajtzeozjvkxuq39,"European HR Consulting, with a strong focus on...",,{'CECS Department': ['http://www.cs.louisville...,Mining & Metals,TED Talks; Environmentalism; Music Creation; R...,"Witbank Area, South Africa",Paun HR Consult SPRL - Belgium,,1,"Search Engine Marketing, Email Campaigns, Sear...",December 2008,7+ Years of experience in the storage stack of...,Managing Partner and Owner,wacghnjvtjehyfjuzwztlhdlwnlobvsvtzrseiyryjw39


In [0]:
experience_df.loc[0].drop(labels='person')

org                                                    UBS
title              Executive Director - Talent Acquisition
end                                                Present
start                                            July 2012
desc     Talent Acquistion Strategy at UBS Investment Bank
pos                                                      0
Name: 0, dtype: object

In [0]:
%%time

# Slide 21
conn.execute('drop view if exists people500')
conn.execute('drop view if exists experience5000')
conn.execute('drop view if exists skills8000')
conn.execute('create view people500 as select * from people limit 500')
conn.execute('create view experience5000 as select * from experience limit 500')
conn.execute('create view skills8000 as select * from skills limit 500')

pd.read_sql_query('select * from (people500 join skills8000 on _id=person) ps join ' + \
                  'experience5000 ex on ps._id=ex.person and value="Marketing"', conn)

CPU times: user 11.5 ms, sys: 3 ms, total: 14.5 ms
Wall time: 47.8 ms


In [0]:
# Join using a *map*, which is a kind of in-memory index
# from keys to (single) values
def merge_map(S,T,l_on,r_on):
    ret = pd.DataFrame()
    T_map = {}
    count = 0
    # Take each value in the r_on field, and
    # make a map entry for it
    for t_index in range(0, len(T)):
        # Make sure we aren't overwriting an entry!
        assert (T.loc[t_index,r_on] not in T_map)
        T_map[T.loc[t_index,r_on]] = T.loc[t_index]
        count = count + 1

    # Now find matches
    for s_index in range(0, len(S)):
        count = count + 1
        if S.loc[s_index, l_on] in T_map:
                ret = ret.append(S.loc[s_index].append(T_map[S.loc[s_index, l_on]].drop(labels=r_on)), ignore_index=True)

    print('Merge compared %d tuples'%count)
    return ret

In [0]:
%%time
# Here's a test join, with people and their experiences.  We can see how many
# comparisons are made
merge_map(experience_df, people_df, 'person', '_id')

Merge compared 5500 tuples
CPU times: user 11.7 s, sys: 7.46 ms, total: 11.7 s
Wall time: 11.7 s


Unnamed: 0,desc,end,homepage,industry,interests,locality,org,overview_html,person,pos,specilities,start,summary,title,url
0,Talent Acquistion Strategy at UBS Investment Bank,Present,,Animasyon,"Reading books, watching movies and nature trip...","Eskisehir, Turkey",UBS,,ichsrrdhpxlojntrimsvrbzexeeyi0,0,Areas Of Practice: Criminal Defense; DUI; Misd...,July 2012,An experienced general and financial manager i...,Executive Director - Talent Acquisition,ftsycpiuiasaecavdmmkooqckeojdsriefopopyeeat0
1,,,,Animasyon,"Reading books, watching movies and nature trip...","Eskisehir, Turkey",Barclays Bank,,ichsrrdhpxlojntrimsvrbzexeeyi0,1,Areas Of Practice: Criminal Defense; DUI; Misd...,2009,An experienced general and financial manager i...,Head of Resourcing - Retail and Business Bank ...,ftsycpiuiasaecavdmmkooqckeojdsriefopopyeeat0
2,,,,Animasyon,"Reading books, watching movies and nature trip...","Eskisehir, Turkey",Enterprise Plc,,ichsrrdhpxlojntrimsvrbzexeeyi0,2,Areas Of Practice: Criminal Defense; DUI; Misd...,2008,An experienced general and financial manager i...,Interim Talent & Development Director,ftsycpiuiasaecavdmmkooqckeojdsriefopopyeeat0
3,,,,Animasyon,"Reading books, watching movies and nature trip...","Eskisehir, Turkey",Vodafone,,ichsrrdhpxlojntrimsvrbzexeeyi0,3,Areas Of Practice: Criminal Defense; DUI; Misd...,2008,An experienced general and financial manager i...,Interim Global Resourcing & Talent Mobility Ma...,ftsycpiuiasaecavdmmkooqckeojdsriefopopyeeat0
4,,,,Animasyon,"Reading books, watching movies and nature trip...","Eskisehir, Turkey",AstraZeneca,,ichsrrdhpxlojntrimsvrbzexeeyi0,4,Areas Of Practice: Criminal Defense; DUI; Misd...,2006,An experienced general and financial manager i...,Global Resourcing & Talent Manager,ftsycpiuiasaecavdmmkooqckeojdsriefopopyeeat0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2262,,,,Bankwesen,Playing the trumpet\nLearning Italian\nSnowboa...,"Murrieta, California",PwC,"<dl id=""overview""><dt>\nConnections\n</dt>\n<d...",kqdvnuxlxteqeylpkuowwboluewwi499,3,"Identity design, digtal design, digtal services",April 2007,Your ideal customer is the one already searchi...,Senior Manager,syauzbaqggxzklufogdjwkleiwccknmwfrhofsvtgxy499
2263,,,,Bankwesen,Playing the trumpet\nLearning Italian\nSnowboa...,"Murrieta, California",GEM CONSULTING,"<dl id=""overview""><dt>\nConnections\n</dt>\n<d...",kqdvnuxlxteqeylpkuowwboluewwi499,4,"Identity design, digtal design, digtal services",June 2004,Your ideal customer is the one already searchi...,Manager of Information Technology & Telecommun...,syauzbaqggxzklufogdjwkleiwccknmwfrhofsvtgxy499
2264,,,,Bankwesen,Playing the trumpet\nLearning Italian\nSnowboa...,"Murrieta, California",ACCENTURE,"<dl id=""overview""><dt>\nConnections\n</dt>\n<d...",kqdvnuxlxteqeylpkuowwboluewwi499,5,"Identity design, digtal design, digtal services",April 2000,Your ideal customer is the one already searchi...,Senior Consultant in Communications & High Tec...,syauzbaqggxzklufogdjwkleiwccknmwfrhofsvtgxy499
2265,,,,Bankwesen,Playing the trumpet\nLearning Italian\nSnowboa...,"Murrieta, California",OTE PLUS,"<dl id=""overview""><dt>\nConnections\n</dt>\n<d...",kqdvnuxlxteqeylpkuowwboluewwi499,6,"Identity design, digtal design, digtal services",July 1999,Your ideal customer is the one already searchi...,Business Process Consultant,syauzbaqggxzklufogdjwkleiwccknmwfrhofsvtgxy499


In [0]:
%%time

# An exercise: how can you modify merge_map to make this work?  (This can be skipped if you wish.)

merge_map(people_df, experience_df, '_id', 'person')

AssertionError: ignored