# Homework: Understanding Performance using a LinkedIn Dataset


This homework focuses on understanding performance using a LinkedIn dataset.  It is the same dataset that was used in the module entitled "Modeling Data and Knowledge".

In [1]:
!pip install pymongo[tls,srv]
!pip install swifter
!pip install lxml

Collecting dnspython<2.0.0,>=1.16.0; extra == "srv"
[?25l  Downloading https://files.pythonhosted.org/packages/ec/d3/3aa0e7213ef72b8585747aa0e271a9523e713813b9a20177ebe1e939deb0/dnspython-1.16.0-py2.py3-none-any.whl (188kB)
[K     |████████████████████████████████| 194kB 4.9MB/s 
[?25hInstalling collected packages: dnspython
Successfully installed dnspython-1.16.0
Collecting swifter
  Downloading https://files.pythonhosted.org/packages/1f/22/0a46b4d2a417824d7e883a8bd8e01c3b000bbdeaa7c154891b7cba94cbf7/swifter-0.296-py3-none-any.whl
Collecting tqdm>=4.33.0
[?25l  Downloading https://files.pythonhosted.org/packages/bb/62/6f823501b3bf2bac242bd3c320b592ad1516b3081d82c77c1d813f076856/tqdm-4.39.0-py2.py3-none-any.whl (53kB)
[K     |████████████████████████████████| 61kB 4.5MB/s 
Collecting partd>=0.3.8; extra == "complete"
  Downloading https://files.pythonhosted.org/packages/44/e1/68dbe731c9c067655bff1eca5b7d40c20ca4b23fd5ec9f3d17e201a6f36b/partd-1.1.0-py3-none-any.whl
Collecting locke



In [0]:
import pandas as pd
import numpy as np
import json
import sqlite3
from lxml import etree
import urllib
import zipfile

import time
import swifter
from pymongo import MongoClient
from pymongo.errors import DuplicateKeyError, OperationFailure
from sklearn.utils import shuffle

# Step 1: Acquire and load the data

We will pull a zipfile with the LinkedIn dataset from a GitHub directory so that it can be efficiently parsed locally.  **Do not modify this cell.**

The cell below will download the file, and may take a while. 

**If you have already downloaded the LinkedIn dataset and stored it in an SQLite database while working on the homework for the module "Modeling Data and Knowledge" you can skip this step.**


In [0]:
url = 'https://raw.githubusercontent.com/chenleshang/OpenDS4All/master/Module3/homework3filewrapper.py'
urllib.request.urlretrieve(url,filename='homework3filewrapper.py')
url = 'https://upenn-bigdataanalytics.s3.amazonaws.com/linkedin.zip'
filehandle, _ = urllib.request.urlretrieve(url,filename='local.zip')

The next cell creates a pointer to the (abbreviated)  LinkedIn dataset, and imports a script that will be used to prepare the dataset to manipulate in this homework. **Do not modify this cell.**

In [0]:
def fetch_file(fname):
    zip_file_object = zipfile.ZipFile(filehandle, 'r')
    for file in zip_file_object.namelist():
        file = zip_file_object.open(file)
        if file.name == fname: return file
    return None

linkedin_small = fetch_file('linkedin_small.json')

from homework3filewrapper import *

The next cell replays the data preparation for the LinkedIn dataset done in the module "Modeling Data and Knowledge". After this, you should have eleven dataframes with the following names. The first nine are as in the lecture notebook; the last two are constructed using queries over the first nine, and their meanings are given  below. 

1. `people_df`
2. `names_df`: Stores the first and last name of each person indexed by ID. 
3. `education_df`
4. `groups_df`
5. `skills_df`
6. `experience_df`
7. `honors_df`
8. `also_view_df`
9. `events_df`
10. `recs_df`: 20 pairs of people with the most shared/common skills in descending order. We will use this to make a recommendation for a potential employer and position to each person.
11. `last_job_df`: Person name, and the title and org corresponding to the person's last (most recent) employment experience (a three column dataframe).

The number of rows that are extracted from the dataset can be changed using LIMIT.  Here, we are lmiting it to 20,000; you can set it to something much smaller (e.g. 10,000) while debugging your code. **Do not modify this cell except to change LIMIT.**

The data is also being stored in an SQLite database so that you can see the effect of indexing on the performance of queries.



In [5]:
people_df, names_df, education_df, groups_df, skills_df, experience_df, honors_df, also_view_df, events_df, recs_df, last_job_df =\
    data_loading(file=fetch_file('linkedin_small.json'), dbname='linkedin.db', filetype='localobj', LIMIT=20000)

10000
20000


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


In [0]:
conn = sqlite3.connect('linkedin.db')

In [23]:
# Sanity Check 1.1 - please do not modify or delete this cell!

recs_df

Unnamed: 0,person_1,person_2,common_skills
247,in-01mihaipop,in-07960,11
504,in-07960,in-01mihaipop,11
443,in-062898,in-0robertvale0,10
715,in-0robertvale0,in-062898,10
456,in-07061976,in-00000001,9
11,in-00000001,in-07061976,9
6,in-00000001,in-01mihaipop,8
568,in-08michaelwright,in-00000001,8
226,in-01mihaipop,in-00000001,8
497,in-07960,in-00000001,8


In [24]:
# Sanity Check 1.2 - please do not modify or delete this cell!

names_df

Unnamed: 0,family_name,given_name,person
0,Mazalu MBA,Dr Catalin,in-00000001
1,Forslund,Ann,in-00001
2,Douglas,Shawn,in-00006
3,Kilimann,Edric,in-000montgomery
4,"Chauhan, PMP",Vijay,in-000vijaychauhan
...,...,...,...
19994,Louro,Lic. Anahi,in-anahilouro
19995,Quiroga,Anahi,in-anahiquiroga
19996,Nemat,Anahita,in-anahita
19997,Charna,Anahita,in-anahitacharna


In [25]:
# Sanity Check 1.3 - please do not modify or delete this cell!

last_job_df

Unnamed: 0,person,org,title
0,in-00001,Johnson and Johnson,"Senior Scientist, Oncology Biomarkers"
5,in-00006,UCSF,Assistant Professor
7,in-000montgomery,<Online Recruiting Company>,Ning
36,in-001adambutler,Brand New Music,Founding Partner and Client Services Director
46,in-001monica,Canadian MedicAlert Foundation,"Manager, Marketing"
...,...,...,...
92031,in-anahigadea,Avaya,Avaya Professional Services Sales Support Asso...
92039,in-anahilouro,AAAP - Asociación Argentina de Agencias de Pub...,"Docente a cargo de la asignatura Creatividad, ..."
92048,in-anahiquiroga,Mead Johnson Nutrition,Finance Manager
92053,in-anahitacharna,JPMorgan Chase,Executive Director


# Step 2: Compare Evaluation Orders using DataFrames

We will now explore the effect of various optimizations, including reordering execution steps and (in the case of database operations) creating indices.

We'll start with the code from our lecture notebooks, which does joins between dataframes.  The next cell creates two functions, merge and merge_map, which we explore in terms of efficiency.  **You do not need to modify this cell.**

In [0]:
# Join using nested loops
def merge(S,T,l_on,r_on):
    ret = pd.DataFrame()
    count = 0
    S_ = S.reset_index().drop(columns=['index'])
    T_ = T.reset_index().drop(columns=['index'])
    for s_index in range(0, len(S)):
        for t_index in range(0, len(T)):
            count = count + 1
            if S_.loc[s_index, l_on] == T_.loc[t_index, r_on]:
                ret = ret.append(S_.loc[s_index].append(T_.loc[t_index].drop(labels=r_on)), ignore_index=True)

    print('Merge compared %d tuples'%count)
    return ret
  
# Join using a *map*, which is a kind of in-memory index
# from keys to (single) values
def merge_map(S,T,l_on,r_on):
    ret = pd.DataFrame()
    T_map = {}
    count = 0
    # Take each value in the r_on field, and
    # make a map entry for it
    T_ = T.reset_index().drop(columns=['index'])
    for t_index in range(0, len(T)):
        # Make sure we aren't overwriting an entry!
        assert (T_.loc[t_index,r_on] not in T_map)
        T_map[T_.loc[t_index,r_on]] = T_.loc[t_index]
        count = count + 1

    # Now find matches
    S_ = S.reset_index().drop(columns=['index'])
    for s_index in range(0, len(S)):
        count = count + 1
        if S_.loc[s_index, l_on] in T_map:
                ret = ret.append(S_.loc[s_index].append(T_map[S_.loc[s_index, l_on]].drop(labels=r_on)), ignore_index=True)

    print('Merge compared %d tuples'%count)
    return ret

## Step 2.1: Find a good order of evaluation.

The following function, `recommend_jobs_basic`, takes as  input `recs_df`, `names_df` and `last_job_df` and returns the name of each `person_1` and the most recent `title` and `org` of each `person_2`.  

We will time how long it takes to execute `recommend_jobs_basic` using the ordering `recs_df`, `names_df` and `last_job_df`.

Your task is to improve this time by changing the join ordering used in `recommend_jobs_basic`.

In [0]:
def recommend_jobs_basic(recs_df, names_df, last_job_df):
    return merge(merge(recs_df,names_df,'person_1','person')[['family_name','given_name','person_1','person_2']],
        last_job_df,'person_2','person')[['family_name','given_name','person_2','org','title']].sort_values('family_name')

In [28]:
%%time

recs_new_df = recommend_jobs_basic(recs_df, names_df, last_job_df)

if(len(recs_new_df.columns) != 5):
    raise AssertionError('Wrong number of columns in recs_new_df')

Merge compared 399980 tuples
Merge compared 347260 tuples
CPU times: user 12.3 s, sys: 6.2 ms, total: 12.3 s
Wall time: 12.3 s


Modify the function `recommend_jobs_basic` in the cell below. Improve the efficiency by changing the join ordering to reduce the number of comparisons made in the `merge` function. 

In [0]:
# TODO: modify the order of joins to reduce comparisons

def recommend_jobs_basic_reordered(recs_df, names_df, last_job_df):
    # YOUR CODE HERE
    ### BEGIN SOLUTION
    return merge(merge(last_job_df, recs_df,'person','person_2').rename(columns={'person':'person_2'}),
        names_df,'person_1','person')[['family_name','given_name','person_2','org','title']].sort_values('family_name')
    ### END SOLUTION

In [30]:
%%time

recs_new_df = recommend_jobs_basic_reordered(recs_df, names_df, last_job_df)

if(len(recs_new_df.columns) != 5):
    raise AssertionError('Wrong number of columns in recs_new_df')

Merge compared 347260 tuples
Merge compared 319984 tuples
CPU times: user 10.7 s, sys: 3.31 ms, total: 10.7 s
Wall time: 10.7 s


## Step 2.2: Perform selections early using `merge` and `merge_map`
 
Reimplement `recommend_jobs_basic` using the `merge` and `merge_map` functions instead of Pandas' merge. Try to find the **most efficient** way by also considering the ordering.  

In [0]:
# TODO: Reimplement recommend jobs using our custom merge and merge_map functions

def recommend_jobs_new(recs_df, names_df, last_job_df):
    # YOUR CODE HERE
    ### BEGIN SOLUTION
    return merge_map(merge_map(recs_df,names_df,'person_1','person')[['family_name','given_name','person_1','person_2']],
        last_job_df,'person_2','person')[['family_name','given_name','person_2','org','title']].sort_values('family_name')
    ### END SOLUTION

In [32]:
# Sanity Check 2.1 - please do not modify or delete this cell!

%%time

recs_new_df = recommend_jobs_new(recs_df, names_df, last_job_df)

if(len(recs_new_df.columns) != 5):
    raise AssertionError('Wrong number of columns in recs_new_df')

Merge compared 20019 tuples
Merge compared 17383 tuples
CPU times: user 8.39 s, sys: 17.3 ms, total: 8.4 s
Wall time: 8.43 s


# Step 3. Query Optimization in Databases

Relational databases optimize queries by performing selections (and projections) as early as possible, and finding a good join ordering. We will therefore implement the recommend_jobs function using SQLite and see if it is faster. 

Dataframes `names_df`, `rec_df` and `last_job_df` are already stored in database `linkedin.db` with table name `names`, `recs` and `lastjob`. 

## Step 3.1 
In the cell below, implement the `recommend_jobs_basic` function in SQL. Since the query is very fast, we will run the query 100 times to get an accurate idea of the execution time.

In [33]:
%%time
for i in range(0, 100):
    # YOUR CODE HERE
    ### BEGIN SOLUTIONS
    pd.read_sql_query('''
        select family_name, given_name, person_2, org, title
        from names N, recs R, lastjob L
        on N.person=R.person_1 and R.person_2=L.person
        ''', conn)
    ### END SOLUTIONS

CPU times: user 479 ms, sys: 7.86 ms, total: 487 ms
Wall time: 491 ms


## Step 3.2
Altough the execution is pretty fast, we can also create indices to make it even faster. Use the syntax `CREATE INDEX I ON T(C)` to create index on the three tables `recs`, `names`, and `lastjob`. Replace `I` with the name of the index that you wish to use, `T` with the name of the table and `C` with the name of the column. 

If you need to change the indices, you must drop them first using the following syntax: 
`conn.execute('drop index if exists I')`
where I is the name of the index to be dropped.

In [43]:
conn.execute('begin transaction')
# YOUR CODE HERE
### BEGIN SOLUTIONS
conn.execute('drop index if exists names_person')
conn.execute('drop index if exists recs_person')
conn.execute('drop index if exists lastjob_person')
conn.execute("create index names_person on names(person)")
conn.execute("create index recs_person on recs(person_1, person_2)")
conn.execute("create index lastjob_person on lastjob(person)")
### END SOLUTIONS
conn.execute('commit')

<sqlite3.Cursor at 0x7fb5c80c1ea0>

In the cell below, rerun the query that you defined in Step 3.1 100 times get a new timing. The database will now use the indices that you created if they are beneficial to the execution. 

Is the query faster? 

In [44]:
%%time
for i in range(0, 100):
    # YOUR CODE HERE
    ### BEGIN SOLUTIONS
    pd.read_sql_query('''
    select family_name, given_name, person_2, org, title
    from names N, recs R, lastjob L
    on N.person=R.person_1 and R.person_2=L.person
    ''', conn)
    ### END SOLUTIONS

CPU times: user 399 ms, sys: 4.07 ms, total: 403 ms
Wall time: 407 ms
