# Homework: Understanding Performance using a LinkedIn Dataset


This homework focuses on understanding performance using a LinkedIn dataset.  It is the same dataset that was used in the module entitled "Modeling Data and Knowledge".

In [0]:
!pip install pymongo[tls,srv]
!pip install swifter
!pip install lxml

In [0]:

import pandas as pd
import numpy as np
import json
import sqlite3
from lxml import etree
import urllib
import zipfile

import time
import swifter
from pymongo import MongoClient
from pymongo.errors import DuplicateKeyError, OperationFailure
from sklearn.utils import shuffle

# Step 1: Acquire and load the data

We will pull a zipfile with the LinkedIn dataset from an url / Google Drive so that it can be efficiently parsed locally. The detailed steps are covered by  "Modeling Data and Knowledge" Module, and you should refer to the instructor notes of that module if you haven't done so. 

The cell below will download/open the file, and may take a while. 


In [0]:
url = 'https://raw.githubusercontent.com/chenleshang/OpenDS4All/master/Module3/homework3filewrapper.py'
urllib.request.urlretrieve(url,filename='homework3filewrapper.py')
# url = 'https://upenn-bigdataanalytics.s3.amazonaws.com/linkedin.zip'
# filehandle, _ = urllib.request.urlretrieve(url,filename='local.zip')

The next cell creates a pointer to the (abbreviated)  LinkedIn dataset, and imports a script that will be used to prepare the dataset to manipulate in this homework. 

In [0]:
def fetch_file(fname):
    zip_file_object = zipfile.ZipFile(filehandle, 'r')
    for file in zip_file_object.namelist():
        file = zip_file_object.open(file)
        if file.name == fname: return file
    return None

# linked_in = fetch_file('test_data_10000.json')

from homework3filewrapper import *

The next cell replays the data preparation for the LinkedIn dataset done in the module "Modeling Data and Knowledge". After this, you should have eleven dataframes with the following names. The first nine are as in the lecture notebook; the last two are constructed using queries over the first nine, and their meanings are given  below. 

1. `people_df`
2. `names_df`: Stores the first and last name of each person indexed by ID. 
3. `education_df`
4. `groups_df`
5. `skills_df`
6. `experience_df`
7. `honors_df`
8. `also_view_df`
9. `events_df`
10. `recs_df`: 20 pairs of people with the most shared/common skills in descending order. We will use this to make a recommendation for a potential employer and position to each person.
11. `last_job_df`: Person name, and the title and org corresponding to the person's last (most recent) employment experience (a three column dataframe).

The number of rows that are extracted from the dataset can be changed using LIMIT.  Here, we are limiting it to 10,000; you can set it to something much smaller (e.g. 1,000) while debugging your code. 

The data is also being stored in an SQLite database so that you can see the effect of indexing on the performance of queries.



In [0]:
# If use a file on Google Drive, then mount it to Colab. 
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

In [0]:
# use open() to open a local file, or to use fetch_file() to get that file from a remote zip file. 
people_df, names_df, education_df, groups_df, skills_df, experience_df, honors_df, also_view_df, events_df, recs_df, last_job_df =\
    data_loading(file=open('/content/drive/My Drive/Colab Notebooks/test_data_10000.json'), dbname='linkedin.db', filetype='localobj', LIMIT=10000)

In [0]:
conn = sqlite3.connect('linkedin.db')

In [0]:
# Sanity Check 1.1 - please do not modify or delete this cell!

recs_df

In [0]:
# Sanity Check 1.2 - please do not modify or delete this cell!

names_df

In [0]:
# Sanity Check 1.3 - please do not modify or delete this cell!

last_job_df

# Step 2: Compare Evaluation Orders using DataFrames

We will now explore the effect of various optimizations, including reordering execution steps and (in the case of database operations) creating indices.

We'll start with the code from our lecture notebooks, which does joins between dataframes.  The next cell creates two functions, merge and merge_map, which we explore in terms of efficiency.  **You do not need to modify this cell.**

In [0]:
# Join using nested loops
def merge(S,T,l_on,r_on):
    ret = pd.DataFrame()
    count = 0
    S_ = S.reset_index().drop(columns=['index'])
    T_ = T.reset_index().drop(columns=['index'])
    for s_index in range(0, len(S)):
        for t_index in range(0, len(T)):
            count = count + 1
            if S_.loc[s_index, l_on] == T_.loc[t_index, r_on]:
                ret = ret.append(S_.loc[s_index].append(T_.loc[t_index].drop(labels=r_on)), ignore_index=True)

    print('Merge compared %d tuples'%count)
    return ret
  
# Join using a *map*, which is a kind of in-memory index
# from keys to (single) values
def merge_map(S,T,l_on,r_on):
    ret = pd.DataFrame()
    T_map = {}
    count = 0
    # Take each value in the r_on field, and
    # make a map entry for it
    T_ = T.reset_index().drop(columns=['index'])
    for t_index in range(0, len(T)):
        # Make sure we aren't overwriting an entry!
        assert (T_.loc[t_index,r_on] not in T_map)
        T_map[T_.loc[t_index,r_on]] = T_.loc[t_index]
        count = count + 1

    # Now find matches
    S_ = S.reset_index().drop(columns=['index'])
    for s_index in range(0, len(S)):
        count = count + 1
        if S_.loc[s_index, l_on] in T_map:
                ret = ret.append(S_.loc[s_index].append(T_map[S_.loc[s_index, l_on]].drop(labels=r_on)), ignore_index=True)

    print('Merge compared %d tuples'%count)
    return ret

## Step 2.1: Find a good order of evaluation.

The following function, `recommend_jobs_basic`, takes as  input `recs_df`, `names_df` and `last_job_df` and returns the name of each `person_1` and the most recent `title` and `org` of each `person_2`.  

We will time how long it takes to execute `recommend_jobs_basic` using the ordering `recs_df`, `names_df` and `last_job_df`.

Your task is to improve this time by changing the join ordering used in `recommend_jobs_basic`.

In [0]:
def recommend_jobs_basic(recs_df, names_df, last_job_df):
    return merge(merge(recs_df,names_df,'person_1','person')[['family_name','given_name','person_1','person_2']],
        last_job_df,'person_2','person')[['family_name','given_name','person_2','org','title']].sort_values('family_name')

In [0]:
%%time

recs_new_df = recommend_jobs_basic(recs_df, names_df, last_job_df)

if(len(recs_new_df.columns) != 5):
    raise AssertionError('Wrong number of columns in recs_new_df')

Modify the function `recommend_jobs_basic` in the cell below. See if it is possible to improve the efficiency by changing the join ordering to reduce the number of comparisons made in the `merge` function. 

In [0]:
# TODO: modify the order of joins to reduce comparisons

def recommend_jobs_basic_reordered(recs_df, names_df, last_job_df):
    # YOUR CODE HERE
    

In [0]:
%%time

recs_new_df = recommend_jobs_basic_reordered(recs_df, names_df, last_job_df)

if(len(recs_new_df.columns) != 5):
    raise AssertionError('Wrong number of columns in recs_new_df')

In [0]:
names_df

In [0]:
recs_df

In [0]:
last_job_df

## Step 2.2: Perform selections early using `merge` and `merge_map`
 
Reimplement `recommend_jobs_basic` using the `merge` and `merge_map` functions instead of Pandas' merge. Try to find the **most efficient** way by also considering the ordering.  

In [0]:
# TODO: Reimplement recommend jobs using our custom merge and merge_map functions

def recommend_jobs_new(recs_df, names_df, last_job_df):
    # YOUR CODE HERE


In [0]:
# Sanity Check 2.1 - please do not modify or delete this cell!

%%time

recs_new_df = recommend_jobs_new(recs_df, names_df, last_job_df)

if(len(recs_new_df.columns) != 5):
    raise AssertionError('Wrong number of columns in recs_new_df')

# Step 3. Query Optimization in Databases

Relational databases optimize queries by performing selections (and projections) as early as possible, and finding a good join ordering. We will therefore implement the recommend_jobs function using SQLite and see if it is faster. 

Dataframes `names_df`, `rec_df` and `last_job_df` are already stored in database `linkedin.db` with table name `names`, `recs` and `lastjob`. 

## Step 3.1 
In the cell below, implement the `recommend_jobs_basic` function in SQL. Since the query is very fast, we will run the query 100 times to get an accurate idea of the execution time.

In [0]:
%%time
for i in range(0, 100):
    # YOUR CODE HERE


## Step 3.2
Altough the execution is pretty fast, we can also create indices to make it even faster. Use the syntax `CREATE INDEX I ON T(C)` to create index on the three tables `recs`, `names`, and `lastjob`. Replace `I` with the name of the index that you wish to use, `T` with the name of the table and `C` with the name of the column. 

If you need to change the indices, you must drop them first using the following syntax: 
`conn.execute('drop index if exists I')`
where I is the name of the index to be dropped.

In [0]:
conn.execute('begin transaction')

# YOUR CODE HERE


conn.execute('commit')

In the cell below, rerun the query that you defined in Step 3.1 100 times get a new timing. The database will now use the indices that you created if they are beneficial to the execution. 

Is the query faster? 

In [0]:
%%time
for i in range(0, 100):
    # YOUR CODE HERE
