# Homework: Modeling and Querying Linked Data Using LinkedIn



Have you ever wondered about (1) what it takes to be a data scientist or "data person", and (2) how social networks and recommender systems work?

This homework is focused on (1) working with hierarchical data stored in dataframes, (2) traversing relationships among data. 

We will focus on questions about data scientists from a crawl of the LinkedIn dataset.

In [0]:
!pip install pymongo[tls,srv]
!pip install swifter
!pip install lxml

In [0]:
import pandas as pd
import numpy as np
import json
import sqlite3
from lxml import etree
import urllib
import zipfile

import time
import swifter
from pymongo import MongoClient
from pymongo.errors import DuplicateKeyError, OperationFailure

# Step 1: Acquire and load data

**Notice:** You need to correctly load the data before successfully running this notebook. The solution would be using an url to visit the data, or to open/mount the data locally. See **instructor notes** or **README** for detail. 

 When using url to visit the data in the below, substitute the location of this dataset in X (see Instructor Notes). The urllib.request will place the result in a file called 'local.zip'. Otherwise, when opening an extracted data mounted/located in the local directory, use open() function instead. 

**About the zip file content:** We need to pull the zipfile with LinkedIn data to your local machine or the Google Colab cloud-hosted machine.  Only when the data is local can we efficiently parse it (and we'll read directly out of a zip file).

The zip file contains a line-based json file which serves as a synthetic linkedin crawl of people's profiles. 
* `test_data_10000.json` (10K records)

The cell below will download the zip file, and may take a while.


In [0]:
# from google.colab import files

# uploaded = files.upload()

# for fn in uploaded.keys():
#   print('User uploaded file "{name}" with length {length} bytes'.format(
#       name=fn, length=len(uploaded[fn])))

In [0]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

In [0]:
# url = 'X'
# filehandle, _ = urllib.request.urlretrieve(url,filename='local.zip')

The cell below creates pointers to the two versions of our dataset. To switch between them, simply change the `file` variable in the cell below.

In [0]:
# def fetch_file(fname):
#     zip_file_object = zipfile.ZipFile(filehandle, 'r')
#     for file in zip_file_object.namelist():
#         file = zip_file_object.open(file)
#         if file.name == fname: return file
#     return None
    
# linked_in= fetch_file('test_data_10000.json')

In [0]:
## Mount Google Drive to Colab as a 'local' directory, if used. 
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

In [0]:
## Everytime when you use a local data, call open() function. We recommend that you extract files beforehand if they are zipped. 
myfile = open('/content/drive/My Drive/Colab Notebooks/test_data_10000.json')

## Step 1.1:  Store the data in dataframes

In the cell below, adapt the data loading code from the associated lecture notebook.  You will need the function that extracts relations from JSON files and the function that converts relations to dataframes. Read in a maximum of 10000 people. Put the code that reads a line of the file, extracts the relations, removes the interval field, and stores the field information with a try statement, just in case. In the error case, just use a `pass` command to move on. At the end of the next cell, you should have nine dataframes with the following names:

1. `people_df`
2. `names_df`
3. `education_df`
4. `groups_df`
5. `skills_df`
6. `experience_df`
7. `honors_df`
8. `also_view_df`
9. `events_df`

In [0]:
# TODO: Adapt the data loading code from class.

# YOUR CODE HERE


In [0]:
people_df

In [0]:
# Sanity Check 1.1 - please do not modify or delete this cell!

display(experience_df)


## Step 1.2: Save data to SQLite

Next save the data to SQLite, using the same approach as in the associated lecture notebook.

In [0]:
conn = sqlite3.connect('linkedin.db')

# YOUR CODE HERE

In [0]:
# Sanity Check 1.2.1 - please do not modify or delete this cell!

people_df.describe()

In [0]:
# Sanity Check 1.2.2 - please do not modify or delete this cell!

skills_df.describe()

In [0]:
# Sanity Check 1.2.3 - please do not modify or delete this cell!

experience_df.describe()

# Step 2: What is a data scientist?

In this homework, we will use LinkedIn to analyze what it means to be a data scientist (as of a few years ago).

## Step 2.1: What are common skills for data scientists?

Our first question is:  for anyone who's job revolves around data (database administrators, data curators, data engineers, data scientists), *what are the most common skills*?

### Step 2.1.1: Collect skills (Pandas)

Complete the `collect_skills` function below. The function should:

1. Using `experience_df`, find all people with a position containing "data" in the title. Remember upper versus lower case.
2. Using `skills_df`, find all people with "data science" as a skill. Again, remember to account for case.
3. For all of the unique people found in steps 1 and 2, find the rest of their skills
4. Return a dataframe of the top 15 skills, by frequency  (see pandas.DataFrame.sort_values).  The columns should be called `skill` (the name of the skill) and `scientists` (the count of the number of data scientists with this skill).

In [0]:
# TODO: Find the top 15 skills for data scientists (Pandas)

def collect_skills(experience_df, people_df, skills_df):
    # YOUR CODE HERE

In [0]:
# Sanity Check 2.1.1 - please do not modify or delete this cell!

top_skills_df = collect_skills(experience_df, people_df, skills_df)
display(top_skills_df)

if "skill" not in top_skills_df:
    raise AssertionError("skill column not defined")
if "scientists" not in top_skills_df:
    raise AssertionError("scientists column not defined")
if len(top_skills_df) != 15:
    raise AssertionError("dataframe does not have top 15")  

### Step 2.1.2: Top skills (SQL)

Compute the same table as in 2.1.1 using SQL. Store it as a datafrane called `top_skills_sql` but otherwise matching the schema and other properties. Be sure to save the data to SQLLite in a table called `top_skills`.

In [0]:
# TODO: Find the top 15 skills for data scientists (SQL)

# YOUR CODE HERE

display(top_skills_sql)

In [0]:
# Sanity Check 2.1.2 - please do not modify or delete this cell!

if "skill" not in top_skills_sql:
    raise AssertionError("skill column not defined")
if "scientists" not in top_skills_sql:
    raise AssertionError("scientists column not defined")
if len(top_skills_df) < 1:
    raise AssertionError("dataframe has no results")  
if len(top_skills_sql.merge(top_skills_df)) != len(top_skills_sql):
    raise AssertionError("Pandas and SQL versions are not of the same length")

## Step 2.2: What are common titles for those with data science skills?

Complete the `collect_titles` function below that aggregates the most recent titles of people with data science skills. This function should use the given dataframes as input and return a two column dataframe: one column called `title` and the other called `count`. You should only consider people who have at least `min_skills` of the top skills for a data scientist. You should also only keep those titles that appear at least `min_count` times.

For extra practice, you can also do this in SQL.

In [0]:
# TODO: Find the common titles (Pandas)

def collect_titles(top_skills_df, skills_df, people_df, experience_df, min_skills, min_count):
    # YOUR CODE HERE

In [0]:
# Sanity Check 2.2 - please do not modify or delete this cell!

ds_titles_df = collect_titles(top_skills_df, skills_df, people_df, experience_df, 6, 2)
display(ds_titles_df)

if "title" not in ds_titles_df:
    raise AssertionError("title column not defined")
if "count" not in ds_titles_df:
    raise AssertionError("count column not defined")
if len(ds_titles_df) < 1:
    raise AssertionError("dataframe has no results")

## Step 2.3: Who employs "data people" based on title?

Now let's find the list of companies that have employed people with the above titles, ranked by number of employees who have had these titles.

### Step 2.3.1: Data employers

Complete the `collect_employers` function below that aggregates the employers with positions corresponding to the most recent titles of people with data science skills. This function should use the given dataframes as input and return a two column dataframe: one column called `org` and the other called `people`. Show the names of companies (in field `org`) with at least `min_count` employees who are "data people" (include that count in the `people` column). Order the dataframe by the count of data people in the company in descending order.

In [0]:
# TODO: Find the data employers
def collect_employers(experience_df, ds_titles_df, min_count):
    # YOUR CODE HERE

In [0]:
# Sanity Check 2.3.1 - please do not modify or delete this cell!

employers_df = collect_employers(experience_df, ds_titles_df, 5)
display(employers_df)

# if "IBM" not in employers_df['org'].tolist():
#     raise AssertionError("Missing IBM")
    
if employers_df['people'].min() < 4:
    raise AssertionError("Not filtering properly")

### Step 2.3.2:  Employees of Data Employers

Complete the `collect_employees` function below that aggregates the employees of employers with positions corresponding to the most recent titles of people with data science skills. In other words, who are the employees of the data employers you found before and what are their titles? This function should use the given dataframes as input and return the `org`, `family_name`, `given_name`, and `title` of each person.

In [0]:
# TODO: Find the employees of the data employers

# YOUR CODE HERE


In [0]:
# Sanity Check 2.3.2 - please do not modify or delete this cell!

title_people_df = collect_employees(people_df, experience_df, employers_df, names_df, ds_titles_df)
display(title_people_df)

if len(title_people_df.columns) != 4:
    raise AssertionError('Wrong number of columns. Check schema again')

## Step 2.4: Find peers

In many common social graph settings, we can make recommendations to people based on their similarity with other people. In this case, we define similarity in terms of the number of identical skills.

Suppose A and B have similar skills: A -> X1 and B -> X1, A -> X2 and B -> X2, etc. up to A -> Xk and B -> Xk.

Then given that A and B have similar skills, we might recommend A's employer to B, and vice versa.

### Step 2.4.1: Compute the top pairs of peers

Let's consider only the first 100 people in `people_df`.
Find, out of this set, the pairs of people with the most shared/common skills, and return the closest 20 pairs in descending order.  We'll then use this to make a *recommendation* for a potential employer and position to each person.

Complete the `collect_peers` function below that finds the top `num` pairs of peers. In other words, compare each person with each *other* person, counting the total set of skills in common. This function should use the given dataframes and `num` as input and return a three column dataframe: `person_1`, `person_2`, and `common_skills`. The first two columns should be person IDs and the last column should be the number of skills that this pair of people shares.

**Hint:** Doing this requires a *Cartesian product*, i.e., every ID paired with every other ID.  Think about how to create a dataframe just with people IDs, then add a field to this dataframe that will let us combine every record with every record.

In [0]:
# TODO: Finish the collect_peers function

people_df_subset = people_df.head(100)

def collect_peers(people_df_subset, skills_df, num):
    # YOUR CODE HERE



In [0]:
# Sanity Check 2.4.1 - please do not modify or delete this cell!

recs_df = collect_peers(people_df_subset, skills_df, 20)
display(recs_df)

if "person_1" not in recs_df:
    raise AssertionError("person_1 column not defined")
if "person_2" not in recs_df:
    raise AssertionError("person_2 column not defined")
if "common_skills" not in recs_df:
    raise AssertionError("common_skills column not defined")
if(len(recs_df) != 20):
    raise AssertionError('Wrong number of rows in recs_df')

### Step 2.4.2: Get the last jobs

Complete the `last_job` function below that takes `experience_df` as input and returns the `person`, `title`, and `org` corresponding to each person's **last** (most recent) employment experience (three column dataframe).

In [0]:
# TODO: Complete the last_job function

def last_job(experience_df):
    # YOUR CODE HERE


In [0]:
# Sanity Check 2.4.2 - please do not modify or delete this cell!

last_job_df = last_job(experience_df)
display(last_job_df)

if(len(last_job_df.columns) != 3):
    raise AssertionError('Wrong number of columns in last_job_df')

### Step 2.4.3: Recommend jobs

Complete the `recommend_jobs` function below that takes `recs_df`, `names_df`, and `last_job_df` as input and returns for each `person_1`, `person_2`'s most recent `title` and `org`.

In [0]:
# TODO: Complete the recommend_jobs function

def recommend_jobs(recs_df, names_df, last_job_df):
    # YOUR CODE HERE


In [0]:
# Sanity Check 2.4.3 - please do not modify or delete this cell!

recommended_df = recommend_jobs(recs_df, names_df, last_job_df)
display(recommended_df)

if "family_name" not in recommended_df:
    raise AssertionError("person_1 column not defined")
if "given_name" not in recommended_df:
    raise AssertionError("person_2 column not defined")
if "person_2" not in recommended_df:
    raise AssertionError("common_skills column not defined")
if "org" not in recommended_df:
    raise AssertionError("common_skills column not defined")
if "title" not in recommended_df:
    raise AssertionError("common_skills column not defined")