# Lecture Notebook: Modeling Data and Knowledge

## Making Choices about Data Representation and Processing using LinkedIn 
This module explores concepts in:

* Designing data representations to capture important relationships
* Reasoning over graphs
* Exploring and traversing graphs


It sets the stage for a deeper understanding of issues related to performance, and cloud/cluster-compute data processing.




We'll use MongoDB on the cloud as a sample NoSQL database

In [0]:
!pip install pymongo[tls,srv]
!pip install swifter
!pip install lxml

Collecting dnspython<2.0.0,>=1.16.0; extra == "srv"
[?25l  Downloading https://files.pythonhosted.org/packages/ec/d3/3aa0e7213ef72b8585747aa0e271a9523e713813b9a20177ebe1e939deb0/dnspython-1.16.0-py2.py3-none-any.whl (188kB)
[K     |████████████████████████████████| 194kB 2.8MB/s 
[?25hInstalling collected packages: dnspython
Successfully installed dnspython-1.16.0
Collecting swifter
  Downloading https://files.pythonhosted.org/packages/1f/22/0a46b4d2a417824d7e883a8bd8e01c3b000bbdeaa7c154891b7cba94cbf7/swifter-0.296-py3-none-any.whl
Collecting tqdm>=4.33.0
[?25l  Downloading https://files.pythonhosted.org/packages/b9/08/8505f192efc72bfafec79655e1d8351d219e2b80b0dec4ae71f50934c17a/tqdm-4.38.0-py2.py3-none-any.whl (53kB)
[K     |████████████████████████████████| 61kB 3.2MB/s 
Collecting partd>=0.3.8; extra == "complete"
  Downloading https://files.pythonhosted.org/packages/8b/17/09c352519da1db31634979c3aa9126078e9ece0f561c5f641e0649b78905/partd-1.0.0-py2.py3-none-any.whl
Collecting l



In [0]:
import pandas as pd
import numpy as np

# JSON parsing
import json

# HTML parsing
from lxml import etree
import urllib

# SQLite RDBMS
import sqlite3

# Time conversions
import time

# Parallel processing
import swifter

# NoSQL DB
from pymongo import MongoClient
from pymongo.errors import DuplicateKeyError, OperationFailure

import zipfile
import os

In [0]:
!curl ipecho.net/plain

35.237.201.176

## Our Example Dataset

The example dataset is a crawl of LinkedIn, stored as a sequence of JSON objects (one per line).  

**Note to instructor:** In the below, substitute the location of this dataset in X (see Instructor Notes). The urllib.request will place the result in a file called 'local.zip'.

In [0]:
url = 'https://upenn-bigdataanalytics.s3.amazonaws.com/linkedin.zip'
# url = 'X'
filehandle, _ = urllib.request.urlretrieve(url,filename='local.zip')

In [0]:
def fetch_file(fname):
    zip_file_object = zipfile.ZipFile(filehandle, 'r')
    for file in zip_file_object.namelist():
        file = zip_file_object.open(file)
        if file.name == fname: return file
    return None
    
# linkedin_small = fetch_file('linkedin_small.json')

In [0]:
%%time
# 100K records from linkedin
linked_in = fetch_file('linkedin_small.json')
    
people = []

for line in linked_in:
    person = json.loads(line)
    people.append(person)
    
people_df = pd.DataFrame(people)
print ("%d records"%len(people_df))

people_df

100000 records
CPU times: user 15.7 s, sys: 1.69 s, total: 17.4 s
Wall time: 17.4 s


## NoSQL storage

For this part you need to access MongoDB (see the Instructor Notes).  One option is to sign up at:

https://www.mongodb.com/cloud

Click on "Get started", sign up, agree to terms of service, and create a new cluster on AWS free tier (Northern Virginia).  Use this location as 'Y' in the client creation below.

Eventually you'll need to tell MongoDB to add your IP address (so you can talk to the machine) and you'll need to create a database called 'linkedin'.

In [0]:
# Store in MongoDB some number of records (here limit=10k) and in an in-memory list

START = 0
LIMIT = 10000

# client = MongoClient('mongodb+srv://Y')
client = MongoClient('mongodb+srv://cis545:1course4all@cluster0-cy1yu.mongodb.net/test?retryWrites=true&w=majority')

linkedin_db = client['linkedin']
linked_in = fetch_file('linkedin_small.json')

# Build a list of the JSON elements
list_for_comparison = []

people = 0
for line in linked_in:
    person = json.loads(line)
    if people >= START:
        try:
            list_for_comparison.append(person)
            linkedin_db.posts.insert_one(person)
        except DuplicateKeyError:
            pass
        except OperationFailure:
            # If the above still uses our cluster, you'll get this error in
            # attempting to write to our MongoDB client
            pass
    people = people + 1
    # if(people % 1000 == 0): 
    #   print (people)
    if people > LIMIT:
        break

In [0]:
# Two ways of looking up skills, one based on an in-memory
# list, one based on MongoDB queries

def find_skills_in_list(skill):
    for post in list_for_comparison:
        if 'skills' in post:
            skills = post['skills']
            for this_skill in skills:
                if this_skill == skill:
                    return post
    return None

def find_skills_in_mongodb(skill):
    return linkedin_db.posts.find_one({'skills': skill})

In [0]:
%%time
find_skills_in_list('Marketing')

CPU times: user 72 µs, sys: 0 ns, total: 72 µs
Wall time: 75.8 µs


{'_id': 'in-01011985',
 'also_view': [{'id': 'pub-murli-shukla-13-b68-27a',
   'url': 'http://in.linkedin.com/pub/murli-shukla/13/b68/27a'},
  {'id': 'pub-sumeet-mehta-20-18-736',
   'url': 'http://in.linkedin.com/pub/sumeet-mehta/20/18/736'},
  {'id': 'pub-killol-bhatt-5-2ab-96',
   'url': 'http://ke.linkedin.com/pub/killol-bhatt/5/2ab/96'},
  {'id': 'pub-vijay-javiya-16-239-70b',
   'url': 'http://in.linkedin.com/pub/vijay-javiya/16/239/70b'},
  {'id': 'in-skdash1969', 'url': 'http://in.linkedin.com/in/skdash1969'},
  {'id': 'in-umarshervani', 'url': 'http://in.linkedin.com/in/umarshervani'},
  {'id': 'in-mehtanilesh', 'url': 'http://in.linkedin.com/in/mehtanilesh'},
  {'id': 'pub-amit-dwivedi-a-3a3-968',
   'url': 'http://in.linkedin.com/pub/amit-dwivedi/a/3a3/968'},
  {'id': 'pub-pritesh-patel-12-604-a23',
   'url': 'http://in.linkedin.com/pub/pritesh-patel/12/604/a23'},
  {'id': 'pub-moumita-chakraborty-13-538-52',
   'url': 'http://in.linkedin.com/pub/moumita-chakraborty/13/538/5

In [0]:
%%time
find_skills_in_mongodb('Marketing')

CPU times: user 2.29 ms, sys: 801 µs, total: 3.09 ms
Wall time: 89.3 ms


{'_id': 'in-01011985',
 'also_view': [{'id': 'pub-murli-shukla-13-b68-27a',
   'url': 'http://in.linkedin.com/pub/murli-shukla/13/b68/27a'},
  {'id': 'pub-sumeet-mehta-20-18-736',
   'url': 'http://in.linkedin.com/pub/sumeet-mehta/20/18/736'},
  {'id': 'pub-killol-bhatt-5-2ab-96',
   'url': 'http://ke.linkedin.com/pub/killol-bhatt/5/2ab/96'},
  {'id': 'pub-vijay-javiya-16-239-70b',
   'url': 'http://in.linkedin.com/pub/vijay-javiya/16/239/70b'},
  {'id': 'in-skdash1969', 'url': 'http://in.linkedin.com/in/skdash1969'},
  {'id': 'in-umarshervani', 'url': 'http://in.linkedin.com/in/umarshervani'},
  {'id': 'in-mehtanilesh', 'url': 'http://in.linkedin.com/in/mehtanilesh'},
  {'id': 'pub-amit-dwivedi-a-3a3-968',
   'url': 'http://in.linkedin.com/pub/amit-dwivedi/a/3a3/968'},
  {'id': 'pub-pritesh-patel-12-604-a23',
   'url': 'http://in.linkedin.com/pub/pritesh-patel/12/604/a23'},
  {'id': 'pub-moumita-chakraborty-13-538-52',
   'url': 'http://in.linkedin.com/pub/moumita-chakraborty/13/538/5

## Designing a relational schema from hierarchical data

Given that we already have a predefined set of fields / attributes / features, we don't need to spend a lot of time defining our table *schemas*, except that we need to unnest data.

* Nested relationships can be captured by creating a second table, which has a **foreign key** pointing to the identifier (key) for the main (parent) table.
* Ordered lists can be captured by encoding an index number or row number.

In [0]:
'''
Simple code to pull out data from JSON and load into sqllite
'''
# linked_in = urllib.request.urlopen('X')
linked_in = fetch_file('linkedin_small.json')

START = 0
LIMIT = 10000 # Limit the max number of records to be 10K. 

def get_df(rel):
    ret = pd.DataFrame(rel).fillna('')
    for k in ret.keys():
        ret[k] = ret[k].astype(str)
    return ret

def extract_relation(rel, name):
    '''
    Pull out a nested list that has a key, and return it as a list
    of dictionaries suitable for treating as a relation / dataframe
    '''
    # We'll return a list
    ret  = []
    if name in rel:
        ret2 = rel.pop(name)
        try:
            # Try to parse the string as a dictionary
            ret2 = json.loads(ret2.replace('\'','\"'))
        except:
            # If we get an error in parsing, we'll leave as a string
            pass
        
        # If it's a dictionary, add it to our return results after
        # adding a key to the parent
        if isinstance(ret2, dict):
            item = ret2
            item['person'] = rel['_id']
            ret.append(item)
        else:
            # If it's a list, iterate over each item
            index = 0
            for r in ret2:
                item = r
                if not isinstance(item, dict):
                    item = {'person': rel['_id'], 'value': item}
                else:
                    item['person'] = rel['_id']
                    
                # A fix to a typo in the data
                if 'affilition' in item:
                    item['affiliation'] = item.pop('affilition')
                    
                item['pos'] = index
                index = index + 1
                ret.append(item)
    return ret
    

names = []
people = []
groups = []
education = []
skills = []
experience = []
honors = []
also_view = []
events = []


conn = sqlite3.connect('linkedin.db')

lines = []
i = 1
for line in linked_in:
    if i > START + LIMIT:
        break
    elif i >= START:
        person = json.loads(line)

        # By inspection, all of these are nested dictionary or list content
        nam = extract_relation(person, 'name')
        edu = extract_relation(person, 'education')
        grp = extract_relation(person, 'group')
        skl = extract_relation(person, 'skills')
        exp  = extract_relation(person, 'experience')
        hon = extract_relation(person, 'honors')
        als = extract_relation(person, 'also_view')
        eve = extract_relation(person, 'events')
        
        # This doesn't seem relevant and it's the only
        # non-string field that's sometimes null
        if 'interval' in person:
            person.pop('interval')
        
        lines.append(person)
        names = names + nam
        education = education + edu
        groups  = groups + grp
        skills = skills + skl
        experience = experience + exp
        honors = honors + hon
        also_view = also_view + als
        events = events + eve
        
    i = i + 1

people_df = get_df(pd.DataFrame(lines))
names_df = get_df(pd.DataFrame(names))
education_df = get_df(pd.DataFrame(education))
groups_df = get_df(pd.DataFrame(groups))
skills_df = get_df(pd.DataFrame(skills))
experience_df = get_df(pd.DataFrame(experience))
honors_df = get_df(pd.DataFrame(honors))
also_view_df = get_df(pd.DataFrame(also_view))
events_df = get_df(pd.DataFrame(events))

In [0]:
# Save these to the SQLite database

people_df.to_sql('people', conn, if_exists='replace', index=False)
names_df.to_sql('names', conn, if_exists='replace', index=False)
education_df.to_sql('education', conn, if_exists='replace', index=False)
groups_df.to_sql('groups', conn, if_exists='replace', index=False)
skills_df.to_sql('skills', conn, if_exists='replace', index=False)
experience_df.to_sql('experience', conn, if_exists='replace', index=False)
honors_df.to_sql('honors', conn, if_exists='replace', index=False)
also_view_df.to_sql('also_view', conn, if_exists='replace', index=False)
events_df.to_sql('events', conn, if_exists='replace', index=False)

In [0]:
groups_df

Unnamed: 0,affilition,person,member
0,"['ASMALLWORLD.net', 'Biomarker Research & Exec...",in-00001,
1,"['Big Data, Low Latency', ""Experts Answer's"", ...",in-000montgomery,
2,"['AeSI Alumni Association', 'Aircraft Electron...",in-000vijaychauhan,"Member of Project Management Institute, Life M..."
3,"['Canadian Marketing Association', 'LeadingLoy...",in-001monica,
4,"['CFA Institute Candidates', 'Economist Intell...",in-00789123,Associate Member of SAMRA
...,...,...,...
6331,"['EADA Alumni', 'Entrepreneurs Network Barcelo...",in-albertocanasrojas,EADA Alumni
6332,"['CUDA Developers', 'CUDA Users Group', 'Data ...",in-albertocanorojas,
6333,"['Sony Ericsson Global', 'WE LOVE ADVERTISING'...",in-albertocarcedo,
6334,"['COMPANY PHARMA TALENT', 'Chemical / O&G Oppo...",in-albertocarimati,


In [0]:
pd.read_sql_query('select _id, org from people join experience on _id=person', conn)

Unnamed: 0,_id,org
0,in-00001,Albert Einstein Medical Center
1,in-00001,Columbia University
2,in-00001,Johnson and Johnson
3,in-00001,Memorial Sloan Kettering Cancer Center
4,in-00001,Sahlgrenska University Hospital
...,...,...
46106,in-albertocastellano,Reply
46107,in-albertocastellano,Vodafone IT
46108,in-albertocesani,Atari Games
46109,in-albertocesani,Koch Media srl


In [0]:
pd.read_sql_query("select _id, group_concat(org) as experience " +\
                  " from people left join experience on _id=person group by _id", conn)

Unnamed: 0,_id,experience
0,in-00000001,
1,in-00001,"Albert Einstein Medical Center,Columbia Univer..."
2,in-00006,"UCSF,Wyss Institute for Biologically Inspired ..."
3,in-000montgomery,"000Montgomery.Com,<Advertising Company>,<Adver..."
4,in-000vijaychauhan,
...,...,...
9995,in-albertocarimati,"BASF,Basf Italia,Lonza Polymer and,Lonza Singa..."
9996,in-albertocarrasco,"Glassdrive España,Saint-Gobain Glassdrive Espa..."
9997,in-albertocarreroderoa,"ArcelorMittal,Corporacion Alimentaria Penasant..."
9998,in-albertocastellano,"Amadeus,Amadeus IT Group,Astek,Reply,Vodafone IT"


## Views

Since we may want to see all the experiences of a person in one place rather than in separate rows, we will create a view in which they are listed as a string (column named experience).  The following code creates this view within the context of a transaction (the code between "begin" and "commit" or "rollback"). If the view already exists, it removes it and creates a new one.

In [0]:
conn.execute('begin transaction')
conn.execute('drop view if exists people_experience')
conn.execute("create view people_experience as select _id, group_concat(org) as experience " +\
                  " from people left join experience on _id=person group by _id")
conn.execute('commit')

# Treat the view as a table, see what's there
pd.read_sql_query('select * from people_experience', conn)

Unnamed: 0,_id,experience
0,in-00000001,
1,in-00001,"Albert Einstein Medical Center,Columbia Univer..."
2,in-00006,"UCSF,Wyss Institute for Biologically Inspired ..."
3,in-000montgomery,"000Montgomery.Com,<Advertising Company>,<Adver..."
4,in-000vijaychauhan,
...,...,...
9995,in-albertocarimati,"BASF,Basf Italia,Lonza Polymer and,Lonza Singa..."
9996,in-albertocarrasco,"Glassdrive España,Saint-Gobain Glassdrive Espa..."
9997,in-albertocarreroderoa,"ArcelorMittal,Corporacion Alimentaria Penasant..."
9998,in-albertocastellano,"Amadeus,Amadeus IT Group,Astek,Reply,Vodafone IT"
