# Lecture Notebook: Modeling Data and Knowledge

## Making Choices about Data Representation and Processing using LinkedIn 
This module explores concepts in:

* Designing data representations to capture important relationships
* Reasoning over graphs
* Exploring and traversing graphs


It sets the stage for a deeper understanding of issues related to performance, and cloud/cluster-compute data processing.




We'll use MongoDB on the cloud as a sample NoSQL database

In [0]:
!pip install pymongo[tls,srv]
!pip install swifter
!pip install lxml



In [0]:
import pandas as pd
import numpy as np

# JSON parsing
import json

# HTML parsing
from lxml import etree
import urllib

# SQLite RDBMS
import sqlite3

# Time conversions
import time

# Parallel processing
import swifter

# NoSQL DB
from pymongo import MongoClient
from pymongo.errors import DuplicateKeyError, OperationFailure

import zipfile
import os

## Our Example Dataset

The example dataset is a crawl of LinkedIn, stored as a sequence of JSON objects (one per line).  

**Notice:** You need to correctly load the data before successfully running this notebook. The solution would be using an url to visit the data, or to open/mount the data locally. See **instructor notes** or **README** for detail. 

**Note:** When using url to visit the data in the below, substitute the location of this dataset in X (see Instructor Notes). The urllib.request will place the result in a file called 'local.zip'. Otherwise, when opening a data mounted/located in the local directory, use open() function instead. 

In [0]:
# url = 'X'
# filehandle, _ = urllib.request.urlretrieve(url,filename='local.zip')

In [0]:
# def fetch_file(fname):
#     zip_file_object = zipfile.ZipFile(filehandle, 'r')
#     for file in zip_file_object.namelist():
#         file = zip_file_object.open(file)
#         if file.name == fname: return file
#     return None
    
# linked_in = fetch_file('test_data_10000.json')

In [0]:
# Mount your google drive to Colab instance, if used Colab. 
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

In [0]:
# Visit data locally, if using Colab or local machine. Example follows. 
myfile = open('/content/drive/My Drive/Colab Notebooks/test_data_10000.json')

In [0]:
%%time
# 10K records from linkedin
# if use url to visit zipped data, then use 'fetch_file' object. 
# linked_in = fetch_file('linkedin_small.json')
# If visit data locally, if using Colab or local machine. 
linked_in = open('/content/drive/My Drive/Colab Notebooks/test_data_10000.json')
    
people = []

for line in linked_in:
    # print(line)
    person = json.loads(line)
    people.append(person)
    
people_df = pd.DataFrame(people)
print ("%d records"%len(people_df))

people_df

10000 records
CPU times: user 986 ms, sys: 173 ms, total: 1.16 s
Wall time: 1.19 s


## NoSQL storage

For this part you need to access MongoDB (see the Instructor Notes).  One option is to sign up at:

https://www.mongodb.com/cloud

Click on "Get started", sign up, agree to terms of service, and create a new cluster on AWS free tier (Northern Virginia).  Use this location as 'Y' in the client creation below.

Eventually you'll need to tell MongoDB to add your IP address (so you can talk to the machine) and you'll need to create a database called 'linkedin'.

In [0]:
# Store in MongoDB some number of records (here limit=10k) and in an in-memory list

START = 0
LIMIT = 10000

client = MongoClient('mongodb+srv://Y')
# client = MongoClient('mongodb+srv://leshangc:<password>@cluster0-vpplx.mongodb.net/test?retryWrites=true&w=majority')
# client = MongoClient('mongodb+srv://cis545:1course4all@cluster0-cy1yu.mongodb.net/test?retryWrites=true&w=majority')

linkedin_db = client['linkedin']
# linked_in = fetch_file('test_data_10000.json')
linked_in = open('/content/drive/My Drive/Colab Notebooks/test_data_10000.json')

# Build a list of the JSON elements
list_for_comparison = []

people = 0
for line in linked_in:
    person = json.loads(line)
    if people >= START:
        try:
            list_for_comparison.append(person)
            linkedin_db.posts.insert_one(person)
        except DuplicateKeyError:
            pass
        except OperationFailure:
            # If the above still uses our cluster, you'll get this error in
            # attempting to write to our MongoDB client
            pass
    people = people + 1
    # if(people % 1000 == 0): 
    #   print (people)
    if people > LIMIT:
        break

In [0]:
# Two ways of looking up skills, one based on an in-memory
# list, one based on MongoDB queries

def find_skills_in_list(skill):
    for post in list_for_comparison:
        if 'skills' in post:
            skills = post['skills']
            for this_skill in skills:
                if this_skill == skill:
                    return post
    return None

def find_skills_in_mongodb(skill):
    return linkedin_db.posts.find_one({'skills': skill})

In [0]:
%%time
find_skills_in_list('Marketing')

CPU times: user 23 ms, sys: 0 ns, total: 23 ms
Wall time: 23.2 ms


In [0]:
%%time
find_skills_in_mongodb('Marketing')

CPU times: user 2.39 ms, sys: 45 µs, total: 2.43 ms
Wall time: 64.9 ms


{'_id': 'in-01011985',
 'also_view': [{'id': 'pub-murli-shukla-13-b68-27a',
   'url': 'http://in.linkedin.com/pub/murli-shukla/13/b68/27a'},
  {'id': 'pub-sumeet-mehta-20-18-736',
   'url': 'http://in.linkedin.com/pub/sumeet-mehta/20/18/736'},
  {'id': 'pub-killol-bhatt-5-2ab-96',
   'url': 'http://ke.linkedin.com/pub/killol-bhatt/5/2ab/96'},
  {'id': 'pub-vijay-javiya-16-239-70b',
   'url': 'http://in.linkedin.com/pub/vijay-javiya/16/239/70b'},
  {'id': 'in-skdash1969', 'url': 'http://in.linkedin.com/in/skdash1969'},
  {'id': 'in-umarshervani', 'url': 'http://in.linkedin.com/in/umarshervani'},
  {'id': 'in-mehtanilesh', 'url': 'http://in.linkedin.com/in/mehtanilesh'},
  {'id': 'pub-amit-dwivedi-a-3a3-968',
   'url': 'http://in.linkedin.com/pub/amit-dwivedi/a/3a3/968'},
  {'id': 'pub-pritesh-patel-12-604-a23',
   'url': 'http://in.linkedin.com/pub/pritesh-patel/12/604/a23'},
  {'id': 'pub-moumita-chakraborty-13-538-52',
   'url': 'http://in.linkedin.com/pub/moumita-chakraborty/13/538/5

## Designing a relational schema from hierarchical data

Given that we already have a predefined set of fields / attributes / features, we don't need to spend a lot of time defining our table *schemas*, except that we need to unnest data.

* Nested relationships can be captured by creating a second table, which has a **foreign key** pointing to the identifier (key) for the main (parent) table.
* Ordered lists can be captured by encoding an index number or row number.

In [0]:
'''
Simple code to pull out data from JSON and load into sqllite
'''
# linked_in = urllib.request.urlopen('X')
# linked_in = fetch_file('linkedin_small.json')
linked_in = open('/content/drive/My Drive/Colab Notebooks/test_data_10000.json')

START = 0
LIMIT = 10000 # Limit the max number of records to be 10K. 

def get_df(rel):
    ret = pd.DataFrame(rel).fillna('')
    for k in ret.keys():
        ret[k] = ret[k].astype(str)
    return ret

def extract_relation(rel, name):
    '''
    Pull out a nested list that has a key, and return it as a list
    of dictionaries suitable for treating as a relation / dataframe
    '''
    # We'll return a list
    ret  = []
    if name in rel:
        ret2 = rel.pop(name)
        try:
            # Try to parse the string as a dictionary
            ret2 = json.loads(ret2.replace('\'','\"'))
        except:
            # If we get an error in parsing, we'll leave as a string
            pass
        
        # If it's a dictionary, add it to our return results after
        # adding a key to the parent
        if isinstance(ret2, dict):
            item = ret2
            item['person'] = rel['_id']
            ret.append(item)
        else:
            # If it's a list, iterate over each item
            index = 0
            for r in ret2:
                item = r
                if not isinstance(item, dict):
                    item = {'person': rel['_id'], 'value': item}
                else:
                    item['person'] = rel['_id']
                    
                # A fix to a typo in the data
                if 'affilition' in item:
                    item['affiliation'] = item.pop('affilition')
                    
                item['pos'] = index
                index = index + 1
                ret.append(item)
    return ret
    

names = []
people = []
groups = []
education = []
skills = []
experience = []
honors = []
also_view = []
events = []


conn = sqlite3.connect('linkedin.db')

lines = []
i = 1
for line in linked_in:
    if i > START + LIMIT:
        break
    elif i >= START:
        person = json.loads(line)

        # By inspection, all of these are nested dictionary or list content
        nam = extract_relation(person, 'name')
        edu = extract_relation(person, 'education')
        grp = extract_relation(person, 'group')
        skl = extract_relation(person, 'skills')
        exp  = extract_relation(person, 'experience')
        hon = extract_relation(person, 'honors')
        als = extract_relation(person, 'also_view')
        eve = extract_relation(person, 'events')
        
        # This doesn't seem relevant and it's the only
        # non-string field that's sometimes null
        if 'interval' in person:
            person.pop('interval')
        
        lines.append(person)
        names = names + nam
        education = education + edu
        groups  = groups + grp
        skills = skills + skl
        experience = experience + exp
        honors = honors + hon
        also_view = also_view + als
        events = events + eve
        
    i = i + 1

people_df = get_df(pd.DataFrame(lines))
names_df = get_df(pd.DataFrame(names))
education_df = get_df(pd.DataFrame(education))
groups_df = get_df(pd.DataFrame(groups))
skills_df = get_df(pd.DataFrame(skills))
experience_df = get_df(pd.DataFrame(experience))
honors_df = get_df(pd.DataFrame(honors))
also_view_df = get_df(pd.DataFrame(also_view))
events_df = get_df(pd.DataFrame(events))

In [0]:
people_df

Unnamed: 0,locality,industry,summary,url,specilities,interests,_id,overview_html,homepage
0,"Eskisehir, Turkey",Animasyon,An experienced general and financial manager i...,ftsycpiuiasaecavdmmkooqckeojdsriefopopyeeat0,Areas Of Practice: Criminal Defense; DUI; Misd...,"Reading books, watching movies and nature trip...",ichsrrdhpxlojntrimsvrbzexeeyi0,,
1,"York, United Kingdom",Pharmaceuticals,1 - Strategic Management of Multi Business Fir...,zramixfzvfpoiysoamdvudwaecragisdfopegjybxdz1,"redacción, comunicación, franquicias, coaching...","Travel, Live Music, Photography, New Technolog...",sxaybhrpeceeanwwqeexnxhclcwhr1,"<dl id=""overview""><dt id=""overview-summary-cur...",
2,Greater Nashville Area,Asigurări,An experienced and motivated ecommerce directo...,ftcisrobxsxkayhotkfvadgjoacjsikbtquekcevpcu2,"Applied microeconomics, econometrics","Sales and business coaching, skiing, triathlon...",cfogybqdnddowlhcamixbpvvxydzs2,"<dl id=""overview""><dt id=""overview-summary-cur...",{'Personal Website': ['http://www.davare.net']}
3,"Madison, Wisconsin Area",Higher Education,Passionate about the internet and the new tech...,xgjwidohcoapijxrzvlpgyiuhdxguzqgpjluejbmyjy3,"DSP (speech, audio, image, video), Detection, ...","Reading Magzines,Newspapers,\nFootabll,Chess,P...",vzcoxvvnuarepgqxmuoqbdduchhmw3,"<dl id=""overview""><dt id=""overview-summary-cur...","{'Blog': ['http://www.DataApprentice.com'], 'C..."
4,"Santa Monica, California",Compagnie aérienne/Aviation,20+ years of consumer product sales and market...,podsgvxoxjtqytutodyxrhsbqephtblnsxbtgrsezgm4,"Territory Management (Sales), Customer Service...","Networks, algorithms\nStrategy board games, Po...",qcocuvfhzactuygqszqlfehfdmzvr4,"<dl id=""overview""><dt id=""overview-summary-cur...",
...,...,...,...,...,...,...,...,...,...
9995,Estonia,Computer Software,Adrian is a results driven technical programme...,vdwsnniwrrqqslmcvuzbklwciyaccgyzcfuyjtblqrb9995,"General Management, Leadership, Marketing","Reading historical fiction, biographies, gen x...",inulhbnjilhretplcplemcpefvpmv9995,"<dl id=""overview""><dt>\nContactos\n</dt>\n<dd ...",{'Old stuff - Bing Music': ['http://www.bing.c...
9996,Seychelles,Cosmetics,I am a self motivated individual with a desire...,qwbkoeytroapckmstbujeesstxetfsdvhqizcfeuazj9996,Specialising in ICT Strategic & Investment Pla...,"Music, Hiking,Bicycling, Wine, Astronomy",mleibnoxuvlaoubbricbrqkajqdit9996,"<dl id=""overview""><dt id=""overview-summary-cur...",{'Personal Website': ['http://bristol.academia...
9997,"Bakersfield, California Area",Real Estate,State of Califorina - Licensed General Buildin...,nhmmiajhufybjbsombsqsrynpdbymqtfgwegdodoaln9997,"Collating and Validating intelligence,building...","Web Development, Python,Django, Google App Eng...",chmibprnchhepzmytjttiylzvhmms9997,"<dl id=""overview""><dt id=""overview-summary-cur...","{'Blog': ['http://blog.adelahmed.com'], 'Perso..."
9998,Greater San Diego Area,Computer Software,1.) Project Management experience in large sca...,kqpdtfcwtsaeoxefbkprbrxjmrhpriayejprlowxgrk9998,While I value my versatility which enables me ...,"Advertising, Web 2.0, Buzz marketing",fbignswrwlvuqlpuyscrxwwntbroh9998,"<dl id=""overview""><dt id=""overview-summary-cur...",{'Instagram': ['http://instagram.com/ajaysampa...


In [0]:
# Save these to the SQLite database

people_df.to_sql('people', conn, if_exists='replace', index=False)
names_df.to_sql('names', conn, if_exists='replace', index=False)
education_df.to_sql('education', conn, if_exists='replace', index=False)
groups_df.to_sql('groups', conn, if_exists='replace', index=False)
skills_df.to_sql('skills', conn, if_exists='replace', index=False)
experience_df.to_sql('experience', conn, if_exists='replace', index=False)
honors_df.to_sql('honors', conn, if_exists='replace', index=False)
also_view_df.to_sql('also_view', conn, if_exists='replace', index=False)
events_df.to_sql('events', conn, if_exists='replace', index=False)

In [0]:
groups_df

Unnamed: 0,affilition,person,member
0,"['Entertainment Ticketing Professionals', 'La ...",sxaybhrpeceeanwwqeexnxhclcwhr1,
1,"[""Denim People's Group"", 'Official FIDM Alumni...",cfogybqdnddowlhcamixbpvvxydzs2,
2,"['BRASIL: VAGAS EXECUTIVAS', 'Best Practices i...",vzcoxvvnuarepgqxmuoqbdduchhmw3,
3,"['A&I - Accoglienza e Integrazione', 'AIDP Ass...",qcocuvfhzactuygqszqlfehfdmzvr4,
4,"['AngloINFO Luxembourg', 'Atlas Consulting SA'...",zclxbuwmpxdfyjqodhgnfsedpdnxw5,Luxembourg Testing Board
...,...,...,...
6331,"['ADR Abruzzo-Vasto', 'ADR MEDIAZIONE E CONCIL...",izzqotaogtikoedukxzwikvhxdjap9992,
6332,"['UST Class of 2010', 'University of St. Thoma...",rqyomndkmyrqpbgxwndffoxetbdvu9994,
6333,"['CISCO', 'CISCO CERTIFICATIONS', 'CISCO CERTI...",inulhbnjilhretplcplemcpefvpmv9995,"NJHIMSS, DVHIMSS, HIMSS"
6334,"['Account Manager Group', 'Alumni Universidad ...",mleibnoxuvlaoubbricbrqkajqdit9996,


In [0]:
pd.read_sql_query('select _id, org from people join experience on _id=person', conn)

Unnamed: 0,_id,org
0,ichsrrdhpxlojntrimsvrbzexeeyi0,AstraZeneca
1,ichsrrdhpxlojntrimsvrbzexeeyi0,Barclays Bank
2,ichsrrdhpxlojntrimsvrbzexeeyi0,Enterprise Plc
3,ichsrrdhpxlojntrimsvrbzexeeyi0,The Stroll Group
4,ichsrrdhpxlojntrimsvrbzexeeyi0,UBS
...,...,...
46060,fbignswrwlvuqlpuyscrxwwntbroh9998,HCL Technologies
46061,fbignswrwlvuqlpuyscrxwwntbroh9998,Skilrock Technologies
46062,rtsrqeqecuidakdpkdaxjbyokgiae9999,Brocade
46063,rtsrqeqecuidakdpkdaxjbyokgiae9999,Calsoft


In [0]:
pd.read_sql_query("select _id, group_concat(org) as experience " +\
                  " from people left join experience on _id=person group by _id", conn)

Unnamed: 0,_id,experience
0,aaalwqmfowcfucbvaxjkoikrlpcvc5074,"L&T Infotech,Tenth Planet,XLabz Technologies"
1,aacellsxfgzptvksqjoytpyphqjkl8187,"Partners for Marketing,SECO Financial"
2,aacvsrpfbzooiadlyewesddlwqpzl8612,"Ben's Heart Foundation,eCarList,txGarage"
3,aaczxilpkjoqfzmqlwipsupmblxoy3406,"K2 Partnering Solutions,Pizza express,SThree plc"
4,aadphcbqlygoozpuurwodjzsuywgk5685,Deloitte
...,...,...
9995,zzgyiojcowmmdqomjhimanonuvtjn2363,"Euro RSCG,Euro RSCG 4D Digital / Portland,Hawt..."
9996,zzmaecuuosawidnbuqadgiauceurc6319,"ARCADIS,Eclipsys,P. Almeida Distrib.,Ronin, LL..."
9997,zztglyvgdgaymjjthfpqmatxxdgtr955,"Mahle Stuttgart,Performance Training & Consult..."
9998,zzwbyiurtdyynjfdahchbjzysrybh7122,"Guadalupe Center of Immokalee,NOISE, Inc.,Pepp..."


## Views

Since we may want to see all the experiences of a person in one place rather than in separate rows, we will create a view in which they are listed as a string (column named experience).  The following code creates this view within the context of a transaction (the code between "begin" and "commit" or "rollback"). If the view already exists, it removes it and creates a new one.

In [0]:
conn.execute('begin transaction')
conn.execute('drop view if exists people_experience')
conn.execute("create view people_experience as select _id, group_concat(org) as experience " +\
                  " from people left join experience on _id=person group by _id")
conn.execute('commit')

# Treat the view as a table, see what's there
pd.read_sql_query('select * from people_experience', conn)

Unnamed: 0,_id,experience
0,aaalwqmfowcfucbvaxjkoikrlpcvc5074,"L&T Infotech,Tenth Planet,XLabz Technologies"
1,aacellsxfgzptvksqjoytpyphqjkl8187,"Partners for Marketing,SECO Financial"
2,aacvsrpfbzooiadlyewesddlwqpzl8612,"Ben's Heart Foundation,eCarList,txGarage"
3,aaczxilpkjoqfzmqlwipsupmblxoy3406,"K2 Partnering Solutions,Pizza express,SThree plc"
4,aadphcbqlygoozpuurwodjzsuywgk5685,Deloitte
...,...,...
9995,zzgyiojcowmmdqomjhimanonuvtjn2363,"Euro RSCG,Euro RSCG 4D Digital / Portland,Hawt..."
9996,zzmaecuuosawidnbuqadgiauceurc6319,"ARCADIS,Eclipsys,P. Almeida Distrib.,Ronin, LL..."
9997,zztglyvgdgaymjjthfpqmatxxdgtr955,"Mahle Stuttgart,Performance Training & Consult..."
9998,zzwbyiurtdyynjfdahchbjzysrybh7122,"Guadalupe Center of Immokalee,NOISE, Inc.,Pepp..."
