# Notebook 10: Take All NODES and RELATIONS Files and Update :ID fields to Be Able to Export to NEO4J

#### This notebook produces the following data into the _final_neo4j_files_ folder:
```
(OCCUPATION) NODE					occupation__node.csv
occupation_id:ID
occupation_title
occupation_synonyms
occupation_description
occupation_salary
:LABEL = "OCCUPATION"

[BELONGS_TO] RELATION					matches__relation.csv
:START_ID = listing_id
:END_ID =  occupation_id
:TYPE = "BELONGS_TO"

(LISTING) NODE						listing__node.csv
listing_id:ID
listing_title
description
:LABEL = "LISTING"

[NEEDS] RELATION					needs__relation.csv
:START_ID = listing_id
:END_ID = skill_id
:TYPE = "NEEDS"

(SKILL) NODE						skill__node.csv
skill_id:ID
skill_name
aliases[]
:LABEL = "SKILL"

[TEACHES] RELATION					teaches__relation.csv
:START_ID = course_id
:END_ID = skill_id
:TYPE = "TEACHES"

(COURSE) NODE						course__node.csv
course_id:ID
course_name
course_difficulty_level
course_url
:LABEL = "COURSE”

[LOCATED_IN] RELATION					located_in__relation.csv
:START_ID = listing_id
:END_ID = location_id
:TYPE = "LOCATED_IN"

(LOCATION) NODE						location__node.csv
location_id:ID
location_name
:LABEL = "LOCATION"

[POSTED] RELATION					posted__relation.csv
:START_ID = company_id
:END_ID = listing_id
:TYPE = "POSTED"

(COMPANY) NODE						company__node.csv
company_id:ID
company_name
:LABEL = "COMPANY"


[HAS_FUTURE] RELATION					has_future__relation.csv
:START_ID = company_id
:END_ID = career_outlook_id
:TYPE = "HAS_FUTURE"



(CAREER_OUTLOOK) NODE					career_outlook__node.csv
career_outlook_id:ID
career_outlook
:LABEL = "CAREER_OUTLOOK"

```

In [None]:
import pandas as pd
import ast

In [None]:
# this cell is to support running the notebook in Google Colab

mydrive = ""  # this is when we run locally

# Google Colab:
from google.colab import drive
drive.mount('/content/drive')
mydrive = "/content/drive/MyDrive/DSE 203 — etl/DSE203_Project/"  # this is when we run on COLAB Leslie
mydrive = "/content/drive/MyDrive/DSE203_Project/"  # this is when we run on COLAB Sergey

input_dir = mydrive+"input_datasets/"
output_dir = mydrive+"output_datasets/"
temp_dir = mydrive+"temp_datasets/"
final_neo4j_dir = mydrive+"final_neo4j_files/"

Mounted at /content/drive


## Prepare (COURSE)->[TEACHES]->(SKILL)

In [None]:
course_df = pd.read_csv(output_dir+'course__node.csv')
skill_df = pd.read_csv(output_dir+'skill__node.csv')
teaches_df = pd.read_csv(output_dir+'teaches__relation.csv')
course_df.head(10)

Unnamed: 0,course_id:ID,course_name,course_difficulty_level,course_url,:LABEL
0,0,Write A Feature Length Screenplay For Film Or ...,Beginner,https://www.coursera.org/learn/write-a-feature...,COURSE
1,1,Business Strategy Business Model Canvas Analys...,Beginner,https://www.coursera.org/learn/canvas-analysis...,COURSE
2,2,Silicon Thin Film Solar Cells,Advanced,https://www.coursera.org/learn/silicon-thin-fi...,COURSE
3,3,Finance for Managers,Intermediate,https://www.coursera.org/learn/operational-fin...,COURSE
4,4,Retrieve Data using SingleTable SQL Queries,Beginner,https://www.coursera.org/learn/single-table-sq...,COURSE
5,5,Building Test Automation Framework using Selen...,Beginner,https://www.coursera.org/learn/building-test-a...,COURSE
6,6,Doing Business in China Capstone,Advanced,https://www.coursera.org/learn/doing-business-...,COURSE
7,7,"Programming Languages, Part A",Intermediate,https://www.coursera.org/learn/programming-lan...,COURSE
8,8,The Roles and Responsibilities of Nonprofit Bo...,Intermediate,https://www.coursera.org/learn/nonprofit-gov-2,COURSE
9,9,Business Russian Communication. Part,Intermediate,https://www.coursera.org/learn/business-russia...,COURSE


In [None]:
skill_df.head(3)

Unnamed: 0,skill_id:ID,skill_name,aliases[],:LABEL
0,0,ecommerceretail qa,ecommerceretail qa,SKILL
1,1,lan,lan,SKILL
2,2,peoplesoft,peoplesoft,SKILL


In [None]:
teaches_df.head(3)

Unnamed: 0,:START_ID,:END_ID,:TYPE
0,0,20261,TEACHES
1,330,20261,TEACHES
2,1906,20261,TEACHES


#### We have to increment SKILL IDs, so they don't overlap with COURSE IDs and also update relations

In [None]:
last_node_course = course_df['course_id:ID'].max()
next_node_skill = last_node_course + 1
next_node_skill

3522

In [None]:
# update node ids of skills in (SKILL)
skill_df['skill_id:ID'] = skill_df['skill_id:ID'] + next_node_skill
skill_df

Unnamed: 0,skill_id:ID,skill_name,aliases[],:LABEL
0,3522,ecommerceretail qa,ecommerceretail qa,SKILL
1,3523,lan,lan,SKILL
2,3524,peoplesoft,peoplesoft,SKILL
3,3525,bourne shell scripting,bourne shell scripting,SKILL
4,3526,groovy,groovy,SKILL
...,...,...,...,...
29418,32940,nosqldatabase,nosqldatabase,SKILL
29419,32941,programmingdevelopment,programmingdevelopment;program development,SKILL
29420,32942,programming on win xp788.1,programming on win xp788.1,SKILL
29421,32943,skills win32 programming expertcc++ programming,skills win32 programming expertcc++ programming,SKILL


In [None]:
# update node ids of skills in [TEACHES]
teaches_df[":END_ID"] = teaches_df[":END_ID"] + next_node_skill
teaches_df.tail(10)

Unnamed: 0,:START_ID,:END_ID,:TYPE
37359,3515,4029,TEACHES
37360,3515,4156,TEACHES
37361,3515,9194,TEACHES
37362,3515,17763,TEACHES
37363,3515,32251,TEACHES
37364,3516,27352,TEACHES
37365,3516,5119,TEACHES
37366,3516,3547,TEACHES
37367,3516,6648,TEACHES
37368,3518,27949,TEACHES


In [None]:
# save updated versions
course_df.sort_values('course_id:ID').to_csv(final_neo4j_dir+'course__node.csv', index=False)
teaches_df.sort_values(':START_ID').drop_duplicates().to_csv(final_neo4j_dir+'teaches__relation.csv', index=False)
skill_df.sort_values('skill_id:ID').to_csv(final_neo4j_dir+'skill__node.csv', index=False)

In [None]:
last_node_skill = skill_df['skill_id:ID'].max()
next_node_listing = last_node_skill + 1
next_node_listing

32945

## Prepare (LISTING)->[NEEDS]->(SKILL)

In [None]:
listing_df = pd.read_csv(output_dir+'listing__node.csv')
needs_df = pd.read_csv(output_dir+'needs__relation.csv')
listing_df.head(10)

Unnamed: 0,listing_id:ID,listing_title,description,:LABEL
0,0,AUTOMATION TEST ENGINEER,Looking for Selenium engineers. must have soli...,LISTING
1,1,Information Security Engineer,The University of Chicago has a rapidly growin...,LISTING
2,2,Business Solutions Architect,"GalaxE.SolutionsEvery day, our solutions affec...",LISTING
3,3,"Java Developer (mid level)- FT- GREAT culture,...","Java DeveloperFulltimedirecthireBolingbrook, I...",LISTING
4,4,DevOps Engineer,Midtown based high tech firm has an immediate ...,LISTING
5,5,SAP FICO Architect,We are looking for a Senior SAP FICO Architect...,LISTING
6,6,Network Engineer,Network Engineer Job Description A Network Eng...,LISTING
7,7,Sr. Web Application Developer (Cloud Team) - C...,Bluebeam is looking for talented sr. web devel...,LISTING
8,8,Front End Developer,This is a fulltime position for a Javascript d...,LISTING
9,9,Application Support Engineer,SummaryOur client is the leading provider of o...,LISTING


In [None]:
needs_df.head(10)

Unnamed: 0,:START_ID,:END_ID,:TYPE
0,0,0,NEEDS
1,0,1,NEEDS
2,299,1,NEEDS
3,310,1,NEEDS
4,491,1,NEEDS
5,756,1,NEEDS
6,912,1,NEEDS
7,921,1,NEEDS
8,1247,1,NEEDS
9,1314,1,NEEDS


In [None]:
listing_df['listing_id:ID'] = listing_df['listing_id:ID'] + next_node_listing
listing_df.head(10)

Unnamed: 0,listing_id:ID,listing_title,description,:LABEL
0,32945,AUTOMATION TEST ENGINEER,Looking for Selenium engineers. must have soli...,LISTING
1,32946,Information Security Engineer,The University of Chicago has a rapidly growin...,LISTING
2,32947,Business Solutions Architect,"GalaxE.SolutionsEvery day, our solutions affec...",LISTING
3,32948,"Java Developer (mid level)- FT- GREAT culture,...","Java DeveloperFulltimedirecthireBolingbrook, I...",LISTING
4,32949,DevOps Engineer,Midtown based high tech firm has an immediate ...,LISTING
5,32950,SAP FICO Architect,We are looking for a Senior SAP FICO Architect...,LISTING
6,32951,Network Engineer,Network Engineer Job Description A Network Eng...,LISTING
7,32952,Sr. Web Application Developer (Cloud Team) - C...,Bluebeam is looking for talented sr. web devel...,LISTING
8,32953,Front End Developer,This is a fulltime position for a Javascript d...,LISTING
9,32954,Application Support Engineer,SummaryOur client is the leading provider of o...,LISTING


In [None]:
needs_df[':START_ID'] = needs_df[':START_ID'] + next_node_listing
needs_df[':END_ID'] = needs_df[':END_ID'] + next_node_skill
needs_df

Unnamed: 0,:START_ID,:END_ID,:TYPE
0,32945,3522,NEEDS
1,32945,3523,NEEDS
2,33244,3523,NEEDS
3,33255,3523,NEEDS
4,33436,3523,NEEDS
...,...,...,...
120478,49211,32940,NEEDS
120479,49212,32941,NEEDS
120480,49212,32942,NEEDS
120481,49212,32943,NEEDS


In [None]:
listing_df.sort_values('listing_id:ID').to_csv(final_neo4j_dir+'listing__node.csv', index=False)
needs_df.sort_values(':START_ID').drop_duplicates().to_csv(final_neo4j_dir+'needs__relation.csv', index=False)

In [None]:
last_node_listing = listing_df['listing_id:ID'].max()
next_node_location = last_node_listing + 1
next_node_location

49213

## Prepare (LISTING)->[LOCATED_IN]->(LOCATION)

In [None]:
location_df = pd.read_csv(output_dir+'location__node.csv')
located_in_df = pd.read_csv(output_dir+'located_in__relation.csv')
location_df.head(10)

Unnamed: 0,location_id:ID,location_name,:LABEL
0,0,"Atlanta, GA",LOCATION
1,1,"Chicago, IL",LOCATION
2,2,"Schaumburg, IL",LOCATION
3,3,"Bolingbrook, IL",LOCATION
4,4,"New York, NY",LOCATION
5,5,"Seattle, WA",LOCATION
6,6,"Highlands Ranch, CO",LOCATION
7,7,"Portland, OR",LOCATION
8,8,"Los Angeles, CA",LOCATION
9,9,"Las Vegas, NV",LOCATION


In [None]:
located_in_df.head(10)

Unnamed: 0,:START_ID,:END_ID,:TYPE
0,0,0,LOCATED_IN
1,4,0,LOCATED_IN
2,6,0,LOCATED_IN
3,26,0,LOCATED_IN
4,4646,0,LOCATED_IN
5,4668,0,LOCATED_IN
6,4699,0,LOCATED_IN
7,4728,0,LOCATED_IN
8,4783,0,LOCATED_IN
9,4785,0,LOCATED_IN


In [None]:
location_df['location_id:ID'] = location_df['location_id:ID'] + next_node_location
location_df.head(10)

Unnamed: 0,location_id:ID,location_name,:LABEL
0,49213,"Atlanta, GA",LOCATION
1,49214,"Chicago, IL",LOCATION
2,49215,"Schaumburg, IL",LOCATION
3,49216,"Bolingbrook, IL",LOCATION
4,49217,"New York, NY",LOCATION
5,49218,"Seattle, WA",LOCATION
6,49219,"Highlands Ranch, CO",LOCATION
7,49220,"Portland, OR",LOCATION
8,49221,"Los Angeles, CA",LOCATION
9,49222,"Las Vegas, NV",LOCATION


In [None]:
located_in_df[':START_ID'] = located_in_df[':START_ID'] + next_node_listing
located_in_df[':END_ID'] = located_in_df[':END_ID'] + next_node_location
located_in_df

Unnamed: 0,:START_ID,:END_ID,:TYPE
0,32945,49213,LOCATED_IN
1,32949,49213,LOCATED_IN
2,32951,49213,LOCATED_IN
3,32971,49213,LOCATED_IN
4,37591,49213,LOCATED_IN
...,...,...,...
16263,48393,50612,LOCATED_IN
16264,48478,50613,LOCATED_IN
16265,48687,50614,LOCATED_IN
16266,48784,50615,LOCATED_IN


In [None]:
location_df.sort_values('location_id:ID').to_csv(final_neo4j_dir+'location__node.csv', index=False)
located_in_df.sort_values(':START_ID').drop_duplicates().to_csv(final_neo4j_dir+'located_in__relation.csv', index=False)

In [None]:
last_node_location = location_df['location_id:ID'].max()
next_node_company = last_node_location + 1
next_node_company

50617

## Prepare (COMPANY)->[POSTED]->(LISTING)

In [None]:
company_df = pd.read_csv(output_dir+'company__node.csv')
posted_df = pd.read_csv(output_dir+'posted__relation.csv')
company_df.head(10)

Unnamed: 0,company_id:ID,company_name,:LABEL
0,0,"Digital Intelligence Systems, LLC",COMPANY
1,1,University of Chicago/IT Services,COMPANY
2,2,"Galaxy Systems, Inc.",COMPANY
3,3,TransTech LLC,COMPANY
4,4,Matrix Resources,COMPANY
5,5,Yash Technologies,COMPANY
6,6,Noble1,COMPANY
7,7,"Bluebeam Software, Inc.",COMPANY
8,8,Genesis10,COMPANY
9,9,"VanderHouwen & Associates, Inc.",COMPANY


In [None]:
posted_df = posted_df[[':START_ID', ':END_ID', ':TYPE']]
posted_df.head(10)

Unnamed: 0,:START_ID,:END_ID,:TYPE
0,0,0,POSTED
1,0,283,POSTED
2,0,655,POSTED
3,0,1021,POSTED
4,0,2055,POSTED
5,0,2598,POSTED
6,0,3070,POSTED
7,0,3417,POSTED
8,0,3434,POSTED
9,0,3514,POSTED


In [None]:
company_df['company_id:ID'] = company_df['company_id:ID'] + next_node_company
company_df.head(10)

Unnamed: 0,company_id:ID,company_name,:LABEL
0,50617,"Digital Intelligence Systems, LLC",COMPANY
1,50618,University of Chicago/IT Services,COMPANY
2,50619,"Galaxy Systems, Inc.",COMPANY
3,50620,TransTech LLC,COMPANY
4,50621,Matrix Resources,COMPANY
5,50622,Yash Technologies,COMPANY
6,50623,Noble1,COMPANY
7,50624,"Bluebeam Software, Inc.",COMPANY
8,50625,Genesis10,COMPANY
9,50626,"VanderHouwen & Associates, Inc.",COMPANY


In [None]:
posted_df[':START_ID'] = posted_df[':START_ID'] + next_node_company
posted_df[':END_ID'] = posted_df[':END_ID'] + next_node_listing
posted_df

Unnamed: 0,:START_ID,:END_ID,:TYPE
0,50617,32945,POSTED
1,50617,33228,POSTED
2,50617,33600,POSTED
3,50617,33966,POSTED
4,50617,35000,POSTED
...,...,...,...
16263,54429,49156,POSTED
16264,54430,49180,POSTED
16265,54431,49184,POSTED
16266,54432,49194,POSTED


In [None]:
company_df.sort_values('company_id:ID').to_csv(final_neo4j_dir+'company__node.csv', index=False)
posted_df.sort_values(':START_ID').drop_duplicates().to_csv(final_neo4j_dir+'posted__relation.csv', index=False)

In [None]:
last_node_location = company_df['company_id:ID'].max()
next_node_occupation = last_node_location + 1
next_node_occupation

54434

## Prepare (LISTING)->(BELONGS_TO)->[OCCUPATION]

In [None]:
occupation_df = pd.read_csv(output_dir+'occupation__node.csv')
belongs_to_df = pd.read_csv(output_dir+'belongs_to__relation.csv')
occupation_df.head(10)

Unnamed: 0,occupation_id:ID,onet_code,occupation_title,occupation_synonyms,occupation_description,occupation_salary,:LABEL
0,0,13-2011.00,Accountants and Auditors,"['Accountant', 'Accounting Officer', 'Audit Pa...","Examine, analyze, and interpret accounting rec...",77250.0,OCCUPATION
1,1,27-2011.00,Actors,"['Actor', 'Actress', 'Comedian', 'Comic', 'Com...","Play parts in stage, television, radio, video,...",,OCCUPATION
2,2,15-2011.00,Actuaries,"['Actuarial Analyst', 'Actuarial Associate', '...","Analyze statistical data, such as mortality, a...",105900.0,OCCUPATION
3,3,29-1291.00,Acupuncturists,"['Acupuncture Physician', 'Acupuncture Provide...","Diagnose, treat, and prevent disorders by stim...",60570.0,OCCUPATION
4,4,29-1141.01,Acute Care Nurses,"['Cardiac Interventional Care Nurse', 'Charge ...",Provide advanced nursing care for patients wit...,77600.0,OCCUPATION
5,5,25-2059.01,Adapted Physical Education Specialists,"['Adapted Physical Activity Specialist', 'Adap...",Provide individualized physical education inst...,61720.0,OCCUPATION
6,6,51-9191.00,Adhesive Bonding Machine Operators and Tenders,"['Coater Operator', 'Glue Line Operator', 'Glu...",Operate or tend bonding machines that use adhe...,37630.0,OCCUPATION
7,7,23-1021.00,"Administrative Law Judges, Adjudicators, and H...","['Adjudications Specialist', 'Adjudicator', 'A...",Conduct hearings to recommend or make decision...,102550.0,OCCUPATION
8,8,11-3012.00,Administrative Services Managers,"['Administrative Coordinator', 'Administrative...","Plan, direct, or coordinate one or more admini...",100170.0,OCCUPATION
9,9,25-3011.00,"Adult Basic Education, Adult Secondary Educati...",['Adult Basic Education Instructor (ABE Instru...,Teach or instruct out-of-school youths and adu...,59720.0,OCCUPATION


In [None]:
# make occupation_synonyms as a list for Neo4j
occupation_df.occupation_synonyms.fillna("no synonyms", inplace=True)

occupation_df.occupation_synonyms = occupation_df.occupation_synonyms \
                                          .str.replace('[','') \
                                          .str.replace(']','') \
                                          .str.replace("'",'') \
                                          .str.replace(", ",';') \

occupation_df.rename(columns={'occupation_synonyms': 'occupation_synonyms[]'}, inplace=True)
occupation_df

  occupation_df.occupation_synonyms = occupation_df.occupation_synonyms \


Unnamed: 0,occupation_id:ID,onet_code,occupation_title,occupation_synonyms[],occupation_description,occupation_salary,:LABEL
0,0,13-2011.00,Accountants and Auditors,Accountant;Accounting Officer;Audit Partner;Au...,"Examine, analyze, and interpret accounting rec...",77250.0,OCCUPATION
1,1,27-2011.00,Actors,Actor;Actress;Comedian;Comic;Community Theater...,"Play parts in stage, television, radio, video,...",,OCCUPATION
2,2,15-2011.00,Actuaries,Actuarial Analyst;Actuarial Associate;Actuaria...,"Analyze statistical data, such as mortality, a...",105900.0,OCCUPATION
3,3,29-1291.00,Acupuncturists,Acupuncture Physician;Acupuncture Provider;Acu...,"Diagnose, treat, and prevent disorders by stim...",60570.0,OCCUPATION
4,4,29-1141.01,Acute Care Nurses,Cardiac Interventional Care Nurse;Charge Nurse...,Provide advanced nursing care for patients wit...,77600.0,OCCUPATION
...,...,...,...,...,...,...,...
1011,1011,51-7099.00,"Woodworkers, All Other",no synonyms,All woodworkers not listed separately.,,OCCUPATION
1012,1012,51-7042.00,"Woodworking Machine Setters, Operators, and Te...",Boring Machine Operator;Cabinet Maker;Knot Saw...,"Set up, operate, or tend woodworking machines,...",36090.0,OCCUPATION
1013,1013,43-9022.00,Word Processors and Typists,Clerk Specialist;Clerk Typist;Keyboard Special...,"Use word processor, computer, or typewriter to...",44030.0,OCCUPATION
1014,1014,27-3043.00,Writers and Authors,Advertisement Agency Copywriter (Ad Agency Cop...,"Originate and prepare written material, such a...",69510.0,OCCUPATION


In [None]:
belongs_to_df.columns = [':START_ID', ':END_ID', ':TYPE']
belongs_to_df

Unnamed: 0,:START_ID,:END_ID,:TYPE
0,0.0,835.0,BELONGS_TO
1,1.0,509.0,BELONGS_TO
2,2.0,183.0,BELONGS_TO
3,4.0,169.0,BELONGS_TO
4,5.0,532.0,BELONGS_TO
...,...,...,...
16263,16246.0,889.0,BELONGS_TO
16264,16250.0,87.0,BELONGS_TO
16265,16254.0,889.0,BELONGS_TO
16266,16264.0,889.0,BELONGS_TO


In [None]:
occupation_df['occupation_id:ID'] = occupation_df['occupation_id:ID'] + next_node_occupation
occupation_df.head(10)

Unnamed: 0,occupation_id:ID,onet_code,occupation_title,occupation_synonyms[],occupation_description,occupation_salary,:LABEL
0,54434,13-2011.00,Accountants and Auditors,Accountant;Accounting Officer;Audit Partner;Au...,"Examine, analyze, and interpret accounting rec...",77250.0,OCCUPATION
1,54435,27-2011.00,Actors,Actor;Actress;Comedian;Comic;Community Theater...,"Play parts in stage, television, radio, video,...",,OCCUPATION
2,54436,15-2011.00,Actuaries,Actuarial Analyst;Actuarial Associate;Actuaria...,"Analyze statistical data, such as mortality, a...",105900.0,OCCUPATION
3,54437,29-1291.00,Acupuncturists,Acupuncture Physician;Acupuncture Provider;Acu...,"Diagnose, treat, and prevent disorders by stim...",60570.0,OCCUPATION
4,54438,29-1141.01,Acute Care Nurses,Cardiac Interventional Care Nurse;Charge Nurse...,Provide advanced nursing care for patients wit...,77600.0,OCCUPATION
5,54439,25-2059.01,Adapted Physical Education Specialists,Adapted Physical Activity Specialist;Adapted P...,Provide individualized physical education inst...,61720.0,OCCUPATION
6,54440,51-9191.00,Adhesive Bonding Machine Operators and Tenders,Coater Operator;Glue Line Operator;Glue Reel O...,Operate or tend bonding machines that use adhe...,37630.0,OCCUPATION
7,54441,23-1021.00,"Administrative Law Judges, Adjudicators, and H...",Adjudications Specialist;Adjudicator;Administr...,Conduct hearings to recommend or make decision...,102550.0,OCCUPATION
8,54442,11-3012.00,Administrative Services Managers,Administrative Coordinator;Administrative Dire...,"Plan, direct, or coordinate one or more admini...",100170.0,OCCUPATION
9,54443,25-3011.00,"Adult Basic Education, Adult Secondary Educati...",Adult Basic Education Instructor (ABE Instruct...,Teach or instruct out-of-school youths and adu...,59720.0,OCCUPATION


In [None]:
belongs_to_df[':START_ID'] = belongs_to_df[':START_ID'] + next_node_listing
belongs_to_df[':END_ID'] = belongs_to_df[':END_ID'] + next_node_occupation
belongs_to_df

Unnamed: 0,:START_ID,:END_ID,:TYPE
0,32945.0,55269.0,BELONGS_TO
1,32946.0,54943.0,BELONGS_TO
2,32947.0,54617.0,BELONGS_TO
3,32949.0,54603.0,BELONGS_TO
4,32950.0,54966.0,BELONGS_TO
...,...,...,...
16263,49191.0,55323.0,BELONGS_TO
16264,49195.0,54521.0,BELONGS_TO
16265,49199.0,55323.0,BELONGS_TO
16266,49209.0,55323.0,BELONGS_TO


In [None]:
belongs_to_df[':START_ID'] = belongs_to_df[':START_ID'].astype(int)
belongs_to_df[':END_ID'] = belongs_to_df[':END_ID'].astype(int)
belongs_to_df

Unnamed: 0,:START_ID,:END_ID,:TYPE
0,32945,55269,BELONGS_TO
1,32946,54943,BELONGS_TO
2,32947,54617,BELONGS_TO
3,32949,54603,BELONGS_TO
4,32950,54966,BELONGS_TO
...,...,...,...
16263,49191,55323,BELONGS_TO
16264,49195,54521,BELONGS_TO
16265,49199,55323,BELONGS_TO
16266,49209,55323,BELONGS_TO


In [None]:
occupation_df.sort_values('occupation_id:ID').to_csv(final_neo4j_dir+'occupation__node.csv', index=False)
belongs_to_df.sort_values(':START_ID').drop_duplicates().to_csv(final_neo4j_dir+'belongs_to__relation.csv', index=False)

In [None]:
last_node_occupation = occupation_df['occupation_id:ID'].max()
next_node_career_outlook = last_node_occupation + 1
next_node_career_outlook

55450

## Prepare (OCCUPATION)->(HAS_FUTURE)->[CAREER_OUTLOOK]

In [92]:
career_outlook_df = pd.read_csv(output_dir+'career_outlook__node.csv')
has_future_df = pd.read_csv(output_dir+'has_future__relation.csv')
career_outlook_df.head(10)

Unnamed: 0,career_outlook_id:ID,career_outlook
0,0,Bright
1,1,Average
2,2,Below Average


In [97]:
# belongs_to_df.columns = [':START_ID', ':END_ID', ':TYPE']
has_future_df = has_future_df[[':START_ID', ':END_ID', ':TYPE']]
has_future_df

Unnamed: 0,:START_ID,:END_ID,:TYPE
0,0,0,HAS_FUTURE
1,1,0,HAS_FUTURE
2,2,0,HAS_FUTURE
3,4,0,HAS_FUTURE
4,10,0,HAS_FUTURE
...,...,...,...
918,1006,2,HAS_FUTURE
919,1008,2,HAS_FUTURE
920,1012,2,HAS_FUTURE
921,1013,2,HAS_FUTURE


In [98]:
career_outlook_df['career_outlook_id:ID'] = career_outlook_df['career_outlook_id:ID'] + next_node_career_outlook
career_outlook_df.head(10)

Unnamed: 0,career_outlook_id:ID,career_outlook
0,55450,Bright
1,55451,Average
2,55452,Below Average


In [99]:
has_future_df[':START_ID'] = has_future_df[':START_ID'] + next_node_occupation
has_future_df[':END_ID'] = has_future_df[':END_ID'] + next_node_career_outlook
has_future_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  has_future_df[':START_ID'] = has_future_df[':START_ID'] + next_node_occupation
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  has_future_df[':END_ID'] = has_future_df[':END_ID'] + next_node_career_outlook


Unnamed: 0,:START_ID,:END_ID,:TYPE
0,54434,55450,HAS_FUTURE
1,54435,55450,HAS_FUTURE
2,54436,55450,HAS_FUTURE
3,54438,55450,HAS_FUTURE
4,54444,55450,HAS_FUTURE
...,...,...,...
918,55440,55452,HAS_FUTURE
919,55442,55452,HAS_FUTURE
920,55446,55452,HAS_FUTURE
921,55447,55452,HAS_FUTURE


In [102]:
# fix for career_outlook
career_outlook_df[':LABEL'] = 'CAREER_OUTLOOK'
career_outlook_df.sort_values('career_outlook_id:ID').to_csv(final_neo4j_dir+'career_outlook__node.csv', index=False)
has_future_df.sort_values(':START_ID').drop_duplicates().to_csv(final_neo4j_dir+'has_future__relation.csv', index=False)

In [103]:
last_node_career_outlook = career_outlook_df['career_outlook_id:ID'].max()
next_node_ = last_node_career_outlook + 1
next_node_

55453