Found on UpWork:
Need someone to download the following file into a Neo4J database, then download and parse the following file, and attach/match it to the hierarchy file (the first file).
If successful, we will keep doing this with other files.
Must have Neo4J experience, must have data parsing and matching experience.
-
develop in 100% Python
-
Files should reside in S3 or be read from local file
- could be http(s) as well??
-
Neo4j and python development should run in a container
- DockerFile or devcontainer??
- Ended up using a single Dockerfile.
- Devcontainer had problems mounting/working with attached database directory from Neo4J
- Ended up using a single Dockerfile.
- DockerFile or devcontainer??
-
should be updatable with new XML data
-
set up s3 bucket with access restricted with no public access. Access will be by AWS "keys" attached to a new IAM user/group
-
set up Python dev environment in a dev container with python, s3, and neo4j support
-
set up neo4j into same environment
- develop data structures for Neo4J access
- JSON Structure
- related by
[HAS_CHILD]
between Parent/Child in JSON
typeId: "{our-generated-type-id},
identifier: "1.130",
label: "\u00a7 1.130 Type II securities; guidelines for obligations issued for university and housing purposes.",
label_level: "\u00a7 1.130",
label_description: "Type II securities; guidelines for obligations issued for university and housing purposes.",
reserved: false,
type: "section",
volumes: [
"1"
],
received_on: "2017-01-07T00:00:00-0500"
- XML Structure
- related by
[HAS_XML]
between related JSON node
- related by
typeId: "{our-generated-type-id},
elementId: "B",
xml: "{quote-encoded-xml-as-in-the-original}"
- read file from s3 into local environment
- used dotenv to control enviornment variable (AWS keys injection into container)
- parse JSON file recurslively and populate Neo4J
- inserted full node from JSON file less children
- used existing JSON format
- children related using
(a:Division)-[r:HAS_CHILD]->(b:Division)
- inserted full node from JSON file less children
- used TypeId to generate unique node IDs
- parse XML recurslively and attach it to Neo4J nodes from JSON data
- similar approach to JSON/Step 3
- used TypeID to generate unique node IDs
- carried "N" identifier into structure
- xml inserted directly as quote escaped XML string
- xml nodes related using
(a:Division)-[r:HAS_XML]->(b:XMLData)
- xml nodes related using
- similar approach to JSON/Step 3
-
would be interesting to see if we can do this using AWS cloud resourcs
- Use EC2 w/mounted EBS for Neo4J DB
- Use lambda function tied to S3 bucket(s) for updating JSON/XML files and serverless Neptune
- Some combo of the above
-
adjust Cypher queries to use "best practices"
- use
MERGE
instead ofCREATE
- create
CONSTRAINTS
ontypeId
for uniquness and value - standardize on property creation so each node has the same "primary" key(??)
- use
There are 2 general purpose "nodes" in the graph:
-
Division
nodes represent an individual "record" from the JSON file- each
Division
has an additional label that is derived from thetype
property, e.g. "Section", "Chapter", etc - each
Division
node contains atypeId
property that represents a unique ID for the node since one doesn't seem to exist within the existing data structure. ThetypeId
property us gerated using the "typeid-python" module - each
Division
node also contains all the properties from the original JSON record with the exception ofchildren
- Each "child" of a
Division
become its ownDivision
node, i.e. allchildren
records of aDivision
are themselvesDivisions
and related to their parentDivision
by aHAS_CHILD
relationship.
- each
-
XMLData
nodes represent the collection of elements within a XMLDIVn
element, e.g.DIV5
. HTML/XML tags with the element name ofDIV
are ignored.- each
XMLData
node contains atypeId
property that represents a unique ID for the node since one doesn't seem to exist within the existing data structure. ThetypeId
property us gerated using the "typeid-python" module - each
XMLData
node also contains all the properties from the original XMLDIVn
element - including text - with the exception of embeddedDIVn
elements - each "child"
DIVn
element of aDIVn
element becomes its ownXMLData
node. - Each
XMLData
node is assoicated with aDivision
node by the following:- discover the related parent/child
Division
nodes via theHAS_CHILD
relationship that contain matchingidentifier
propertie from the parent/childDIVn/N
element values- i.e
MATCH (a:Division)-[:HAS_CHILD]->(b:Division) return a.typeId, b.typeId
- i.e
- use the approperiate
typeId
to create a[:HAS_XML]
relationship between theDivision
andXMLData
nodes
- discover the related parent/child
- each
-
The population of the JSON
Division
nodes andXMLData
nodes is recursive
(a:Division)-[:HAS_CHILD]->(b:Divsion)
(a:Division)-[:HAS_XML]->(x:XMLData)
Each Division
may have multiple [:HAS_CHILD]
relationships
Each Division
may have only 1 [:HAS_XML] relationship