<font size="6">**APOC Library Updates**</font>

We will look at some functions, procedures and features introduced in the last year or so:

- Export and import of compressed files
- Import binary data
- Read js generated html 
- Read and write with Redis
- Read and write with Apache Arrow
- Detect graph cycles
- apoc.load.directory*

---

### Setup

- Neo4j 5.1 instance
- APOC Core 5.1.0
- APOC Extended 5.1.0 (Called APOC Full in 4.x.x version)

#### Dataset

- The one created via `:play movies`

#### Notebook setup
- cy2py: to connect neo4j with jupyter
    - cytoscape: graph visualization
    - pandas: table visualization



In [None]:
# table style
import pandas
pandas.set_option('display.max_colwidth', 500)
pandas.set_option('html.use_mathjax', False)


# custom node colors
colors = {
  ':Person': '#fffb00',
  ':CompressedNode': 'red'
}

# custom graph layout
layout = {
    'layout': 'grid', 
    'padding': 100,
    'nodeSpacing': 100
}

# custom node captions (default is :LabelName)
caption = {':CompressedNode': ['name']}

# connect neo4j with jupyter
%reload_ext cy2py

# url and credential
neo4j_url = "bolt://localhost:7688"
neo4j_user = "neo4j"
neo4j_pwd = "apoc"

# we check the connections, set the above custom options and create the dataset
%cypher -u $neo4j_url -us $neo4j_user -pw $neo4j_pwd \
    -co $colors -la $layout -ca $caption \
    call apoc.cypher.runFile('movies.cypher')

# Export and import compressed files

<span style="color:#33f" size="7"> ***Introduced for both APOC Core and Full/Extended in 4.4.0.6*** </span>

All `apoc.export.*` export procedures allows file compression.

On the contrary, all `apoc.import.*` procedures and `apoc.load.*` procedures (except for `apoc.load.directory*`), 
allow the reading of a compressed file via a configuration parameter: `compression: <ALGO>`.




## normal way

In [None]:
%%cypher

match (n:Person) with collect(n) as people
call apoc.export.csv.data(people, [], "normal.csv", {}) 
yield done return done

## compressed way



In [None]:
%%cypher

match (n:Person) with collect(n) as people
call apoc.export.csv.data(people, [], "compressed.csv.gz", {compression: 'GZIP'})
yield done return done

#### Possibile compression algorithms: 

- `NONE` (default)
- `GZIP`
- `BZIP2`
- `DEFLATE`
- `BLOCK_LZ4`
- `FRAMED_SNAPPY`

## import and load compressed

In [12]:
%%cypher
CALL apoc.load.csv('compressed.csv.gz', {compression: 'GZIP'})

Unnamed: 0,lineNo,list,strings,map,stringMap
0,0,"[1, :Person, 1964, Keanu Reeves, , , ]",[],"{'_end': '', '_start': '', 'born': '1964', 'name': 'Keanu Reeves', '_type': '', '_id': '1', '_labels': ':Person'}",{}
1,1,"[2, :Person, 1967, Carrie-Anne Moss, , , ]",[],"{'_end': '', '_start': '', 'born': '1967', 'name': 'Carrie-Anne Moss', '_type': '', '_id': '2', '_labels': ':Person'}",{}
2,2,"[3, :Person, 1961, Laurence Fishburne, , , ]",[],"{'_end': '', '_start': '', 'born': '1961', 'name': 'Laurence Fishburne', '_type': '', '_id': '3', '_labels': ':Person'}",{}
3,3,"[4, :Person, 1960, Hugo Weaving, , , ]",[],"{'_end': '', '_start': '', 'born': '1960', 'name': 'Hugo Weaving', '_type': '', '_id': '4', '_labels': ':Person'}",{}
4,4,"[5, :Person, 1967, Lilly Wachowski, , , ]",[],"{'_end': '', '_start': '', 'born': '1967', 'name': 'Lilly Wachowski', '_type': '', '_id': '5', '_labels': ':Person'}",{}
...,...,...,...,...,...
394,394,"[642, :Person, 1943, Penny Marshall, , , ]",[],"{'_end': '', '_start': '', 'born': '1943', 'name': 'Penny Marshall', '_type': '', '_id': '642', '_labels': ':Person'}",{}
395,395,"[643, :Person, , Paul Blythe, , , ]",[],"{'_end': '', '_start': '', 'born': '', 'name': 'Paul Blythe', '_type': '', '_id': '643', '_labels': ':Person'}",{}
396,396,"[644, :Person, , Angela Scope, , , ]",[],"{'_end': '', '_start': '', 'born': '', 'name': 'Angela Scope', '_type': '', '_id': '644', '_labels': ':Person'}",{}
397,397,"[645, :Person, , Jessica Thompson, , , ]",[],"{'_end': '', '_start': '', 'born': '', 'name': 'Jessica Thompson', '_type': '', '_id': '645', '_labels': ':Person'}",{}


In [None]:
%%cypher
CALL apoc.import.csv(
    [{fileName: 'compressed.csv.gz', labels: ['CompressedNode']}], // nodes
    [], // rels
    {compression: 'GZIP'})

In [None]:
%%cypher
MATCH (n:CompressedNode) RETURN n

In [None]:
%%cypher
MATCH (n:Person) RETURN n

## String compression

<span style="color:#33f" size="7"> ***Introduced in APOC Core, 4.4.0.7*** </span>

We can use the `apoc.util.compress` to compress a string.

And vice versa, the `apoc.util.decompress` to read a compressed `byte[]`.


We can use the same values as export/import `compression` configuration (but with default `"GZIP"`)




In [13]:
%%cypher
return apoc.util.compress("name,born\nFoo,1999\nBar,2001")

Unnamed: 0,"apoc.util.compress(""name,born\nFoo,1999\nBar,2001"")"
0,"b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\xff\xcbK\xccM\xd5I\xca/\xca\xe3r\xcb\xcf\xd71\xb4\xb4\xb4\xe4rJ,\xd21200\x04\x00\xd6\x15&\x7f\x1b\x00\x00\x00'"


In [14]:
%%cypher
return apoc.util.compress("name,born\nFoo,1999\nBar,2001", {compression: 'DEFLATE'})

Unnamed: 0,"apoc.util.compress(""name,born\nFoo,1999\nBar,2001"", {compression: 'DEFLATE'})"
0,"b'x\x9c\xcbK\xccM\xd5I\xca/\xca\xe3r\xcb\xcf\xd71\xb4\xb4\xb4\xe4rJ,\xd21200\x04\x00y\xc4\x07\xc3'"


In [None]:
%%cypher

// with compression "NONE", unlike the export procedures, we return a `String.getBytes()`

return apoc.util.compress("name,born\nFoo,1999\nBar,2001", {compression: 'NONE'})

# Import and load binaries

<span style="color:#33f" size="7"> ***Introduced in both APOC Core and Full/Extended in 4.4.0.6*** </span>

Besides importing a file from a url, 
we can pass a `byte[]` as a parameter.

Useful for cloud where you cannot store files on File system or when you don't want to expose data in the internet.


In [15]:
%%cypher

// transform a string in `byte[]`
with apoc.util.compress('{"name": "Foo", "born": 2001} {"name": "Bar", "born": 2001}') 
as binaryJson

// read binary
call apoc.load.json(binaryJson, 
                    null, // JsonPath parameter,
                    {compression: 'GZIP'})
yield value return value

Unnamed: 0,value
0,"{'born': 2001, 'name': 'Foo'}"
1,"{'born': 2001, 'name': 'Bar'}"


In [17]:
%%cypher

// With csv and DEFLATE algorithm

with apoc.util.compress('name,born\nFoo,1999\nBar,2001', {compression: 'DEFLATE'}) as binaryJson

// read binary
call apoc.load.csv(binaryJson,  {compression: 'DEFLATE'})
yield list return list

Unnamed: 0,list
0,"[Foo, 1999]"
1,"[Bar, 2001]"


# Apache Arrow

<span style="color:#33f" size="7"> ***Introduced in APOC Core 4.4.0.4*** </span>

Apache Arrow defines a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations.

Useful for interoperability with others with other frameworks like Spark and Kafka.

todo - CONFIG, batchSize

### Export procedures

- 

### Export stream procedures:

Similar e.g. to `apoc.export.csv.all(null, {stream: true, compression: '<ALGO>'})`.
That streams a list of `byte[]` one per each batch
-
-
-

    

### Load procedures:
    
- 

### Load stream procedures:
It reads an Arrow `byte[]` and returns a map for each row


In [None]:
%%cypher

// export

CALL apoc.export.arrow.query('query_test.arrow', "MATCH (n:Person) RETURN n")


In [None]:
%%cypher

CALL apoc.load.arrow('query_test.arrow')

In [None]:
%%cypher
// export stream

CALL apoc.export.arrow.stream.all()

In [None]:
%%cypher

// roundtrip export-load

CALL apoc.export.arrow.stream.all() YIELD value WITH value as byteArray
CALL apoc.load.arrow.stream(byteArray) YIELD value RETURN value

<hr style="border:1px solid #ccc"> 

# Load html with js generated code


By default, the apoc.load.html procedure leverage the jsoup library to parse the html file:  https://jsoup.org/.



But, with the following html, we cannot read the js generated code (i.e. the tag `strong`)
```
...
<body>
	<div id="addStuff"></div>

	<script type="text/javascript">
		const newTag = document.createElement("p");
		newTag.innerText = "This is a new tag";
		document.getElementById("addStuff").appendChild(newTag);
	</script>
</body>
...
```

To remedy these cases, we can leverage the [Selenium WebDriver](https://www.selenium.dev/)
which is used for automating browsers (mostly for testing purpose).

With this tool, we can open a browser in headless mode, i.e. without a graphical interface, with which to interpret the js inside the html file.

So unlike jsoup, it is not just parsing.


To do this, we can pass in `$config` the option `{browser: "CHROME"}` or `{browser: "FIREFOX"}`,
in order to read html with auto-generated js.


#### Note
```
In order to use this procedure we need to download an optional jar
https://github.com/neo4j-contrib/neo4j-apoc-procedures/releases/download/<APOC_VERSION>/apoc-selenium-dependencies-<APOC_VERSION>-all.jar,
and put in the `plugin` folder.

So for example with apoc 5.1.0, `https://github.com/neo4j-contrib/neo4j-apoc-procedures/releases/download/5.1.0/apoc-selenium-dependencies-5.1.0-all.jar`.

```


#### Cons: 

- Leverage an installed browser, chrome or firefox, so it's slower.
- Require additional jars. 

So use only if needed, not with html which we know to be static.

So if we don't need it, because we have to read an html that we know is static, better don't use it






In [18]:
%%cypher

// file with the above js code

CALL apoc.load.html("wikipediaWithJs.html", {newNode: 'p'}, {browser: 'CHROME'})

Unnamed: 0,value
0,"{'newNode': [{'text': 'This is a new tag', 'tagName': 'p'}, {'text': 'my paragraph', 'tagName': 'p'}]}"


In [19]:
%%cypher

// default way

CALL apoc.load.html("wikipediaWithJs.html", {newNode: 'p'}, {})

Unnamed: 0,value
0,"{'newNode': [{'text': 'my paragraph', 'tagName': 'p'}]}"


Additionally, with `browser` equal to `CHROME` / `FIREFOX`, we can set optional various configurations which work like the configurations [described here](https://bonigarcia.dev/webdrivermanager/), in `Table 1. Configuration capabilities for driver management`.
 
The possible configs are:

- `driverVersion`
- `browserVersion`
- `architecture`
- `operatingSystem`
- `driverRepositoryUrl`
- `versionsPropertiesUrl`
- `commandsPropertiesUrl`
- `cachePath`
- `resolutionCachePath`
- `proxy`
- `proxyUser`
- `proxyPass`
- `gitHubToken`
- `forceDownload`
- `useBetaVersions`
- `useMirror`
- `avoidExport`
- `avoidOutputTree`
- `clearDriverCache`
- `clearResolutionCache`
- `avoidFallback`
- `avoidBrowserDetection`
- `avoidReadReleaseFromRepository`
- `avoidTmpFolder`
- `useLocalVersionsPropertiesFirst`
- `timeout`
- `ttl`
- `ttlBrowsers`

# Load html as a string

In addition to `apoc.load.html`, there is another procedure that works similarly 
and accepts the same parameter as apoc.load.html
but returns a textual representation instead of a list of map describing the tag:

`CALL apoc.load.htmlPlainText(uri, query, config)`


In [23]:
%%cypher

/*
File content
<body>
    ....

    <ul>
        <li>one</li>
        <li>two</li>
        <li>three</li>
    </ul>
    <br>
    <br>
    <p>my paragraph</p>
</body>
*/

with "wikipediaWithJs.html" as url

call apoc.load.htmlPlainText(url, {content: "body"}) 
yield value 
with url, value.content as valueString // valueString gets a textual representation
call apoc.load.html(url, {content: "body"}) 
yield value return valueString, value.content as valueListMap

Unnamed: 0,valueString,valueListMap
0,\n - one \n - two \n - three \n\n\nmy paragraph \n\n\n,"[{'data': '  const newTag = document.createElement(""p"");  newTag.innerText = ""This is a new tag"";  document.getElementById(""addStuff"").appendChild(newTag); 	', 'text': 'one two three my paragraph', 'tagName': 'body'}]"


In [24]:
%%cypher

// htmlPlainText with browser 
call apoc.load.htmlPlainText("wikipediaWithJs.html", {content: "body"}, {browser: "CHROME"}) 

Unnamed: 0,value
0,{'content': ' This is a new tag - one - two - three my paragraph '}


### [NEXT CHAPTER](http://localhost:8888/notebooks/Read%2C%20write%20and%20other%20utils.ipynb)