<font size="6">**APOC Library Updates**</font>

We will take a look at some functions, procedures and features introduced in the last year or so:

- Export and import of compressed files
- Import binary data
- Read js generated html 
- Read html as a string 
- Read and write with Redis
- Read and write with Apache Arrow
- Detect graph cycles
- apoc.load.directory* procedures

We will cover both APOC Core and Full/Extended functions/procedures.


---

### Setup

- Neo4j 5.1.0 instance
- APOC Core 5.1.0
- APOC Extended 5.1.0 (Called APOC Full in 4.x.x version)

#### Dataset

- The one created via `:play movies`

#### Notebook setup
- cy2py: library to connect neo4j with jupyter
    - ipython-cypher: to create cypher queries using `%%cypher` and `%cypher`
    - cytoscape: graph visualization
    - pandas: table visualization



In [None]:
# table style
import pandas
pandas.set_option('display.max_colwidth', 500)
pandas.set_option('html.use_mathjax', False)


# custom node colors
colors = {
  ':Person': '#fffb00',
  ':CompressedNode': 'red'
}

# custom graph layout
layout = {
    'layout': 'grid', 
    'padding': 100,
    'nodeSpacing': 100
}

# custom node captions (default is `LabelName`)
caption = {':CompressedNode': ['name']}

# connect neo4j with jupyter
%reload_ext cy2py

# url and credential
neo4j_url = "bolt://localhost:7687"
neo4j_user = "neo4j"
neo4j_pwd = "apoc"

# we check the connections, set the above custom options and create the dataset
%cypher -u $neo4j_url -us $neo4j_user -pw $neo4j_pwd \
    -co $colors -la $layout -ca $caption \
    call apoc.cypher.runFile('movies.cypher')

# Export and import compressed files

<span style="color:#33f" size="7"> ***For 4.4, introduced for both APOC Core and Full/Extended in 4.4.0.6*** </span>

All `apoc.export.*` export procedures allows file compression,
via a configuration parameter: `compression: <ALGO>`.

Contrariwise, all `apoc.import.*` procedures and `apoc.load.*` procedures (except for `apoc.load.directory*`), 
allow the reading of a compressed file in the same way as for export .


#### Possibile compression algorithms: 

- `NONE` (default)
- `GZIP`
- `BZIP2`
- `DEFLATE`
- `BLOCK_LZ4`
- `FRAMED_SNAPPY`


## normal way

In [None]:
%%cypher

match (n:Person) with collect(n) as people
call apoc.export.csv.data(people, [], "normal.csv", {}) 
yield done return done

## compressed way



In [None]:
%%cypher

match (n:Person) with collect(n) as people
call apoc.export.csv.data(people, [], "compressed.csv.gz", {compression: 'GZIP'})
yield done return done

## stream normal way

In [None]:
%%cypher

// it returns a `String stream

match (n:Person) with collect(n) as people
call apoc.export.csv.data(people, [], null, 
            {stream: true})
yield data return data

## stream compressed way

In [None]:
%%cypher

// it returns a `btye[]` stream

match (n:Person) with collect(n) as people
call apoc.export.csv.data(people, [], null, 
            {compression: 'GZIP', stream: true})
yield data return data

## load and import compressed

We can import or upload files in the same way as export.

Anyway, we don't necessarily have to use the export with {`compression: ALGO}`, 
we can also manually compress a previously exported file with `{compression: NONE}`, for example via `gzip` terminal command.


In [None]:
%%cypher
CALL apoc.load.csv('compressed.csv.gz', {compression: 'GZIP'})

In [None]:
%%cypher
CALL apoc.import.csv(
    [{fileName: 'compressed.csv.gz', labels: ['CompressedNode']}], // nodes
    [], // rels
    {compression: 'GZIP'})

In [None]:
%%cypher
MATCH (n:CompressedNode) RETURN n

## Import and load archive

APOC also provides the possibility to import an archive, 
both with compression, `tar.gz` or `.tgz`, and without, like `.zip` or `.tar`,
which works differently from the single compressed file.

For example, via `apoc.load.json`:
```
CALL apoc.load.json("pathToCompressedFile/file.<compressionExt>!pathToCsvFileInArchive/fileName.csv")
```

So we don't have to specify `compression: ALGO`, 
but apoc automatically recognizes the archiving algorithm from the file extension, 
so we don't have to specify compression: ALGO, but apoc automatically recognizes the archiving algorithm from the file extension. 
Currently, the only supported extensions are `.tar`, `.tar.gz`, `.zip` and `.tar`.


#### Note
```
Only from APOC 4.3.0.9 and 4.4.0.10, and 5.x the tar.gz, tgz and tar archives are supported.
```


In [None]:
%%cypher

// # testload.tar.gz contains a `person.json` file
CALL apoc.load.json("testload.tar.gz!person.json")


<hr style="border:1px solid #ccc"> 

# String compression

<span style="color:#33f" size="7"> ***For 4.4, introduced in APOC Core, 4.4.0.7*** </span>

Useful for example, when we need to store large string values into Node or Relationships.

Or also, to compress huge source data on a client and send the compressed byte arrays of data to the server, and then decompress, parse and process the file on the server.

We can use the `apoc.util.compress` to compress a `string`.

And vice versa, the `apoc.util.decompress` to read a compressed `byte[]`.


We can use the same values as export/import `compression` configuration (but with default `"GZIP"`)



In [None]:
%%cypher
return apoc.util.compress("name,born\nFoo,1999\nBar,2001")

In [None]:
%%cypher
return apoc.util.compress("name,born\nFoo,1999\nBar,2001", {compression: 'DEFLATE'})

In [None]:
%%cypher

// # with compression "NONE", unlike the export procedures, it returns a `String.getBytes()`

return apoc.util.compress("name,born\nFoo,1999\nBar,2001", {compression: 'NONE'})


<hr style="border:1px solid #ccc"> 

# Import and load binaries

<span style="color:#33f" size="7"> ***For 4.4, introduced in both APOC Core and Full/Extended in 4.4.0.6*** </span>

Besides importing a file from a url, 
we can pass a `byte[]` as a parameter.

Useful for cloud where you cannot store files on File system or when you don't want to expose data in the internet.


In [None]:
%%cypher

// # transform a string in `byte[]`
with apoc.util.compress('{"name": "Foo", "born": 2001} {"name": "Bar", "born": 2001}') 
as binaryJson

// # read binary
call apoc.load.json(binaryJson, 
                    null, // JsonPath parameter,
                    {compression: 'GZIP'})
yield value return value

In [None]:
%%cypher

// # With csv and DEFLATE algorithm

with apoc.util.compress('name,born\nFoo,1999\nBar,2001', {compression: 'DEFLATE'}) as binaryJson

// # read binary
call apoc.load.csv(binaryJson,  {compression: 'DEFLATE'})
yield list return list


<hr style="border:1px solid #ccc"> 

# Apache Arrow

<span style="color:#33f" size="7"> ***For 4.4, introduced in APOC Core 4.4.0.4*** </span>

[Apache Arrow](https://arrow.apache.org/) defines a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations.

It's useful for interoperability with others Apache frameworks like Spark and Kafka.

#### Note

```
In order to use this procedure we need to download an additional jar (not included in Apoc jars) from mvn repository
https://mvnrepository.com/artifact/org.apache.arrow/arrow-memory-netty,
and put in the `plugin` folder.
```

### Export procedures

Similarly to other export procedures,
there are 3 procedures to export to arrow (currently there is no export data, such as `apoc.export.csv.data`)

- `apoc.export.arrow.all(file, $config)` - Exports the full database


- `apoc.export.arrow.graph(file, graph, $config)` - Exports the given graph (i.e. `{nodes: [nodeList], relationships: [relList]}`)


- `apoc.export.arrow.query(file, query, config)` - Exports the results from the given Cypher query

### Export stream procedures:

Conceptually similar e.g. to `apoc.export.csv.all(null, {stream: true, compression: '<ALGO>'})`, which streams a list of `byte[]` one per each batch, instead of exporting to a file:

- `apoc.export.arrow.stream.all($config)`


- `apoc.export.arrow.stream.graph(graph, $config)`


- `apoc.export.arrow.stream.all(query, $config)`


At this very moment, `$config` manages just one property, `batchSize`, with default `2000`.

### Load procedures:

It reads an `.arrow` file and returns a map for each row

- `apoc.load.arrow(fileName)`

### Load stream procedures:
It reads an Arrow `byte[]` and returns a map for each row

- `apoc.load.arrow.stream(bytes)`


```
Unlike csv, graphml and Json, there is no `apoc.import.arrow`, 
so we have to use the `apoc.load.arrow*` to create nodes, in case.
```

In [None]:
%%cypher

// # export file
CALL apoc.export.arrow.query('query_test.arrow', "MATCH (n:Person) RETURN n")


In [None]:
%%cypher

// # load file
CALL apoc.load.arrow('query_test.arrow')

In [None]:
%%cypher

// # export stream of bytes[], based on `batchSize`

MATCH (n:Person) 
WITH collect(n) as nodes
CALL apoc.export.arrow.stream.graph({nodes: nodes, relationships: []}, {batchSize: 10})
YIELD value RETURN value

In [None]:
%%cypher

// # roundtrip export-load stream

CALL apoc.export.arrow.stream.query("MATCH (n:Person) RETURN n")
YIELD value
WITH value as byteArray
CALL apoc.load.arrow.stream(byteArray)
YIELD value RETURN value

<hr style="border:1px solid #ccc"> 

# Load html with js generated code


By default, the `apoc.load.html(url, selector, $config)` procedure uses the jsoup library to parse the html file:  https://jsoup.org/.



But, with the following html, we cannot read the js generated code (i.e. the tag `h1`)
```
...
<body>
	<div id="addStuff"></div>

	<script type="text/javascript">
		const newTag = document.createElement("h1");
		newTag.innerText = "This is a new tag";
		document.getElementById("addStuff").appendChild(newTag);
	</script>
</body>
...
```

To remedy these cases, we can leverage the [Selenium WebDriver](https://www.selenium.dev/)
which is used for automating browsers (mostly for testing purpose).

With this tool, we can open a browser in headless mode, i.e. without a graphical interface, with which to interpret the js inside the html file.

So unlike jsoup, it is not just a parse.


To do this, we can pass in `$config` the option `{browser: "CHROME"}` or `{browser: "FIREFOX"}`,
in order to read html with auto-generated js.


#### Note
```
In order to use this procedure we need to download an additional jar
https://github.com/neo4j-contrib/neo4j-apoc-procedures/releases/download/<APOC_VERSION>/apoc-selenium-dependencies-<APOC_VERSION>-all.jar,
and put in the `plugin` folder.

So for example with apoc 5.1.0, `https://github.com/neo4j-contrib/neo4j-apoc-procedures/releases/download/5.1.0/apoc-selenium-dependencies-5.1.0-all.jar`.

```


#### Cons: 

- Leverage an installed browser, chrome or firefox, so it's slower.
- Require additional jars. 


So if we don't need it, because we have to read an html that we know is static, 
better not to use it.




In [None]:
%%cypher

// # file with the above js code
// # we create a map {newNode: [listOfH1Tags]}
CALL apoc.load.html("wikipediaWithJs.html", {newNode: 'h1'}, {browser: 'CHROME'})

In [None]:
%%cypher

// # default way

CALL apoc.load.html("wikipediaWithJs.html", {newNode: 'h1'}, {})

Additionally, with `browser` equal to `CHROME` / `FIREFOX`, we can set optional various configurations which work like the configurations [described here](https://bonigarcia.dev/webdrivermanager/), in `Table 1. Configuration capabilities for driver management`, and have the same default values.
 
The possible configs are:

- `driverVersion`
- `browserVersion`
- `architecture`
- `operatingSystem`
- `driverRepositoryUrl`
- `versionsPropertiesUrl`
- `commandsPropertiesUrl`
- `cachePath`
- `resolutionCachePath`
- `proxy`
- `proxyUser`
- `proxyPass`
- `gitHubToken`
- `forceDownload`
- `useBetaVersions`
- `useMirror`
- `avoidExport`
- `avoidOutputTree`
- `clearDriverCache`
- `clearResolutionCache`
- `avoidFallback`
- `avoidBrowserDetection`
- `avoidReadReleaseFromRepository`
- `avoidTmpFolder`
- `useLocalVersionsPropertiesFirst`
- `timeout`
- `ttl`
- `ttlBrowsers`

In [None]:
%%cypher

// # Force downloading chrome driver (even if it is already in the cache) 

CALL apoc.load.html("wikipediaWithJs.html", {newNode: 'p'}, 
            {browser: 'CHROME', forceDownload: true})

<hr style="border:1px solid #ccc"> 

# Load html as a string

<span style="color:#33f" size="7"> ***For 4.4, introduced in APOC Full 4.4.0.9*** </span>

In addition to `apoc.load.html`, there is another procedure that works similarly 
and accepts the same parameter as apoc.load.html
but returns a textual representation instead of a list of map describing the tag:

`CALL apoc.load.htmlPlainText(uri, query, config)`


In [None]:
%%cypher

/*
File content
<body>
    ....

    <ul>
        <li>one</li>
        <li>two</li>
        <li>three</li>
    </ul>
    <br>
    <br>
    <p>my paragraph</p>
</body>
*/

with "wikipediaWithJs.html" as url

call apoc.load.htmlPlainText(url, {content: "body"}) 
yield value 
with url, value.content as valueString // valueString gets a textual representation
call apoc.load.html(url, {content: "body"}) 
yield value return valueString, value.content as valueListMap

In [None]:
%%cypher

// # htmlPlainText with browser 
call apoc.load.htmlPlainText("wikipediaWithJs.html", {content: "h1"}, {browser: "CHROME"}) 

### [NEXT CHAPTER](http://localhost:8888/notebooks/Read%2C%20write%20and%20other%20utils.ipynb)