This notebook shows how to use the CensysDatasetManager to convert Censys files into DataFrame abstraction, to enrich and manage this dataset. 

In [None]:
from pyspark.sql import SparkSession
from tlhop.converters import CensysDatasetManager

In [None]:
spark = SparkSession.builder\
            .master("local[10]")\
            .config("spark.driver.memory", "20g")\
            .getOrCreate()

In [None]:
INPUT_SNAPSHOT_FOLDER = "<SNAPSHOT_FOLDER>"
TMP_OUTPUT = "/home/<USER>/censys.delta"

In [None]:
censys_mgr = CensysDatasetManager(filter_by_contry='Brazil')

In [None]:
censys_mgr.convert_files(INPUT_SNAPSHOT_FOLDER, TMP_OUTPUT)

In [None]:
# or, to convert, without writing into a new file: 
# 
# df = censys_mgr.convert_dump_to_df(INPUT_SNAPSHOT_FOLDER)

After convert all files, users can access the dataset using Spark native API:

In [None]:
df = spark.read.format("delta").load(TMP_OUTPUT)
df.count()

### Dataset optimization

Spark may generate small files over time. Because of that, we expose a method (`optimize_delta`) to optimize the dataset by merging small files into a bigger size.

In [None]:
censys_mgr.optimize_delta(TMP_OUTPUT)

Delta format supports time travel. In order to support this feature, older files version are kept inside dataset folder (for instance, it keeps the version before the execution of `optimize_delta` method). When we ensure that older dataset versions are not needed anymore, we can use the `remove_old_delta_versions` method to force a removal of these old versions.

### Cleaning old versions

In [None]:
censys_mgr.remove_old_delta_versions(TMP_OUTPUT)

### Further Delta operations

Because we use Delta, further operations are also available using native Delta API. For instance, we can check the complete dataset history:

In [None]:
from delta.tables import DeltaTable

deltaTable = DeltaTable.forPath(spark, TMP_OUTPUT)
deltaTable.history().toPandas()

In [None]:
spark.stop()