# Setup Dremio in Kubernetes Cluster



### What is Dremio?

Dremio is an open-source, distributed analytics engine designed for high-performance analytics on data lakes. It provides a simple, self-service interface for data exploration, transformation, and collaboration. Dremio's architecture is built on top of Apache Arrow, a high-performance columnar memory format, and leverages the Parquet file format for efficient storage. By combining these technologies, Dremio enables users to query data from various sources, including Hadoop, NoSQL databases, and cloud storage, without the need for data movement or copies.

There are several reasons why Dremio has gained popularity among data professionals:

* **Self-service data exploration and transformation**: Dremio empowers data analysts and scientists to explore, clean, and transform data using SQL or visual interfaces. This self-service model reduces the burden on IT teams and speeds up the data analysis process.

* **Unified data access**: Dremio connects to a wide range of data sources, allowing users to query data from multiple systems simultaneously. This unified access simplifies data management and promotes collaboration between teams.

* **High-performance queries**: By leveraging Apache Arrow and columnar storage, Dremio delivers fast query performance on large datasets. Additionally, it employs advanced techniques like predicate pushdown and column pruning to optimize queries further.

* **Scalability**: Dremio's distributed architecture can scale horizontally to accommodate growing data volumes and concurrent user workloads. This scalability ensures that Dremio can handle the demands of even the largest enterprises.

* **Security and governance**: Dremio supports robust security features, including data access controls, encryption, and integration with enterprise security solutions like LDAP and Active Directory. These features help organizations maintain compliance and protect sensitive data

## Setup Dremio OSS in Kubernetes


To deploy Dremio in a Kubernetes cluster, we can use Helm charts. Helm is a package manager for Kubernetes that helps us to deploy, manage, and upgrade applications in a Kubernetes cluster. Helm charts provide a set of templates and configurations that define how an application should be deployed in Kubernetes.

In this scenario, we will use the Dremio OSS (Open Source Software) image to deploy 1 Master, 3 Executors and 3 Zookeepers. The Master node is responsible for coordinating the cluster, while the Executors are responsible for processing the data. By deploying multiple Executors, we can parallelize the data processing and improve the performance of our cluster.

To store the data in a distributed manner, we will use MinIO bucket as the distributed storage. MinIO is a high-performance, distributed object storage system that is designed for cloud-native applications. Whenever new files are uploaded to Dremio, they will be stored in the MinIO bucket. This allows us to store and process large amounts of data in a scalable and distributed manner.

### Prerequisites
To Follow the instructions in the Notebook below you will need,
* A Kubernetes cluster. You can use Minikube to set up a local Kubernetes cluster on your machine
* Helm, the package manager for Kubernetes. You can follow this guide to install Helm on your machine.
* A MinIO server running on bare metal or kubernetes. You can follow [this](https://min.io/docs/minio/linux/operations/installation.html#install-and-deploy-minio) guide to install MinIO on bare metal or [this](https://min.io/docs/minio/kubernetes/upstream/index.html) guide to install MinIO on Kubernetes or you can use [play server](https://play.min.io/) for testing purposes.
* A MinIO client (mc) to access the MinIO server. You can follow [this](https://docs.min.io/docs/minio-client-quickstart-guide.html) guide to install mc on your machine.



### Create MinIO bucket

Let's create a MinIO bucket `openlake/dremio` which will be used by Dremio as the distributed storage

In [None]:
!mc mb play/openlake
!mc mb play/openlake/dremio

### Clone dremio-cloud-tools repo

we will use the helm charts from this repo to setup Dremio

In [None]:
!git clone https://github.com/dremio/dremio-cloud-tools

We will use the `demio_v2` version of the charts, we will uses the `values.minio.yaml` file in the current directory to setup Dremio. Lets copy the YAML to `dremio-cloud-tools/charts/dremio_v2`

In [None]:
!cp charts/values.minio.yaml dremio-cloud-tools/charts/dremio_v2/

In [None]:
%cd dremio-cloud-tools/charts/

In [None]:
!ls dremio_v2 #you should see values.minio.yaml

### Install Dremio using Helm

In [None]:
!helm install dremio dremio_v2 -f dremio_v2/values.minio.yaml --namespace dremio --create-namespace

Above command will install Dremio released names `dremio` in the namespace `dermio` and it creates the new namespace `dremio`. 

Note: Make sure to update you Minio Endpoint, access key and secret key in the `values.minio.yaml`

In [None]:
!kubectl -n dremio get pods # after the helm setup is complete it takes sometime for the pods to be up and running

In [None]:
!kubectl -n dremio get svc # List all the services in namespace dremio

In [None]:
!mc ls play/openlake/dremio # we should see new prefixes being created that Dremio will use later

### Login to Dremio

To login to Dremio lets port `dremio-client` service to our localhost. After executing the below command goto http://localhost:9047

In [None]:
!kubectl -n dremio port-forward svc/dremio-client 9047 # stop the cell once you are done exploring

You will need to setup a new user on your first time launching Dremio

![Spark UI](./img/login-screen.png)

Once we have setup the user we will be greeted with a welcome page. To keep this workflow simple let's upload a sample dataset to Dremio that is included in the repo `data/nyc_taxi_small.csv` and start querying it.

We can upload `nyc_taxi_small.csv` by clicking on the `+` at the top right corner of the home page, as shown below
![Add](./img/add.png)

Dremio will automatically parse the CSV and gives the recommended formatting as shown below, we will proceed further.
![Format](./img/format.png)

In [None]:
!mc ls --summarize --recursive openlake/dremio/uploads # you will see the CSV file uploaded in to the MinIO bucket

We will be taken to the SQL query Console where we can start executing queries, here is 2 sample queries that you can try executing

```sql
SELECT count(*) FROM nyc_taxi_small;

SELECT * FROM nyc_taxi_small;
```

Paste the above in the console and click `Run`, you see something like below

![console](./img/console.png)

You can click on `Query1` tab to see the number of Rows in the dataset 

![console-query1](./img/console1.png)

You can click on `Query2` tab to see the number of Rows in the dataset 
![console-query2](./img/console2.png)

Now that we have end-to-end Dremio workflow working, let us take a look at `values.minio.yaml` to see what configurations were setup

## Deployment Walkthrough


Now let's take a deepdive at the `values.minio.yaml` file to see some of the modifcations done to `distStorage` section

```yaml
distStorage:
  aws:
    bucketName: "openlake"
    path: "/dremio"
    authentication: "accessKeySecret"
    credentials:
     accessKey: "minioadmin"
     secret: "minioadmin"

    extraProperties: |
     <property>
       <name>fs.s3a.endpoint</name>
       <value>play.min.io</value>
     </property>
     <property>
       <name>fs.s3a.path.style.access</name>
       <value>true</value>
     </property>
     <property>
       <name>dremio.s3.compat</name>
       <value>true</value>
     </property>
```

We set the `distStorage` to `aws` and the name of the bucket is `openlake` and all the storage for Dremio will be under the prefix `dremio` (aka `s3://openlake/dremio`). We also need to add `extraProperties` since we are using MinIO to specify the Endpoint. We also need to add 2 additional properties in order to make Dremio work with MinIO `fs.s3a.path.style.access` need to be set to `true` and `dremio.s3.compat` to `true` so that dremio knows this is an S3 compatabile object store.

Apart from this we can customize multiple other configurations like `executor` CPU, Memory usage depending on the K8s cluster capacity. We can also specify how many executors we need depending upon the size of workloads Dremio is going to handel.

Overall till now we have seen how to Deploy Dremio in K8s cluster and use MinIO as the distributed storage. We also saw how to upload sample dataset to Dremio and start querying it. We have just touched the tip of the iceberg 😜, in the following Notebook we will see how to manage `Apache Iceberg` table that was created by processing engine like Spark like shown [here](../spark/spark-iceberg-minio.ipynb) without any hassel and how to access data that you already have in MinIO.