# Open data studio

[Open data studio](https://open-datastudio.io) is a managed computing service on [Staroid](https://staroid.com). Run your machine learning and large scale data processing workloads without managing clusters and servers.

[ods](https://github.com/open-datastudio/ods) library makes it easy to use in a Python environment. Currently, the library supports the following computing frameworks.

 - Apache Spark

Let's get started!

## Configure

First, you need a SKE (Star Kubernetes Engine) cluster from [staroid.com](https://staroid.com) and access token for it. SKE provides a fully managed, serverless Kubernetes namespace on the cloud.

  - Sign up [staroid.com](https://staroid.com)
  - Click 'Kubernetes' -> 'New Kubernetes cluster' to create a new SKE cluster. And set the `STAROID_SKE` environment variable.
  - Get access token from the 'Account' -> ['Access tokens'](https://staroid.com/settings/accesstokens) menu. And set the `STAROID_ACCESS_TOKEN` environment variable.

In [None]:
import os

os.environ["STAROID_SKE"]="<your ske cluster name>"
os.environ["STAROID_ACCESS_TOKEN"]="<your staroid access token>"

# optionally configure your aws key to test data on s3
os.environ["AWS_ACCESS_KEY_ID"]=""
os.environ["AWS_SECRET_ACCESS_KEY"]=""

Now you're ready to go!.
Let's install and initialize the [ods](https://github.com/open-datastudio/ods) module.

## Install

In [None]:
!pip install -q ods

import ods
ods.init()

## Spark cluster

Getting a Spark cluster is simple. Create a spark session using ods library. The library will download Spark (3.x), configure it, create workers on the cloud, and connect to them automatically.

In [None]:
my_cluster = ods.spark("spark1", worker_num=2, delta=True) # you can replace 'spark1' to a unique name for the instance
spark = my_cluster.session() # get spark session

Great! now you've got spark session powered by powerful worker machines running on the cloud.

Load your data and run your job!

In [None]:
# run spark job any spark 
df = spark.createDataFrame([{"hello": "world"} for x in range(100)])
df.show()

Your spark session will create Spark executors in remote 

In [None]:
# if you have configured AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, you can try load following data
dataLocation = "s3a://us-east-1.elasticmapreduce.samples/flightdata/input/"
df = spark.read.parquet(dataLocation)
df.createOrReplaceTempView("flights")

In [None]:
spark.sql("select flightdate, count(1) from flights group by flightdate order by flightdate desc").show()

## Instance management menu and Spark UI

Open a [Instance management menu](https://staroid.com/g/open-datastudio/spark-serverless/instances). You'll find your spark-serverless cluster instance. Once you click, you'll see status of your executors.

Also you can find link to Spark UI

## Stop Spark session and clean up

When the spark is no longer needed, you can stop the spark session and release executors.

In [None]:
spark.stop() # stop spark session and release executors
my_cluster.delete() # delete all cluster resources on the cloud.

## Documentation

Visit http://open-datastudio.io/computing/spark/index.html for the documentation.

## Get involved

Open data studio is an open source project. Please give us feedback and feel free to get involved!

 - Feedbacks, questions - [ods issue tracker](https://github.com/open-datastudio/ods/issues)
 - Open data studio slack channel - [Join](https://join.slack.com/t/opendatastudio/shared_invite/zt-jq449y9j-DIPBteeWC15xBbQAqi4J4g)

## Commercial support

[Staroid](https://staroid.com) actively contributes to Open data studio and provides commercial support. Please [contact](https://staroid.com/site/contact).