<a id='section-id0'></a>
# Working with data in Yandex DataSphere

1. [Working with primary data sources](#section-id1)
1. [Datasets](#section-id2)
1. [Sharing output](#section-id3) 
1. [More about Yandex DataSphere](#section-id4) 

In [3]:
# %pip install boto3 if needed
# set os environment variables aws_access_key_id and aws_secret_access_key

import boto3
import os
from pathlib import Path


def download_files(s3_client, bucket_name: str, local_path: str, file_name: str) -> None:
    local_path = Path(local_path)

    local_path.mkdir(parents=True, exist_ok=True)

    file_path = Path.joinpath(local_path, file_name)
    file_path.parent.mkdir(parents=True, exist_ok=True)
    s3_client.download_file(
        bucket_name,
        file_name,
        str(file_path)
    )


S3_CREDS = {
    "aws_access_key_id": 'YCAJEAeZbsb8c7fvN8GNL2vXz',
    "aws_secret_access_key": 'YCNNhq867MpjKO1w5vZhuCvHlpO64mCyI4tkJeOq'
}

bucket = "bucket-datalens-test-abacaba"
file_name = 'part-0.parquet'

client = boto3.client(
    service_name='s3',
    endpoint_url='https://storage.yandexcloud.net',
    **S3_CREDS)

download_files(
    client,
    bucket,
    "from-s3-folder",
    file_name
)


In [6]:
%pip install pyarrow==20.0.*

Defaulting to user installation because normal site-packages is not writeable
Collecting pyarrow==20.0.*
  Downloading pyarrow-20.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Downloading pyarrow-20.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (42.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.3/42.3 MB[0m [31m96.1 MB/s[0m eta [36m0:00:00[0m:00:01[0m0:01[0m
[?25hInstalling collected packages: pyarrow
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pandas-gbq 0.17.9 requires pyarrow<10.0dev,>=3.0.0, but you have pyarrow 20.0.0 which is incompatible.[0m[31m
[0mSuccessfully installed pyarrow-20.0.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32

In [7]:
import pyarrow.parquet as pq
parquet_file = pq.ParquetFile('./from-s3-folder/part-0.parquet')

In [22]:
parquet_file.metadata

<pyarrow._parquet.FileMetaData object at 0x7f0e650358f0>
  created_by: parquet-cpp-arrow version 16.1.0
  num_columns: 213
  num_rows: 3191103
  num_row_groups: 1729
  format_version: 2.6
  serialized_size: 38489415

In [36]:
parquet_file.schema

<pyarrow._parquet.ParquetSchema object at 0x7f0e650486c0>
required group field_id=-1 schema {
  optional binary field_id=-1 inn (String);
  optional binary field_id=-1 ogrn (String);
  optional binary field_id=-1 region (String);
  optional binary field_id=-1 region_taxcode (String);
  optional int32 field_id=-1 creation_date (Date);
  optional int32 field_id=-1 dissolution_date (Date);
  optional double field_id=-1 age;
  optional double field_id=-1 eligible;
  optional binary field_id=-1 exemption_criteria (String);
  optional double field_id=-1 financial;
  optional double field_id=-1 filed;
  optional double field_id=-1 imputed;
  optional double field_id=-1 simplified;
  optional double field_id=-1 articulated;
  optional double field_id=-1 totals_adjustment;
  optional double field_id=-1 outlier;
  optional binary field_id=-1 okved (String);
  optional binary field_id=-1 okved_section (String);
  optional binary field_id=-1 okpo (String);
  optional binary field_id=-1 okopf (Stri

In [53]:
import json

In [34]:
s = []
for i in range(213):
    col = parquet_file.schema.column(i)
    t = col.physical_type.lower()
    if t.startswith('byte'):
        t = 'string'
    s.append(dict(name=col.name, type=t, required=False))
with open('out.txt', 'w') as f:
    f.write(json.dumps(s))
    

<a id='section-id1'></a>
## 1. Working with data sources

DataSphere includes the ability to handle all major data sources. Using standard Jupyter Notebook tools, you can copy existing notebooks and data from your local machine.
You can work with Git, Yandex.Disk, Google Drive, S3, FTP, and Spark. To make it easier to work with all data sources, DataSphere has Snippets: examples of code describing the handling of data sources. You can also find connection examples in [our documentation](https://cloud.yandex.ru/docs/datasphere/operations/#data-source).

![](https://storage.yandexcloud.net/onboarding-notebooks/screenshots/snippets.png)

In [None]:
# For example, you can get a cell like this if you select Snippets -> S3 -> Get file.py in the notebook top panel

# %pip install boto3 if needed
# set os environment variables aws_access_key_id and aws_secret_access_key

import boto3
import os
from pathlib import Path


def download_files(s3_client, bucket_name: str, local_path: str, file_name: str) -> None:
    local_path = Path(local_path)

    local_path.mkdir(parents=True, exist_ok=True)

    file_path = Path.joinpath(local_path, file_name)
    file_path.parent.mkdir(parents=True, exist_ok=True)
    s3_client.download_file(
        bucket_name,
        file_name,
        str(file_path)
    )


S3_CREDS = {
    "aws_access_key_id": os.environ['aws_access_key_id'],
    "aws_secret_access_key": os.environ['aws_secret_access_key']
}

bucket = "my_bucket"
file_name = 'path/to/file'

client = boto3.client(
    service_name='s3',
    endpoint_url='https://storage.yandexcloud.net',
    **S3_CREDS)

download_files(
    client,
    bucket,
    "from-s3-folder",
    file_name
)


In DataSphere, you can work with Git repositories, both local and remote. 
All steps are described in detail in [our documentation](https://cloud.yandex.ru/docs/datasphere/operations/projects/work-with-git).

DataSphere works as a service and uses the Python shell for file system access. No terminal is available in DataSphere, but you can run Bash commands. To do this, specify the `#!:bash` prefix in the cell header.

In [1]:
#!:bash 

# For example, you can perform an operation like this on a text string
echo "This is a test" | sed 's/a test/yet another test/'

This is yet another test


<a id='section-id2'></a>

## 2. Datasets

Datasets are a convenient way of storing large sets of data that do not need to be modified during computations. Datasets can store up to 4 TB of data and provide faster access than project storage. 

You must populate a dataset immediately upon creation and initialization. After initialization, a dataset will become read-only.

You can view all datasets of a project in the **Dataset** section in the project resources. You can create and initialize a dataset in a cell with the `#pragma dataset init` command. 

Here is the template you will get if you select **Snippets -> Datasets -> Create custom dataset.py** in the top menu. The minimum size of a new dataset is 1 GB.

In [None]:
#pragma dataset init DATASET_NAME --size 1Gb

# TODO: fill dataset here
# Dataset will be created in /home/jupyter/mnt/datasets/DATASET_NAME


Datasets can be populated from files using a link as well as file storage objects. For dataset creation examples, see [our documentation](https://cloud.yandex.ru/docs/datasphere/concepts/dataset).

<a id='section-id3'></a>

## 3. Sharing output

#### Publishing a notebook

You can export a notebook as a link to an HTML report. The link is active for a week.

![](https://storage.yandexcloud.net/onboarding-notebooks/screenshots/export-notebook.png)

#### Exporting a notebook to external projects

To export a notebook to external projects, you can use the procedure for working with Git as described above. 
You can also export a project as a ZIP archive.

<a id='section-id4'></a>

## 4. More about Yandex DataSphere
We have a detailed [documentation](https://cloud.yandex.ru/docs/datasphere/) available.

**Videos and demos**

- Presentation of a new DataSphere version with an updated interface: [Yandex DataSphere: New UI and collaborative ML development capabilities
](https://www.youtube.com/watch?v=xzEW5g7WVd4&themeRefresh=1).

- Enabling collaboration through a community and projects: [DataSphere features for distributed ML teams
](https://www.youtube.com/watch?v=xM0qdz5wJdE) 

**Projects implemented in Yandex DataSphere**

- Environmental monitoring of the Lake Baikal to forecast its state and impact of climate change on its ecosystem, as well as measure fish populations: [About project](https://cloud.yandex.ru/special/baikal/), [Habr](https://habr.com/ru/companies/yandex/articles/689592/), [GitHub](https://github.com/baikal-zooplankton)

- Forecasting the El Niño natural anomaly in the Pacific Ocean: [About project](https://cloud.yandex.ru/blog/posts/2023/04/el-nino)

Join the <a href="https://t.me/yandex_datasphere">DataSphere community chat on Telegram</a>! 

In [1]:
%pip install polar seaborn pandas scikit-learn scipy matplotlib numpy nltk -U

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [2]:
import pyarrow.dataset as ds
import polar
RFSD = ds.dataset("from-s3-folder", partitioning="hive")

ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

In [51]:
RFSD_2023_crit = pl.from_arrow(
    RFSD.to_table(
        filter=ds.field('year') == 2023,
        columns=['inn', 'exemption_criteria']
        )
)


NameError: name 'pl' is not defined