In [None]:
# Upgrade Oracle ADS to pick up latest features and maintain compatibility with Oracle Cloud Infrastructure.

!pip install -U oracle-ads

Oracle Data Science service sample notebook.

Copyright (c) 2020, 2022 Oracle, Inc. All rights reserved. Licensed under the [Universal Permissive License v 1.0](https://oss.oracle.com/licenses/upl).

---

# <font color="red">Loading Data from Different Sources Using Pandas and Dask</font>
<p style="margin-left:10%; margin-right:10%;">by the <font color="teal">Oracle Cloud Infrastructure Data Science Service.</font></p>

---


Compatible conda pack: [General Machine Learning](https://docs.oracle.com/en-us/iaas/data-science/using/conda-generalml-fam.htm) for CPU on Python 3.8 (version 1.0)

In [None]:
import ads
import logging
import numpy as np
import os
import pandas as pd
import shutil
import tempfile
import warnings
from ads.dataset.factory import DatasetFactory
from os import path

warnings.filterwarnings("ignore")
logging.basicConfig(format="%(levelname)s:%(message)s", level=logging.INFO)

<a id='src'></a>
## Loading Datasets From Various Sources

Loading data into ADS can be done in several different ways. Data can load from a local, network file system, Hadoop Distributed File System (HDFS), Oracle Object Storage, Amazon S3, Google Cloud Service, Azure Blob, Oracle Database, ADW, elastic search instance, NoSQL DB instance, Mongodb and many more sources. This notebook demonstrates how to do this for some of the more common data sources. However, the approach can be generalized to the other data sources.



<a id='Object Storage'></a>
### Oracle Cloud Infrastructure Object Storage

Use `pandas` to load data from the object storage.

In [None]:
bucket_name = "hosted-ds-datasets"
namespace = "bigdatadatasciencelarge"


file_name = "titanic/titanic.csv"
df = pd.read_csv(
    f"oci://{bucket_name}@{namespace}/{file_name}",
    storage_options=ads.common.auth.default_signer(),
)

In [None]:
df.head()

Create `ads.dataset.dataset.ADSDataset` object from pandas dataframe to visualize and analyze the data

In [None]:
ds = DatasetFactory.from_dataframe(df)

The `ds` variable is an object from the class `ads.dataset.dataset.ADSDataset`. Objects of this class have a method `show_in_notebook` that provides a wealth of exploratory data analysis (EDA) information. It displays summary statistics, correlations, visualizations, and warnings about the condition of the data.

In [None]:
ds.show_in_notebook()

#### Loading files from a folder using glob patterns

Sometimes you have have a folder full of CSVs or parquet files. Read efficiently from the object storage path using [`dask`](https://www.dask.org/)

In [None]:
import dask.dataframe as dd

bucket_name = "hosted-ds-datasets"
namespace = "bigdatadatasciencelarge"


ddf = dd.read_parquet(
    f"oci://{bucket_name}@{namespace}/nyc_tlc/2017/",
    storage_options=ads.common.auth.default_signer(),
    engine="pyarrow",
)

df = ddf.compute()

In [None]:
df.head()

<a id='adb'></a>
### Oracle Autonomous Database

`oracle-ads` provides a drop-in replacement for `pandas.read_sql` and `pandas.to_sql` to read data from Oracle database, Big Data Service Hive and MySQL. To learn more read the [user gide](https://accelerated-data-science.readthedocs.io/en/latest/user_guide/loading_data/connect.html#oracle-database)

In [None]:
# If you are using Wallet file, provide the zip file path for `wallet_location`
connection_parameters = {
    "user_name": "<username>",
    "password": "<password>",
    "service_name": "<service_name_{high|med|low}>",
    "wallet_location": "/full/path/to/my_wallet.zip",
}

In [None]:
import pandas as pd
import ads

df = pd.DataFrame.ads.read_sql(
    "SELECT * FROM SH.SALES",
    connection_parameters=connection_parameters,
)

df = pd.DataFrame.ads.read_sql(
    """
    SELECT
    *
    FROM
    SH.SALES
    """,
    connection_parameters=connection_parameters,
)

<a id='adb'></a>
### Big Data Service Hive

Learn more about different options for reading and writing data from Hive [here](https://accelerated-data-science.readthedocs.io/en/latest/user_guide/loading_data/connect.html#bds-hive)

In [None]:
connection_parameters = {
    "host": "<hive hostname>",
    "port": "<hive port number>",
}

In [None]:
connection_parameters = {
    "host": "<database hostname>",
    "port": "<database port number>",
}
import pandas as pd
import ads

# simple read of a SQL query into a dataframe with no bind variables
df = pd.DataFrame.ads.read_sql(
    "SELECT * FROM EMPLOYEE", connection_parameters=connection_parameters, engine="hive"
)

# read of a SQL query into a dataframe with a bind variable. Use bind variables
# rather than string substitution to avoid the SQL injection attack vector.
df = pd.DataFrame.ads.read_sql(
    """
    SELECT
    *
    FROM
    EMPLOYEE
    WHERE
        `emp_no` <= ?
    """,
    bind_variables=(1000,),
    connection_parameters=connection_parameters,
    engine="hive",
)

<a id="reference"></a>
# References

- [ADS Library Documentation](https://accelerated-data-science.readthedocs.io/en/latest/index.html)
- [Data Science YouTube Videos](https://www.youtube.com/playlist?list=PLKCk3OyNwIzv6CWMhvqSB_8MLJIZdO80L)
- [OCI Data Science Documentation](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm)
- [Oracle Data & AI Blog](https://blogs.oracle.com/datascience/)
- [scikit-learn](https://scikit-learn.org/stable/)