# Data

This notebook contains information about how to use the files that are either:

* Included with the AIQC Python package (a few KB).
* Remotely stored in the AIQC GitHub repository.

You'll find that these files are used throughout the documentation as they are well suited for classification (multi & binary) and regression models.

## Prerequisites

If you've already completed the instructions on the [Installation](installation.html) page, then let's get started.

In [2]:
import aiqc
from aiqc import datum

> The module is named `datum` so that it does not overlap with commonly used words like 'data', 'files', or 'datasets'.

## Prepackaged Local Data

The `list_datums()` method provides metadata about each of file that is included in the package, so that you can find one that suits your purposes.

> By default it returns a Pandas DataFrame, but you can `list_datums(format='list')` to change that.

In [3]:
datum.list_datums()

Unnamed: 0,name,dataset_type,analysis_type,label,label_classes,features,samples,description,location
0,exoplanets.parquet,tabular,regression,SurfaceTempK,,8,433,Predict temperature of exoplanet.,local
1,heart_failure.parquet,tabular,regression,died,2.0,12,299,Biometrics to predict loss of life.,local
2,iris.tsv,tabular,classification_multi,species,3.0,4,150,"3 species of flowers. Only 150 rows, so cross-folds not represent population.",local
3,sonar.csv,tabular,classification_binary,object,2.0,60,208,"Detecting either a rock ""R"" or mine ""M"". Each feature is a sensor reading.",local
4,houses.csv,tabular,regression,price,,12,506,Predict the price of the house.,local
5,iris_noHeaders.csv,tabular,classification multi,species,3.0,4,150,For testing; no column names.,local
6,iris_10x.tsv,tabular,classification multi,species,3.0,4,1500,For testing; duplicated 10x so cross-folds represent population.,local
7,brain_tumor.csv,image,classification_binary,status,2.0,1 color x 160 wide x 120 tall,80,csv acts as label and manifest of image urls.,remote


> The location where Python packages are installed varies from system to system so `pkg_resources` is used to find the location of these files dynamically.

Using the value of the `name` column, you can fetch that file via `to_pandas()`.

In [3]:
df = datum.to_pandas(name='houses.csv')
df.head()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,lstat,price
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,5.33,36.2


Alternatively, if you prefer to work directly with the file itself, then you can obtain the location of the file via the `get_demo_file_path()` method.

In [4]:
datum.get_path('houses.csv')

'/Users/layne/.pyenv/versions/3.8.7/envs/jupyterlab3/lib/python3.8/site-packages/aiqc/data/houses.csv'

---

## Remote Data

In order to avoid bloating the package, larger dummy datasets are *not* included in the package. These kind of datasets consist of many large files.

They are located within the repository at `https://github.com/aiqc/aiqc/remote_datum`.

> If you want to fetch them on your own, you can do so by appending `?raw=true` to the end of an individual file's URL.
>
> Otherwise, their locations are hardcoded into the datum methods.

We'll use the `brain_tumor.csv` file as an example.

In [5]:
df = datum.to_pandas(name='brain_tumor.csv')
df.head()

Unnamed: 0,status,url
0,0,https://github.com/aiqc/aiqc/blob/file_schema/remote_datum/image/brain_tumor/images/healthy_0.jpg?raw=true
1,0,https://github.com/aiqc/aiqc/blob/file_schema/remote_datum/image/brain_tumor/images/healthy_1.jpg?raw=true
2,0,https://github.com/aiqc/aiqc/blob/file_schema/remote_datum/image/brain_tumor/images/healthy_2.jpg?raw=true
3,0,https://github.com/aiqc/aiqc/blob/file_schema/remote_datum/image/brain_tumor/images/healthy_3.jpg?raw=true
4,0,https://github.com/aiqc/aiqc/blob/file_schema/remote_datum/image/brain_tumor/images/healthy_4.jpg?raw=true


This file acts as a manifest for multi-modal datasets in that each row of this CSV represents a sample:

* The `'status'` column of this file serves as the Label of that sample. We'll construct a `Dataset.Tabular` from this.
* Meanwhile, the `'url'` column acts as a manifest in that it contains the URL of the image file for that sample. We'll construct a `Dataset.Image` from this.



In [6]:
img_urls = datum.get_remote_urls('brain_tumor.csv')
img_urls[:3]

['https://github.com/aiqc/aiqc/blob/file_schema/remote_datum/image/brain_tumor/images/healthy_0.jpg?raw=true',
 'https://github.com/aiqc/aiqc/blob/file_schema/remote_datum/image/brain_tumor/images/healthy_1.jpg?raw=true',
 'https://github.com/aiqc/aiqc/blob/file_schema/remote_datum/image/brain_tumor/images/healthy_2.jpg?raw=true']

At this point, you can use either the [high](api_high_level.html) or [low](api_low_level.html) level AIQC APIs [e.g. `aiqc.Dataset.Image.from_ulrs()`] to ingest that data and work with it as normal.