# Getting started

To start using Ibis, you need a Python environment with Ibis installed. If you're running through this tutorial on your own machine (rather than binder) please follow the [installation instructions](https://ibis-project.org/install/) to setup an environment.

You'll also need access to the `geography.db` database hosted [here](https://storage.googleapis.com/ibis-tutorial-data/geography.db). Every notebook in the tutorial starts with the following code to download the database if it doesn't already exist.

In [None]:
from tutorial_utils import setup
setup()

You should now have `ibis` and the tutorial data all setup. We're ready to get started. First lets import `ibis`.

In [None]:
import ibis

To make it things easier in this tutorial, we will be using Ibis's "interactive mode". This is the recommended mode to use when doing interactive/iterative work with `ibis`. When deploying production code you'll typically run in non-interactive mode. More details on Ibis non-interactive mode are covered in [a later notebook](./03-Expressions-Lazy-Mode-Logging.ipynb).

To enable interactive mode, run:

In [None]:
ibis.options.interactive = True

Next thing we need is to create a connection object. The connection defines where the data is stored and where the computations will be performed.

For a comparison to pandas, this is not the same as where the data is imported from (e.g. `pandas.read_sql`). pandas loads data into memory and performs the computations itself. Ibis won't load the data and perform any computation, but instead will leave the data in the backend defined in the connection, and will _ask_ the backend to perform the computations.

In this tutorial we will be using a SQLite connection for its simplicity (no installation is needed). But Ibis can work with many different backends, including big data systems, or GPU-accelerated analytical databases. As well as most common relational databases (PostgreSQL, MySQL,...).

In [None]:
connection = ibis.sqlite.connect('geography.db')

### Exploring the data

To list the tables in the `connection` object, we can use the `.list_tables()` method. If you are using Jupyter, you can see all the methods and attributes of the `connection` object by writing `connection.` and pressing the `<TAB>` key.

In [None]:
connection.list_tables()

These two tables include data about countries, and about GDP by country and year.

The data from countries has been obtained from [GeoNames](https://www.geonames.org/countries/).
The GDP table will be used in the next tutorial, and the data has been obtained from the
[World Bank website](https://data.worldbank.org/indicator/NY.GDP.MKTP.CD).

Next, we want to access a specific table in the database. We can create a handler to the `countries` table with:

In [None]:
countries = connection.table('countries')

To list the columns of the `countries` table, we can use the `columns` attribute.

Again, Jupyter users can see all the methods and attributes of the `countries` object by typing `countries.` and pressing `<TAB>`.

In [None]:
countries.columns

We can now access a sample of the data. Let's focus on the `name`, `continent` and `population` columns to start with. We can visualize the values of the columns with:

In [None]:
countries['name', 'continent', 'population']

The table is too big for all the results to be displayed, and we probably don't want to see all of them at once anyway. For this reason, just the beginning and the end of the results is displayed. Often, the number of rows will be so large that this operation could take a long time.

To check how many rows a table has, we can use the `.count()` method:

In [None]:
countries.count()

To fetch just a subset of the rows, we can use the `.limit(n)` method, where `n` is the number of samples we want. In this case we will fetch the first `3` countries from the table:

In [None]:
countries['name', 'continent', 'population'].limit(3)

### Filters and order

Now that we've got an intuition of the data available in the table `countries`, we will extract some information from it by applying filters and sorting the data.

Let's focus on a single continent. We can see a list of unique continents in the table using the `.distinct()` method:

In [None]:
countries[['continent']].distinct()

We will focus on Asia (`AS` in the table). We can identify which rows belong to Asian countries using the standard Python `==` operator:

In [None]:
countries['continent'] == 'AS'

The result has a value `True` for rows where the condition is true, and the value `False` when it's not.

We can provide this expression to the method `.filter()`, and save the result in the variable `asian_countries` for future use.

In [None]:
asian_countries = countries['name', 'continent', 'population'].filter(
    countries['continent'] == 'AS'
)
asian_countries

We can check how many countries exist in Asia (based on the information in the database) by using the `.count()` method we've already seen:

In [None]:
asian_countries.count()

Next, we want to find the most populated countries in Asia. To obtain them, we are going to sort the countries by the column `population`, and just fetch the first 10. To sort by a column in Ibis, we can use the `.order_by()` method:

In [None]:
asian_countries.order_by('population').limit(10)

This will return the least populated countries, since `.order_by` will by default order in ascending order (ascending order like in `1, 2, 3, 4`). This behavior is consistent with SQL `ORDER BY`.

To order in descending order we can use `ibis.desc()`:

In [None]:
asian_countries.order_by(ibis.desc('population')).limit(10)

This is the list of the 10 most populated countries based on the data from [GeoNames](https://www.geonames.org/).

To learn more about Ibis, continue to the next tutorial.