# DS-SF-25 | Codealong and Independent Practice 17 | Introduction to Databases

In [1]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.notebook_repr_html', True)
pd.set_option('display.max_columns', 10)

import sqlite3

## Accessing databases from `pandas`

While databases provide many analytical capabilities, often it's useful to pull the data back into Python for more flexible programming.

Large, fixed operations would be more efficient in a database, but `pandas` allows for interactive processing:
- E.g., if you just want to aggregate login or sales data to present a report or dashboard, this operation is operating on a large dataset and not often changing.
- However, if we want to investigate the login or sales data further and ask more interactive questions, then using Python would come in very handy.

`pandas` can be used to connect to most relational databases.

Here, we will create and connect to a `SQLite` database.  `SQLite` creates portable relational databases saved in a single file.

These databases are stored in a very efficient manner and allow fast querying, making them ideal for small databases or databases that need to be moved across machines.

We can create a `SQLite` database as follows:

In [2]:
db = sqlite3.connect('rossmann.db')

This creates a file, `rossmann.db`, which will store our SQL database.

## Writing data into a `SQLite` database

Data in `pandas` can be loaded into a relational database.  For the most part, `pandas` can use the databases column information to infer the schema for the table it creates.

Let's return to the Rossmann sales data and load it into our database.

In [3]:
df = pd.read_csv(os.path.join('..', 'datasets', 'rossmann-sales.csv'),
                 skipinitialspace = True,
                 low_memory = False)

In [4]:
df

Unnamed: 0,Store,DayOfWeek,Date,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday
0,1,5,2015-07-31,5263,555,1,1,0,1
1,2,5,2015-07-31,6064,625,1,1,0,1
2,3,5,2015-07-31,8314,821,1,1,0,1
3,4,5,2015-07-31,13995,1498,1,1,0,1
4,5,5,2015-07-31,4822,559,1,1,0,1
...,...,...,...,...,...,...,...,...,...
1017204,1111,2,2013-01-01,0,0,0,0,a,1
1017205,1112,2,2013-01-01,0,0,0,0,a,1
1017206,1113,2,2013-01-01,0,0,0,0,a,1
1017207,1114,2,2013-01-01,0,0,0,0,a,1


Data is moved to the database with the `to_sql` command, similar to the `to_csv` command.

`to_sql` takes several arguments:
- `name` - the table name to create
- `con` - a connection to a database
- `index` - whether to input the index column
- `schema` - if we want to write a custom schema for the new table
- `if_exists` - what to do if the table already exists.  We can overwrite it, add to it, or fail

The following code loads the Rossmann sales data to our database:

In [5]:
df.to_sql(name = 'rossmann_sales',
          con = db,
          index = False,
          if_exists = 'replace')

Once we have data in the database, we can use `pandas` to query it.

Querying is done through the `read_sql` command in the sql module.  E.g.,

In [6]:
pd.io.sql.read_sql('SELECT * ' +
                   'FROM rossmann_sales ' +
                   'LIMIT 10;', con = db)

Unnamed: 0,Store,DayOfWeek,Date,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday
0,1,5,2015-07-31,5263,555,1,1,0,1
1,2,5,2015-07-31,6064,625,1,1,0,1
2,3,5,2015-07-31,8314,821,1,1,0,1
3,4,5,2015-07-31,13995,1498,1,1,0,1
4,5,5,2015-07-31,4822,559,1,1,0,1
5,6,5,2015-07-31,5651,589,1,1,0,1
6,7,5,2015-07-31,15344,1414,1,1,0,1
7,8,5,2015-07-31,8492,833,1,1,0,1
8,9,5,2015-07-31,8565,687,1,1,0,1
9,10,5,2015-07-31,7185,681,1,1,0,1


This runs the query passed in and returns a dataframe with the results.

## Activity

1. Load the Rossmann Store metadata in `rossmann-stores.csv` and create a table in the database with it.

In [7]:
df2 = pd.read_csv(os.path.join('..', 'datasets', 'rossmann-stores.csv'),
                 skipinitialspace = True,
                 low_memory = False)

In [8]:
df2.to_sql(name = 'rossmann_stores',
          con = db,
          index = False,
          if_exists = 'replace')

In [9]:
pd.io.sql.read_sql('SELECT * ' +
                   'FROM rossmann_stores ' +
                   'LIMIT 10;', con = db)

Unnamed: 0,Store,StoreType,Assortment,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear,PromoInterval
0,1,c,a,1270.0,9.0,2008.0,0,,,
1,2,a,a,570.0,11.0,2007.0,1,13.0,2010.0,"Jan,Apr,Jul,Oct"
2,3,a,a,14130.0,12.0,2006.0,1,14.0,2011.0,"Jan,Apr,Jul,Oct"
3,4,c,c,620.0,9.0,2009.0,0,,,
4,5,a,a,29910.0,4.0,2015.0,0,,,
5,6,a,a,310.0,12.0,2013.0,0,,,
6,7,a,c,24000.0,4.0,2013.0,0,,,
7,8,a,a,7520.0,10.0,2014.0,0,,,
8,9,a,c,2030.0,8.0,2000.0,0,,,
9,10,a,a,3160.0,9.0,2009.0,0,,,


## SQL syntax | `SELECT`, `WHERE`, `GROUP BY`, and `JOIN`

### `SELECT`

Every query should start with `SELECT`.  `SELECT` is followed by the names of the columns in the output.

`SELECT` is always paired with `FROM`, which identifies the table(s) to retrieve data from.

```SQL
SELECT <columns>
    FROM <table>;
```

`SELECT * FROM table` denotes returning all (of the columns of) the table.

E.g.,

```SQL
SELECT Store, Sales
    FROM rossmann_sales;
```

In [None]:
pd.io.sql.read_sql('SELECT Store, Sales ' +
                   'FROM rossmann_sales;', con = db)

### Activity

1. Write a query for the Rossmann Sales data that returns Store, Date, and Customers.

In [None]:
# TODO

### `WHERE`

`WHERE` is used to filter a table using a specific criteria.  The `WHERE` clause follows the `FROM` clause.

```SQL
SELECT <columns>
    FROM <table>
    WHERE <condition>;
```

The condition is some filter applied to the rows, where rows that match the condition will be output.

E.g.,

```SQL
SELECT Store, Sales
    FROM rossmann_sales
    WHERE Store = 1;
```

In [None]:
pd.io.sql.read_sql('SELECT Store, Sales ' +
                   'FROM rossmann_sales ' +
                   'WHERE Store = 1;', con = db)

E.g.,

```SQL
SELECT Store, Sales
    FROM rossmann_sales
    WHERE Store = 1 AND Open = 1;
```

In [None]:
pd.io.sql.read_sql('SELECT Store, Sales ' +
                   'FROM rossmann_sales ' +
                   'WHERE Store = 1 AND Open = 1;', con = db)

### Activity

1. Write a query for the Rossmann Sales data that returns Store, Date, and Customers for stores that were open and running a promotion.

In [None]:
# TODO

### `GROUP BY`

`GROUP BY` allows us to aggregate over any field in the table by applying the concept of Split Apply Combine.

We identify some key with which we want to segment the rows.  Then, we roll up or compute some statistics over all of the rows that match that key.

`GROUP BY` must be paired with an aggregate function, the statistic we want to compute in the rows, in the `SELECT` statement.

`COUNT(*)` denotes counting up all of the rows.  Other aggregate functions commonly available are `AVG` (average), `MAX`, `MIN`, and `SUM`.

If we want to aggregate over the entire table, without results specific to any key, we can use an aggregate function in the `SELECT` clause and ignore the `GROUP BY` clause.

E.g.,

```SQL
SELECT Store, SUM(Sales), AVG(Customers)
    FROM rossmann_sales
    WHERE Open = 1
    GROUP BY Store;
```

In [None]:
pd.io.sql.read_sql('SELECT Store, SUM(Sales), AVG(Customers) ' +
                   'FROM rossmann_sales ' +
                   'WHERE Open = 1 ' +
                   'GROUP BY Store;', con = db)

E.g.,

```SQL
SELECT Store, SUM(Sales), AVG(Customers)
    FROM rossmann_sales
    WHERE Open = 1
```

In [None]:
pd.io.sql.read_sql('SELECT Store, SUM(Sales), AVG(Customers) ' +
                   'FROM rossmann_sales ' +
                   'WHERE Open = 1;', con = db)

### Activity

1. Write a query that returns the total sales on the promotion and non-promotion days.

In [None]:
# TODO

### `ORDER BY`

`ORDER BY` is used to sort the results of a query.

```SQL
SELECT <columns>
    FROM <table>
    WHERE <condition>
    ORDER BY <columns>;
```

You can order by multiple columns in ascending (`ASC`) or descending (`DESC`) order.

E.g.,

```SQL
SELECT Store, SUM(Sales) AS total_sales, AVG(Customers)
    FROM rossmann_sales
    WHERE Open = 1
    GROUP BY Store
    ORDER BY total_sales DESC;
```

`SUM(Sales) as total_sales` renames the `SUM(Sales)` value to `total_sales` so we can refer to it later in the `ORDER BY` clause.

In [None]:
pd.io.sql.read_sql('SELECT Store, SUM(Sales) AS total_sales, AVG(Customers) '
                   'FROM rossmann_sales ' +
                   'WHERE Open = 1 ' +
                   'GROUP BY Store ' +
                   'ORDER BY total_sales DESC;', con = db)

### `JOIN`

`JOIN` allows us to access data across many tables.  We specify how a row in one table links to another.

```SQL
SELECT a.Store, a.Sales, s.CompetitionDistance
    FROM rossmann_sales AS a
    JOIN rossmann_stores AS s
    ON a.Store = s.Store
```

Here, `ON` denotes an inner join.

By default, most joins are inner joins, which means only when there is a match in both tables does a row appear in the results.

If we want to keep the rows of one table even if there is no matching counterpart, we can perform an outer join.

Outer joins can be `LEFT`, `RIGHT`, or `FULL`, meaning keep all of the left rows, all the right rows, or all the rows, respectively.

In [None]:
pd.io.sql.read_sql('SELECT a.Store, a.Sales, s.CompetitionDistance '
                   'FROM rossmann_sales AS a ' +
                   'JOIN rossmann_stores AS s ' +
                   'ON a.Store = s.Store;', con = db)

## Independent Practice

1. Load the Walmart sales and store features data
1. Create a table for each of those datasets
1. Select the store, date and fuel price on days it was over 90 degrees
1. Select the store, date and weekly sales and temperature
1. What were average sales on holiday vs. non-holiday sales?
1. What were average sales on holiday vs. non-holiday sales when the temperature was below 32 degrees?