# Course 2: Project - Task E - Build a database

<a name="task-e-top"></a>
This notebook is concerned with task E.

**Contents:**
* [Imports](#task-e-imports)
* [Data loading](#task-e-data-loading)

## Imports<a name="task-e-imports"></a> ([top](#task-e-top))
---

In [44]:
# Standard library:
import itertools
import pathlib
import re
import typing as t
import unicodedata

# 3rd party:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import pandas.io.formats.style
import seaborn as sns
from pandas.plotting import register_matplotlib_converters

# Project:
import ingredients
import utils

%matplotlib inline
register_matplotlib_converters()

## Data loading<a name="task-e-data-loading"></a> ([top](#task-e-top))
---

First, we load the subset of the cleaned-up dataset that we need:

In [34]:
base_name = pathlib.Path.cwd().joinpath('en.openfoodfacts.org.products.clean')

In [35]:
# The columns to load:
usecols=['created_on', 'last_modified_on']

# Load:
data_types, parse_dates  = utils.amend_dtypes(utils.load_dtypes(base_name))
# We can only parse dates in the columns that we are loading:
parse_dates = list(set(parse_dates) & set(usecols))
df = pd.read_csv(
        f'{base_name}.csv',
        header=0,
        parse_dates=parse_dates,
        usecols=usecols,
        dtype=data_types)

We get some general information:

In [36]:
df.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 355569 entries, 0 to 355568
Data columns (total 2 columns):
created_on          355569 non-null datetime64[ns, UTC]
last_modified_on    355569 non-null datetime64[ns, UTC]
dtypes: datetime64[ns, UTC](2)
memory usage: 5.4 MB


We look at the first few rows:

In [37]:
df.head()

Unnamed: 0,created_on,last_modified_on
0,2016-09-17 09:17:46+00:00,2016-09-17 09:18:13+00:00
1,2017-03-09 14:32:37+00:00,2017-03-09 14:32:37+00:00
2,2017-03-09 14:32:37+00:00,2017-03-09 14:32:37+00:00
3,2017-03-09 10:35:31+00:00,2017-03-09 10:35:31+00:00
4,2017-03-09 10:34:13+00:00,2017-03-09 10:34:13+00:00


## Mean time difference<a name="task-d-mean-time-difference"></a> ([top](#task-d-top))
---

**Task:** You will build a database to hold your data. It is up to you to define appropriate tables and well as primary keys for connecting them. (The focus is definitely more on just using the basic methods/tools introduced in the course and not on building a complex database.) In particular, you can follow the following list of steps:

* restrict your data to 1000 entries and 5 columns of your choice
* create a connection to a sqlite3 database
* create one or multiple tables, at least one of the tables should have a PRIMARY KEY
* fill the database with your data
* run at least one query to demonstrate that it works correctly