# Database Design and Load Exercise

### Steps
 1. Analyze
 2. Design
 3. Data Carpentry
 4. Data Loading
 5. Analytical Queries

# You are designing a system to track shopping habbits in a _Smart Store_.

You are (hopefully) familiar with the online shopping experience and the concept of an online shopping cart.
Now imagine a physical store can capture shopper behavior in a similar fasion.

Your analysis of the use case for a database has identified the following Entities and their associated attributes.



### orders :
* `order_id`: order identifier
* `user_id`: customer identifier
* `eval_set`: which evaluation set this order belongs in (see `SET` described below)
* `order_number`: the order sequence number for this user (1 = first, n = nth)
* `order_dow`: the day of the week the order was placed on
* `order_hour_of_day`: the hour of the day the order was placed on
* `days_since_prior`: days since the last order, capped at 30 (with NAs for `order_number` = 1)

### products :
* `product_id`: product identifier
* `product_name`: name of the product
* `aisle_id`: foreign key
* `department_id`: foreign key

### aisles :
* `aisle_id`: aisle identifier
* `aisle`: the name of the aisle

### deptartments :
* `department_id`: department identifier
* `department`: the name of the department

### order_products :
* `order_id`: foreign key
* `product_id`: foreign key
* `add_to_cart_order`: order in which each product was added to cart
* `reordered`: 1 if this product has been ordered by this user in the past, 0 otherwise

# Task 1

 1. Convert the above Entities and attributes into an ERD model.
    * Any application is acceptable, just to capture a screen shot or an **image**. 
 1. Upload the image to the _COURSE_/`Day2/module2/exercises/` folder. 
 1. In the markdown cell below, double click and put the name of the image file within the `()`.
   * Example: Change 
   
     `![ERD MISSING]()`  
 to   
     `![ERD MISSING](erd.jpg)`  

![ERD](db_design.png)

![ERD MISSING]()

# Task 2
 1. Convert the Entities and attributes into a Database schema for Postgres.
 1. Remember to prefix table names with your database id, e.g., _ncw24_.
    * Example: `CREATE TABLE ncw24.Order ... `
    
**Remember to specify your Primary Keys and Foreign Keys for each table!**

In [None]:
# Add all the DDL to create your schema to this cell
# --------------------------------------------------
%%sql
CREATE TABLE orders(   
    order_id int PRIMARY KEY,
    user_id int NOT NULL,
    eval_set text NOT NULL,
    order_number int NOT NULL,
    order_dow int NOT NULL CHECK(order_dow >= 0 AND order_dow < 7),
    order_hour_of_day int NOT NULL CHECK(order_hour_of_day >= 0 and order_hour_of_day < 24),
    days_since_prior int NOT NULL CHECK(days_since_prior <= 30)
);

CREATE TABLE aisles(
    aisle_id int PRIMARY KEY,
    aisle text NOT NULL
);

CREATE TABLE departments(
    department_id int PRIMARY KEY,
    department text NOT NULL
);

CREATE TABLE products(
    product_id int PRIMARY KEY,
    product_name text NOT NULL,
    aisle_id int,
    department_id int,
    FOREIGN KEY(aisle_id) REFERENCES aisles,
    FOREIGN KEY(department_id) REFERENCES departments
);

CREATE TABLE order_products(
    order_id int,
    product_id int,
    add_to_cart_order int NOT NULL,
    reordered bool NOT NULL,
    PRIMARY KEY(order_id, product_id),
    FOREIGN KEY(order_id) REFERENCES orders,
    FOREIGN KEY(product_id) REFERENCES products
);











# Task 3 
Load data from the following files:

## `/dsa/data/all_datasets/instacart/orders.csv`
 * 3421084 Rows
 * File Preview 
```
order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
2539329,1,prior,1,2,08,
2398795,1,prior,2,3,07,15.0
473747,1,prior,3,3,12,21.0
2254736,1,prior,4,4,07,29.0
```

## `/dsa/data/all_datasets/instacart/products.csv`
 * 49689 Rows
 * File Preview 
```
product_id,product_name,aisle_id,department_id
1,Chocolate Sandwich Cookies,61,19
2,All-Seasons Salt,104,13
3,Robust Golden Unsweetened Oolong Tea,94,7
4,Smart Ones Classic Favorites Mini Rigatoni With Vodka Cream Sauce,38,1
```

## `/dsa/data/all_datasets/instacart/aisles.csv`
 * 135 Rows
 * File Preview 
```
aisle_id,aisle
1,prepared soups salads
2,specialty cheeses
3,energy granola bars
4,instant foods
```

## `/dsa/data/all_datasets/instacart/departments.csv`
 * 22 Rows
 * File Preview 
```
department_id,department
1,frozen
2,other
3,bakery
4,produce
```

## `/dsa/data/all_datasets/instacart/order_products.csv`
 * 1384618 Rows
 * File Preview 
```
order_id,product_id,add_to_cart_order,reordered
1,49302,1,1
1,11109,2,1
1,10246,3,0
1,49683,4,0
```
     

## In each designated cell, load the data using Python
 * Either psycopg2 or SQLAlchemy

---


In [None]:
from getpass import getpass
DB_NAME = None
DB_USER = None
DB_HOST = None
DB_PASSWORD = getpass()

### Orders 

In [None]:
## Add your code below
## --------------------
import psycopg2 as sql
from psycopg2.extras import execute_values
from getpass import getpass

pw = getpass("Enter password: ")

# connect
conn = sql.connect(dbname=DB_NAME, host=DB_HOST, user=DB_USER, password=DB_PASSWORD)

with open('/dsa/data/all_datasets/instacart/orders.csv') as file:
    data = file.readlines()

del data[0]

finished_data = list()
for l in data:
    d = tuple(l.replace("\n", "").split(','))
    tmp = list()
    for i,item in enumerate(d):
        if i == 2:
            tmp.append(item)
        elif i == 6:
            if item == "":
                tmp.append(0)
            else:
                tmp.append(int(float(item)))
        else:
            tmp.append(int(float(item)))
    finished_data.append(tuple(tmp))

with conn.cursor() as cursor:
    execute_values(cursor, "INSERT INTO orders VALUES %s", finished_data)

conn.commit()
conn.close()

---

### Products







In [None]:
## Add your code below
## --------------------

import psycopg2 as sql
from psycopg2.extras import execute_values
from getpass import getpass

pw = getpass("Enter password: ")

# connect
conn = sql.connect(dbname=DB_NAME, host=DB_HOST, user=DB_USER, password=DB_PASSWORD)

with open('/dsa/data/all_datasets/instacart/products.csv', "rb") as file:
    data = file.readlines()
del data[0]

finished_data = list()
for l in data:
    s = l.decode("UTF-8")
    try:
        firstpos = s.index('"')
        endpos = s.index('"', firstpos+1)
        full = s[firstpos+1:endpos]
        sub = full.replace(",", "")
    
        s = s.replace('"' + full + '"', sub)
   
    except Exception:
        pass
    
    d = tuple(s.replace("\n", "").split(','))
    tmp = list()
    for i,item in enumerate(d):
        if i == 1:
            tmp.append(item)
        else:
            try:
                tmp.append(int(float(item)))
            except ValueError:
                print("Couldn't covert: {}".format(item))
                print(l)
    finished_data.append(tuple(tmp))

with conn.cursor() as cursor:
    execute_values(cursor, "INSERT INTO products VALUES %s", finished_data)

conn.commit()
conn.close()

---

### Aisles

In [None]:
## Add your code below
## --------------------

import psycopg2 as sql
from psycopg2.extras import execute_values
from getpass import getpass

pw = getpass("Enter password: ")

# connect
conn = sql.connect(dbname=DB_NAME, host=DB_HOST, user=DB_USER, password=DB_PASSWORD)

with open('/dsa/data/all_datasets/instacart/aisles.csv', "rb") as file:
    data = file.readlines()

del data[0]

finished_data = list()
for l in data:
    d = tuple(l.decode("UTF-8").replace("\n", "").split(','))
    tmp = list()
    for i,item in enumerate(d):
        if i == 1:
            tmp.append(item)
        else:
            try:
                tmp.append(int(float(item)))
            except ValueError:
                print("Couldn't covert: {}".format(item))
                print(l)
    finished_data.append(tuple(tmp))


with conn.cursor() as cursor:
    execute_values(cursor, "INSERT INTO aisles VALUES %s", finished_data)

conn.commit()
conn.close()

---

### Departments

In [None]:
## Add your code below
## --------------------

import psycopg2 as sql
from psycopg2.extras import execute_values
from getpass import getpass

pw = getpass("Enter password: ")

# connect
conn = sql.connect(dbname=DB_NAME, host=DB_HOST, user=DB_USER, password=DB_PASSWORD)

with open('/dsa/data/all_datasets/instacart/departments.csv', "rb") as file:
    data = file.readlines()

del data[0]

finished_data = list()
for l in data:
    d = tuple(l.decode("UTF-8").replace("\n", "").split(','))
    tmp = list()
    for i,item in enumerate(d):
        if i == 1:
            tmp.append(item)
        else:
            try:
                tmp.append(int(float(item)))
            except ValueError:
                print("Couldn't covert: {}".format(item))
                print(l)
    finished_data.append(tuple(tmp))


with conn.cursor() as cursor:
    execute_values(cursor, "INSERT INTO departments VALUES %s", finished_data)

conn.commit()
conn.close()

---

### Order Products

In [None]:
## Add your code below
## --------------------

import psycopg2 as sql
from psycopg2.extras import execute_values
from getpass import getpass

pw = getpass("Enter password: ")

# connect
conn = sql.connect(dbname=DB_NAME, host=DB_HOST, user=DB_USER, password=DB_PASSWORD)

with open('/dsa/data/all_datasets/instacart/order_products.csv', "rb") as file:
    data = file.readlines()

del data[0]

finished_data = list()
for l in data:
    d = tuple(l.decode("UTF-8").replace("\n", "").split(','))
    tmp = list()
    for i,item in enumerate(d):
        if i == 3:
            tmp.append(True if int(item) == 1 else False)
        else:
            try:
                tmp.append(int(float(item)))
            except ValueError:
                print("Couldn't covert: {}".format(item))
                print(l)
    finished_data.append(tuple(tmp))

with conn.cursor() as cursor:
    execute_values(cursor, "INSERT INTO order_products VALUES %s", finished_data)

conn.commit()
conn.close()

--- 

# Task 4

In each of the cells below, use Python to pull the data out of the database. 

#### Task 4.A : Find the top 10 products, based on number of orders.
Display in a table!

In [None]:
## Add your code below
## --------------------

%%sql select product_id, product_name, 
COUNT(order_id) FROM order_products JOIN products USING(product_id) 
GROUP BY product_id, product_name ORDER BY 3 DESC LIMIT 10;


#### Task 4.B : Visualize a histogram of the number Products across the Departments

In [None]:
## Add your code below
## --------------------
from matplotlib import pyplot as plt
import psycopg2 as sql
from psycopg2.extras import execute_values
from getpass import getpass
import numpy as np

pw = getpass("Enter password: ")

# connect
conn = sql.connect(dbname=DB_NAME, host=DB_HOST, user=DB_USER, password=DB_PASSWORD)
with conn.cursor() as cursor:
    cursor.execute("select department_id, department, count(product_id) \
     from products join departments using(department_id) group by 1,2")
    data = cursor.fetchall()

counts = []
for d in data:
    counts.append(d[2])

plt.hist(counts)
plt.show()






## Optional Extra Task

Visualize a two-dimensional histogram of the product counts across the Departments and the Aisles

In [None]:
## Add your code below
## --------------------










# Save your notebook, then `File > Close and Halt`

---