# Week10. Database

# Recap: Classes


## Class Definition

```python
class ClassName(OptionalParantClass):
    # class block

    # Class variables - common parameters (same for all instances),
    # shared amon all instanses, that is there is only one copy
    common_class_variable = <value>

    # define methods
    def __init__(self, other_arguments_1):
        # method block
        # this one particularly is called initializer

        # call parent initializer
        super().__init__(other_arguments_1)

        # instances variables - local to a particular instance
        self.local_to_self_variable = <value>
        <statement...>

    def method_name(self, other_arguments_2):
        # method block
        <statement...>

    @classmethod
    # class method don't update or read information from object self
    # it is more related to Class rather than object self
    def class_method_name(cls, other_arguments_3):
        # method block
        <statement...>

    # static method don't update or read information from object self or class
    @staticmethod
    def static_method_name(other_arguments_4):
        # method block
        <statement...>
```

## Creating new instance
new instance is typically created by calling class constructor (class name + parenthesis, e.g. ClassName()),
pass arguments to the initializer:
```python
new_instance = ClassName(m_other_arguments_1)
```
## Calling new instance methods and using instance variable
```python
new_instance.method_name(m_other_arguments_2)
new_instance.local_to_self_variable = "new value"
```
## Calling class or static methods and using class variable
```python
ClassName.class_method_name(m_other_arguments_3)
ClassName.static_method_name(m_other_arguments_4)
ClassName.common_class_variable = "new value"
```
## Some special methods

see https://docs.python.org/3/reference/datamodel.html

* operator overloading (should return result of corresponding operation)
  * `object.__add__(self, other)` - called by `object + other`
  * `object.__sub__(self, other)` - called by `object - other`
  * `object.__mul__(self, other)` - called by `object * other`
  * `object.__truediv__(self, other)` - called by `object / other`
  * `object.__floordiv__(self, other)` - called by `object // other`
  * `object.__mod__(self, other)` - called by `object % other`\
  * `object.__and__(self, other)` - called by `object and other`
  * `object.__or__(self, other)` - called by `object or other`
* Comparison methods (should return bool)
  * `object.__lt__(self, other)` - called by `object < other`
  * `object.__le__(self, other)` - called by `object <= other`
  * `object.__eq__(self, other)` - called by `object == other`
  * `object.__ne__(self, other)` - called by `object != other`
  * `object.__gt__(self, other)` - called by `object > other`
  * `object.__ge__(self, other)` - called by `object >= other`
* Iteratable methods
  * `__iter__` return iterator, can be self
  * Iterator method
    * `__next__` return next item, raise `StopIteration` if there is no more items

# Week10. Database
## Database
- After you have parsed your data, you need to store the data
- Databases allows you to persist data or save the data
- You can access the data later
- SQL databases support transaction or they are transactional
    - This means that the database is ACID
    - ACID - Atomic, Consistent, Isolated and Durable
      - Atomicity - whole transaction or none
      - Consistency - database moves from one valid state to another
      - Isolation - transactions executed concurrently should have same effect as if executed sequentially
      - Durable - committed transactions should be persistent and not affected by factors like power outage.
    - series of operation that happen either all happen or nothing happens
    - Example: Transfer money -- decrement one account and increment another account
- We will use SQLite -- database saved as a file. Unlike flat files(csv files for example), the data stored in an optimized way so you can do queries. It is portable -- you can share your database file. Do not need to setup a database/server. It is a zero-configuration and serverless. No client/server interaction. An application will directly access the database file and read and write to it.
    - Self-contained
    - Serverless
    - Zero-configuration
    - Transactional -- all or nothing; either all instructions are executed successfully or nothing happens.
    - Uses dynamic types for tables -- can store any datatype in any column
    - SQLite is SQL dialect (Structured Query Language), there are a lot of similarities with other SQL dialect.
- Other popular SQL:
  - MySQL
  - MariaDB (MySQL fork)
  - Microsoft SQL

- Database terminology
    - A database contains tables
    - tables store information in rows and columns
    - a relational database -- there is a relationship between two or more tables


- Working with SQLite (Similar to other SQL dialects):
  - Command line interface (CLI):
    - download from https://www.sqlite.org/
  - Graphical user interface (GUI):
    - Use DB Browser for SQLite (https://sqlitebrowser.org/dl/)
  - Programmatically (that is with python):
    - using sqlite3 module (https://docs.python.org/3/library/sqlite3.html)


- Reference: https://www.sqlite.org/
  - Docs: https://www.sqlite.org/doc.html
  - Download: https://www.sqlite.org/download.html
- Reference: https://www.sqlitetutorial.net
- Reference: https://www.quackit.com/sqlite/tutorial
- Data: Data generated using faker
- Data: https://www.sqlitetutorial.net/sqlite-sample-database/
- Data: https://www.kaggle.com/airbnb/seattle
- https://sqlitebrowser.org/dl/

- Basic commands for CLI
  * `sqlite3` - start sql lite
  * `.help`  - list commands available
  * `.quit` - quit sqlite
  * a dot (".") are intercepted and interpreted by the sqlite3 program itself

## Load simple database
```
create table students (last_name TEXT, first_name TEXT, username TEXT, exam1 INTEGER, exam2 INTEGER, exam3 INTEGER);
.separator "\t"
.import students.tsv Students
.save student.db
.headers ON
.mode column
```

## Basic commands -- Part I
```
.help
.database
.table
.schema <name_of_table>
.quit
.q
```

### SELECT command
```
SELECT * FROM students;
SELECT username, exam1 FROM STUDENTS;
SELECT username, exam1 FROM STUDENTS ORDER BY username;
SELECT username, exam1 FROM STUDENTS ORDER BY username LIMIT 20;
SELECT username, exam1, exam2 FROM STUDENTS ORDER BY exam1 ASC LIMIT 10;
SELECT username, exam1 FROM STUDENTS ORDER BY exam1 DESC LIMIT 10;
```

### Save output to a file
```
.output output.txt
SELECT username, exam1 FROM STUDENTS ORDER BY -exam1 LIMIT 10;
.output stdout
```

### Backup whole database
```
.output dump.sql
.dump
.output stdout
```

### Calculating Average
```
SELECT
 avg(exam1) as `Exam1 Average`, avg(exam2), avg(exam3)
 FROM students;
```

### Find all exam1 grades greater than 80
```
SELECT
username, exam1
FROM
students
WHERE exam1 > 80;
```

### Find all exam1 between
```
SELECT
username, exam1
FROM
students
WHERE exam1 BETWEEN 80 and 90;
```



### Get count of exam1 greater than 80
```
SELECT
count(exam1)
FROM
students
WHERE exam1 > 80;
```

### Find students with same first name
```
SELECT
*
FROM
students
WHERE first_name == 'Melissa'
```

### Find students IN list
```
SELECT
*
FROM
students
WHERE first_name in ('Melissa', 'Stephanie', 'Alex');
```


### Find students LIKE %
```
SELECT
*
FROM
students
WHERE first_name LIKE 'Alex%';
```

### Find students LIKE  %%
```
SELECT
*
FROM
students
WHERE first_name LIKE '%ath%';
```

### Find Histogram exam1 grades using GROUP BY
```
SELECT exam1, count(exam1) as c  FROM students GROUP BY exam1 ORDER BY c DESC;
SELECT first_name, count(first_name) as name_count  FROM students GROUP BY first_name ORDER BY name_count DESC;
```

### Adding null values
```
INSERT INTO Students('exam1') VALUES ('52');
SELECT * FROM Students;
```

### Adding NOT NULL constraint
```
create table students (
    last_name TEXT NOT NULL,
    first_name TEXT NOT NULL,
    username TEXT NOT NULL,
    exam1 REAL,
    exam2 REAL,
    exam3 REAL
);
.separator "\t"
.import students.tsv Students

SELECT * FROM Students;
.schema Students

INSERT INTO Students('exam1') VALUES ('52');

```


### Primary Key
```
.headers ON
PRAGMA foreign_keys;
PRAGMA foreign_keys = ON;
PRAGMA foreign_keys;
CREATE TABLE Teachers(
  TeacherName  TEXT NOT NULL
);

.tables
.schema Teachers

INSERT INTO Teachers ('TeacherName') VALUES ('John Smith');
SELECT * FROM Teachers;
INSERT INTO Teachers ('TeacherName') VALUES ('John Smith');


SELECT rowid, * FROM Teachers;

```

When you create a table that has an `INTEGER NOT NULL PRIMARY KEY` column, this column is the alias of the `rowid` column. It uniquely defines a record/row.

```
.headers ON
DROP TABLE Teachers;

CREATE TABLE Teachers (
   TeacherId INTEGER NOT NULL PRIMARY KEY,
   TeacherName  TEXT NOT NULL
);
INSERT INTO Teachers ('TeacherName') VALUES ('John Smith');
INSERT INTO Teachers ('TeacherName') VALUES ('John Smith');
SELECT * FROM Teachers;
```


### Making a teacher Unique
```
.headers ON
DROP TABLE Teachers;

CREATE TABLE Teachers (
   TeacherId INTEGER NOT NULL PRIMARY KEY,
   TeacherName  TEXT NOT NULL,
   TeacherEmployeeID INTEGER NOT NULL,
   UNIQUE (TeacherEmployeeID)
);
INSERT INTO Teachers ('TeacherName', 'TeacherEmployeeID') VALUES ('John Smith', 100001);
INSERT INTO Teachers ('TeacherName', 'TeacherEmployeeID') VALUES ('John Smith', 100002);
SELECT * FROM Teachers;
```


### Making a teacher Unique
```
.headers ON
DROP TABLE Teachers;

CREATE TABLE Teachers (
   TeacherId INTEGER NOT NULL PRIMARY KEY,
   TeacherName  TEXT NOT NULL,
   TeacherEmployeeID INTEGER NOT NULL,
   UNIQUE (TeacherEmployeeID)
);
INSERT INTO Teachers ('TeacherName', 'TeacherEmployeeID') VALUES ('John Smith', 100001);
INSERT INTO Teachers ('TeacherName', 'TeacherEmployeeID') VALUES ('John Smith', 100002);
SELECT * FROM Teachers;
```


# Same Manipulations with Python

In [None]:
filename = 'students.tsv'
students_list = []
with open(filename) as file:
    for line in file:
        if not line.strip():
            continue
        line = line.strip()
        students_list.append(line.split('\t'))



header = ['last_name', 'first_name', 'username', 'exam1', 'exam2', 'exam3']
students_dict = []
with open(filename) as file:
    for line in file:
        if not line.strip():
            continue
        line = line.strip()
        students_dict.append(dict(zip(header, line.split('\t'))))

In [None]:
# SELECT * FROM Students;

for student in students_list:
    print(student)


for student in students_dict:
    print(student)


In [None]:
# SELECT username, exam1 FROM STUDENTS;

for student in students_list:
    print(student[2], student[3])

print('_'*100)

for student in students_dict:
    print(student.get('username'), student.get('exam1'))

In [None]:
# SELECT username, exam1 FROM STUDENTS ORDER BY username;

for student in sorted(students_list, key=lambda student: student[2]):
   print(student[2], student[3])

In [None]:
# SELECT username, exam1 FROM STUDENTS ORDER BY exam1;

for student in sorted(students_list, key=lambda student: int(student[3])):
    print(student[2], student[3])

In [None]:
# SELECT username, exam1 FROM STUDENTS ORDER BY -exam1 LIMIT 10
# SELECT username, exam1 FROM STUDENTS ORDER BY exam1 desc LIMIT 10

for student in sorted(students_list, key=lambda student: int(student[3]), reverse=True)[:10]:
    print(student[2], student[3])

In [None]:
# SELECT username, exam1 FROM STUDENTS ORDER BY exam1 ASC LIMIT 10 ## default behavior

for student in sorted(students_list, key=lambda student: int(student[3]))[:10]:
    print(student[2], student[3])

In [None]:
# SELECT
#     avg(exam1) as `Exam1 Average`, avg(exam2), avg(exam3)
# FROM students;


exam1_avg = sum([int(student[3]) for student in students_list])/len(students_list)
exam2_avg = sum([int(student[4]) for student in students_list])/len(students_list)
exam3_avg = sum([int(student[5]) for student in students_list])/len(students_list)

print(exam1_avg, exam2_avg, exam3_avg)


In [None]:
# SELECT
# 	username, exam1
# FROM Students
# WHERE
# 	exam1 > 80;


for student in filter(lambda student: int(student[3]) > 80, students_list):
    print(student[2], student[3])

In [None]:
# SELECT
# 	username, exam1
# FROM students
# WHERE exam1 BETWEEN 80 and 90;

for student in filter(lambda student: 80 <int(student[3]) < 90, students_list):
    print(student[2], student[3])

In [None]:
# SELECT
# count(exam1) as `Exam 1 GT 80`
# FROM students
# WHERE exam1 > 80;

len(list(filter(lambda student: int(student[3]) > 80, students_list)))

In [None]:
# SELECT
# *
# FROM students
# WHERE first_name == 'Melissa'

list(filter(lambda student: student[1] == 'Melissa', students_list))

In [None]:
# SELECT
# *
# FROM Students
# WHERE first_name IN ('Melissa', 'Stephanie', 'Alex')
# ORDER BY last_name

sorted(list(filter(lambda student: student[1] in ('Melissa', 'Stephanie', 'Alex'), \
                   students_list)), key=lambda student: student[0])

In [None]:
# SELECT
# *
# FROM students
# WHERE first_name LIKE 'Alex%';

list(filter(lambda student: student[1].startswith('Alex'), students_list))

In [None]:
# SELECT
# *
# FROM
# students
# WHERE first_name LIKE '%ath%';

list(filter(lambda student: 'ath' in student[1], students_list))

In [None]:
sub_string = 'efg'
string = 'abcdefghijk'

sub_string in string

In [None]:
# SELECT
# 	exam1,
# 	count(exam1) as Exam1Count
# FROM students
# GROUP BY exam1
# ORDER BY Exam1Count DESC;

output = {}

for student in students_list:
    exam1 = int(student[3])
    if exam1 not in output:
        output[exam1] = 0
    output[exam1] += 1


for ele in sorted(zip(output.keys(), output.values()), key= lambda ele: ele[1], reverse=True):
    print(ele)

In [None]:
list(zip(output.keys(), output.values()))

# sqlite3 module

In [None]:
import pandas as pd
import sqlite3
conn = sqlite3.connect("student.db")
cur = conn.cursor()
sql_statement = "select username, exam1 FROM Students;"
df = pd.read_sql_query(sql_statement, conn)
df

In [None]:
import pandas as pd
import sqlite3
conn = sqlite3.connect("student.db")
cur = conn.cursor()

sql_statement="""
SELECT
	first_name,
	count(first_name) as name_count
FROM students
GROUP BY first_name ORDER BY name_count DESC LIMIT 5;
"""
df = pd.read_sql_query(sql_statement, conn)
df

In [None]:
output = {}

for student in students_list:
    first_name = student[1]
    if first_name not in output:
        output[first_name] = 0
    output[first_name] += 1


for ele in sorted(zip(output.keys(), output.values()), key= lambda ele: ele[1], reverse=True)[:7]:
    print(ele)

## Database Normalization

- Why use a database?
  - Ref: https://www.bbc.co.uk/bitesize/guides/z8yg87h/revision/4
  - Data is stored efficiently; saves space
  - Because data is stored efficiently, you can access it faster; easy to search
  - Because data is stored efficiently, you can easily update and remove data
  - Easily sort and group data
- What is database normalization?
  - Ref: https://www.complexsql.com/database-normalization/
  - Ref: http://www.databasedev.co.uk/1norm_form.html
  - The purpose of database normalization is to:
    - eliminate redundant data
    - reduce complexity of data, making it easier to manage the data and make change
    - ensure logical data dependencies
- How is database normalization achieved?
  - By fulfilling five normal forms. Each normal form represents an increasingly stringent set of rules. Usually fulfilling the first three normal forms is sufficient.
  - Ref: https://www.1keydata.com/database-normalization/first-normal-form-1nf.php
- First Normal Form  (1NF):
  1. if there are no repeating groups.
  2. all values are atomic, meaning they are the smallest meaningful value
- Second Normal Form  (2NF):
  1. the table is in first normal form
  2. each non-key field is functionally dependent on the entire primary key
- Third Normal Form (3NF):
  1. the table is in second normal form
  2. there are no transitive dependencies
- Ref: https://arctype.com/blog/2nf-3nf-normalization-example/

- Problems with example1
  - Repeating group of fields
  - The project and time fields are not made up of atomic values
  - Can't sort by last name
  - Can't sort by time because field is type text
  - Assumed relationship between project and time

- Analysis of example2
  - Can sort now!
  - How can you add another project?


- Analysis of example3 -- first normal form
  - Can do groups by employeeid or projectnum
  - Can sort by time
  - Can sort by name

- Analysis of example4
  - How would you update the project title for a given project? Have to edit in many places
  - Can you add a project without an employeeid?
  - How can you delete a project?

- Analysis of example5
  - second normal form

- Analysis of example 6
  - Phone number, which is a non-key field, has transitive dependency on another non-key field.

- Analysis of example7
  - Removed transitive dependency