# Building a Database for Crime Reports

In this project, I will be building a database for storing data related to crimes that occurred in Boston. The data can be found in a locally stored file called `boston.csv`. A description of each column is as follows:
* incident_number: identifier of the crime
* offense_code: numeric identifier code for committed crime
* description: details of nature of the crime
* date: date on which the crime happened
* day_of_the_week: corresponding day of the week
* lat: latitude coordinate of incident
* long: longitude coordinate of incident

The goal of this prjoect is to create a database named `crimes_db` with a table `boston_crimes`, using appropriate datatypes for storing the data from the `boston.csv` file. Inside the table will be a schema named `crimes`. I will also create the `readonly` and `readwrite` groups with the appropriate privileges, along with one users for each of the groups. 

First, I will start by connecting to the "dq" database using the user "dq" (short for DataQuest, as this is a DataQuest project). I will turn on the `connection.autocommit` command in order to do so. After closing the first connection used to do this, I will connect to the `crimedb` new database and create another connection object and cursor object. 

In [1]:
import psycopg2
conn1 = psycopg2.connect(dbname="dq", user="dq")
cur1 = conn1.cursor()
conn1.autocommit = True
cur1.execute("create database crimedb;")
conn1.autocommit = False
conn1.close()

In [2]:
conn2 = psycopg2.connect(dbname="crimedb", user="dq")
cur2 = conn2.cursor()
cur2.execute("create schema crimes;")
conn2.commit()

Now that I have a database and schema, I can begin creating tables. I will first read the column names from `boston.csv`. 

In [3]:
import csv
with open("boston.csv") as file:
    reader = csv.reader(file)
    col_header = next(reader)
    first_row = next(reader)
    
print(first_row)

['1', '619', 'LARCENY ALL OTHERS', '2018-09-02', 'Sunday', '42.35779134', '-71.13937053']


Next, I will define a function `get_col_value_set()` that will help identify the proper datatypes for each of the columns. Given the name of the csv file and a column index (starting at 0), the function computes a Python set with all distinct values contained in that column. 

In [None]:
def get_col_value_set(csv_filename, col_index):
    value_set = set()
    with open(csv_filename, 'r') as file:
        next(file)
        reader = csv.reader(file)
        for row in reader: 
            column = row[col_index]
            if column not in value_set:
                value_set.add(column)
    return value_set            

In [None]:
index0 = get_col_value_set("boston.csv", 0)
index1 = get_col_value_set("boston.csv", 1)
index2 = get_col_value_set("boston.csv", 2)
index3 = get_col_value_set("boston.csv", 3)
index4 = get_col_value_set("boston.csv", 4)
index5 = get_col_value_set("boston.csv", 5)
index6 = get_col_value_set("boston.csv", 6)

print("Number of different values in incident_number:", len(index0))
print("Number of different values in offense_code:", len(index1))
print("Number of different values in description:", len(index2))
print("Number of different values in date:", len(index3))
print("Number of different values in day_of_the_week:", len(index4))
print("Number of different values in lat:", len(index5))
print("Number of different values in long:", len(index6))

I have now determined the number of distinct values in each column of the `boston.csv` file. I already know from above that the two textual columns in the data set are `description` and `day_of_the_week`. I will now print the header row in order to determine which index the `description` column is. 

In [None]:
print(col_header)

Seeing the `description` has an index of 2, I will use the function I created to compute the maximum length of any value in this column to see the maximum number of characters in any description.

In [None]:
descriptions = get_col_value_set('boston.csv', 2)

lengths = []
for description in descriptions:
    lengths.append(len(description))

max_length = max(lengths)
print(max_length)

I have found that the maximum number of characters in any description is 58. 

I will now create a table named `boston_crimes` inside the `crimes` schema of the `crimedb` database. Before creating the table, I will create the enumerated data type `enum_day`, which will be used for the column that contains the day of the week that the crime happened. The `incident_number` column is the primary key. The other data types are as follows:
* incident_number: integer since it is a whole, positive number
* offense_code: integer since it is a whole, positive number
* description: varchar(58) since we now know the maximum length
* date: date since it is in date format
* day_of_the_week: date_enum, an enumerated datatype since there is a set number of options for this column
* lat: decimal(10,8) since there can be two numbers before the decimal point, and the precision for latitude is 8 decimals after the decimal point
* long: decimal(10,8) since there can be two numbers before the decimal point, and the precision for longitude is 8 decimals after the decimal point

In [None]:
cur2.execute('''
    create type enum_day as enum (
        'Sunday',
        'Monday', 
        'Tuesday', 
        'Wednesday',
        'Thursday',
        'Friday',
        'Saturday'
    );
''')

conn2.commit()

In [None]:
cur2.execute('''
    create table crimes.boston_crimes (
        incident_number integer primary key,
        offense_code integer,
        description varchar(58),
        date date,
        day_of_the_week enum_day,
        latitude decimal(10,8),
        longitude decimal(10,8)
    );
''')

conn2.commit()

Now that the table is created, I can load data into it using the `cursor.copy_expert()` method.

In [None]:
with open("boston.csv") as file:
    cur2.copy_expert("copy crimes.boston_crimes from stdin with csv header;", file)
    conn2.commit()

So far, I have created a database with a schema inside it for holding data about crimes, selected the correct datatypes for storing the data, created a table, and loaded the csv `boston.csv` containing crimes about Boston. 

Now, I will be creating two new user groups: `readonly` and `readwrite`. I will follow the least privilege principle by revoking all privileges from the `public` group and the `public` schema. 

In [None]:
cur2.execute("revoke all on schema public from public;")
conn2.commit()

In [None]:
cur2.execute("revoke all on database crimedb from public;")
conn2.commit()

Now that I have removed any inherent privileges, I can create the two user groups. The `readonly` group will be able to perform `SELECT` queries, while the `readwrite` group will be able to perform `SELECT`, `INSERT`, `DELETE`, and `UPDATE` queries. 

In [None]:
cur2.execute("create group readonly nologin;")
cur2.execute("create group readwrite nologin;")
cur2.execute("grant connect on database crimedb to readonly;")
cur2.execute("grant connect on database crimedb to readwrite;")
cur2.execute("grant usage on schema crimes to readonly;")
cur2.execute("grant usage on schema crimes to readwrite;")
cur2.execute("grant select on all tables in schema crimes to readonly;")
cur2.execute("grant select, insert, delete, update on all tables in schema crimes to readwrite;")
conn2.commit()

Since I have created user groups, the next step is to create users. I will create one user in each group.

In [None]:
cur2.execute("create user data_analyst with password 'secret1';")
cur2.execute("grant readonly to data_analyst;")
cur2.execute("create user data_scientist with password 'secret2';")
cur2.execute("grant readwrite to data_scientist;")
conn2.commit()

I will now inspect the `information_schema.table_privileges` table to ensure that everything looks the way I planned. 

In [None]:
cur2.execute("select grantee, privilege_type from information_schema.table_privileges;")
conn2.commit()

In [None]:
conn2.rollback()

This project is now completed with the database with a schema inside it, the correct datatypes for storing the data, table containing rows from the csv `boston.csv`, user groups, and users assigned to those groups. 