# Building a database for crime reports
The goal of this project is to create a postgres database based from a CSV file which has data on crimes in Boston. 

We will: 
* create database with a table with appropriate datatypes for storing the data
* create readonly and readwrite groups
* create 1 user for each of these groups

## Creating the Crime Database

In [1]:
import psycopg2
conn= psycopg2.connect(dbname='dq',user='dq') # No Password
cur = conn.cursor()

conn.autocommit = True
cur.execute("CREATE DATABASE crime_db;")
conn.close()

conn = psycopg2.connect(dbname='crime_db',user='dq')
cur = conn.cursor()
cur.execute("CREATE SCHEMA crimes;")
conn.commit()

## Obtaining the Column Names and Sample

In [2]:
# We analyze the dataset in boston.csv
import csv
with open('boston.csv','r') as f:
    reader = csv.reader(f)
    col_headers = next(reader)
    first_row = next(reader)

print(col_headers)
print(first_row)

['incident_number', 'offense_code', 'description', 'date', 'day_of_the_week', 'lat', 'long']
['1', '619', 'LARCENY ALL OTHERS', '2018-09-02', 'Sunday', '42.35779134', '-71.13937053']



|Index|Columns|
|---|---|
|0|incident_number|
|1|offense_code|
|2|description|
|3|date|
|4|day_of_the_week|
|5|lat|
|6|long|


## Creating an Auxiliary Function
Before we create a table for storing the crime data, we need to identify the proper datatypes for the columns. To help us with that, let's create a function — *get_col_value_set()* — that given the name of a CSV file and a column index (starting a 0) that computes a Python set with all distinct values contained in that column

This function will be useful for two reasons:

1. Checking whether an enumerated datatype might be a good choice for representing a column.
2. Computing the maximum length of any text-like column to select appropriate sizes for VARCHAR columns.

In [5]:
def get_col_value_set(csv_filename,col_index):
    value_set= set()
    with open(csv_filename,'r') as f:
        reader= csv.reader(f)
        next(reader)
        for row in list(reader):
            value_set.add(row[col_index])
    return value_set

for i in range(7):
    print('Num of distinct values at index {}: {}\n'.format(i,len(get_col_value_set('boston.csv',i))))
    

Num of distinct values at index 0: 298329

Num of distinct values at index 1: 219

Num of distinct values at index 2: 239

Num of distinct values at index 3: 1177

Num of distinct values at index 4: 7

Num of distinct values at index 5: 18177

Num of distinct values at index 6: 18177



|Index|Columns|Distinct Values|
|---|---|---|
|0|incident_number|298,329|
|1|offense_code|219|
|2|description|239|
|3|date|1177|
|4|day_of_the_week|7|
|5|lat|18,177|
|6|long|18,177|

## Finding the Maximum Length
There are 2 textual columns: description and day_of_the_week. Since there are only 7 distinct values in day_of_the_week we can use enumerated datatype for that column. Plus its easy to see that Wednesday is the longest value with length of 9. We will find out the maximum length of each value in the description column

In [7]:
max_length = len(max(get_col_value_set('boston.csv',2),key=lambda x:len(x)))
print('Max Length: {}'.format(max_length))

Max Length: 58


## Creating the table
We will use 5 data types to create the table:

1. VARCHAR(80) for the column description
2. Enumerated for the column day_of_the_week 
3. Integer for columns incident number and offence_code
4. Date for column date
5. Decimal(10,8) for lat and long [precision 10, scale 8] 

### Creating Enumerated datatype

In [8]:
cur.execute("CREATE TYPE day_of_week as ENUM ('Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday');")
conn.commit()

In [9]:
cur.execute('''
CREATE TABLE crimes.boston_crimes (
    incident_number INTEGER PRIMARY KEY,
    offence_code INTEGER,
    description VARCHAR(80),
    date DATE,
    day_of_the_week DAY_OF_WEEK,
    lat DECIMAL(10,8),
    long DECIMAL(10,8)
);
''')
conn.commit()

## Loading the Data

In [14]:
with open('boston.csv','r') as f:
    cur.copy_expert("COPY crimes.boston_crimes FROM STDIN WITH CSV HEADER",f)
conn.commit()    

## Revoking Public Privileges

In [15]:
cur.execute("REVOKE ALL ON SCHEMA public FROM public;")
cur.execute("REVOKE ALL ON DATABASE crime_db FROM public;")
conn.commit()

## Creating User Groups

In [25]:
# Creating Groups
cur.execute("CREATE GROUP readonly NOLOGIN;")
cur.execute("CREATE GROUP readwrite NOLOGIN;")

# Granting privelages on tables in crimes schema
cur.execute("GRANT SELECT ON ALL TABLES in SCHEMA crimes to readonly;")
cur.execute("GRANT SELECT, INSERT, DELETE, UPDATE on ALL TABLES in SCHEMA crimes to readwrite;")

# Granting Connect on database crimes_db
cur.execute("GRANT CONNECT ON DATABASE crime_db to readonly,readwrite;")

# Granting Usage on schema crimes
cur.execute("GRANT USAGE ON SCHEMA crimes to readonly, readwrite;")
conn.commit()

## Creating Users

In [26]:
cur.execute("CREATE USER data_analyst WITH PASSWORD 'secret1';")
cur.execute("GRANT readonly to data_analyst;")
cur.execute("CREATE USER data_scientist WITH PASSWORD 'secret2';")
cur.execute("GRANT readwrite to data_scientist;")
conn.commit()



## Testing

In [28]:
cur.execute('''
SELECT grantee, privilege_type
    FROM information_schema.table_privileges
    WHERE grantee in ('readonly','readwrite');
''')
result.append(cur.fetchall())
print(result)

['Grantee', 'Privilege', [('readonly', 'SELECT'), ('readwrite', 'INSERT'), ('readwrite', 'SELECT'), ('readwrite', 'UPDATE'), ('readwrite', 'DELETE')], [('readonly', 'SELECT'), ('readwrite', 'INSERT'), ('readwrite', 'SELECT'), ('readwrite', 'UPDATE'), ('readwrite', 'DELETE')]]


In [None]:
cur.execute()