# Building a Postgres DB for Crime Reports

>The project demonstrates the process of creating an optimal database named crimes_db to store data on crime reports primarily available in the CSV format.

## 1. Setup

In [4]:
!pip3 install psycopg2-binary

In [43]:
import psycopg2
import csv
import pandas as pd

>Connecting to the default postgres db with default postgres user

In [9]:
conn = psycopg2.connect("dbname=postgres user=postgres")
cur = conn.cursor()

>Creating a new database crime_db

In [10]:
conn.autocommit = True
cur.execute("CREATE DATABASE crime_db;")
conn.close()

>Connecting to the new db and creating the schema

In [12]:
conn = psycopg2.connect("dbname=crime_db user=postgres")
cur = conn.cursor()
cur.execute("CREATE SCHEMA crimes;")

>Reading the data

In [58]:
with open("data/boston.csv", "r") as f:
    reader = csv.reader(f)
    col_headers = next(f)
    first_row = next(f)
print("Headers: ",col_headers)
print("First Row: ",first_row)
# converting col headers to a list
col_headers = col_headers.split(",")

Headers:  incident_number,offense_code,description,date,day_of_the_week,lat,long

First Row:  1,619,LARCENY ALL OTHERS,2018-09-02,Sunday,42.35779134,-71.13937053



## 2. Tables

### 2.1 Defining Optimal Data Types

>The function get_unique() computes unique values for a given column. The aim of the method is to help determine the maximum length of existing text values and check whether an enumerated data type can be used.

In [64]:
def get_unique(file, row_index):
    f = pd.read_csv(file)
    row = f.iloc[:,row_index]
    output = set(row)
    return output

> Computing the number of unique values for each column

In [66]:
for i, col in enumerate(col_headers):
    data = get_unique("data/boston.csv", i)
    length = len(data)
    print(i, " ", col, " # of unique values: ",length)

0   incident_number  # of unique values:  298329
1   offense_code  # of unique values:  219
2   description  # of unique values:  239
3   date  # of unique values:  1177
4   day_of_the_week  # of unique values:  7
5   lat  # of unique values:  18177
6   long
  # of unique values:  18177


> Computing the value with maximum length for each column

In [77]:
out = dict()
for i, col in enumerate(col_headers):
    max_length = 0
    data = get_unique("data/boston.csv", i)
    for d in data:
        if isinstance(d, str):
            l = len(d)
            if max_length < l:
                max_length = l
            out.update({col:max_length})
out      

{'description': 58, 'date': 10, 'day_of_the_week': 9}