<IMG SRC="https://github.com/jacquesroy/byte-size-data-science/raw/master/images/Banner.png" ALT="BSDS Banner" WIDTH=1195 HEIGHT=200>

# Data Understanding and Preparation
To understand data, we need to explore it.

This adds to the following videos:
- <a href="https://youtu.be/xSDP6u_Xqhc">017-Spark Data Exploration</a>
- <a href="https://youtu.be/AeeHapnLhyE">018-Python Pandas Data Exploration</a>
- <a href="https://youtu.be/qw4FtewQFZE">032-JDBC Data Exploration</a>

This time, we dig deeper and look at subjects such as:
- Basic stats
- Normalization
- Reducing categorical choices
- Correlation
- Visualization 


## 060-Data Understanding and Preparation
Execute the next cell if you want to see the `Byte Size Data Science` youtube channel video

In [None]:
from IPython.display import IFrame

IFrame(src="https://www.youtube.com/embed/vVX8GLEDwoY?rel=0&amp;controls=0&amp;showinfo=0", width=560, height=315)


## Import the appropriate libraries and set up needed connections

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import ibm_db
import ibm_db_dbi
import math

from ftplib import FTP
import requests, zipfile, io

import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
credentials = {
    'username': 'bluadmin',
    'password': """PASSWORD""",
    'sg_service_url': 'https://sgmanager.ng.bluemix.net',
    'database': 'BLUDB',
    'host': 'dashdb-enterprise-. . . .bluemix.net',
    'port': '50001',
    'url': 'https://undefined'
}
schema="CHICAGO"

In [None]:
dsn = (
    "DRIVER={{IBM DB2 ODBC DRIVER}};"
    "DATABASE={0};"
    "HOSTNAME={1};"
    "PORT={2};"
    "PROTOCOL=TCPIP;"
    "SECURITY=ssl;"
    "UID={3};"
    "PWD={4};").format(credentials['database'], credentials['host'],
                       credentials['port'], credentials['username'],
                       credentials['password'])

conn = ibm_db.connect(dsn, "", "")
pconn = ibm_db_dbi.Connection(conn)

# Chicago Accident data
Just like in the previous videos mentioned above.

This time, we assume that the data is in a database table as shown in video 059-CSV to DB

The original data is at: https://github.com/jacquesroy/byte-size-data-science/raw/master/data/ChicagoTrafficCrashes20180917.csv.zip

### Staging table definition
```
CREATE TABLE CHICAGO.staging_ChicagoAccidents (
  RD_NO                          VARCHAR(8) NOT NULL PRIMARY KEY,
  CRASH_DATE_EST_I               CHAR(3),
  CRASH_DATE                     TIMESTAMP,
  POSTED_SPEED_LIMIT             INTEGER,
  TRAFFIC_CONTROL_DEVICE         VARCHAR(23),
  DEVICE_CONDITION               VARCHAR(24),
  WEATHER_CONDITION              VARCHAR(22),
  LIGHTING_CONDITION             VARCHAR(22),
  FIRST_CRASH_TYPE               VARCHAR(28),
  TRAFFICWAY_TYPE                VARCHAR(31),
  LANE_CNT                       INTEGER,
  ALIGNMENT                      VARCHAR(21),
  ROADWAY_SURFACE_COND           VARCHAR(15),
  ROAD_DEFECT                    VARCHAR(17),
  REPORT_TYPE                    VARCHAR(26),
  CRASH_TYPE                     VARCHAR(32),
  INTERSECTION_RELATED_I         CHAR(3),
  NOT_RIGHT_OF_WAY_I             CHAR(3),
  HIT_AND_RUN_I                  CHAR(3),
  DAMAGE                         VARCHAR(13),
  DATE_POLICE_NOTIFIED           TIMESTAMP,
  PRIM_CONTRIBUTORY_CAUSE        VARCHAR(80),
  SEC_CONTRIBUTORY_CAUSE         VARCHAR(80),
  STREET_NO                      INTEGER,
  STREET_DIRECTION               CHAR(3),
  STREET_NAME                    VARCHAR(31),
  BEAT_OF_OCCURRENCE             INTEGER,
  PHOTOS_TAKEN_I                 CHAR(3),
  STATEMENTS_TAKEN_I             CHAR(3),
  DOORING_I                      CHAR(3),
  WORK_ZONE_I                    CHAR(3),
  WORK_ZONE_TYPE                 VARCHAR(12),
  WORKERS_PRESENT_I              CHAR(3),
  NUM_UNITS                      INTEGER,
  MOST_SEVERE_INJURY             VARCHAR(24),
  INJURIES_TOTAL                 INTEGER,
  INJURIES_FATAL                 INTEGER,
  INJURIES_INCAPACITATING        INTEGER,
  INJURIES_NON_INCAPACITATING    INTEGER,
  INJURIES_REPORTED_NOT_EVIDENT  INTEGER,
  INJURIES_NO_INDICATION         INTEGER,
  INJURIES_UNKNOWN               INTEGER,
  CRASH_HOUR                     INTEGER,
  CRASH_DAY_OF_WEEK              INTEGER,
  CRASH_MONTH                    INTEGER,
  LATITUDE                       DOUBLE,
  LONGITUDE                      DOUBLE 
) ORGANIZE BY ROW;
```

## Looking at distinct values

In [None]:
sql = """
SELECT * FROM {0}.staging_ChicagoAccidents LIMIT 2
""".format(schema)
data_pd = pd.read_sql(sql, pconn)
data_pd.head(5)

## How many non null values and distinct values in columns
This is something we did in videos 17, 18, and 32. This time, we create an SQL statement programmatically.

In [None]:
# Use the column names to create an SQL statement
# Skip the location column. The fact it is a geometry causes issues with COUNT
sql = "SELECT "
for name in data_pd.columns.to_list() :
    sql = sql + "count({0}) {1}, count(distinct {0}) {2},\n".format(name, name + "_count", name + "_distinct")

sql = sql[:-2] + "\n FROM {0}.staging_ChicagoAccidents".format(schema)

result_pd = pd.read_sql(sql, pconn)
result_dict = result_pd.iloc[0].to_dict()

for name in data_pd.columns.to_list() :
    print("{0:30s}COUNT {1:8.0f}\tDISTINCT {2:8.0f}".format(name,result_dict[name + '_COUNT'],result_dict[name + '_DISTINCT'] ))

## Normalization
Here, we are talking about relational database normal forms.

We have multiple columns that are categorical. For example:
- `TRAFFIC_CONTROL_DEVICE         VARCHAR(23)` - 15 unique values
- `DEVICE_CONDITION               VARCHAR(24)` -  8 unique values
- `WEATHER_CONDITION              VARCHAR(22)` -  9 unique values
- `LIGHTING_CONDITION             VARCHAR(22)` -  6 unique values
- `FIRST_CRASH_TYPE               VARCHAR(28)` - 15 unique values
- `TRAFFICWAY_TYPE                VARCHAR(31)` - 11 unique values

We can create small tables for each categorical column and popiulate them with their unique values.
Then replace the string in the chicagoAccidents table with a 4-byte integer.

For example, we would create a `TRAFFIC_CONTROL_DEVICE` table with two columns and 15 rows like:

```
ID  DESCRIPTION
1	LANE USE MARKING
2	NO CONTROLS
3	NO PASSING
4	OTHER
5	OTHER RAILROAD CROSSING
```

Then, the ChicagoAccidents column `TRAFFIC_CONTROL_DEVICE` woud be replace by `TRAFFIC_CONTROL_DEVICE_ID`.

This saves us a lot of storage space and convert our text to a number that is more appropriate for modeling.

In [None]:
# Categorical columns that we want to convert into separate tables
cat_columns = ['TRAFFIC_CONTROL_DEVICE','DEVICE_CONDITION','WEATHER_CONDITION','LIGHTING_CONDITION',
           'FIRST_CRASH_TYPE','TRAFFICWAY_TYPE','ALIGNMENT','ROADWAY_SURFACE_COND','ROAD_DEFECT',
           'REPORT_TYPE','CRASH_TYPE','DAMAGE','PRIM_CONTRIBUTORY_CAUSE','SEC_CONTRIBUTORY_CAUSE',
           'WORK_ZONE_TYPE','MOST_SEVERE_INJURY'
          ]

In [None]:
table_def = """
CREATE TABLE {0}.{1}_TABLE (
    id          INT GENERATED ALWAYS AS IDENTITY
                    (START WITH 1, INCREMENT BY 1),
    description VARCHAR(80),
    group_id    INT DEFAULT -1,

    PRIMARY KEY(id)
) ORGANIZE BY ROW;
"""

for col in cat_columns :
    sql = table_def.format(schema,col)
    cur = pconn.cursor()
    cur.execute(sql)
    print("Table {0}_TABLE created".format(col))

In [None]:
insert_sql = """
  INSERT INTO {0}.{1}_TABLE(description)
    SELECT distinct {1} AS description 
    FROM CHICAGO.staging_ChicagoAccidents
"""
for col in cat_columns :
    sql = insert_sql.format(schema,col)
    cur = pconn.cursor()
    cur.execute(sql)
    print("Table {0}_TABLE populated".format(col))

## Reducing categorical choices
The posted speed limit has 31 distinct values. We could reduce it in two ways:
- Remove suspicious values from our analysis
- Grouping the values

## Looking at count per categorical value in each attribute
We need to add a few columns that are numeric but still categorical.

In [None]:
other_cat_columns = ['POSTED_SPEED_LIMIT','LANE_CNT','NUM_UNITS', 'INJURIES_TOTAL',
                     'CRASH_HOUR','CRASH_DAY_OF_WEEK','CRASH_MONTH']
cat_all = cat_columns + other_cat_columns

### Build the SQL statement
We need to get all the counts

In [None]:
# Do the same thing but with Quries to the table.
# I have to build a series of SQL statements and do a UNION ALL on them

query = """
SELECT '{1}' COLNAME, attr.id COLVALUE, COUNT(*) VALCOUNT
FROM {0}.staging_ChicagoAccidents acc, {0}.{1}_table attr
WHERE acc.{1} = attr.description
GROUP BY '{1}', attr.id 
"""

query2 = """
SELECT '{1}' COLNAME, {1} COLVALUE, COUNT(*) VALCOUNT
FROM {0}.staging_ChicagoAccidents acc
GROUP BY '{1}', {1}
"""

sql = ""
for name in cat_columns :
    if (len(sql) > 0 ) :
        sql = sql + "UNION ALL"
    sql = sql + query.format(schema,name)
for name in other_cat_columns :
    sql = sql + "UNION ALL"
    sql = sql + query2.format(schema,name)

stats_pd = pd.read_sql(sql, pconn)
print("Number of records: {0}".format(stats_pd.shape[0]))

### Display the graphs

In [None]:
nb_rows = math.ceil(len(cat_all) / 2)

fig, axes = plt.subplots(nrows=nb_rows, ncols=2)
fig.set_figheight(75)
fig.set_figwidth(15)
for ix, ax in enumerate(axes.flatten()) :
    if (ix < len(cat_all) ) :
        tmp_pd = stats_pd[stats_pd['COLNAME'] == cat_all[ix]].sort_values(by=['COLVALUE'])
        tmp_pd.plot.bar(ax=ax, x='COLVALUE', y='VALCOUNT',title=cat_all[ix], legend=False)
        ax.set_xlabel('')
    else:
        fig.delaxes(ax)