# Predicting Blight Violations

_This project was completed as a part of the Applied Data Science with Python Specialization from Coursera._

In this project, we will evaluate the performance and predictive power of a model that has been trained and tested on data collected from Blight Violation Notices (BVN), or Blight Tickets, that have been issued to property owners who have violated City of Detroit ordinances that govern how property owners must maintain the exterior of their property. Blight Tickets are issued by city inspectors, police officers, neighborhood city hall managers and other city officials who investigate complaints of blight and are managed by the Department of Administrative Hearings.

The target variable is compliance, which is __True__ if the ticket was paid early, on time, or within one month of the hearing data, __False__ if the ticket was paid after the hearing date or not at all, and __Null__ if the violator was found not responsible. Compliance is not avaliable in the dataset so it must be calculated.

The dataset for this project originates from the City of Detroit's Open Data Portal initiative and is updated daily. For this project the dataset spans tickets issued from March 2004 to March 2018. The dataset is avaliable at: https://data.detroitmi.gov/Property-Parcels/Blight-Violations/ti6p-wcg4

## Pre-Processing

Imports the raw data, calculates compliance for each ticket, removes features that cause data leakage, and seperates the data into two dataset based on when the violation occured. Violations issused before 2017 will be used to train and test the model, and tickets issued on and after 2017 will be used to validate the model.

In [1]:
# Common imports
import os
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from sklearn.preprocessing import normalize
from scipy.sparse import coo_matrix, hstack

# To make the notebook's output stable across runs
random_seed = 12062017
np.random.seed(random_seed)

In [2]:
def load_blight_raw(blight_path):
    '''
    Loads the raw blight_violations.csv
    Returns a pandas dataframe.
    '''
    csv_path = os.path.join(blight_path, "Blight_Violations.csv")
    blight_raw = pd.read_csv(csv_path, encoding='ISO-8859-1')
    print("RAW Blight Violations dataset has {} observations with {} features each.".format(*blight_raw.shape))
    return blight_raw

def pre_processing(blight_df):
    ''' 
    Takes blight panda dataframe and makes column names more reabable, sets ticket_id as index, creates compliance (the 
    prediction variable), and creates compliance_details (explains why a tickets was labeled as complantent or non-complantent).
    Returns a panda dataframe.
    '''
    # Makes column names more reabable
    # Gets columns from dataframe
    columns_names = pd.Series(blight_df.columns)
    # Removes parentheses, removes right spaces
    columns_names = columns_names.str.split('(').str[0].str.strip()
    # Coverts column names to lowercase and replaces spaces with underscores
    columns_names = columns_names.str.lower().str.replace(' ', '_')
    blight_df.columns = columns_names
    # Sets ticket_id as index
    blight_df.set_index('ticket_id', inplace=True)
    # Defines prediction variable compliance and helper variable compliance_detail
    # The values of these variables are NOT correct
    blight_df['compliance'] = 0
    blight_df['compliance_detail'] = np.NaN    
    return blight_df

def clean_up(blight_df):
    # Removes instances where 'ticket_issued_date' is not in (2000, 2017) 
    blight_df = blight_df[blight_df['violation_date'].str.contains('[0-9]{2}/[0-9]{2}/[20]{2}[0-9]{2}') == True]
    # Converts 'ticket_issued_date' to datetime
    blight_df.loc[:, 'violation_date'] = pd.to_datetime(blight_df['violation_date'])
    # Converts 'payment_date' to datetime
    blight_df.loc[:, 'payment_date'] = pd.to_datetime(blight_df['payment_date'].str.extract('([0-9]{2}/[0-9]{2}/20{1}[0-9]{2})')) 
    # Converts bad values in 'ticket_issued_time' to NaNs (Dataset error removed by publisher)
    #blight_df.loc[blight_df['violation_date'] == '00000000000000.000', 'violation_date'] = np.nan
    # Converts 'ticket_issued_time' to datetime.time
    blight_df.loc[:, 'ticket_issued_time'] = pd.DatetimeIndex(blight_df['ticket_issued_time']).time
    # Converts 'hearing_time' to datetime.time
    blight_df.loc[:, 'hearing_time'] = pd.DatetimeIndex(blight_df['hearing_time']).time
    # Remove "$" from 'judgement_amount'
    blight_df.loc[:, 'judgment_amount'] = blight_df['judgment_amount'].str.strip('$')    
    # Converts 'judgment_amount' to float    
    blight_df.loc[:, 'judgment_amount'] = blight_df['judgment_amount'].astype(float)
    # Removes "$" from 'payment_amount' and converts it to float
    blight_df.loc[:, 'payment_amount'] = blight_df['payment_amount'].str.strip('$').astype(float)
    return blight_df

def null_compliance(blight_df):
    '''
    Compliance = np.NaN
    Tickets that cannot be classified as compliant or non-compliant because they were ruled as not responsible in disposition
    or they are awaiting judgement.
    Returns a pandas dataframe
    '''
    # Dispositions that are ruled as not responsible or are still pending
    null_dispositions = ['Not responsible by Dismissal', 'Not responsible by City Dismissal', 'PENDING JUDGMENT', 
                 'Not responsible by Determination','SET-ASIDE (PENDING JUDGMENT)']
    # Loops over dispositions and sets 'compliance' values to np.NaN
    # Loops over dispositions and sets 'compliance_detail' values to 'Not Responsible/Pending Judgement'
    for dispositions in null_dispositions:
        blight_df.loc[blight_df['disposition'] == dispositions, 'compliance'] = np.NaN
        blight_df.loc[blight_df['disposition'] == dispositions, 'compliance_detail'] = 'Not Responsible/Pending Judgement'
    return blight_df

def compliant(blight_df):
    '''
    Compliance = 1
    Tickets that are classified as compliant because they had no fine, fine was waved, made a payment with hearing pending,
    early payment (hearing not pending), payment on time, or payment within one month after hearing date.
    Returns a pandas dataframe
    '''
    ## Compliant by no fine
    # Fine Waived by Determintation
    blight_df.loc[blight_df['disposition'] == 'Responsible (Fine Waived) by Determination', 'compliance'] = 1
    blight_df.loc[blight_df['disposition'] == 'Responsible (Fine Waived) by Determination', 'compliance_detail'] = 'Compliant by no fine'
    # Fine Waived by Admission
    blight_df.loc[blight_df['disposition'] == 'Responsible (Fine Waived) by Admission', 'compliance'] = 1
    blight_df.loc[blight_df['disposition'] == 'Responsible (Fine Waived) by Admission', 'compliance_detail'] = 'Compliant by no fine'
    ## Compliant by Payment
    # Payment with PENDING hearing
    blight_df.loc[(
        (blight_df['hearing_date'] == 'PENDING') &
        (blight_df['payment_amount'] > 0) &
        (blight_df['compliance_detail'].isnull())), 'compliance'] = 1
    blight_df.loc[(
        (blight_df['hearing_date'] == 'PENDING') &
        (blight_df['payment_amount'] > 0) &
        (blight_df['compliance_detail'].isnull())), 'compliance_detail'] = 'Compliant by payment with PENDING hearing'
    # Transforms values in 'hearing_date' and 'payment_date' to panda date types
    dummy_hearing_date = blight_df['hearing_date'].copy()
    blight_df.loc[blight_df['hearing_date'] == 'PENDING', 'hearing_date'] = np.nan
    blight_df.loc[:, 'hearing_date'] = pd.to_datetime(blight_df['hearing_date'].str.extract('([0-9]{2}/[0-9]{2}/20{1}[0-9]{2})'))
    # Early Payment, payment before hearing date
    blight_df.loc[(
        (blight_df['payment_date'] < blight_df['hearing_date']) &
        (blight_df['payment_amount'] > 0) &
        (blight_df['compliance_detail'].isnull())), 'compliance'] = 1
    blight_df.loc[(
        (blight_df['payment_date'] < blight_df['hearing_date']) &
        (blight_df['payment_amount'] > 0) &
        (blight_df['compliance_detail'].isnull())), 'compliance_detail'] = 'Compliant by early payment'
    # Payment on time, payment on hearing date
    blight_df.loc[(
        (blight_df['payment_date'] == blight_df['hearing_date']) &
        (blight_df['payment_amount'] > 0) &
        (blight_df['compliance_detail'].isnull())), 'compliance'] = 1
    blight_df.loc[(
        (blight_df['payment_date'] == blight_df['hearing_date']) &
        (blight_df['payment_amount'] > 0) &
        (blight_df['compliance_detail'].isnull())), 'compliance_detail'] = 'Compliant by payment on time'
    # Payment within one month after hearing date
    blight_df.loc[(
        ((blight_df['payment_date'] - blight_df['hearing_date'])/np.timedelta64(1, 'M') <= 1.0) &
        (blight_df['payment_amount'] > 0) &
        (blight_df['compliance_detail'].isnull())), 'compliance'] = 1
    blight_df.loc[(
        ((blight_df['payment_date'] - blight_df['hearing_date'])/np.timedelta64(1, 'M') <= 1.0) &
        (blight_df['payment_amount'] > 0) &
        (blight_df['compliance_detail'].isnull())), 'compliance_detail'] = 'Compliant by payment within 1 Month'
    # Sets 'hearing_date' to orginal state
    blight_df.loc[:, 'hearing_date'] = dummy_hearing_date
    return blight_df

def non_compliant(blight_df):
    '''
    Compliance = 0
    Tickets that are classified as non-compliant because they made no payment or a payment after one month (late payment).
    Returns a pandas dataframe
    '''
    # Transforms values in 'hearing_date' to panda date types
    dummy_hearing_date = blight_df['hearing_date'].copy()
    blight_df.loc[blight_df['hearing_date'] == 'PENDING', 'hearing_date'] = np.nan
    blight_df.loc[:, 'hearing_date'] = pd.to_datetime(blight_df['hearing_date'].str.extract('([0-9]{2}/[0-9]{2}/20{1}[0-9]{2})'))
    # Non-compliant by late payment
    blight_df.loc[(
        ((blight_df['payment_date'] - blight_df['hearing_date'])/np.timedelta64(1, 'M') > 1.0) &
        (blight_df['payment_amount'] > 0) &
        (blight_df['compliance_detail'].isnull())), 'compliance_detail'] = 'Non-compliant by late payment more than 1 month'
    # Non-compliant by no payment
    blight_df.loc[blight_df['compliance_detail'].isnull(), 'compliance_detail'] = 'Non-compliant by no payment'
    # Sets 'hearing_date' to orginal state
    blight_df.loc[:, 'hearing_date'] = dummy_hearing_date
    return blight_df

def populate_compliance(blight_df):
    '''
    Populates compliance and compliance_details in blight_df with the correct values. 
    Returns a panda dataframe.
    '''
    blight_df = null_compliance(blight_df)
    blight_df = compliant(blight_df)
    blight_df = non_compliant(blight_df)  
    return blight_df

def remove_leakage(blight_df):
    ''' 
    In blight_df removes variables to prevent data leakage and variables with mostly NaNs,
    Returns a panda dataframe
    '''
    blight_df = blight_df[['agency_name', 'inspector_name', 'violator_name','violation_street_number', 'violation_street_name', 
                       'mailing_address_street_name', 'mailing_address_street_name', 'mailing_address_city', 
                       'mailing_address_state', 'mailing_address_zip_code', 'mailing_address_non-usa_code', 
                       'mailing_address_country', 'violation_date', 'hearing_date', 'violation_code', 'violation_description',
                       'disposition', 'fine_amount', 'admin_fee', 'state_fee', 'late_fee', 'discount_amount', 
                       'judgment_amount', 'violation_latitude', 'violation_longitude', 'compliance']]  
    return blight_df 

def process_blight(blight_path, pre_process):
    '''
    If pre_process is true - Loads the raw blight_violations.csv, cleans the data, computes compliance, removes features with 
    data leakage and saves this dataframe. If pre_process is False, then it loads the pre-processed data.
    Returns a pandas dataframe.
    '''
    if pre_process == False:
        blight_df = load_blight_raw(blight_path)
        blight_df = pre_processing(blight_df)
        blight_df = clean_up(blight_df)
        blight_df = populate_compliance(blight_df)
        blight_df = remove_leakage(blight_df)
        save_blight_data(blight_path, blight_df)
    else:
        blight_df = load_blight(blight_path)
    print("PROCESSED Blight Violations dataset has {} observations with {} features each.".format(*blight_df.shape))        
    return blight_df

def save_blight_data(blight_path, blight_df):
    '''
    Saves the processed blight_df
    Returns nothing.
    '''
    csv_path = os.path.join(blight_path, "Blight_Violations_Processed.csv")
    blight_df.to_csv(csv_path)
    
def load_blight(blight_path):
    '''
    Loads the preprocessed blight_violations.csv
    Returns a pandas dataframe.
    '''
    csv_path = os.path.join(blight_path, "Blight_Violations_Processed.csv")
    blight_df = pd.read_csv(csv_path)
    blight_df.set_index('ticket_id', inplace=True)
    return blight_df

In [3]:
# Loads the raw blight data and pre-processes
blight_df = process_blight(r"C:\Users\Adrian\Google Drive\Datasets\Blight-Violations", True)
# Checks to make sure the data loaded properly
blight_df.info()

PROCESSED Blight Violations dataset has 373844 observations with 26 features each.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 373844 entries, 47056 to 247933
Data columns (total 26 columns):
agency_name                      373844 non-null object
inspector_name                   373844 non-null object
violator_name                    373842 non-null object
violation_street_number          373844 non-null int64
violation_street_name            373779 non-null object
mailing_address_street_name      373841 non-null object
mailing_address_street_name.1    373841 non-null object
mailing_address_city             372028 non-null object
mailing_address_state            371565 non-null object
mailing_address_zip_code         372025 non-null object
mailing_address_non-usa_code     1819 non-null object
mailing_address_country          1830 non-null object
violation_date                   373844 non-null object
hearing_date                     373234 non-null object
violation_code         

Now lets remove unrelevant fields and useless observations -
* Observations where compliance is _NaNs_, these observations provide no use to me
* Features that are too noisey or contain many _NaNs_: inspector_name, violator_name, violation_street_number, violation_street_name, mailing_address_street_name, mailing_address_street_name.1, violation_code, and violation_description

In [11]:
# Drop observations where compliance is NaNs
blight_df.dropna(axis=0, subset=['compliance'], inplace=True)

In [5]:
blight_df.columns

Index(['agency_name', 'inspector_name', 'violator_name',
       'violation_street_number', 'violation_street_name',
       'mailing_address_street_name', 'mailing_address_street_name.1',
       'mailing_address_city', 'mailing_address_state',
       'mailing_address_zip_code', 'mailing_address_non-usa_code',
       'mailing_address_country', 'violation_date', 'hearing_date',
       'violation_code', 'violation_description', 'disposition', 'fine_amount',
       'admin_fee', 'state_fee', 'late_fee', 'discount_amount',
       'judgment_amount', 'violation_latitude', 'violation_longitude',
       'compliance'],
      dtype='object')

In [None]:
blight_df[['inspector_name', 'violator_name',
       'violation_street_number', 'violation_street_name',
       'mailing_address_street_name', 'mailing_address_street_name.1', 'mailing_address_zip_code', 'mailing_address_non-usa_code',
        'violation_code', 'violation_description']]

### Violation Description

In [22]:
blight_df.dropna(axis=0, subset=['violation_description'], inplace=True)

In [26]:
from sklearn.feature_extraction.text import CountVectorizer
import gensim

# Use CountVectorizor to find three letter tokens, remove stop_words, 
# remove tokens that don't appear in at least 20 documents,
# remove tokens that appear in more than 20% of the documents
vect = CountVectorizer(min_df=20, max_df=0.2, stop_words='english', token_pattern='(?u)\\b\\w\\w\\w+\\b')

# Fit and transform
X = vect.fit_transform(blight_df['violation_description'])

# Convert sparse matrix to gensim corpus.
corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)

# Mapping from word IDs to words (To be used in LdaModel's id2word parameter)
id_map = dict((v, k) for k, v in vect.vocabulary_.items())



In [27]:
# Use the gensim.models.ldamodel.LdaModel constructor to estimate 
# LDA model parameters on the corpus, and save to the variable `ldamodel`

# Your code here:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=10, 
        id2word=id_map, passes=25, random_state=34)

In [30]:
ldamodel.show_topics(num_topics=5, num_words=8)

[(5,
  '0.083*"comply" + 0.074*"unlawful" + 0.069*"order" + 0.069*"emergency" + 0.068*"danger" + 0.068*"imminent" + 0.068*"structure" + 0.060*"occupancy"'),
 (8,
  '0.221*"weeds" + 0.221*"excessive" + 0.221*"growth" + 0.221*"plant" + 0.019*"failed" + 0.015*"land" + 0.013*"use" + 0.012*"requirements"'),
 (3,
  '0.105*"hours" + 0.100*"bulk" + 0.100*"time" + 0.100*"designated" + 0.100*"deposited" + 0.059*"owned" + 0.056*"depositing" + 0.054*"permit"'),
 (1,
  '0.468*"rental" + 0.388*"registration" + 0.012*"residential" + 0.011*"defective" + 0.011*"vehicles" + 0.010*"open" + 0.006*"improperly" + 0.006*"stored"'),
 (4,
  '0.148*"collection" + 0.108*"private" + 0.084*"containers" + 0.081*"secure" + 0.077*"city" + 0.077*"services" + 0.041*"public" + 0.028*"permit"')]

In [21]:
blight_df.loc[blight_df['violation_description'].isnull(), :]

Unnamed: 0_level_0,agency_name,inspector_name,violator_name,violation_street_number,violation_street_name,mailing_address_street_name,mailing_address_street_name.1,mailing_address_city,mailing_address_state,mailing_address_zip_code,...,disposition,fine_amount,admin_fee,state_fee,late_fee,discount_amount,judgment_amount,violation_latitude,violation_longitude,compliance
ticket_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
329109,Department of Public Works,Maydell Bell,PHILLIP COLE,15707,PREST,GRANDMONT,GRANDMONT,DETROIT,MI,48227,...,Responsible by Default,,$20.00,$10.00,$0.00,$0.00,30.0,42.405392,-83.198039,0.0


Unnamed: 0_level_0,inspector_name,violator_name,violation_street_number,violation_street_name,mailing_address_street_name,mailing_address_street_name.1,mailing_address_zip_code,mailing_address_non-usa_code,violation_code,violation_description
ticket_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
47056,Dennis Williams,ARTHUR WALKER,5196,TRUMBULL,ROSELAWN,ROSELAWN,48228,,22-2-17,Improper storage and separation of solid waste...
78398,Aaron Rose,Clifford Brookins,16901,BURGESS,Burgess,Burgess,48219,,9-1-110(a),Inoperable motor vehicle(s) one- or two-family...
72605,Amanda Bickers-Holmes,SCOTT MANAGEMENT,12696,CHAPEL,GREINER,GREINER,48205,,9-1-105,Rodent harborage one-or two-family dwelling or...
65652,Paul Gray,SHIRLEY NORMAN,20447,FLEMING,FLEMING,FLEMING,48243,,22-2-88,"Failure of owner to keep property, its sidewal..."
70572,Billy Hayes,HOWARD STEEL JR.,14504,ST MARYS,BRIARCLIFF,BRIARCLIFF,48221,,22-2-88,"Failure of owner to keep property, its sidewal..."
73928,Doris Houston,BYRON DANDRIDGE,651,MANISTIQUE,MANISTIQUE,MANISTIQUE,48215,,22-2-88,"Failure of owner to keep property, its sidewal..."
76597,Marilyn Evans,FANNIE MAE,15367,STRATHMOOR,VISION DR,VISION DR,43219,,22-2-88,"Failure of owner to keep property, its sidewal..."
75770,Marilyn Evans,BETTY HAMM,6364,LINSDALE,LINSDALE,LINSDALE,48204,,22-2-88,"Failure of owner to keep property, its sidewal..."
77704,Orbie Gailes,FISHER MAINTENANCE CO.,14893,SUSSEX,BIRCHCREST DRIVE,BIRCHCREST DRIVE,48221,,9-1-81(a),Failure to obtain certificate of registration ...
93281,Doris Houston,SUSIE WILSON,16551,KENTUCKY,KENTUCKY,KENTUCKY,48221,,9-1-104,Excessive weeds or plant growth one- or two-fa...


In [27]:
len(blight_df['violation_code'].value_counts())

276

In [14]:
blight_df['violation_code'].str.split('(').str[0].str.split('/').str[0].str.split('.').str[0].value_counts()

9-1-36       77458
22-2-88      48573
9-1-104      37920
9-1-81       26468
22-2-45       7945
9-1-43        6349
9-1-110       6317
22-2-43       4137
9-1-105       3821
9-1-82        3727
9-1-103       3612
22-2-22       2930
9-1-113       2887
9-1-111       2883
22-2-61       1814
22-2-17       1717
22-2-83       1687
9-1-45        1648
61-5-21       1397
9-1-50        1026
9-1-50         837
9-1-201        522
61-81          464
9-1-101        373
9-1-83 -       365
22-2-49        327
22-2-21        319
9-1-107        305
61-5-18        299
9-1-206        294
             ...  
9-1-84           2
61-14-175        2
22-2-92          2
61-5-20          2
61-104           2
9-1-375          2
9-1-308          2
9-1-352          1
9-1-471          1
9-1-302          1
61-4-38          1
9-1-353          1
9-1-437          1
61-111           1
9-1-401          1
61-45            1
61-47            1
9-1-474          1
22-2-84          1
61-122           1
61-119           1
9-1-382     