# Understanding and Predicting Property Maintenance Fines

This project is based on a data challenge from the Michigan Data Science Team ([MDST](http://midas.umich.edu/mdst/)). 

The Michigan Data Science Team ([MDST](http://midas.umich.edu/mdst/)) and the Michigan Student Symposium for Interdisciplinary Statistical Sciences ([MSSISS](https://sites.lsa.umich.edu/mssiss/)) have partnered with the City of Detroit to help solve one of the most pressing problems facing Detroit - blight. [Blight violations](http://www.detroitmi.gov/How-Do-I/Report/Blight-Complaint-FAQs) are issued by the city to individuals who allow their properties to remain in a deteriorated condition. Every year, the city of Detroit issues millions of dollars in fines to residents and every year, many of these fines remain unpaid. Enforcing unpaid blight fines is a costly and tedious process, so the city wants to know: how can we increase blight ticket compliance?

The first step is understanding when and why a resident might fail to comply with a blight ticket. This is where predictive modeling comes in. For this assignment, your task is to predict whether a given blight ticket will be paid on time.

All data for this assignment has been provided to us through the [Detroit Open Data Portal](https://data.detroitmi.gov/). **Only the data already included in your Coursera directory can be used for training the model for this assignment.** Nonetheless, we encourage you to look into data from other Detroit datasets to help inform feature creation and model selection. We recommend taking a look at the following related datasets:

* [Building Permits](https://data.detroitmi.gov/Property-Parcels/Building-Permits/xw2a-a7tf)
* [Trades Permits](https://data.detroitmi.gov/Property-Parcels/Trades-Permits/635b-dsgv)
* [Improve Detroit: Submitted Issues](https://data.detroitmi.gov/Government/Improve-Detroit-Submitted-Issues/fwz3-w3yn)
* [DPD: Citizen Complaints](https://data.detroitmi.gov/Public-Safety/DPD-Citizen-Complaints-2016/kahe-efs3)
* [Parcel Map](https://data.detroitmi.gov/Property-Parcels/Parcel-Map/fxkw-udwf)

___

We have two data files for use in training and validating your models: train.csv and test.csv. Each row in these two files corresponds to a single blight ticket, and includes information about when, why, and to whom each ticket was issued. The target variable is compliance, which is True if the ticket was paid early, on time, or within one month of the hearing data, False if the ticket was paid after the hearing date or not at all, and Null if the violator was found not responsible. Compliance, as well as a handful of other variables that will not be available at test-time, are only included in train.csv.

<br>

**File descriptions** (Use only this data for training your model!)

    readonly/train.csv - the training set (all tickets issued 2004-2011)
    readonly/test.csv - the test set (all tickets issued 2012-2016)
    readonly/addresses.csv & readonly/latlons.csv - mapping from ticket id to addresses, and from addresses to lat/lon coordinates. 
     Note: misspelled addresses may be incorrectly geolocated.

<br>

**Data fields**

train.csv & test.csv

    ticket_id - unique identifier for tickets
    agency_name - Agency that issued the ticket
    inspector_name - Name of inspector that issued the ticket
    violator_name - Name of the person/organization that the ticket was issued to
    violation_street_number, violation_street_name, violation_zip_code - Address where the violation occurred
    mailing_address_str_number, mailing_address_str_name, city, state, zip_code, non_us_str_code, country - Mailing       address of the violator
    ticket_issued_date - Date and time the ticket was issued
    hearing_date - Date and time the violator's hearing was scheduled
    violation_code, violation_description - Type of violation
    disposition - Judgment and judgement type
    fine_amount - Violation fine amount, excluding fees
    admin_fee - $20 fee assigned to responsible judgments
    state_fee - $10 fee assigned to responsible judgments
    late_fee - 10% fee assigned to responsible judgments
    discount_amount - discount applied, if any
    clean_up_cost - DPW clean-up or graffiti removal cost
    judgment_amount - Sum of all fines and fees
    grafitti_status - Flag for graffiti violations
    
train.csv only

    payment_amount - Amount paid, if any
    payment_date - Date payment was made, if it was received
    payment_status - Current payment status as of Feb 1 2017
    balance_due - Fines and fees still owed
    collection_status - Flag for payments in collections
    compliance [target variable for prediction] 
     Null = Not responsible
     0 = Responsible, non-compliant
     1 = Responsible, compliant
    compliance_detail - More information on why each ticket was marked compliant or non-compliant




In [9]:
def blight_model():
    
    # Your code here
    import pandas as pd
    import numpy as np
    from sklearn.ensemble import GradientBoostingClassifier
    from sklearn.svm import SVC
    from sklearn.preprocessing import MinMaxScaler
    from sklearn.model_selection import train_test_split
    from sklearn.model_selection import GridSearchCV
    from sklearn.metrics import roc_curve,auc,roc_auc_score
    
   
    #cleaning data

    train_data=pd.read_csv('train.csv',encoding='ISO-8859-1')
    test_data = pd.read_csv('test.csv')

    address=pd.read_csv('addresses.csv')
    latlons=pd.read_csv('latlons.csv')
    address_latlons=address.set_index('address').join(latlons.set_index('address'),how='left').dropna().reset_index(drop=False)
    train_data=pd.merge(train_data,address_latlons,on='ticket_id').set_index('ticket_id')

    test_data=pd.merge(test_data,address_latlons,on='ticket_id',how='left').set_index('ticket_id')

    train_data['lat'].fillna(method='pad',inplace=True)
    train_data['lon'].fillna(method='pad',inplace=True)
    train_data['state'].fillna(method='pad',inplace=True)

    test_data['lat'].fillna(method='pad',inplace=True)
    test_data['lon'].fillna(method='pad',inplace=True)
    test_data['state'].fillna(method='pad',inplace=True)



    train_data=train_data[(train_data['compliance']==0)|(train_data['compliance']==1)]

    train_data['compliance']=train_data['compliance'].astype(int)


#calculate time gap
    from datetime import datetime
    def time_gap(hearing_date_str,ticket_issued_date_str):
        
        if not hearing_date_str or type(hearing_date_str)!=str:return 73
        hearing_date = datetime.strptime(hearing_date_str,'%Y-%m-%d %H:%M:%S')
        ticket_issued_date = datetime.strptime(ticket_issued_date_str,'%Y-%m-%d %H:%M:%S')
        gap = hearing_date-ticket_issued_date
        return gap.days

   
    train_data['time_gap']=train_data.apply(lambda row:time_gap(row['hearing_date'],row['ticket_issued_date']),axis=1)
    test_data['time_gap']=test_data.apply(lambda row:time_gap(row['hearing_date'],row['ticket_issued_date']),axis=1)
    
    
    feature_to_be_splitted={'agency_name':'category',
                            'state':'category',
                            'disposition':'category'
                            }
  
    
    
    list_to_remove_train=['balance_due','collection_status','compliance_detail','payment_amount','payment_date','payment_status']
    
    
    list_to_remove_train_test=['inspector_name','violator_name','zip_code','country','city','violation_street_number',
                               'violation_street_name','violation_zip_code','violation_description', 'mailing_address_str_number',
                               'mailing_address_str_name','non_us_str_code','ticket_issued_date', 'hearing_date',
                               'grafitti_status', 'violation_code']
    
    feature_columns=['agency_name',
                    'state',
                    
                     'late_fee',
                     'fine_amount',
                     'discount_amount',
                     'judgment_amount',
                     'lat',
                     'lon',
                     'time_gap']
    train_data.drop(list_to_remove_train,axis=1,inplace=True)
    train_data.drop(list_to_remove_train_test,axis=1,inplace=True)
    test_data.drop(list_to_remove_train_test,axis=1,inplace=True)
    
    
   
    
    for df in [train_data,test_data]:
        for col,col_type in feature_to_be_splitted.items():
            if col in df:
                if col_type=='category':
                    df[col]=df[col].astype(col_type)
                    
    cat_columns=train_data.select_dtypes(['category']).columns
    for df in [train_data, test_data]:
        df[cat_columns]=df[cat_columns].apply(lambda x:x.cat.codes)
        
    X=train_data[feature_columns].copy()
    y=train_data['compliance']
    X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0)
    
    test=test_data[feature_columns].copy()
    GBC=GradientBoostingClassifier(learning_rate=0.01,max_depth=3,random_state=0).fit(X_train,y_train)
    
    y_score_GBC=GBC.decision_function(X_test)
    fpr_GBC,tpr_GBC,_=roc_curve(y_test,y_score_GBC)
    roc_auc_GBC=auc(fpr_GBC,tpr_GBC)
    accuracy_GBC=GBC.score(X_test,y_test)
    print("accuracy = {:.3f}  AUC= {:.3f}".format(accuracy_GBC,roc_auc_GBC))
    y_proba=GBC.predict_proba(test)[:,1]
    test['compliance']=y_proba
    return  test['compliance']# Your answer here

In [10]:
blight_model()

  if self.run_code(code, result):


accuracy = 0.934  AUC= 0.766


ticket_id
284932    0.069823
285362    0.045330
285361    0.069823
285338    0.076646
285346    0.076646
285345    0.076646
285347    0.076646
285342    0.237046
285530    0.056847
284989    0.052213
285344    0.076646
285343    0.056847
285340    0.056847
285341    0.076646
285349    0.076646
285348    0.076646
284991    0.052213
285532    0.056847
285406    0.045330
285001    0.056847
285006    0.056847
285405    0.045330
285337    0.045330
285496    0.069823
285497    0.069823
285378    0.045330
285589    0.045330
285585    0.069823
285501    0.069823
285581    0.045330
            ...   
376367    0.045330
376366    0.047276
376362    0.047276
376363    0.069823
376365    0.045330
376364    0.047276
376228    0.050075
376265    0.047276
376286    0.127654
376320    0.047276
376314    0.047276
376327    0.237046
376385    0.237046
376435    0.237046
376370    0.237046
376434    0.069823
376459    0.069823
376478    0.045330
376473    0.047276
376484    0.048018
376482    0.045330
37