---

_You are currently looking at **version 1.1** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-machine-learning/resources/bANLa) course resource._

---

## Assignment 4 - Understanding and Predicting Property Maintenance Fines

This assignment is based on a data challenge from the Michigan Data Science Team ([MDST](http://midas.umich.edu/mdst/)). 

The Michigan Data Science Team ([MDST](http://midas.umich.edu/mdst/)) and the Michigan Student Symposium for Interdisciplinary Statistical Sciences ([MSSISS](https://sites.lsa.umich.edu/mssiss/)) have partnered with the City of Detroit to help solve one of the most pressing problems facing Detroit - blight. [Blight violations](http://www.detroitmi.gov/How-Do-I/Report/Blight-Complaint-FAQs) are issued by the city to individuals who allow their properties to remain in a deteriorated condition. Every year, the city of Detroit issues millions of dollars in fines to residents and every year, many of these fines remain unpaid. Enforcing unpaid blight fines is a costly and tedious process, so the city wants to know: how can we increase blight ticket compliance?

The first step in answering this question is understanding when and why a resident might fail to comply with a blight ticket. This is where predictive modeling comes in. For this assignment, your task is to predict whether a given blight ticket will be paid on time.

All data for this assignment has been provided to us through the [Detroit Open Data Portal](https://data.detroitmi.gov/). **Only the data already included in your Coursera directory can be used for training the model for this assignment.** Nonetheless, we encourage you to look into data from other Detroit datasets to help inform feature creation and model selection. We recommend taking a look at the following related datasets:

* [Building Permits](https://data.detroitmi.gov/Property-Parcels/Building-Permits/xw2a-a7tf)
* [Trades Permits](https://data.detroitmi.gov/Property-Parcels/Trades-Permits/635b-dsgv)
* [Improve Detroit: Submitted Issues](https://data.detroitmi.gov/Government/Improve-Detroit-Submitted-Issues/fwz3-w3yn)
* [DPD: Citizen Complaints](https://data.detroitmi.gov/Public-Safety/DPD-Citizen-Complaints-2016/kahe-efs3)
* [Parcel Map](https://data.detroitmi.gov/Property-Parcels/Parcel-Map/fxkw-udwf)

___

We provide you with two data files for use in training and validating your models: train.csv and test.csv. Each row in these two files corresponds to a single blight ticket, and includes information about when, why, and to whom each ticket was issued. The target variable is compliance, which is True if the ticket was paid early, on time, or within one month of the hearing data, False if the ticket was paid after the hearing date or not at all, and Null if the violator was found not responsible. Compliance, as well as a handful of other variables that will not be available at test-time, are only included in train.csv.

Note: All tickets where the violators were found not responsible are not considered during evaluation. They are included in the training set as an additional source of data for visualization, and to enable unsupervised and semi-supervised approaches. However, they are not included in the test set.

<br>

**File descriptions** (Use only this data for training your model!)

    train.csv - the training set (all tickets issued 2004-2011)
    test.csv - the test set (all tickets issued 2012-2016)
    addresses.csv & latlons.csv - mapping from ticket id to addresses, and from addresses to lat/lon coordinates. 
     Note: misspelled addresses may be incorrectly geolocated.

<br>

**Data fields**

train.csv & test.csv

    ticket_id - unique identifier for tickets
    agency_name - Agency that issued the ticket
    inspector_name - Name of inspector that issued the ticket
    violator_name - Name of the person/organization that the ticket was issued to
    violation_street_number, violation_street_name, violation_zip_code - Address where the violation occurred
    mailing_address_str_number, mailing_address_str_name, city, state, zip_code, non_us_str_code, country - Mailing address of the violator
    ticket_issued_date - Date and time the ticket was issued
    hearing_date - Date and time the violator's hearing was scheduled
    violation_code, violation_description - Type of violation
    disposition - Judgment and judgement type
    fine_amount - Violation fine amount, excluding fees
    admin_fee - $20 fee assigned to responsible judgments
state_fee - $10 fee assigned to responsible judgments
    late_fee - 10% fee assigned to responsible judgments
    discount_amount - discount applied, if any
    clean_up_cost - DPW clean-up or graffiti removal cost
    judgment_amount - Sum of all fines and fees
    grafitti_status - Flag for graffiti violations
    
train.csv only

    payment_amount - Amount paid, if any
    payment_date - Date payment was made, if it was received
    payment_status - Current payment status as of Feb 1 2017
    balance_due - Fines and fees still owed
    collection_status - Flag for payments in collections
    compliance [target variable for prediction] 
     Null = Not responsible
     0 = Responsible, non-compliant
     1 = Responsible, compliant
    compliance_detail - More information on why each ticket was marked compliant or non-compliant


___

## Evaluation

Your predictions will be given as the probability that the corresponding blight ticket will be paid on time.

The evaluation metric for this assignment is the Area Under the ROC Curve (AUC). 

Your grade will be based on the AUC score computed for your classifier. A model which with an AUROC of 0.7 passes this assignment, over 0.75 will recieve full points.
___

For this assignment, create a function that trains a model to predict blight ticket compliance in Detroit using `train.csv`. Using this model, return a series of length 61001 with the data being the probability that each corresponding ticket from `test.csv` will be paid, and the index being the ticket_id.

Example:

    ticket_id
       284932    0.531842
       285362    0.401958
       285361    0.105928
       285338    0.018572
                 ...
       376499    0.208567
       376500    0.818759
       369851    0.018528
       Name: compliance, dtype: float32
       
### Hints

* Make sure your code is working before submitting it to the autograder.

* Print out your result to see whether there is anything weird (e.g., all probabilities are the same).

* Generally the total runtime should be less than 10 mins. You should NOT use Neural Network related classifiers (e.g., MLPClassifier) in this question. 

* Try to avoid global variables. If you have other functions besides blight_model, you should move those functions inside the scope of blight_model.

* Refer to the pinned threads in Week 4's discussion forum when there is something you could not figure it out.

In [11]:
import pandas as pd
import numpy as np
    
def blight_model_1(debug=False):
    
    # Your code here
    # read data set
    train_only_df = pd.read_csv("train.csv", encoding="ISO-8859-1", low_memory=False)
    test_only_df = pd.read_csv("test.csv", encoding="ISO-8859-1", low_memory=False)
    addr_df = pd.read_csv("addresses.csv", encoding="ISO-8859-1")
    latlon_df = pd.read_csv("latlons.csv", encoding="ISO-8859-1")
    
#     if debug: 
#         print("\nOriginal Training DF labels:")
#         train_only_df.info()
#         print("\nOriginal Test DF labels: ")
#         test_only_df.info()
#         print("\nAddress DF labels: ")
#         addr_df.info()
#         print("\nLatitude & Longitude S=DF labels:")
#         latlon_df.info()
        
    # merge train and test df with address and latlon info 
    # join on indices, merge on column
    # Ref: https://stackoverflow.com/questions/22676081/what-is-the-difference-between-join-and-merge-in-pandas
    train_df = train_only_df.merge(addr_df, on=("ticket_id")).merge(latlon_df, on=("address"))
    test_df = test_only_df.merge(addr_df, on=("ticket_id")).merge(latlon_df, on=("address"))
    
    # define selected features (columns)
    train_df['date_diff'] = pd.to_datetime(train_df['hearing_date']) - pd.to_datetime(train_df['ticket_issued_date'])
    test_df['date_diff'] = pd.to_datetime(test_df['hearing_date']) - pd.to_datetime(test_df['ticket_issued_date'])
    
    if debug:
        print("\ntrain DF labels: ")
        train_df.info()
        
    return test_df # Your answer here

# blight_model_1(debug=True)

  if self.run_code(code, result):



train DF labels: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 250306 entries, 0 to 250305
Data columns (total 38 columns):
ticket_id                     250306 non-null int64
agency_name                   250306 non-null object
inspector_name                250306 non-null object
violator_name                 250272 non-null object
violation_street_number       250306 non-null float64
violation_street_name         250306 non-null object
violation_zip_code            0 non-null float64
mailing_address_str_number    246704 non-null float64
mailing_address_str_name      250302 non-null object
city                          250306 non-null object
state                         250213 non-null object
zip_code                      250305 non-null object
non_us_str_code               3 non-null object
country                       250306 non-null object
ticket_issued_date            250306 non-null object
hearing_date                  237815 non-null object
violation_code                

Unnamed: 0,ticket_id,agency_name,inspector_name,violator_name,violation_street_number,violation_street_name,violation_zip_code,mailing_address_str_number,mailing_address_str_name,city,...,state_fee,late_fee,discount_amount,clean_up_cost,judgment_amount,grafitti_status,address,lat,lon,date_diff
0,284932,Department of Public Works,"Granberry, Aisha B","FLUELLEN, JOHN A",10041.0,ROSEBERRY,,141,ROSEBERRY,DETROIT,...,10.0,20.0,0.0,0.0,250.0,,"10041 roseberry, Detroit MI",42.407581,-82.986642,14 days 19:00:00
1,285362,Department of Public Works,"Lusk, Gertrina","WHIGHAM, THELMA",18520.0,EVERGREEN,,19136,GLASTONBURY,DETROIT,...,10.0,100.0,0.0,0.0,1130.0,,"18520 evergreen, Detroit MI",42.426239,-83.238259,31 days 23:10:00
2,285361,Department of Public Works,"Lusk, Gertrina","WHIGHAM, THELMA",18520.0,EVERGREEN,,19136,GLASTONBURY,DETROIT,...,10.0,10.0,0.0,0.0,140.0,,"18520 evergreen, Detroit MI",42.426239,-83.238259,31 days 23:10:00
3,285338,Department of Public Works,"Talbert, Reginald","HARABEDIEN, POPKIN",1835.0,CENTRAL,,2246,NELSON,WOODHAVEN,...,10.0,20.0,0.0,0.0,250.0,,"1835 central, Detroit MI",42.309661,-83.122426,32 days 22:35:00
4,285346,Department of Public Works,"Talbert, Reginald","CORBELL, STANLEY",1700.0,CENTRAL,,3435,MUNGER,LIVONIA,...,10.0,10.0,0.0,0.0,140.0,,"1700 central, Detroit MI",42.308830,-83.121116,39 days 22:40:00
5,285345,Department of Public Works,"Talbert, Reginald","CORBELL, STANLEY",1700.0,CENTRAL,,3435,MUNGER,LIVONIA,...,10.0,20.0,0.0,0.0,250.0,,"1700 central, Detroit MI",42.308830,-83.121116,39 days 22:40:00
6,285347,Department of Public Works,"Talbert, Reginald","CORBELL, STANLEY",1700.0,CENTRAL,,3435,MUNGER,LIVONIA,...,10.0,5.0,0.0,0.0,85.0,,"1700 central, Detroit MI",42.308830,-83.121116,33 days 00:10:00
7,285342,Department of Public Works,"Talbert, Reginald","NICKOLA CORPORATION, W & H",1605.0,LIVERNOIS,,1382,WHITEHOUSE CT,ROCHESTER HILLS,...,10.0,0.0,0.0,0.0,230.0,,"1605 livernois, Detroit MI",42.313314,-83.108636,32 days 23:10:00
8,285530,Department of Public Works,"Buchanan, Daryl","INTERSTATE INVESTMENT GROUP LL, .",3408.0,BEATRICE,,341,HAMPTON,GILBERT,...,10.0,100.0,0.0,0.0,1130.0,,"3408 beatrice, Detroit MI",42.261245,-83.160878,34 days 02:00:00
9,284989,Department of Public Works,"Buchanan, Daryl","YAMAN, BATURAY",8040.0,SARENA,,43494,ELLSWORTH # 20,FREMONT,...,10.0,50.0,0.0,0.0,580.0,,"8040 sarena, Detroit MI",42.342537,-83.148025,20 days 00:20:00


In [4]:
import pandas as pd
import numpy as np

def blight_model_2():
    import pandas as pd
    import numpy as np
    from sklearn.neural_network import MLPClassifier
    from sklearn.preprocessing import MinMaxScaler

    # Load train and test data
    train_data = pd.read_csv('train.csv', encoding = 'ISO-8859-1')
    test_data = pd.read_csv('test.csv')

    # Filter null valued compliance rows
    train_data = train_data[(train_data['compliance'] == 0) | (train_data['compliance'] == 1)]
    address =  pd.read_csv('addresses.csv')

    # Load address and location information
    latlons = pd.read_csv('latlons.csv')
    address = address.set_index('address').join(latlons.set_index('address'), how='left')

    # Join address and location to train and test data
    train_data = train_data.set_index('ticket_id').join(address.set_index('ticket_id'))
    test_data = test_data.set_index('ticket_id').join(address.set_index('ticket_id'))

    # Filter null valued hearing date rows
    train_data = train_data[~train_data['hearing_date'].isnull()]

    # Remove Non Existing Features In Test Data
    train_remove_list = [
            'balance_due',
            'collection_status',
            'compliance_detail',
            'payment_amount',
            'payment_date',
            'payment_status'
        ]

    train_data.drop(train_remove_list, axis=1, inplace=True)

    # Remove String Data
    string_remove_list = ['violator_name', 'zip_code', 'country', 'city',
            'inspector_name', 'violation_street_number', 'violation_street_name',
            'violation_zip_code', 'violation_description',
            'mailing_address_str_number', 'mailing_address_str_name',
            'non_us_str_code', 'agency_name', 'state', 'disposition',
            'ticket_issued_date', 'hearing_date', 'grafitti_status', 'violation_code'
        ]

    train_data.drop(string_remove_list, axis=1, inplace=True)
    test_data.drop(string_remove_list, axis=1, inplace=True)

    # Fill NA Lat Lon Values
    train_data.lat.fillna(method='pad', inplace=True)
    train_data.lon.fillna(method='pad', inplace=True)
    test_data.lat.fillna(method='pad', inplace=True)
    test_data.lon.fillna(method='pad', inplace=True)

    # Select target value as y train and remove it from x train
    y_train = train_data.compliance
    X_train = train_data.drop('compliance', axis=1)

    # Do nothing with test data and select as x test, we don't have y_test
    X_test = test_data
    
    # Scale Features To Reduce Computation Time
    scaler = MinMaxScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    # Build And Train Classifier Model
    clf = MLPClassifier(hidden_layer_sizes = [100, 10],
                        alpha=0.001,
                        random_state = 0, 
                        solver='lbfgs', 
                        verbose=0)
    clf.fit(X_train_scaled, y_train)
    
    # Predict probabilities
    y_proba = clf.predict_proba(X_test_scaled)[:,1]
    
    # Integrate with reloaded test data
    test_df = pd.read_csv('test.csv', encoding = "ISO-8859-1")
    test_df['compliance'] = y_proba
    test_df.set_index('ticket_id', inplace=True)
    
    return test_df.compliance # Your answer here

# Your AUC of 0.744519114573 was awarded a value of 0.96 out of 1.0 total grades

In [8]:
# Ref: https://www.kaggle.com/kylingu/apply-machine-learning-module-4-assignment

def blight_model_trial2(debug=False):
    import pandas as pd
    import numpy as np
    import os
    
    # print(os.listdir("../input"))
    df_train = pd.read_csv('train.csv', encoding = 'ISO-8859-1', low_memory=False).set_index('ticket_id')
    df_address = pd.read_csv('addresses.csv', encoding = 'ISO-8859-1', low_memory=False).set_index('ticket_id')

    city, state= zip(*df_address.address.apply(lambda x: x.split(', ')[1].split(' ')))
    violation_address = pd.DataFrame({'vio_city': city, 'vio_state': state}, index=df_address.index)
    violation_address.describe(include='all')
    
    # clean data
    df_train.compliance.value_counts(dropna=False)/len(df_train)
    
    df_train_all = df_train.copy()
    df_train = df_train.dropna(subset=['compliance'])
    
    df_train.groupby('agency_name').compliance.agg(['count', 'sum', 'mean', 'std'])
    
    if debug:
        print(df_train_all.disposition.unique(), df_train.disposition.unique())

    disposition_replace = {'Responsible by Default': 'By default',
                           'Responsible by Determination': 'By determination', 
                           'Responsible (Fine Waived) by Deter': 'Fine Waived',
                           'Responsible by Admission': 'By admission',
                           'SET-ASIDE (PENDING JUDGMENT)': 'Pending',
                           'PENDING JUDGMENT': 'Pending',
                           'Not responsible by Dismissal': 'Not responsible',
                           'Not responsible by City Dismissal': 'Not responsible',
                           'Not responsible by Determination': 'Not responsible'
                          }

    df_train_all.disposition.replace(disposition_replace, inplace=True)
    df_train_all.groupby('disposition').compliance.agg(['count', 'sum', 'mean', 'std'])
    
    df_train.disposition.replace(disposition_replace, inplace=True)
    df_train.groupby('disposition').compliance.agg(['count', 'sum', 'mean', 'std'])
    
    df_train.groupby('country').compliance.agg(['count', 'sum', 'mean', 'std'])

    a = df_train.groupby('state').compliance.agg(
        ['count', 'sum', 'mean', 'std']).sort_values('count', ascending=False)
    a['compl_rate'] = a['sum']/a['count']
    a['count_rate'] = a['count']/len(df_train)
    # a
    
    us_statesus_state  = ['AL','AK','AZ','AR','CA','CO','CT','DE','FL','GA','HI','ID','IL',
             'IN','IA','KS','KY','LA','ME','MD','MA','MI','MN','MS','MO','MT',
             'NE','NV','NH','NJ','NM','NY','NC','ND','OH','OK','OR','PA','RI',
             'SC','SD','TN','TX','UT','VT','VA','WA','WV','WI','WY']

    df_train['is_in_state'] = df_train.state.apply(lambda x: x if x in us_statesus_state else 'foreign')
    df_train.is_in_state[(df_train.is_in_state != 'foreign') 
                         & (df_train.is_in_state != 'MI')] = 'out_of_state'
    
    df_train.groupby('is_in_state').compliance.agg(
        ['count', 'sum', 'mean', 'std']).sort_values('count', ascending=False)
    
    df_train.groupby('inspector_name').compliance.agg(
        ['count', 'sum', 'mean', 'std']).sort_values('count', ascending=False)
    
    df_train.groupby('violation_code').compliance.agg(
        ['count', 'sum', 'mean', 'std']).sort_values('count', ascending=False)
    
    if debug: 
        %matplotlib inline
        df_train.groupby('discount_amount').compliance.agg(
            ['count', 'sum', 'mean', 'std']).sort_values('count', ascending=False)
        
    df_train['is_discount'] = df_train.discount_amount.apply(lambda x:1 if x > 0 else 0)
    
    df_train.groupby('judgment_amount').compliance.agg(
        ['count', 'sum', 'mean', 'std']).sort_index(ascending=False)
    
    df_train['judgment_level'] = pd.cut(df_train.judgment_amount, bins=[-1, 140, 305, float("inf")])

    # df_train.judgment_amount.apply(lambda x: )
    df_train.groupby('judgment_level').compliance.agg(['count', 'sum', 'mean', 'std']).sort_index(ascending=False)
    
    # characteristics
    selection = ['judgment_level', 'is_discount', 'is_in_state', 'disposition', 'agency_name', 'compliance']
    df_selected = df_train[selection]
    
    X = df_selected.drop(columns='compliance')
    Y = df_selected.compliance
    # X.shape, Y.shape
    
    X_encoded = pd.get_dummies(X)
    # X_encoded.head(3)
    
    X_encoded['disposition_Not responsible'] = np.zeros(len(X))
    X_encoded['disposition_Pending'] = np.zeros(len(X))
    # X_encoded.shape
    
    # generate training and test dataset
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test= train_test_split(X_encoded, Y, random_state = 0)
    
    
    from sklearn.metrics import roc_auc_score
    from sklearn.model_selection import GridSearchCV
    # GDBC Classifier
    def gradientboosting(debug=False):
        from sklearn.ensemble import GradientBoostingClassifier
        
        clf = GradientBoostingClassifier().fit(X_train, y_train)
  
        # ROC-AUC Score
        y_train_pred = clf.predict(X_train)
        y_pred = clf.predict(X_test)

        if debug:
            print('train score ', roc_auc_score(y_train, y_train_pred))
            print('test score ', roc_auc_score(y_test, y_pred))

        # GridSearch
        params= {'learning_rate': [0.1, 0.3, 1, 3], 'n_estimators':[100], 'max_depth':[3, 5, 8]}

        clf = GradientBoostingClassifier(random_state=0)
        gscv = GridSearchCV(estimator=clf, param_grid=params, scoring='roc_auc', cv=5, n_jobs=8)
        gscv.fit(X_encoded, Y)

        y_pred = gscv.predict(X_test)

        if debug:
            print("Gradient Boosting Classifier")
            print("Grid Search Scores: {} \Grid Search Best Parameters: {}"
                  .format(gscv.best_score_, gscv.best_params))
            print('test score ', roc_auc_score(y_test, y_pred))
        
        return y_pred
    
    # Radom Forest Classifier
    def random_forest(debug=False):
        from sklearn.ensemble import RandomForestClassifier

        params = {'n_estimators':range(1, 50, 5)}
        clf = RandomForestClassifier(random_state=0)
        gscv_rfc = GridSearchCV(estimator=clf, param_grid=params, scoring='roc_auc', cv=5, n_jobs=8)
        gscv_rfc.fit(X_encoded, Y)
        
        y_pred = gscv.predict(X_test)
    
        if debug:
            print("Grid Search Scores: {}\nGrid Search Best Parameteres: {}"
                 .format(gscv_rfc.best_score_, gscv_rfc.best_params_))
            print('train score ', roc_auc_score(y_train, y_train_pred))
            print('test score ', roc_auc_score(y_test, y_pred))
        
        return y_pred
        
    # logistic Regression
    def logistic_reg(debug=False):
        from sklearn.linear_model import LogisticRegression

        clf = LogisticRegression(C=1).fit(X_train, y_train)
        y_train_pred = clf.predict(X_train)
        y_pred = clf.predict(X_test)

        if debug:
            print('Logistic Regression')
            print('train score ', roc_auc_score(y_train, y_train_pred))
            print('test score ', roc_auc_score(y_test, y_pred))
        
        return y_pred

    # KNN
    def knn(debug=False):
        from sklearn.neighbors import KNeighborsClassifier

        knn = KNeighborsClassifier(n_jobs=8, n_neighbors=10)
        knn.fit(X_train, y_train)

        y_train_pred = clf.predict(X_train)
        y_pred = clf.predict(X_test)
        
        if debug:
            print('KNeighbours')
            print('train score ', roc_auc_score(y_train, y_train_pred))
            print('test score ', roc_auc_score(y_test, y_pred))
            
        return y_pred
        
    
    gdb_pred = gradientboosting(debug=True)
    rf_pred = random_forest(debug=True)
    lr_pred = logistic_reg(debug=True)
    knn_pred = knn(debug=True)
    
    def findal_test(debug=False):
        df_test = pd.read_csv('test.csv', encoding = 'ISO-8859-1', 
                              low_memory=False).set_index('ticket_id')
        
        # daat clean
        df_test['judgment_level'] = pd.cut(df_test.judgment_amount, bins=[-1, 140, 305, float("inf")])
        df_test['is_discount'] = df_test.discount_amount.apply(lambda x:1 if x > 0 else 0)
        us_statesus_state  = ['AL','AK','AZ','AR','CA','CO','CT','DE','FL','GA','HI','ID','IL',
                     'IN','IA','KS','KY','LA','ME','MD','MA','MI','MN','MS','MO','MT',
                     'NE','NV','NH','NJ','NM','NY','NC','ND','OH','OK','OR','PA','RI',
                     'SC','SD','TN','TX','UT','VT','VA','WA','WV','WI','WY']

        df_test['is_in_state'] = df_test.state.apply(lambda x: x if x in us_statesus_state else 'foreign')
        df_test.is_in_state[(df_test.is_in_state != 'foreign') & (df_test.is_in_state != 'MI')] = 'out_of_state'

        disposition_replace = {'Responsible by Default': 'By default',
                               'Responsible by Determination': 'By determination', 
                               'Responsible (Fine Waived) by Deter': 'Fine Waived',
                               'Responsible by Admission': 'By admission',
                               'SET-ASIDE (PENDING JUDGMENT)': 'Pending',
                               'PENDING JUDGMENT': 'Pending',
                               'Not responsible by Dismissal': 'Not responsible',
                               'Not responsible by City Dismissal': 'Not responsible',
                               'Not responsible by Determination': 'Not responsible',
                               'Responsible (Fine Waived) by Admis': 'Fine Waived',
                               'Responsible - Compl/Adj by Default': 'By default',
                               'Responsible - Compl/Adj by Determi': 'By determination',
                               'Responsible by Dismissal': 'By default'
                              }

        df_test.disposition.replace(disposition_replace, inplace=True)
    
        df_test.disposition.value_counts()
        
        selection = ['judgment_level', 'is_discount', 'is_in_state', 'disposition', 'agency_name']
        df_test_selected = df_test[selection]
        # df_test_selected.head(3)
        
        final_df_test = pd.get_dummies(df_test_selected)
        final_df_test.head()
        df_test_selected.agency_name.unique()
        
        final_df_test['agency_name_Health Department'] = np.zeros(len(final_df_test), dtype=np.int)
        final_df_test['agency_name_Neighborhood City Halls'] = np.zeros(len(final_df_test), dtype=np.int)
        final_df_test['disposition_Not responsible'] = np.zeros(len(final_df_test))
        final_df_test['disposition_Pending'] = np.zeros(len(final_df_test))
        
        ret = gscv.predict_proba(final_df_test)[:, None, 1]
        predict_probs = pd.Series(ret.reshape(len(final_df_test),), index=final_df_test.index)
        predict_probs.rename('compliance').astype('float32')
        

import pandas as pd
import numpy as np

def blight_model_3(debug=False):
    
    # Your code here
    # loading data
    if debug:
        print('1. loading data')
    
    df_train = pd.read_csv('train.csv', encoding = 'ISO-8859-1', low_memory=False).set_index('ticket_id')
    
    # cleaing data and adjust data
    if debug:
        print('2. cleaning data')
    
    df_train = df_train.dropna(subset=['compliance'])
    disposition_replace = {'Responsible by Default': 'By default',
                       'Responsible by Determination': 'By determination', 
                       'Responsible (Fine Waived) by Deter': 'Fine Waived',
                       'Responsible by Admission': 'By admission',
                       'SET-ASIDE (PENDING JUDGMENT)': 'Pending',
                       'PENDING JUDGMENT': 'Pending',
                       'Not responsible by Dismissal': 'Not responsible',
                       'Not responsible by City Dismissal': 'Not responsible',
                       'Not responsible by Determination': 'Not responsible',
                       'Responsible (Fine Waived) by Admis': 'Fine Waived',
                       'Responsible - Compl/Adj by Default': 'By default',
                       'Responsible - Compl/Adj by Determi': 'By determination',
                       'Responsible by Dismissal': 'By default'
                      }

    df_train.disposition.replace(disposition_replace, inplace=True)
    
    if debug:
        print('3. feature engineering')
    
    us_statesus_state  = ['AL','AK','AZ','AR','CA','CO','CT','DE','FL','GA','HI','ID','IL',
             'IN','IA','KS','KY','LA','ME','MD','MA','MI','MN','MS','MO','MT',
             'NE','NV','NH','NJ','NM','NY','NC','ND','OH','OK','OR','PA','RI',
             'SC','SD','TN','TX','UT','VT','VA','WA','WV','WI','WY']

    df_train['is_in_state'] = df_train.state.apply(lambda x: x if x in us_statesus_state else 'foreign')
    df_train.is_in_state[(df_train.is_in_state != 'foreign') & (df_train.is_in_state != 'MI')] = 'out_of_state'
    df_train['is_discount'] = df_train.discount_amount.apply(lambda x:1 if x > 0 else 0)
    df_train['judgment_level'] = pd.cut(df_train.judgment_amount, bins=[-1, 140, 305, float("inf")])
    selection = ['judgment_level', 'is_discount', 'is_in_state', 'disposition', 'agency_name', 'compliance']
    df_selected = df_train[selection]
    X = df_selected.drop('compliance', axis=1)
    Y = df_selected.compliance
    X_encoded = pd.get_dummies(X)
    X_encoded.head(3)
    X_encoded['disposition_Not responsible'] = np.zeros(len(X))
    X_encoded['disposition_Pending'] = np.zeros(len(X))
    
    # train
    if debug:
        print('4. training')
    
    from sklearn.model_selection import GridSearchCV
    
    # using GBDT, and best param for GBDT
    from sklearn.ensemble import GradientBoostingClassifier
    params= {'learning_rate': [0.3], 'n_estimators':[100], 'max_depth':[3]}
    clf = GradientBoostingClassifier(random_state=0)
    
    # try SVM, too slow...
    #   from sklearn.svm import SVC
    #   params = {'gamma':[0.001], 'kernel':['rbf'], }
    #   clf = SVC(random_state=0)
    
    gscv = GridSearchCV(estimator=clf, param_grid=params, scoring='roc_auc', cv=5, n_jobs=-1)
    gscv.fit(X_encoded, Y)
    
    if debug:
        print('5. training complete with best score', gscv.best_score_, gscv.best_params_)
    
    # test 
    if debug:
        print('6. final test')
    
    df_test = pd.read_csv('test.csv').set_index('ticket_id')
    df_test['judgment_level'] = pd.cut(df_test.judgment_amount, bins=[-1, 140, 305, float("inf")])
    df_test['is_discount'] = df_test.discount_amount.apply(lambda x:1 if x > 0 else 0)
    df_test['is_in_state'] = df_test.state.apply(lambda x: x if x in us_statesus_state else 'foreign')
    df_test.is_in_state[(df_test.is_in_state != 'foreign') & (df_test.is_in_state != 'MI')] = 'out_of_state'
    df_test.disposition.replace(disposition_replace, inplace=True)
    
    selection = ['judgment_level', 'is_discount', 'is_in_state', 'disposition', 'agency_name']
    df_test_selected = df_test[selection]
    
    final_df_test = pd.get_dummies(df_test_selected)
    final_df_test['agency_name_Health Department'] = np.zeros(len(final_df_test), dtype=np.int)
    final_df_test['agency_name_Neighborhood City Halls'] = np.zeros(len(final_df_test), dtype=np.int)
    final_df_test['disposition_Not responsible'] = np.zeros(len(final_df_test))
    final_df_test['disposition_Pending'] = np.zeros(len(final_df_test))
    ret = gscv.predict_proba(final_df_test)[:, None, 1]
    predict_probs = pd.Series(ret.reshape(len(final_df_test),), index=final_df_test.index)
    
    return predict_probs.rename('compliance').astype('float32')# Your answer here

# blight_model_3(debug=True)

# Your AUC of 0.771187138132 was awarded a value of 1.0 out of 1.0 total grades

In [16]:
import pandas as pd
import numpy as np
from sklearn.metrics import roc_curve, auc
from sklearn.tree import DecisionTreeClassifier
from adspy_shared_utilities import plot_decision_tree
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import GradientBoostingClassifier

def blight_model_4(debug=False):
    
    # Cleaning Up the Data
    df = pd.read_csv("train.csv",encoding='ISO-8859-1')
    df1 = pd.read_csv("test.csv",encoding='ISO-8859-1')

    #Fill the na values
    df_label = df['compliance'].fillna(0)

    # Removing the Unnecessary features
    df = df.drop( ['payment_amount', 'payment_date', 'payment_status', 'balance_due', 'collection_status',
                   'agency_name', 'inspector_name', 'violation_code','country','mailing_address_str_name',
                  'city','state','violator_name','violation_street_name','violation_description', 
                   'compliance_detail','mailing_address_str_number','ticket_issued_date','hearing_date',
                   'non_us_str_code','compliance'],axis=1)
    df = df.fillna(0)
    df1 = df1.drop( ['agency_name', 'inspector_name', 'violation_code','country','mailing_address_str_name',
                   'city','state','violator_name','violation_street_name','violation_description',
                     'mailing_address_str_number','non_us_str_code','ticket_issued_date','hearing_date'],axis=1)
    df1 = df1.fillna(0)
    df = pd.get_dummies(data=df, columns=['grafitti_status', 'disposition'])
    df1 = pd.get_dummies(data=df1, columns=['grafitti_status', 'disposition'])

    #Calculating new features using the existing ones to improve the predictions of the classifier
    df['late_amount'] = df['judgment_amount']*df['late_fee']
    df1['late_amount'] = df1['judgment_amount']*df1['late_fee']
    
    #df['hearing_date'] = pd.to_datetime(df['hearing_date']).fillna(0)
    #df['ticket_issued_date'] = pd.to_datetime(df['ticket_issued_date']).fillna(0)
    #df['date_diff'] = (pd.to_datetime(df['hearing_date']).dt.date - 
    #                         pd.to_datetime(df['ticket_issued_date']).dt.date).fillna(0)

    # Converting the datatype according to the data.
    df = df.convert_objects(convert_numeric=True).fillna(0)
    df1 = df1.convert_objects(convert_numeric=True).fillna(0)
    df1.violation_zip_code = df1.violation_zip_code.astype('float').fillna(0)
    #print(df1.dtypes)

    #Splitting the data into training and test set
    X_train, X_test, y_train, y_test = train_test_split(df, df_label, random_state=0)

    #Fit the classifier and predict the values for test set
    #clf = DecisionTreeClassifier().fit(X_train, y_train)
    clf = GradientBoostingClassifier().fit(X_train, y_train)
    tree_predicted = clf.predict(X_test)
    # clf = lr.fit(X_train, y_train).decision_function(X_test)

    #Calculating the Area Under the Curve
    fpr_lr, tpr_lr, _ = roc_curve(y_test, tree_predicted)
    roc_auc_lr = auc(fpr_lr, tpr_lr)

    if debug:
        print(tree_predicted.shape)
        print(df1['ticket_id'].shape)
    
    df1['disposition_SET-ASIDE (PENDING JUDGMENT)'] = df['disposition_SET-ASIDE (PENDING JUDGMENT)']
    
    if debug:
        print(len(df.columns),"---",len(df1.columns))
        print(df.head())
        print(df1.head())

    
    preds = clf.predict(df1)
    preds = pd.DataFrame(data=preds)
    preds.set_index(df1['ticket_id'],inplace=True)

    if debug:
        print(preds.head())
        print(preds.dtypes)
        print()
        print('Accuracy of DT classifier on training set: {:.2f}'.format(clf.score(X_train, y_train)))
        print('Accuracy of DT classifier on test set: {:.2f}'.format(clf.score(X_test, y_test)))
        print('ROC Score on test set: {:.2f}'.format(roc_auc_lr))
        print(df.head())


    
    return preds

# blight_model_4(debug=True)

# Your AUC of 0.704128760648 was awarded a value of 0.8 out of 1.0 total grades

In [None]:
import pandas as pd
import numpy as np

def blight_model(debug=False):
    from sklearn.neural_network import MLPClassifier
    from sklearn.preprocessing import MinMaxScaler
    from datetime import datetime
    def time_gap(hearing_date_str, ticket_issued_date_str):
        if not hearing_date_str or type(hearing_date_str)!=str: return 73
        hearing_date = datetime.strptime(hearing_date_str, "%Y-%m-%d %H:%M:%S")
        ticket_issued_date = datetime.strptime(ticket_issued_date_str, "%Y-%m-%d %H:%M:%S")
        gap = hearing_date - ticket_issued_date
        return gap.days
    train_data = pd.read_csv('train.csv', encoding = 'ISO-8859-1', low_memory=False)
    test_data = pd.read_csv('test.csv', encoding = 'ISO-8859-1', low_memory=False)
    train_data = train_data[(train_data['compliance'] == 0) | (train_data['compliance'] == 1)]
    address =  pd.read_csv('addresses.csv')
    latlons = pd.read_csv('latlons.csv')
    address = address.set_index('address').join(latlons.set_index('address'), how='left')
    train_data = train_data.set_index('ticket_id').join(address.set_index('ticket_id'))
    test_data = test_data.set_index('ticket_id').join(address.set_index('ticket_id'))
    train_data = train_data[~train_data['hearing_date'].isnull()]
    train_data['time_gap'] = train_data.apply(lambda row: time_gap(row['hearing_date'], row['ticket_issued_date']), axis=1)
    test_data['time_gap'] = test_data.apply(lambda row: time_gap(row['hearing_date'], row['ticket_issued_date']), axis=1)
    feature_to_be_splitted = ['agency_name', 'state', 'disposition']
    train_data.lat.fillna(method='pad', inplace=True)
    train_data.lon.fillna(method='pad', inplace=True)
    train_data.state.fillna(method='pad', inplace=True)

    test_data.lat.fillna(method='pad', inplace=True)
    test_data.lon.fillna(method='pad', inplace=True)
    test_data.state.fillna(method='pad', inplace=True)
    train_data = pd.get_dummies(train_data, columns=feature_to_be_splitted)
    test_data = pd.get_dummies(test_data, columns=feature_to_be_splitted)
    list_to_remove_train = [
        'balance_due',
        'collection_status',
        'compliance_detail',
        'payment_amount',
        'payment_date',
        'payment_status'
    ]
    list_to_remove_all = ['fine_amount', 'violator_name', 'zip_code', 'country', 'city',
                          'inspector_name', 'violation_street_number', 'violation_street_name',
                          'violation_zip_code', 'violation_description',
                          'mailing_address_str_number', 'mailing_address_str_name',
                          'non_us_str_code',
                          'ticket_issued_date', 'hearing_date', 'grafitti_status', 'violation_code']
    train_data.drop(list_to_remove_train, axis=1, inplace=True)
    train_data.drop(list_to_remove_all, axis=1, inplace=True)
    test_data.drop(list_to_remove_all, axis=1, inplace=True)
    train_features = train_data.columns.drop('compliance')
    train_features_set = set(train_features)
    
    for feature in set(train_features):
        if feature not in test_data:
            train_features_set.remove(feature)
    train_features = list(train_features_set)
    
    X_train = train_data[train_features]
    y_train = train_data.compliance
    X_test = test_data[train_features]
    
    scaler = MinMaxScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    clf = MLPClassifier(hidden_layer_sizes = [100, 10], alpha = 5,
                       random_state = 0, solver='lbfgs', verbose=0)
    # clf = DecisionTreeClassifier()
    clf.fit(X_train_scaled, y_train)

    test_proba = clf.predict_proba(X_test_scaled)[:,1]

    test_df['compliance'] = test_proba
    test_df.set_index('ticket_id', inplace=True)
    
    if debug:
        print()
    
    
    return test_df.compliance

# blight_model(debug=True)

In [None]:
blight_model()

  if self.run_code(code, result):


ticket_id
284932    0.055695
285362    0.011789
285361    0.047673
285338    0.061302
285346    0.071665
285345    0.061497
285347    0.077322
285342    0.638586
285530    0.016982
284989    0.039724
285344    0.077300
285343    0.017237
285340    0.017181
285341    0.077122
285349    0.067679
285348    0.063926
284991    0.048184
285532    0.041581
285406    0.028670
285001    0.018109
285006    0.003585
285405    0.022596
285337    0.014510
285496    0.067036
285497    0.053202
285378    0.022769
285589    0.034097
285585    0.041315
285501    0.062277
285581    0.011600
            ...   
376367    0.005907
376366    0.029196
376362    0.184266
376363    0.226109
376365    0.005907
376364    0.029196
376228    0.037536
376265    0.028917
376286    0.219657
376320    0.034757
376314    0.034940
376327    0.304472
376385    0.285798
376435    0.704513
376370    0.615777
376434    0.061396
376459    0.067699
376478    0.000114
376473    0.034847
376484    0.031778
376482    0.020191
37