# Lab 6 - Wide and Deep Networks

### Eric Smith and Jake Carlson

## Introduction
For this lab, we will again be examining the Global Terrorism Database maintained by the National Consortium for the Study of Terrorism and Responses to Terrorism (START) at the University of Maryland. We will be looking at attacks that happened in the United States over the whole time span of the data set, since it's creation in 1974.

## Business Understanding

### Motivations
Protecting the United States from terror threats has been a major objective of the federal government. This is characterized by the founding of the Department of Homeland Security in 2001. But predicting when an attack will happen based on certain attributes is next to impossible. Attempting to train a model on the Global Terrorism Database to learn when terrorist attacks happen will result in a model that is over-trained on the GTD and will fail to predict any such attacks. Not to mention, such a system would have to be accompanied by a large-scale communication monitoring and processing system capable of feeding the model relevant inputs that exemplify a possible attack.

Instead of trying to predict when an attack will happen, our goal is to create a model that can predict the cost associate with an individual attack. Immediately after an attack has happened, law enforcement can feed in information about the attack, such as the attack type, the number of people injured, and the target type, and they could receive an approximation of the amount of property damage dealt to their city. Such a model would allow city officials and law enforcement to estimate in real time how much an attack will cost their city. Knowing the estimated cost would enable city officials to determine if they need to request support from the federal government in a shorter timeframe. Furthermore, cities could plan their future budgets accordingly to incorporate funding in response to a terrorist attack.

Cities have to submit requests to FEMA for non-disaster grants to aid in the prevention and response to terrorist activity. The Department of Homeland Security can also issue grants to aid in the prevention of terrorism. Grant policies start with Congress allocating funds for federal grants of this type. The Executive Branch provides input for how the policy should be implemented. Then grant issuing agencies develop their own policies for how to allocate grant money.

Each state defines their own thresholds for when an attack is severe enough that they will ask for federal assistance. Our model will allow officials to immediately decide if they need to file for a federal grant. Smaller cities have lwoer thresholds and larger cities can handle higher costs before needing assistance.

### Objectives
Based on the characteristics of an attack, such as the target type and the date, we want to assign an estimated cost label to the entity. Because our system will be used to estimate the cost for local city governements, perfect classification of cost is not required. However, it is important that these estimations are accurate because a request for a grant will need to be formed and sent to the federal government.

Based on the distribution of our classes we want to achieve an accuracy that is greater than the ratio of the majority class to the rest of the population. The class counts are given by:

    Catastrophic (class 0): 4
    Major (class 1): 52
    Minor (class 2): 1770
    Unknown (class 3): 190

The majority class is class 2, which constitutes 87% of the data set. We want to achieve a classification accuracy greater than this for our model to be useful.

## REDO everything above this**

### Evaluation
Because we are predicting the group that conducted the attack, it would be an issue if we predicted the wrong group. Law enforcement could waste time and resources following the incorrect prediction and the perpetrators would have more time to get away or plan another attack. We will evaluate our model using the precision score in order to minimize the false positive rate. We will use macro precision so all of the groups are weighted equally.

Because some of the groups are over-represented, we will use stratified 10-fold cross validation so the classes in each fold match the distribution of the original data set. Running training and testing ten times will also allow us to be confident in the generalization performance of the model.

## Data Preparation

### Attributes
Here is the list of attributes we will keep in our data set to use for classification.

#### General Information
- **iyear** (ordinal): The year the event occured in
- **imonth** (ordinal): The month the event occured in
- **iday** (ordinal): The day the event occured in
- **extended** (binary): 1 if the incident was longer than 24 hours, 0 otherwise
    - **resolution** (ordinal): The date an extended incident was resolved if *extended* is 1


- **inclusion criteria** (binary): There are three inclusion criteria where a 1 indicates the event meets that criteria
    - **crit1**: Political, economic, religious, or social goal
    - **crit2**: Intention to coerce, intimidate, or publicize
    - **crit3**: Outside international humanitarian law


#### Location
We will provide the name of the city to the model. An alternative method would be to train a unique logistic regression algorithm for each city where our system is deployed.
- **city** (text): Name of the city in which the event occured
- **vicinity** (nominal/binary): A 1 indicates the event occured in the immediate vicinity of *city*, 0 indicates the even occured in *city*
- **latitude** (ratio): The latitude of the *city* in which the event occured
- **longitude** (ratio): The longitude of the *city* in which the event occured

#### Attack Type
The most severe method of attack. This will be our class label. Although the original data set contains columns for three different attack types, the attack types are ranked by their severity. Many attacks only have one attack type. By removing the second and third attack types from our data set, we will still be predicting the most severe of the attack types.
- **attacktype1** (ordinal): Most severe attack type

- The attack types follow the following hierarchy:
    1. Assassination
    2. Armed Assault
    3. Bombing/Explosion
    4. Hijacking
    5. Barricade Incident
    6. Kidnapping
    7. Facility/Infrastructure Attack 
    8. Unarmed Assault
    9. Unknown


- **suicide** (nominal/binary): A 1 indicates there was evidence the attacker did not make an effort to escape with their life

#### Target Type
We will only be considering the first target type of the attack. The set of target attributes is provided below:
- **targtype1, targtype1_txt** (nominal): The general type of target from the following list:
    1. Business
    2. Government (General)
    3. Police
    4. Military
    5. Abortion related
    6. Airports and aircraft
    7. Government (Diplomatic)
    8. Educational institution
    9. Food or water supply
    10. Journalists and media
    11. Maritime
    12. NGO
    13. Other
    14. Private citizens and property
    15. Religious figures and institutions
    16. Telecommunication
    17. Terrirists and non-state militias
    18. Tourists
    19. Transportation
    20. Unknown
    21. Utilities
    22. Violent political parties
    

- **targsubtype1, targsubtype1_txt** (nominal): There are a number of subtypes for each of the above target types

#### Perpetrator Information
The data set provides information on up to three perpetrators if the attack was conducted by multiple groups. We will only be considering the first group, or the one decided to have the most responsibility for the attack.
- **individual** (binary): A 1 indicates the individuals carrying out the attack are not affiliated with a terror organization
- **nperps** (ratio): Indicates the total number of terrorists participating in the event
- **nperpcap** (ratio): Number of perpatrators taken into custody
- **claimed** (binary): A 1 indicates a person or group claimed responsibility for the attack
- **claimmode** (nominal): Records the method the terror group used to claim responsibility for the attack. Can be one of the ten following categories:
    1. Letter
    2. Call (post-incident)
    3. Call (pre-incident)
    4. E-mail
    5. Note left at scene
    6. Video
    7. Posted to website
    8. Personal claim
    9. Other
    10. Unknown


#### Casualties and Consequences
- **nkill** (ratio): Records the number of confirmed kills for the incident
- **nkillter** (ratio): Indicates the number of terrorists who were killed in the event
- **nwound** (ratio): Indicates the number of people who sustained non-fatal injuries in the event
- **nwoundte** (ratio): Indicates the number of terrorists who sustained non-lethal injuries
- **property** (binary): A 1 indicates the event resulted in property damage. We will only select entities that resulted in property damage
- **propextent** (ordinal): If *property* is a 1, this field records the extent of the property damage following the scheme:
    <ol start='0'>
        <li>Catastrophic (likely > \$1 billion)</li>
        <li>Major (likely > \$1 million and < \$1 billion)</li>
        <li>Minor (likely < \$1 million)</li>
        <li>Unknown</li>
    </ol>

### Data Cleaning
We will clean the data set so only the above attributes are present.

In [9]:
import pandas as pd

df = pd.read_csv('./data/After_911.csv', encoding='ISO-8859-1', low_memory=False)
df.head()

Unnamed: 0,eventid,iyear,imonth,iday,approxdate,extended,resolution,country,country_txt,region,...,addnotes,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related
0,200000000000.0,2001,9,11,,0,,34,Burundi,11,...,At least 10 army soldiers were killed by Front...,"""Burundi: Rebels Ambush Minibus North of Bujum...",,,CETIS,-9,-9,0,-9,
1,200000000000.0,2001,9,11,,0,,229,Democratic Republic of the Congo,11,...,The soldiers arrived at the location of the at...,"""DRCongo: Four Killed in Shooting in Rebel Hel...",,,CETIS,-9,-9,0,-9,
2,200000000000.0,2001,9,11,,0,,97,Israel,10,...,Israeli investigators believed the attack was ...,"""Two Border Policemen Killed, One Wounded in a...",,,CETIS,-9,-9,0,-9,
3,200000000000.0,2001,9,11,,0,,217,United States,1,...,This attack was one of four related incidents ...,"United States Government, The 9/11 Commission ...","Lindsay Kines, ñUnited States on high alert af...","Joe Frolick, ñHijackers Ram Two Airliners Into...",CETIS,0,1,0,1,"200109110004, 200109110005, 200109110006, 2001..."
4,200000000000.0,2001,9,11,,0,,217,United States,1,...,This attack was one of four related incidents ...,"United States Government, The 9/11 Commission ...","Lindsay Kines, ñUnited States on high alert af...","Joe Frolick, ñHijackers Ram Two Airliners Into...",CETIS,0,1,0,1,"200109110005, 200109110004, 200109110006, 2001..."


In [10]:
df.columns.values

array(['eventid', 'iyear', 'imonth', 'iday', 'approxdate', 'extended',
       'resolution', 'country', 'country_txt', 'region', 'region_txt',
       'provstate', 'city', 'latitude', 'longitude', 'specificity',
       'vicinity', 'location', 'summary', 'crit1', 'crit2', 'crit3',
       'doubtterr', 'alternative', 'alternative_txt', 'multiple',
       'success', 'suicide', 'attacktype1', 'attacktype1_txt',
       'attacktype2', 'attacktype2_txt', 'attacktype3', 'attacktype3_txt',
       'targtype1', 'targtype1_txt', 'targsubtype1', 'targsubtype1_txt',
       'corp1', 'target1', 'natlty1', 'natlty1_txt', 'targtype2',
       'targtype2_txt', 'targsubtype2', 'targsubtype2_txt', 'corp2',
       'target2', 'natlty2', 'natlty2_txt', 'targtype3', 'targtype3_txt',
       'targsubtype3', 'targsubtype3_txt', 'corp3', 'target3', 'natlty3',
       'natlty3_txt', 'gname', 'gsubname', 'gname2', 'gsubname2', 'gname3',
       'gsubname3', 'motive', 'guncertain1', 'guncertain2', 'guncertain3',
       'in

In [11]:
to_keep = ['eventid', 'extended', 'iyear', 'imonth', 'approxdate', 'iday', 'crit1', 'crit2',
           'crit3', 'country', 'city', 'vicinity', 'latitude', 'longitude',
           'attacktype1_txt', 'attacktype2_txt',
           'attacktype3_txt', 'success', 'suicide',
           'targtype1_txt', 'gname', 'individual', 
           'nperps', 'nperpcap', 'claimed', 'nkill', 'nkillter', 'nwound', 'nwoundte',
           'property', 'propextent', 'propextent_txt', 'propvalue',
           'ishostkid', 'nhostkid', 'nreleased']
df = df[to_keep]

In [12]:
import numpy as np
from sklearn import preprocessing
from datetime import datetime
import dateutil.parser

logical_cols = ['extended', 'vicinity', 'crit1', 'crit2', 'crit3',
                'suicide', 'individual', 'claimed', 'success', 'property']
categorical_cols = ['attacktype1_txt', 'attacktype2_txt', 'attacktype3_txt', 
                    'targtype1_txt', 'country', 'city', 'gname', 'propextent_txt']
ratio_cols = ['latitude', 'longitude', 'nperps', 'nperpcap', 'nkill',
              'nkillter', 'nwound', 'nwoundte', 'propvalue', 'ishostkid']

# replace unknowns with nan
logical_replace = dict((l, {-9:np.nan}) for l in logical_cols)
ratio_replace = dict((r, {-99:np.nan, -9:np.nan}) for r in ratio_cols)
df.replace(to_replace=logical_replace, inplace=True)
df.replace(to_replace=ratio_replace, inplace=True)

# replace unknowns with median
logical_replace = dict((l, {np.nan:df[l].median()}) for l in logical_cols)
ratio_replace = dict((r, {np.nan:df[r].median()}) for r in ratio_cols)
df.replace(to_replace=logical_replace, inplace=True)
df.replace(to_replace=ratio_replace, inplace=True)

# impute nhostkid column
for index, row in df.iterrows():
    if row.ishostkid == 0:
        df.loc[index, 'nhostkid'] = 0

# convert logical cols to bools
for l in logical_cols:
    df[l] = df[l].astype('bool')

# replace dates with the first date of the approximate range
for index, row in df[ df.approxdate.notnull() ].iterrows():
    date = dateutil.parser.parse( row.approxdate.split('-')[0] )
    df.loc[index, 'imonth'] = date.month
    df.loc[index, 'iday'] = date.day


# normalize ratio cols
min_max_scaler = preprocessing.MinMaxScaler()
df[ratio_cols] = min_max_scaler.fit_transform(df[ratio_cols])

# standardize date attributes
# use year, month, and day to get day number in year
day_list = []
for r in df[['iyear', 'imonth', 'iday']].iterrows():
    # fudge day 0 to 1
    if r[1].iday == 0:
        day_list.append(
            datetime(r[1].iyear, r[1].imonth, 1).timetuple().tm_yday)
    else:
        day_list.append(
            datetime(r[1].iyear, r[1].imonth, r[1].iday).timetuple().tm_yday)
        
df = df.assign(dayn=day_list)

# drop month and day attributes
df.drop(['imonth', 'iday'], axis=1, inplace=True)

# normalize day number and year col
df['iyear'] = df['iyear'].astype(np.float64)
df['dayn'] = df['dayn'].astype(np.float64)
df[['iyear', 'dayn']] = min_max_scaler.fit_transform(df[['iyear', 'dayn']])

# drop unknown groups
df = df[df.gname != "Unknown"]

# one-hot encode categorical cols
# df = pd.get_dummies(df, prefix=categorical_cols, columns=categorical_cols)


df.head()

Unnamed: 0,eventid,extended,iyear,approxdate,crit1,crit2,crit3,country,city,vicinity,...,nwound,nwoundte,property,propextent,propextent_txt,propvalue,ishostkid,nhostkid,nreleased,dayn
3,200000000000.0,False,0.0,,True,True,True,217,New York City,False,...,1.0,0.0,True,1.0,Catastrophic (likely > $1 billion),0.002911,1.0,88.0,0.0,0.081967
4,200000000000.0,False,0.0,,True,True,True,217,New York City,False,...,0.999864,0.0,True,1.0,Catastrophic (likely > $1 billion),0.002911,1.0,59.0,0.0,0.081967
5,200000000000.0,False,0.0,,True,True,True,217,Arlington,False,...,0.01439,0.0,True,1.0,Catastrophic (likely > $1 billion),0.002911,1.0,59.0,0.0,0.081967
6,200000000000.0,False,0.0,,True,True,True,217,Shanksville,True,...,0.000679,0.0,True,1.0,Catastrophic (likely > $1 billion),0.002911,1.0,40.0,0.0,0.081967
9,200000000000.0,False,0.0,,True,True,True,160,Barira,False,...,0.000815,0.0,False,,,0.002911,0.0,0.0,,0.098361


In [28]:
# save full clean data set
df.to_csv('./clean-data/After_911_clean.csv', sep=',')

We have done a number of things to prepare our data for modeling. First, we replaced unknown values in each column with the median for that column. Second, we converted attributes that encode a logical value to a boolean. Third, we normalized the ratio attributes so they are all in the range 0 to 1. Fourth, we convert the year, month, and day attribute to a single numeric attribute which represents the day number in the year that the attack occured on. Then we drop the month and day columns. We still want the year attribute because of the change in attack frequency we noticed in Lab 1, so we standardize the year and day number columns. Finally, we one-hot encode all of the categorical attributes in our data set, creating a variety of additional columns that are needed to represent our data in this way.

### Crossed Columns
We will create several crossed columns to make the data set wider. This will help our model with memorization of the training data. We will cross attack type with property extent, target type, city, and country. We will also cross property extent with target type.

In [42]:
import pandas as pd
import numpy as np
df = pd.read_csv('./clean-data/After_911_clean.csv')

y = df['propextent']
X = df.drop(['propextent'], axis=1)
# y_ints, y_levels = pd.factorize(y)

In [44]:
set(y.values)

{nan,
 1.0,
 2.0,
 nan,
 nan,
 4.0,
 nan,
 nan,
 nan,
 nan,
 3.0,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan

In [87]:
import pandas as pd

df = pd.read_csv('./data/After_911.csv', encoding='ISO-8859-1')

# drop rows without property damage or unknown city
orig_len = df.shape[0]
df = df[df['property'] == 1]
df = df[df['city'] != "Unknown"]
new_len = df.shape[0]
print("Percent maintained: ", new_len/orig_len*100, "%")

# select columns of interest
df = df[['iyear', 'imonth', 'iday', 'extended', 'country', 'city', 'vicinity',
         'latitude', 'longitude', 'crit1', 'crit2', 'crit3', 'suicide',
         'attacktype1_txt', 'targtype1_txt', 'individual', 'nperps', 'nperpcap', 
         'claimed', 'claimmode', 'nkill', 'nkillter', 'nwound', 
         'nwoundte', 'propextent']]

import numpy as np
from sklearn import preprocessing
from datetime import datetime

logical_cols = ['extended', 'vicinity', 'crit1', 'crit2', 'crit3',
                'suicide', 'individual', 'claimed']
categorical_cols = ['attacktype1_txt', 'targtype1_txt', 'country', 'city', 'claimmode']
ratio_cols = ['latitude', 'longitude', 'nperps', 'nperpcap', 'nkill',
              'nkillter', 'nwound', 'nwoundte']

# replace unknowns with nan
logical_replace = dict((l, {-9:np.nan}) for l in ['claimed'])
ratio_replace = dict((r, {-99:np.nan, -9:np.nan}) for r in ratio_cols)
df.replace(to_replace=logical_replace, inplace=True)
df.replace(to_replace=ratio_replace, inplace=True)

# might want to remove outliers here

# replace NA's with median for col
df.fillna(value=df.median(), inplace=True)

# convert logical cols to bools
for l in logical_cols:
    df[l] = df[l].astype('bool')

# normalize ratio cols
min_max_scaler = preprocessing.MinMaxScaler()
df[ratio_cols] = min_max_scaler.fit_transform(df[ratio_cols])

# standardize date attributes
# use year, month, and day to get day number in year
day_list = []
for r in df[['iyear', 'imonth', 'iday']].iterrows():
    # fudge day 0 to 1
    if r[1].iday == 0:
        day_list.append(
            datetime(r[1].iyear, r[1].imonth, 1).timetuple().tm_yday)
    else:
        day_list.append(
            datetime(r[1].iyear, r[1].imonth, r[1].iday).timetuple().tm_yday)
        
df = df.assign(dayn=day_list)

# drop month and day attributes
df.drop(['imonth', 'iday'], axis=1, inplace=True)

# normalize day number and year col
df['iyear'] = df['iyear'].astype(np.float64)
df['dayn'] = df['dayn'].astype(np.float64)
df[['iyear', 'dayn']] = min_max_scaler.fit_transform(df[['iyear', 'dayn']])

# convert categorical cols as ints
# for c in categorical_cols:
#     df[c] = df[c].astype(np.int64)

# one-hot encode categorical cols
df = pd.get_dummies(df, prefix=categorical_cols, columns=categorical_cols)

# zero-index propextent
df.propextent = df.propextent - 1

df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Percent maintained:  36.354092102123595 %


Unnamed: 0,iyear,extended,vicinity,latitude,longitude,crit1,crit2,crit3,suicide,individual,...,claimmode_1.0,claimmode_2.0,claimmode_3.0,claimmode_4.0,claimmode_5.0,claimmode_6.0,claimmode_7.0,claimmode_8.0,claimmode_9.0,claimmode_10.0
3,0.0,False,False,0.758239,0.161304,True,True,True,True,False,...,0,0,0,0,0,1,0,0,0,0
4,0.0,False,False,0.758239,0.161304,True,True,True,True,False,...,0,0,0,0,0,1,0,0,0,0
5,0.0,False,False,0.740602,0.150956,True,True,True,True,False,...,0,0,0,0,0,1,0,0,0,0
6,0.0,False,True,0.751555,0.144956,True,True,True,True,False,...,0,0,0,0,0,1,0,0,0,0
10,0.0,False,False,0.733236,0.452702,True,True,True,False,False,...,0,0,0,0,0,0,1,0,0,0


In [88]:
y = df['propextent']
X = df.drop(['propextent'], axis=1)
y_ints = y.values

In [59]:
from keras.models import Sequential
from keras.layers import Dense, Activation, Input
from keras.layers import Embedding, Flatten, Merge, concatenate
from keras.models import Model
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold, cross_val_score

cross_columns = [['attacktype1_txt','targtype1_txt'],
                 ['attacktype1_txt','city'],
                 ['attacktype1_txt','country']]

X_ints = []
all_inputs = []
all_branch_outputs = []

for cols in cross_columns:
    # encode as ints for the embedding
    enc = LabelEncoder()
    
    # create crossed labels
    X_crossed = df[cols].apply(lambda x: '_'.join(str(x)), axis=1)
    
    enc.fit(X_crossed)
    X_crossed = enc.transform(X_crossed)
    X_ints.append(X_crossed)
    
    # get the number of categories
    N = max(X_ints[-1]+1)
    
    # create embedding branch from the number of categories
    inputs = Input(shape=(1,),dtype='int32')
    all_inputs.append(inputs)
    x = Embedding(input_dim=N, output_dim=int(np.sqrt(N)), input_length=1)(inputs)
    x = Flatten()(x)
    all_branch_outputs.append(x)

# merge the branches together
final_branch = concatenate(all_branch_outputs)
final_branch = Dense(units=1,activation='sigmoid')(final_branch)

# def create_model():
#     model = Model(inputs=all_inputs, outputs=final_branch)

#     model.compile(optimizer='sgd',
#                   loss='mean_squared_error',
#                   metrics=['accuracy'])
#     return model

model = Model(inputs=all_inputs, outputs=final_branch)

model.compile(optimizer='sgd',
                  loss='mean_squared_error',
                  metrics=['accuracy'])

# model = KerasClassifier(build_fn=create_model, epochs=10, batch_size=32, verbose=1)
# kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=64)
# results = cross_val_score(model, X_ints, df.gname.values, cv=kfold)

# print(results.mean())
## replace this with the train test pipeline
model.fit(X_ints,
          y_ints, epochs=10, batch_size=32, verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x137741668>

In [89]:
X.columns.values

array(['iyear', 'extended', 'vicinity', ..., 'claimmode_8.0',
       'claimmode_9.0', 'claimmode_10.0'], dtype=object)

In [94]:
from sklearn.model_selection import StratifiedShuffleSplit

col_names = X.columns.values

sss = StratifiedShuffleSplit(n_splits=3, test_size=0.2, random_state=64)
for train_idx, test_idx in sss.split(X.values, y.values):
    # X_train - 80% training attribute set
    # X_test - 20% test attribute set
    # y_train - 80% training labels
    # y_test - 20% training labels
    X_train, X_test = pd.DataFrame(X.values[train_idx], columns=col_names), pd.DataFrame(X.values[test_idx], columns=col_names)
    y_train, y_test = pd.DataFrame(y.values[train_idx], columns=["propextent"]), pd.DataFrame(y.values[test_idx], columns=["propextent"])

X_train, X_test = X_train.values, X_test.values
y_train, y_test = y_train.values, y_test.values

In [105]:
set(y_train.flatten())

{0.0, 1.0, 2.0, 3.0}

In [110]:
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics as mt

# ohe = LabelEncoder()
# X_train_ohe = ohe.fit_transform(X_train[categorical_cols].values)
# X_test_ohe = ohe.transform(X_test[categorical_cols].values)

# inputs = Input(shape=(X_train.shape[1],),sparse=False)

# # a layer instance is callable on a tensor, and returns a tensor
# x = Dense(units=100, activation='relu')(inputs)
# predictions = Dense(4,activation='softmax')(x)

def build_model(input_dim):
    model = Sequential()
    model.add(Dense(input_dim=input_dim, output_dim=20, activation='relu'))
    model.add(Dense(output_dim=4, activation='softmax'))
    model.compile(optimizer='sgd',
                  loss='mean_squared_error',
                  metrics=['accuracy'])
    return model
    

# This creates a model that includes
# the Input layer and three Dense layers
# model = Model(inputs=inputs, outputs=predictions)

# model.compile(optimizer='sgd',
#               loss='mean_squared_error',
#               metrics=['accuracy'])

model = build_model(X_train.shape[1])
model.summary()

model.fit(X_train, y_train, epochs=10, batch_size=50, verbose=1)

# test on the data
yhat = np.round(model.predict(X_test))
print(mt.confusion_matrix(y_test,yhat),mt.accuracy_score(y_test,yhat))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_36 (Dense)             (None, 20)                68660     
_________________________________________________________________
dense_37 (Dense)             (None, 4)                 84        
Total params: 68,744
Trainable params: 68,744
Non-trainable params: 0
_________________________________________________________________




ValueError: Error when checking target: expected dense_37 to have shape (None, 4) but got array with shape (6094, 1)