# Lab 4 - Extending Logistic Regression

### Eric Smith and Jake Carlson

## Introduction
For this lab, we will again be examining the Global Terrorism Database maintained by the National Consortium for the Study of Terrorism and Responses to Terrorism (START) at the University of Maryland. We will be looking at attacks that happened in the United States over the whole time span of the data set, since it's creation in 1974.

We will be extending binary logistic regression based classification to support multiple classes. Specifically, based on the attributes of the input entity, we want to predict what the attack type of the entity is.

## Business Understanding

### Motivations
Protecting the United States from terror threats has been a major objective of the federal government. This is characterized by the founding of the Department of Homeland Security in 2001. But predicting when an attack will happen based on certain attributes is next to impossible. Attempting to train a model on the Global Terrorism Database to learn when terrorist attacks happen will result in a model that is over-trained on the GTD and will fail to predict any such attacks. Not to mention, such a system would have to be accompanied by a large-scale communication monitoring and processing system capable of feeding the model relevant inputs that exemplify a possible attack.

Instead of trying to predict when an attack will happen, our goal is to create a model that can predict the cost associate with an individual attack. Immediately after an attack has happened, law enforcement can feed in information about the attack, such as the attack type, the number of people injured, and the target type, and they could receive an approximation of the amount of property damage dealt to their city. Such a model would allow city officials and law enforcement to estimate in real time how much an attack will cost their city. Knowing the estimated cost would enable city officials to determine if they need to request support from the federal government in a shorter timeframe. Furthermore, cities could plan their future budgets accordingly to incorporate funding in response to a terrorist attack.

Cities have to submit requests to FEMA for non-disaster grants to aid in the prevention and response to terrorist activity. The Department of Homeland Security can also issue grants to aid in the prevention of terrorism. Grant policies start with Congress allocating funds for federal grants of this type. The Executive Branch provides input for how the policy should be implemented. Then grant issuing agencies develop their own policies for how to allocate grant money.

Each state defines their own thresholds for when an attack is severe enough that they will ask for federal assistance. Our model will allow officials to immediately decide if they need to file for a federal grant. Smaller cities have lwoer thresholds and larger cities can handle higher costs before needing assistance.

### Objectives
Based on the characteristics of an attack, such as the target type and the date, we want to assign an estimated cost label to the entity. Because our system will be used to estimate the cost for local city governements, perfect classification of cost is not required. However, it is important that these estimations are as accurate as possible because a request for a grant will need to be formed and sent to the federal government.

...

## Data Preparation

### Attributes
Here is the list of attributes we will keep in our data set to use for classification.

#### General Information
- **iyear** (ordinal): The year the event occured in
- **imonth** (ordinal): The month the event occured in
- **iday** (ordinal): The day the event occured in
- **extended** (binary): 1 if the incident was longer than 24 hours, 0 otherwise
    - **resolution** (ordinal): The date an extended incident was resolved if *extended* is 1


- **inclusion criteria** (binary): There are three inclusion criteria where a 1 indicates the event meets that criteria
    - **crit1**: Political, economic, religious, or social goal
    - **crit2**: Intention to coerce, intimidate, or publicize
    - **crit3**: Outside international humanitarian law


#### Location
We will provide the name of the city to the model. An alternative method would be to train a unique logistic regression algorithm for each city where our system is deployed.
- **city** (text): Name of the city in which the event occured
- **vicinity** (nominal/binary): A 1 indicates the event occured in the immediate vicinity of *city*, 0 indicates the even occured in *city*

#### Attack Type
The most severe method of attack. This will be our class label. Although the original data set contains columns for three different attack types, the attack types are ranked by their severity. Many attacks only have one attack type. By removing the second and third attack types from our data set, we will still be predicting the most severe of the attack types.
- **attacktype1** (ordinal): Most severe attack type

- The attack types follow the following hierarchy:
    1. Assassination
    2. Armed Assault
    3. Bombing/Explosion
    4. Hijacking
    5. Barricade Incident
    6. Kidnapping
    7. Facility/Infrastructure Attack 
    8. Unarmed Assault
    9. Unknown


- **suicide** (nominal/binary): A 1 indicates there was evidence the attacker did not make an effort to escape with their life

#### Target Type
We will only be considering the first target type of the attack. The set of target attributes is provided below:
- **targtype1, targtype1_txt** (nominal): The general type of target from the following list:
    1. Business
    2. Government (General)
    3. Police
    4. Military
    5. Abortion related
    6. Airports and aircraft
    7. Government (Diplomatic)
    8. Educational institution
    9. Food or water supply
    10. Journalists and media
    11. Maritime
    12. NGO
    13. Other
    14. Private citizens and property
    15. Religious figures and institutions
    16. Telecommunication
    17. Terrirists and non-state militias
    18. Tourists
    19. Transportation
    20. Unknown
    21. Utilities
    22. Violent political parties
    

- ? **targsubtype1, targsubtype1_txt** (nominal): There are a number of subtypes for each of the above target types

#### Perpetrator Information
The data set provides information on up to three perpetrators if the attack was conducted by multiple groups. We will only be considering the first group, or the one decided to have the most responsibility for the attack.
- **individual** (binary): A 1 indicates the individuals carrying out the attack are not affiliated with a terror organization
- **nperps** (ratio): Indicates the total number of terrorists participating in the event
- **nperpcap** (ratio): Number of perpatrators taken into custody
- **claimed** (binary): A 1 indicates a person or group claimed responsibility for the attack
- **claimmode** (nominal): Records the method the terror group used to claim responsibility for the attack. Can be one of the ten following categories:
    1. Letter
    2. Call (post-incident)
    3. Call (pre-incident)
    4. E-mail
    5. Note left at scene
    6. Video
    7. Posted to website
    8. Personal claim
    9. Other
    10. Unknown


#### Casualties and Consequences
- **nkill** (ratio): Records the number of confirmed kills for the incident
- **nkillter** (ratio): Indicates the number of terrorists who were killed in the event
- **nwound** (ratio): Indicates the number of people who sustained non-fatal injuries in the event
- **nwoundte** (ratio): Indicates the number of terrorists who sustained non-lethal injuries
- **property** (binary): A 1 indicates the event resulted in property damage. We will only select entities that resulted in property damage
- **propextent** (ordinal): If *property* is a 1, this field records the extent of the property damage following the scheme:
    1. Catastrophic (likely > \$1 billion)
    2. Major (likely > \$1 million and < \$1 billion)
    3. Minor (likely < \$1 million)
    4. Unknown


### Data Cleaning
We will clean the data set so only the above attributes are present.

In [36]:
import pandas as pd

df = pd.read_csv('../Lab1/data/us_only.csv', encoding='ISO-8859-1')
df.head()

Unnamed: 0,eventid,iyear,imonth,iday,approxdate,extended,resolution,country,country_txt,region,...,addnotes,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related
0,197001010002,1970,1,1,,0,,217,United States,1,...,"The Cairo Chief of Police, William Petersen, r...","""Police Chief Quits,"" Washington Post, January...","""Cairo Police Chief Quits; Decries Local 'Mili...","Christopher Hewitt, ""Political Violence and Te...",Hewitt Project,-9,-9,0,-9,
1,197001020002,1970,1,2,,0,,217,United States,1,...,"Damages were estimated to be between $20,000-$...",Committee on Government Operations United Stat...,"Christopher Hewitt, ""Political Violence and Te...",,Hewitt Project,-9,-9,0,-9,
2,197001020003,1970,1,2,,0,,217,United States,1,...,The New Years Gang issue a communiqu to a loc...,"Tom Bates, ""Rads: The 1970 Bombing of the Army...","David Newman, Sandra Sutherland, and Jon Stewa...","The Wisconsin Cartographers' Guild, ""Wisconsin...",Hewitt Project,0,0,0,0,
3,197001030001,1970,1,3,,0,,217,United States,1,...,"Karl Armstrong's girlfriend, Lynn Schultz, dro...",Committee on Government Operations United Stat...,"Tom Bates, ""Rads: The 1970 Bombing of the Army...","David Newman, Sandra Sutherland, and Jon Stewa...",Hewitt Project,0,0,0,0,
4,197001050001,1970,1,1,,0,,217,United States,1,...,,,,,PGIS,0,0,0,0,


In [37]:
df.columns.values

array(['eventid', 'iyear', 'imonth', 'iday', 'approxdate', 'extended',
       'resolution', 'country', 'country_txt', 'region', 'region_txt',
       'provstate', 'city', 'latitude', 'longitude', 'specificity',
       'vicinity', 'location', 'summary', 'crit1', 'crit2', 'crit3',
       'doubtterr', 'alternative', 'alternative_txt', 'multiple',
       'success', 'suicide', 'attacktype1', 'attacktype1_txt',
       'attacktype2', 'attacktype2_txt', 'attacktype3', 'attacktype3_txt',
       'targtype1', 'targtype1_txt', 'targsubtype1', 'targsubtype1_txt',
       'corp1', 'target1', 'natlty1', 'natlty1_txt', 'targtype2',
       'targtype2_txt', 'targsubtype2', 'targsubtype2_txt', 'corp2',
       'target2', 'natlty2', 'natlty2_txt', 'targtype3', 'targtype3_txt',
       'targsubtype3', 'targsubtype3_txt', 'corp3', 'target3', 'natlty3',
       'natlty3_txt', 'gname', 'gsubname', 'gname2', 'gsubname2', 'gname3',
       'gsubname3', 'motive', 'guncertain1', 'guncertain2', 'guncertain3',
       'in

In [38]:
# drop rows without property damage or unknown city
orig_len = df.shape[0]
df = df[df['property'] == 1]
df = df[df['city'] != "Unknown"]
new_len = df.shape[0]
print("Percent maintained: ", new_len/orig_len*100, "%")

# select columns of interest
df = df[['iyear', 'imonth', 'iday', 'extended', 'city', 'vicinity',
         'crit1', 'crit2', 'crit3', 'suicide', 'attacktype1', 'targtype1',
         'individual', 'nperps', 'nperpcap', 'claimed', 'claimmode',
         'nkill', 'nkillter', 'nwound', 'nwoundte', 'propextent']]

Percent maintained:  73.42277012327773 %


In [39]:
df.head()

Unnamed: 0,iyear,imonth,iday,extended,city,vicinity,crit1,crit2,crit3,suicide,...,individual,nperps,nperpcap,claimed,claimmode,nkill,nkillter,nwound,nwoundte,propextent
0,1970,1,1,0,Cairo,0,1,1,1,0,...,0,-99.0,-99.0,0.0,,0.0,0.0,0.0,0.0,3.0
1,1970,1,2,0,Oakland,0,1,1,1,0,...,0,-99.0,-99.0,0.0,,0.0,0.0,0.0,0.0,3.0
2,1970,1,2,0,Madison,0,1,1,1,0,...,0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,3.0
3,1970,1,3,0,Madison,0,1,1,1,0,...,0,1.0,1.0,0.0,,0.0,0.0,0.0,0.0,3.0
5,1970,1,6,0,Denver,0,1,1,1,0,...,0,-99.0,-99.0,0.0,,0.0,0.0,0.0,0.0,3.0


In [40]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2025 entries, 0 to 2756
Data columns (total 22 columns):
iyear          2025 non-null int64
imonth         2025 non-null int64
iday           2025 non-null int64
extended       2025 non-null int64
city           2025 non-null object
vicinity       2025 non-null int64
crit1          2025 non-null int64
crit2          2025 non-null int64
crit3          2025 non-null int64
suicide        2025 non-null int64
attacktype1    2025 non-null int64
targtype1      2025 non-null int64
individual     2025 non-null int64
nperps         1193 non-null float64
nperpcap       1155 non-null float64
claimed        1157 non-null float64
claimmode      339 non-null float64
nkill          1958 non-null float64
nkillter       1178 non-null float64
nwound         1942 non-null float64
nwoundte       1166 non-null float64
propextent     1490 non-null float64
dtypes: float64(9), int64(12), object(1)
memory usage: 363.9+ KB


In [48]:
import numpy as np

logical_cols = ['extended', 'vicinity', 'crit1', 'crit2', 'crit3',
                'suicide', 'individual', 'claimed']
categorical_cols = ['city', 'attacktype1', 'targtype1', 'claimmode',
                    'propextent']
ratio_cols = ['nperps', 'nperpcap', 'nkill', 'nkillter', 'nwound',
              'nwoundte']

# replace unknowns with nan
logical_replace = dict((l, {-9:np.nan}) for l in ['claimed'])
ratio_replace = dict((r, {-99:np.nan, -9:np.nan}) for r in ratio_cols)
df.replace(to_replace=logical_replace, inplace=True)
df.replace(to_replace=ratio_replace, inplace=True)

# convert logical cols to bools

# normalize ratio cols

# one-hot encode categorical cols

{'nperps': {-99: nan, -9: nan}, 'nperpcap': {-99: nan, -9: nan}, 'nkill': {-99: nan, -9: nan}, 'nkillter': {-99: nan, -9: nan}, 'nwound': {-99: nan, -9: nan}, 'nwoundte': {-99: nan, -9: nan}}


## References
Federal Grant Policy:
<a href="https://www.grants.gov/web/grants/learn-grants/grant-policies.html">A Short History of the Federal Grant Policy</a>