## Capstone Project 1

**Predicting the outcome of terrorist attacks**

## Data Cleaning

In [1]:
import pandas as pd
import numpy as np

First, I used Pandas to read in a dataframe from the CSV files (I split the dataset into two files in order to get past GitHub's 100MB file size limit), selecting the columns I am interested in. I used the missing value codes from the Guidebook to convert missing values to 'NaN'.

In [2]:
df = pd.read_csv('global_terrorism1.csv', encoding = 'latin1')
for i in range(len(list(df.columns))):
    print(i, list(df.columns)[i])

0 eventid
1 iyear
2 imonth
3 iday
4 approxdate
5 extended
6 resolution
7 country
8 country_txt
9 region
10 region_txt
11 provstate
12 city
13 latitude
14 longitude
15 specificity
16 vicinity
17 location
18 summary
19 crit1
20 crit2
21 crit3
22 doubtterr
23 alternative
24 alternative_txt
25 multiple
26 success
27 suicide
28 attacktype1
29 attacktype1_txt
30 attacktype2
31 attacktype2_txt
32 attacktype3
33 attacktype3_txt
34 targtype1
35 targtype1_txt
36 targsubtype1
37 targsubtype1_txt
38 corp1
39 target1
40 natlty1
41 natlty1_txt
42 targtype2
43 targtype2_txt
44 targsubtype2
45 targsubtype2_txt
46 corp2
47 target2
48 natlty2
49 natlty2_txt
50 targtype3
51 targtype3_txt
52 targsubtype3
53 targsubtype3_txt
54 corp3
55 target3
56 natlty3
57 natlty3_txt
58 gname
59 gsubname
60 gname2
61 gsubname2
62 gname3
63 gsubname3
64 motive
65 guncertain1
66 guncertain2
67 guncertain3
68 individual
69 nperps
70 nperpcap
71 claimed
72 claimmode
73 claimmode_txt
74 claim2
75 claimmode2
76 claimmode2_txt

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
col_indexes =  [1,2,9,13,14,19,20,21,26,27,28,30,32,34,42,50,68,81,85,89,93,98,101,104,105]

terrorism_df1 = pd.read_csv('global_terrorism1.csv', encoding = 'latin1', usecols = col_indexes, 
                           na_values = {'attacktype1': 9, 'weaptype1': 13, 'targtype1': 20, 'property': -9, 'propextent':4})

terrorism_df2 = pd.read_csv('global_terrorism2.csv', encoding = 'latin1', usecols = col_indexes, 
                           na_values = {'attacktype1': 9, 'weaptype1': 13, 'targtype1': 20, 'property': -9, 'propextent':4})

terrorism_df = terrorism_df1.append(terrorism_df2)

In [4]:
terrorism_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 170350 entries, 0 to 59601
Data columns (total 25 columns):
iyear          170350 non-null int64
imonth         170350 non-null int64
region         170350 non-null int64
latitude       165744 non-null float64
longitude      165744 non-null float64
crit1          170350 non-null int64
crit2          170350 non-null int64
crit3          170350 non-null int64
success        170350 non-null int64
suicide        170350 non-null int64
attacktype1    163925 non-null float64
attacktype2    5630 non-null float64
attacktype3    374 non-null float64
targtype1      165477 non-null float64
targtype2      10018 non-null float64
targtype3      1034 non-null float64
individual     170350 non-null int64
weaptype1      156498 non-null float64
weaptype2      11843 non-null float64
weaptype3      1660 non-null float64
weaptype4      74 non-null float64
nkill          160668 non-null float64
nwound         155025 non-null float64
property       150771 non-

The dataframe has missing values for many columns. I exclude rows with missing values for some columns. Note that a few features, such as attack type, are represented by multiple columns (since an attack can be classified as multiple types, use multiple weapons, etc.). Because I'll convert all of these columns to dummy variables, I only need to filter by the first column - if 'attacktype1' is missing for a row, then 'attacktype2' and 'attacktype 3' will also be missing, and so on for the other columns. I also only filter for missing values in columns that I ended up using in my analysis.

In [5]:
#Remove rows with missing values
filter_cols = ['attacktype1', 'targtype1', 'weaptype1', 'nkill', 'latitude', 'longitude']

cleaned_df = terrorism_df.dropna(axis = 0, how = 'any', subset = filter_cols)
cleaned_df = cleaned_df.reset_index(drop=True)
cleaned_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 140928 entries, 0 to 140927
Data columns (total 25 columns):
iyear          140928 non-null int64
imonth         140928 non-null int64
region         140928 non-null int64
latitude       140928 non-null float64
longitude      140928 non-null float64
crit1          140928 non-null int64
crit2          140928 non-null int64
crit3          140928 non-null int64
success        140928 non-null int64
suicide        140928 non-null int64
attacktype1    140928 non-null float64
attacktype2    4907 non-null float64
attacktype3    337 non-null float64
targtype1      140928 non-null float64
targtype2      8843 non-null float64
targtype3      890 non-null float64
individual     140928 non-null int64
weaptype1      140928 non-null float64
weaptype2      10587 non-null float64
weaptype3      1515 non-null float64
weaptype4      63 non-null float64
nkill          140928 non-null float64
nwound         135589 non-null float64
property       123859 non-n

Next, I convert the categorical, non-binary columns - region, attack type, weapon type, and target type - into dummy variables. For the features with multiple columns, such as attack type, a column representing one attack type has a value of 1 if 'attacktype1', 'attacktype2', or 'attacktype3' have a value corresponding to that type. For each feature, I leave out one value that I do not convert to a dummy column, as the last column for a set of dummy variables is redundant. In the case of weapon type and target type, there are codes corresponding to "other", so I leave out that value.

In [6]:
region_dict = {1: 'North America', 2: 'Central America', 3: 'South America', 4: 'East Asia', 5: 'SE Asia', 6: 'South Asia', 
7: 'Central Asia', 8: 'West Europe',9: 'East Europe',10: 'ME and North Africa', 
11: 'Sub-Saharan Africa', 12: 'Oceania'}

#leave one out
for i in range(1,12):
    cleaned_df[region_dict[i]] = (cleaned_df['region'] == i).astype(int)
    
type_dict = {1:'Assassination', 2:'Armed Assault',
            3:'Bombing', 4:'Hijacking', 5:'Hostage (Barricade)',
             6:'Kidnapping', 7:'Infrastructure', 8:'Unarmed Assault'}

for i in range(1,8):
    cleaned_df[type_dict[i]] = ((cleaned_df['attacktype1'] == i) | (cleaned_df['attacktype2'] == i)
                                | (cleaned_df['attacktype3'] == i)).astype(int)
    
weapon_dict = {1:'Biological',2:'Chemical',3:'Radiological',
            4:'Nuclear',5:'Firearms',6:'Explosives',
            7:'Fake Weapon', 8:'Incendiary', 9:'Melee',
            10:'Vehicle',11:'Sabotage Equipment', 12:'Other'}

for i in range(1,12):
    cleaned_df[weapon_dict[i]] = ((cleaned_df['weaptype1'] == i) | (cleaned_df['weaptype2'] == i) 
                                   | (cleaned_df['weaptype3'] == i) | (cleaned_df['weaptype4'] == i)).astype(int)
    
target_dict = {1:'Business', 2:'Government', 3:'Police',4:'Military',5:'Abortion',6:'Aviation',7:'Diplomatic',8:'Education',9:'FoodWater',10:'Media',
              11:'Maritime', 12:'NGO',13:'Other',14:'Private',15:'Religious',16:'Telecommunication',
              17:'Terrorists', 18:'Tourists', 19:'Transportation',21:'Utilities',22:'Violent Parties'}

target_codes = [i for i in range(1,23) if i != 13 and i != 20]

for i in target_codes:
    cleaned_df[target_dict[i]] = ((cleaned_df['targtype1'] == i) | (cleaned_df['targtype2'] == i)
                                | (cleaned_df['targtype3'] == i)).astype(int)

Finally, I export the cleaned dataframe to a new CSV file.

In [7]:
cleaned_df.to_csv('cleaned.csv')

## Exploratory Data Analysis