# Checkpoint Three: Cleaning Data

Now you are ready to clean your data. Before starting coding, provide the link to your dataset below.

My dataset:

Import the necessary libraries and create your dataframe(s).

In [2]:
# Import Libraries
import pandas as pd
import numpy as np

# Dataframe Creation
df_gtd = pd.read_csv("gtd_uk_usa.csv")
df_us_elections = [2000, 2004, 2008, 2012, 2016, 2020, 2024]
df_uk_elections = pd.read_csv("uk-gen_elections-voteshare-trunc.csv")

In [6]:
# Reviewing basic dataframe information

df_gtd.info()
df_gtd.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1999 entries, 0 to 1998
Columns: 110 entries, eventid to related
dtypes: float64(42), int64(21), object(47)
memory usage: 1.7+ MB


(1999, 110)

In [7]:
df_gtd.head(6)

Unnamed: 0,eventid,iyear,imonth,iday,approxdate,extended,resolution,country_txt,region_txt,provstate,...,addnotes,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related
0,200001010027,2000,1,1,,0,,United States,North America,Michigan,...,The four perpetrators were also indicted for t...,"""AG Hall Arsonist Sentenced to 21 Years in Pri...","Art Bukowski, ""Activists planned MSU arson plo...","Justice Department Documents and Publications,...",Eco Project 2010,0,1,0,1,
1,200001030007,2000,1,3,,0,,United States,North America,California,...,,"""Animal Liberation Front says they attack Cali...","FBI ""Terrorism in the United States 2000/2001""...",,CETIS,0,1,0,1,
2,200001030008,2000,1,3,,0,,United States,North America,Ohio,...,,"""Abortion Clinics Evacuated After Threats of A...","Patricia Baird-Windle and Eleanor J. Bader, ""T...",,Anti-Abortion Project 2010,-9,-9,0,-9,"200001030008, 200001030009"
3,200001030009,2000,1,3,,0,,United States,North America,Ohio,...,This is part of a multiple attack with 2000010...,"""Abortion Clinics Evacuated After Threats of A...","Patricia Baird-Windle and Eleanor J. Bader, ""T...",,Anti-Abortion Project 2010,-9,-9,0,-9,"200001030008, 200001030009"
4,200001070003,2000,1,7,,0,,United Kingdom,Western Europe,Northern Ireland,...,"Denver Smith, 32, had been murdered at the sam...","“Blast in N. Ireland Town; None Injured,” Pres...",,,CETIS,-9,-9,1,1,
5,200001100001,2000,1,10,,0,,United Kingdom,Western Europe,Northern Ireland,...,,"David Sharrock, “Ulster Terrorist Leader Murde...","Chris Anderson, “Ulster Police Expect Loyalist...","Deric Henderson, “UK: LVF Blamed After Rival L...",CETIS,0,0,1,1,


In [4]:
df_uk_elections.info()
df_uk_elections.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90 entries, 0 to 89
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   party_name  90 non-null     object 
 1   alignment   90 non-null     object 
 2   vote_share  78 non-null     float64
 3   vote_year   90 non-null     int64  
dtypes: float64(1), int64(1), object(2)
memory usage: 2.9+ KB


(90, 4)

In [8]:
df_uk_elections.head(6)

Unnamed: 0,party_name,alignment,vote_share,vote_year
0,Labour,centre-left,33.7,2024
1,Conservative,centre-right,23.704,2024
2,Reform UK,right-populism,14.293,2024
3,Liberal Democrat,centre-left,12.215,2024
4,Green Party,left,6.398,2024
5,Scottish National Party,centre-left,2.516,2024


<div class="alert alert-block alert-info">
As I parsed and created the UK elections dataset and cleaned as I went, my focus throughout this checkpoint will be on the GTD dataset.
</div>

## Missing Data

Test your dataset for missing data and handle it as needed. Make notes in the form of code comments as to your thought process.

In [None]:
# Identifying that there are columns with null values.
# Sorted by descending and utilizing .pipe() with lambda to filter out any columns with 0 nulls.

df_gtd.isna().sum().pipe(lambda s: s[s > 0]).sort_values(ascending=False)

ransomamtus         1999
kidhijcountry       1999
divert              1999
weapsubtype4_txt    1999
weaptype4_txt       1999
                    ... 
nkillter               4
natlty1_txt            3
nkillus                2
target1                1
ishostkid              1
Length: 75, dtype: int64

<div class="alert alert-block alert-info">
Understanding that there are 1999 entries in the dataset, we can see that there are columns that are entirely nulls. For our analysis with these two countries (USA and UK), a null value for ransom amount, kidnapping/hijacking country, diversion [from hijacking/kidnapping], etc. are not particularly interesting or valuable. As such, we can delete these columns.
</div>

In [23]:
# Using the same logic to identify those columns that are entirely null values:

df_gtd.isna().sum().pipe(lambda s: s[s == 1999]).sort_values(ascending=False)

gname3              1999
gsubname3           1999
guncertain3         1999
claim3              1999
claimmode3          1999
claimmode3_txt      1999
weaptype4_txt       1999
weapsubtype4_txt    1999
divert              1999
kidhijcountry       1999
ransomamt           1999
ransomamtus         1999
ransompaid          1999
ransompaidus        1999
dtype: int64

<div class="alert alert-block alert-info">
The columns listed above are comprised entirely of null values. This shows that among the incidents in the United States and United Kingdom between 2000 and 2020, there were none that had a third group identified, if that third group name is uncertain or not, a third group claiming responsibility (and/or how), a fourth weapon type used, etc. Following the comment above, we can delete these columns, as well. 
</div>

In [25]:
# Creating a list of columns to drop for ease:

cols_to_drop = [
    "gname3",
    "gsubname3",
    "guncertain3",
    "claim3",
    "claimmode3",
    "claimmode3_txt",
    "weaptype4_txt",
    "weapsubtype4_txt",
    "kidhijcountry",
    "divert",
    "ransomamt",
    "ransomamtus",
    "ransompaid",
    "ransompaidus"
]

# Dropping the columns
df_gtd = df_gtd.drop(columns=cols_to_drop)

In [26]:
# Confirming dropped columns:

df_gtd.isna().sum().pipe(lambda s: s[s == 1999]).sort_values(ascending=False)

Series([], dtype: int64)

In [27]:
# Identifying that there are columns with null values again.
# Sorted by descending and utilizing .pipe() with lambda to filter out any columns with 0 nulls.

df_gtd.isna().sum().pipe(lambda s: s[s > 0]).sort_values(ascending=False)

attacktype3        1998
attacktype3_txt    1998
ndays              1997
claimmode2_txt     1994
claimmode2         1994
                   ... 
nwound                4
natlty1_txt           3
nkillus               2
target1               1
ishostkid             1
Length: 61, dtype: int64

In [11]:
df_gtd.isna().sum()

eventid          0
iyear            0
imonth           0
iday             0
approxdate    1955
              ... 
INT_LOG          0
INT_IDEO         0
INT_MISC         0
INT_ANY          0
related       1650
Length: 110, dtype: int64

<div class="alert alert-block alert-info">
From EDA performed in checkpoint 2, we also know that approxdate and related have null values. However, it is important to understand why 'approxdate' is null and why those entries with null values for it cannot be removed. The 'approxdate' column is used when there is only an approximate and not exact date of the incident, such that if an exact date is known, the approximate date value will be null. This suggests that of the 1999 entries, 44 of them only have an approximate date for the incident. While these could be removed, our analysis is utilizing incident year rather than specific dates. As such, there is no harm in keeping those entries. Alternatively, we could drop the column entirely as those entries with approximate dates do have values in the appropriate fields ('iyear', 'imonth') for that approximation.
</div>

In [None]:
# Dropping the column 'approxdate':

df_gtd = df_gtd.drop(columns=["approxdate"])

In [29]:
# Sanity check:
"approxdate" in df_gtd

False

## Irregular Data

Detect outliers in your dataset and handle them as needed. Use code comments to make notes about your thought process.

In [None]:
# Overview of the dataset using .describe()

df_gtd.describe()

Unnamed: 0,eventid,iyear,imonth,iday,extended,specificity,vicinity,crit1,crit2,crit3,...,nhostkidus,nhours,ndays,ransom,hostkidoutcome,nreleased,INT_LOG,INT_IDEO,INT_MISC,INT_ANY
count,1999.0,1999.0,1999.0,1999.0,1999.0,1999.0,1999.0,1999.0,1999.0,1999.0,...,34.0,33.0,2.0,115.0,34.0,34.0,1999.0,1999.0,1999.0,1999.0
mean,201256800000.0,2012.502751,6.350175,15.136568,0.004002,1.013507,0.036018,0.96048,0.993997,0.996998,...,-1.352941,-56.121212,-49.5,-0.156522,3.235294,-4.735294,-6.784892,-6.717359,0.592296,-2.397699
std,635381500.0,6.351867,3.28759,8.787154,0.06315,0.168363,0.186382,0.194877,0.077265,0.054717,...,36.628257,50.736918,70.003571,1.181668,1.670761,48.779224,3.87773,4.002083,0.535395,4.722257
min,200001000000.0,2000.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,-99.0,-99.0,-99.0,-9.0,2.0,-99.0,-9.0,-9.0,-9.0,-9.0
25%,200808200000.0,2008.0,4.0,8.0,0.0,1.0,0.0,1.0,1.0,1.0,...,0.0,-99.0,-74.25,0.0,2.0,0.25,-9.0,-9.0,0.0,-9.0
50%,201501100000.0,2015.0,6.0,15.0,0.0,1.0,0.0,1.0,1.0,1.0,...,0.0,-99.0,-49.5,0.0,2.0,1.0,-9.0,-9.0,1.0,1.0
75%,201801300000.0,2018.0,9.0,23.0,0.0,1.0,0.0,1.0,1.0,1.0,...,1.0,1.0,-24.75,0.0,4.0,1.0,-9.0,-9.0,1.0,1.0
max,202012300000.0,2020.0,12.0,31.0,1.0,4.0,1.0,1.0,1.0,1.0,...,86.0,5.0,0.0,0.0,7.0,200.0,0.0,1.0,1.0,1.0


<div class="alert alert-block alert-info">
The above is not particularly helpful due to the number of columns, as well as the majority of the numerical values corresponding to a text value (see codebook) or a yes/no flag. Let's narrow down to a handful columns that do have possible quantitative significance. Numerical variable columns may also be identified by searching for 'Numerical Variable' within the codebook.
</div>

In [37]:
df_gtd[["nperps","nperpcap","nkill","nkillus","nkillter","nwound","nwoundus","nwoundte"]].describe()

Unnamed: 0,nperps,nperpcap,nkill,nkillus,nkillter,nwound,nwoundus,nwoundte
count,1790.0,1972.0,1999.0,1997.0,1995.0,1995.0,1987.0,1992.0
mean,-61.523464,-2.736308,1.78089,1.652479,0.039098,12.571429,0.385506,0.01757
std,48.649305,17.219225,44.056259,42.75975,0.284108,345.003475,3.990971,0.174152
min,-99.0,-99.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,-99.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,-99.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,15.0,11.0,1385.0,1361.0,5.0,10878.0,151.0,4.0


<div class="alert alert-block alert-info">
There are outliers in this dataset when it comes to the number killed or wounded, particularly when taking into account mass casualty incidents, such as September 11, 2001. However, removing this outliers would remove important data points and a broader picture of incidents of terrorism over time. As such, they will not be removed.
</div>

## Unnecessary Data

Look for the different types of unnecessary data in your dataset and address it as needed. Make sure to use code comments to illustrate your thought process.

<div class="alert alert-block alert-info">
As noted in Checkpoint 2:
<div class="alert alert-block alert-success">
<i>Prior to this EDA, I pre-emptively removed columns from the original dataset in order to reduce the size of the .csv. This included limiting the timeframe from 2000 to 2020, excluding the 
data from 1970 to 1999. This early cleaning also included removing the columns with numerical values that also had corresponding text values (e.g. 'country'; 'country_txt') or those with 
interesting qualitative information but not useful in a quantitative analysis without further coding (e.g. 'summary'). Some of these fields were left to add additional context for future 
reviewers (e.g. source citations). Other fields, also with potentially valuable information but outside the scope of this project, were also removed, such as latitude/longitude.</i>>
</div>
<br>
As such, much of the unnecessary data has already been removed from the dataset before beginning EDA and further with Checkpoint 3.
</div>

In [None]:
# Reviewing the current dataframe:

df_gtd.head(6)

Unnamed: 0,eventid,iyear,imonth,iday,extended,resolution,country_txt,region_txt,provstate,city,...,addnotes,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related
0,200001010027,2000,1,1,0,,United States,North America,Michigan,Mesick,...,The four perpetrators were also indicted for t...,"""AG Hall Arsonist Sentenced to 21 Years in Pri...","Art Bukowski, ""Activists planned MSU arson plo...","Justice Department Documents and Publications,...",Eco Project 2010,0,1,0,1,
1,200001030007,2000,1,3,0,,United States,North America,California,Petaluma,...,,"""Animal Liberation Front says they attack Cali...","FBI ""Terrorism in the United States 2000/2001""...",,CETIS,0,1,0,1,
2,200001030008,2000,1,3,0,,United States,North America,Ohio,Cincinnati,...,,"""Abortion Clinics Evacuated After Threats of A...","Patricia Baird-Windle and Eleanor J. Bader, ""T...",,Anti-Abortion Project 2010,-9,-9,0,-9,"200001030008, 200001030009"
3,200001030009,2000,1,3,0,,United States,North America,Ohio,Cincinnati,...,This is part of a multiple attack with 2000010...,"""Abortion Clinics Evacuated After Threats of A...","Patricia Baird-Windle and Eleanor J. Bader, ""T...",,Anti-Abortion Project 2010,-9,-9,0,-9,"200001030008, 200001030009"
4,200001070003,2000,1,7,0,,United Kingdom,Western Europe,Northern Ireland,Antrim,...,"Denver Smith, 32, had been murdered at the sam...","“Blast in N. Ireland Town; None Injured,” Pres...",,,CETIS,-9,-9,1,1,
5,200001100001,2000,1,10,0,,United Kingdom,Western Europe,Northern Ireland,Portadown,...,,"David Sharrock, “Ulster Terrorist Leader Murde...","Chris Anderson, “Ulster Police Expect Loyalist...","Deric Henderson, “UK: LVF Blamed After Rival L...",CETIS,0,0,1,1,


In [40]:
# Because we are limited to the United States and United Kingdom with country_txt, the region_txt does not add any value. This can be dropped.
# Dropping 'region_txt' column:

df_gtd = df_gtd.drop(columns=["region_txt"])

## Inconsistent Data

Check for inconsistent data and address any that arises. As always, use code comments to illustrate your thought process.

In [41]:
df_gtd.head(51)

Unnamed: 0,eventid,iyear,imonth,iday,extended,resolution,country_txt,provstate,city,specificity,...,addnotes,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related
0,200001010027,2000,1,1,0,,United States,Michigan,Mesick,1.0,...,The four perpetrators were also indicted for t...,"""AG Hall Arsonist Sentenced to 21 Years in Pri...","Art Bukowski, ""Activists planned MSU arson plo...","Justice Department Documents and Publications,...",Eco Project 2010,0,1,0,1,
1,200001030007,2000,1,3,0,,United States,California,Petaluma,1.0,...,,"""Animal Liberation Front says they attack Cali...","FBI ""Terrorism in the United States 2000/2001""...",,CETIS,0,1,0,1,
2,200001030008,2000,1,3,0,,United States,Ohio,Cincinnati,1.0,...,,"""Abortion Clinics Evacuated After Threats of A...","Patricia Baird-Windle and Eleanor J. Bader, ""T...",,Anti-Abortion Project 2010,-9,-9,0,-9,"200001030008, 200001030009"
3,200001030009,2000,1,3,0,,United States,Ohio,Cincinnati,1.0,...,This is part of a multiple attack with 2000010...,"""Abortion Clinics Evacuated After Threats of A...","Patricia Baird-Windle and Eleanor J. Bader, ""T...",,Anti-Abortion Project 2010,-9,-9,0,-9,"200001030008, 200001030009"
4,200001070003,2000,1,7,0,,United Kingdom,Northern Ireland,Antrim,1.0,...,"Denver Smith, 32, had been murdered at the sam...","“Blast in N. Ireland Town; None Injured,” Pres...",,,CETIS,-9,-9,1,1,
5,200001100001,2000,1,10,0,,United Kingdom,Northern Ireland,Portadown,1.0,...,,"David Sharrock, “Ulster Terrorist Leader Murde...","Chris Anderson, “Ulster Police Expect Loyalist...","Deric Henderson, “UK: LVF Blamed After Rival L...",CETIS,0,0,1,1,
6,200001150004,2000,1,15,0,,United States,California,Petaluma,1.0,...,,"FBI, ""Terrorism 2000/2001,""FBI, DOJ, 2001.","Suzanne Bohan, ""Some cage-free egg producers f...","Debra J. Saunders, ""Save the Chickens,"" San Fr...",Eco Project 2010,0,1,0,1,
7,200001200002,2000,1,20,0,,United States,Indiana,Bloomington,1.0,...,Homeowners and the Indiana Crime and Arson ass...,"Associated Press. ""Bloomington house fire rewa...","Associated Press ""Local ELF group feels backla...",,CETIS,0,1,0,1,
8,200001240007,2000,1,24,0,,United States,California,Redwood City,1.0,...,,"James W. Sweeney, ""Rights Group Takes Credit f...","""Animal Liberation Front Says They Attacked Ca...","Pamela J. Podger, ""Animal-Rights Cell Says It ...",Eco Project 2010,0,1,0,1,
9,200002060001,2000,2,6,0,,United Kingdom,Northern Ireland,Irvinestown,1.0,...,,"“Bomb Explodes at Northern Ireland Hotel,” Lon...","“IRA Dissidents Suspected in Bombing,” Associa...","Ian Graham, “Ulster Peace in Crisis After Hote...",CETIS,0,0,1,1,


<div class="alert alert-block alert-info">
Given that this is an academic dataset maintained by a research university with documented criteria for how incidents are included, the data appears consistent and there are no concerns at this time for inconsistencies.
</div>

### Finalizing Dataset(s)

In [42]:
# Converting to .csv for continued use:

df_gtd.to_csv("gtd_uk_usa_v2.csv", index=False)

## Summarize Your Results

Make note of your answers to the following questions.

1. Did you find all four types of dirty data in your dataset?

<div class="alert alert-block alert-info">
I did not necessarily find all four types of "dirty" data within the dataset. Much of the cleaning conducted during Checkpoint 3 related to removing superfluous or unnecessary data as it pertains to the analysis I intend to conduct. Null values were not inherently missing data but more often fields not applicable to an incident. Further, outliers are not inherently problematic when understanding this type of data as the outliers in and of themselves are important to note, recognize, and even research more; this is to say, outliers in this dataset are not problematic entries to be removed but interesting data points.
</div>

2. Did the process of cleaning your data give you new insights into your dataset?

<div class="alert alert-block alert-info">
This process assisted in further reducing the size of my dataset and removing unneccessary fields. It also led me to reference the codebook for the dataset once more and better look at the numerical variables, such that I have additional ideas of relationships to explore and visualize.
</div>

3. Is there anything you would like to make note of when it comes to manipulating the data and making visualizations?

<div class="alert alert-block alert-info">
It may be interesting to identify the more impactful and "known" terrorism incidents and demarcate them on my visualizations. This could also add further context to spikes in activity and/or those killed/injured.
</div>