# What do terrorists really want?
Google Collab [1/4] - https://colab.research.google.com/drive/1599wFpPYiRBZvVURqFiYF0_ivEbWPZZo?usp=sharing

In [1]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import pickle
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
import gensim
from gensim.utils import simple_preprocess
import string
import nltk
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *

np.random.seed(2018)

# Visualisation libraries
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go

# Map stuff
# import folium
# from folium.plugins import MarkerCluster

# Misc
# from collections import Counter
# sns.set()
# %matplotlib inline

# Ignore warnings
import warnings
warnings.filterwarnings("ignore")

ModuleNotFoundError: No module named 'pandas'

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

Out of all of the fields that's given to us in the dataset, we would like to take a deeper look into 2 relevant text fields: "summary" and "motive".

According to the GTD Codebook, the **"summary"** column gives us an extremely brief narrative summary of the incident which follows the 5W1H (Who,What,When,Where,Why,How) guidelines.

Whereas the **"motive"** field contains an explicit mention in the official reports. There might be general information about the political, social, or economic situation at the time of attack if it was determined by the researchers to have an impact on the motivation of the incident.

Additionally, it is to be noted that these 2 fields were only implemented starting from the year of 1998. Hence, in order to run our data on more valuable data, we will be taking only the events after 1998.

In [None]:
# GTD Dataset from 1970 - 2019
df_all = pd.read_excel("/content/drive/MyDrive/NTU/Sem 1.2/SC1015/SC1015 Mini-Project/globalterrorismdb_0221dist.xlsx")
file_name = 'df_all'
df_all.to_pickle(file_name)  # where to save it, usually as a .pkl
# df.info

NameError: name 'pd' is not defined

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
df = pd.read_pickle(file_name)

# Drop rows that are before the year 1998
df.drop(df[df['iyear'] < 1998].index, inplace=True)

# Drop columns that are not needed
df.drop(df.columns.difference(['summary', 'motive']), 1, inplace=True)

In [None]:
# Creating data into an excel file
df.to_excel('filename1.xlsx', sheet_name = 'New_Sheet')

In [None]:
# Drop rows with NAN for motive
df.dropna(subset = ['motive'], inplace=True)
df = df[df["motive"].str.contains("Unknown")==False]
df = df[df['motive'] != 'The specific motive for the attack is unknown.']
# df.info

Doing some research beforehand, we understand that we would have to use NLP to help us to reach a conclusion to this question. We think that being able to categorise the words being used in the "summary" and "motive" fields would help us to better understand the motives of the terrorists.

The methods that we could try, include:
*   **Topic Modelling**<br>
A science that requires both art and science to identify and quantify the mic of topics within a document. Topic Modelling is used when we have a set of text documents that we would like to find out about the different topics that they cover and then group then by these topics. Two Analysis that we could try here:
  *   Latent Semantic Analysis (LSA)<br>
  This is like Naive Bayes, it is based on the word frequencies of the dataset. The general idea is that the algorithm will count the frequency of the word and group it together with the.
  *   Latent Dirichlet Allocation (LDA)<br>
  This has a fixed set of topics, we define each topiv to be represented by a set of words. LDA simply tries to map all the words to the topics (in some sort of a backward motion).

*   **Topic Classification**<br>
A note that this ML algorithm is supervised. Topic Modelling is unsupervised.
  *   Empirical Topic Classification (ETC)<br>
That combines NLP methods together with human instuition to identity the word features that result in a more meaningful identification of the terrorists' motive categories.

To give some hindsight, some of the perpertrator's motives could include retaliation, intimidation, causing instability or to raise contempt — despise a certain community.

References:<br>
https://www.mdpi.com/2076-0760/11/1/23#<br>
https://monkeylearn.com/topic-analysis/

# Step 1: Dealing with Motive first

This process would involve:
<!-- ![picture](https://drive.google.com/uc?export=view&id=1r-rjElUan2DorRG2gUNEGuKzZg399PTi) -->

References:<br>
https://towardsdatascience.com/text-classification-supervised-unsupervised-learning-approaches-9fd5e01a036<br>
pyLDAvis: https://towardsdatascience.com/topic-modelling-in-python-with-nltk-and-gensim-4ef03213cd21


## Data Pre-processing

*   Change the text into lowercase
*   Remove all punctuations
*   Tokenization<br>
Split the text into sentences and the sentences into words.=
*   All stopwords are removed.
*   Words are lemmatized<br>
Where the words written in third person will be changed to first person. Verbs that are in past and future tenses are changed to the present.



In [None]:
# Converting all of the strings into lowercase
df['motive'] = df['motive'].str.lower()

# Removing punctuation
df['motive'] = df['motive'].str.replace('[{}]'.format(string.punctuation), '')

# Removing digits
df['motive'] = df['motive'].str.replace('\d+', '')

df['motive'].head

In [None]:
# Removing stopwords
stop = stopwords.words('english')
newStopWords = ['unknown','however','specific','motive','sources','noted','reported','related','victim',\
                'attack','targeted','speculated','claimed','responsibility','incident','stated','scheduled',\
                'larger','part','suspected','carried','accused','may','january','february','march','april','june',\
                'july','august','september','october','november','december','people','believed','meant','considered']
stop.extend(newStopWords)
df['motive'] = df['motive'].apply(lambda x: [item for item in x.split() if item not in stop])
df['motive'].head

In [None]:
# Tokenization — splitting the text into sentences and sentences into words
# df['motive'].apply(word_tokenize)

# df['motive'].head

In [None]:
# Initialise the Wordnet Lemmatizer
lemmatizer = WordNetLemmatizer()

# Lemmatize each word - Group together different forms of the same word (e.g Running == Run)
def lemmatize_text(text):
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(w) for w in text]
df['motive'] = df['motive'].apply(lemmatize_text)
print(df['motive'].head)

In [None]:
# Splitting the Dataset - Wouldn't need to split dataset for this unsupervised learning model
# X_train, X_test = train_test_split(df['motive'], test_size=0.2, random_state=1)

In [None]:
# Creating data into an excel file
df.to_excel('filename.xlsx', sheet_name = 'New_sheet')

## Bag of Words on the Dataset
A popular technique for developing sentiment analysis models is to use a bag-of-words model that transforms documents into vectors where each word in the document is assigned a score.

In [None]:
# Create a dictionary from the df['motive'] containing the number of times a word appears in the training set.
dictionary = gensim.corpora.Dictionary(df['motive'])

# Print words that are in the top 10
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

In [None]:
# For each row, we will create another dictionary to document the number of words and the number of times that each words appear
bagofwords = [dictionary.doc2bow(doc) for doc in df['motive']]

In [None]:
# Printing of the 10th row (Preview)
bagofwords[10]

## Running LDA using BoW

In [None]:
# Train our LDA model with the gensim models LDAMulticore
# num_topics: the number of requested latent topics to be extracted from the training corpus. We will stick to 6 for now.
lda_bow = gensim.models.LdaMulticore(bagofwords, num_topics=6, id2word=dictionary, workers=3)

for idx, topic in lda_bow.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

In [None]:
# Importing pyLDAvis module for visualisation
# !pip install pyLDAvis
import pyLDAvis
import pyLDAvis.gensim_models

lda_display = pyLDAvis.gensim_models.prepare(lda_bow, bagofwords, dictionary, sort_topics=True)
pyLDAvis.display(lda_display)

In [None]:
# save the model to disk
filename = 'lda_bow.sav'
pickle.dump(lda_bow, open(filename, 'wb'))

## Running LDA using TFIDF
"In information retrieval, tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling."<br>

[Taken from Wikipedia](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

In [None]:
# Creating the tf-idf model
from gensim import corpora, models
from pprint import pprint

tfidf = models.TfidfModel(bagofwords)
corpus_tfidf = tfidf[bagofwords]

In [None]:
lda_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics=6, id2word=dictionary, workers=3)
for idx, topic in lda_tfidf.print_topics(-1):
    print('Topic: {} Word: {}'.format(idx, topic))

In [None]:
lda_display = pyLDAvis.gensim_models.prepare(lda_tfidf, corpus_tfidf, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

In [None]:
# save the model to disk
filename = 'lda_tfidf.sav'
pickle.dump(lda_tfidf, open(filename, 'wb'))

BoW and TFIDF are simply based on the frequency of the words found. However, one of the drawbacks of using this method is the understanding of the context. This is where Word Embedding techniques like Word2Vec, Continuous Bag of Words (CBOW), Skipgram would come in.

For the sake of the simplicity of this model, we will not delve into such algorithm but rather just understand that there are many other ways to improve the analysis of the dataset.

## Name the topics obtained using Google [Not working]
It gave me some results before though. I think maybe Google had blocked my IP from runnng the cell too many times. But VPN doesn't work either. So I think they blacklisted my account.

Update, it allowed for me to run again the next day. So I assume that I just simply can't run too many times in a short span of time.

In [None]:
# !pip install cssselect

# Thank you to Sam H. who shared this on https://stackoverflow.com/questions/43985683/automatic-labeling-of-lda-generated-topics
from urllib.parse import urlencode, urlparse, parse_qs
from lxml.html import fromstring
from requests import get
from collections import Counter
import re
import requests
from bs4 import BeautifulSoup

def get_srp_text(search_term):
    raw = get(f"https://www.google.com/search?q={search_term}").text
    page = fromstring(raw)

    blob = ""

    for result in page.cssselect("a"):
        for res in result.findall("div"):
            blob += ' '
            blob += res.text if res.text else " "
            blob += ' '
    return blob

def blob_cleaner(blob):
    clean_blob = blob.replace(r'[\/,\(,\),\:,_,-,\-]', ' ')
    return ''.join(e for e in blob if e.isalnum() or e.isspace())

def get_name_from_srp_blob(clean_blob):
    blob_tokens = list(filter(bool, map(lambda x: x if len(x) > 2 else '', clean_blob.split(' '))))
    print(blob_tokens)
    c = Counter(blob_tokens)
    most_common = c.most_common(10)

    name = f"{most_common[0][0]}-{most_common[1][0]}"
    return name

pipeline = lambda x: get_name_from_srp_blob(blob_cleaner(get_srp_text(x)))

In [None]:
# topic_dict = {'Topic_' + str(i): [token for token, score in lda_bow.show_topic(i, topn=10)] for i in range(0, lda_bow.num_topics)}
# # print(topic_dict)
# for key in topic_dict:
#     joined = " ".join(topic_dict[key])
#     # print(joined)
#     name = pipeline(joined)
#     print(name)

topic_terms = "delivery area mile option partner traffic hub thanks city way"
name = pipeline(topic_terms)
print(name)

# topic_terms = "package address time customer apartment delivery number item support door"
# name = pipeline(topic_terms)
# print(name)

Results obtained don't really make sense.

topic_terms = "delivery area mile option partner traffic hub thanks city way". Gives me ['mediumcom', 'sidewalktalk', 'thefutureoflastmiledeliveryhasarrived', 'deptswashingtonedu', 'sctlctr', 'newsevents', 'inthenews', 'futurelastmil', 'wwwcapgeminicom', 'ReportDigitalLastMileDeliveryChallenge1', 'wwwmckinseycom', 'travellogisticsandinfrastructure', 'ourinsights', 'wwwmckinseycom', 'industries', 'ourinsights', 'orderingintherapidevo', 'wwwbcgcom', 'publications', 'solvingthepackagedeliverysystemprobl', 'gigglefinancecom', 'whichfooddeliveryservicepaysthemost', 'wwwgovtmonitorcom', 'page', 'View', 'all', 'wwworegonlivecom', 'business', '202009', 'amazonplans1000smalldel']
wwwmckinseycom-ourinsights

Since there is no specific way or recommended way to label the topic terms obtained. We are going to do a little screening here and realise that the top few vocab being used do make some sense. Hence, in the following lines of code, we are going to assume and categorise the motive column into these 6 groups:

We will be using the topic breakdown from lda_bow, because they seem to have a more balanced categorisation.

*   Retaliation
*   Religion
*   Extortion
*   Fear
*   Political
*   Violence

Referenced: https://medcraveonline.com/FRCIJ/motivation-leading-to-radicalization-in-terrorists.html



## Evaluation of LDA

In [None]:
# Creating a dictionary to hold the catergories
motive_dict = {
    '0': "Retaliation",
    '1': "Religion",
    '2': "Extortion",
    '3': "Fear",
    '4': "Political",
    '5': "Violence"
}

In [None]:
# Extracting Topics from Copus
print(lda_bow.print_topics(num_topics=6, num_words=5))
print(lda_tfidf.print_topics(num_topics=6, num_words=5))

In [None]:
# https://stackoverflow.com/questions/61198009/classify-text-with-gensim-lda-model
count = 1
stop = 0
motiveCat=[]
for line in df['motive']: # where each line in the document is its own sentence for simplicity
    # print('\nSentence: ', line)
    # line = line.split()
    line_bow = dictionary.doc2bow(line)
    doc_lda = lda_bow[line_bow]
    # print(doc_lda)
    # print(max(doc_lda,key=lambda x:x[1])[0])
    # print('\nLine ' + str(count) + ' assigned to Topic ' + motive_dict[str(max(doc_lda)[0])] + ' with ' + str(round(max(doc_lda)[1]*100,2)) + ' probability!')
    count += 1
    motiveCat.append(motive_dict[str(max(doc_lda,key=lambda x:x[1])[0])])
    # if(str(max(doc_lda)[0])=='1'):
    #     print('\nSentence: ', line)
    #     print('\nLine ' + str(count) + ' assigned to Topic ' + motive_dict[str(max(doc_lda)[0])] + ' with ' + str(round(max(doc_lda)[1]*100,2)) + ' probability!')
    # stop+=1
    # if(stop==100):
    #     break
    
df['motiveCat1'] = motiveCat

In [None]:
df['motiveCat1'].value_counts()

In [None]:
# https://stackoverflow.com/questions/61198009/classify-text-with-gensim-lda-model
count = 1
stop = 0
motiveCat=[]
for line in df['motive']: # where each line in the document is its own sentence for simplicity
    # print('\nSentence: ', line)
    # line = line.split()
    line_bow = dictionary.doc2bow(line)
    doc_lda = lda_tfidf[line_bow]
    # print(doc_lda)
    # print(max(doc_lda,key=lambda x:x[1])[0])
    # print('\nLine ' + str(count) + ' assigned to Topic ' + motive_dict[str(max(doc_lda)[0])] + ' with ' + str(round(max(doc_lda)[1]*100,2)) + ' probability!')
    count += 1
    motiveCat.append(motive_dict[str(max(doc_lda,key=lambda x:x[1])[0])])
    # if(str(max(doc_lda)[0])=='1'):
    #     print('\nSentence: ', line)
    #     print('\nLine ' + str(count) + ' assigned to Topic ' + motive_dict[str(max(doc_lda)[0])] + ' with ' + str(round(max(doc_lda)[1]*100,2)) + ' probability!')
    # stop+=1
    # if(stop==100):
    #     break
    
df['motiveCat2'] = motiveCat

In [None]:
df['motiveCat2'].value_counts()

# Step 2: Dealing with Summary (Using the LDA models)
We will be using the previous lda_bow to categorise the motives for each row, we will using each summary with the motive categories.

## Data Pre-processing

In [None]:
# Converting all of the strings into lowercase
df['summary'] = df['summary'].str.lower()

# Removing punctuation
df['summary'] = df['summary'].str.replace('[{}]'.format(string.punctuation), '')

# Removing digits
df['summary'] = df['summary'].str.replace('\d+', '')

df['summary'].head

In [None]:
# for i, row in df.iterrows():
#     print(row['summary'])

In [None]:
# Removing stopwords
stop = stopwords.words('english')
newStopWords = ['unknown','however','specific','motive','sources','noted','reported','related','victim',\
                'attack','targeted','speculated','claimed','responsibility','incident','stated','scheduled',\
                'larger','part','suspected','carried','accused','may','january','february','march','april','june',\
                'july','august','september','october','november','december','people','believed','meant','considered']
stop.extend(newStopWords)

df['summary'] = df['summary'].apply(lambda x: [word for word in x.split() if word not in (stop)])
df['summary'].head

In [None]:
# Initialise the Wordnet Lemmatizer
lemmatizer = WordNetLemmatizer()

# Lemmatize each word - Group together different forms of the same word (e.g Running == Run)
def lemmatize_text(text):
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(w) for w in text]
df['summary'] = df['summary'].apply(lemmatize_text)
print(df['summary'].head)

## Bag of Words on the Dataset

In [None]:
# Create a dictionary from the df['motive'] containing the number of times a word appears in the training set.
dictionary = gensim.corpora.Dictionary(df['summary'])

# Print words that are in the top 10
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# For each row, we will create another dictionary to document the number of words and the number of times that each words appear
vectorizer = CountVectorizer(stop_words='english', binary=True)

arr = []
for i in df['summary']:
    arr.append(' '.join(i))
df['summary'] = arr
bagofwords = vectorizer.fit_transform(df['summary'])

In [None]:
# df["summary"]= df["summary"].str.join(" ")
X = bagofwords
y = df['motiveCat1'].values

print(np.shape(X))
print(np.shape(y))
print(X[:1])
print(y[:10])

In [None]:
# Splitting the dataset into Training set and Test set 
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## KernelSVM

In [None]:
# Fitting Kernel SVM to the Training set 
from sklearn.svm import SVC 
model = SVC(kernel = 'rbf', random_state = 0) 
model.fit(X_train, y_train) 

In [None]:
# Save Model
filename = '2_kernelSVM.sav'
pickle.dump(model, open(filename, 'wb'))

In [None]:
from sklearn.metrics import confusion_matrix 
from sklearn.metrics import roc_auc_score

# Creating a definition for the eval part
def eval():
    # Predicting the Test set results
    y_pred = model.predict(X_test)

    # Making the Classification Report
    print(classification_report(y_test, y_pred))

    # Using the ROC AUC score matrix
    # roc_auc_score(y_test, y_pred, average=None)

In [None]:
eval()

## Logistic Regression (LR)

In [None]:
# Fitting Log Reg to the Training set 
model = LogisticRegression(solver="sag", max_iter=400)
model.fit(X_train, y_train)

In [None]:
# Save Model
filename = '2_logreg.sav'
pickle.dump(model, open(filename, 'wb'))

In [None]:
eval()

## Stochastic Gradient Descent (SGD)

In [None]:
from sklearn.linear_model import SGDClassifier
model = SGDClassifier(loss="hinge", penalty="l2", max_iter=5)
model.fit(X_train, y_train)

In [None]:
# Save Model
filename = '3_sgd.sav'
pickle.dump(model, open(filename, 'wb'))

In [None]:
eval()

## Decision Tree (DT)

In [None]:
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
# Create Decision Tree classifer object
model = DecisionTreeClassifier()

# Train Decision Tree Classifer
model = model.fit(X_train,y_train)

In [None]:
# Save Model
filename = '4_dt.sav'
pickle.dump(model, open(filename, 'wb'))

In [None]:
eval()

In [None]:
# Visualizing Decision Trees
# !pip install six
# from six import StringIO  
# from IPython.display import Image  
# from sklearn.tree import export_graphviz
# import pydotplus

# feature_cols = ['Exortion', 'Fear', 'Political', 'Religion','Retaliation','Violence']
# dot_data = StringIO()
# export_graphviz(model, out_file=dot_data,  
#                 filled=True, rounded=True,
#                 special_characters=True, feature_names = feature_cols, class_names=['0','1'])
# graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
# graph.write_png('diabetes.png')
# Image(graph.create_png())

## Random Forest (RF)

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=50)
model.fit(X_train, y_train)

In [None]:
# Save Model
filename = '5_rf.sav'
pickle.dump(model, open(filename, 'wb'))

In [None]:
eval()

Interesting, that Random Forest also gives us a good accuracy here and in our previous question. Beating out the rest of the models that we have trained.

## k-Nearest Neighbors (kNN)

In [None]:
model = KNN(n_neighbors=7)
model.fit(X_train, y_train)

In [None]:
# Save Model
filename = '6_knn.sav'
pickle.dump(model, open(filename, 'wb'))

In [None]:
eval()

## AdaBoost

In [None]:
from sklearn.ensemble import AdaBoostClassifier
model = AdaBoostClassifier()
model.fit(X_train, y_train)

In [None]:
# Save Model
filename = '7_adaBoost.sav'
pickle.dump(model, open(filename, 'wb'))

In [None]:
eval()

## Gradient Boosting

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier()
model.fit(X_train, y_train)

In [None]:
# Save Model
filename = '8_gradientBoosting.sav'
pickle.dump(model, open(filename, 'wb'))

In [None]:
eval()

Gradient boosting and ADA Boosting are known to some of Ensemble Learning methods, and a boosting method. It is used to combine several weak "learners" into a stronger model. The concept of boosting is quite interesting because it tries to fit a new predictor into the errors that were made by models before.

Adaboost apparently gave us a lower accuracy 0.40 at predicting the motive categories.

GB obtaining an accuracy of 0.49 indicates that it was no better than running the previous models (without Boosting).

## K-Fold Cross Validation

In [None]:
# Applying k-Fold Cross Validation - not helpful right now
from sklearn.model_selection import cross_val_score 
accuracies = cross_val_score(estimator = model, X = X_train, y = y_train, cv = 5) 

In [None]:
# Save Model
filename = '9_kfold.sav'
pickle.dump(accuracies, open(filename, 'wb'))

In [None]:
print(accuracies.mean())
print(accuracies.std())

Using the K-Fold Cross Validation should improve the model in general, however in this case, we would attribute it to a case of a little over-fitting. The other models that were executed, had also an average accuracy of 0.49. Hence we can conclude that executing K-Fold Cross Validation does not help us in this case.

# Step 3: Evaluation
Summing everything up, we have used LDA to determine the motive categories and obtained 6 of them that we think would be most suitable to represent the topic labels.

*   Retaliation
*   Religion
*   Extortion
*   Fear
*   Political
*   Violence

The above are the reasons and the main goals of the terrorists for initiating such attacks.

Taking a look at the perfomance metrices for each model. We have learnt from doing the previous question "What makes a successful terrorist attack?" that the accuracy cannot entirely be depended on:

*   Confusion Matrix - Great for visualisation (Based on TP,TN,FN,FP)
*   Accuracy - Not a very good indicator. Most commonly used, the number of correct predictions made as a ratio of all predictions
*   Precision - TP/(TP+FP). This is the measure of precision.
*   Recall - TP/(TP+FN). This is the measure of precision.
*   F1-score - This is the measure of precision and recall. Taking FP and FN into account.
*   ROC-AUC score - AUC represents an average indicator of performance across all the possible classification thresholds.

Reference: <br>
Performance Metrics: https://machinelearningmastery.com/metrics-evaluate-machine-learning-algorithms-python/



## Conclusion
We also understand that through this process, we have significantly cut down on alot of data points. Going to columns, and after preprocessing the data for NLP, the once 200k data points in it's glory is reduced to around 20k of them. However, it is still significant enough for us to draw analysis.

Now to finally answer the question: "What do terrorists really want?". We can see that terrorists usually attack civilians to maximise their political gains (as we can see the political category having the highest number of datapoints = 966). Perhaps it's to make a statement and coming from a place of wanting to be heard. Although the models created were not exactly extremely accurate when determining the categories, being able to use LDA to determine the most common frequencies of words really helps us to understand the motives better, and hence group them under 6 topic labels. 

One of the ways that we thought could help in countering such terrorist attacks is to dimininish their benefits from the attack. Meaning that since we can know of their aims beforehand, we would be able to strategise against their goals and void the attack, hopefully stopping it in the early planning stages.

Originally, we wanted to use the models to try and help predict certain keywords that appear, like their country, situation, some background info that was obtained before the attacks was executed. From there, being able to predict what the the terrorists want before them stating it to us.