## Gunman Description 
To do:
- isolate the search for adjectives describing the gunman to only the articles that are actually about the shooting at hand 
- compare the words across the different shootings/ partisan differences

#### IMPORTs: packages, variables, files

In [294]:
import All_Functions as af 
from All_Functions import *
import glob as glob 
from ast import literal_eval
import nltk
import collections
from collections import Counter

In [295]:
# Functions
def print_most_common(text):
    counter = Counter(text)
    most = counter.most_common()
    print(most[:10])
    
# FUNCTIONS
def get_relevant_articles(shooting_list):
    """ INPUT: a list of lists where each sublist is an article w/ corresponding data from a given shooting 
        ASSUMPTIONS: We're asserting that if an article contains more than 3 keywords, the article is *about* the shooting 
        OUTPUT: a list of the shooting we're talking about, the # of articles about the shooting, the total # of articles in the dataset, the avg word count of the relevant articles 
    """
    count = [0 for x in range(len(shooting_list))] # set up an index to keep track of when/ how many keywords appear 
    kw_appearance = [[] for x in range(len(shooting_list))]
    location = shooting_list[1][-2] # the "name" of the shooting which links to the dictionary

    for i in range(len(shooting_list)):
        lower = shooting_list[i][2].lower() 
        shooting_list[i][2] = lower
         
        for word in nltk.word_tokenize(shooting_list[i][2]): # for single words in the dict
            if word in shooting_keywords[location]:
                count[i] +=1
                kw_appearance[i].append(word)
        for word in af.extract_ngrams(nltk.word_tokenize(shooting_list[i][2]), 2): # for bigrams in the dict
            if word in shooting_keywords[location]:
                count[i] += 1
                kw_appearance[i].append(word)
        for word in af.extract_ngrams(nltk.word_tokenize(shooting_list[i][2]), 3): # for trigrams in the dict
            if word in shooting_keywords[location]:
                count[i] += 1
                kw_appearance[i].append(word)
#     for index in range(len(shooting_list)):
#         shooting_list[i].append(count[i])
    relevant_articles = []
    for i in range(len(shooting_list)):
        if count[i] > 3: # Here is where we enter the assumption of 3
            relevant_articles.append(shooting_list[i])
            
    # CAN Also return kw_appearance if you want to see the words that are pinging
    return relevant_articles

In [296]:
# Get all the unclean text files
all_files = [x for x in glob.glob('newspaper-text/unclean-sentencesplit' + "/*.csv")]
all_text = af.import_text_data(all_files)

### ACTIONS

In [92]:
# Get just the text 
just_text = []
for i in range(len(all_text)):
    for j in range(1,len(all_text[i])):
        mini_text = []
        literal = literal_eval(all_text[i][j][2]) # for some reason the lists are imported as strings so this fixes that 
        tokenized = [nltk.word_tokenize(x) for x in literal]
#         lower= af.remove_capitalization(tokenized)
#         for x in literal:
#             tokenized = nltk.word_tokenize(x)
#             mini_text.append(x) # now we have just one list of every string from every shooting article
        for x in tokenized:
            lower= af.remove_capitalization(x)
            just_text.append(lower)

#### Find synonyms of the word "gunman"
Using Word2Vec, find synonyms of the word "gunman"

In [93]:
from gensim.models import Word2Vec
model = Word2Vec(sentences=just_text, window=5, min_count=1, workers=4)
model.save("word2vec.model")

In [94]:
model = Word2Vec.load("word2vec.model")
word_vectors = model.wv

In [95]:
# for x in word_vectors.key_to_index:
#     print(x)

In [103]:
vector = model.wv['gunman']  # get numpy vector of a word
sims = model.wv.most_similar('gunman', topn=30)  # get other similar words

In [104]:
sims

[('shooter', 0.8254361748695374),
 ('suspect', 0.7077057361602783),
 ('gunmen', 0.6907996535301208),
 ('attacker', 0.6741224527359009),
 ('man', 0.6522361040115356),
 ('paddock', 0.6377201676368713),
 ('assailant', 0.6344540119171143),
 ('killer', 0.629865288734436),
 ('kelley', 0.5895920395851135),
 ('perpetrator', 0.5850415229797363),
 ('rampage', 0.5806468725204468),
 ('craddock', 0.5793517827987671),
 ('ator', 0.5637785196304321),
 ('massacre', 0.5572240352630615),
 ('attack', 0.5498966574668884),
 ('suspects', 0.5496650338172913),
 ('cassidy', 0.5487622618675232),
 ('xaver', 0.5304782390594482),
 ('sniper', 0.5276758074760437),
 ('mateen', 0.5257592797279358),
 ('shooting', 0.525025486946106),
 ('atchison', 0.5226256847381592),
 ('assailants', 0.5135753154754639),
 ('wooden-framed', 0.5122582912445068),
 ('gunfire', 0.5016877055168152),
 ('bar', 0.5012877583503723),
 ('hotel', 0.5009043216705322),
 ('roof', 0.49991390109062195),
 ('farook', 0.49825799465179443),
 ('shoplifter', 0.

In [111]:
gunman_words = ['gunman','shooter','suspect','gunmen','attacker','assailant',
                'killer','perpetrator','suspects']

#### Set up a POS tagger to identify the words that describe "gunman" and its synonyms

In [99]:
import nltk
nltk.download('brown')
nltk.download('treebank')
from nltk.corpus import brown
from nltk.tag import UnigramTagger
from nltk.tag import UnigramTagger, BigramTagger
from nltk.tag.brill_trainer import BrillTaggerTrainer
from nltk.tag.brill import brill24
from nltk.tag import brill, brill_trainer

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\khahn\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package treebank to
[nltk_data]     C:\Users\khahn\AppData\Roaming\nltk_data...
[nltk_data]   Package treebank is already up-to-date!


In [100]:
n_cutoff = 20000
brown_sents_train = brown.tagged_sents()[0:n_cutoff] # training corpus

In [101]:
uni_tagger = UnigramTagger(brown_sents_train)
bi_tagger_backoff = BigramTagger(brown_sents_train, backoff=uni_tagger)
bi_tagger = BigramTagger(brown_sents_train)

In [106]:
### Add a timer to this next part: it takes ages

In [102]:
#with the backoff
templates = nltk.tag.brill.brill24()
brill_tagger = brill_trainer.BrillTaggerTrainer(bi_tagger, templates)
trainer = brill_tagger.train(brown_sents_train)
brill_tagger_backoff = brill_trainer.BrillTaggerTrainer(bi_tagger_backoff, templates)
trainer_backoff = brill_tagger_backoff.train(brown_sents_train)

In [None]:
trainer_backoff.tag(x)


In [144]:
"""Now I want to be able to connect the results to a specific shooting, so I need to build exactly the same database except with location attached"""
"""The issue is many of the article from a specific shooting aren't really about that shooting"""
# Get just the text 
text_loc = []
for i in range(len(all_text)):
    for j in range(1,len(all_text[i])):
        literal = literal_eval(all_text[i][j][2]) # for some reason the lists are imported as strings so this fixes that 
        tokenized = [nltk.word_tokenize(x) for x in literal]
        location = all_text[i][j][-2]
        for x in tokenized:
            lower= af.remove_capitalization(x)
            text_loc.append([lower, [location]])

In [145]:
gunman_sentences =[]
for i in range(len(text_loc)):
    for word in text_loc[i][0]:
        if word in gunman_words:
            gunman_sentences.append(text_loc[i])


In [146]:
for i in range(len(gunman_sentences)):
    tags = trainer_backoff.tag(gunman_sentences[i][0])
    gunman_sentences[i].append(tags)

In [186]:
# more sophisticated version with 
adjs = ['JJ','JJS','JJR']
shootings = []
gunman_adjs = [[] for x in range(10)]
for i in range(len(gunman_sentences)):
    # First, get the location & index of the location in "shootings"
    location = gunman_sentences[i][1]
    if location not in shootings:
        shootings.append(location)
    for j in range(len(shootings)):
        if location == shootings[j]:
            index = j
    # then get all the adjectives from the sentences
    for ii in range(len(gunman_sentences[i][-1])): 
        if ii > 0 and gunman_sentences[i][-1][ii][0] in gunman_words:
            if gunman_sentences[i][-1][ii-1][1] in adjs:
#                 print( gunman_sentences[i][-1][ii-1][0],gunman_sentences[i][-1][ii][0] )
                gunman_adjs[index].append(gunman_sentences[i][-1][ii-1][0])
            if gunman_sentences[i][-1][ii-2][1] in adjs:
#                 print(gunman_sentences[i][-1][ii-2][0], gunman_sentences[i][-1][ii-1][0],gunman_sentences[i][-1][ii][0] )
#                 print(gunman_sentences[i][-1][ii-2][1], gunman_sentences[i][-1][ii-1][1],gunman_sentences[i][-1][ii][1] )
                gunman_adjs[index].append(gunman_sentences[i][-1][ii-2][0])
                gunman_adjs[index].append(gunman_sentences[i][-1][ii-1][0])
            if gunman_sentences[i][-1][ii+1][1] in adjs:
#                 print(gunman_sentences[i][-1][ii][0], gunman_sentences[i][-1][ii+1][0] )
#                 print(gunman_sentences[i][-1][ii-2][1], gunman_sentences[i][-1][ii-1][1],gunman_sentences[i][-1][ii][1] )
                gunman_adjs[index].append(gunman_sentences[i][-1][ii+1][0])

In [188]:
print('Number of words that describe the gunman in each shooting:')
for index in range(len(gunman_adjs)):
    print(shootings[index], len(gunman_adjs[index]))

Number of words that describe the gunman in each shooting:
['Bogue'] 99
['Boulder'] 520
['DC'] 147
['Houston'] 146
['Odessa'] 333
['Pittsburgh'] 265
['Plano'] 1242
['SanBernadino'] 360
['Vegas'] 1244
['VirginiaBeach'] 150


In [191]:
print('Most common words to describe each shooter:')

for index in range(len(gunman_adjs)):
    print(shootings[index])
    print_most_common(gunman_adjs[index])

Most common words to describe each shooter:
['Bogue']
[('active', 34), ('lone', 9), ('to', 6), ('shooting', 5), ('white', 4), ('reasonable', 4), ('multiple', 4), ('the', 3), ('possible', 2), ('correct', 2)]
['Boulder']
[('active', 195), ('lone', 47), ('teenage', 36), ('21-year-old', 32), ('dead', 14), ('white', 13), ('possible', 13), ('middle-aged', 10), ('old', 7), ('the', 7)]
['DC']
[('lone', 19), ('possible', 18), ('dead', 17), ('sandy', 12), ('hook', 12), ('second', 9), ('potential', 8), ('crazy', 8), ('active', 6), ('elementary', 3)]
['Houston']
[('active', 52), ('26-year-old', 10), ('lone', 8), ('late', 7), ('would-be', 5), ('dead', 5), ('young', 4), ('such', 4), ('responsible', 3), ('the', 3)]
['Odessa']
[('active', 110), ('lone', 23), ('white', 14), ('21-year-old', 10), ('unknown', 10), ('mass', 9), ('recent', 9), ('supremacist', 9), ('potential', 8), ('dead', 7)]
['Pittsburgh']
[('active', 115), ('lone', 24), ('dead', 16), ('possible', 11), ('shooting', 11), ('white', 9), ('ma

### Repeat this process but just for the relevant articles

In [297]:
shooting_keywords = {'Plano':["spencer hight","dallas cowboys","meredith hight","dallas","plano","sunday","caleb edwards","deffner","rushin"],
                     'Pittsburgh':["pittsburgh","synagogue","bowers","tree of life","squirrel hill","jewish","anti-semitism","jews"],
                     'Vegas':["paddock","mandalay bay hotel","route 91 harvest","las vegas","aldean","concert","mesquite","hotel","lombardo"],
                     'SanBernadino':['syed','rizwan','farook','tashfeen','malik','SUV',"inland regional center","san bernardino","redlands","christmas party","public health department","bomb"],
                     'Houston':['david','conley','harris county','ex-girlfriend','houston','saturday','valerie jackson','window','black','arrested'],
                     'Odessa':['saturday','midland','odessa','seth','aaron','ator','west texas','traffic stop','white van','movie theater','random','white'],
                     'Bogue':['willie','cory','godbolt','arrested','lincoln county','bogue chitto','brookhaven','barbara mitchell'],
                     'DC':['washington navy yard','aaron alexis','monday','contractor','12 people'],
                     'Boulder':['king soopers','boulder','ahmad al','aliwi','al-issa','arrested','9mm handgun','table mesa drive','eric talley','boulder police','monday','in custody'],
                     'VirginiaBeach':['virginia beach','dewayne','craddock','employee','nettleton','.45-caliber','engineer','municipal']
                    }

In [298]:
relevant_articles = [get_relevant_articles(x) for x in all_text]

In [299]:
# print(rel_art[1][2].split())
# nltk.word_tokenize(rel_art[1][2])
# af.extract_ngrams(nltk.word_tokenize(rel_art[5][2]), 2)

In [300]:
rel_art, kw= []
rel_art, kw = get_relevant_articles(all_text[-1])
for i in range(len(rel_art)):
    if len(kw[i]) > 0:
        print(f"Keywords: {kw[i]}")
        print(f"Index: {i}")
        print(f"Title: {rel_art[i][4]}")
        print(f"Url: {rel_art[i][1]}")
        print(" ")

ValueError: not enough values to unpack (expected 2, got 0)

In [301]:
for x in relevant_articles:
    print(len(x))

4
556
57
74
306
449
363
477
1840
141


In [302]:
# Get just the text 
text_loc_v2 = []
for i in range(len(relevant_articles)):
    for j in range(1,len(relevant_articles[i])):
        literal = literal_eval(relevant_articles[i][j][2]) # for some reason the lists are imported as strings so this fixes that 
        tokenized = [nltk.word_tokenize(x) for x in literal]
        location = all_text[i][j][-2]
        for x in tokenized:
            lower= af.remove_capitalization(x)
            text_loc_v2.append([lower, [location]])

In [303]:
gunman_sentences = []
for i in range(len(text_loc_v2)):
    for word in text_loc_v2[i][0]:
        if word in gunman_words:
            gunman_sentences.append(text_loc_v2[i])

In [304]:
for i in range(len(gunman_sentences)):
    tags = trainer_backoff.tag(gunman_sentences[i][0])
    gunman_sentences[i].append(tags)

In [305]:
# more sophisticated version with 
adjs = ['JJ','JJS','JJR']
shootings = []
gunman_adjs = [[] for x in range(10)]
for i in range(len(gunman_sentences)):
    # First, get the location & index of the location in "shootings"
    location = gunman_sentences[i][1]
    if location not in shootings:
        shootings.append(location)
    for j in range(len(shootings)):
        if location == shootings[j]:
            index = j
    # then get all the adjectives from the sentences
    for ii in range(len(gunman_sentences[i][-1])): 
        if ii > 0 and gunman_sentences[i][-1][ii][0] in gunman_words:
            if gunman_sentences[i][-1][ii-1][1] in adjs:
#                 print( gunman_sentences[i][-1][ii-1][0],gunman_sentences[i][-1][ii][0] )
                gunman_adjs[index].append(gunman_sentences[i][-1][ii-1][0])
            if gunman_sentences[i][-1][ii-2][1] in adjs:
#                 print(gunman_sentences[i][-1][ii-2][0], gunman_sentences[i][-1][ii-1][0],gunman_sentences[i][-1][ii][0] )
#                 print(gunman_sentences[i][-1][ii-2][1], gunman_sentences[i][-1][ii-1][1],gunman_sentences[i][-1][ii][1] )
                gunman_adjs[index].append(gunman_sentences[i][-1][ii-2][0])
                gunman_adjs[index].append(gunman_sentences[i][-1][ii-1][0])
            if gunman_sentences[i][-1][ii+1][1] in adjs:
#                 print(gunman_sentences[i][-1][ii][0], gunman_sentences[i][-1][ii+1][0] )
#                 print(gunman_sentences[i][-1][ii-2][1], gunman_sentences[i][-1][ii-1][1],gunman_sentences[i][-1][ii][1] )
                gunman_adjs[index].append(gunman_sentences[i][-1][ii+1][0])

In [306]:
print('Number of words that describe the gunman in each shooting:')
for index in range(len(gunman_adjs)):
    print(shootings[index], len(gunman_adjs[index]))

Number of words that describe the gunman in each shooting:
['Bogue'] 0
['Boulder'] 193
['DC'] 74
['Houston'] 20
['Odessa'] 130
['Pittsburgh'] 82
['Plano'] 684
['SanBernadino'] 194
['Vegas'] 970
['VirginiaBeach'] 38


In [307]:
print('Most common words to describe each shooter:')

for index in range(len(gunman_adjs)):
    print(shootings[index])
    print_most_common(gunman_adjs[index])

Most common words to describe each shooter:
['Bogue']
[]
['Boulder']
[('active', 79), ('21-year-old', 25), ('lone', 13), ('middle-aged', 10), ('prime', 6), ('sure', 5), ('white', 4), ('this', 4), ('possible', 4), ('unknown', 3)]
['DC']
[('lone', 17), ('possible', 16), ('dead', 14), ('second', 9), ('potential', 7), ('additional', 2), ('well-armed', 2), ('active', 2), ('nationwide', 1), ('multiple', 1)]
['Houston']
[('late', 7), ('such', 4), ('possible', 2), ('would-be', 1), ('active', 1), ('potential', 1), ('medical', 1), ('layman', 1), ('glad', 1), ('the', 1)]
['Odessa']
[('active', 30), ('lone', 13), ('white', 11), ('supremacist', 8), ('21-year-old', 6), ('mass', 5), ('potential', 4), ('recent', 4), ('years', 4), ('new', 4)]
['Pittsburgh']
[('active', 38), ('possible', 11), ('shooting', 11), ('lone', 7), ('outspoken', 4), ('mass', 3), ('likely', 1), ('apparent', 1), ('notorious', 1), ('responsible', 1)]
['Plano']
[('active', 497), ('lone', 82), ('sole', 46), ('dead', 7), ('26-year-old

### Hands on Approach
Okay now gonna try a more hands on approach for Bogue bc there have got to be sentences that describe the shooter 

In [314]:
bogue = all_text[0]
for i in range(len(bogue[:82])):
    tokenized = nltk.word_tokenize(bogue[i][2])
    for j in range(len(tokenized)):
        if tokenized[j] == 'godbolt':
            print(tokenized[j-2],tokenized[j-1],tokenized[j],tokenized[j+1])

Cant find evidence of a single article about the Bogue shooting....