# Natural Langauge Processing of ISIS Content
This notebook contains information regarding the efforts of **Kamal Kamalaldin** and **Will Fitzgerald** in filtering ISIS contents that violate the terms of service from the Internet Archive (www.archive.org).

## Requirements
1. **Python 3+** (current version is 3.5). This should be a straightforward thing to [install](https://www.python.org/downloads/)
2. [**pip3**](https://pip.pypa.io/en/stable/installing/) or [**homebrew**](http://brew.sh/) command line tools (homeb. We will be using these tools in install other, more essential tools.
3. **The Internet Archive command tools**, refered to throughout this notebook as *ia*. Here we use pip to install *ia*. It can also be installed using any of the previous command line tools one or more of which we just installed. Because of some issues with using imported modules while using jupyter (IPython notebook), **if you plan on using this notebook to run all the code**, we need to specify that we want to install the *ia* tools in the directory in which jupyter can find the module. Follow the pip3 execurable command before continuing. 
    - using homebrew: "brew install internetarchive"
    - using pip3: "pip3 install internetarchive"
4. [** Vowpal Wabbit**](http://hunch.net/~vw/). For macs, this can be installed using homebrew. For windows, perhaps [scoop](http://scoop.sh/) will work.

In [None]:
! python3 -V

In [None]:
! EDIT THE FILEPATH AND DELETE THIS LINE BEFORE EXECUTING
! DONT EXECUTE UNTIL EDITING PATH IN CAPS pip3 install internetarchive -t /PATH/TO/CURRENT/ANACONDA/ENVIRONMENT/lib/python3.5/site-packages/internetarchive

In [None]:
import internetarchive as ia

## Data collection
At the beginning of our endeavour, we attempted to categories arabic item entries into those violating the term of service, and those who do not. Items can violate the term of service if they are "deemed offensive, disturbing, pornographic, racist, sexist, bizarre, misleading, fraudulent, or otherwise objectionable" (Terms of Service, Internet Archive). As Kamal was the only person knowledgable enough in the Arabic language to understand and categorize items, it was quickly aparent that the rate at which items were categorized was too slow to attain the sufficient data to train a model. Therefore, another approach was considered.

For the data portion, a director tree was created.

In [None]:
! mkdir data

##### ISIS Collection

A collection of ~5,000 items that included ISIS/ISIL or their arabic counterparts in them was collected. This collection would represent items that were ISIS related, and assumed to be malicious. 

First, items with "الدولة الاسلامية", the arabic equivalent of Islamic State are searched for. We first examine the number of items returned, then we download them.

In [None]:
! mkdir data/ISIS
! echo "Number of matching searches found : "
! ia search "الدولة الاسلامية" -i -n 
! echo "Donwloading item identifiers..."
! ia search "الدولة الاسلامية" -i >> data/ISIS/existingIDs.txt.txt
! echo "Number of item identifiers downloaded:"
! wc -l data/ISIS/existingIDs.txt

If you encounter an error running these commands, ensure that you have the latest version of *ia* installed (See **Requirements** above). 

The -i tag specifies that we are only interested in the ID of the item, not the whole file (we will download those later). The -n tag specifies that we only want to know the number of search results for now.
In the third line we ask for the ID to be downloaded and put in a file called ISIS.txt in the data directory. The head command allows us to examine the first three lines of the file we jsut created.

We now extract the IDs from the file and download the metadata for each item. The files are downloaded for future inspection and cashing. **This is currently a time consuming procedure**

In [None]:
! mkdir data/ISIS/metadata

In [None]:
import json

def getIDsFromFile(file):
    IDs = []
    print("Openning ID file...")
    print("Reading IDs from file...")
    for line in ISIS_ID_file:
        IDs.append(line.rstrip())
    print("Finished reading IDs from file. Closing file...")
    ISIS_ID_file.close()
    print("ID File closed!")
    return IDs

def downloadMetaFiles(IDs, directoryToSave):
    
    print("Downloading " + str(len(IDs)) + " item metadatas...")
    count = 1
    for ID in IDs:
        file = open(directoryToSave + ID + ".txt", 'w')
        meta = ia.get_item(ID).metadata
        json.dump(meta, file)
        if(count % 31 == 0):
            print('downloaded ' + str(count) + ' metadata files so far!')
        count += 1
    print("Successefully downloaded " + str(len(IDs)) + " metadatas!")


ISIS_ID_file = open('data/ISIS/IDList.txt', 'r')
posIDs = getIDsFromFile(ISIS_ID_file)
downloadMetaFiles(posIDs, "data/ISIS/metadata/")

We now have the metadata readily available in files, and I/O should take a much faster time to retreive the data rather than downloading them again if need be. We define the methods below to extract the metadata back from the files.

In [2]:
import os
import json

def getMetaFromFile(fileDir):
    file = open(fileDir, 'r')
    try:
        jObject = json.load(file)
        return jObject
    except ValueError:
        print("Error reading JSON from " + fileDir)
    
def readMetaTextInDirectory(directory):
    return [getMetaFromFile(directory + os.sep + f) 
            for f in os.listdir(directory)
            if  f.endswith('.txt')]

posMetadata = readMetaTextInDirectory('data/ISIS/metadata')
len(posMetadata)

1063

This list of metadata will be the actual data points which we will process, develop our model with, and test against!
We need to get it in VW format first. To see what that looks like, skim through the [input format page](https://github.com/JohnLangford/vowpal_wabbit/wiki/Input-format). In a nutshell, each data point (metadata entry) must be on its own line. Each line must begin with a +1 or -1 to signify if it's a positive or negative example (ISIS or not ISIS, respectively, in this case), followed by a pipe "|" that separated each namespace and its features. The format for namespace and features is: "namespac1e: feature1 |namespace2: feature2 |namespeace3: feature3...\n". "|" and ":" are reserved for VW, so we must distinguish their occurance in the metadata from their occrance in the VW format. We replace "|" with "PIPE" and ":" with "COLON" 

In [26]:
def proccessText(text):
    newText = text.replace(':', 'COLON').replace('|', 'PIPE').replace('\n', ' ').replace('\r', ' ').replace("@", ' ')
    newText = newText.replace('<br>', ' ').replace("<\br>", ' ')
    newText = newText.replace('الدولة الاسلامية', ' ')
    newText = newText.replace('سكر', ' ')
    n2 = []
    for word in newText.split():
        if len(word) < 50:
            n2.append(word)
    return ' '.join(n2[:500])

def metadataToVWline(metadata: dict, positive: bool):
    ignored_keys = ['mediatype', 'sound', 'color', 'curation']
    data = ''
    if metadata is None:
        print('Found null metadata. Skipping.')
    else:
        for key in metadata:
            if key in ignored_keys:
                continue
            else:
                if(type(metadata[key]) == list):
                    string = ' '.join(metadata[key])
                    data += key + ' ' + proccessText(string) + " |"
                else:
                    data += key + ' ' + proccessText(metadata[key]) + " |"
        #remove last trailing pipe and add new line
        data = data.rstrip('|')
        data += '\n'

        if(positive):
            data = "+1 |" + data
        else:
            data = "-1 |" + data
        
    return data

#     ID = metadata.get('identifier', 'NONE').replace(':', 'COLON').replace('|', 'PIPE')
#     uploader = metadata.get('uploader', 'NONE').replace(':', 'COLON').replace('|', 'PIPE')
#     title = metadata.get('title', 'NONE').replace(':', 'COLON').replace('|', 'PIPE')
#     subject = metadata.get('subject', 'NONE').replace(':', 'COLON').replace('|', 'PIPE')
#     description = metadata.get('description', 'NONE').replace(':', 'COLON').replace('|', 'PIPE')
#     publishDate = metadata.get('publishdate', 'NONE').replace(':', 'COLON').replace('|', 'PIPE')
#     addedDate = metadata.get('addeddate', 'NONE').replace(':', 'COLON').replace('|', 'PIPE')
#     collection = metadata.get('collection', 'NONE').replace(':', 'COLON').replace('|', 'PIPE')
#     ID = metadata['identifier']
#     uploader = metadata['uploader']
#     title = metadata['title']
#     subject = metadata['subject']
#     description = metadata['description']
#     publishDate = metadata['publicdate']
#     addedDate = metadata['addeddate']
#     collection = metadata['collection']
#     print(metadata)
    
#     data = "identifier: ".join(ID) + " | uploader: ".join(uploader) + " | title: ".join(title) + " | subject: "
#     data = data + ''.join(subject) + " | description".join(description) + " | publishDate: ".join(publishDate)
#     data = data + " | addedDate: ".join(addedDate) + " | collection: ".join(collection) + "\n"


def writeVWlinesToFile(vwLines, file):
    print("opening Vwneg.txt file to write metadata")
    print("Writing lines to file...")
    for line in vwLines:
        file.write(line)
    print("Closing file...")
    file.close()
    print("File closed!")


In [27]:
posVwLines = []
for meta in posMetadata:
    posVwLines.append(metadataToVWline(meta, True))
    
ISIS_Metadata_vw_file = open('data/ISIS/VWpos.txt', 'w')
writeVWlinesToFile(posVwLines, ISIS_Metadata_vw_file)

opening Vwneg.txt file to write metadata
Writing lines to file...
Closing file...
File closed!


A sample (~40 items) of this collection was inspected to determine the accuracy of the assumption (how many items in the sample are *actually* ISIS content that violated terms of service). The accurasy of our assumption was recorded as 


$AccViol = \frac{v(s)}{l(s)}$

where $v(s)$ is the number of items violating the terms of service in the sample $s$ and $l(s)$ is the number of items in $s$

In [None]:
# import random
# random.shuffle(metadata)
# AccViol = 27/30


Main points:
- get negative data
- run it all through VW
- get a proccess to get new data (IDs + metadata)

##### Arabic-non-ISIS Collection

Another collectino of ~5,000 items that included general arabic words was collected. This collection would represent items that included arabic text but were not violating of the terms of service. A general keyword was needed for searching content not related to ISIS. سكر (sugar) was chosen. This keyword surprisingly collected a lot of quran recitations, which is good since that means the data will force the model to not discremenate based on religious text.

In [None]:
! mkdir data/Arabic-non-ISIS
! mkdir data/Arabic-non-ISIS/metadata
! echo "Number of matching searches found : "
! ia search "سكر" -i -n 
! echo "Donwloading item identifiers..."
! ia search "سكر" -i > data/Arabic-non-ISIS/existingIDs.txt
! echo "Number of item identifiers downloaded:"
! wc -l data/Arabic-non-ISIS/metadata/existingIDs.txt

**SINCE THERE COULD BE A LOT OF METADATA TO DOWNLOAD, PLEASE FEEL FREE TO STOP THIS KERNEL PROCCESS MANUALLY WHEN YOU THINK YOU HAVE ENOUGH DATA**

In [None]:
Arabic_non_ISIS_ID_file = open('data/Arabic-non-ISIS/existingIDs.txt', 'r')
negIDs = getIDsFromFile(Arabic_non_ISIS_ID_file)
downloadMetaFiles(posIDs, "data/Arabic-non-ISIS/metadata/")

Now we read the metadata that we downloaded into a list.

In [28]:
negMetadata = readMetaTextInDirectory('data/Arabic-non-ISIS/metadata')
len(negMetadata)

Error reading JSON from data/Arabic-non-ISIS/metadata/MA2436724357324747374A12.txt


1826

We process the items in this list into lines readable by vw, and then we write these lines to a file for backup and inspection.

In [29]:
negVwLines = []
for meta in negMetadata:
    negVwLines.append(metadataToVWline(meta, False))
    
# Arabic_metadata_VW_file = open('data/Arabic-non-ISIS/VWneg.txt', 'w')
# writeVWlinesToFile(negVwLines, Arabic_metadata_VW_file)

Found null metadata. Skipping.


A sample (~40 items) of this collection was inspected to determine the accuracy of the assumption (how many items in the sample are *actually* general arabic content that do not violate the terms of service). The accuracy of our assumption was recorded as $AccAra$

The list of negative and positive examples are combined and shuffled.

In [30]:
allVWLines = posVwLines + negVwLines
import random
random.seed(1234)
random.shuffle(allVWLines)
len(allVWLines)

2889

Then we split in half the shuffled vw lines into training and testing data. We write both to their appropriate file. *sentiment.tr* will be our training data, and *sentiment.te* will be our testing data.

In [31]:
def writeToVWFile(filename, examples):
    with open(filename, 'w') as h:
        for ex in examples:
            h.write(ex)
writeToVWFile('data/sentiment.tr', allVWLines[:1400])
writeToVWFile('data/sentiment.te', allVWLines[1400:])
!wc -l data/sentiment.tr data/sentiment.te

    1400 data/sentiment.tr
    1488 data/sentiment.te
    2888 total


Now we run the model against our training data and cross our fingers that it learns something.

In [32]:
!vw --binary data/sentiment.tr

Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = data/sentiment.tr
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
0.000000 0.000000            1            1.0  -1.0000  -1.0000      554
0.500000 1.000000            2            2.0   1.0000  -1.0000      172
0.500000 0.500000            4            4.0   1.0000  -1.0000       24
0.250000 0.000000            8            8.0  -1.0000  -1.0000      528
0.125000 0.000000           16           16.0   1.0000   1.0000      100
0.093750 0.062500           32           32.0  -1.0000  -1.0000      532
0.062500 0.031250           64           64.0  -1.0000  -1.0000      519
0.031250 0.000000          128          128.0   1.0000   1.0000       49
0.019531 0.007812          256          256.0  -1.0000  -1.0000      533
0.015625 0.011719          512          512.0  -1.0000  -1.0

Now that we see what it looks like, it is time to train vw over the data sufficiently (with many passes) and produce a model file that will be used to make predictions against the testing data.

In [33]:
! vw --binary data/sentiment.tr --passes 20 -c -k -f data/sentiment.model

final_regressor = data/sentiment.model
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
creating cache_file = data/sentiment.tr.cache
Reading datafile = data/sentiment.tr
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
0.000000 0.000000            1            1.0  -1.0000  -1.0000      554
0.500000 1.000000            2            2.0   1.0000  -1.0000      172
0.500000 0.500000            4            4.0   1.0000  -1.0000       24
0.250000 0.000000            8            8.0  -1.0000  -1.0000      528
0.125000 0.000000           16           16.0  -1.0000  -1.0000      526
0.093750 0.062500           32           32.0  -1.0000  -1.0000      535
0.062500 0.031250           64           64.0  -1.0000  -1.0000      524
0.031250 0.000000          128          128.0   1.0000   1.0000       23
0.023438 0.015625          256         

This produces a model in data/sentiment.model! Notice that the **average loss** is incredibly low when the model is finally created, which means that the extra passes improved the model, and the last time it ran through the model it could predict (persumably) all of the items correctly.
Since we have a model, it is now time to run it against the testing data we left out.

In [34]:
! vw --binary -t -i data/sentiment.model -p data/sentiment.te.pred data/sentiment.te

only testing
predictions = data/sentiment.te.pred
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = data/sentiment.te
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
0.000000 0.000000            1            1.0  -1.0000  -1.0000      526
0.000000 0.000000            2            2.0   1.0000   1.0000       29
0.000000 0.000000            4            4.0  -1.0000  -1.0000      544
0.000000 0.000000            8            8.0  -1.0000  -1.0000      531
0.000000 0.000000           16           16.0  -1.0000  -1.0000      530
0.031250 0.062500           32           32.0  -1.0000  -1.0000      528
0.031250 0.031250           64           64.0  -1.0000  -1.0000      531
0.015625 0.000000          128          128.0   1.0000   1.0000       30
0.011719 0.007812          256          256.0  -1.0000  -1.0000      539
0.015625 0

The average loss is only 0.011, which means that the model is 98.9% accurate! Since this is a little suspecious, let us look at what the most predictive features were. 

In [35]:
!vw -i data/sentiment.model -t --invert_hash data/sentiment.model.readable data/sentiment.tr --quiet

As copied from GettingTheMost vwnlp notebook: this says "start from that pre-trained model; go into test mode (so that you don't adjust any of the weights of the model); store the resulting readable model (--invert_hash) into the specified file; and read from data/sentiment.tr (you have to re-read from the same training data).
We can now look at data/sentiment.model.readable to see what's going on."

The next step after making this readable copy of the model, we need to inspect it. To do that, we will sort the data by the feature weight (details in the refereed book). We look at the 30 most predictive features by -n30 at the end

In [36]:
!cat data/sentiment.model.readable  | tail -n+13 | sort -t: -k3nr | head -n30

collection^opensource_movies:102092:0.130581
description^باقية:220112:0.130282
uploader^mail.ru:184078:0.110775
collection^loggedin:13007:0.097585
scanner^Archive:202608:0.088549
scanner^HTML5:130839:0.088549
scanner^Internet:229369:0.088549
scanner^Uploader:37599:0.088549
scanner^1.6.1:93547:0.086958
uploader^oma_222:64648:0.084879
title^في:64074:0.084830
subject^العراق:205019:0.079868
language^eng:144814:0.074802
description^ولاية:255536:0.071500
subject^Islam:246233:0.066713
year^2014:139254:0.065090
title^من:174608:0.064924
uploader^gmail.com:184177:0.062123
subject^shamikh1.info/vb;:43030:0.061989
creator^ابو:151298:0.061721
subject^ولاية:246401:0.061025
uploader^fofo.bobo.82:169643:0.059059
title^ولاية:23670:0.057314
subject^في:202856:0.055348
uploader^rock201110:34556:0.054932
collection^iraq_middleeast:218420:0.053134
collection^iraq_war:117857:0.053134
collection^newsandpublicaffairs:164064:0.053134
creator^المهاجر:132187:0.052726
description^بغداد:143779:0.052462
sort: write 

## Removing Scanner
Since we are not particularly intereted by what scanner was used to scan the items when they were added to the Interner Archive, we predict that the scanner namespace is not a necessarily important one, so we would like to remove it. Furthrmore, we don't want the model to be biased against items in one collection or another, so we remove that namespace as well. We remove these namespaces by ignoring them when we are proccessing the metadata. So we redefine that function.

In [37]:
def metadataToVWline(metadata: dict, positive: bool):
    ignored_keys = ['mediatype', 'sound', 'color', 'curation', 'collection', 'creator', 'scanner']
    data = ''
    if metadata is None:
        print('Found null metadata. Skipping.')
    else:
        for key in metadata:
            if key in ignored_keys:
                continue
            else:
                if(type(metadata[key]) == list):
                    string = ' '.join(metadata[key])
                    data += key + ' ' + proccessText(string) + " |"
                else:
                    data += key + ' ' + proccessText(metadata[key]) + " |"
        #remove last trailing pipe and add new line
        data = data.rstrip('|')
        data += '\n'

        if(positive):
            data = "+1 |" + data
        else:
            data = "-1 |" + data
        
    return data

We re-proccess the metadata and run it through vw

In [38]:
posVwLines = []
for meta in posMetadata:
    posVwLines.append(metadataToVWline(meta, True))
negVwLines = []
for meta in negMetadata:
    negVwLines.append(metadataToVWline(meta, False))
allVWLines = posVwLines + negVwLines
import random
random.seed(1234)
random.shuffle(allVWLines)
len(allVWLines)
writeToVWFile('data/sentiment.tr', allVWLines[:1400])
writeToVWFile('data/sentiment.te', allVWLines[1400:])
! vw --binary data/sentiment.tr --passes 20 -c -k -f data/sentiment.model
!vw -i data/sentiment.model -t --invert_hash data/sentiment.model.readable data/sentiment.tr --quiet
!cat data/sentiment.model.readable  | tail -n+13 | sort -t: -k3nr | head -n30

Found null metadata. Skipping.
final_regressor = data/sentiment.model
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
creating cache_file = data/sentiment.tr.cache
Reading datafile = data/sentiment.tr
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
0.000000 0.000000            1            1.0  -1.0000  -1.0000      548
0.500000 1.000000            2            2.0   1.0000  -1.0000      168
0.250000 0.000000            4            4.0   1.0000   1.0000       18
0.125000 0.000000            8            8.0  -1.0000  -1.0000      522
0.125000 0.125000           16           16.0  -1.0000  -1.0000      520
0.093750 0.062500           32           32.0  -1.0000  -1.0000      529
0.062500 0.031250           64           64.0  -1.0000  -1.0000      518
0.031250 0.000000          128          128.0   1.0000   1.0000       17
0.023438

In [None]:
print("Placeholder code")
ia search 'الصداقة' -i -n
ia search 'سكر' -i -n

AccAra = 0