# Natural Langauge Processing of ISIS Content
This notebook contains information regarding the efforts of **Kamal Kamalaldin** and **Will Fitzgerald** in filtering ISIS contents that violate the terms of service from the Internet Archive (www.archive.org).

## Requirements
1. **Python 3+** (current version is 3.5). This should be a straightforward thing to [install](https://www.python.org/downloads/)
2. [**pip3**](https://pip.pypa.io/en/stable/installing/) or [**homebrew**](http://brew.sh/) command line tools (homeb. We will be using these tools in install other, more essential tools.
3. **The Internet Archive command tools**, refered to throughout this notebook as *ia*. Here we use pip to install *ia*. It can also be installed using any of the previous command line tools one or more of which we just installed. Because of some issues with using imported modules while using jupyter (IPython notebook), **if you plan on using this notebook to run all the code**, we need to specify that we want to install the *ia* tools in the directory in which jupyter can find the module. Follow the pip3 execurable command before continuing. 
    - using homebrew: "brew install internetarchive"
    - using pip3: "pip3 install internetarchive"
4. [** Vowpal Wabbit**](http://hunch.net/~vw/). For macs, this can be installed using homebrew. For windows, perhaps [scoop](http://scoop.sh/) will work.

In [94]:
! python3 -V

Python 3.5.1


In [95]:
! EDIT THE FILEPATH AND DELETE THIS LINE BEFORE EXECUTING
! DONT EXECUTE UNTIL EDITING PATH IN CAPS pip3 install internetarchive -t /PATH/TO/CURRENT/ANACONDA/ENVIRONMENT/lib/python3.5/site-packages/internetarchive

/bin/sh: EDIT: command not found
/bin/sh: DONT: command not found


In [96]:
import internetarchive as ia

## Data collection
At the beginning of our endeavour, we attempted to categories arabic item entries into those violating the term of service, and those who do not. Items can violate the term of service if they are "deemed offensive, disturbing, pornographic, racist, sexist, bizarre, misleading, fraudulent, or otherwise objectionable" (Terms of Service, Internet Archive). As Kamal was the only person knowledgable enough in the Arabic language to understand and categorize items, it was quickly aparent that the rate at which items were categorized was too slow to attain the sufficient data to train a model. Therefore, another approach was considered.

For the data portion, a director tree was created.

In [97]:
! mkdir data

mkdir: data: File exists


##### ISIS Collection

A collection of ~5,000 items that included ISIS/ISIL or their arabic counterparts in them was collected. This collection would represent items that were ISIS related, and assumed to be malicious. 

First, items with "الدولة الاسلامية", the arabic equivalent of Islamic State are searched for. We first examine the number of items returned, then we download them.

In [98]:
! mkdir data/ISIS
! echo "Number of matching searches found : "
! ia search "الدولة الاسلامية" -i -n 
! echo "Donwloading item identifiers..."
! ia search "الدولة الاسلامية" -i > data/ISIS/IDList.txt
! echo "Number of item identifiers downloaded:"
! wc -l data/ISIS/IDList.txt

mkdir: data/ISIS: File exists
Number of matching searches found : 
1051
Donwloading item identifiers...
Number of item identifiers downloaded:
    1051 data/ISIS/IDList.txt


If you encounter an error running these commands, ensure that you have the latest version of *ia* installed (See **Requirements** above). 

The -i tag specifies that we are only interested in the ID of the item, not the whole file (we will download those later). The -n tag specifies that we only want to know the number of search results for now.
In the third line we ask for the ID to be downloaded and put in a file called ISIS.txt in the data directory. The head command allows us to examine the first three lines of the file we jsut created.

We now extract the IDs from the file and download the metadata for each item with that identifier.
**This is currently a time consuming procedure. Further improvements can be made by downloading the metadata files separately then manipulating them.**

In [99]:
def getIDsFromFile():
    IDs = []
    print("Openning ID file...")
    ISIS_ID_file = open('data/ISIS/IDList.txt', 'r')
    print("Reading IDs from file...")
    for line in ISIS_ID_file:
        IDs.append(line.rstrip())
    print("Finished reading IDs from file. Closing file...")
    ISIS_ID_file.close()
    print("ID File closed!")
    return IDs

def getItemsMetadata(item_identifiers):
    metadata_list = []
    print("Downloading " + str(len(item_identifiers)) + " item metadatas...")
    count = 1
    for ID in item_identifiers:
        item = ia.get_item(ID)
        metadata_list.append(item.metadata)
        if(count % 31 ==0):
            print('downloaded ' + str(count) + ' metadata files so far!')
        count += 1
    print("Successefully downloaded " + str(len(metadata_list)) + " metadatas!")
    return metadata_list

IDs = getIDsFromFile()
metadata_list = getItemsMetadata(IDs)

Openning ID file...
Reading IDs from file...
Finished reading IDs from file. Closing file...
ID File closed!
Downloading 1051 item metadatas...
downloaded 31 metadata files so far!
downloaded 62 metadata files so far!
downloaded 93 metadata files so far!
downloaded 124 metadata files so far!
downloaded 155 metadata files so far!
downloaded 186 metadata files so far!
downloaded 217 metadata files so far!
downloaded 248 metadata files so far!
downloaded 279 metadata files so far!
downloaded 310 metadata files so far!
downloaded 341 metadata files so far!
downloaded 372 metadata files so far!
downloaded 403 metadata files so far!
downloaded 434 metadata files so far!
downloaded 465 metadata files so far!
downloaded 496 metadata files so far!
downloaded 527 metadata files so far!
downloaded 558 metadata files so far!
downloaded 589 metadata files so far!
downloaded 620 metadata files so far!
downloaded 651 metadata files so far!
downloaded 682 metadata files so far!
downloaded 713 metadata

This list of metadata will be the actual data points which we will process, develop our model with, and test against!
We need to get it in VW format first. To see what that looks like, skim through the [input format page](https://github.com/JohnLangford/vowpal_wabbit/wiki/Input-format). In a nutshell, each data point (metadata entry) must be on its own line. Each line must begin with a +1 or -1 to signify if it's a positive or negative example (ISIS or not ISIS, respectively, in this case), followed by a pipe "|" that separated each namespace and its features. The format for namespace and features is: "namespac1e: feature1 |namespace2: feature2 |namespeace3: feature3...\n". "|" and ":" are reserved for VW, so we must distinguish their occurance in the metadata from their occrance in the VW format. We replace "|" with "PIPE" and ":" with "COLON" 

In [105]:


def metadataToVWline(metadata: dict, positive: bool):
    ignored_keys = ['mediatype', 'sound', 'color', 'curation']
    data = ''
    for key in metadata:
        if key in ignored_keys:
            continue
        else:
            if(type(metadata[key]) == list):
                data += key + ': ' + ' '.join(metadata[key]).replace(':', 'COLON').replace('|', 'PIPE') + " |"
            else:
                data += key + ': ' + metadata[key].replace(':', 'COLON').replace('|', 'PIPE') + " |"
    #remove last trailing pipe and add new line
    data = data.rstrip('|')
    data += '\n'
    
    if(positive):
        data = "+1 | " + data
    else:
        data = "-1 | " + data
        
    return data
#     ID = metadata.get('identifier', 'NONE').replace(':', 'COLON').replace('|', 'PIPE')
#     uploader = metadata.get('uploader', 'NONE').replace(':', 'COLON').replace('|', 'PIPE')
#     title = metadata.get('title', 'NONE').replace(':', 'COLON').replace('|', 'PIPE')
#     subject = metadata.get('subject', 'NONE').replace(':', 'COLON').replace('|', 'PIPE')
#     description = metadata.get('description', 'NONE').replace(':', 'COLON').replace('|', 'PIPE')
#     publishDate = metadata.get('publishdate', 'NONE').replace(':', 'COLON').replace('|', 'PIPE')
#     addedDate = metadata.get('addeddate', 'NONE').replace(':', 'COLON').replace('|', 'PIPE')
#     collection = metadata.get('collection', 'NONE').replace(':', 'COLON').replace('|', 'PIPE')
#     ID = metadata['identifier']
#     uploader = metadata['uploader']
#     title = metadata['title']
#     subject = metadata['subject']
#     description = metadata['description']
#     publishDate = metadata['publicdate']
#     addedDate = metadata['addeddate']
#     collection = metadata['collection']
#     print(metadata)
    
#     data = "identifier: ".join(ID) + " | uploader: ".join(uploader) + " | title: ".join(title) + " | subject: "
#     data = data + ''.join(subject) + " | description".join(description) + " | publishDate: ".join(publishDate)
#     data = data + " | addedDate: ".join(addedDate) + " | collection: ".join(collection) + "\n"


def writeVWlinesToFile(vwLines):
    print("opening Vwneg.txt file to write metadata")
    ISIS_Metadata_vw_file = open('data/ISIS/VWneg.txt', 'w')
    print("Writing lines to Vwneg.txt ...")
    for line in vwLines:
        ISIS_Metadata_vw_file.write(line)
    print("Closing Vwneg.txt file...")
    ISIS_Metadata_vw_file.close()
    print("File Vwneg.txt closed!")


vwLines = []
for meta in metadata_list:
    vwLines.append(metadataToVWline(meta, True))
writeVWlinesToFile(vwLines)

    

opening Vwneg.txt file to write metadata
Writing lines to Vwneg.txt ...
Closing Vwneg.txt file...
File Vwneg.txt closed!


A sample (~40 items) of this collection was inspected to determine the accuracy of the assumption (how many items in the sample are *actually* ISIS content that violated terms of service). The accurasy of our assumption was recorded as $AccViol$

In [101]:
print("Placeholder code")

AccViol = 0

Placeholder code


##### Arabic-non-ISIS Collection

Another collectino of ~5,000 items that included general arabic words was collected. This collection would represent items that included arabic text but were not violating of the terms of service.

In [102]:
print("Placeholder Code")

Placeholder Code


A sample (~40 items) of this collection was inspected to determine the accuracy of the assumption (how many items in the sample are *actually* general arabic content that do not violate the terms of service). The accuracy of our assumption was recorded as $AccAra$

In [103]:
print("Placeholder code")

AccAra = 0

Placeholder code
