# DeepCheck - Smarter Gun Background Checks
Introducing a smart and robust background check.

## The Current Process of U.S. Firearm Checks
1) Firearm Buyer: Fills out an ATF Form 4473 with:`name`, `age`, `address`, `place of birth`, `race`,  `citizenship`, `Social Security (optional)`, as well as the following questions:
  - Have you ever been convicted of a felony?
  - Have you ever been convicted of a misdemeanor crime of domestic violence?
  - Are you an unlawful user of, or addicted to, marijuana or any other depressant, stimulant, narcotic drug, or any other controlled substance?
  - Are you a fugitive from justice?
  - Have you ever been committed to a mental institution?

2) Firearm Seller: Submits the information to the FBI via a toll-free phone line or over the internet, and the agency checks the applicant's info against databases

3) FBI: Conducts background check with the submitted form (can take minutes). The FBI will deny a claim to Fire

*Source: https://www.cnn.com/2018/02/15/us/gun-background-checks-florida-school-shooting/index.html*


## The Purpose of DeepCheck
DeepCheck builds upon the current foundational parameters in background checks and introduces the concept of utilizing a candidates' digital interactions on social media to further diagnosis their level of "at risk" for improper use with a firearm. The current social media application we have decided to utilize is Twitter. DeepCheck gathers a candidates' recent (up to 500) status updates (or tweets), retweets, and favorites. Upon gathering the data, DeepCheck runs a natural language processing algorithm which implements sentiment analysis to spotlight offensive language, hate speech, and any encouragement of violent crime. The purpose of this in-depth analysis is to catch hidden sentiments or motives in individuals who do not have a past history with crime or law enforcement.

## Dataset and Feature Engineering

The dataset and feature engineering methods were obtained from the following publication:

Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber. 2017. "Automated Hate Speech Detection and the Problem of Offensive Language." Proceedings of the 11th International Conference on Web and Social Media (ICWSM).

Git Repo: https://github.com/t-davidson/hate-speech-and-offensive-language


# Setup

## Importing Libraries

In [1]:
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# IMPORT AZURE LIBRARIES
# Azure Notebook Libraries
import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
import logging

# IMPORT DATA SCIENCE LIBRARIES
import pandas as pd 
import scipy
import numpy as np
import csv
from sklearn.datasets import load_svmlight_file
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.utils import resample

## Accessing the Azure Workspace

In [2]:
# Load workspace
from azureml.core import Workspace

ws = Workspace.from_config()

Found the config file in: C:\Users\house\Documents\GitHub\config.json


# Creating an Experiment

In [3]:
# Choose a name for the experiment and specify the project folder.
from azureml.core.experiment import Experiment

experiment_name = 'hatespeech_detection'
project_folder = './hatespeech_project'

experiment = Experiment(ws, experiment_name)

# Preprocess Data

In [4]:
df = pd.read_csv('text_data.csv',encoding='utf-8')

In [5]:
import matplotlib.pyplot as plt
print('class imbalance:')
df['new_class'].value_counts() / df['class'].sum()

class imbalance:


1   0.80
0   0.10
Name: new_class, dtype: float64

Because of this class imbalance, we chose to upsample the class 0 (innoculus tweets) to match class 1.

In [5]:
print('Imbanalnced Class Frequencies:')
print(df['new_class'].value_counts() / df['new_class'].count())
print('First count: %i' % df['class'].count())

# Separate majority and minority classes
df_0 = df[df['new_class']==0]
df_1 = df[df['new_class']==1]

# Number of observations in majority class
i_class1 = np.where(df['new_class'] == 0)[0]
n_class1 = int(round(len(i_class1),0))

# Upsample to match classes
df_0_upsampled = resample(df_1, 
                            replace=False,           # sample with replacement
                            n_samples=n_class1,     # to match minority class
                            random_state=123)       # reproducible results

# Combine minority class with downsampled majority class
df_downsampled = pd.concat([df_0_upsampled, df_0])
 
# Display new class counts
print('\nNew Class Frequencies:')
print(df_downsampled['new_class'].value_counts() / df_downsampled['new_class'].count())
print('Updated count: %i' % df_downsampled['new_class'].count())


Imbanalnced Class Frequencies:
1   0.88
0   0.12
Name: new_class, dtype: float64
First count: 24783

New Class Frequencies:
1   0.50
0   0.50
Name: new_class, dtype: float64
Updated count: 5744


## Feature Extration
These data have over 20,000 labeled tweets in this dataset. Most tweets contain special characters and 

In [6]:
tweets = df_downsampled["tweet"]
import feature_extraction

X, vectorizer, tfidf, pos_vectorizer = feature_extraction.get_features(tweets)


## Use Cloud Services

## Upload Data to Datastore
The training data must be uploaded to the datastore in order to compute the training on the cloud. To do this, the training data will be downloaded locally to a `.tsv` file and then uploaded to the datastore with the `ds.upload()` command. Uploading to the datastore is a one-time task. 

In [7]:
# store the data in a temporary folder
if not os.path.isdir('data'):
    os.mkdir('data')

# store the get_data() script in the project folder
if not os.path.exists(project_folder):
    os.makedirs(project_folder)

In [8]:
# Split the test train sets 
y = df_downsampled["new_class"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

In [9]:
pd.DataFrame(X_train).to_csv("data/X_train.tsv", index=False, header=False, quoting=csv.QUOTE_ALL, sep="\t")
pd.DataFrame(y_train).to_csv("data/y_train.tsv", index=False, header=False, sep="\t")

In [10]:
ds = ws.get_default_datastore()
ds.upload(src_dir='./data', target_path='hatespeech_data', overwrite=True, show_progress=True)

Uploading ./data\X_train.tsv
Uploading ./data\y_train.tsv
Uploaded ./data\y_train.tsv, 1 files out of an estimated total of 2
Uploaded ./data\X_train.tsv, 2 files out of an estimated total of 2


$AZUREML_DATAREFERENCE_aef24cf01c0e47adb29ebe0d34af7878

In [12]:
from azureml.core.runconfig import DataReferenceConfiguration
dr = DataReferenceConfiguration(datastore_name=ds.name, 
                   path_on_datastore='hatespeech_data', 
                   path_on_compute='/tmp/azureml_runs',
                   mode='download', # download files from datastore to compute target
                   overwrite=False)

# Train the Model with Cloud Computing

In [13]:
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget

# Choose a name for your cluster.
amlcompute_cluster_name = "deepcheck"

found = False
# Check if this compute target already exists in the workspace.
cts = ws.compute_targets
if amlcompute_cluster_name in cts and cts[amlcompute_cluster_name].type == 'AmlCompute':
    found = True
    print('Found existing compute target.')
    compute_target = cts[amlcompute_cluster_name]
    
if not found:
    print('Creating a new compute target...')
    provisioning_config = AmlCompute.provisioning_configuration(vm_size = "STANDARD_NC6",
                                                                max_nodes = 6)

    # Create the cluster.
    compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, provisioning_config)
    
    # Can poll for a minimum number of nodes and for a specific timeout.
    # If no min_node_count is provided, it will use the scale settings for the cluster.
    compute_target.wait_for_completion(show_output = True, min_node_count = None, timeout_in_minutes = 20)
    
     # For a more detailed view of current AmlCompute status, use get_status().

Found existing compute target.


In [14]:
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies

# create a new RunConfig object
conda_run_config = RunConfiguration(framework="python")

# Set compute target to AmlCompute
conda_run_config.target = compute_target
conda_run_config.environment.docker.enabled = True
conda_run_config.environment.docker.base_image = azureml.core.runconfig.DEFAULT_CPU_IMAGE

# set the data reference of the run coonfiguration
conda_run_config.data_references = {ds.name: dr}

cd = CondaDependencies.create(pip_packages=['azureml-sdk[automl]'], conda_packages=['numpy','py-xgboost<=0.80'])
conda_run_config.environment.python.conda_dependencies = cd


In [15]:
%%writefile $project_folder/get_data.py

import pandas as pd

def get_data():
    X_train = pd.read_csv("/tmp/azureml_runs/hatespeech_data/X_train.tsv", delimiter="\t", header=None, quotechar='"')
    y_train = pd.read_csv("/tmp/azureml_runs/hatespeech_data/y_train.tsv", delimiter="\t", header=None, quotechar='"')

    return { "X" : X_train.values, "y" : y_train[0].values }

Overwriting ./hatespeech_project/get_data.py


In [16]:
automl_settings = {
    "iteration_timeout_minutes": 10,
    "iterations": 10,
    "n_cross_validations": 3,
    "primary_metric": 'AUC_weighted',
    "verbosity": logging.INFO
}

automl_config = AutoMLConfig(task = 'classification',
                             debug_log = 'automl_errors.log',
                             path = project_folder,
                             run_configuration=conda_run_config,
                             data_script = project_folder + "/get_data.py",
                             **automl_settings
                            )

In [17]:
run = experiment.submit(automl_config, show_output = True)

Running on remote compute: deepcheck
Parent Run ID: AutoML_c7056a49-84c9-4cb9-a1ff-44e2aed23d34
********************************************************************************************************************
ITERATION: The iteration being evaluated.
PIPELINE: A summary description of the pipeline being evaluated.
SAMPLING %: Percent of the training data to sample.
DURATION: Time taken for the current iteration.
METRIC: The result of computing score on the fitted pipeline.
BEST: The best observed score thus far.
********************************************************************************************************************

 ITERATION   PIPELINE                                       SAMPLING %  DURATION      METRIC      BEST
         0   MaxAbsScaler LightGBM                          100.0000    0:01:20       0.9053    0.9053
         1   RobustScaler LightGBM                          100.0000    0:01:19       0.9588    0.9588
         2   RobustScaler LogisticRegression       

In [18]:
run

Experiment,Id,Type,Status,Details Page,Docs Page
hatespeech_detection,AutoML_c7056a49-84c9-4cb9-a1ff-44e2aed23d34,automl,Completed,Link to Azure Portal,Link to Documentation


# Test the Model

In [19]:
# Show the model with log loss minimized
best_run, fitted_model = run.get_output(iteration=None, metric=None)

In [20]:
# Randomly select digits and test.
from azureml.core.model import Model

y_test = np.array(y_test)

predicted = fitted_model.predict(X_test)

unique, counts = np.unique(predicted, return_counts=True)

print(dict(zip(unique, counts)))

from sklearn.metrics import accuracy_score
print('Accuracy score: %.2f' % accuracy_score(y_true=y_test, y_pred=predicted))

pd.DataFrame(zip(y_test,predicted)).to_csv("predictions.csv")

{0: 974, 1: 922}
Accuracy score: 0.91


# Register the Model

In [21]:
model = run.register_model('deepcheck')
# Best model is: AutoML54f661779best

Registering model AutoMLc7056a498best


# Predict on Real Twitter Data

In [40]:
# Import the necessary package to process data in JSON format
try:
    import json
except ImportError:
    import simplejson as json

# Import the tweepy library
import tweepy
import pprint
from sklearn.externals import joblib 

pp = pprint.PrettyPrinter(indent=4)

'''
The Access token is hidden from open source code
# Setup tweepy to authenticate with Twitter credentials:
# Variables that contains the user credentials to access Twitter API 
ACCESS_TOKEN = <YOUR ACCESS TOKEN>
ACCESS_SECRET = <YOUR ACCESS SECRET>
CONSUMER_KEY = <YOUR CUSTOMER KEY>
CONSUMER_SECRET = <YOUR CUSTOMER SECRET>s
'''

auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)

# Create the api to connect to twitter with your creadentials
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True, compression=True)
status_cursor = tweepy.Cursor(api.user_timeline, screen_name="dishsrivastava", count=1000)
status_list = status_cursor.iterator.next()

user_tweets = []
for i in range(len(status_list)):
    user_tweets += [status_list[i].text]

user_df = pd.DataFrame(user_tweets,columns=['tweet'])

print(user_df['tweet'])

0      RT @DuolingoUS: Если вам пришлось перевести, ч...
1      RT @ericswalwell: Are you ready America? Let's...
2      RT @CNN: JUST IN: Democratic Rep. Eric Swalwel...
3      RT @BernieSanders: This is a real national eme...
4      RT @umasscs: Congratulations to all the winnin...
5      RT @mic: The way men send emails vs. women rev...
6                                     @saltystvph Not ok
7      RT @TDisfromNYC: Sup twitter fam 🖖🏾.\n\nI'm a ...
8      RT @MforMEGAN: @DMetriaT @quenblackwell Here y...
9      RT @quenblackwell: pick the llama up....now. h...
10     RT @nihilisims: BRO THIS IS THE CUTEST SWEETES...
11     RT @RepTedLieu: Dear @POTUS: Your weird belief...
12     this is ridiculous that’s literally a mang tee...
13     RT @Jessie46914117: Hey everybody. My momma ha...
14     RT @ElaLaineReeves: I shared this yesterday an...
15     RT @milkygoddess: WHEN SHE LOOKS AT THE MONITO...
16     RT @tabitchaaa: Look what you bitches are doin...
17     RT @abby_thatsme: you gu

In [41]:

def basic_tokenize(tweet):
    """Same as tokenize but without the stemming"""
    tweet = " ".join(re.split("[^a-zA-Z.,!?]*", tweet.lower())).strip()
    return tweet.split()


def preprocess(text_string):
    space_pattern = '\s+'
    giant_url_regex = ('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|'
        '[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
    mention_regex = '@[\w\-]+'
    parsed_text = re.sub(space_pattern, ' ', text_string)
    parsed_text = re.sub(giant_url_regex, '', parsed_text)
    parsed_text = re.sub(mention_regex, '', parsed_text)
    return parsed_text


def transform_inputs(tweets, tf_vectorizer, idf_vector, pos_vectorizer):
    tf_array = tf_vectorizer.transform(tweets).toarray()
    #print(tf_array.shape)
    #tfidf_array = tf_array*idf_vector
    #print("Built TF-IDF array")

    tweet_tags = get_pos_tags(tweets)
    pos_array = pos_vectorizer.transform(pd.Series(tweet_tags)).toarray()
    #print("Built POS array")
    
    oth_array = get_feature_array(tweets)
    #print("Built other feature array")
    #print(tf_array.shape)
    #print(pos_array.shape)
    #print(oth_array.shape)
    M = np.concatenate([tf_array, pos_array, oth_array],axis=1)
    return pd.DataFrame(M)

def get_pos_tags(tweets):
    """Takes a list of strings (tweets) and
    returns a list of strings of (POS tags).
    """
    tweet_tags = []
    for t in tweets:
        tokens = basic_tokenize(preprocess(t))
        tags = nltk.pos_tag(tokens)
        tag_list = [x[1] for x in tags]
        #for i in range(0, len(tokens)):
        tag_str = " ".join(tag_list)
        tweet_tags.append(tag_str)
    return tweet_tags

def get_oth_features(tweets):
    """Takes a list of tweets, generates features for
    each tweet, and returns a numpy array of tweet x features"""
    feats=[]
    for t in tweets:
        feats.append(other_features(t))
    return np.array(feats)

def get_feature_array(tweets):
    feats=[]
    for t in tweets:
        feats.append(other_features(t))
    return np.array(feats)

def other_features(tweet):
    """This function takes a string and returns a list of features.
    These include Sentiment scores, Text and Readability scores,
    as well as Twitter specific features"""
    ##SENTIMENT
    sentiment_analyzer = VS()
    sentiment = sentiment_analyzer.polarity_scores(tweet)
    
    words = preprocess(tweet) #Get text only
    
    syllables = textstat.syllable_count(words) #count syllables in words
    num_chars = sum(len(w) for w in words) #num chars in words
    num_chars_total = len(tweet)
    num_terms = len(tweet.split())
    num_words = len(words.split())
    avg_syl = round(float((syllables+0.001))/float(num_words+0.001),4)
    num_unique_terms = len(set(words.split()))
    
    ###Modified FK grade, where avg words per sentence is just num words/1
    FKRA = round(float(0.39 * float(num_words)/1.0) + float(11.8 * avg_syl) - 15.59,1)
    ##Modified FRE score, where sentence fixed to 1
    FRE = round(206.835 - 1.015*(float(num_words)/1.0) - (84.6*float(avg_syl)),2)
    
    twitter_objs = count_twitter_objs(tweet) #Count #, @, and http://
    retweet = 0
    if "rt" in words:
        retweet = 1
    features = [FKRA, FRE,syllables, avg_syl, num_chars, num_chars_total, num_terms, num_words,
                num_unique_terms, sentiment['neg'], sentiment['pos'], sentiment['neu'], sentiment['compound'],
                twitter_objs[2], twitter_objs[1],
                twitter_objs[0], retweet]
    #features = pandas.DataFrame(features)
    return features

def count_twitter_objs(text_string):
    space_pattern = '\s+'
    giant_url_regex = ('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|'
        '[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
    mention_regex = '@[\w\-]+'
    hashtag_regex = '#[\w\-]+'
    parsed_text = re.sub(space_pattern, ' ', text_string)
    parsed_text = re.sub(giant_url_regex, 'URLHERE', parsed_text)
    parsed_text = re.sub(mention_regex, 'MENTIONHERE', parsed_text)
    parsed_text = re.sub(hashtag_regex, 'HASHTAGHERE', parsed_text)
    return(parsed_text.count('URLHERE'),parsed_text.count('MENTIONHERE'),parsed_text.count('HASHTAGHERE'))


In [42]:
from azureml.core.model import Model
import os 
import re
import nltk
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer as VS
from textstat.textstat import *

tweets = user_tweets
X = transform_inputs(tweets, vectorizer, tfidf, pos_vectorizer)
y_preds = fitted_model.predict(X)

unique, counts = np.unique(y_preds, return_counts=True)
print(dict(zip(unique, counts)))
pd.DataFrame(zip(user_tweets,y_preds)).to_csv("DishaPredictions.csv")

{0: 153, 1: 45}
