# Intro: DeepCheck - Smarter Gun Background Checks

Present-day gun background checks process is as follows:
## The Process of U.S. Firearm Checks

1) Firearm Buyer: Fills out an ATF Form 4473 with:`name`, `age`, `address`, `place of birth`, `race`,  `citizenship`, `Social Security (optional)`, as well as the following questions:
  - Have you ever been convicted of a felony?
  - Have you ever been convicted of a misdemeanor crime of domestic violence?
  - Are you an unlawful user of, or addicted to, marijuana or any other depressant, stimulant, narcotic drug, or any other controlled substance?
  - Are you a fugitive from justice?
  - Have you ever been committed to a mental institution?

2) Firearm Seller: Submits the information to the FBI via a toll-free phone line or over the internet, and the agency checks the applicant's info against databases

3) FBI: Conducts background check with the submitted form (can take minutes). The FBI will deny a claim to Fire

*Source: https://www.cnn.com/2018/02/15/us/gun-background-checks-florida-school-shooting/index.html*

## When is someone Denied the Right to Firearms?

* Convicted of a crime punishable by imprisonmnet
* Convicted of a violent misdemeaner
* Addict of any controlled substance
* Committed to a mental institution
* Illegal immigrant
* Harassing, stalking, or threatening an intimate partner
* Renounced his/her US citizenship

*Source: https://www.fbi.gov/services/cjis/nics/about-nics*


## Dataset

The dataset mentioned above was obtained from:

Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber. 2017. "Automated Hate Speech Detection and the Problem of Offensive Language." Proceedings of the 11th International Conference on Web and Social Media (ICWSM).






# Setup

## Importing Libraries

In [76]:
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# IMPORT AZURE LIBRARIES
# Azure Notebook Libraries
import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
import logging

# IMPORT DATA SCIENCE LIBRARIES
import pandas as pd 
import scipy
import numpy as np
from sklearn.datasets import load_svmlight_file
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.utils import resample
import unicodedata
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.stem.porter import *
from nltk.corpus import stopwords
import string
import re
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer as VS
from textstat.textstat import *
from nltk import PorterStemmer

## Accessing the Azure Workspace

In [17]:
# Load workspace
from azureml.core import Workspace

ws = Workspace.from_config()

Found the config file in: C:\Users\house\Documents\GitHub\config.json


# Creating an Experiment

In [18]:

# Choose a name for the experiment and specify the project folder.
from azureml.core.experiment import Experiment

experiment_name = 'hatespeech_detection'
project_folder = './hatespeech_project'

experiment = Experiment(ws, experiment_name)

# Preprocess Data

In [68]:
df = pd.read_csv('text_data.csv',encoding='utf-8')

In [69]:
import matplotlib.pyplot as plt

df['class'].value_counts() / df['class'].sum()

1   0.70
2   0.15
0   0.05
Name: class, dtype: float64

## Feature generation
These data have over 20,000 labeled tweets in this dataset. Most tweets contain special characters and 

In [70]:
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.stem.porter import *
from nltk.corpus import stopwords
import string
import re

#nltk.download('averaged_perceptron_tagger')


def preprocess(text_string):
    """
    Accepts a text string and replaces:
    1) urls with URLHERE
    2) lots of whitespace with one instance
    3) mentions with MENTIONHERE

    This allows us to get standardized counts of urls and mentions
    Without caring about specific people mentioned
    """
    space_pattern = '\s+'
    giant_url_regex = ('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|'
        '[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
    mention_regex = '@[\w\-]+'
    parsed_text = re.sub(space_pattern, ' ', text_string)
    parsed_text = re.sub(giant_url_regex, '', parsed_text)
    parsed_text = re.sub(mention_regex, '', parsed_text)
    #parsed_text = parsed_text.code("utf-8", errors='ignore')
    return parsed_text

def tokenize(tweet):
    """Removes punctuation & excess whitespace, sets to lowercase,
    and stems tweets. Returns a list of stemmed tokens."""
    tweet = " ".join(re.split("[^a-zA-Z]*", tweet.lower())).strip()
    #tokens = re.split("[^a-zA-Z]*", tweet.lower())
    tokens = [stemmer.stem(t) for t in tweet.split()]
    return tokens

def basic_tokenize(tweet):
    """Same as tokenize but without the stemming"""
    tweet = " ".join(re.split("[^a-zA-Z.,!?]*", tweet.lower())).strip()
    return tweet.split()


def count_twitter_objs(text_string):
    """
    Accepts a text string and replaces:
    1) urls with URLHERE
    2) lots of whitespace with one instance
    3) mentions with MENTIONHERE
    4) hashtags with HASHTAGHERE

    This allows us to get standardized counts of urls and mentions
    Without caring about specific people mentioned.
    
    Returns counts of urls, mentions, and hashtags.
    """
    space_pattern = '\s+'
    giant_url_regex = ('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|'
        '[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
    mention_regex = '@[\w\-]+'
    hashtag_regex = '#[\w\-]+'
    parsed_text = re.sub(space_pattern, ' ', text_string)
    parsed_text = re.sub(giant_url_regex, 'URLHERE', parsed_text)
    parsed_text = re.sub(mention_regex, 'MENTIONHERE', parsed_text)
    parsed_text = re.sub(hashtag_regex, 'HASHTAGHERE', parsed_text)
    return(parsed_text.count('URLHERE'),parsed_text.count('MENTIONHERE'),parsed_text.count('HASHTAGHERE'))

def other_features(tweet):
    """This function takes a string and returns a list of features.
    These include Sentiment scores, Text and Readability scores,
    as well as Twitter specific features"""
    ##SENTIMENT
    sentiment = sentiment_analyzer.polarity_scores(tweet)
    
    words = preprocess(tweet) #Get text only
    
    syllables = textstat.syllable_count(words) #count syllables in words
    num_chars = sum(len(w) for w in words) #num chars in words
    num_chars_total = len(tweet)
    num_terms = len(tweet.split())
    num_words = len(words.split())
    avg_syl = round(float((syllables+0.001))/float(num_words+0.001),4)
    num_unique_terms = len(set(words.split()))
    
    ###Modified FK grade, where avg words per sentence is just num words/1
    FKRA = round(float(0.39 * float(num_words)/1.0) + float(11.8 * avg_syl) - 15.59,1)
    ##Modified FRE score, where sentence fixed to 1
    FRE = round(206.835 - 1.015*(float(num_words)/1.0) - (84.6*float(avg_syl)),2)
    
    twitter_objs = count_twitter_objs(tweet) #Count #, @, and http://
    retweet = 0
    if "rt" in words:
        retweet = 1
    features = [FKRA, FRE,syllables, avg_syl, num_chars, num_chars_total, num_terms, num_words,
                num_unique_terms, sentiment['neg'], sentiment['pos'], sentiment['neu'], sentiment['compound'],
                twitter_objs[2], twitter_objs[1],
                twitter_objs[0], retweet]
    #features = pandas.DataFrame(features)
    return features

def get_feature_array(tweets):
    feats=[]
    for t in tweets:
        feats.append(other_features(t))
    return np.array(feats)




In [71]:
#Now get other features
sentiment_analyzer = VS()
stemmer = PorterStemmer()
    
tweets = df["tweet"]

stopwords=stopwords = nltk.corpus.stopwords.words("english")

other_exclusions = ["#ff", "ff", "rt"]
stopwords.extend(other_exclusions)

vectorizer = TfidfVectorizer(
    #vectorizer = sklearn.feature_extraction.text.CountVectorizer(
    tokenizer=tokenize,
    preprocessor=preprocess,
    ngram_range=(1, 3),
    stop_words=stopwords, #We do better when we keep stopwords
    use_idf=True,
    smooth_idf=False,
    norm=None, #Applies l2 norm smoothing
    decode_error='replace',
    max_features=10000
    )

#Construct tfidf matrix and get relevant scores
tfidf = vectorizer.fit_transform(tweets).toarray()
vocab = {v:i for i, v in enumerate(vectorizer.get_feature_names())}
idf_vals = vectorizer.idf_
idf_dict = {i:idf_vals[i] for i in vocab.values()} #keys are indices; values are IDF scores

#Get POS tags for tweets and save as a string
tweet_tags = []
for t in tweets:
    tokens = basic_tokenize(preprocess(t))
    tags = nltk.pos_tag(tokens)
    tag_list = [x[1] for x in tags]
    #for i in range(0, len(tokens)):
    tag_str = " ".join(tag_list)
    tweet_tags.append(tag_str)
        #print(tokens[i],tag_list[i])


# We can use the TFIDF vectorizer to get a token matrix for the POS tags
pos_vectorizer = TfidfVectorizer(
    #vectorizer = sklearn.feature_extraction.text.CountVectorizer(
    tokenizer=None,
    lowercase=False,
    preprocessor=None,
    ngram_range=(1, 3),
    stop_words=None, #We do better when we keep stopwords
    use_idf=False,
    smooth_idf=False,
    norm=None, #Applies l2 norm smoothing
    decode_error='replace',
    max_features=5000,
    )

pos = pos_vectorizer.fit_transform(pd.Series(tweet_tags)).toarray()
pos_vocab = {v:i for i, v in enumerate(pos_vectorizer.get_feature_names())}

sentiment_analyzer = VS()

other_features_names = ["FKRA", "FRE","num_syllables", "avg_syl_per_word", "num_chars", "num_chars_total", \
                    "num_terms", "num_words", "num_unique_words", "vader neg","vader pos","vader neu", "vader compound", \
                    "num_hashtags", "num_mentions", "num_urls", "is_retweet"]
feats = get_feature_array(tweets)
X = np.concatenate([tfidf,pos,feats],axis=1)

# Train the Model

In [29]:
# Split the test train sets 


print(X.shape)
y = df["class"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

y_train = np.array(y_train)
y_test = np.array(y_test)

automl_config = AutoMLConfig(task = 'classification',
                             debug_log = 'automl_errors.log',
                             primary_metric = 'AUC_weighted',
                             iteration_timeout_minutes = 60,
                             iterations = 1,
                             n_cross_validations = 3,
                             verbosity = logging.INFO,
                             X = X_train, 
                             y = y_train,
                             path = project_folder)


(1000, 12668)


In [None]:
local_run = experiment.submit(automl_config, show_output = True)

# Test the Model

In [25]:
# Show the model with log loss minimized
lookup_metric = "log_loss"
best_run, fitted_model = local_run.get_output(metric = lookup_metric)

NameError: name 'local_run' is not defined

In [None]:
# Randomly select digits and test.
from azureml.core.model import Model

y_test = np.array(y_test)

predicted = fitted_model.predict(X_test)

from sklearn.metrics import accuracy_score
print('Accuracy score: %.2f' % accuracy_score(y_true=y_test, y_pred=predicted))

# Register the Model

In [None]:
model = local_run.register_model('deepcheck')

# Predict on Real Twitter Data

In [37]:
# Import the necessary package to process data in JSON format
try:
    import json
except ImportError:
    import simplejson as json

# Import the tweepy library
import tweepy
import pprint
from sklearn.externals import joblib 

pp = pprint.PrettyPrinter(indent=4)

# Setup tweepy to authenticate with Twitter credentials:

auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)

# Create the api to connect to twitter with your creadentials
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True, compression=True)

#status_cursor = tweepy.Cursor(api.user_timeline, screen_name="realDonaldTrump", count=100)
status_cursor = tweepy.Cursor(api.user_timeline, screen_name="realDonaldTrump", count=1000)
status_list = status_cursor.iterator.next()

user_tweets = []

for i in range(len(status_list)):
    user_tweets += [status_list[i].text]

model = joblib.load('final_model.pkl')

user_df = pd.DataFrame(user_tweets,columns=['tweet'])

print(df['tweet'])
print(user_df['tweet'])

0      !!! RT @mayasolovely: As a woman you shouldn't...
1      !!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2      !!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3      !!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4      !!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...
5      !!!!!!!!!!!!!!!!!!"@T_Madison_x: The shit just...
6      !!!!!!"@__BrighterDays: I can not just sit up ...
7      !!!!&#8220;@selfiequeenbri: cause I'm tired of...
8      " &amp; you might not get ya bitch back &amp; ...
9      " @rhythmixx_ :hobbies include: fighting Maria...
10     " Keeks is a bitch she curves everyone " lol I...
11                    " Murda Gang bitch its Gang Land "
12     " So hoes that smoke are losers ? " yea ... go...
13         " bad bitches is the only thing that i like "
14                               " bitch get up off me "
15                       " bitch nigga miss me with it "
16                                " bitch plz whatever "
17                             

In [75]:
def transform_inputs(tweets, tf_vectorizer, idf_vector, pos_vectorizer):
    """
    This function takes a list of tweets, along with used to
    transform the tweets into the format accepted by the model.
    Each tweet is decomposed into
    (a) An array of TF-IDF scores for a set of n-grams in the tweet.
    (b) An array of POS tag sequences in the tweet.
    (c) An array of features including sentiment, vocab, and readability.
    Returns a pandas dataframe where each row is the set of features
    for a tweet. The features are a subset selected using a Logistic
    Regression with L1-regularization on the training data.
    """
    tf_array = tf_vectorizer.fit_transform(tweets).toarray()
    tfidf_array = tf_array*idf_vector
    print("Built TF-IDF array")

    pos_tags = get_pos_tags(tweets)
    pos_array = pos_vectorizer.fit_transform(pos_tags).toarray()
    print("Built POS array")

    oth_array = get_oth_features(tweets)
    print("Built other feature array")
    print(tfidf_array.shape)
    print(pos_array.shape)
    print(oth_array.shape)
    M = np.concatenate([tfidf_array, pos_array, oth_array],axis=1)
    return pd.DataFrame(M)

def get_pos_tags(tweets):
    """Takes a list of strings (tweets) and
    returns a list of strings of (POS tags).
    """
    tweet_tags = []
    for t in tweets:
        tokens = basic_tokenize(preprocess(t))
        tags = nltk.pos_tag(tokens)
        tag_list = [x[1] for x in tags]
        #for i in range(0, len(tokens)):
        tag_str = " ".join(tag_list)
        tweet_tags.append(tag_str)
    return tweet_tags

def get_oth_features(tweets):
    """Takes a list of tweets, generates features for
    each tweet, and returns a numpy array of tweet x features"""
    feats=[]
    for t in tweets:
        feats.append(other_features_(t))
    return np.array(feats)

def other_features_(tweet):
    """This function takes a string and returns a list of features.
    These include Sentiment scores, Text and Readability scores,
    as well as Twitter specific features.
    This is modified to only include those features in the final
    model."""

    sentiment = sentiment_analyzer.polarity_scores(tweet)

    words = preprocess(tweet) #Get text only

    syllables = textstat.syllable_count(words) #count syllables in words
    num_chars = sum(len(w) for w in words) #num chars in words
    num_chars_total = len(tweet)
    num_terms = len(tweet.split())
    num_words = len(words.split())
    avg_syl = round(float((syllables+0.001))/float(num_words+0.001),4)
    num_unique_terms = len(set(words.split()))

    ###Modified FK grade, where avg words per sentence is just num words/1
    FKRA = round(float(0.39 * float(num_words)/1.0) + float(11.8 * avg_syl) - 15.59,1)
    ##Modified FRE score, where sentence fixed to 1
    FRE = round(206.835 - 1.015*(float(num_words)/1.0) - (84.6*float(avg_syl)),2)

    twitter_objs = count_twitter_objs(tweet) #Count #, @, and http://
    features = [FKRA, FRE, syllables, num_chars, num_chars_total, num_terms, num_words,
                num_unique_terms, sentiment['compound'],
                twitter_objs[2], twitter_objs[1],]
    #features = pandas.DataFrame(features)
    return features

#Load ngram dict
#Load pos dictionary
#Load function to transform data


X = transform_inputs(tweets, vectorizer, tfidf, pos_vectorizer)


Built TF-IDF array
Built POS array
Built other feature array
(24783, 10000)
(24783, 5000)
(24783, 11)


In [80]:
from azureml.core.model import Model
import os 

ws = Workspace.from_config()

model=Model(ws, id='AutoML7d3a81240best:1')
model.download(target_dir=os.getcwd(), exist_ok=True)

clf = joblib.load( os.path.join(os.getcwd(), 'model.pkl'))
y_hat = clf.predict(X_test)

y_preds = model.predict(X)

Found the config file in: C:\Users\house\Documents\GitHub\config.json


ValueError: operands could not be broadcast together with shapes (330,12668) (11166,) (330,12668) 