# YINS Datahack 2017

## Challenges
The following challenges are available to choose from:
* 1) Police Misconduct Data (Yale Policy Lab): http://datahack.yale.edu/files/yale-policy-lab.pdf
    * Every day a civilian is shot by a police ocer
in the United States. While rare as
compared to other types of violence, the use of force by the police is of potentially greater
importance: Force, utilized improperly or routinely, can erode trust between citizens and
their government, prompt disengagement with the law, and shake the very foundation of
our democracy.
Communities of ocers
in Chicago
Despite its importance, and the recent attention to police
misconduct as a pressing social problem, there is still
much we do not know about the phenomenon - including
the rates of police misconduct, its distribution, and its
root causes. In large part, this is due to a lack of largescale
scientific inquiry, which itself results from a lack
of data; neither police departments nor the government
are compelled to compile and analyze incidents of police
(mis)behavior in any systematic fashion.
The recent series of high-profile shootings of unarmed
citizens, however, has drawn the nation’s gaze to this
issue once again, and citizens, policy makers, and policing
experts are mobilizing to uncover what data do exist. One such repository comes from
Chicago, IL, and contains records of allegations of police misconduct and investigations
of police shootings dating back to 1971. The goal of this DataHack is to leverage this
database to answer the White House’s call to develop smarter, more data-driven methods
to understand and improve policing in the United States, and to reduce the use of force by
ocers.
Participants will have access to a database - compiled by The Policy Lab and The
Justice Collaboratory at Yale Law School - containing more than 150,000 records of alleged
police misconduct (built on reports made by both citizens and other police ocers),
as well
as related datasets - including departmental and (anonymized) personnel records, maps of
the city, and details as to its physical and political infrastructure. Hackers will be charged
with uncovering actionable insights into the correlates – and possible drivers – of police
Date: February, 2017.
1
2 THE POLICY LAB & THE JUSTICE COLLABORATORY
misconduct, and will be evaluated based on the novelty, robustness, and scalability of their
finished product, as well as the degree to which it highlights new conceptual threads as the
foundation for future research.
At A Glance: Possible Guiding Questions
Improving our Understanding of Police Misconduct
• Is police misconduct a group or networked phenomenon? If so, what do police
misconduct networks look like? To what degree do police misconduct networks
look and act like social and behavioral networks? For example, do people who
work and train together, engage in misconduct together?
• What are the behavioral and network-related correlates of police misconduct?
Are there particular characteristics by which we can di↵erentiate ocers
who
do and do not display more serious forms of misconduct?
Improving our Prediction of Police Misconduct
• How accurately can we predict future aggression by police ocers?
Can we
achieve top-shelf accuracy and eciency
in our predictions, but do so using
models that can later be interrogated, so as to help policy makers and researchers
better understand the underlying behavioral mechanisms at work
(i.e. using something other than machine learning)?
• How can we make all of these insights most useful to police departments, in
real time? Can we develop visualization tools to aid police departments and
citizens in understanding how misconduct concentrates within social networks?
Improving our Study of Police Misconduct
• How can we achieve the objectives listed above, but avoid over-fitting? These
data come from one city (Chicago), within which some patterns may be idiosyncratic.
Can we build models that are likely to scale to other cities?
• How can we estimate the scale and contents of missing data? These allegations
do not represent every incident of police misconduct – rather, they reflect only
those that were ocially
reported - a time and energy intensive process, which
also requires that citizens trust the police in the first place.
    
    
* 2) Leveraging Social Networks in Driving Corporate Performance (McChrystal Group): http://datahack.yale.edu/files/mcchrystal-group.pdf
    * The corporate world is awash with buzz words—organizations live in fear of black swan events, often desperate to stay ahead of disruptors. No executive wants to be featured in the Kodak- or Blockbusterstyle case studies covered by the core curriculum of nearly every MBA. Wall Street analysts, private equity investors, and companies themselves are devoting unprecedented resources to dissecting the numbers behind what makes for a successful organization. But one area that remains under-studied is what creates organizational success in the first place: the people and how those people really interact. As more and more companies move beyond simple line and block hierarchical reporting into crossfunctional, layered matrices, executives are looking to leverage their companies’ internal networks to drive organizational performance. Though organizational network analysis has established itself as a major field in academia, it is still burgeoning in the world of corporate organizational design. Analysts studying networks in the corporate world face unique challenges. Corporate management often lacks the understanding of how to best utilize network analysis findings, in worst-case scenarios using them to make performance management decisions rather than insightful re-alignments of their organization’s networking and communications. As such, the data that analysts obtain from surveys are often anonymous to protect employee identity, making the linking of a respondent to his/her exact networks difficult. Additionally, given a world in which we are asked to rate our experience every time we take a flight or order a pizza, analysts must cope with increasing survey fatigue, and hence, less complete datasets. Facing all of these challenges, we are asking hackers to **leverage several organizational performance and network analysis datasets collected from a variety of industries in order to better understand the nature of social network influence in the corporate world**. More specifically, looking across datasets from companies diverse in size, location, revenue, and industry, **what kind of impact can influencers have on both their peers as well as the company’s overall success**? What **actionable insights can we derive from these datasets in order to improve a company’s performance**? We ask Hackers to offer their **perceptions of the generalizability/scalability of their findings**. **Do they believe that their insights and selected implementations are unique to the datasets they are working with**, or **do they believe these trends could be consistent for an entire corporation, across industry, etc.**? 
    
    
* 3) Identifying Groups of Traders that Manipulate the Financial Markets (Goldman Sachs Group): http://datahack.yale.edu/files/goldman-sachs.pdf
    * It is a common practice for the Compliance Departments within big financial organizations to **set the appropriate controls and procedures to ensure that employees comply** with applicable rules and regulations. For example, "Spoofing" is a practice in which traders attempt to give an artificial impression of market conditions by entering and quickly canceling large buy or sell orders onto an exchange, in an attempt to manipulate prices. The 2010 Dodd-Frank Act specifically forbids spoofing. Monitoring and identifying spoofing at an individual trader level is straightforward; however, finding a group of traders that manipulate the market is more involved. One of the mandates of a contemporary compliance officer is to **identify potential group spoofing activities**, which we define as follows: Definition: [Potential Group Spoofing Activity (PGSA)]: We say that **there is a Potential Group Spoofing Activity when two traders trade the same financial instrument (e.g., a stock) at some timestamp t and they communicate (for example, via email or phone) at the same timestamp t**. In this datahack, we are going to explore methodologies that identify Potential Group Spoofing Activities within big financial organizations. Dataset: We are given **trading data and communication data - all corresponding to a single day**. 1. Trading data: **500 traders trade (buy/sell positions) 1,000 stocks in 100 different timestamps**. 2. Communication data: **500 traders communicate with each other in the same 100 timestamps**. 1. **Determine if there is at least one PGSA at each timestamp**. 2. Find the **timestamp where the fewest PGSAs occur**. 3. **Find all the PGSAs** in this dataset. 4. Find the **“riskiest” PGSA** and explain your reasoning in detail. 


police: complaint data, and co-complaints. links are co-complains over a series of time. timeseries data

# Police Misconduct Data

In [2]:
import scipy as sp
from scipy.sparse import linalg, csr_matrix
import math
import numpy as np

def read_network(fn, undirected = True):
    edges = list()
    nodes_id = dict()
    node_counter = -1
    with open(fn, 'r') as f:
        for i, line in enumerate(f):
            nodei, nodej = line.rstrip('\n').split()
            if nodei not in nodes_id:
                node_counter += 1
                nodes_id[nodei] = node_counter
            if nodej not in nodes_id:
                node_counter += 1
                nodes_id[nodej] = node_counter
            
            edges.append((nodei, nodej))
            
    A = np.zeros((node_counter + 1, node_counter + 1), dtype = float)
    for e in edges:
        i = nodes_id[e[0]]
        j = nodes_id[e[1]]
        A[i, j] = 1.0
        if undirected:
            A[j, i] = 1.0
     
    return A, node_counter + 1
# dA, dN = read_network('data/smallnet2.txt', undirected = False) 
def get_centrality(G, type_centrality):
    
    if type_centrality == "degree":
        return nx.degree_centrality(G)
        
    elif type_centrality == "closeness":
        return nx.closeness_centrality(G)
    
    elif type_centrality == "betweenness":
        return nx.betweenness_centrality(G)
    
    elif type_centrality == "eigenvector":
        return nx.eigenvector_centrality(G, max_iter = 1000, tol = 1e-06)
    
    elif type_centrality == "katz":
        return nx.katz_centrality(G, alpha = 0.01, beta = 0.01, max_iter = 1000, tol = 1e-06)
    
    elif type_centrality == "pagerank":
        return nx.pagerank(G, 0.85)
    else:
        print "wrong type of centrality"
        return None
    
def get_centrality_distribution(centrality, number_of_bins = 50, log_binning = False, base = 10):
    if type(centrality) == dict:
        centrality = centrality.values()
    elif type(centrality) == list:
        centrality = centrality
    else:
        print "wrong type of centrality. must be either dictionary or list"
        return None, None
    
    # We need to define the support of our distribution
    lower_bound = min(centrality)
    upper_bound = max(centrality)
    
    # And the bins
    if log_binning:
        log = np.log2 if base == 2 else np.log10
        lower_bound = log(lower_bound) if lower_bound > 0 else -1
        upper_bound = log(upper_bound)
        bins = np.logspace(lower_bound, upper_bound, number_of_bins, base = base)
    else:
        bins = np.linspace(lower_bound,upper_bound,number_of_bins)
        
    # Then we can compute the histogram using numpy
    y, __ = np.histogram(centrality, 
                           bins = bins,
                           density = True)
    #print centrality
    # Now, we need to compute for each y the value of x
    x = bins[1:] - np.diff(bins) / 2.0
        
    return x, y



In [1]:
from operator import itemgetter
import scipy.stats as stats
#stats.kendalltau is for two arrays. We have centralities as dictionary

def k_tau(dict1, dict2):
    list_1 = sorted(dict1.items(), key = itemgetter(1), reverse = True)
    list_2 = sorted(dict2.items(), key = itemgetter(1), reverse = True)
    
    id1 = map(itemgetter(0), list_1)
    id2 = map(itemgetter(0), list_2)
    
    kendall_tau, p_value = stats.kendalltau(id1, id2)
    
    return kendall_tau, p_value

In [None]:
plt.figure(figsize=(12,5))
color_pool = ['r', 'g']
for idx, method in enumerate(['pagerank', 'katz']):
    centralities = get_centrality(net, method)
    x, y = get_centrality_distribution(centralities, number_of_bins = 50, log_binning = True, base = 10)
    plt.loglog(x, y, 'o', color = color_pool[idx], label = method)

plt.xlabel("Centrality")
plt.ylabel("P")
plt.legend(numpoints = 1)
plt.show()



full_misconduct_data = 