![RiskLens](logo.png)
# RiskLens Data Science Candidate Task

This is the Data Science candidate take-home task for candidates
interviewing to join the Data Science team at RiskLens in Spokane, WA.

The purpose of this task is to allow the candidate to explore some
industry-relevant cybersecurity incident data, and allow for the hiring
manager to see a sample of the candidate\'s work using data that the
candidate has probably not seen before.

# Instructions

See the data directory for the data set and data description. We are
using data from the VERIS Community Database (VCDB) that has been
filtered and transformed to be more easily workable in a few hours. The
data has been filtered to be only incidents from the healthcare
industry.

 You should further filter the data to the years 2010-2018 (inclusive of both).

Please create a **reproducible report** answering the following
questions.:

1.  Complete an exploratory analysis of the data set and answer at least
    the following questions:

    1.  How many total incidents are in the database for each year?
    2.  Grouping by `action` and year, what are some trends in the action types that you notice over time? Please note that since this is an open source database, the total number of incidents in a given year is more of a function of community involvement in incident reporting than a good representation of the total number of incidents. Given this fact, for your trend analysis it may be better to look at proportions of actions for each action type in a given year rather than total number of incidents for a given action.
    3. Repeat step B but for `actor`.
    4.  Repeat step B but for `asset`.
    5.  Repeat step B but for the three `attribute` variables.
    6.  Do we see a trend for the proportion of incidents within the US versus outside of the US?
    7.  Please feel free to share any other notable findings as you explore the data.

2.  Modeling questions. Note: The [RiskLens
    Platform](https://www.risklens.com/platform) uses
    [PERT](https://www.statisticshowto.datasciencecentral.com/pert-distribution/)
    distributions for users to report minimum, most likely, and maximum
    values for estimates in [FAIR analyses](https://www.risklens.com/what-is-fair). However, you
    don't need to restrict yourself to reporting values in this manner
    if you discover a different distribution for your data.

    1.  Let's assume that you work for a large healthcare employer
        (1001 employees or larger), and you are scoping a risk scenario
        where you are worried about an insider threat (actor: internal)
        compromising the confidentiality of medical records. Assuming
        that you will have a cybersecurity incident this year, based on
        the data you have can you come up with a model that will
        estimate, with 90% confidence, the range of the counts of
        medical records that will be compromised in such an incident
        (minimum and maximum)? Within that 90% confidence interval, what is the most likely count of the breached records? You may ignore those employers with unknown employee counts for the sake of time. Bonus points if you integrate the year of breach into your model -- are the trends changing?

    2.  If you have time: How does your model change if you estimate total
    record count instead of just medical records?

3.  *Optional and bonus*: Can you do anything fun and interesting with the
    text in the summary column?
    

**A reproducible report will allow us to re-run your code with the
data set and obtain the same results and figures, given dependencies
and relevant instructions. The most popular formats for this type of
report are R Markdown notebooks and Jupyter Notebook (aka IPython
Notebook).**

*Note:* the Lead Data Scientist at RiskLens loves data
visualizations :)

# Data Description
## Modified VERIS Incident Data: Medical Industry

For this exercise, we have taken [VERIS](https://github.com/vz-risk/veris)-formatted incident data from the VERIS Community Database [VCDB](https://github.com/vz-risk/VCDB) and modified it to be more quickly understood and explored for the purposes of this exercise.

VERIS, the Vocabulary for Event Recording and Incident Sharing, is a common language for describing security incidents in a structured manner. It was invented and is maintained by the Verizon RISK team, and the most current documentation for the entire schema may be [found on GitHub](https://github.com/vz-risk/veris) and at [veriscommunity.net](http://veriscommunity.net/). The VCDB is an open-source data set in VERIS format, also maintained by the Verizon RISK team and several community contributors. The data set is licensed with the Creative Commons Attribution-ShareAlike 4.0 International Public License. Please see the [VCDB License file](https://github.com/vz-risk/VCDB/blob/master/LICENSE.txt) for more information. 

### How we Modified the Data

As of the writing of this document, the VCDB consists of 8192 incidents, and when parsed with the [verispy](https://github.com/RiskLens/verispy) Python package, has 2330 features (columns). 

Because this would be extremely unwieldy for a take-home exercise with a target time of 4-5 hours, we have simplified the data set by doing the following:  

  * Limiting the data set to incidents in the medical industry (NAICS 2-digit code: 62).  
  * Compacting some enumerations and wholly eliminating many of them. Unfortunately, this does cause a problem when, for instance, there is more than one actor or one action involved in an incident. So, user beware: your findings may be interesting and applicable in context of what you are asked to do for this exercise, but you should use the full VCDB data set if you wish to make grand pronouncements about the state of cybersecurity breaches.  

 This leaves us with a current data set of 2252 rows and 22 columns as of this writing, which should be more tractable for this exercise.  

 For documentation and code showing how the data set was built, please see the [Build_Data_Set](Build_Data_Set.ipynb) Jupyter Notebook in this repository (not required).  

 ## Variable Descriptions (Code Book) 

 A listing of all the features -- their names and descriptions -- for our modified data set is shown below. Links to additional information are included for the curious, but is not necessary to complete this exercise. 

   * **incident_id**: Incident or case ID. Corresponds to the JSON filename in the [VCDB json directory](https://github.com/vz-risk/VCDB/tree/master/data/json/validated). 
   * **timeline.incident.day**:  Day of month incident occurred.  
   * **timeline.incident.month**: Month incident occurred.  
   * **timeline.incident.time**: Time incident occurred.  
   * **timeline.incident.year**: Year incident occurred.  
   * **actor**: Entities that cause or contribute to an incident. [source](http://veriscommunity.net/actors.html) 
   * **action**: Describe what the threat actor did to cause or contribute to the incident. [source](http://veriscommunity.net/actions.html) 
   * **attribute.confidentiality**: Was this a [confidentiality](https://resources.infosecinstitute.com/cia-triad/) breach (T/F)? [source](http://veriscommunity.net/attributes.html#section-confidentiality) 
   * **attribute.integrity**: Was this an [integrity](https://resources.infosecinstitute.com/cia-triad/) incident (T/F)? [source](http://veriscommunity.net/attributes.html#section-integrity)
   * **attribute.availability**: Was this an [availability](https://resources.infosecinstitute.com/cia-triad/) incident (T/F)? [source](http://veriscommunity.net/attributes.html#section-availability)
   * **asset**: The information assets that were compromised during the incident. [source](http://veriscommunity.net/assets.html) 
   * **asset.variety**: The variety of the asset that was compromised during the event. Prepended with a single-letter abbreviation of the asset class
   * **confidentiality.medical_records**: Count of the number of medical records breached.  
   * **confidentiality.payment_records**: Count of the number of payment records breached.  
   * **confidentiality.personal_records**: Count of the number of personal records breached.  
   * **confidentiality.total_record_count**: Count of the total records breached (includes other classes besides the previous three).  
   * **victim.employee_count**: Number of employees for the victim organization. Small: 1,000 employees or less. Large: 1,001 employees or more. [source](http://veriscommunity.net/enums.html#section-victims)
   * **victim.state**: Victim organization's state (if country == US) 
   * **victim.country**: Victim organization's country, 2-letter code.  See [code_to_country.json](https://github.com/vz-risk/veris/blob/master/code_to_country.json) for the list of country codes (not required for this exercise).  
   * **victim.victim_id**: Name of the victim organization.  
   * **summary**:  Free text summary entered by the user who entered event into VCDB.  
   * **reference**: Usually a link to a news source for the breach.  

## Notes  

  * An event may have one or more attributes (i.e. it could be a confidentiality breach and an integrity incident at the same time). 
  * Events often have more than one actor, action, or asset affected. In order to keep the data relatively simplified, in these cases we chose just a single entry for these features for each incident.  
  * The `attribute` features are part of the [CIA triad](https://whatis.techtarget.com/definition/Confidentiality-integrity-and-availability-CIA). 


In [None]:
# Write Your Code Here