# Introduction to Data Science

This is a presentation given to Iowa State University's Women's Alliance for Cybersecurity on April 7th, 2018.

## What is data science?

When a statistician and a database administrator love each other very much, they share a special hug, and 9 months later a data scientist is born.

"Data science" is not a well-defined field. Here's how I think about it\*:
- **Data management:** Storing and moving data effectively within an organization
- **Data analysis:** Using data to learn about the world and make decisions
    - **Mathematical modeling:** Analysis via mathematical models
        - **Statistics:** Analyzing the process that generated the data
        - **Machine learning:** Making predictions
    - **Data visualization:** Analysis via charts, interactive dashboards, etc.
    
These are not mutually exclusive: A plot may include statistical error bars, a machine learning method may require more data than fits in memory, etc.

I will focus more on the data analysis side of things in this presentation.

###### \*Many people will disagree with this breakdown.

## Why should this interest women in cybersecurity?

Deciding whether to flag a network event as a potential threat, particularly automatically, may be a data-driven task:
- Log files are data. The dataset can become quite large over time.
- What features of an event would make you suspicious of it? What does a "typical event" look like?
- Relatedly, is there a way to measure the similarity of events to each other?
- False negatives are far more problematic than false positives.
- These classifications may need to be performed in real time.

Problems in cybersecurity touch on many aspects of data science.

## Resources

See the ISU data science club's [resources repository](https://github.com/ISU-DataScienceClub/Resources).

## Example

This example will demonstrate some of the commonly-used Python data science packages. We will look at a small subset of Los Alamos National Laboratory's [Network Event Dataset](https://csr.lanl.gov/data/2017.html).

First, we set things up with the following libraries:
- [Matplotlib](https://matplotlib.org/) is Python's most well-established plotting package. It is great for static, 2-dimensional plots, but not for anything else.
- [Pandas](http://pandas.pydata.org/) provides data frames, which are a fundamental data structure in data science. Data frames allow for easy, intuitive manipulation of datasets.
- [Seaborn](http://seaborn.pydata.org/) improves Matplotlib's default plot styles, and provides tools to make relatively complex plots quickly.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

# This allows Matplotlib to play nicely with the Jupyter notebook format
%matplotlib inline

# This replaces Matplotlib's ugly default plot styles with something more pleasant
sns.set()

### Using Pandas to explore the data

In [None]:
dat = pd.read_feather("data/network_event_data.feather")

Field descriptions from the website:
- **`time`:** The start time of the event in epoch time format
- **`duration`:** The duration of the event in seconds.
- **`src_device`:** The device that likely initiated the event.
- **`dst_device`:** The receiving device.
- **`protocol`:** The protocol number.
- **`src_port`:** The port used by the SrcDevice.
- **`dst_port`:** The port used by the DstDevice.
- **`src_packets`:** The number of packets the SrcDevice sent during the event.
- **`dst_packets`:** The number of packets the DstDevice sent during the event.
- **`src_bytes`:** The number of bytes the SrcDevice sent during the event.
- **`dst_bytes`:** The number of bytes the DstDevice sent during the event.

In [None]:
dat.head()

In [None]:
dat.info()

### Cleaning the source and destination port fields

We looked at the first few rows of the dataset earlier. Note that there seem to be some inconsistencies in the `src_port` and `dst_port` fields:

In [None]:
dat[["src_port", "dst_port"]].head()

To fix this, we will need to:
1. Strip the word "Port" from each entry of the column, if it's there.
2. Convert the data type of the column to integer.

In [None]:
dat["src_port"] = dat["src_port"].str.strip("Port").astype(int)
dat["src_port"].head()

Now do the same to the destination port column:

In [None]:
dat["dst_port"] = dat["dst_port"].str.strip("Port").astype(int)

### How many distinct protocols are used on this network?

In [None]:
unique_protocols = dat["protocol"].unique()
print(f"There are {len(unique_protocols)} distinct protocols used.")

I use "time", below, but any other column will do because of how the `count` function works.

In [None]:
dat.groupby("protocol").count()["time"]

### Which source device sends the largest average number of packets?

In [None]:
avg_packets_sent = dat.groupby("src_device")["src_packets"].mean()

avg_packets_sent.head()

In [None]:
avg_packets_sent.sort_values(ascending=False, inplace=True)

avg_packets_sent.head()

In [None]:
print(f"The device {avg_packets_sent.index[0]} sends the largest number of packets on average.")

### Use Matplotlib (+ Seaborn) to plot the data

In [None]:
ax = dat["duration"].plot(kind="hist", log=True, figsize=(10, 6))
ax.set(title="Histogram of Network Event Durations (millions of seconds)",
       xticklabels=[str(i) for i in range(9)]);

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(dat["src_packets"], dat["dst_packets"])

ax.set_title("Relationships between packets sent by source and destination for events");