# Introduction to Python and Useful Data Science Libraries

There a couple of well established data science libraries that you will find useful when exploring security data.

* pandas
* numpy
* matplotlib
* seaborn
* sklearn

Pandas library stands for Python Data Analysis Library. Pandas is a game changer when it comes to analysising data with Pytn and it is on eof the most preferred and widely used tools in data science.

Pandas takes data (CSV, TSV, SQL database) and creates a python objection with rows and columns called data frame what looks similar to tables in statistical software (Excel, SPSS, R). This makes it easier to work with in comparison to working with lists and/or dictionaries through for loops or list comprehension.

In order to use Pandas in your Jupyter Notebook you need to import the Pandas library first. Importing the library means loading it into the memory and then it's there for you to work with. In order to import Pandas all you have to do is the run the following code:

In [1]:
import pandas as pd

The second part `pd` allows you to access Pandas with `pd.command` instead of needing to write `pandas.command` every time you need to use it. I also imported numpy because it is very useful library for scientific computing with Python. Now Pandas is ready for use! You would need to do this everytime you start a new Jupyter Notebook.

### Loading and Saving Data with Pandas

When you want to use Pandas for data analysis, you'll usually use it in one of three different ways:
* Convert a Python's list, dictionary or Numpy array to a Pandas data frame
* Open a local file using Pandas, usally a CSV file, but could also be delimited text file (like TSV), Excel, etc
* Open a remote file or database like a CSV or a JSON on a website through a URL or read from a SQL table/database

Below we will demonstrate reading a local file (Bro Log) into a Pandas data frame.

In order to read a bro log into a Pandas dataframe we rely on the library Bro Analysis Toolkit (BAT). The BAT Python package supports the processing and analysis of Bro data with Pandas, scikit-learn, and Spark. The goals of Bat:

* Offload: Running complex tasks like statistics, state machines, machine learning, etc.. offloaded from Bro so that Bro can focus on the efficient processing of high volume network traffic
* Data Analysis: Use a large set of classes that help bridge from raw Bro data to packages like Pandas, scikit-learn, and Spark

In [None]:
from bat.log_to_dataframe import LogToDataFrame

# Create a Pandas dataframe from a Bro log
df = LogToDataFrame('data/ftp.log')

#### Computer Network Traffic Data 

The U.S. National CyberWatch Mid-Atlantic Collegiate Cyber Defense Competition (MACCDC) is a unique experience for college and university students to test their cybersecurity knowledge and skills in a competitive environment. The MACCDC takes great pride in being one of the premier events of this type in the United States.

While similar to other cyber defense competitions in many aspects, the MA CCDC, as part of the National CCDC, is unique in that it focuses on the operational aspects of managing and protecting an existing network infrastructure. The teams are physically co-located in the same building. Each team is given physically identical computer configurations at the start of the competition. Throughout the competition, the teams have to ensure the systems supply the specified services while under attack from a volunteer Red Team. In addition, the teams have to satisfy periodic “injects” that simulate business activities IT staff must deal with in the real world.

MACCDC2012 - Generated with Bro from the 2012 dataset A nice dataset that has everything from scanning/recon through explotation as well as some c99 shell traffic.

print(df)

When the Dataframe is large, like above, you can still print it to the screen, or you can simply print the first 5 lines of the DataFrame with the `.head()` function.

In [None]:
df.head()

### Basic DataFrame Manipulation

The rows and columns of a DataFrame may have names (as you can see from the http.log dataframe above, when we printed it to the screen). To find out which names are used for the columns, use the keys function, which is accessible with the dot syntax. You can loop through the names of the columns.

In [None]:
print('Names of Columns:')
print(df.keys())

## Numpy

NumPy (short for Numerical Python) provides an efficient interface to store and operate on dense data buffers. In some ways, NumPy arrays are like Python's built-in list type, but NumPy arrays provide much more efficient storage and data operations as the arrays grow larger in size. NumPy arrays form the core of nearly the entire ecosystem of data science tools in Python, so time spent learning to use NumPy effectively will be valuable no matter what aspect of data science interests you.

### numpy functions for DataFrames

DataFrame objects can often be treated arrays, especially when they contain data. Most numpy functions work on DataFrame objects, but they can also be accessed with the dot syntax, like dataframe_name.function() Simply type `df,` in a code cell then hit the tab key to see all the functions that are available (there are many). In the code cell below, we compute the maximum number of flows, and the mean value of flows.

In [None]:
print('maximum number of flows:', df.f.max())
print('mean number of flows:', df.f.mean())

### Exercise 1

We have already loaded in the data you need. Making use of the dataframe perform the following tasks:
* Report 10 of the local ip addresses
* Print the minimum number of flows

### Plotting DataFrames

You can plot the column or row of a DataFrame with pandas. The plotting capabilities of pandas also use the dox syntax, like dataframe.plot(). All columns can be plotted simultaneously.

### Matplotlib

matplotlib.pyplot is a collection of command style functions that make matplotlib work like MATLAB. Each pyplot function makes some change to a figure: e.g., creates a figure, creates a plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc.

In matplotlib.pyplot various states are preserved across function calls, so that it keeps track of things like the current figure and plotting area, and the plotting functions are directed to the current axes (please note that "axes" here and in most places in the documentation refers to the axes part of a figure and not the strict mathematical term for more than one axis)

In [None]:
import matplotlib
%matplotlib inline

import matplotlib.pyplot as plt

Reference: 
* https://towardsdatascience.com/a-quick-introduction-to-the-pandas-python-library-f1b678f34673
* https://www.secrepo.com/
* https://github.com/SuperCowPowers/bat