# **Python Data Science Workshop**
![LISA logo](https://raw.githubusercontent.com/wshand/Python-Data-Science-Workshop/master/assets/LISA_logo_medium.jpg)

This notebook helps introduce some of the most basic tools that are commonly used for doing data science and statistics in Python.

## Introduction
In this workshop, we're going to look at the following Python libraries for analyzing data:

* Jupyter Notebook
* pandas
* matplotlib
* 

## Jupyter Notebook
[Jupyter Notebook](https://jupyter.org/) is an interactive tool for running code and visualizing data. To run code, click a code cell (like the one below) followed by one of two things:
* Press `Shift + Enter` on your keyboard
* On the toolbar at the top of this notebook, press the <button class="btn btn-default" title="Run"><i class="fa-step-forward fa"></i><span class="toolbar-btn-label">Run</span></button> button.

In [None]:
print("Hello, world!")

In addition to running code, you can render [Markdown](https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html) in your notebooks (just like the text you're reading right now). For instance, try double-clicking the text below. Put the following text underneath: `### This is a heading!`. Then run the cell either using `Shift + Enter` or the "Run" button, as you did above.

__*Double-click this text!*__

One thing that makes Jupyter great is that it's highly customizable. For instance, as a basic example, we can create [widgets](https://ipywidgets.readthedocs.io/en/latest/index.html) that allow you to create a nice user interface for interacting with the data. As an example, try running the code cell below. This code creates two plots, and displays them in adjacent tabs.

In [None]:
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt
import ipywidgets as widgets
from scipy.stats import norm, linregress

out = [widgets.Output() for i in range(2)]
tabs = widgets.Tab(children=[out[0], out[1]])
tabs.set_title(0, 'Linear regression')
tabs.set_title(1, 'Normal distribution')

with out[0]:
    # Fit line to some random data
    x = np.random.uniform(size=30)
    y = x + np.random.normal(scale=0.1, size=30)
    slope, intercept, _, _, _ = linregress(x,y)
    u = np.linspace(0, 1)
    
    # Plot
    fig1, axes1 = plt.subplots()
    axes1.scatter(x, y)
    axes1.plot(u, slope * u + intercept, 'k')
    plt.show(fig1)

with out[1]:
    # Plot the probability distribution function (pdf) of the
    # standard normal distribution.
    x = np.linspace(-3.5, 3.5, num=100)
    p = norm.pdf(x)
    
    # Plot
    fig2, axes2 = plt.subplots()
    axes2.plot(x, p)
    plt.show(fig2)

display(tabs)

You can create much richer and more complex interfaces that include buttons, sliders, progress bars, and more with Jupyter's `ipywidgets` library.

Jupyter Notebooks are also simple to share with others. *Jupyter Hub intro* [Jupyter Hub](https://jupyterhub.readthedocs.io/en/stable/)

## pandas
[pandas](https://pandas.pydata.org/) is a Python library that provides useful data structures and tools for analyzing data.

The fundamental type of the pandas library is the `DataFrame`. In the following code, we load the [iris flower dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set) using the [seaborn library](https://seaborn.pydata.org/). By default, this dataset is stored in a pandas `DataFrame`.

In [None]:
import pandas as pd
import seaborn as sns

iris = sns.load_dataset('iris')

# `iris` is stored as a pandas DataFrame
print('Type of "iris":', type(iris))

# Show the first few entries in this DataFrame
iris.head()

Let's get some information about these data. I want to do the following:

1. Find out how many columns there are in the `DataFrame` object, and what kinds of data are in each column
2. Calculate the average petal length
3. Determine what species of flowers are in the dataset
4. Get an overall summary of the dataset

In [None]:
# 1. Column labels, and types of data in each column
iris.dtypes

In [None]:
# 2. Calculate the average petal length
iris['petal_length'].mean()

In [None]:
# 3. Determine which 
iris['species'].unique()

In [None]:
# 4. Summary of the data
iris.describe()

In [None]:
# Create one DataFrame corresponding to each species of flower. Below are
# two different ways of filtering the data like this; pick whichever
# method you prefer as they are equivalent.

# Method 1: "query" function
setosa     = iris.query('species == "setosa"')
versicolor = iris.query('species == "versicolor"')

# Method 2: index into the DataFrame
virginica = iris[iris['species'] == 'virginica']

# Note: the choice of which method to use for which species was arbitrary.
# You could just as easily do something like
#
#     setosa = iris[iris['species'] == 'setosa']
#
# or
#
#     virginica = iris.query('species == "virginica"')

## matplotlib
Python has a *massive* number of libraries that can be used for data visualization; [this article](https://www.anaconda.com/blog/developer-blog/python-data-visualization-2018-why-so-many-libraries/) gives a high-level overview of many of them. For this workshop we'll be using [matplotlib](https://matplotlib.org/), since it is by far Python's most popular library for visualization.

In [None]:
# Create a bar chart with average petal length for each flower
import matplotlib.pyplot as plt

x_ind = [0, 1, 2]   # 3 positions on the x-axis for 3 bars
width = 0.35        # width of the bars

# Mean petal lengths and their standard deviations for each species
iris_means = []
iris_stds  = []
for species in (setosa, versicolor, virginica):
    iris_means.append(species['petal_length'].mean())
    iris_stds.append(species['petal_length'].std())

plt.bar(x_ind, iris_means, width, yerr=iris_stds)

# Set axis labels and plot title
plt.xlabel("Species")
plt.ylabel("Average petal length")
plt.title("Mean petal length for each species of flower")

# Add species name to the x-axis
plt.xticks(x_ind, ("Setosa", "Versicolor", "Virginica"))

# matplotlib always requires you to run "show" in order to display the plots.
# In Jupyter Notebook, though, the plots will usually be displayed even if you
# don't use "show".
plt.show()

## Additional References

* O'Reilly provides a couple of good books that go in-depth about these tools and more:
  * [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do)
  * [Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do) -- this book was published in 2012 and may be slightly dated. However, the author provides some Jupyter Notebooks for free in [this repository](https://github.com/wesm/pydata-book) that you may find helpful.
* Check out the full documentation for Jupyter on the [Project Jupyter site](https://jupyter.org/documentation).