<a href="https://colab.research.google.com/github/joh06288/AMIA2019_W07/blob/master/Intro_to_Jupyter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to the Data Science Workshop

## Medinfo 2019
### Data Science Workshop
#### August 26, 2019

# Python Data Science Ecosystem

The Python community has done a tremendous job of developing an ecosystem of tools for Data Science.  The Anaconda organization (https://anaconda.org/) provides a standardized and maintained distribution of the Python Data Science environment.  The environemnt includes:

- Numpy - Numerical library
- Pandas - Data manipulation using Dataframes
- Scikit-learn - Data Science Libraries
- Matplotlib, Seaborn, Bokeh - Graphing and Data Visualization
- Jupyter - A notebook interface to developing and documenting projects 



## NumPy
NumPy is the fundamental package for scientific computing with Python. It contains among other things:

- a powerful N-dimensional array object
- sophisticated (broadcasting) functions
- tools for integrating C/C++ and Fortran code
-useful linear algebra, Fourier transform, and random number capabilities

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.

Link: http://www.numpy.org/

## Pandas

Pandas is the Python Data Analysis Library
pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.

pandas is well suited for many different kinds of data:

- Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
- Ordered and unordered (not necessarily fixed-frequency) time series data.
- Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
- Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

Link: https://pandas.pydata.org/index.html

## Scikit-learn

Scikit-learn is a comprehensive Machine Learning library for Python
- Simple and efficient tools for data mining and data analysis
- Accessible to everybody, and reusable in various contexts
- Built on NumPy, SciPy, and matplotlib
- Open source, commercially usable - BSD license

Scikit-learn provides the ability to easily perform Preprocessing, Classification, Regression, Clustering, Dimensionality reduction and Model selection.

Link: http://scikit-learn.org/stable/index.html

## Matplotlib

Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application servers, and four graphical user interface toolkits.

Link: https://matplotlib.org/

## Seaborn

Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics.

Link: https://seaborn.pydata.org/

## Bokeh

Bokeh is an interactive visualization library that targets modern web browsers for presentation. Its goal is to provide elegant, concise construction of versatile graphics, and to extend this capability with high-performance interactivity over very large or streaming datasets. Bokeh can help anyone who would like to quickly and easily create interactive plots, dashboards, and data applications.

Link: https://bokeh.pydata.org/en/latest/

# Jupyter

The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.

### Strengths
- Works for multiple languages including Python, R, SQL and shell commands
- Document the data science process
- Reproduce all steps
- Enhance productivity and reuse
- Share your work
### Weaknesses
- Version control, debugging, long running tasks


Link: http://jupyter.org/

# Setup Jupyter Environment


In [0]:
# Import required modules
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import matplotlib as mplot
%matplotlib inline
import IPython
import os
from os import listdir

## Jupyter Quick Start

Useful keyboard shortcuts are listed below and an example for how to get help on a function

In [0]:
# Keyboard shortcuts

# Executing cells - Shift-Enter, Ctrl-Enter
# Adding a cell above Ctrl-M,A
# Adding a cell below - Ctrl-M,B
# Split and Merge Cells - Shift-Ctrl-_, Ctrl-M,Shift-M
# Copy, Cut, Paste - Ctrl-C, Ctrl-X, Ctrl-V
# Undo - Ctrl-Z
# Select all - Ctrl-A
# Comment Cell - Ctrl-/

In [1]:
# Get help for a function
pd.read_csv??

Object `pd.read_csv` not found.


In [0]:
# Example of a Stack trace.  Uncomment the code below to generate an error
#print(steve)
    

# Data Science in the Cloud
### Many choices for using the cloud for data science
- Microsoft: Data Science Virtual Machine
- Amazon: SageMaker
- Google: Colab

#### Advantages
- Choose high-performance hardware and GPUs
- Parallel execution across multiple machines
- Easy “tear-down” and “scale-up”
- Cost effective for research, pilots
- High security, availability and performance

#### Disadvantages
- Can be costly for high-volume, production systems
- Vendor lock-in
- Must move data to the cloud



## Pandas Dataframes

Pandas is one of the most important packages for performing data science in Python.  The DataFrame makes it easy to manipulate data in a spreadsheet-like way by organizing data into rows and columns of data.

In Pandas, the columns of data are called `Series`.  They have a data type and functionality to manipulate all of the data in that column collectively.  Groups of `Series` can be connected together to form a DataFrame.

In [4]:
# Create some Series to use for our tutorial
s1 = pd.Series([1,3,5,7,9], name='odds', dtype=int)
s2 = pd.Series([2,4,6,8,10], name='evens', dtype=int)
s3 = pd.Series(['Monday','Tuesday','Wednesday','Thursday','Friday'], name='days', dtype=str)
                
# How many elements?
print(len(s2))

# Apply functions 
print(s1.mean())
                
            

5
5.0


In [5]:
# Now lets create a DataFrame
# axis=1 means to treat each Series as a column.  axis=2 would treat each Series as a row
df = pd.concat([s1,s2,s3],axis=1)

# Jupyter displays DataFrames in a nice-looking table.  
# We will sometimes convert objects into DataFrames just to make the output pretty
display(df)

Unnamed: 0,odds,evens,days
0,1,2,Monday
1,3,4,Tuesday
2,5,6,Wednesday
3,7,8,Thursday
4,9,10,Friday


In [6]:
# Usually, DataFrames are much bigger than 5 rows.  
# Use .head and .tail to see the first and last rows

display(df.head(3))
display(df.tail(3))

Unnamed: 0,odds,evens,days
0,1,2,Monday
1,3,4,Tuesday
2,5,6,Wednesday


Unnamed: 0,odds,evens,days
2,5,6,Wednesday
3,7,8,Thursday
4,9,10,Friday


In [7]:
# The .shape attribute is also useful to see how many rows and columns are in a DataFrame

print(df.shape)

(5, 3)


In [8]:
# DataFrames have a powerful set of features for selecting data

# .iloc : Indexing rows and columns using integers 
# Get the first row and all of the columns.  
# The ":" is a python range which means "all of the columns"
first_row = df.iloc[0,:]
display(type(first_row), first_row)

# Get the first column
first_col = df.iloc[:,0]
display(type(first_col), first_col)


pandas.core.series.Series

odds          1
evens         2
days     Monday
Name: 0, dtype: object

pandas.core.series.Series

0    1
1    3
2    5
3    7
4    9
Name: odds, dtype: int64

In [9]:
# We can also get columns using their names
first_col = df['odds']
display(first_col)

# Or multiple columns
first_2_cols = df[['odds','evens']]
display(first_2_cols)

0    1
1    3
2    5
3    7
4    9
Name: odds, dtype: int64

Unnamed: 0,odds,evens
0,1,2
1,3,4
2,5,6
3,7,8
4,9,10


In [10]:
# More powerful selection is accomplished using Boolean Indexing
# First, we create a boolean filter using a function or conditional

filter = df['odds'] > 5

# This creates a Series with cells that are True if the condition holds or False if it does not
display(filter)

# We can use this to filter out the cells that we want
filtered_rows = df[filter]
display(filtered_rows)

# You can combine Boolean Indexing for more complex selections
display(df[(df['odds'] > 5) | (df['days'] == 'Monday')])


0    False
1    False
2    False
3     True
4     True
Name: odds, dtype: bool

Unnamed: 0,odds,evens,days
3,7,8,Thursday
4,9,10,Friday


Unnamed: 0,odds,evens,days
0,1,2,Monday
3,7,8,Thursday
4,9,10,Friday


In [11]:
# We can apply functions to everything in the DataFrame at the same time
display(df.mean())
display(df.corr())

odds     5.0
evens    6.0
dtype: float64

Unnamed: 0,odds,evens
odds,1.0,1.0
evens,1.0,1.0


In [14]:
# We can apply an arbitrary function using .apply
# Python lambda's are functions that don't have to be pre-defined
# The code below will generate an error.  Why?
display(df.apply(lambda x: x/2))

# Comment out the code above and uncomment the code below to fix the error
#display(df[['odds','evens']].apply(lambda x: x/2))

Unnamed: 0,odds,evens
0,0.5,1.0
1,1.5,2.0
2,2.5,3.0
3,3.5,4.0
4,4.5,5.0


## Further reading



*   Tutorial on Jupyter notebooks: https://www.dataquest.io/blog/jupyter-notebook-tutorial/
*   Tutorial on using Pandas: https://towardsdatascience.com/exploratory-data-analysis-with-pandas-and-jupyter-notebooks-36008090d813

