# Introduction to the Hands-On Full Life Cycle Data Science Workshop

### Steve Johnson and Lisianne Pruinelli

#### 2018 Nursing Knowledge: Big Data Science Pre-Conference

#### 6/13/18

# Data Science Ecosystem - Python


# Setup Jupyter Environment


In [None]:
# Import required modules
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import matplotlib as mplot
%matplotlib inline
import IPython
import os
from os import listdir

## Jupyter Quick Start

In [None]:
# Keyboard shortcuts

# Executing cells - Shift-Enter, Ctrl-Enter
#     Return
# Run All, above, below
# Adding cells - A, B
# Code vs Markdown
# Moving cells
# Split and Merge Cells - Shift-Ctrl-_, Ctrl-M,Shift-M
# Copy, Cut, Paste - Ctrl-C, Ctrl-X, Ctrl-V
# Undo - Ctrl-Z
# Select all - Ctrl-A
# Comment Cell - Ctrl-/

In [None]:
# Get help for a function
pd.read_csv??

In [None]:
# Debug / Stack trace
#print(steve)
    

## Pandas Dataframes

Pandas is one of the most important packages for performing data science in Python.  The DataFrame makes it easy to manipulate data in a spreadsheet-like way by organizing data into rows and columns of data.

In Pandas, the columns of data are called `Series`.  They have a data type and functionality to manipulate all of the data in that column collectively.  Groups of `Series` can be connected together to form a DataFrame.

In [None]:
# Create some Series to use for our tutorial
s1 = pd.Series([1,3,5,7,9], name='odds', dtype=int)
s2 = pd.Series([2,4,6,8,10], name='evens', dtype=int)
s3 = pd.Series(['Monday','Tuesday','Wednesday','Thursday','Friday'], name='days', dtype=str)
                
# How many elements?
print(len(s2))

# Apply functions 
print(s1.mean())
                
            

In [None]:
# Now lets create a DataFrame
# axis=1 means to treat each Series as a column.  axis=2 would treat each Series as a row
df = pd.concat([s1,s2,s3],axis=1)

# Jupyter displays DataFrames in a nice-looking table.  
# We will sometimes convert objects into DataFrames just to make the output pretty
display(df)

In [None]:
# Usually, DataFrames are much bigger than 5 rows.  
# Use .head and .tail to see the first and last rows

display(df.head(3))
display(df.tail(3))

In [None]:
# The .shape attribute is also useful to see how many rows and columns are in a DataFrame

print(df.shape)

In [None]:
# DataFrames have a powerful set of features for selecting data

# .iloc : Indexing rows and columns using integers 
# Get the first row and all of the columns.  
# The ":" is a python range which means "all of the columns"
first_row = df.iloc[0,:]
display(type(first_row), first_row)

# Get the first column
first_col = df.iloc[:,0]
display(type(first_col), first_col)


In [None]:
# We can also get columns using their names
first_col = df['odds']
display(first_col)

# Or multiple columns
first_2_cols = df[['odds','evens']]
display(first_2_cols)

In [None]:
# More powerful selection is accomplished using Boolean Indexing
# First, we create a boolean filter using a function or conditional

filter = df['odds'] > 5

# This creates a Series with cells that are True if the condition holds or False if it does not
display(filter)

# We can use this to filter out the cells that we want
filtered_rows = df[filter]
display(filtered_rows)

# You can combine Boolean Indexing for more complex selections
display(df[(df['odds'] > 5) | (df['days'] == 'Monday')])


In [None]:
# We can apply functions to everything in the DataFrame at the same time
display(df.mean())
display(df.corr())

# We can apply an arbitray function using .apply
# Python lambda's are functions that don't have to be pre-defined
display(df.apply(lambda x: x/2))

#display(df[['odds','evens']].apply(lambda x: x/2))