# Introduction to analysis with Jupyter, Python and Pandas  -- Paul Rodgers

## Who am I?
* Sr. Data Analyst at Virgin Pulse
* One of few people to have booted a PDP-8 and a Raspberry Pi
* Early adopter
  
  Windows from 5 1/4" floppies
  Microsoft Certified Professional ~ 180
  
* Generalist<br>  
  Oversaw the architectural design and development of document repository for large scale ( >10^6 document) litigation database<br> 
  Currently using Python and Pandas for realtime unification of Salesforce and inhouse data stores<br>
* Evangelist

## What is Jupyter?
* Interactive code execution environment
* Tells a story 
  
  Allows the use of data, code and rich content  
  Enables the author to create a narrative   
  Engages the audience  
  Increases comprehension
  
## Who's using Jupyter?
* Academics  
* Journalists    
* Data Scientists
* Netfix!!

## What is Pandas?
* Programmable two dimensional tabular data managment tool (Excel on steroids)
* Similar to R dataframes
* Leverages Numpy (fast array math)
* Top notch CSV importer
* RAM based
* Rich data manipulation features
* Database style joins

## Who's using Pandas?
* Data Scientists 
* Financial Analysts (where Pandas was born)

## What is Python (in this context)?
* Ipython kernel

### Jupyter Kernels
* __Julia__
* __Python__
* __R__
* Bash
* Haskell
* Perl
* ...

## Why Notebooks?
* Reproduceability
* Sharing
* Transparency


## This environment
* Python
* Pandas
* Jupyter Lab
* Cookiecutter
    https://github.com/drivendata/cookiecutter-data-science>
* Simple-salesforce


## A bit of history (acknowledgments)
* 2001 - IPython Fernando Perez creates IPython<br>

  Interactive shell  
  Features introspection, rich media, shell syntax, tab completion, and history  
  Latest stable 7.0 Sept 29,2018
  BSD License
* Jupyter Foundation
  Spun off from Ipython in 2014
  Language agnostic 
  BSD License
* 2008 Wes McKinney ntroduces Pandas
  BSD License
* Python, Pandas and Jupyter are part of SciPy https://www.scipy.org/index.html
* All are sponsered projects at NumFOCUS https://numfocus.org/community/mission
  


## Notebook Server
* ZeroMQ for interprocess communications
* Tornado for HTTP server

## Security
* Notebooks are inherently tied to a notebook server
* Be definition, one runs arbitrary code on that server
* Servers can provide disk access
* Servers can be set shell commands
* Jupyter OUTPUT is rendered in an exectutable environment (your browser)
* Anything that your browser can do, Jupyter output can do

## Server Security
* Token authentication by default
* Server instance token
* Browser initialization token

## Client Security

### Inherent issues

### Security Model
* User responsibility
* Output vs Input
* Did the current user direct the action?

## Interacting with the notebook

## Interacting with the shell

## Pandas basics
* Series
* Dataframe
* Index
* Data types

# Let's get to it!!! 

## Bringing data into dataframes
* csv
* json
* sql
* Salesforce (via simple-salesforce abstraction)

#### Examples -- information about the data
* Dataframe summary information
* Mean, min, max etc.
#### Examples -- subsetting data with slicing and  using .loc and .iloc
* \[row:column\]
#### Examples -- filtering data by its characteristics
* Row restrictions
* Combining row restrictions
#### Examples -- combining data

* Appending (concatenating) data sets
* Merging datasets with sql style joins
* Aggregating data


# What's in the pipeline?

## Netflix
* Papermill
## Jupyter Lab


## Further reading etc.
* Atlantic article https://paulromer.net/jupyter-mathematica-and-the-future-of-the-research-paper/
* JupyterCon 2018 videos https://www.youtube.com/watch?v=Ql2f1eF52P8&list=PL055Epbe6d5b572IRmYAHkUgcq3y6K3Ae
* Reuven Lerner on quick setup of a shared Jupyter server http://blog.lerner.co.il/five-minute-guide-setting-jupyter-notebook-server/


In [5]:
%alias

Total number of aliases: 12


[('cat', 'cat'),
 ('cp', 'cp'),
 ('ldir', 'ls -F -G -l %l | grep /$'),
 ('lf', 'ls -F -l -G %l | grep ^-'),
 ('lk', 'ls -F -l -G %l | grep ^l'),
 ('ll', 'ls -F -l -G'),
 ('ls', 'ls -F -G'),
 ('lx', 'ls -F -l -G %l | grep ^-..x'),
 ('mkdir', 'mkdir'),
 ('mv', 'mv'),
 ('rm', 'rm'),
 ('rmdir', 'rmdir')]

In [6]:
%ldir 

drwxr-xr-x   3 paulrodgers  staff     96 Jun  4 20:34 __pycache__/
drwxr-xr-x  10 paulrodgers  staff    320 Jun  1 20:22 data/
drwxr-xr-x   2 paulrodgers  staff     64 Aug  4 12:27 data_public/
drwxr-xr-x   4 paulrodgers  staff    128 Jun 11 14:50 notebook/
drwxr-xr-x  20 paulrodgers  staff    640 Jun 18 21:51 salesforce/
drwxr-xr-x   9 paulrodgers  staff    288 May 29 23:45 sfdc/


In [7]:
%colors nocolor

In [8]:
%config 

Available objects for config:
     AliasManager
     DisplayFormatter
     HistoryManager
     IPCompleter
     IPKernelApp
     LoggingMagics
     MagicsManager
     PrefilterManager
     ScriptMagics
     StoreMagics
     ZMQInteractiveShell


In [9]:
%config HistoryManager

HistoryManager options
--------------------
HistoryManager.connection_options=<Dict>
    Current: {}
    Options for configuring the SQLite connection
    These options are passed as keyword args to sqlite3.connect when
    establishing database connections.
HistoryManager.db_cache_size=<Int>
    Current: 0
    Write to database every x commands (higher values save disk access & power).
    Values of 1 or less effectively disable caching.
HistoryManager.db_log_output=<Bool>
    Current: False
    Should the history database include output? (default: no)
HistoryManager.enabled=<Bool>
    Current: True
    enable the SQLite history
    set enabled=False to disable the SQLite history, in which case there will be
    no stored history, no SQLite connection, and no background saving thread.
    This may be necessary in some threaded environments where IPython is
    embedded.
HistoryManager.hist_file=<Unicode>
    Current: '/Users/paulrodgers/.ipython/profile_default/history.sqlite'
    Pat

In [10]:
%config HistoryManager

HistoryManager options
--------------------
HistoryManager.connection_options=<Dict>
    Current: {}
    Options for configuring the SQLite connection
    These options are passed as keyword args to sqlite3.connect when
    establishing database connections.
HistoryManager.db_cache_size=<Int>
    Current: 0
    Write to database every x commands (higher values save disk access & power).
    Values of 1 or less effectively disable caching.
HistoryManager.db_log_output=<Bool>
    Current: False
    Should the history database include output? (default: no)
HistoryManager.enabled=<Bool>
    Current: True
    enable the SQLite history
    set enabled=False to disable the SQLite history, in which case there will be
    no stored history, no SQLite connection, and no background saving thread.
    This may be necessary in some threaded environments where IPython is
    embedded.
HistoryManager.hist_file=<Unicode>
    Current: '/Users/paulrodgers/.ipython/profile_default/history.sqlite'
    Pat

<a id='Kernel_Intros'></a>Anchor is here

In [18]:
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/S_f2qV2_U00?rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')

In [12]:
HTML('<iframe width="560" height="315" src="http://toastytech.com/guis/win1x2xdraw.png" frameborder="0" allowfullscreen></iframe>')

In [16]:
HTML('<iframe width="560" height="315" src="https://www.theatlantic.com/science/archive/2018/04/the-scientific-paper-is-obsolete/556676/" frameborder="0" allowfullscreen></iframe>')

In [15]:
HTML('<iframe width="560" height="315" src="https://en.wikipedia.org/wiki/PDP-8#/media/File:Digital_pdp8-e2.jpg" frameborder="0" allowfullscreen></iframe>')