# Intro to Python for business analytics

# Outline

* Overview of Python
* Working with Jupyter notebooks
* Learning Python via analysis and visualization of some healthcare data
    
    - Basic math, numpy, basic plots
    - Loops, lists, file globbing
    - Conditional logic, dictionaries and reading files

# What is Python?

* Widely used, general purpose programming language
* Designed to be readable and require less code than C-like languages
* Supports multiple programming paradigms (OO, procedural, functional)
* Dynamic typing (“duck” typing), automatic memory management, full featured standard library (“batteries included”)
* Has caught on like wildfire in the analytics and scientific computing community
    - great “scripting” language
    - huge ecosystem of sci-computing related libraries
    - some credit data science with greatly contributing to the popularity and widespread use of Python

# The Zen of Python

In [None]:
import this

# ... design principles

Designed to be extensible - users create importable libraries

Offers choice but 

> there should be one—and preferably only one—obvious way to do it

--> **the Pythonic way**

but not dogmatic about it

# Python milestones

* 1989 – [Guido van Rossum](https://gvanrossum.github.io/) begins implementation of Python
Guido still heavily involved but retired as BDFL in 2018
* 2000 - Python 2.0 released 
* 2008, 2010 - Python 2.6, 2.7 released - widely used
* 2008 - Python 3.0 released
    - a major, “backwards incompatible” release meant to fix structural problems in Python 2
    - adoption is pretty widespread at this point
    - no new development on Python 2.7.x
    - We will use Python 3.8.x or newer in our class
* 2012 [Anaconda Python distribution](https://www.anaconda.com/products/individual) data science
* 2021 Python 3.10 released


# Some features

* **Modular design:** [core language](https://www.python.org/) + [many contributed libraries](https://pypi.python.org/pypi)
core language has large standard library (“batteries included”)
* **Strong graphics:** [matplotlib](http://matplotlib.org/), [seaborn](https://stanford.edu/~mwaskom/software/seaborn/), and [more](https://www.analyticsvidhya.com/blog/2020/03/6-data-visualization-python-libraries/) allow creation of sophisticated and beautiful graphs
* **Multiple use modes:**
    - Interactive: IDLE, IPython, Jupyter (consoles and browser based notebooks)
scripts (programs)
    - Python is a full-featured programming language
* **Vibrant user community:**
    - [StackOverflow](http://stackoverflow.com/questions/tagged/python), http://planet.python.org/, and our course website has links to tons of relevant resources
    - If there's some feature you need, chances are, someone has created a library for it. If not, you can always create it yourself.

# The Python scientific computing stack

## https://www.scipy.org/

![scipy](img/scipy.jpg)

In [None]:
Image("img/scipy.jpg")

# Python is free software

* Python is free, both as in beer and in speech
* With free software, you are granted
    - The freedom to run the program, for any purpose (**freedom 0**).
    - The freedom to study how the program works, and adapt it to your needs (**freedom 1**). Access to the source code is a precondition for this.
    - The freedom to redistribute copies so you can help your neighbor (**freedom 2**).
    - The freedom to improve the program, and release your improvements to the public, so that the whole community benefits (**freedom 3**). Access to the source code is a precondition for this.
* Linux, R, Python and a JILLION more software packages are free
* This entire course (except for Camastia for the videos) was developed with free software
        Linux, VirtualBox, LibreOffice, R, Anaconda Python, Geany, Firefox, Moodle, pandoc, git

# Why Python and/or R for analytics

* Both Python and R are widely used in the data science and business analytics worlds
* A quote from [Enterprise Data Analysis and Visualization: An Interview Study](http://web.cse.ohio-state.edu/~machiraju.1/teaching/CSE5544/Visweek2012/vast/papers/kandel.pdf) on the growing need for technically adept analysts:
    - When discussing recruitment, one Chief Scientist said “analysts that can’t program are disenfranchised here”
* Both support a combination of interactive use via tools like IPython and R Studio along with programmatic use via text scripting
* Huge communities and ecosystems supporting Python and R for analytics work
* Both facilitate [reproducible analysis](https://www.coursera.org/learn/reproducible-research)
* Some things that are simply hideously difficult to do in tools like Excel or a database, are simple in Python and/or R
    - Group By or Pivoting type analysis for operations such as percentiles
    - [Small multiples](https://ggplot2-book.org/facet.html) and other complex graphing/charting/plotting
    - Documenting and reproducing complex series of data cleaning and transformations

# Some cons of Python for analytics

* While there are thousands of contributed libraries, can be hard to find what you need
* Packaging system is a little nightmarish but getting better
* The Python 2 v Python 3 controversy is distracting but is just about done
* For super computationally heavy work, Python might have some speed issues

Ummm, I can't think of any more

IMHO, pros greatly outweigh cons

# Starting to learn Python

* Hello World! a bunch of ways
* Learn the fundamentals of Python via data analysis tasks
    - Basic math, numpy, basic plots, writing functions, loops, lists, file globbing, conditional logic and images

* We'll do this by working through a nice set of tutorials from [Software Carpentry](https://software-carpentry.org/lessons/) (that I've modified for our class) using [Jupyter notebooks](https://jupyter.org/) 
    - problem based using a simple healthcare related dataset
    - facilitate reproducible, well documented analyses (much like R Markdown)
    - mix Markdown + Python and can turn into HTML and other formats
* You will get a ton of hands-on experience with Python and Jupyter notebooks
* later we'll use IDEs like [Spyder](https://www.spyder-ide.org/) or [PyCharm](https://www.jetbrains.com/help/pycharm/quick-start-guide.html)
* Then we'll spend some time learning:
    - [pandas](https://pandas.pydata.org/), the go-to library for data analysis 
    - [matplotlib](https://matplotlib.org/) & [Seaborn](https://seaborn.pydata.org/) for plotting 
    - [scikit-learn](https://scikit-learn.org/stable/) for machine learning

# Hello World!

`hello_world.py` is in Downloads folder

Let's `cat hello_world.py`

We can run a shell command directly from this Jupyter notebook using the ! operator.


In [None]:
# Linux
!cat hello_world.py

In [None]:
# Windows
!type hello_world.py

## Running via python at the command line

Let's open a terminal window in that folder by browsing to it in the File Manager and hitting F4

    $python hello_world.py 



## Running via Jupyter notebook

We can type the whole `hello_world.py` program within a single cell.

In [3]:
print("Hello, World!")

Hello, World!


## Running via Spyder

We'll be running Spyder later as we go through the notebooks for this session.

# Python Basics

* `01-basics-lookahead-pcda.ipynb`
    - variables, numpy, math, peek at plotting and dataframes
* `02-loop-conditionals-pcda.ipynb`
    - repeating actions and using if-then-else logic
* `03-lists-pcda.ipynb`
    - flexible data storage, indexing and slicing
* `04-intro-dictionaries-readingfiles-pcda.ipynb`
    - another storage container, more on reading data files
* `05-file-globbing-pcda.ipynb`
    - processing a bunch of data files
* `06-more-on-conditions-pcda.ipynb`
    - more on conditional logic and truthiness