# Using Python for Data Science
Python is a versatile and widely-used programming language with many applications. This webinar explores how to use Python for data science, and is designed as an introduction to the analytical capabilities of Python programming.

**Agenda**
1. How do I write and execute a Python script? *An introduction to the IPython notebook.*
2. What can I do with Python? *An overview of 6 python packages every Data Scientist should know about: NumPy, Pandas, Beautiful Soup, Matplotlib, StatsModels, Scikit-Learn*
3. How do I get started with real data? *Example of reading, writing, and summarizing data with the pandas library.*
4. How do I learn Python for Data Science? *Resources and tips for accelerating your learning path.*

**About the Instructor**
                                          
<img src="profile.jpg" width="300"> Josiah Davis works as a data scientist out of Slalom’s San Francisco office. Josiah's professional experience spans Machine Learning, Natural Language Processing, General Linear Modeling, Survival Analysis, Forecasting, and Data Visualization. In addition to client work, Josiah is a conference speaker, data science instructor and mentor. Previously, Josiah graduated from the University of Maryland with a Bachelor's degree in Mechanical Engineering. Josiah can found online [@josiahjdavis](https://twitter.com/josiahjdavis).

# 1) How do I write and execute a Python script?
There are many tools for developing python scripts. For this webinar, we will be executing everything using the Jupyter notebook. For those of you who are new to Jupyter project, here are some items that you should know about it.

**Item #0: Jupyter notebooks are a great way to develop and share code**

Here is the definition from the [Jupyter project website](http://jupyter.org/).

> The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more.

**Item #1: Jupyter notebooks are organized into cells, which can contain markdown or code** 

This is a cell with markdown

In [None]:
# This is a cell with code
def simple_calculator(first_number, second_number):
    return first_number + second_number
simple_calculator(5, 6)

**Item #2: Their are two modes of interacting with cells in a Jupyter notebook: Command and Edit**

* *Command* is useful for skimming through a notebook, executing cells, and changing between code and markdown.
* *Edit* is useful for developing code and writing markdown

** Item #3: There are many useful keyboard shortcuts which are commonly used** 

Here are some shortcuts I use a lot. To get the full list go to *Help* -> *Keyboard Shortcuts*. 

|          | Mac    | PC    |
|:----------|:--------|:-------|
| **Command Mode**  | esc    |       |
| Delete   | d, d     |       |
| Markdown   | m     |       |
| Run Cell   | control, return    |       |
| Run Cell and Insert Below   | option, return    |       |
| Insert Above   | a    |       |
| Insert Below   | b    |       |
| **Edit Mode**     | return | Enter |
| Run Cell | control, return     |       |
| Run Cell and Insert Below   | option, return    |       |

**Item #5: Use ```Shift + Tab``` for help**

* `Shit + Tab 1x` Gives you abbreviated help
* `Shit + Tab 2x` Gives you full help
* `Shit + Tab 4x` Gives you full help in separate window

In [None]:
import numpy as np
np.linspace(2, 3, 10)

**Item #5: The [NBViewer](https://nbviewer.jupyter.org/) is a great way to explore and share notebooks**

**Item #6: Type ```jupyter notebook``` into the command line or terminal to start up your own notebook application**

This will do two things, it will start the python kernal which will need to be running in the background, and it will launch the notebook web application.

**Item #7: The Jupyter project supports many languages, including Python, R, Julia, and Scala.**

Here is a [link](https://nbviewer.jupyter.org/github/ipython/ipython/blob/4.0.x/examples/Notebook/Index.ipynb) to more information about the Jupyter Project. Previously Jupyter notebooks were called IPython notebooks, but as support has expanded past Python, this name has been replaced.

# 2) What can I do with Python?

Here are some of the most commonly used pacakges for Data Science in python, and what they are used for:
* *NumPy*: Analysis with arrays
* *Pandas*: Analysis with dataframes
* *Beautiful Soup*: Web Scraping
* *Matplotlib*: Data visualizations
* *StatsModels*: Statistical Modeling
* *NLTK*: Natural Language Processing
* *Scikit Learn*: Machine Learning

![NumPy](numpy.png)

**Package #1: [NumPy](http://www.numpy.org/)** is used for data analysis with arrays. A NumPy array is a homogeneously typed n-dimensional array (i.e., you can't have strings and integers in the same array, for instance). Here are some examples of a NumPy array:

```
array([[ 0.        ,  0.36363636],
       [ 0.72727273,  1.09090909],
       [ 1.45454545,  1.81818182],
       [ 2.18181818,  2.54545455],
       [ 2.90909091,  3.27272727],
       [ 3.63636364,  4.        ]])
```

![Pandas](pandas2.png)

**Package #2: [Pandas](http://pandas.pydata.org/pandas-docs/stable/10min.html) is the main python package for data manipulation.** Pandas most common data structure is a heterogeneous typed 2D array with row and column labels. Here is an example of a pandas dataframe:

|          | revenue    | company    |
|:----------|:--------|:-------|
| 2013-01-01  | 100    |   A    |
| 2013-01-02   | 230     |   A    |
| 2013-01-03   | 506     |   A    |
| 2013-01-01  | 111    |   B    |
| 2013-01-02   | 451     |   B    |
| 2013-01-03   | 210     |   B    |

Pandas is a very powerful tool: It handles time well, it has a powerful group-by-engine, and it comes with intuitive indexing.

<img src="beautiful_soup.png" width="300">

**Package #3: [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is the python package for web-scraping.** Here is an example of the data that one might get from the internet:

```
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
```


![Pandas](matplotlib2.png)

**Package #4: Matplotlib is a package for doing data visualizations.** 

Matplotlib is not the only option for data visualizations. [Bokeh](http://bokeh.pydata.org/en/latest/) is growing in popularity and so is [seaborn](https://stanford.edu/~mwaskom/software/seaborn/). Pandas also has some limited visualization capabilities.


![Pandas](statsmodels.png)

**Package #5: StatsModels is a useful package for doing general linear modeling.** Here is an example of what an output looks like.

```
OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 4.020e+06
Date:                Sun, 01 Feb 2015   Prob (F-statistic):          2.83e-239
Time:                        09:32:32   Log-Likelihood:                -146.51
No. Observations:                 100   AIC:                             299.0
Df Residuals:                      97   BIC:                             306.8
Df Model:                           2
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          1.3423      0.313      4.292      0.000         0.722     1.963
x1            -0.0402      0.145     -0.278      0.781        -0.327     0.247
x2            10.0103      0.014    715.745      0.000         9.982    10.038
==============================================================================
Omnibus:                        2.042   Durbin-Watson:                   2.274
Prob(Omnibus):                  0.360   Jarque-Bera (JB):                1.875
Skew:                           0.234   Prob(JB):                        0.392
Kurtosis:                       2.519   Cond. No.                         144.
==============================================================================
```

![Pandas](scikitlearn.png)

**Package #6: scikit-learn is a package for doing Meachine learning.**

scikit-learn can do a wide variety of supervised and unsupervised machine learning tasks including linear modeling, tree-based modeling, dimension reduction, clustering. Additionally, scikit learn can perform cross-validation, bootstrapping, data pre-processing, and natural language processing.

# 3) How do I get started with real data?

The data used consists of credit card defaults. This dataset was taken from the [UC Irvine Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients)

In [None]:
# Let's import some of these packages
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
from sklearn.cross_validation import train_test_split

In [None]:
# Call the magic function to view plots in notebook
%matplotlib inline

In [None]:
# Read data into memory
d = pd.read_excel('default of credit card clients.xls', index_col = 'ID', skiprows=1)

In [None]:
# Check the data types
d.dtypes

In [None]:
# View first couple of rows
d.head()

In [None]:
# Describe the numeric columns
d.describe()

In [None]:
# Rename a column
d.rename(columns={'default payment next month': 'default'}, inplace=True)

In [None]:
# Transform codes into meaningful values
d['SEX'] = d['SEX'].replace({1: 'M', 2: 'F'})

In [None]:
# Split the data into train and test sets
train, test = train_test_split(d,test_size=0.3, random_state=1)

In [None]:
# Select column and perform aggregation
train['default'].mean()

In [None]:
# Perform aggregation over categorical variables
train.groupby('SEX').default.mean()

In [None]:
# Transform codes into meaningful values
d['SEX'] = d['SEX'].replace({1: 'M', 2: 'F'})
d.groupby('SEX').default.mean()

In [None]:
# Create a scatter plot of the age vs. limit balance
train.plot(x='LIMIT_BAL', y='AGE', kind='scatter', alpha=0.05)
plt.ylim([10,85]); plt.xlim([0, 800000])

In [None]:
# Mark defaults with a different color and symbol
train_nd = train[train.default == 0]
train_d = train[train.default == 1]
plt.figure()
plt.scatter(train_d.LIMIT_BAL, train_d.AGE, alpha = .3, marker='o', edgecolors = 'r', facecolors = 'none')
plt.scatter(train_nd.LIMIT_BAL, train_nd.AGE, alpha = .2, marker='+', edgecolors = 'b', c= 'b')
plt.ylim([10,85]); plt.xlim([0, 800000])
plt.legend( ('default', 'no default'), loc='upper right')

In [None]:
# Run a logistic regression on the balance variable
balance = smf.logit('default ~ AGE + LIMIT_BAL', data = train).fit()
balance.summary()

In [None]:
# Create predictions with the fitted model
predictions = balance.predict(test)
predictions

# 4) How do I learn Python for Data Science?

**Here is a simple recipe that I would recommend:**
1. Download the [Anaconda Distribution](https://www.continuum.io/downloads) of Python 3.x
2. Go through python software foudnation website [base python tutorial](https://docs.python.org/3/tutorial/index.html)
3. Run through [10 minutes to pandas](http://pandas.pydata.org/pandas-docs/stable/10min.html) tutorial
4. Pick a data science problem that you would like to use python to solve
5. Learn the necessary techniques to solve that particular problem
6. Repeat

** Here are a couple of additional tips**

**Tip #1: Learn from the documentation**

**Tip #2: Concepts first, code second**

**Tip #3: Start with pandas**