# Data Analysis and Visualization in Python

In this lesson we'll be exploring how we as researchers and staff can use python to perform data analysis in a reproducible manner. 

Python is a fantastic tool for performing data analysis due to a powerful combination of data analysis packages and the fact that it is a general purpose programming language. This is in contrast to languages like R and MATLAB which are specifically designed as scripting language for analysis for statistical analysis and engineering. 

The strength of this is the ability to **automate** data analysis workflows (e.g large scale data analysis using high performance computing systems), whilst maintaining cutting edge data analysis frameworks. 

## Jupyter Notebook

A large part of Python's reproducible framework comes from **Jupyter Notebooks** which is the web-based interface that you are currently using. With jupyter notebooks you can perform an analysis all on one page with both code and visualizations together. 

Jupyter notebooks are great for quick prototyping of software and analysis, but they aren't what you'd use for actual automation. For example if you wanted to develop a new analysis or wanted to test a new script to automate some data management, this would be a great way to start. 

***

## Python basics 

First let's quickly review some basic programming functionality that you can perform in python

In [4]:
# Variable assignment
x = 10
y = 20
print(x)
print(y)

10
20


In [5]:
# Arithemtic operations
a = x+y
b = x-y
c = x*y
d = x/y
e = x**2
print(a,b,c,d,e)

30 -10 200 0.5 100


A key thing to note is that the computer only knows about variables when we run "cells" (the blocks of code). If you *don't run a cell, then the computer doesn't know about the code you wrote*. In addition the computer memorizes what you've run. So for example:

In [6]:
# Running 
print(x)

10


So any variables you run are saved in *memory*. Look at the following example:

In [7]:
counter = 1

In [19]:
#Hit Ctrl + Enter several times
counter = counter + 1
print(counter)

13


Notice that the variable get's incremented every time you run the block. This is because counter is stored in memory and everytime you run the block you assign a new value of itself plus 1. 

We can run loops as well!

In [20]:
for i in range(0,10):
    print(i)

0
1
2
3
4
5
6
7
8
9


As we expect we print <code> i </code> on every iteration of the loop

## Functions

This is arguably the most important concept in all of programming and especially for Python. A **Function** is a way to define a set of operations so you can perform them using one-liner *function calls*. To make this clear let's start with an example:

In [2]:
# All functions start with "def function_name(inputs if any):"
def print_hello_world():
    print('Hello world!')

In [4]:
print_hello_world()

Hello world!


Notice a couple of things. All functions start with the following signature:
<code> def function_name(): </code>
In plain english it means define "function_name" as the following. The empty bracket means you don't give the function any inputs. If you wanted inputs to your function you can do it as follows:

In [6]:
def print_square(x):
    print(x*x)

In [8]:
print_square(10)

100


Notice here that <code>print_square</code> takes an input that is defined by the brackets. 
The **variable** <code>x</code> used in the function definition could have been anything, what matters is that the variables used in the function are the ones defined in the function definition. Here are some examples to clarify this example:

In [11]:
def print_square(y):
    print(y*y)
print_square(10)

100


In [13]:
def print_square(x):
    print(y)
print_square(1000)

NameError: name 'y' is not defined

When we defined <code>print_square</code> as using <code>x</code> as an input, the variable <code>y</code> isn't known to the function! You can also have multiple inputs if you'd like as well!

In [15]:
def print_multiplication(x,y):
    print(x*y)
print_multiplication(10,20)

200


Finally we'd like to define functions to make our life easier when running complicated routines. Naturally we'd want to use the **outputs** of functions. In Python outputs are defined using a <code>return</code> statement. An example of this is:

In [16]:
def square(x):
    sq = x*x
    return sq

In [18]:
z = square(20)
print(z)

400


Similar to how we can specify multiple inputs we can also specify multiple outputs:

In [19]:
def get_square_and_cube(x):
    sq = x*x
    cube = x**3
    return sq,cube

In [21]:
a,b = get_square_and_cube(10)

In [22]:
print(a)
print(b)

100
1000


We'll be using functions frequently during this course because they provide two important aspects of programming:
1. Organization - having well-named functions allows you to re-use code and know what it does
2. Simplification - wrapping complex routines in a function allows you to make your code more readable. 

***

## Python Packages

Python packages are the meat of what Python is all about. Essentially they are additional functionality provided by the developers of python and other individuals that you can load on demand and use as you'd like. 
Let's start with the most basic example, the **math** package which provides some basic mathematical operations.

In [22]:
import math

Now we can use the <code> . </code> notation to call functions contained within the math package. For example let's use a square root

In [23]:
x = 100
y = math.sqrt(x)
print(y)

10.0


By typing in <code>math.</code> and hitting tab we can view the list of all functions

In [None]:
math.

If typing <code> math </code> every time is annoying you can also use *aliases* which are ways to give another name to <code>math</code>

In [26]:
import math as m

In [27]:
print(m.sqrt(x))

10.0


In some cases loading packages is **expensive**, that is, they take time to import because they are large in size or require loading many things in the background. To deal with this we can import specific packages using <code>from</code>

In [28]:
from math import sqrt, exp

Notice we can import multiple packages using <code>,</code>. Furthermore we can combine this with an alias!

In [29]:
from math import sqrt,exp as s,e

Here we imported square root and exponential as <code> s </code> and <code> e </code> respectively!

Packages will form the foundation for how we use python for data analysis, visualization and automation. 
A few extremely useful packages many python analysts use in their day-to-day and what I use are:

- <code> numpy </code> - a python package for dealing with arrays and matrices 
- <code> pandas </code> - for data wrangling and manipulation using tables
- <code> matplotlib </code> - a python library for data visualization (tries to copy MATLAB)
- <code> seaborn </code> - a statistical visualization library
- <code> scipy </code> - a scientific computing library

For this course we'll be using <code> pandas </code> which allows us to treat data like spreadsheet tables