# DSTEP20

Jan 8, 2020

## Lightning Introduction to Python and Jupyter

---

Of all the "software" that is available for data analysis, Python is the most flexible and adaptable to a variety of situations.  Want to scrape data from the web?  Python.  Want to run machine learning algorithms?  Python.  Want to make pretty plots for publications?  Python.

Python can also be used to build websites, control devices, apply filters to images, etc, and it has one of the most (if not the most) understandable user interfaces of all of the "serious coding languages".

A lot of effort over the past decade has turned python from what was called a "scripting" language into something usuable for even the most challenging data analysis tasks.  And over the past several years, there has been a parallel effort to make the power of python easily usable by everyone.  The culmination of that effort resulted in Jupyter which follows a "notebook-style" format that you're reading right now, and that allows python to be used interactively giving you very close access to data.

<b><i>This document will start at the very beginning of python and introduce much of the core functionality we'll use in this class.</b></i>

### Data types

Here are a few of the most common basic data types in python.

In [0]:
# This is a comment, everything that comes after a hash symbol on a given line 
# is ignored by python.  Comments are used to describe what you are about to 
# do or are currently doing.

i = 1 # integer
f = 1.0 # float
s = "hello world" # string
l = [3, 74, -22, 550] # list
t = (3, 74, -22, 550) # tuple

### Arithematic operations

These basic data types can be manipulated to, for example, make python a fancy calculator:

In [0]:
# -- demonstration of mathmatical operations
x = 50.
y = 3.
i = 50
j = 3

print(x / y)
print(i / j)
print(i * j)
print(x * y)
print(x * i)
print(y**j)
print(i // j) # integer division

### Expanding functionality with modules

The built-in math functions are pretty limited.  To enhance our functionality, we need access to *the most* useful python "library" (aka "module") called numpy

In [0]:
# -- to access a module, it needs to be imported in one of three ways

# -- first way
import numpy # all numpy functions can be accessed with numpy.something
print("first way", numpy.log10(x))
print(log10(x))

# -- second way
import numpy as np
print(np.log10(x))
print(log10(x))

# -- third way (NEVER TO BE USED)
from numpy import *
print(log10(x))

Note that that produced an error (with an "error trace") at the line pointed to by the arrow.  Try commenting out that line and seeing if you get another error.

### Core data analysis "stack"

There are modules for everything under the sun, but the core modules we'll be using most frequently in this course are:

In [0]:
# -- importing the core modules
import os # "operating system" information
import numpy as np # arrays and vectors
import scipy as sp # scientific python functions
import matplotlib.pyplot as plt # plotting and visualizations
import pandas as pd # tabular data manipulations
import sklearn as sk # machine learning functions

Jupyter let's you explore an imported module with dropdown menus.  Use a question mark to get more information about a function.

In [0]:
# -- delete the dot, retype it, then wait a moment
np.

In [0]:
# -- execute this cell
np.log10?

### Lists, tuples, and arrays

Lists are collections of data:

In [0]:
mylist = [-4, -2, 0, 2, 4]

print(mylist[0])
print(mylist[3])
print(mylist[2])
print(mylist[-2])

In [0]:
mylist.

In [0]:
mylist.append(-4.5)

print(mylist)

Notice two things in the last example, 1.) the append "method" modifies the list itself "in place" and 2.) lists can be of mixed data types.  Another, more complicated example:

In [0]:
mylist2 = [-4, 2.7, "hello world!", np.log10]

print(mylist2)

Tuples are a lot like lists, but they **cannot be modified**:

In [0]:
my_list = [-10, -9, -8, -7, -6]
my_tuple = (10, 9, 8, 7, 6)

print(my_list, my_tuple)

In [0]:
my_list[3] = 1776

print(my_list)

In [0]:
my_tuple[3] = 1776

print(my_tuple)

Numpy arrays (aka, ndarrays) are <u>the</u> most useful object in the extended python universe when it comes to data analysis, and they are the backbone for many more complicated objects.  One of their best features is that they support **"vectorized"** operations:

In [0]:
my_array = np.array([-5, -4, -2, 0, 2, 4, 5])

print(my_array)

print(my_array + 1)

print(my_array / 10.)

print(my_array % 4)

print(my_array**2)

print(my_array == 2)

print(my_array <= -2)

and they support **"slicing, broadcasting, and indexing"** operations:

In [0]:
print(my_array[::2])

print(my_array[1::2])

print(my_array[1:4:2])

print(my_array[my_array <= -2])

### Pandas dataframes 

Dataframes in pandas are fairly complex abstractions.  That said, pandas dataframes are, at their core, tabular objects and have a lot of properties something like an excel table might have.  Most of the time, we'll be reading them in as data files, but they can be constructed from scratch:

In [0]:
# -- create an empyt dataframe
my_df = pd.DataFrame()

print(my_df)

In [0]:
# -- define some parameters to put in the dataframe

ages = np.random.randint(18, 96, 5)
names = ["Mohammed", "Danielle", "Jeffrey", "Rebecca", "Lingxiao"]
ends_with_vowel = [False, True, False, True, True]
heights = (1.0 * ages)**2

print(ages, names, ends_with_vowel, heights)

In [0]:
# -- put values in the dataframe

my_df["age"] = ages
my_df["name"] = names
my_df["ewv"] = ends_with_vowel
my_df["height"] = heights


print(my_df)

To access columns in a dataframe, use brackets and column names:

In [0]:
# -- access a data frame column and do some arithematic
vals = my_df["age"]
vals2 = my_df["age"] * 2

print(vals, vals2)

Dataframes **also** support broadcasting, slicing, and indexing:

In [0]:
print(my_df[my_df["ewv"] == True])

### Making plots with matplotlib

Making plots involves a couple of extra lines, but matplotlib gives you a lot of control:

In [0]:
# -- generate some fake data
xx = np.linspace(-10., 10., 100)
yy = np.cos(xx)

print("xx = {0}\n\nyy = {1}".format(xx, yy))

In [0]:
# -- make a plot...

# set up the figure
fig, ax = plt.subplots()

# plot the data
lin, = ax.plot(xx, yy)

# overplot some more data
lin2, = ax.plot(xx, (2.0 * yy)**2, "o", color="darkred")

# always set the axis labels
ax.set_xlabel("time [seconds]")
ax.set_ylabel("PM2.5 [$\mu$g/m$^3$]")

# show the figure
fig.show()

FWIW, dataframes have a built in plotting method:

In [0]:
my_df.plot("age", "height")
my_df.plot("age", "height", marker="o", color="r", linewidth=0)

### Loops and functions

Sometimes you might want to do something over and over again but don't want to type it all out, like printing the integers between 0 and 5.  That's what loops are for:

In [0]:
print(0)
print(1)
print(2)
print(3)
print(4)
print(5)

print("\n that was a lot of typing...\n")

for ii in range(6):
  print(ii)

We've already seen a lot of functions (e.g., cos, etc), but many times you'll want to make your own:


In [0]:
def metric(age, height):
  my_metric = np.sqrt(age) / (height + (height == 0))
  
  return my_metric

In [0]:
print(metric(100., 20))

for index, row in my_df.iterrows():
  print("my metric for {0} is {1}".format(row["name"], 
                                          metric(row["age"], row["height"])))

### Truly unlocking the power of python: search engines and stackoverflow

Got an error message?  Google it.  Want to know how to do something in python?  Google it.  There are a **tremendous** number of python resources available on the web and it is nearly 100% certain that someone else has had the exact same question/problem.  Look espeically for results from stackoverflow.com, it is an absolutely indispensable resource for literally everyone (including advanced coders!).