# Python for Data Science 

#### AAWG Dev Day 6/14/2019 

------------


# Python for Data Science 

#### AAWG Dev Day 6/14/2019 

------------

### Python 

Is an interesting and fun language. It is rapidly taking over the for both engineering and data science. It can be extremely powerful, and powers some of the everyday products you use. But it can also be painfully slow and clunky if you use it incorrectly. Here are a few important considerations to keep in mind: 

#### History 

* Conceived in 1980 
* Python 2.0 released in 2000, which is also about the time it became a popular, powerful language 
* 3.0 released 2008. Unless you have a very good reason, **always begin a project in python 3 instead of python 2!** 
* The name and lots of instructional/infrastructural concepts are borrowed from Monty Python 

#### Programming paradigms 

Python supports procedural, object-oriented, functional programming. 

Everything we do today (and most data science workloads) are *functional* programming. Most engineering projects require *object-oriented* programming. 

#### Opinionated 

Python has some quirks, but in general is a very friendly language. Here are a few important things to keep in mind: 

* Whitespace is extremely important 
  * Indentation is a part of the language and cannot be ignored. Must be consistent throughout. Tabs vs spaces, # of spaces, etc. 
* Documentation is often very difficult to understand when you get started, so you'll rely on StackOverflow etc for most problems and examples. Don't let SO super-users get to you. They are often not nice. 
* Jupyter is a nice place to play and experiment. Production code generally lives in a package of scripts (modules) 
* Unlike compiled languages, python will let you execute a script that won't work. You'll catch errors as the interpreter finds them. Thus **work in small chunks and test often!!** 
* If you indent more than 2 levels, rethink (no more than 2 nested if statements, for example) 
* Each function should do something specific and do it in a clean way. Resist the urge to pack everything into a single function 



#### Philosophy 

* Beautiful is better than ugly
* Explicit is better than implicit
* Simple is better than complex
* Complex is better than complicated
* Readability counts

Try the `import this` command below :) 

#### Comments 

    # this is invisible to python. 
    # Use this for short comments 
    # It is ok for each line to have a comment! 
    print('but this will be evaluated') 
    
    """
    This is a block comment. 
    Use these to be verbose about 
    your function or block of code. 
    You are writing comments so that 
    your future self won't be embarassed! 
    All functions need to begin with a block comment! 
    """


#### Additional Resources 

Some examples in this exercise were taken from 2 excellent books: 

* [Machine Learning with Python Cookbook.](https://www.amazon.com/Machine-Learning-Python-Cookbook-Preprocessing/dp/1491989386/ref=asc_df_1491989386/?tag=hyprod-20&linkCode=df0&hvadid=312114711253&hvpos=1o2&hvnetw=g&hvrand=6564831004064313021&hvpone=&hvptwo=&hvqmt=&hvdev=c&hvdvcmdl=&hvlocint=&hvlocphy=9031977&hvtargid=pla-440699598191&psc=1) Chris Albon. 
* [Data Science from Scratch](https://www.amazon.com/Data-Science-Scratch-Principles-Python/dp/1492041130/ref=asc_df_1492041130/?tag=hyprod-20&linkCode=df0&hvadid=343276535408&hvpos=1o1&hvnetw=g&hvrand=12496244881379603440&hvpone=&hvptwo=&hvqmt=&hvdev=c&hvdvcmdl=&hvlocint=&hvlocphy=9031977&hvtargid=pla-699588372177&psc=1&tag=&ref=&adgrpid=74543737372&hvpone=&hvptwo=&hvadid=343276535408&hvpos=1o1&hvnetw=g&hvrand=12496244881379603440&hvqmt=&hvdev=c&hvdvcmdl=&hvlocint=&hvlocphy=9031977&hvtargid=pla-699588372177). Joel Grus. 

These are great resources from all levels of python data scientists. 


## Basic data types. 

### Lists 

Lists are the most simple collection of data objects. You can mix and match within a list, but this will create problems sometimes. Generally you should keep a single list to the same type (number vs text). 

You can start a new empy list like this, and then use the `append` function to add items to the list: 

In [None]:
l = [] 
l.append('item 1')
l

Or you can initiate a list with data already in it. Here are several lists we'll reuse below. Note there are mixed data types. 

**See if you can deduce what assumptions/choices python is making when we do this.** 

In [None]:
first_name = ['Jason', 'Molly', 'Tina', 'Jake', 'Amy']
last_name = ['Miller', 'Jacobson', ".", 'Milner', 'Cooze'] 
age = [42, 52, 36, 24, 73] 
preTestScore = [4, 24, 31, ".", "."]
postTestScore = ["25,000", "94,000", 57, 62, 70]

In [None]:
# type(preTestScore[1])
type(preTestScore[4])

Now that we have lists, we need to learn to extract information from them. We can do this with simple indexing, which in python begins with 0 (not 1). 

**What is the `[-1]` index doing?** 

**Print the list in reverse order?**  
 
**Print all but the first** 

In [None]:
first_name[0]
# first_name[0:1]
# first_name[2:]
# first_name[-1]

In [None]:
first_name[::-1]
# first_name[1:]

### DataFrames

For data science jobs, lists are usually a starting place, but sometimes not useful by themselves. So next we can put them into a `pandas` **DataFrame** object. **This is the canonical data science format, so knowing DF is crucial!** 

First, we need the `pandas` library. This will essentially always be used for data science jobs in python. 

In [None]:
import pandas as pd 

There are multiple ways to get lists into a DF. Here is an example of using the `zip` function to *zipper* together multiple lists. 

In [None]:
df = pd.DataFrame(list(zip(first_name, last_name, age)), 
                  columns=['first_name', 'last_name', 'age'])
df

**Create a 1-column DataFrame out of a single list.** 

In [None]:
pd.DataFrame(first_name, columns=['first_name'])

### Dictionaries 

Dictionaries are the second essential data structure you'll need for data science in python. These are kind of like lists, but are based on a `{key: data}` structure. 

In [None]:
d = {'key': [1, 2, 3]}
d['key']

Notice you can retrieve data using key words, and dictionaries can be much more complex than lists. 

**See if you can add another key and associated data into the `d` dict we created above.**

**Are there any limitations? Does it need to be the same length as the original key+data? How about mixing data types?**


In [None]:
d['key1'] = [4, 5, 6, 7, 'fish']
d

Next, let's reproduce the first/last name DataFrame using a dictionary this time. 

**What is the `columns` arg doing? What happens if you rearrange the column order.**

In [None]:
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
            'last_name': ['Miller', 'Jacobson', ".", 'Milner', 'Cooze'], 
            'age': [42, 52, 36, 24, 73], 
            'preTestScore': [4, 24, 31, ".", "."],
            'postTestScore': ["25,000", "94,000", 57, 62, 70]}
df = pd.DataFrame(raw_data, 
                  columns = ['first_name', 
                             'last_name', 
                             'age', 
                             'preTestScore', 
                             'postTestScore'], 
                 index = ['a', 'b', 'c', 'd', 'e'])
df

In [None]:
df.iloc[1]

In [None]:
df.loc['b']

**Print the first and third columns together**

**Print the last name `Milner`** 

**Now do the same thing with a slightly different command** 

In [None]:
df.iloc[:, [0, 2]]

In [None]:
df.iloc[3,1]

In [None]:
df.loc['d', 'last_name']

#### Tuples 

... Are generally not necessary in data science workstreams. But you will occasionally run into them, so best to understand how to deal. If you remember anything from today, don't let it be tuples. 

In [None]:
t1 = (1, 4, ['this', 'that', 'the other'])
t2 = (2, 5, 7)

In [None]:
t1[2][1]

**Replace 'this' with something else.**

**Then replace 4 with something else. ?!?!?** 

**Just for fun throw a pandas DataFrame into one of your tuples. Why not?**

In [None]:
t1[2][0] = 'those'

In [None]:
t4 = (1, 2, 3, pd.DataFrame(['another', 'data', 'frame']))
t4

Tuples can be turned into lists or dicts, and then you can cycle through them like you would other data structures. 

In [None]:
t3 = [t1, t2]
t3

Now that we have a list of tuples, we can pull the 2nd element from each one. 

In [None]:
[t[1] for t in t3]

### Changing data in a DataFrame

The print commands above don't actually change the DF. Sometimes it is important to replace/remove/add data. 

Run the commands below, and see what happens. Is this what you expected to happen? 


In [None]:
df_trimmed = df.iloc[1:3, :]
df_trimmed.loc['c', 'age'] = 15

In [None]:
df

**Ok, now see if you can find one of the many ways to change data in just a *copy* of your DataFrame**

In [None]:
df_copy = df.copy()
df_copy.loc['c', 'age'] = 21

In [None]:
df

Congrats! Now that you figured that out, you have a version control problem... 
So there is a good reason why python developers made that difficult. 

But you know how to do it when you need to. 

In reality, you might want to make a habit of creating a raw, untouched copy each time you begin transforming your data. That way it is handy in case you make mistakes. Then you can always come back to the original. But generally avoid making lots of copies of your data!! 


There are **lots and lots** of important operations that `numpy` and `pandas` can do. 
We've just scratched the surface. 
When you need more, you'll get really comfortable with package documentation and StackOverflow! 

### Dealing with files

Reading data into python can be a pain. Luckily `pandas` has nice functions to smooth the process. 

First, orient within the file system. Note that Colab gives us a temporary directory.

In [None]:
!pwd

In [None]:
!ls

#### CSV

In [None]:
df.to_csv('test_df.csv')

Remove the data frame just to illustrate that it is gone. 

In [None]:
del df
# df

In [None]:
df = pd.read_csv('test_df.csv', index_col=0)

In [None]:
df

#### Excel

In [None]:
df.to_excel('test_df.xlsx') #, index=0)

In [None]:
df = pd.read_excel('test_df.xlsx') #, index_col = 0)

In [None]:
df

#### SQL database

In [None]:
from sqlalchemy import create_engine

In [None]:
engine = create_engine('sqlite:///test_df.db', echo=False)

In [None]:
engine.execute('DROP TABLE IF EXISTS users;')
df.to_sql('users', con=engine)

In [None]:
df = pd.read_sql_query("SELECT * FROM users", engine, index_col='index')

# df.index.name = None 
df

Notice that the name of the index is now visible. It is still just an index, but now you can call it by name. 

**See if you can remove the name `index` from the index column so the table prints out just like it did a few cells above.**

In [None]:
del df.index.name

In [None]:
df

### Looping

Often in python, the fastest and most efficient way to do something with data is to loop through. Not always! But often. All python workloads will have a bit of looping. There are multiple ways to make it happen. 

First, let's loop through a list the most common way. 

In [None]:
first_name = ['Jason', 'Molly', 'Tina', 'Jake', 'Amy']

In [None]:
for name in first_name: 
    print(name) 

In [None]:
for name in first_name: print(name) 

There is a really powerful complex way to do something simple in a loop. This is known as 'list comprehension'. Get good at it! 

In [None]:
[print(name) for name in first_name]

Note that this creates a little empty list as an artifact. That's because we're using the `[]` list brackets but not putting anything in them. List comprehension is most useful when creating a list by doing something simple to another list or other group of objects: 

In [None]:
annotated_names = ['first name = ' + name for name in first_name]
annotated_names 

Sometimes there is not a known finite number of times we need to do something. So instead of traditional looping, we can keep doing a thing until we get the result we want: 

In [None]:
not_amy = True
i = 0
while first_name[i] != 'Amy': 
    print('first name = ' + first_name[i])
    i += 1
    

### `if` statements

You essentially always need some sort of if decision logic during a project there are many ways to do it. Here are some simple examples. 

In [None]:
if 'Jake' in first_name: 
    print('Found Jake!')

In [None]:
if 'James' not in first_name: 
    print('James is missing')

In [None]:
if 'James' in first_name: 
    print('Found James!')
elif 'Jake' in first_name: 
    print('Found Jake!')
else: 
    print('found nothing.')

### More about Data Frames and their Framed Data

The reason to get data quickly into `pandas` is so that you can use the fabulous selection of data-manipulation and stats functions on your data. Here are a few simple examples using some datasets already loaded into the data science packages. 

In [None]:
from sklearn import datasets

In [None]:
iris_data = datasets.load_iris()

**What format is the iris dataset by default?**

**Can you turn it into a DataFrame named `df`?**

In [None]:
iris_data

In [None]:
df = pd.DataFrame(iris_data.data, columns = iris_data.feature_names)

The `iris` table is a classic plant physiology dataset that is often used to test ML clustering and classification algorithms. 

In [None]:
df.head(5)

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.columns = ['sepal length', 'sepal width', 'petal length', 'petal width']
df.columns

In [None]:
df_big_sepal = df[df['sepal length'] > 7]
df_big_sepal

**Now find all the rows with sepal width ALSO > 3** 

In [None]:
df_bigger_sepal = df[(df['sepal width'] > 3) & (df['sepal length'] > 7)]

In [None]:
df_bigger_sepal.max()

In [None]:
df_big_sepal['petal width'].max()

**Find the average of the second and third columns**

In [None]:
df_bigger_sepal.iloc[:, 1:3].mean()

### Dealing with missing/bad data

Almost always an issue. So let's deal with it. First let's create some missing data. 

numpy has a simple NaN representation that works nicely for our needs. 

In [None]:
import numpy as np 
df['petal length'] = df['petal length'].replace(6.1, np.nan)

In [None]:
df

And then remove those entries. 

In [None]:
df = df.dropna()

In [None]:
df

### Grouping and summarizing your data

In [None]:
import random 
df['group'] = np.random.choice(['r', 'g', 'b'], df.shape[0])

In [None]:
df

In [None]:
df.groupby('group').mean()

### `apply` functions

You can get much more creative about applying functions to entire columns or groups, but you'll need to pass the data through a function like `apply`. 

In [None]:
df.groupby('group').apply(lambda x: x.count())

### Plotting data

`matplotlib` will do 90% or more of the plotting you need during routine data science. There are **many, many, many** different things you can do, and it (with other fuctionality) is becoming as good as R for visualization. 

We'll just look at a super simple example here, with one simple style option envoked. Every variation of plotting in python is accessible with a quick search. 

Notice also that we're reloading a clean version of the `iris` dataset with its true response category (`species`

In [None]:
import matplotlib.pyplot as plt
import seaborn; seaborn.set()
iris = pd.DataFrame(iris_data.data, columns = iris_data.feature_names)
iris['species'] = iris_data.target
iris

In [None]:
iris.plot(kind='scatter', x='petal width (cm)', y='petal length (cm)', color=iris['species'])
plt.show()

**Do the `sepal` variables show the same pattern?** 

In [None]:
iris.plot(kind='scatter', x='sepal width (cm)', y='sepal length (cm)', color=iris['species'])
plt.show()

### Writing functions

If your workflow is extensive at all, you'll certainly need to write efficient, readable functions that do complex operations. In general, any task that gets coded more than 2x should consider moving to a clean function. 

For example, let's do something special to some of the items in the dataset. We'll use some logic in a function. 

When writing functions and loops, pay close attention to your indention!! This matters in python. Mixed tabs and spaces will cause problems, as will any other mismatched indentions. 

In [None]:
def show_me(data, length=6): 
    ## Takes a single element, 
    ##, and prints the value 
    ## if it is greater than 
    ## the threshold `length` arg. 
    if data > length: 
        print(data)

In [None]:
for data in list(df['sepal length']): 
    show_me(data, 7)

This is just a tiny taste of python's functional programming capability. It goes **way way** deeper than this. And then there is object-oriented programming, which we don't even touch here. 

Often during a data science process, I end up with a script of small functions that do repeatable things, and then a runner script that calls those functions. Or you can put the function near the top of your notebook, and then call it below. 

Notice something really important above. The commented description tells you everything you need to know to use the function correctly. It is nota problem 