# T81-558: Applications of Deep Neural Networks
**Module 2: Python for Machine Learning**
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

# Module 2 Material

Main video lecture:

* Part 2.1: Introduction to Pandas [[Video]](https://www.youtube.com/watch?v=bN4UuCBdpZc&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_02_1_python_pandas.ipynb)
* Part 2.2: Categorical Values [[Video]](https://www.youtube.com/watch?v=4a1odDpG0Ho&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_02_2_pandas_cat.ipynb)
* **Part 2.3: Grouping, Sorting, and Shuffling in Python Pandas** [[Video]](https://www.youtube.com/watch?v=YS4wm5gD8DM&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_02_3_pandas_grouping.ipynb)
* Part 2.4: Using Apply and Map in Pandas for Keras [[Video]](https://www.youtube.com/watch?v=XNCEZ4WaPBY&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_02_4_pandas_functional.ipynb)
* Part 2.5: Feature Engineering in Pandas for Deep Learning in Keras [[Video]](https://www.youtube.com/watch?v=BWPTj4_Mi9E&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_02_5_pandas_features.ipynb)

# Part 2.3: Grouping, Sorting, and Shuffling  

### Shuffling a Dataset
The following code is used to shuffle and reindex a data set.  A random seed can be used to produce a consistent shuffling of the data set.

In [38]:
import os
import pandas as pd

df = pd.read_csv(
    "https://data.heatonresearch.com/data/t81-558/auto-mpg.csv", 
    na_values=['NA', '?'])

#np.random.seed(42) # Uncomment this line to get the same shuffle each time
df = df.reindex(np.random.permutation(df.index))
df.reset_index(inplace=True, drop=True)
display(df[0:10])

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,25.0,4,98.0,,2046,19.0,71,1,ford pinto
1,19.0,6,250.0,88.0,3302,15.5,71,1,ford torino 500
2,16.0,8,302.0,140.0,4141,14.0,74,1,ford gran torino
3,18.2,8,318.0,135.0,3830,15.2,79,1,dodge st. regis
4,32.2,4,108.0,75.0,2265,15.2,80,3,toyota corolla
5,36.1,4,98.0,66.0,1800,14.4,78,1,ford fiesta
6,44.0,4,97.0,52.0,2130,24.6,82,2,vw pickup
7,10.0,8,360.0,215.0,4615,14.0,70,1,ford f250
8,30.7,6,145.0,76.0,3160,19.6,81,2,volvo diesel
9,19.1,6,225.0,90.0,3381,18.7,80,1,dodge aspen


### Sorting a Data Set

Data sets can also be sorted.  This code sorts the MPG dataset by name and displays the first car.

In [39]:
import os
import pandas as pd

df = pd.read_csv(
    "https://data.heatonresearch.com/data/t81-558/auto-mpg.csv", 
    na_values=['NA', '?'])

df = df.sort_values(by='name', ascending=True)
print(f"The first car is: {df['name'].iloc[0]}")
display(df[0:5])

The first car is: amc ambassador brougham


Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
96,13.0,8,360.0,175.0,3821,11.0,73,1,amc ambassador brougham
9,15.0,8,390.0,190.0,3850,8.5,70,1,amc ambassador dpl
66,17.0,8,304.0,150.0,3672,11.5,72,1,amc ambassador sst
315,24.3,4,151.0,90.0,3003,20.1,80,1,amc concord
257,19.4,6,232.0,90.0,3210,17.2,78,1,amc concord


### Grouping a Data Set

Grouping is a common operation on data sets.  In SQL, this operation is referred to as "GROUP BY".  Grouping is used to summarize data.  Because of this summarization the row could will either stay the same or more likely shrink after a grouping is applied.

The Auto MPG dataset is used to demonstrate grouping.

In [40]:
import os
import pandas as pd

df = pd.read_csv(
    "https://data.heatonresearch.com/data/t81-558/auto-mpg.csv", 
    na_values=['NA', '?'])
display(df[0:5])

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,1,ford torino


The above data set can be used with group to perform summaries.  For example, the following code will group cylinders by the average (mean).  This code will provide the grouping.  In addition to mean, other aggregating functions, such as **sum** or **count** can be used. 

In [46]:
g = df.groupby('cylinders')['mpg'].mean()
g

cylinders
3    20.550000
4    29.286765
5    27.366667
6    19.985714
8    14.963107
Name: mpg, dtype: float64

It might be useful to have these **mean** values as a dictionary.

In [42]:
d = g.to_dict()
d

{3: 20.55,
 4: 29.28676470588236,
 5: 27.366666666666664,
 6: 19.985714285714284,
 8: 14.963106796116508}

This allows you to quickly access an individual element, such as to lookup the mean for 6 cylinders.  This is used in target encoding, which is presented in this module.

In [43]:
d[6]

19.985714285714284

The code below shows how to count the number of rows that match each cylinder count.

In [44]:
df.groupby('cylinders')['mpg'].count().to_dict()

{3: 4, 4: 204, 5: 3, 6: 84, 8: 103}

# Part 2.4: Apply and Map

The **apply** and **map** functions can also be applied to Pandas **dataframes**.

### Using Map with Dataframes

In [47]:
import os
import pandas as pd
import numpy as np

df = pd.read_csv(
    "https://data.heatonresearch.com/data/t81-558/auto-mpg.csv", 
    na_values=['NA', '?'])

display(df[0:10])

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,1,ford torino
5,15.0,8,429.0,198.0,4341,10.0,70,1,ford galaxie 500
6,14.0,8,454.0,220.0,4354,9.0,70,1,chevrolet impala
7,14.0,8,440.0,215.0,4312,8.5,70,1,plymouth fury iii
8,14.0,8,455.0,225.0,4425,10.0,70,1,pontiac catalina
9,15.0,8,390.0,190.0,3850,8.5,70,1,amc ambassador dpl


In [48]:
df['origin_name'] = df['origin'].map({1: 'North America', 2: 'Europe', 3: 'Asia'})
display(df[0:50])

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name,origin_name
0,18.0,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu,North America
1,15.0,8,350.0,165.0,3693,11.5,70,1,buick skylark 320,North America
2,18.0,8,318.0,150.0,3436,11.0,70,1,plymouth satellite,North America
3,16.0,8,304.0,150.0,3433,12.0,70,1,amc rebel sst,North America
4,17.0,8,302.0,140.0,3449,10.5,70,1,ford torino,North America
5,15.0,8,429.0,198.0,4341,10.0,70,1,ford galaxie 500,North America
6,14.0,8,454.0,220.0,4354,9.0,70,1,chevrolet impala,North America
7,14.0,8,440.0,215.0,4312,8.5,70,1,plymouth fury iii,North America
8,14.0,8,455.0,225.0,4425,10.0,70,1,pontiac catalina,North America
9,15.0,8,390.0,190.0,3850,8.5,70,1,amc ambassador dpl,North America


### Using Apply with Dataframes

If the **apply** function is directly executed on the data frame, the lambda function is called once per column or row, depending on the value of axis.  For axis = 1, rows are used. 

The following code calculates a series called **efficiency** that is the **displacement** divided by **horsepower**. 

In [49]:
effi = df.apply(lambda x: x['displacement']/x['horsepower'], axis=1)
display(effi[0:10])

0    2.361538
1    2.121212
2    2.120000
3    2.026667
4    2.157143
5    2.166667
6    2.063636
7    2.046512
8    2.022222
9    2.052632
dtype: float64

### Feature Engineering with Apply and Map

In this section we will see how to calculate a complex feature using map, apply, and grouping.  The data set is the following CSV:

* https://www.irs.gov/pub/irs-soi/16zpallagi.csv 

This is US Government public data for "SOI Tax Stats - Individual Income Tax Statistics".  The primary website is here:

* https://www.irs.gov/statistics/soi-tax-stats-individual-income-tax-statistics-2016-zip-code-data-soi 

Documentation describing this data is at the above link.

For this feature, we will attempt to estimate the adjusted gross income (AGI) for each of the zipcodes.  The data file contains many columns; however, you will only use the following:

* STATE - The state (e.g. MO)
* zipcode - The zipcode (e.g. 63017)
* agi_stub - Six different brackets of annual income (1 through 6) 
* N1 - The number of tax returns for each of the agi_stubs

Note, the file will have 6 rows for each zipcode, for each of the agi_stub brackets. You can skip zipcodes with 0 or 99999.

We will create an output CSV with these columns; however, only one row per zip code. Calculate a weighted average of the income brackets. For example, the following 6 rows are present for 63017:


|zipcode |agi_stub | N1 |
|--|--|-- |
|63017	 |1 | 4710 |
|63017	 |2 | 2780 |
|63017	 |3 | 2130 |
|63017	 |4 | 2010 |
|63017	 |5 | 5240 |
|63017	 |6 | 3510 |


We must combine these six rows into one.  For privacy reasons, AGI's are broken out into 6 buckets.  We need to combine the buckets and estimate the actual AGI of a zipcode. To do this, consider the values for N1:

* 1 = \$1 to \$25,000
* 2 = \$25,000 to \$50,000
* 3 = \$50,000 to \$75,000
* 4 = \$75,000 to \$100,000
* 5 = \$100,000 to \$200,000
* 6 = \$200,000 or more

The median of each of these ranges is approximately:

* 1 = \$12,500
* 2 = \$37,500
* 3 = \$62,500 
* 4 = \$87,500
* 5 = \$112,500
* 6 = \$212,500

Using this you can estimate 63017's average AGI as:

```
>>> totalCount = 4710 + 2780 + 2130 + 2010 + 5240 + 3510
>>> totalAGI = 4710 * 12500 + 2780 * 37500 + 2130 * 62500 + 2010 * 87500 + 5240 * 112500 + 3510 * 212500
>>> print(totalAGI / totalCount)

88689.89205103042
```

In [50]:
import pandas as pd

df=pd.read_csv('https://www.irs.gov/pub/irs-soi/16zpallagi.csv')

First, we trim all zipcodes that are either 0 or 99999.  We also select the three fields that we need.

In [51]:
df=df.loc[(df['zipcode']!=0) & (df['zipcode']!=99999),['STATE','zipcode','agi_stub','N1']]

In [52]:
df

Unnamed: 0,STATE,zipcode,agi_stub,N1
6,AL,35004,1,1510
7,AL,35004,2,1410
8,AL,35004,3,950
9,AL,35004,4,650
10,AL,35004,5,630
11,AL,35004,6,60
12,AL,35005,1,1310
13,AL,35005,2,960
14,AL,35005,3,450
15,AL,35005,4,200


We replace all of the **agi_stub** values with the correct median values with the **map** function.

In [53]:
medians = {1:12500,2:37500,3:62500,4:87500,5:112500,6:212500}
df['agi_stub']=df.agi_stub.map(medians)

In [54]:
df

Unnamed: 0,STATE,zipcode,agi_stub,N1
6,AL,35004,12500,1510
7,AL,35004,37500,1410
8,AL,35004,62500,950
9,AL,35004,87500,650
10,AL,35004,112500,630
11,AL,35004,212500,60
12,AL,35005,12500,1310
13,AL,35005,37500,960
14,AL,35005,62500,450
15,AL,35005,87500,200


Next the dataframe is grouped by zip code.

In [55]:
groups = df.groupby(by='zipcode')

A lambda is applied across the groups and the AGI estimate is calculated.

In [56]:
df = pd.DataFrame(groups.apply(lambda x:sum(x['N1']*x['agi_stub'])/sum(x['N1']))).reset_index()

In [57]:
df

Unnamed: 0,zipcode,0
0,1001,52895.322940
1,1002,64528.451001
2,1003,15441.176471
3,1005,54694.092827
4,1007,63654.353562
5,1008,57575.757576
6,1009,45576.923077
7,1010,61303.191489
8,1011,49807.692308
9,1012,53214.285714


The new agi_estimate column is renamed.

In [58]:
df.columns = ['zipcode','agi_estimate']

In [59]:
display(df[0:10])

Unnamed: 0,zipcode,agi_estimate
0,1001,52895.32294
1,1002,64528.451001
2,1003,15441.176471
3,1005,54694.092827
4,1007,63654.353562
5,1008,57575.757576
6,1009,45576.923077
7,1010,61303.191489
8,1011,49807.692308
9,1012,53214.285714


We can also see that our zipcode of 63017 gets the correct value.

In [60]:
df[ df['zipcode']==63017 ]

Unnamed: 0,zipcode,agi_estimate
19909,63017,88689.892051


# Part 2.5: Feature Engineering

Feature engineering is a very important part of machine learning.  Later in this course we will see some techniques for automatic feature engineering.  

## Calculated Fields

It is possible to add new fields to the dataframe that are calculated from the other fields.  We can create a new column that gives the weight in kilograms.  The equation to calculate a metric weight, given a weight in pounds is:

$ m_{(kg)} = m_{(lb)} \times 0.45359237 $

This can be used with the following Python code:

In [61]:
import os
import pandas as pd

df = pd.read_csv(
    "https://data.heatonresearch.com/data/t81-558/auto-mpg.csv", 
    na_values=['NA', '?'])

df.insert(1, 'weight_kg', (df['weight'] * 0.45359237).astype(int))
df

Unnamed: 0,mpg,weight_kg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,1589,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,1675,8,350.0,165.0,3693,11.5,70,1,buick skylark 320
2,18.0,1558,8,318.0,150.0,3436,11.0,70,1,plymouth satellite
3,16.0,1557,8,304.0,150.0,3433,12.0,70,1,amc rebel sst
4,17.0,1564,8,302.0,140.0,3449,10.5,70,1,ford torino
5,15.0,1969,8,429.0,198.0,4341,10.0,70,1,ford galaxie 500
6,14.0,1974,8,454.0,220.0,4354,9.0,70,1,chevrolet impala
7,14.0,1955,8,440.0,215.0,4312,8.5,70,1,plymouth fury iii
8,14.0,2007,8,455.0,225.0,4425,10.0,70,1,pontiac catalina
9,15.0,1746,8,390.0,190.0,3850,8.5,70,1,amc ambassador dpl


## Google API Keys

Sometimes you will use external API's to obtain data.  The following examples show how to use the Google API keys to encode addresses for use with neural networks.  To use these, you will need your own Google API key.  The key I have below is not a real key, you need to put your own in there.  Google will ask for a credit card, but unless you use a very large number of lookups, there will be no actual cost.  YOU ARE NOT required to get an Google API key for this class, this only shows you how.  If you would like to get a Google API key, visit this site and obtain one for **geocode**.

[Google API Keys](https://developers.google.com/maps/documentation/embed/get-api-key)

In [62]:
GOOGLE_KEY = 'INSERT_YOUR_KEY'

# Other Examples: Dealing with Addresses

Addresses can be difficult to encode into a neural network.  There are many different approaches, and you must consider how you can transform the address into something more meaningful.  Map coordinates can be a good approach.  [Latitude and longitude](https://en.wikipedia.org/wiki/Geographic_coordinate_system) can be a useful encoding.  Thanks to the power of the Internet, it is relatively easy to transform an address into its latitude and longitude values.  The following code determines the coordinates of [Washington University](https://wustl.edu/):

In [63]:
import requests

address = "1 Brookings Dr, St. Louis, MO 63130"

response = requests.get('https://maps.googleapis.com/maps/api/geocode/json?key={}&address={}'.format(GOOGLE_KEY,address))

resp_json_payload = response.json()

if 'error_message' in resp_json_payload:
    print(resp_json_payload['error_message'])
else:
    print(resp_json_payload['results'][0]['geometry']['location'])

{'lat': 38.648238, 'lng': -90.30487459999999}


If latitude and longitude are simply fed into the neural network as two features, they might not be overly helpful.  These two values would allow your neural network to cluster locations on a map.  Sometimes cluster locations on a map can be useful.  Consider the percentage of the population that smokes in the USA by state:

![Smokers by State](https://raw.githubusercontent.com/jeffheaton/t81_558_deep_learning/master/images/class_6_smokers.png "Smokers by State")

The above map shows that certain behaviors, like smoking, can be clustered by global region. 

However, often you will want to transform the coordinates into distances.  It is reasonably easy to estimate the distance between any two points on Earth by using the [great circle distance](https://en.wikipedia.org/wiki/Great-circle_distance) between any two points on a sphere:

The following code implements this formula:

$\Delta\sigma=\arccos\bigl(\sin\phi_1\cdot\sin\phi_2+\cos\phi_1\cdot\cos\phi_2\cdot\cos(\Delta\lambda)\bigr)$

$d = r \, \Delta\sigma$

In [64]:
from math import sin, cos, sqrt, atan2, radians

# Distance function
def distance_lat_lng(lat1,lng1,lat2,lng2):
    # approximate radius of earth in km
    R = 6373.0

    # degrees to radians (lat/lon are in degrees)
    lat1 = radians(lat1)
    lng1 = radians(lng1)
    lat2 = radians(lat2)
    lng2 = radians(lng2)

    dlng = lng2 - lng1
    dlat = lat2 - lat1

    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlng / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))

    return R * c

# Find lat lon for address
def lookup_lat_lng(address):
    response = requests.get('https://maps.googleapis.com/maps/api/geocode/json?key={}&address={}'.format(GOOGLE_KEY,address))
    json = response.json()
    if len(json['results']) == 0:
        print("Can't find: {}".format(address))
        return 0,0
    map = json['results'][0]['geometry']['location']
    return map['lat'],map['lng']


# Distance between two locations

import requests

address1 = "1 Brookings Dr, St. Louis, MO 63130" 
address2 = "3301 College Ave, Fort Lauderdale, FL 33314"

lat1, lng1 = lookup_lat_lng(address1)
lat2, lng2 = lookup_lat_lng(address2)

print("Distance, St. Louis, MO to Ft. Lauderdale, FL: {} km".format(
        distance_lat_lng(lat1,lng1,lat2,lng2)))

Distance, St. Louis, MO to Ft. Lauderdale, FL: 1684.9161446533758 km


Distances can be useful to encode addresses as.  You must consider what distance might be useful for your dataset.  Consider:

* Distance to major metropolitan area
* Distance to competitor
* Distance to distribution center
* Distance to retail outlet

The following code calculates the distance between 10 universities and washu:

In [65]:
# Encoding other universities by their distance to Washington University

schools = [
    ["Princeton University, Princeton, NJ 08544", 'Princeton'],
    ["Massachusetts Hall, Cambridge, MA 02138", 'Harvard'],
    ["5801 S Ellis Ave, Chicago, IL 60637", 'University of Chicago'],
    ["Yale, New Haven, CT 06520", 'Yale'],
    ["116th St & Broadway, New York, NY 10027", 'Columbia University'],
    ["450 Serra Mall, Stanford, CA 94305", 'Stanford'],
    ["77 Massachusetts Ave, Cambridge, MA 02139", 'MIT'],
    ["Duke University, Durham, NC 27708", 'Duke University'],
    ["University of Pennsylvania, Philadelphia, PA 19104", 'University of Pennsylvania'],
    ["Johns Hopkins University, Baltimore, MD 21218", 'Johns Hopkins']
]

lat1, lng1 = lookup_lat_lng("1 Brookings Dr, St. Louis, MO 63130")

for address, name in schools:
    lat2,lng2 = lookup_lat_lng(address)
    dist = distance_lat_lng(lat1,lng1,lat2,lng2)
    print("School '{}', distance to wustl is: {}".format(name,dist))

School 'Princeton', distance to wustl is: 1354.4748428037537
School 'Harvard', distance to wustl is: 1670.6348910867227
School 'University of Chicago', distance to wustl is: 418.07123096093096
School 'Yale', distance to wustl is: 1508.209168740192
School 'Columbia University', distance to wustl is: 1418.2846378506144
School 'Stanford', distance to wustl is: 2780.6884662205066
School 'MIT', distance to wustl is: 1672.4354422735219
School 'Duke University', distance to wustl is: 1046.7924543575177
School 'University of Pennsylvania', distance to wustl is: 1307.1873732319766
School 'Johns Hopkins', distance to wustl is: 1184.3754484499111


# Module 2 Assignment

You can find the first assignment here: [assignment 2](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/assignments/assignment_yourname_class2.ipynb)