MQM 2018-2019  
Summer Term  
Duke University, The Fuqua School of Business  
**Data Infrastructure**  
Lecture 11

# Pandas Basics

* Today we're going to review the NumPy and pandas packages
* Specifically, we'll focus on the following...
  * A quick review of some important Python points
  * What is NumPy?  What is a ndarray?
  * What is pandas?  What is a DataFrame?
  * How do we get data into a DataFrame?
  * Since we're now SQL pros, what operations are equivalent in pandas?
  * How do we perform other basic manipulations on a DataFrame?

## Quick Python Review

* When assigning a value to a variable, really just assigning a reference to the value on the right of the equal sign.  But, the reference can behave differently if it is pointing to something that is "mutable" vs. "immutable"...

In [43]:
a=1
b=a
b=2
print(a)
print(b)
c=3
d=c
c=4
# An integer object is immutable
print(c)
print(d)
names = ["Larry","Jeff"]
type(names)
more_names = names
more_names.append("Marty")
# But a list is mutable
print(more_names)
print(names)
list1 = ["Larry","Moe"]
list2 = list1
list1.append("Curly")
print(list1)
print(list2)

1
2
4
3
['Larry', 'Jeff', 'Marty']
['Larry', 'Jeff', 'Marty']
['Larry', 'Moe', 'Curly']
['Larry', 'Moe', 'Curly']


* In Python, list indexes can be a little confusing.  A few points to keep in mind...
    * The first item in the list has index 0
    * Slicing a list includes the first index but excludes the second index
    * Reading from right to left, the first item has index -1
    

In [34]:
my_list = [1,4,9,16]
# Select the 2nd item in my_list
print(my_list[1])
# Print the type of the 4th item in my list
print(type(my_list[3]))
# Slicing using : ...
# Select the 1st and 2nd items in my_list at the same time
print(my_list[0:2])
# Similarly to SQL, we can also read from "right to left"
print(my_list[-1])

my_list.append(25)
# Select everything from the 2nd item on
print(my_list[1:])
# If I want to select every other value, starting with the first item...
my_list[::2]

# Can we avoid our previous mutable object reference situation (with lists)? Yes!
my_new_list = my_list[:]
my_new_list.append(36)
print(my_list)
print(my_new_list)

# You can also find the index of a specific value in a list...
print(my_new_list.index(25))

# Finally, to delete items from a list, use the del() function...
del(my_new_list[5])
print(my_new_list)

# Functions, when applied to different types, can do different things...
len(my_new_list)
len("Data Infrastructure")

4
<class 'int'>
[1, 4]
16
[4, 9, 16, 25]
[1, 4, 9, 16, 25]
[1, 4, 9, 16, 25, 36]
4
[1, 4, 9, 16, 25]


19

In [37]:
# All elements in a list do NOT need to have the same type
print(type(names))
print(type(names[0]))
names.append(47)
print(type(names[3]))

<class 'list'>
<class 'str'>
<class 'int'>


* So, values in a Python list do not need to have the same type.  This is not the case for a NumPy ndarray...

In [38]:
help(str)

Help on class str in module builtins:

class str(object)
 |  str(object='') -> str
 |  str(bytes_or_buffer[, encoding[, errors]]) -> str
 |  
 |  Create a new string object from the given object. If encoding or
 |  errors is specified, then the object must expose a data buffer
 |  that will be decoded using the given encoding and error handler.
 |  Otherwise, returns the result of object.__str__() (if defined)
 |  or repr(object).
 |  encoding defaults to sys.getdefaultencoding().
 |  errors defaults to 'strict'.
 |  
 |  Methods defined here:
 |  
 |  __add__(self, value, /)
 |      Return self+value.
 |  
 |  __contains__(self, key, /)
 |      Return key in self.
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __format__(...)
 |      S.__format__(format_spec) -> str
 |      
 |      Return a formatted version of S as described by format_spec.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getattribute__(self, name, /)
 |      Return getatt

## Numpy and ndarray

* First, need to import the numpy package in order to use it...

In [2]:
import numpy as np

* "NumPy" is short for "Numerical Python"
* Powerful package that can perform mathematical operations quickly
* Can operate on entire arrays (without writing loops)
* Can integrate with C, C++, and Fortran (why would we want to do this?)
* pandas is built on top of NumPy (which is why we need to import both packages to work with pandas)

In [46]:
TripSlash = [[.287,42,103],[.300,24,88]]
TSarr = np.array(TripSlash)
TSarr

array([[   0.287,   42.   ,  103.   ],
       [   0.3  ,   24.   ,   88.   ]])

In [48]:
Temp = [[.287,42,103],[.300,24]]
Temparr = np.array(Temp)
Temparr

array([[0.287, 42, 103], [0.3, 24]], dtype=object)

In [49]:
TSarr*10

array([[    2.87,   420.  ,  1030.  ],
       [    3.  ,   240.  ,   880.  ]])

In [50]:
TSarr-TSarr

array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])

In [51]:
Temparr*10

array([ [0.287, 42, 103, 0.287, 42, 103, 0.287, 42, 103, 0.287, 42, 103, 0.287, 42, 103, 0.287, 42, 103, 0.287, 42, 103, 0.287, 42, 103, 0.287, 42, 103, 0.287, 42, 103],
       [0.3, 24, 0.3, 24, 0.3, 24, 0.3, 24, 0.3, 24, 0.3, 24, 0.3, 24, 0.3, 24, 0.3, 24, 0.3, 24]], dtype=object)

In [52]:
print(TSarr.ndim)
print(TSarr.shape)
print(TSarr.dtype)

2
(2, 3)
float64


In [53]:
# There are lots of methods available to us, including .place()
np.place(TSarr,TSarr==24,23)
TSarr

array([[   0.287,   42.   ,  103.   ],
       [   0.3  ,   23.   ,   88.   ]])

In [54]:
# Another method is .astype()
TSarrSTR = TSarr.astype(np.str)
print(TSarrSTR.dtype)
print(type(TSarrSTR))
print(TSarrSTR)

<U32
<class 'numpy.ndarray'>
[['0.287' '42.0' '103.0']
 ['0.3' '23.0' '88.0']]


In [55]:
np.place(TSarrSTR,TSarrSTR=="23.0","ERROR")
TSarrSTR

array([['0.287', '42.0', '103.0'],
       ['0.3', 'ERROR', '88.0']], 
      dtype='<U32')

In [56]:
np.place(TSarrSTR,TSarrSTR=="ERROR",23)
TSarrSTR

array([['0.287', '42.0', '103.0'],
       ['0.3', '23', '88.0']], 
      dtype='<U32')

In [57]:
np.place(TSarr,TSarr==23,"ERROR")

ValueError: could not convert string to float: 'ERROR'

Some important notes from the above queries...
* If, for example, we try to generate a ndarray from a list of lists, each of the inner lists must be the same size (otherwise Python generates a generic "object")
* Scalar operations are applied to each element of the array (this is NOT the case for a "list of lists")
* Array operations are applied elementwise for arrays of equal size
* All elements of a ndarray must be of the same data type
* Python will attempt to perform implicit type conversion where possible (but it isn't always successful)

Next, let's do a little slicing and dicing...

In [58]:
# How would we select the first triple slash?
TSarr[0]

array([   0.287,   42.   ,  103.   ])

In [59]:
# How would we select everything?
TSarr[:]

array([[   0.287,   42.   ,  103.   ],
       [   0.3  ,   23.   ,   88.   ]])

In [60]:
# How would we select the HRs (i.e. the second value in each list)
TSarr[:,1]

array([ 42.,  23.])

## pandas and the DataFrame

* So why not just use NumPy for all of our data analysis?
* pandas provides a lot more functionality and is significantly easier to use
* A pandas DataFrame is similar to a relational database table in that it...
  * Organizes data in a tabular format
  * Requires that each "column" has a specific data type
* However, the DataFrame is significantly different from a relational database table in one key respect:  the DataFrame has both row and column indices (and therefore, the "location" of an individual data value at the intersection of a column and a row has actual meaning)

In [3]:
# It is straightforward to construct a DataFrame from a dictionary
import pandas as pd
baseball_data = {'avg':[.287,.300],
                 'hr':[42,24],
                 'rbi':[103,88]
                }
df = pd.DataFrame(baseball_data)

In [62]:
df

Unnamed: 0,avg,hr,rbi
0,0.287,42,103
1,0.3,24,88


* So, the dictionary keys become the column names and an index is generated for each row with values from 0 to N-1
* To be clear, the column labels will always be referenced with "columns" and the row labels will always be referenced with "index"

In [64]:
# What will the following do?
df1 = pd.DataFrame(baseball_data, columns=["avg","hr","runs"])

In [65]:
df1

Unnamed: 0,avg,hr,runs
0,0.287,42,
1,0.3,24,


* So, the above code shows that if we're passing a dictionary to the DataFrame method, we cannot simply rename the columns (keys in the dictionary) to whatever we want.  But, there are a number of ways that we can rename the columns, including using "list comprehension" below...

In [66]:
df.columns = ['runs' if x=='rbi' else x for x in df.columns]
df

Unnamed: 0,avg,hr,runs
0,0.287,42,103
1,0.3,24,88


In [68]:
# We can also change the index...
df.index=[2016,2017]
df
df.loc[2017,"hr"]

24

In [69]:
# And if we don't like our change, we can reset it back
# to the default...
df=df.reset_index(drop=True)
df

Unnamed: 0,avg,hr,runs
0,0.287,42,103
1,0.3,24,88


In [71]:
# Strings even work too!
df.index=["a","b"]
df
df.loc["b","hr"]

24

In [79]:
df.index=["a","a"]
df
df.loc['a',"hr"]

a    42
a    24
Name: hr, dtype: int64

* ***IMPORTANT***:  The DataFrame is not required to have a unique index!

In [74]:
df.reset_index(drop=True)

Unnamed: 0,avg,hr,runs
0,0.287,42,103
1,0.3,24,88


In [75]:
df.iloc[1,1]

24.0

## Importing Data

* Incredibly straightforward to load data into a DataFrame from a .csv file (there are lots of additional options that you can specify)

In [4]:
# This is a .csv file posted to the Canvas homepage
df1=pd.read_csv('300AB_14_16.csv')
df1.head()

Unnamed: 0,playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,...,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP,RANK() OVER(PARTITION BY yearID ORDER BY HR DESC)
0,abreujo02,2014,1,CHA,AL,145,556,80,176,35,...,3,1,51,131,15,11,0,4,14,4
1,ackledu01,2014,1,SEA,AL,143,502,64,123,27,...,8,4,32,90,1,3,3,2,10,93
2,adamsma01,2014,1,SLN,NL,142,527,55,152,34,...,3,2,26,114,5,3,0,7,9,84
3,altuvjo01,2014,1,HOU,AL,158,660,85,225,47,...,56,9,36,53,7,5,1,5,20,172
4,alvarpe01,2014,1,PIT,NL,122,398,46,92,13,...,8,3,45,113,6,2,0,0,12,61


In [5]:
df1.columns.size

23

In [6]:
# Here, after checking for the number of columns in the
# DataFrame (in the previous query), I use this information
# to change the name of the last column...
df1.columns.values[22]='HR_Rank'

In [7]:
df1.head()

Unnamed: 0,playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,...,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP,HR_Rank
0,abreujo02,2014,1,CHA,AL,145,556,80,176,35,...,3,1,51,131,15,11,0,4,14,4
1,ackledu01,2014,1,SEA,AL,143,502,64,123,27,...,8,4,32,90,1,3,3,2,10,93
2,adamsma01,2014,1,SLN,NL,142,527,55,152,34,...,3,2,26,114,5,3,0,7,9,84
3,altuvjo01,2014,1,HOU,AL,158,660,85,225,47,...,56,9,36,53,7,5,1,5,20,172
4,alvarpe01,2014,1,PIT,NL,122,398,46,92,13,...,8,3,45,113,6,2,0,0,12,61


In [8]:
# How would we select playerID, yearID, and HR for "freemfr01"?
df1.loc[df1['playerID'] == 'freemfr01',['playerID','yearID','HR']]

Unnamed: 0,playerID,yearID,HR
79,freemfr01,2014,18
314,freemfr01,2015,18
550,freemfr01,2016,34


* Above we can utilize the .loc attribute of our DataFrame and specify the [ rows , columns ] of interest
* Also easy to load SQL query results directly into a DataFrame

In [9]:
import pymysql
pymysql.install_as_MySQLdb()
%reload_ext sql
%sql mysql://:@mqm-db/
%sql USE lahman2016;

0 rows affected.


[]

In [13]:
# Store the query result in a local variable...
# NOTE:  I cannot get this to run properly with a %%sql cell magic
result2 = %sql SELECT * FROM Batting WHERE yearID = 2016;
type(result2)

1483 rows affected.


sql.run.ResultSet

In [15]:
# This is the "sql-magic" recommended method for getting
# query results into a pandas DataFrame
df2 = result2.DataFrame()
type(df2)

pandas.core.frame.DataFrame

In [16]:
df2.head()

Unnamed: 0,playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,...,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
0,abadfe01,2016,1,MIN,AL,39,1,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,abadfe01,2016,2,BOS,AL,18,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,abreujo02,2016,1,CHA,AL,159,624,67,183,32,...,100,0,2,47,125,7,15,0,9,21
3,achteaj01,2016,1,LAA,AL,27,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ackledu01,2016,1,NYA,AL,28,61,6,9,0,...,4,0,0,8,9,0,0,0,1,0


In [17]:
# This method also works.  However, it is not loading the
# column names properly.  I don't have a good explanation
# for why this is occurring...
df3 = pd.DataFrame(result2)

In [18]:
df3.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,12,13,14,15,16,17,18,19,20,21
0,abadfe01,2016,1,MIN,AL,39,1,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,abadfe01,2016,2,BOS,AL,18,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,abreujo02,2016,1,CHA,AL,159,624,67,183,32,...,100,0,2,47,125,7,15,0,9,21
3,achteaj01,2016,1,LAA,AL,27,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ackledu01,2016,1,NYA,AL,28,61,6,9,0,...,4,0,0,8,9,0,0,0,1,0


* Wow, that was easy!  And guess what?  We can recreate our SQL clauses in pandas too!  Let's work through each clause in turn, and then put everything back together.
* Note:  I am adapting this information from...  https://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html

### SELECT

In [22]:
# Simply include a list of columns that you want to SELECT
df2[['playerID','yearID','HR','RBI']].head()

Unnamed: 0,playerID,yearID,HR,RBI
0,abadfe01,2016,0,0
1,abadfe01,2016,0,0
2,abreujo02,2016,25,100
3,achteaj01,2016,0,0
4,ackledu01,2016,0,4


### WHERE

In [25]:
# Suppose we want to SELECT all of the records with
# HR >= 25 and RBI >= 100.  There are a number of ways to 
# filter data in Python, but we'll walk step-by-step through
# "boolean indexing"
df2[['playerID','yearID','HR','RBI']][(df2['HR'] >= 25) & (df2['RBI'] >= 100)]

Unnamed: 0,playerID,yearID,HR,RBI
2,abreujo02,2016,25,100
51,arenano01,2016,41,133
96,beltrad01,2016,32,104
108,bettsmo01,2016,31,113
156,bryankr01,2016,39,102
180,cabremi01,2016,38,108
193,canoro01,2016,39,103
287,cruzne02,2016,43,105
306,daviskh01,2016,42,102
366,duvalad01,2016,33,103


In [26]:
# How does this work?  Each condition generates a set of 
# True/False values
df2['HR'] >= 25

0       False
1       False
2        True
3       False
4       False
5       False
6       False
7       False
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19      False
20      False
21      False
22      False
23      False
24      False
25      False
26      False
27      False
28      False
29      False
        ...  
1453    False
1454    False
1455    False
1456    False
1457    False
1458    False
1459    False
1460    False
1461    False
1462    False
1463    False
1464    False
1465    False
1466    False
1467    False
1468    False
1469    False
1470    False
1471    False
1472    False
1473    False
1474    False
1475    False
1476    False
1477    False
1478    False
1479    False
1480    False
1481    False
1482    False
Name: HR, Length: 1483, dtype: bool

In [27]:
# Here is the second condition...
df2['RBI'] >= 100

0       False
1       False
2        True
3       False
4       False
5       False
6       False
7       False
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19      False
20      False
21      False
22      False
23      False
24      False
25      False
26      False
27      False
28      False
29      False
        ...  
1453    False
1454    False
1455    False
1456    False
1457    False
1458    False
1459    False
1460    False
1461    False
1462    False
1463    False
1464    False
1465    False
1466    False
1467    False
1468    False
1469    False
1470    False
1471    False
1472    False
1473    False
1474    False
1475    False
1476    False
1477    False
1478    False
1479    False
1480    False
1481    False
1482    False
Name: RBI, Length: 1483, dtype: bool

In [28]:
# Next, the "&" combines the two sets together...
(df2['HR'] >= 25) & (df2['RBI'] >= 100)

0       False
1       False
2        True
3       False
4       False
5       False
6       False
7       False
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19      False
20      False
21      False
22      False
23      False
24      False
25      False
26      False
27      False
28      False
29      False
        ...  
1453    False
1454    False
1455    False
1456    False
1457    False
1458    False
1459    False
1460    False
1461    False
1462    False
1463    False
1464    False
1465    False
1466    False
1467    False
1468    False
1469    False
1470    False
1471    False
1472    False
1473    False
1474    False
1475    False
1476    False
1477    False
1478    False
1479    False
1480    False
1481    False
1482    False
Length: 1483, dtype: bool

In [29]:
# Finally, this set of boolean values is utilized as a filter
# on the rows in the dataframe.  Any record with a value of
# "False" is filtered!
df2[['playerID','yearID','HR','RBI']][(df2['HR'] >= 25) & (df2['RBI'] >= 100)]

Unnamed: 0,playerID,yearID,HR,RBI
2,abreujo02,2016,25,100
51,arenano01,2016,41,133
96,beltrad01,2016,32,104
108,bettsmo01,2016,31,113
156,bryankr01,2016,39,102
180,cabremi01,2016,38,108
193,canoro01,2016,39,103
287,cruzne02,2016,43,105
306,daviskh01,2016,42,102
366,duvalad01,2016,33,103


### GROUP BY

In [30]:
# While the syntax for GROUP BY is very straightforward,
# the output can be a little problematic.  You'll want to
# remember to use the size() aggregate function instead of
# of count().
# Suppose we want to GROUP BY "teamID" and calculate the
# number of records and sum of home runs for each group... 
gb=df2.groupby(['teamID']).agg({'teamID': [np.size],'HR': [np.size, np.sum]})
gb

Unnamed: 0_level_0,teamID,HR,HR
Unnamed: 0_level_1,size,size,sum
teamID,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
ARI,50,50,190
ATL,60,60,122
BAL,47,47,253
BOS,50,50,208
CHA,50,50,168
CHN,45,45,199
CIN,52,52,164
CLE,49,49,185
COL,47,47,204
DET,44,44,211


* So, we pass a list of the columns that we want to group on to the ***groupby()*** method and a dictionary (keys=columns, values=list of aggregate functions) to the ***agg()*** method.
* This looks great, and creates a new dataframe.  The issue is that we're now dealing with a hierarchical index...

In [31]:
type(gb)

pandas.core.frame.DataFrame

In [32]:
gb.columns

MultiIndex(levels=[['teamID', 'HR'], ['size', 'sum']],
           labels=[[0, 1, 1], [0, 0, 1]])

In [33]:
# Unfortunately, the following command will no longer work
gb.sort_value(['teamID'])

AttributeError: 'DataFrame' object has no attribute 'sort_value'

In [34]:
# To sort on a hierarchical column, need to include all
# levels of the column in a tuple, within a list (yes, I
# know this is a little complex)
gb.sort_values([('teamID','size')],ascending=False)

Unnamed: 0_level_0,teamID,HR,HR
Unnamed: 0_level_1,size,size,sum
teamID,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
ATL,60,60,122
SDN,58,58,177
PIT,55,55,153
LAN,55,55,189
SEA,54,54,223
LAA,53,53,156
NYA,53,53,183
MIA,53,53,128
TEX,52,52,215
CIN,52,52,164


In [35]:
# Notice that our grouping column has been turned into an
# index!  To sort on the index we need to use a slightly
# different function--sort_index()
gb.sort_index()

Unnamed: 0_level_0,teamID,HR,HR
Unnamed: 0_level_1,size,size,sum
teamID,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
ARI,50,50,190
ATL,60,60,122
BAL,47,47,253
BOS,50,50,208
CHA,50,50,168
CHN,45,45,199
CIN,52,52,164
CLE,49,49,185
COL,47,47,204
DET,44,44,211


In [36]:
# And, if we want to revert the index back to the way it was,
# we can do the following...
gb=gb.reset_index()
gb

Unnamed: 0_level_0,teamID,teamID,HR,HR
Unnamed: 0_level_1,Unnamed: 1_level_1,size,size,sum
0,ARI,50,50,190
1,ATL,60,60,122
2,BAL,47,47,253
3,BOS,50,50,208
4,CHA,50,50,168
5,CHN,45,45,199
6,CIN,52,52,164
7,CLE,49,49,185
8,COL,47,47,204
9,DET,44,44,211


In [37]:
# Now, to sort on our original teamID...
gb.sort_values([('teamID','')],ascending=False)

Unnamed: 0_level_0,teamID,teamID,HR,HR
Unnamed: 0_level_1,Unnamed: 1_level_1,size,size,sum
29,WAS,43,43,203
28,TOR,49,49,221
27,TEX,52,52,215
26,TBA,48,48,216
25,SLN,41,41,225
24,SFN,45,45,130
23,SEA,54,54,223
22,SDN,58,58,177
21,PIT,55,55,153
20,PHI,49,49,161


In [38]:
gb.columns = gb.columns.map(''.join)
gb

Unnamed: 0,teamID,teamIDsize,HRsize,HRsum
0,ARI,50,50,190
1,ATL,60,60,122
2,BAL,47,47,253
3,BOS,50,50,208
4,CHA,50,50,168
5,CHN,45,45,199
6,CIN,52,52,164
7,CLE,49,49,185
8,COL,47,47,204
9,DET,44,44,211


#### HW\#6, Question \#3
* So, let's try to put all of this together and solve the homework problem.

In [40]:
import pymysql
pymysql.install_as_MySQLdb()
%reload_ext sql
%sql mysql://:@mqm-db/
%sql USE sanford;

0 rows affected.


[]

In [41]:
result3 = %sql SELECT * FROM health;

150822 rows affected.


In [42]:
q3 = result3.DataFrame()
len(q3)

150822

In [43]:
q3.head()

Unnamed: 0,id,sex,age,status,hypertension,vasc_disease,payor,diabetes,a1c,bmi,visits_sched,visits_miss,dbp,sbp,smoke
0,2,Male,82,Alive,0,0,Medicare,0,,25.53,4,0,72,139,4
1,8,Female,50,Alive,0,0,Private Ins/Other,0,,31.34,3,0,91,150,5
2,9,Female,60,Alive,1,0,Private Ins/Other,1,6.4,30.85,8,0,78,122,5
3,11,Male,59,Alive,1,0,Private Ins/Other,1,8.0,32.36,5,0,79,116,4
4,12,Male,65,Alive,1,0,Private Ins/Other,0,,34.52,7,1,82,117,4


In [44]:
q3.dtypes

id               int64
sex             object
age             object
status          object
hypertension     int64
vasc_disease     int64
payor           object
diabetes         int64
a1c             object
bmi             object
visits_sched    object
visits_miss     object
dbp              int64
sbp              int64
smoke            int64
dtype: object

In [45]:
# From HW#1 we remember that SQL was converting the "90+"
# age to 90.  So, we'll do that here and then change the
# age field to a number...
q3['age']=q3['age'].replace({"90+":"90"})
q3=q3.apply(pd.to_numeric, errors='ignore')
q3.dtypes

id                int64
sex              object
age               int64
status           object
hypertension      int64
vasc_disease      int64
payor            object
diabetes          int64
a1c             float64
bmi             float64
visits_sched     object
visits_miss     float64
dbp               int64
sbp               int64
smoke             int64
dtype: object

In [46]:
# Here we first do the WHERE, then the GROUP BY and aggregations
q3=q3[(q3['status'] == 'Alive') & (q3['sex'] != 'Unknown')].groupby(['sex','payor','hypertension']).agg({'age': [np.size,np.mean]})

In [47]:
# Here we'll do the rounding and the HAVING clause
q3=q3.round(2)
q3=q3[q3['age']['size']>=2000]
q3

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,age,age
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,size,mean
sex,payor,hypertension,Unnamed: 3_level_2,Unnamed: 4_level_2
Female,Medicare,0,18843,79.19
Female,Medicare,1,24507,73.32
Female,Private Ins/Other,0,11051,52.35
Female,Private Ins/Other,1,16699,55.68
Male,Medicare,0,13229,76.08
Male,Medicare,1,20084,72.15
Male,Private Ins/Other,0,14173,51.99
Male,Private Ins/Other,1,22366,54.32


In [48]:
# Next let's cleanup the output and then sort...
q3=q3.reset_index()
q3.columns=q3.columns.map(''.join)
q3=q3.sort_values(['sex','payor','hypertension'])

In [49]:
q3

Unnamed: 0,sex,payor,hypertension,agesize,agemean
0,Female,Medicare,0,18843,79.19
1,Female,Medicare,1,24507,73.32
2,Female,Private Ins/Other,0,11051,52.35
3,Female,Private Ins/Other,1,16699,55.68
4,Male,Medicare,0,13229,76.08
5,Male,Medicare,1,20084,72.15
6,Male,Private Ins/Other,0,14173,51.99
7,Male,Private Ins/Other,1,22366,54.32


### JOIN
* Finally, let's examine how to JOIN dataframes together.  Both the ***join()*** and ***merge()*** methods are available for this task (I tend to favor ***merge()***).

In [50]:
# Let's pull data from the Master table for all players
# born after 1970...
%sql USE lahman2016;
result4 = %sql SELECT * FROM Master WHERE birthYear > 1970;

0 rows affected.
4528 rows affected.


In [51]:
df4 = result4.DataFrame()

In [52]:
df4.head()

Unnamed: 0,playerID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,...,nameLast,nameGiven,weight,height,bats,throws,debut,finalGame,retroID,bbrefID
0,aardsda01,1981,12,27,USA,CO,Denver,,,,...,Aardsma,David Allan,215,75,R,R,2004-04-06,2015-08-23,aardd001,aardsda01
1,abadan01,1972,8,25,USA,FL,Palm Beach,,,,...,Abad,Fausto Andres,184,73,L,L,2001-09-10,2006-04-13,abada001,abadan01
2,abadfe01,1985,12,17,D.R.,La Romana,La Romana,,,,...,Abad,Fernando Antonio,220,73,L,L,2010-07-28,2016-09-25,abadf001,abadfe01
3,abbotje01,1972,8,17,USA,GA,Atlanta,,,,...,Abbott,Jeffrey William,190,74,R,L,1997-06-10,2001-09-29,abboj002,abbotje01
4,abercre01,1980,7,15,USA,GA,Columbus,,,,...,Abercrombie,Reginald Damascus,215,75,R,R,2006-04-04,2008-09-28,aberr001,abercre01


In [53]:
df2.head()

Unnamed: 0,playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,...,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
0,abadfe01,2016,1,MIN,AL,39,1,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,abadfe01,2016,2,BOS,AL,18,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,abreujo02,2016,1,CHA,AL,159,624,67,183,32,...,100,0,2,47,125,7,15,0,9,21
3,achteaj01,2016,1,LAA,AL,27,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ackledu01,2016,1,NYA,AL,28,61,6,9,0,...,4,0,0,8,9,0,0,0,1,0


In [54]:
df4=df4[['birthYear','playerID']]

In [55]:
df4.head()

Unnamed: 0,birthYear,playerID
0,1981,aardsda01
1,1972,abadan01
2,1985,abadfe01
3,1972,abbotje01
4,1980,abercre01


In [56]:
# Let's try to merge birthYear into the Batting data...
merged=df2.merge(df4, left_on='playerID', right_on='playerID',how='inner').head()

In [57]:
merged[['playerID','yearID','birthYear']].head()

Unnamed: 0,playerID,yearID,birthYear
0,abadfe01,2016,1985
1,abadfe01,2016,1985
2,abreujo02,2016,1987
3,achteaj01,2016,1988
4,ackledu01,2016,1988


In [58]:
merged.dtypes

playerID     object
yearID        int64
stint         int64
teamID       object
lgID         object
G             int64
AB            int64
R             int64
H             int64
2B            int64
3B            int64
HR            int64
RBI           int64
SB            int64
CS            int64
BB            int64
SO            int64
IBB          object
HBP          object
SH           object
SF           object
GIDP         object
birthYear     int64
dtype: object

In [59]:
merged['age_est']=merged['yearID']-merged['birthYear']

In [60]:
merged.columns

Index(['playerID', 'yearID', 'stint', 'teamID', 'lgID', 'G', 'AB', 'R', 'H',
       '2B', '3B', 'HR', 'RBI', 'SB', 'CS', 'BB', 'SO', 'IBB', 'HBP', 'SH',
       'SF', 'GIDP', 'birthYear', 'age_est'],
      dtype='object')

In [61]:
merged.head()

Unnamed: 0,playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,...,CS,BB,SO,IBB,HBP,SH,SF,GIDP,birthYear,age_est
0,abadfe01,2016,1,MIN,AL,39,1,0,0,0,...,0,0,1,0,0,0,0,0,1985,31
1,abadfe01,2016,2,BOS,AL,18,0,0,0,0,...,0,0,0,0,0,0,0,0,1985,31
2,abreujo02,2016,1,CHA,AL,159,624,67,183,32,...,2,47,125,7,15,0,9,21,1987,29
3,achteaj01,2016,1,LAA,AL,27,0,0,0,0,...,0,0,0,0,0,0,0,0,1988,28
4,ackledu01,2016,1,NYA,AL,28,61,6,9,0,...,0,8,9,0,0,0,1,0,1988,28
