<h1>Pandas</h1>

<li>Integrated data manipulation and analysis capabilities
<li>Integration with data visualization libraries
<li>Integration with machine learning libraries
<li>Built in time-series capabilities (Pandas was originally designed for financial time series data)
<li>Optimized for speed
<li>Built-in support for grabbing data from multiple sources csv, xls, html tables, yahoo, worldbank, FRED
<li>Integrated data manipulation support (messy data, missing data, feature construction)
<li><b>End to end support for data manipulation, data visualization, data analysis, and presenting results</b>

<h3>Getting pandas</h3>
<li>Included with anaconda python</li>
<li>But upgrade to the latest version</li>

In [1]:
import numpy as np
np.__version__

'1.26.0'

In [2]:
import pandas as pd
pd.__version__ #2.1.0

'2.1.0'

<h4>Uncomment the following box and run to ensure all libraries are updated</h4>
<li><b>Important</b>: Restart kernel after upgrading a library</li>

In [3]:
# !pip install pandas --upgrade
# !pip install numpy --upgrade
# !pip install numexpr --upgrade
# !pip install bottleneck --upgrade
# !pip install matplotlib --upgrade

<h3>Imports</h3>
<li>pandas and numpy</li>
<li>also matplotlib</li>

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

<h3>Documentation</h3>
<li><a href="https://pandas.pydata.org/pandas-docs/stable/">pandas</a></li>
<li><a href="https://matplotlib.org/users/index.html">matplotlib</a></li>

<h3>Pandas organizes data into two data objects</h3>
<li>Series: A one dimensional array object
<li>DataFrame: A two dimensional <b>table</b> object
<ul>
    <li>Each column in a dataframe corresponds to a named series</li>
    <li>Rows in a dataframe, that are <b>indexed</b></li>
    <li>Data can be accessed using row data index values and column data names (rather than integer indexes)</li>
    
</ul>

<h1>DataFrames</h1>
<li><b>The most important data structure in python!</b></li>
<li>2-Dimensional structure
<li>Columns can contain data of different types (like an Excel spreadsheet)
<li>Can contain an index (or indices)
<li>Columns (and indices) can be named


<h3>Constructing a dataframe</h3>
<li>Note that pandas automatically gives the data an index and names columns</li>
<li>But, of course, you can specify them as well</li>

In [5]:
df = pd.DataFrame([["Rao","Tucson",33422.12,27127.22],
                   ["Montalbano","Chicago",45233.27,41322.13],
                   ["Zhang","Miami",36234.22,39123.45],
                   ["Brown","New York",57322.83,41486.28],
                   ["Achebe","Los Angeles",23490.81,22540.36]])

df

Unnamed: 0,0,1,2,3
0,Rao,Tucson,33422.12,27127.22
1,Montalbano,Chicago,45233.27,41322.13
2,Zhang,Miami,36234.22,39123.45
3,Brown,New York,57322.83,41486.28
4,Achebe,Los Angeles,23490.81,22540.36


In [6]:
df = pd.DataFrame([["Rao","Tucson",33422.12,27127.22],
                   ["Montalbano","Chicago",45233.27,41322.13],
                   ["Zhang","Miami",36234.22,39123.45],
                   ["Brown","New York",57322.83,41486.28],
                   ["Achebe","Los Angeles",23490.81,22540.36]],
                  columns=["Manager","City","Revenue","Expenses"],
                 index=["SouthWest","Central","South","East","West"])

df

Unnamed: 0,Manager,City,Revenue,Expenses
SouthWest,Rao,Tucson,33422.12,27127.22
Central,Montalbano,Chicago,45233.27,41322.13
South,Zhang,Miami,36234.22,39123.45
East,Brown,New York,57322.83,41486.28
West,Achebe,Los Angeles,23490.81,22540.36


<h4>Pandas dataframes work like dictionaries</h4>
<li>Column names can be used to access columnar data</li>
<li>The accessed column is returned as a <b>pandas series</b></li>

In [7]:
type(df)

pandas.core.frame.DataFrame

In [8]:
type(df["Revenue"])

pandas.core.series.Series

In [9]:
#OR
revenue = df.Revenue
revenue

SouthWest    33422.12
Central      45233.27
South        36234.22
East         57322.83
West         23490.81
Name: Revenue, dtype: float64

In [10]:
revenue.sum()

195703.25

<h3>Series are indexed</h3>
<li>Every series contains an index and the values associated with each index item
<li>Series items can be accessed:
    <ol>
        <li>using the row number in a list or numpy like way</li>
        <li>using the row number with the iloc attribute</li>
        <li>using the index to get at the corresponding value</li>
    </ol>
<li>Think of a Series as a sequence of (key,value) pairs. The key is the index.
<li>Index items must be immutable (unique)

In [11]:
type(df["Revenue"])

pandas.core.series.Series

In [12]:
print(revenue["West"])
print(revenue.iloc[4]) # not good practice to access by row index, instead use the row name


23490.81
23490.81


<h3>Dataframes: Selecting rows</h3>
<li>rows can be selected using the index df.loc[index_value]
<li>or using row number df.iloc[row_number]
<li>Note that both methods use dictionary like indexing!

In [13]:
df

Unnamed: 0,Manager,City,Revenue,Expenses
SouthWest,Rao,Tucson,33422.12,27127.22
Central,Montalbano,Chicago,45233.27,41322.13
South,Zhang,Miami,36234.22,39123.45
East,Brown,New York,57322.83,41486.28
West,Achebe,Los Angeles,23490.81,22540.36


In [14]:
#df['East'] #WILL NOT WORK!!
df.loc['East'] #This will work, we are accessing all of the values in a column


Manager        Brown
City        New York
Revenue     57322.83
Expenses    41486.28
Name: East, dtype: object

In [15]:
x= df.loc["East"]

In [16]:
x["City"]

'New York'

<h3>Dataframes: Selecting cells</h3>
<li>rows can be selected using the index df.loc[index_value]
<li>Columns can be selected using dictionary style indexing
<li>Use a combination of the two to zero in on an individual cell
<li>Or extract a series and use the series index to grab a value

In [17]:
df

Unnamed: 0,Manager,City,Revenue,Expenses
SouthWest,Rao,Tucson,33422.12,27127.22
Central,Montalbano,Chicago,45233.27,41322.13
South,Zhang,Miami,36234.22,39123.45
East,Brown,New York,57322.83,41486.28
West,Achebe,Los Angeles,23490.81,22540.36


In [18]:
# the following two sells are equivalent 
df.loc['East']['City']
# df['City']['East']

'New York'

In [19]:
df['City']['East']

'New York'

<h2>Add a new column to a DataFrame</h2>
<li>Similar to adding a key to a dictionary</li>
<li>A df is mutable so assignment works as expected</li>
<li><b>Columns can also be created as a result of an elementwise operation</b></li>

In [20]:
df["Profit"] = df["Revenue"] - df["Expenses"] # create a new column 
df

Unnamed: 0,Manager,City,Revenue,Expenses,Profit
SouthWest,Rao,Tucson,33422.12,27127.22,6294.9
Central,Montalbano,Chicago,45233.27,41322.13,3911.14
South,Zhang,Miami,36234.22,39123.45,-2889.23
East,Brown,New York,57322.83,41486.28,15836.55
West,Achebe,Los Angeles,23490.81,22540.36,950.45


<h4>Selecting multiple columns</h4>
<li>Use a <b>list</b> containing the names of the desired rows</li>
<li>A copy of the dataframe is returned</li>
<li>Changes will not be reflected in the original dataframe</li>

In [21]:
df[['Revenue','Expenses']] # selectr multiple columns

Unnamed: 0,Revenue,Expenses
SouthWest,33422.12,27127.22
Central,45233.27,41322.13
South,36234.22,39123.45
East,57322.83,41486.28
West,23490.81,22540.36


<h4>Creating a new column using np.where</h4>
<li>Identify regions where the profit is positive</li>
<li>we'll transpose the dataframe first</li>
<li>And then use np.where to add a Column</li>

In [22]:
df

Unnamed: 0,Manager,City,Revenue,Expenses,Profit
SouthWest,Rao,Tucson,33422.12,27127.22,6294.9
Central,Montalbano,Chicago,45233.27,41322.13,3911.14
South,Zhang,Miami,36234.22,39123.45,-2889.23
East,Brown,New York,57322.83,41486.28,15836.55
West,Achebe,Los Angeles,23490.81,22540.36,950.45


In [23]:
np.where(df["Profit"]>=0,"Profit","Loss") # conditional, if True, if False

array(['Profit', 'Profit', 'Loss', 'Profit', 'Profit'], dtype='<U6')

In [24]:
df["P/L Flag"] = np.where(df["Profit"]>=0,"Profit","Loss")
df

Unnamed: 0,Manager,City,Revenue,Expenses,Profit,P/L Flag
SouthWest,Rao,Tucson,33422.12,27127.22,6294.9,Profit
Central,Montalbano,Chicago,45233.27,41322.13,3911.14,Profit
South,Zhang,Miami,36234.22,39123.45,-2889.23,Loss
East,Brown,New York,57322.83,41486.28,15836.55,Profit
West,Achebe,Los Angeles,23490.81,22540.36,950.45,Profit


<h1>Grouping data in Pandas</h1>
<li>Pandas allows grouping by value as well as grouping by functions</li>
<li>The <span style="color:blue">groupby</span> function specifies the parameters for grouping data</li>
<li>groupby returns a groupby object that, roughly speaking, contains the groupings</li>
<li>grouping parameters can be functions or columns</li>

<li>Create a sample dataframe</li>
<li>note the use of <span style="color:blue">transpose</span>; <span style="color:blue">set_index</span>; and <span style="color:blue">columns</span>

In [2]:
import pandas as pd
import numpy as np
emp_id = np.array([100,101,102,103,104,105,106,107,108,109,110,111])
names = np.array(['Bill','Ludovica','Qing','Savitri','Giovanni',"Birgit",
                  "Bercù","Elodie","Gurumul","Kwame","Rosa","João"])
bonus = np.array([232300.56,478123.45,3891.24,98012.36,52123.50,0,
                  321000.23,37345.22,121200,59621.33,94123.5,45123.2])
department = np.array(['1','2','1','2','1','1','1','2',"1","2","1","1"])
city = np.array(["New York","Catania","Paris","New York","Sydney","Sydney",
                 "Paris","New York","Sydney","Paris","New York","Paris"])
salary = np.array([455000,722321,95223,135000,132033,700000,832123,
                   78123.11,13243.32,456122.17,912321.22,31123])
columns=["name","department","city","salary","bonus"]


df = pd.DataFrame([names,department,city,salary,bonus]).transpose()#.set_index(emp_id)


df.columns = columns

df['salary'] =df['salary'].astype('float64')
df['bonus'] =df['bonus'].astype('float64')


df


Unnamed: 0,name,department,city,salary,bonus
0,Bill,1,New York,455000.0,232300.56
1,Ludovica,2,Catania,722321.0,478123.45
2,Qing,1,Paris,95223.0,3891.24
3,Savitri,2,New York,135000.0,98012.36
4,Giovanni,1,Sydney,132033.0,52123.5
5,Birgit,1,Sydney,700000.0,0.0
6,Bercù,1,Paris,832123.0,321000.23
7,Elodie,2,New York,78123.11,37345.22
8,Gurumul,1,Sydney,13243.32,121200.0
9,Kwame,2,Paris,456122.17,59621.33


In [3]:
# add row names
df = pd.DataFrame([names,department,city,salary,bonus]).transpose().set_index(emp_id)
df.columns = columns
df['salary'] =df['salary'].astype('float64')
df['bonus'] =df['bonus'].astype('float64')

df

Unnamed: 0,name,department,city,salary,bonus
100,Bill,1,New York,455000.0,232300.56
101,Ludovica,2,Catania,722321.0,478123.45
102,Qing,1,Paris,95223.0,3891.24
103,Savitri,2,New York,135000.0,98012.36
104,Giovanni,1,Sydney,132033.0,52123.5
105,Birgit,1,Sydney,700000.0,0.0
106,Bercù,1,Paris,832123.0,321000.23
107,Elodie,2,New York,78123.11,37345.22
108,Gurumul,1,Sydney,13243.32,121200.0
109,Kwame,2,Paris,456122.17,59621.33


In [4]:
df.info()
# main take-away from the info function is that it lets us know if we are missing any data; if we are missing 
# data points this could hinder studies of our data

<class 'pandas.core.frame.DataFrame'>
Index: 12 entries, 100 to 111
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   name        12 non-null     object 
 1   department  12 non-null     object 
 2   city        12 non-null     object 
 3   salary      12 non-null     float64
 4   bonus       12 non-null     float64
dtypes: float64(2), object(3)
memory usage: 576.0+ bytes


<h2>group the data by department </h2>
<li>Notice the use of elementwise operations as in numpy arrays</li>
<li><span style="font-size:large">for loops = BAD PRACTICE!</span></li>
<li>Then apply aggregate functions (size, mean, sum) on the groups</li>

In [5]:
department_groups = df.groupby('department') #Returns a groupby object that contains groupings
department_groups

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fd1e0493a30>

In [6]:
department_groups.size()

department
1    8
2    4
dtype: int64

<li>Which department is better paid </li>


In [7]:
department_groups.mean(numeric_only=True)

Unnamed: 0_level_0,salary,bonus
department,Unnamed: 1_level_1,Unnamed: 2_level_1
1,396383.3175,108720.27875
2,347891.57,168275.59


In [8]:
department_groups['salary'].mean() + department_groups['bonus'].mean()

department
1    505103.59625
2    516167.16000
dtype: float64

<li>Which department makes more in bonus as a pct of their compensation?</li>

In [9]:
department_groups['bonus'].mean()

department
1    108720.27875
2    168275.59000
Name: bonus, dtype: float64

In [10]:

department_groups['bonus'].mean() / department_groups['salary'].mean()

department
1    0.274281
2    0.483701
dtype: float64

<h2>multi-level grouping</h2>
<li>include columns in the group order</li>
<li>Does relative department compensation differ by city? Group first by city and then, within each city, by department</li>

In [11]:
df

Unnamed: 0,name,department,city,salary,bonus
100,Bill,1,New York,455000.0,232300.56
101,Ludovica,2,Catania,722321.0,478123.45
102,Qing,1,Paris,95223.0,3891.24
103,Savitri,2,New York,135000.0,98012.36
104,Giovanni,1,Sydney,132033.0,52123.5
105,Birgit,1,Sydney,700000.0,0.0
106,Bercù,1,Paris,832123.0,321000.23
107,Elodie,2,New York,78123.11,37345.22
108,Gurumul,1,Sydney,13243.32,121200.0
109,Kwame,2,Paris,456122.17,59621.33


In [16]:
city_dept_group = df.groupby(['city','department'])
cdg = city_dept_group['salary'].mean() + city_dept_group['bonus'].mean()
cdg


city      department
Catania   2             1.200444e+06
New York  1             8.468726e+05
          2             1.742403e+05
Paris     1             4.428279e+05
          2             5.157435e+05
Sydney    1             3.395333e+05
dtype: float64

<h4>formatting print</h4>
<li>format decimal numbers to two decimal places (and avoid seeing exponent form output!)</li>
<li>use <span style="color:blue">pd.options.display.float_format</span></li>
<li><a href="https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html">other display settings and options</a></li>

In [36]:
pd.options.display.float_format = '{:,.2f}'.format # keep the whole number (:) and two places after the decimal pt (.2f)
print(city_dept_group['salary'].mean() + city_dept_group['bonus'].mean())

city      department
Catania   2            1,200,444.45
New York  1              846,872.64
          2              174,240.35
Paris     1              442,827.89
          2              515,743.50
Sydney    1              339,533.27
dtype: float64


<h2>Custom grouping function</h2>
<li>We've looked at grouping by column values, but we can also group by a custom grouping function</li>
<li>Divide the employees into "High Bonus" (more than 20% of salary as bonus) and "Low Bonus"</li>


<li>We need to write a function that takes a dataframe, a row index, and the salary and bonus columns as arguments
<li>The four arguments together will point to <span style="color:blue">two values</span> in the dataframe
<li>And we can apply a test to these values to see which group the row belongs
<li>And return the group label

In [37]:
def ratio_group(df,index,threshold):
    # we use a try and except block to handle the situations where our data has missing values (Nan)
    try:
        if df.loc[index]['bonus']/df.loc[index]['salary'] > threshold:
            return "High Bonus"
    except:
        pass
    return "Low Bonus"

In [38]:
ratio_group(df,107,0.2)

'High Bonus'

<h2>Digression: Lambda functions</h2>
<li>lambda functions are anonymous functions, created on the fly, and typically meant to be used once</li>
<li>since they are unnamed, they are not callable but are meant to be used in context</li>
<li>but, since python functions are first order functions, you can assign a name to the function using =</li>


In [39]:
def square(x):
    return x*x
square(33)

1089

In [40]:
#Single use example. This function will not be available for future use
(lambda x: x*x)(33)

1089

In [41]:
#Multiple use example. This function will be callable in the future
square = lambda x: x*x
square(33)

1089

In [42]:
(lambda x,y: x*y)(3,2)

6

<li>lambda functions are most often used when they are an argument to another function</li>
<li>most basic example of use: defining a sort order when sorting</li>
<li>A list containing student names and scores</li>
<li>sort by scores</li>
<li>sort functions (also min, max) have a key argument that contains the function used for comparing pairs of elements</li>
    

In [43]:
grades = [("Jack",27),("Jack",83),("Jill",94),("Qing",73),("Birgit",73),("Parsifal",99),("Birgit",24)]
simple_sort = sorted(grades) #Sorts in order of elements (name, score)
score_sort = sorted(grades,key=lambda x: x[1]) #Sorts ONLY by second argument

In [44]:
simple_sort

[('Birgit', 24),
 ('Birgit', 73),
 ('Jack', 27),
 ('Jack', 83),
 ('Jill', 94),
 ('Parsifal', 99),
 ('Qing', 73)]

In [45]:
score_sort

[('Birgit', 24),
 ('Jack', 27),
 ('Qing', 73),
 ('Birgit', 73),
 ('Jack', 83),
 ('Jill', 94),
 ('Parsifal', 99)]

In [46]:
name_sort = sorted(grades,key=lambda x: x[0])
name_sort # sort only by names

[('Birgit', 73),
 ('Birgit', 24),
 ('Jack', 27),
 ('Jack', 83),
 ('Jill', 94),
 ('Parsifal', 99),
 ('Qing', 73)]

<li>the lambda function takes one argument (x)</li>
<li>each x value corresponds to a tuple (name,score)</li>
<li>and returns  the second value in the tuple, the score</li>
<li>and sort uses these values to order the elements</li>

<li>We can give the function a name using an assignment statement</li>

<li>lambda functions can have only one expression and they return whatever the expression returns</li>
<li>the if .. else .. structure in a lambda function is in the form of an "expression if"</li>
<li>multiple arguments are separated by a comma</li>



In [47]:
(lambda x,y,z: x if x > y else y if y > z else z) (4,2,3)

4

<h3>Expression if</h3>
<li>An ordinary if statement returns nothing. An expression if returns a value</li>

In [48]:
x=5
if x<10:
    y = x
else:
    y = 200

In [49]:
x=5
y = x if x<10 else 200
y

"""
x=5
if x<10:
   y = x
else:
   y = 200
"""
y

5

<h3>Grouping using lambda functions</h3>
<li>Finally, we'll pass the function to groupby using a lambda function
<li>It will use the values returned by the function to group the data
<li>Then we can get group level statistics

In [55]:
groups = df.groupby(lambda x: ratio_group(df,x,0.5))
print(groups.size())
print(groups.mean(numeric_only=True))
print(groups.std(numeric_only=True))

High Bonus    5
Low Bonus     7
dtype: int64
               salary      bonus
High Bonus 271,337.46 194,951.91
Low Bonus  457,992.21  81,157.86
               salary      bonus
High Bonus 308,229.44 172,384.97
Low Bonus  362,195.02 110,661.32


<h2>In-class problem</h2>
<li>For our employee example, write a grouping function that groups the data into a two groups</li>
<li>The "living well" group has a total income > 600,000 after adjusting for the cost of living</li>
<li>The "living hand to mouth" group has a total income <= 600000 after adjusting for the cost of living</li>
<li>The cost of living index is in a dictionary "cost_of_living_index. Just divide the total income (salary + bonus) by the corresponding cost of living for the city the employee lives in to get the adjusted value</li>
<li>How many employees are "living well" and how many are "living hand to mouth"</li>
<li>What is the average income of the "living well" employees and that of the "living hand to mouth" employees</li>

For the sizes and the means, you should get:

<pre>
lives hand to mouth    8
lives well             4
dtype: int64


lives hand to mouth     255,685.63
lives well            1,015,003.10
dtype: float64
</pre>

In [51]:
import pandas as pd
import numpy as np
emp_id = np.array([100,101,102,103,104,105,106,107,108,109,110,111])
names = np.array(['Bill','Ludovica','Qing','Savitri','Giovanni',"Birgit",
                  "Bercù","Elodie","Gurumul","Kwame","Rosa","João"])
bonus = np.array([232300.56,478123.45,3891.24,98012.36,52123.50,0,
                  321000.23,37345.22,121200,59621.33,94123.5,45123.2])
department = np.array(['1','2','1','2','1','1','1','2',"1","2","1","1"])
city = np.array(["New York","Catania","Paris","New York","Sydney","Sydney",
                 "Paris","New York","Sydney","Paris","New York","Paris"])
salary = np.array([455000,722321,95223,135000,132033,700000,832123,
                   78123.11,13243.32,456122.17,912321.22,31123])

cost_of_living_index = {"New York":1.25,
                        "Catania":0.8,
                        "Paris": 1.1,
                        "Sydney":0.9}

columns=["name","department","city","salary","bonus"]


df = pd.DataFrame([names,department,city,salary,bonus]).transpose().set_index(emp_id)


df.columns = columns

df['salary'] =df['salary'].astype('float64')
df['bonus'] =df['bonus'].astype('float64')


df.info()


<class 'pandas.core.frame.DataFrame'>
Index: 12 entries, 100 to 111
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   name        12 non-null     object 
 1   department  12 non-null     object 
 2   city        12 non-null     object 
 3   salary      12 non-null     float64
 4   bonus       12 non-null     float64
dtypes: float64(2), object(3)
memory usage: 576.0+ bytes


In [52]:
def grouping_function(df,index,cost_of_living_index):
    row = df.loc[index]
    city = row.city
    c_o_l_index = cost_of_living_index[city]
    compensation = row.salary + row.bonus
    adjusted_comp = compensation/c_o_l_index
    if adjusted_comp > 600000:
        return "Living well"
    
    return "Living hand to mouth"
grouping_function(df,106,cost_of_living_index)


groups = df.groupby(lambda x: grouping_function(df,x,cost_of_living_index))

#Number of people in each group
groups.size()

Living hand to mouth    8
Living well             4
dtype: int64

In [53]:
#Average income in each group
groups.salary.mean() + groups.bonus.mean()

Living hand to mouth     255,685.63
Living well            1,015,003.10
dtype: float64