# Pandas Introduction

**Week02, Pandas Introduction**

ISM6136

&copy; 2023 Dr. Tim Smith


<a target="_blank" href="https://colab.research.google.com/github/prof-tcsmith/dm-f23/blob/main/W02/W02c-Pandas.ipynb#offline=1">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

---

## Introduction

In this notebook we review the fundamentals of Pandas. We will cover the following topics:
    
1. Pandas Series
2. Pandas DataFrames
3. Data Selection
4. Data Manipulation
5. Looping over DataFrame
6. Comparison Operators
7. Demonstration of a sample of common Pandas methods


In [None]:
# if running on colab, uncomment the following lines
#!pip install matplotlib
#!pip install numpy
#!pip install pandas

In [273]:
# let's begin with our imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## 1.0 Pandas Series

Pandas Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively called index. You can think of a Pandas Series as similar to a column in an excel sheet.

In [274]:
# create a series from a list
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print(s)

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64


In [275]:
# add an index to a series
s = pd.Series([1, 3, 5, np.nan, 6, 8], index=['a', 'b', 'c', 'd', 'e', 'f'])
print(s)

a    1.0
b    3.0
c    5.0
d    NaN
e    6.0
f    8.0
dtype: float64


In [276]:
# create a series from a dictionary
d = {'a': 0., 'b': 1., 'c': 2.}
s = pd.Series(d)
print(s)

a    0.0
b    1.0
c    2.0
dtype: float64


In [277]:
# create a series from a dictionary with a different index
s = pd.Series(d, index=['b', 'c', 'd', 'a'])
print(s)

b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64


In [278]:
# create a series from a scalar
s = pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])

In [279]:
# create a series from a numpy array
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
print(s)

a    0.127591
b   -0.655786
c    0.185745
d   -0.414263
e    0.094432
dtype: float64


In [280]:
# add a new element to a series
s['g'] = 10
print(s)

a     0.127591
b    -0.655786
c     0.185745
d    -0.414263
e     0.094432
g    10.000000
dtype: float64


In [281]:
# remove and element from a series
s = s.drop('g')
print(s)

a    0.127591
b   -0.655786
c    0.185745
d   -0.414263
e    0.094432
dtype: float64


In [282]:
# select an element from a series
print(s['a'])

0.12759077197930593


In [283]:
# select a range of elements from a series
print(s['a':'c'])

a    0.127591
b   -0.655786
c    0.185745
dtype: float64


## 2. Pandas DataFrames

Pandas dataframes are a very powerful tool for data analysis. They are built on top of numpy arrays and are very similar to R dataframes. Pandas dataframes are also very similar to SQL tables. Pandas dataframes are very useful for data manipulation and analysis. We will use pandas dataframes extensively in this course.

In [284]:
# create a dataframe from a dictionary
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
print(df)

   A  B
0  1  4
1  2  5
2  3  6


In [285]:
# create a dataframe from a list of lists
df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=['A', 'B', 'C'])
print(df)

   A  B  C
0  1  2  3
1  4  5  6


In [286]:
# create a dataframe from a list of dictionaries
df = pd.DataFrame([{'A': 1, 'B': 2}, {'A': 3, 'B': 4}])
print(df)

   A  B
0  1  2
1  3  4


In [287]:
# create a dataframe from a numpy array
df = pd.DataFrame(np.random.rand(3, 2), columns=['A', 'B'])
print(df)

          A         B
0  0.713841  0.875272
1  0.004488  0.619932
2  0.491812  0.372714


In [288]:
# create a dataframe from a series
df = pd.DataFrame({'A': pd.Series([1, 2, 3]), 'B': pd.Series([4, 5, 6])})
print(df)

   A  B
0  1  4
1  2  5
2  3  6


In [289]:
# create a dataframe from a series with a name
df = pd.DataFrame({'A': pd.Series([1, 2, 3], name='foo'), 'B': pd.Series([4, 5, 6], name='bar')})
print(df)

   A  B
0  1  4
1  2  5
2  3  6


In [290]:
# create a dataframe from a series with a name and an index
df = pd.DataFrame({'A': pd.Series([1, 2, 3], name='foo'), 'B': pd.Series([4, 5, 6], name='bar')}, index=['a', 'b', 'c'])
print(df)

    A   B
a NaN NaN
b NaN NaN
c NaN NaN


In [291]:
# save a dataframe to a csv file
df.to_csv('data.csv')

In [292]:
# create a dataframe from a csv file
df2 = pd.read_csv('data.csv')
print(df)

    A   B
a NaN NaN
b NaN NaN
c NaN NaN


## 3.0 Data Selection in DataFrames

In [293]:
# let's begin by create a dataframe to use to demonstrate some of the pandas functionality
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
        'age': [42, 52, 36, 24, 73],
        'preTestScore': [4, 24, 31, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(data, columns = ['name', 'age', 'preTestScore', 'postTestScore'], index=['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
print(df)

             name  age  preTestScore  postTestScore
Cochice     Jason   42             4             25
Pima        Molly   52            24             94
Santa Cruz   Tina   36            31             57
Maricopa     Jake   24             2             62
Yuma          Amy   73             3             70


### 3.1 Using loc

The loc method is used to access data in a DataFrame using labels. You can use it to select rows and columns based on their labels. Here is an example:

In [294]:
df.loc['Pima'] # select a single row

name             Molly
age                 52
preTestScore        24
postTestScore       94
Name: Pima, dtype: object

In [295]:
df.loc[ : ,'age'] # select a single column

Cochice       42
Pima          52
Santa Cruz    36
Maricopa      24
Yuma          73
Name: age, dtype: int64

In [296]:
df.loc['Pima': 'Yuma'] # select multiple continous rows

Unnamed: 0,name,age,preTestScore,postTestScore
Pima,Molly,52,24,94
Santa Cruz,Tina,36,31,57
Maricopa,Jake,24,2,62
Yuma,Amy,73,3,70


In [297]:
# selection multiple discontinuous rows and columns
df.loc[['Pima', 'Yuma', 'Santa Cruz'], ['age', 'name']]

Unnamed: 0,age,name
Pima,52,Molly
Yuma,73,Amy
Santa Cruz,36,Tina


### 3.2 Using iloc

The iloc method is used to access data in a DataFrame using integer indexes. You can use it to select rows and columns based on their integer positions. Here is an example:

In [298]:
# select one row using iloc

df.iloc[0]

name             Jason
age                 42
preTestScore         4
postTestScore       25
Name: Cochice, dtype: object

In [299]:
# select multiple contingous rows 

df.iloc[0:3]

Unnamed: 0,name,age,preTestScore,postTestScore
Cochice,Jason,42,4,25
Pima,Molly,52,24,94
Santa Cruz,Tina,36,31,57


In [300]:
# select a column
df.loc[:, 'name']

Cochice       Jason
Pima          Molly
Santa Cruz     Tina
Maricopa       Jake
Yuma            Amy
Name: name, dtype: object

In [301]:
# select multiple rows and columns using iloc
df.iloc[0:3, 0:2]

Unnamed: 0,name,age
Cochice,Jason,42
Pima,Molly,52
Santa Cruz,Tina,36


In [302]:
# select multiple discontinuous rows and columns
df.iloc[[0, 2, 3], [1, 2]]

Unnamed: 0,age,preTestScore
Cochice,42,4
Santa Cruz,36,31
Maricopa,24,2


## 4 Data Manipulation with Pandas

In [303]:
df.iloc[1, 2] = 55 # change the value in the 2nd row and 3rd column
df

Unnamed: 0,name,age,preTestScore,postTestScore
Cochice,Jason,42,4,25
Pima,Molly,52,55,94
Santa Cruz,Tina,36,31,57
Maricopa,Jake,24,2,62
Yuma,Amy,73,3,70


In [304]:
df.iloc[1:3,1] = 200
df

Unnamed: 0,name,age,preTestScore,postTestScore
Cochice,Jason,42,4,25
Pima,Molly,200,55,94
Santa Cruz,Tina,200,31,57
Maricopa,Jake,24,2,62
Yuma,Amy,73,3,70


In [305]:
df.loc['Pima', 'age'] = 333
df

Unnamed: 0,name,age,preTestScore,postTestScore
Cochice,Jason,42,4,25
Pima,Molly,333,55,94
Santa Cruz,Tina,200,31,57
Maricopa,Jake,24,2,62
Yuma,Amy,73,3,70


In [306]:
df['age-squared'] = df['age'] ** 2
df

Unnamed: 0,name,age,preTestScore,postTestScore,age-squared
Cochice,Jason,42,4,25,1764
Pima,Molly,333,55,94,110889
Santa Cruz,Tina,200,31,57,40000
Maricopa,Jake,24,2,62,576
Yuma,Amy,73,3,70,5329


In [307]:
df.drop('age-squared', axis=1, inplace=True)
df

Unnamed: 0,name,age,preTestScore,postTestScore
Cochice,Jason,42,4,25
Pima,Molly,333,55,94
Santa Cruz,Tina,200,31,57
Maricopa,Jake,24,2,62
Yuma,Amy,73,3,70


## 5 Loop over Pandas Dataframe


You will often find that you need to loop over a the data in a dataframe and do something with it. For example, you may need to loop over the rows in a dataframe and do something with the data in each row. There are several ways to do this. We will cover two ways to loop over a dataframe in this notebook. The first is using a for loop, and the second is using the apply method.

### 5.1 For Loops & DataFrames

In [308]:
for something in df:
    print(something)

name
age
preTestScore
postTestScore


In [309]:
for something in df.iterrows():
    print(something)

('Cochice', name             Jason
age                 42
preTestScore         4
postTestScore       25
Name: Cochice, dtype: object)
('Pima', name             Molly
age                333
preTestScore        55
postTestScore       94
Name: Pima, dtype: object)
('Santa Cruz', name             Tina
age               200
preTestScore       31
postTestScore      57
Name: Santa Cruz, dtype: object)
('Maricopa', name             Jake
age                24
preTestScore        2
postTestScore      62
Name: Maricopa, dtype: object)
('Yuma', name             Amy
age               73
preTestScore       3
postTestScore     70
Name: Yuma, dtype: object)


In [310]:
for label, row in df.iterrows():
    print(label)
    print(row)


Cochice
name             Jason
age                 42
preTestScore         4
postTestScore       25
Name: Cochice, dtype: object
Pima
name             Molly
age                333
preTestScore        55
postTestScore       94
Name: Pima, dtype: object
Santa Cruz
name             Tina
age               200
preTestScore       31
postTestScore      57
Name: Santa Cruz, dtype: object
Maricopa
name             Jake
age                24
preTestScore        2
postTestScore      62
Name: Maricopa, dtype: object
Yuma
name             Amy
age               73
preTestScore       3
postTestScore     70
Name: Yuma, dtype: object


In [311]:
for label, row in df.iterrows():
    print(label)
    print(row['age'])

Cochice
42
Pima
333
Santa Cruz
200
Maricopa
24
Yuma
73


In [312]:
for label, row in df.iterrows():
    print(f"{label}:{row['age']}")

Cochice:42
Pima:333
Santa Cruz:200
Maricopa:24
Yuma:73


In [313]:
# Let's add a column that is the count of letters in name
for label, row in df.iterrows():
    df.loc[label,'name-len'] = len(row['name'])

df

Unnamed: 0,name,age,preTestScore,postTestScore,name-len
Cochice,Jason,42,4,25,5.0
Pima,Molly,333,55,94,5.0
Santa Cruz,Tina,200,31,57,4.0
Maricopa,Jake,24,2,62,4.0
Yuma,Amy,73,3,70,3.0


### 5.3 Using the apply method

The apply method is built into any pandas dataframe object. It is a very powerful method that allows you to apply a function to every row or column in a dataframe. Here is an example:

In [314]:
df['name-length'] = df['name'].apply(len) 

df

Unnamed: 0,name,age,preTestScore,postTestScore,name-len,name-length
Cochice,Jason,42,4,25,5.0,5
Pima,Molly,333,55,94,5.0,5
Santa Cruz,Tina,200,31,57,4.0,4
Maricopa,Jake,24,2,62,4.0,4
Yuma,Amy,73,3,70,3.0,3


Notice that the apply method takes a function as an argument. In the example above, we used an existing function called len that accepts a python composite data structure (such as a string) and returns the number of items in it.  We passed the len function to the apply method. The apply method then applied the len function to every row in the name column of the dataframe and returned the results in a new colu,n called name-length.

Here is another example of using apply, this time we will only return the results to the screen, and not overwrite any data in the dataframe:

In [315]:
df['age'].apply(lambda x: x ** 2) # this simple returns that values to the screen


Cochice         1764
Pima          110889
Santa Cruz     40000
Maricopa         576
Yuma            5329
Name: age, dtype: int64

In this last example, we have used a lambda function. A lambda function is a function that is defined on the fly, and is not stored in memory. It is a very useful way to define a function that you only need to use once. In this example, we have defined a lambda function that takes a string as an argument and returns the length of the string. We then passed this lambda function to the apply method. The apply method then applied the lambda function to every row in the name column of the dataframe and returned the results to the screen.

Finally, let's store the result from a apply method call in a new column in the dataframe:

In [316]:
df['age-squared'] = df['age'].apply(lambda x: x ** 2) # this adds a new column to the dataframe

df

Unnamed: 0,name,age,preTestScore,postTestScore,name-len,name-length,age-squared
Cochice,Jason,42,4,25,5.0,5,1764
Pima,Molly,333,55,94,5.0,5,110889
Santa Cruz,Tina,200,31,57,4.0,4,40000
Maricopa,Jake,24,2,62,4.0,4,576
Yuma,Amy,73,3,70,3.0,3,5329


## 6.0 Comparison operators and Pandas (Filtering Pandas Dataframes)


Using standard comparison operators (such as <, <, ==, <=, <=) we can filter a pandas dataframe. Here are a few examples:

In [317]:
df['age'] > 99


Cochice       False
Pima           True
Santa Cruz     True
Maricopa      False
Yuma          False
Name: age, dtype: bool

In [318]:
df['age'] > 99


Cochice       False
Pima           True
Santa Cruz     True
Maricopa      False
Yuma          False
Name: age, dtype: bool

In [319]:
df[df['age'] > 99][['name']]


Unnamed: 0,name
Pima,Molly
Santa Cruz,Tina


When we want to combine multiple conditions, we can use and or or. Here are some examples:

In [320]:
np.logical_and(df['age'] > 99, df['name-len'] < 5)


Cochice       False
Pima          False
Santa Cruz     True
Maricopa      False
Yuma          False
dtype: bool

In [321]:
df[np.logical_or(df['age'] > 99, df['name-len'] < 5)]


Unnamed: 0,name,age,preTestScore,postTestScore,name-len,name-length,age-squared
Pima,Molly,333,55,94,5.0,5,110889
Santa Cruz,Tina,200,31,57,4.0,4,40000
Maricopa,Jake,24,2,62,4.0,4,576
Yuma,Amy,73,3,70,3.0,3,5329


## 7.0 Summarizing Data from dataframes

A common task when working with dataframes is summarizing the data within the dataframe. In this section we cover some of the more common ways in which we summarize a dataframe. We will see many examples of this during the course as we prepare data for modeling. 

Let's begin by reading a dataframe to use in our examples. We create this dataframe from a csv file located on a website -- in this case, a file from one of my public GitHub repos:

In [322]:
df = pd.read_csv('https://raw.githubusercontent.com/prof-tcsmith/data/master/UniversalBank.csv')

#### .info() method

The .info() method provides a list of columns, null count information, and the data type of each column.

In [323]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  5000 non-null   int64  
 1   Age                 5000 non-null   int64  
 2   Experience          5000 non-null   int64  
 3   Income              5000 non-null   int64  
 4   ZIP Code            5000 non-null   int64  
 5   Family              5000 non-null   int64  
 6   CCAvg               5000 non-null   float64
 7   Education           5000 non-null   int64  
 8   Mortgage            5000 non-null   int64  
 9   Personal Loan       5000 non-null   int64  
 10  Securities Account  5000 non-null   int64  
 11  CD Account          5000 non-null   int64  
 12  Online              5000 non-null   int64  
 13  CreditCard          5000 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 547.0 KB


#### .describe()

Describe provides a statistical summary of the continuous variables found in the dataframe.

In [324]:
df.describe()

Unnamed: 0,ID,Age,Experience,Income,ZIP Code,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,2500.5,45.3384,20.1046,73.7742,93152.503,2.3964,1.937938,1.881,56.4988,0.096,0.1044,0.0604,0.5968,0.294
std,1443.520003,11.463166,11.467954,46.033729,2121.852197,1.147663,1.747659,0.839869,101.713802,0.294621,0.305809,0.23825,0.490589,0.455637
min,1.0,23.0,-3.0,8.0,9307.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1250.75,35.0,10.0,39.0,91911.0,1.0,0.7,1.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,2500.5,45.0,20.0,64.0,93437.0,2.0,1.5,2.0,0.0,0.0,0.0,0.0,1.0,0.0
75%,3750.25,55.0,30.0,98.0,94608.0,3.0,2.5,3.0,101.0,0.0,0.0,0.0,1.0,1.0
max,5000.0,67.0,43.0,224.0,96651.0,4.0,10.0,3.0,635.0,1.0,1.0,1.0,1.0,1.0


#### .value_counts()

Value counts provides the number of each category (or class) found in a categorical variable.

In [325]:
df['CreditCard'].value_counts()

CreditCard
0    3530
1    1470
Name: count, dtype: int64

Value_counts has little value when summarizing a continuous variable:

In [326]:
df['Income'].value_counts

<bound method IndexOpsMixin.value_counts of 0        49
1        34
2        11
3       100
4        45
       ... 
4995     40
4996     15
4997     24
4998     49
4999     83
Name: Income, Length: 5000, dtype: int64>

#### .unique and .nunique()

.unique() provides a list of the unique values found in a dataframe variable (column). nunique() provides a number of unique values found in a given column variable.

In [327]:
df['Education'].unique() # find the unique values in the AGE column

array([1, 2, 3], dtype=int64)

In [328]:
df['Age'].unique() # find the unique values in the AGE column. It's rare that is is applied to a continuous variable, but it's possible to do

array([25, 45, 39, 35, 37, 53, 50, 34, 65, 29, 48, 59, 67, 60, 38, 42, 46,
       55, 56, 57, 44, 36, 43, 40, 30, 31, 51, 32, 61, 41, 28, 49, 47, 62,
       58, 54, 33, 27, 66, 24, 52, 26, 64, 63, 23], dtype=int64)

#### .sum()

Provides the sum of a continuous variable found in the dataframe

In [329]:
df.sum()

ID                    1.250250e+07
Age                   2.266920e+05
Experience            1.005230e+05
Income                3.688710e+05
ZIP Code              4.657625e+08
Family                1.198200e+04
CCAvg                 9.689690e+03
Education             9.405000e+03
Mortgage              2.824940e+05
Personal Loan         4.800000e+02
Securities Account    5.220000e+02
CD Account            3.020000e+02
Online                2.984000e+03
CreditCard            1.470000e+03
dtype: float64

In [330]:
df['Income'].sum()

368871

In [331]:
# df['Education']sum() # this will generate an error because the sum method doesn't work on strings or categorical variables

#### .count()

Provides a count of the number of observations in each variable. This is the same as the value_counts() method, but it is applied to each column

In [332]:
df.count()

ID                    5000
Age                   5000
Experience            5000
Income                5000
ZIP Code              5000
Family                5000
CCAvg                 5000
Education             5000
Mortgage              5000
Personal Loan         5000
Securities Account    5000
CD Account            5000
Online                5000
CreditCard            5000
dtype: int64

#### .min(), .max(), .mean(), and median()

In [333]:
df.min() # produces the minimum value in each column

ID                       1.0
Age                     23.0
Experience              -3.0
Income                   8.0
ZIP Code              9307.0
Family                   1.0
CCAvg                    0.0
Education                1.0
Mortgage                 0.0
Personal Loan            0.0
Securities Account       0.0
CD Account               0.0
Online                   0.0
CreditCard               0.0
dtype: float64

In [334]:
df['Income'].min() # produces the minimum value in the INCOME column

8

In [335]:
df['Income'].max() # produces the maximum value in the INCOME column

224

In [336]:
df['Income'].mean() # produces the mean value in the INCOME column

73.7742

In [337]:
df['Income'].median() # produces the median value in each column

64.0

#### .groupby()

In [338]:
df.groupby('Education').sum()

Unnamed: 0_level_0,ID,Age,Experience,Income,ZIP Code,Family,CCAvg,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
Education,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,5183663,94244,42057,179389,195236537,4687,4738.7,129171,93,224,118,1255,633
2,3492891,63191,27738,90232,130869983,3721,2364.18,72001,182,150,88,860,400
3,3825946,69257,30728,99250,139655995,3574,2586.81,81322,205,148,96,869,437


In [339]:
df.groupby('Education')['Income'].sum()

Education
1    179389
2     90232
3     99250
Name: Income, dtype: int64