<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

This notebook is adapted by Zhuo Chen from the notebooks created by [Nathan Kelber](http://nkelber.com), [William Mattingly](https://datascience.si.edu/people/dr-william-mattingly) and [Melanie Walsh](https://melaniewalsh.org) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email zhuo.chen@ithaka.org or nathan.kelber@ithaka.org.<br />
___

# Pandas 2

**Description:** This notebook describes how to:
* Create a dataframe from a .csv file
* Filter data 
* Update data in a dataframe

This is the second notebook in a series on learning to use Pandas. 

**Use Case:** For Learners (Detailed explanation, not ideal for researchers)

**Difficulty:** Intermediate

**Knowledge Required:** 
* [Pandas 1](./pandas-1.ipynb)
* Python Basics ([Start Python Basics I](./python-basics-1.ipynb))

**Knowledge Recommended:** 
* [Python Intermediate 2](./python-intermediate-2.ipynb)
* [Python Intermediate 4](./python-intermediate-4.ipynb)

**Completion Time:** 90 minutes

**Data Format:** CSV (.csv)

**Libraries Used:** Pandas

**Research Pipeline:** None
___


In [1]:
### Download the sample file for this Lesson
import urllib.request
url = 'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/failed_bank_since_2000.csv'
urllib.request.urlretrieve(url, './data/' + url.rsplit('/', 1)[-1])
print('Samples files retrieved.')

Samples files retrieved.


## Create a dataframe from a .csv file

When working with DataFrames, we are usually using a dataset that has been compiled by someone else. Often the data will be in the form of a CSV or Excel file. 

We can convert the data in a .csv file to a Pandas DataFrame using the `.read_csv()` method. We pass in the location of the .csv file.

Use the `**File > Open**` menu above to navigate to the `failed_bank_since_2000.csv` in the `/data` folder. Preview its structure before we load it into a dataframe.

In [2]:
import pandas as pd

In [3]:
# Create a DataFrame `df` from a CSV file
df = pd.read_csv('data/failed_bank_since_2000.csv')
df

Unnamed: 0,Bank Name,City,State,Cert,Acquiring Institution,Closing Date,Fund
0,Almena State Bank,Almena,KS,15426,Equity Bank,23-Oct-20,10538
1,First City Bank of Florida,Fort Walton Beach,FL,16748,"United Fidelity Bank, fsb",16-Oct-20,10537
2,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.",3-Apr-20,10536
3,Ericson State Bank,Ericson,NE,18265,Farmers and Merchants Bank,14-Feb-20,10535
4,City National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,1-Nov-19,10534
...,...,...,...,...,...,...,...
558,"Superior Bank, FSB",Hinsdale,IL,32646,"Superior Federal, FSB",27-Jul-01,6004
559,Malta National Bank,Malta,OH,6629,North Valley Bank,3-May-01,4648
560,First Alliance Bank & Trust Co.,Manchester,NH,34264,Southern New Hampshire Bank & Trust,2-Feb-01,4647
561,National State Bank of Metropolis,Metropolis,IL,3815,Banterra Bank of Marion,14-Dec-00,4646


In [4]:
# Change the display setting
pd.set_option('display.min_rows', 40)

In [5]:
# Print out the dataframe
df

Unnamed: 0,Bank Name,City,State,Cert,Acquiring Institution,Closing Date,Fund
0,Almena State Bank,Almena,KS,15426,Equity Bank,23-Oct-20,10538
1,First City Bank of Florida,Fort Walton Beach,FL,16748,"United Fidelity Bank, fsb",16-Oct-20,10537
2,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.",3-Apr-20,10536
3,Ericson State Bank,Ericson,NE,18265,Farmers and Merchants Bank,14-Feb-20,10535
4,City National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,1-Nov-19,10534
5,Resolute Bank,Maumee,OH,58317,Buckeye State Bank,25-Oct-19,10533
6,Louisa Community Bank,Louisa,KY,58112,Kentucky Farmers Bank Corporation,25-Oct-19,10532
7,The Enloe State Bank,Cooper,TX,10716,"Legend Bank, N. A.",31-May-19,10531
8,Washington Federal Bank for Savings,Chicago,IL,30570,Royal Savings Bank,15-Dec-17,10530
9,The Farmers and Merchants State Bank of Argonia,Argonia,KS,17719,Conway Bank,13-Oct-17,10529


By convention, a dataframe variable is called `df` but we could give it any valid Python variable name.

In [6]:
# Get some info about the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 563 entries, 0 to 562
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Bank Name              563 non-null    object
 1   City                   563 non-null    object
 2   State                  563 non-null    object
 3   Cert                   563 non-null    int64 
 4   Acquiring Institution  532 non-null    object
 5   Closing Date           563 non-null    object
 6   Fund                   563 non-null    int64 
dtypes: int64(2), object(5)
memory usage: 30.9+ KB


<h3 style="color:red; display:inline">Coding Challenge! &lt; / &gt; </h3>

In the exercises in this notebook, we'll work on a dataset built from Constellate.

We'll use the `constellate` client to automatically retrieve the [metadata](https://constellate.org/docs/key-terms/#metadata) for a [dataset](https://constellate.org/docs/key-terms/#dataset). We can retrieve [metadata](https://constellate.org/docs/key-terms/#metadata) in a [CSV file](https://constellate.org/docs/key-terms/#csv-file) using the `get_metadata` method.


In [7]:
# download and import constellate library
!pip install constellate-client
import constellate

Constellate: use and download of datasets is covered by the Terms & Conditions of Use: https://constellate.org/terms-and-conditions/


In [8]:
# Creating a variable `dataset_id` to hold our dataset ID
# The default dataset is Shakespeare Quarterly, 1950-present
# retrieve the metadata
dataset_id = "7e41317e-740f-e86a-4729-20dab492e925"
metadata = constellate.get_metadata(dataset_id)

All documents from JSTOR published in Shakespeare Quarterly from 1950 - 2020. 6745 documents.
INFO:root:File /Users/zchen/data/7e41317e-740f-e86a-4729-20dab492e925-sampled-metadata.csv exists. Not re-downloading.


The metadata is stored in a .csv file. In the following code cell, read in the data using Pandas. 

In [9]:
shake = pd.read_csv(metadata)
shake

Unnamed: 0,id,title,isPartOf,publicationYear,doi,docType,provider,datePublished,issueNumber,volumeNumber,url,creator,publisher,language,pageStart,pageEnd,placeOfPublication,wordCount,pageCount,outputFormat
0,http://www.jstor.org/stable/2902131,Fragments of Nationalism in Troilus and Cressida,Shakespeare Quarterly,2000,,article,jstor,2000-07-01,2.0,51,http://www.jstor.org/stable/2902131,Matthew A. Greenfield,Folger Shakespeare Library,eng,181,200,,9538,20,unigram; bigram; trigram
1,http://www.jstor.org/stable/2871217,"Shakespearean Chronology, Ideological Complici...",Shakespeare Quarterly,1994,,article,jstor,1994-07-01,2.0,45,http://www.jstor.org/stable/2871217,Barbara Freedman,Folger Shakespeare Library,eng,190,210,,12472,21,unigram; bigram; trigram
2,http://www.jstor.org/stable/2866912,Review Article,Shakespeare Quarterly,1962,,article,jstor,1962-01-01,1.0,13,http://www.jstor.org/stable/2866912,R. G. Cox,Folger Shakespeare Library,eng,96,97,,573,2,unigram; bigram; trigram
3,http://www.jstor.org/stable/41819770,Review Article,Shakespeare Quarterly,2012,,article,jstor,2012-12-01,4.0,63,http://www.jstor.org/stable/41819770,Darryl Gless,Folger Shakespeare Library,eng,580,587,,3658,8,unigram; bigram; trigram
4,http://www.jstor.org/stable/2868793,Review Article,Shakespeare Quarterly,1974,,article,jstor,1974-07-01,3.0,25,http://www.jstor.org/stable/2868793,Marion O'Connor,Folger Shakespeare Library,eng,370,373,,1914,4,unigram; bigram; trigram
5,http://www.jstor.org/stable/2869261,Falstaff in Windsor Forest: Villain or Victim?,Shakespeare Quarterly,1975,,article,jstor,1975-01-01,1.0,26,http://www.jstor.org/stable/2869261,Jeanne Addison Roberts,Folger Shakespeare Library,eng,8,15,,4193,8,unigram; bigram; trigram
6,http://www.jstor.org/stable/2870776,Coriolanus: Body Politic and Private Parts,Shakespeare Quarterly,1990,,article,jstor,1990-12-01,4.0,41,http://www.jstor.org/stable/2870776,Zvi Jagendorf,Folger Shakespeare Library,eng,455,469,,8209,15,unigram; bigram; trigram
7,http://www.jstor.org/stable/2869933,"Shakespeare in West Germany, 1983",Shakespeare Quarterly,1984,,article,jstor,1984-07-01,2.0,35,http://www.jstor.org/stable/2869933,Wilhelm Hortmann,Folger Shakespeare Library,eng,214,218,,2764,5,unigram; bigram; trigram
8,http://www.jstor.org/stable/2866221,Review Article,Shakespeare Quarterly,1951,,article,jstor,1951-04-01,2.0,2,http://www.jstor.org/stable/2866221,A. H. R. Fairchild,Folger Shakespeare Library,eng,133,134,,1207,2,unigram; bigram; trigram
9,http://www.jstor.org/stable/2870918,Review Article,Shakespeare Quarterly,1992,,article,jstor,1992-04-01,1.0,43,http://www.jstor.org/stable/2870918,John W. Velz,Folger Shakespeare Library,eng,107,109,,1517,3,unigram; bigram; trigram


In the code cell below, can you set the rows to display to 30?

Use a Pandas method to explore the dataframe. How many rows does it have? How many columns does it have? What is the data type of the data in each column?

In [10]:
shake.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500 entries, 0 to 1499
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   id                  1500 non-null   object 
 1   title               1500 non-null   object 
 2   isPartOf            1500 non-null   object 
 3   publicationYear     1500 non-null   int64  
 4   doi                 0 non-null      float64
 5   docType             1500 non-null   object 
 6   provider            1500 non-null   object 
 7   datePublished       1500 non-null   object 
 8   issueNumber         1467 non-null   float64
 9   volumeNumber        1500 non-null   int64  
 10  url                 1500 non-null   object 
 11  creator             1209 non-null   object 
 12  publisher           1500 non-null   object 
 13  language            1500 non-null   object 
 14  pageStart           1443 non-null   object 
 15  pageEnd             1443 non-null   object 
 16  placeO

## Filter dataframe
We have learned how to use `.loc` and `.iloc` to select part of a dataframe in [Pandas 1](./pandas-1.ipynb). We will learn more ways to do data filtering in this section.
### Work with missing values
It is a common case that datasets have missing values. As you may have already noticed, blank cells in a CSV file show up as NaN in a Pandas DataFrame. For example, in our dataset, the `Acquiring Institution` column gives the name when a failed bank was acquired by another institution and is empty otherwise.

In [11]:
# Use isna() to check whether a dataframe has missing values
df.isna()

Unnamed: 0,Bank Name,City,State,Cert,Acquiring Institution,Closing Date,Fund
0,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False
6,False,False,False,False,False,False,False
7,False,False,False,False,False,False,False
8,False,False,False,False,False,False,False
9,False,False,False,False,False,False,False


In [12]:
# Use isna() to check whether a Series has missing values
df['Acquiring Institution'].isna()

0      False
1      False
2      False
3      False
4      False
5      False
6      False
7      False
8      False
9      False
10     False
11     False
12     False
13     False
14     False
15     False
16     False
17     False
18     False
19     False
       ...  
543    False
544    False
545    False
546    False
547     True
548     True
549    False
550    False
551     True
552    False
553     True
554    False
555    False
556    False
557    False
558    False
559    False
560    False
561    False
562    False
Name: Acquiring Institution, Length: 563, dtype: bool

We can use `.dropna()` method to remove rows or columns with missing values.

In [13]:
# Use .dropna() to remove all rows with at least one missing value
df.dropna() # no argument passed in

Unnamed: 0,Bank Name,City,State,Cert,Acquiring Institution,Closing Date,Fund
0,Almena State Bank,Almena,KS,15426,Equity Bank,23-Oct-20,10538
1,First City Bank of Florida,Fort Walton Beach,FL,16748,"United Fidelity Bank, fsb",16-Oct-20,10537
2,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.",3-Apr-20,10536
3,Ericson State Bank,Ericson,NE,18265,Farmers and Merchants Bank,14-Feb-20,10535
4,City National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,1-Nov-19,10534
5,Resolute Bank,Maumee,OH,58317,Buckeye State Bank,25-Oct-19,10533
6,Louisa Community Bank,Louisa,KY,58112,Kentucky Farmers Bank Corporation,25-Oct-19,10532
7,The Enloe State Bank,Cooper,TX,10716,"Legend Bank, N. A.",31-May-19,10531
8,Washington Federal Bank for Savings,Chicago,IL,30570,Royal Savings Bank,15-Dec-17,10530
9,The Farmers and Merchants State Bank of Argonia,Argonia,KS,17719,Conway Bank,13-Oct-17,10529


In [14]:
# Use .dropna() to remove all rows with at least one missing value
df.dropna(axis=0) # Set the axis to 0

Unnamed: 0,Bank Name,City,State,Cert,Acquiring Institution,Closing Date,Fund
0,Almena State Bank,Almena,KS,15426,Equity Bank,23-Oct-20,10538
1,First City Bank of Florida,Fort Walton Beach,FL,16748,"United Fidelity Bank, fsb",16-Oct-20,10537
2,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.",3-Apr-20,10536
3,Ericson State Bank,Ericson,NE,18265,Farmers and Merchants Bank,14-Feb-20,10535
4,City National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,1-Nov-19,10534
5,Resolute Bank,Maumee,OH,58317,Buckeye State Bank,25-Oct-19,10533
6,Louisa Community Bank,Louisa,KY,58112,Kentucky Farmers Bank Corporation,25-Oct-19,10532
7,The Enloe State Bank,Cooper,TX,10716,"Legend Bank, N. A.",31-May-19,10531
8,Washington Federal Bank for Savings,Chicago,IL,30570,Royal Savings Bank,15-Dec-17,10530
9,The Farmers and Merchants State Bank of Argonia,Argonia,KS,17719,Conway Bank,13-Oct-17,10529


In [15]:
# Use .dropna() to remove all rows with at least one missing value
df.dropna(axis='rows') # Set the axis to 'rows'

Unnamed: 0,Bank Name,City,State,Cert,Acquiring Institution,Closing Date,Fund
0,Almena State Bank,Almena,KS,15426,Equity Bank,23-Oct-20,10538
1,First City Bank of Florida,Fort Walton Beach,FL,16748,"United Fidelity Bank, fsb",16-Oct-20,10537
2,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.",3-Apr-20,10536
3,Ericson State Bank,Ericson,NE,18265,Farmers and Merchants Bank,14-Feb-20,10535
4,City National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,1-Nov-19,10534
5,Resolute Bank,Maumee,OH,58317,Buckeye State Bank,25-Oct-19,10533
6,Louisa Community Bank,Louisa,KY,58112,Kentucky Farmers Bank Corporation,25-Oct-19,10532
7,The Enloe State Bank,Cooper,TX,10716,"Legend Bank, N. A.",31-May-19,10531
8,Washington Federal Bank for Savings,Chicago,IL,30570,Royal Savings Bank,15-Dec-17,10530
9,The Farmers and Merchants State Bank of Argonia,Argonia,KS,17719,Conway Bank,13-Oct-17,10529


In [16]:
# Use .dropna() to remove all columns with at least one missing value
df.dropna(axis=1) # set the axis to 1

Unnamed: 0,Bank Name,City,State,Cert,Closing Date,Fund
0,Almena State Bank,Almena,KS,15426,23-Oct-20,10538
1,First City Bank of Florida,Fort Walton Beach,FL,16748,16-Oct-20,10537
2,The First State Bank,Barboursville,WV,14361,3-Apr-20,10536
3,Ericson State Bank,Ericson,NE,18265,14-Feb-20,10535
4,City National Bank of New Jersey,Newark,NJ,21111,1-Nov-19,10534
5,Resolute Bank,Maumee,OH,58317,25-Oct-19,10533
6,Louisa Community Bank,Louisa,KY,58112,25-Oct-19,10532
7,The Enloe State Bank,Cooper,TX,10716,31-May-19,10531
8,Washington Federal Bank for Savings,Chicago,IL,30570,15-Dec-17,10530
9,The Farmers and Merchants State Bank of Argonia,Argonia,KS,17719,13-Oct-17,10529


In [17]:
# Use .dropna() to remove all columns with at least one missing value
df.dropna(axis='columns') # set the axis to 'columns'

Unnamed: 0,Bank Name,City,State,Cert,Closing Date,Fund
0,Almena State Bank,Almena,KS,15426,23-Oct-20,10538
1,First City Bank of Florida,Fort Walton Beach,FL,16748,16-Oct-20,10537
2,The First State Bank,Barboursville,WV,14361,3-Apr-20,10536
3,Ericson State Bank,Ericson,NE,18265,14-Feb-20,10535
4,City National Bank of New Jersey,Newark,NJ,21111,1-Nov-19,10534
5,Resolute Bank,Maumee,OH,58317,25-Oct-19,10533
6,Louisa Community Bank,Louisa,KY,58112,25-Oct-19,10532
7,The Enloe State Bank,Cooper,TX,10716,31-May-19,10531
8,Washington Federal Bank for Savings,Chicago,IL,30570,15-Dec-17,10530
9,The Farmers and Merchants State Bank of Argonia,Argonia,KS,17719,13-Oct-17,10529


In [18]:
# Use .dropna() to remove all rows that have a missing value in a specific column
df.dropna(subset=['Acquiring Institution'])

Unnamed: 0,Bank Name,City,State,Cert,Acquiring Institution,Closing Date,Fund
0,Almena State Bank,Almena,KS,15426,Equity Bank,23-Oct-20,10538
1,First City Bank of Florida,Fort Walton Beach,FL,16748,"United Fidelity Bank, fsb",16-Oct-20,10537
2,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.",3-Apr-20,10536
3,Ericson State Bank,Ericson,NE,18265,Farmers and Merchants Bank,14-Feb-20,10535
4,City National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,1-Nov-19,10534
5,Resolute Bank,Maumee,OH,58317,Buckeye State Bank,25-Oct-19,10533
6,Louisa Community Bank,Louisa,KY,58112,Kentucky Farmers Bank Corporation,25-Oct-19,10532
7,The Enloe State Bank,Cooper,TX,10716,"Legend Bank, N. A.",31-May-19,10531
8,Washington Federal Bank for Savings,Chicago,IL,30570,Royal Savings Bank,15-Dec-17,10530
9,The Farmers and Merchants State Bank of Argonia,Argonia,KS,17719,Conway Bank,13-Oct-17,10529


Sometimes, you would want to maintain the rows and columns that have missing values. However, you would want the values in those cells to be in the same data type as the other cells in the same column. In this way, when you apply a certain function to a column in a dataframe, you will not run into type error. A common practice to deal with this kind of problem is to use the `.fillna()` method. 

In [19]:
# Fill the missing values
df['Acquiring Institution'].fillna('No Acquirer')

0                              Equity Bank
1                United Fidelity Bank, fsb
2                           MVB Bank, Inc.
3               Farmers and Merchants Bank
4                          Industrial Bank
5                       Buckeye State Bank
6        Kentucky Farmers Bank Corporation
7                       Legend Bank, N. A.
8                       Royal Savings Bank
9                              Conway Bank
10               United Fidelity Bank, fsb
11     First-Citizens Bank & Trust Company
12                            Whitney Bank
13                       Cache Valley Bank
14                     State Bank of Texas
15     First-Citizens Bank & Trust Company
16                            Today's Bank
17                             United Bank
18     First-Citizens Bank & Trust Company
19              The Bank of Fayette County
                      ...                 
543                         Earthstar Bank
544                          The Park Bank
545        

### Drop columns or rows
If a certain column or a row is no longer useful in the data analysis, we can drop it from a dataframe.

In [20]:
# Drop a column by setting the columns parameter
df.drop(columns='Fund')

Unnamed: 0,Bank Name,City,State,Cert,Acquiring Institution,Closing Date
0,Almena State Bank,Almena,KS,15426,Equity Bank,23-Oct-20
1,First City Bank of Florida,Fort Walton Beach,FL,16748,"United Fidelity Bank, fsb",16-Oct-20
2,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.",3-Apr-20
3,Ericson State Bank,Ericson,NE,18265,Farmers and Merchants Bank,14-Feb-20
4,City National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,1-Nov-19
5,Resolute Bank,Maumee,OH,58317,Buckeye State Bank,25-Oct-19
6,Louisa Community Bank,Louisa,KY,58112,Kentucky Farmers Bank Corporation,25-Oct-19
7,The Enloe State Bank,Cooper,TX,10716,"Legend Bank, N. A.",31-May-19
8,Washington Federal Bank for Savings,Chicago,IL,30570,Royal Savings Bank,15-Dec-17
9,The Farmers and Merchants State Bank of Argonia,Argonia,KS,17719,Conway Bank,13-Oct-17


In [21]:
# Drop a column by setting the axis parameter
df.drop('Fund', axis=1)

Unnamed: 0,Bank Name,City,State,Cert,Acquiring Institution,Closing Date
0,Almena State Bank,Almena,KS,15426,Equity Bank,23-Oct-20
1,First City Bank of Florida,Fort Walton Beach,FL,16748,"United Fidelity Bank, fsb",16-Oct-20
2,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.",3-Apr-20
3,Ericson State Bank,Ericson,NE,18265,Farmers and Merchants Bank,14-Feb-20
4,City National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,1-Nov-19
5,Resolute Bank,Maumee,OH,58317,Buckeye State Bank,25-Oct-19
6,Louisa Community Bank,Louisa,KY,58112,Kentucky Farmers Bank Corporation,25-Oct-19
7,The Enloe State Bank,Cooper,TX,10716,"Legend Bank, N. A.",31-May-19
8,Washington Federal Bank for Savings,Chicago,IL,30570,Royal Savings Bank,15-Dec-17
9,The Farmers and Merchants State Bank of Argonia,Argonia,KS,17719,Conway Bank,13-Oct-17


In [22]:
# Drop multiple columns by setting the columns parameter
df.drop(columns=['Fund', 'Cert'])

Unnamed: 0,Bank Name,City,State,Acquiring Institution,Closing Date
0,Almena State Bank,Almena,KS,Equity Bank,23-Oct-20
1,First City Bank of Florida,Fort Walton Beach,FL,"United Fidelity Bank, fsb",16-Oct-20
2,The First State Bank,Barboursville,WV,"MVB Bank, Inc.",3-Apr-20
3,Ericson State Bank,Ericson,NE,Farmers and Merchants Bank,14-Feb-20
4,City National Bank of New Jersey,Newark,NJ,Industrial Bank,1-Nov-19
5,Resolute Bank,Maumee,OH,Buckeye State Bank,25-Oct-19
6,Louisa Community Bank,Louisa,KY,Kentucky Farmers Bank Corporation,25-Oct-19
7,The Enloe State Bank,Cooper,TX,"Legend Bank, N. A.",31-May-19
8,Washington Federal Bank for Savings,Chicago,IL,Royal Savings Bank,15-Dec-17
9,The Farmers and Merchants State Bank of Argonia,Argonia,KS,Conway Bank,13-Oct-17


In [23]:
# Drop multiple columns by setting the axis parameter
df.drop(['Fund', 'Cert'], axis=1)

Unnamed: 0,Bank Name,City,State,Acquiring Institution,Closing Date
0,Almena State Bank,Almena,KS,Equity Bank,23-Oct-20
1,First City Bank of Florida,Fort Walton Beach,FL,"United Fidelity Bank, fsb",16-Oct-20
2,The First State Bank,Barboursville,WV,"MVB Bank, Inc.",3-Apr-20
3,Ericson State Bank,Ericson,NE,Farmers and Merchants Bank,14-Feb-20
4,City National Bank of New Jersey,Newark,NJ,Industrial Bank,1-Nov-19
5,Resolute Bank,Maumee,OH,Buckeye State Bank,25-Oct-19
6,Louisa Community Bank,Louisa,KY,Kentucky Farmers Bank Corporation,25-Oct-19
7,The Enloe State Bank,Cooper,TX,"Legend Bank, N. A.",31-May-19
8,Washington Federal Bank for Savings,Chicago,IL,Royal Savings Bank,15-Dec-17
9,The Farmers and Merchants State Bank of Argonia,Argonia,KS,Conway Bank,13-Oct-17


In [24]:
# Drop a row by setting the index parameter
df.drop(index=0)

Unnamed: 0,Bank Name,City,State,Cert,Acquiring Institution,Closing Date,Fund
1,First City Bank of Florida,Fort Walton Beach,FL,16748,"United Fidelity Bank, fsb",16-Oct-20,10537
2,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.",3-Apr-20,10536
3,Ericson State Bank,Ericson,NE,18265,Farmers and Merchants Bank,14-Feb-20,10535
4,City National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,1-Nov-19,10534
5,Resolute Bank,Maumee,OH,58317,Buckeye State Bank,25-Oct-19,10533
6,Louisa Community Bank,Louisa,KY,58112,Kentucky Farmers Bank Corporation,25-Oct-19,10532
7,The Enloe State Bank,Cooper,TX,10716,"Legend Bank, N. A.",31-May-19,10531
8,Washington Federal Bank for Savings,Chicago,IL,30570,Royal Savings Bank,15-Dec-17,10530
9,The Farmers and Merchants State Bank of Argonia,Argonia,KS,17719,Conway Bank,13-Oct-17,10529
10,Fayette County Bank,Saint Elmo,IL,1802,"United Fidelity Bank, fsb",26-May-17,10528


In [25]:
# Drop a row by setting the axis parameter
df.drop(0, axis=0)

Unnamed: 0,Bank Name,City,State,Cert,Acquiring Institution,Closing Date,Fund
1,First City Bank of Florida,Fort Walton Beach,FL,16748,"United Fidelity Bank, fsb",16-Oct-20,10537
2,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.",3-Apr-20,10536
3,Ericson State Bank,Ericson,NE,18265,Farmers and Merchants Bank,14-Feb-20,10535
4,City National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,1-Nov-19,10534
5,Resolute Bank,Maumee,OH,58317,Buckeye State Bank,25-Oct-19,10533
6,Louisa Community Bank,Louisa,KY,58112,Kentucky Farmers Bank Corporation,25-Oct-19,10532
7,The Enloe State Bank,Cooper,TX,10716,"Legend Bank, N. A.",31-May-19,10531
8,Washington Federal Bank for Savings,Chicago,IL,30570,Royal Savings Bank,15-Dec-17,10530
9,The Farmers and Merchants State Bank of Argonia,Argonia,KS,17719,Conway Bank,13-Oct-17,10529
10,Fayette County Bank,Saint Elmo,IL,1802,"United Fidelity Bank, fsb",26-May-17,10528


In [26]:
# Drop multiple rows


Again, if you want the change to take place, you will need to set the parameter `inplace=True`.

In [27]:
# Reset the index after dropping rows
df.reset_index(drop=True)

Unnamed: 0,Bank Name,City,State,Cert,Acquiring Institution,Closing Date,Fund
0,Almena State Bank,Almena,KS,15426,Equity Bank,23-Oct-20,10538
1,First City Bank of Florida,Fort Walton Beach,FL,16748,"United Fidelity Bank, fsb",16-Oct-20,10537
2,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.",3-Apr-20,10536
3,Ericson State Bank,Ericson,NE,18265,Farmers and Merchants Bank,14-Feb-20,10535
4,City National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,1-Nov-19,10534
5,Resolute Bank,Maumee,OH,58317,Buckeye State Bank,25-Oct-19,10533
6,Louisa Community Bank,Louisa,KY,58112,Kentucky Farmers Bank Corporation,25-Oct-19,10532
7,The Enloe State Bank,Cooper,TX,10716,"Legend Bank, N. A.",31-May-19,10531
8,Washington Federal Bank for Savings,Chicago,IL,30570,Royal Savings Bank,15-Dec-17,10530
9,The Farmers and Merchants State Bank of Argonia,Argonia,KS,17719,Conway Bank,13-Oct-17,10529


<h3 style="color:red; display:inline">Coding Challenge! &lt; / &gt; </h3>

When you explore the Shakespeare dataframe, what did you find about the column `doi`? What did you find about the `placeOfPublication` column? Is there any non-null value in them?

In [28]:
# Drop the columns of doi and the placeofPublication, make the change permanent


### Filter data using conditionals
Conditional selection using `df.loc[]` is a very common method to filter a dataframe. 

You write a filtering condition to run by a target column. The condition then checks, for each cell in the target column, whether it fulfills the condition or not. The results will be returned as a Series of True/False values. The `.loc` indexer then uses this Series to select the rows that have True values. 

Suppose you are interested in the banks that failed since 2000 in the state of Georgia. From the original dataframe, you would like to filter all the rows of the failed banks in Georgia. How do you do it?

In [29]:
# Write a filtering condition
df['State'] == 'GA' # Returns a Series of True/False values for the column 'State'

0      False
1      False
2      False
3      False
4      False
5      False
6      False
7      False
8      False
9      False
10     False
11     False
12     False
13     False
14     False
15     False
16     False
17      True
18     False
19     False
       ...  
543    False
544    False
545    False
546    False
547    False
548     True
549    False
550    False
551    False
552    False
553    False
554    False
555    False
556    False
557    False
558    False
559    False
560    False
561    False
562    False
Name: State, Length: 563, dtype: bool

In [30]:
# Assign the filtering condition to a variable
filt = (df['State'] == 'GA') # Use parenthesis for better reading

In [31]:
# Pass the Series returned by the filtering condition to df.loc[]
df.loc[filt]

Unnamed: 0,Bank Name,City,State,Cert,Acquiring Institution,Closing Date,Fund
17,The Woodbury Banking Company,Woodbury,GA,11297,United Bank,19-Aug-16,10521
22,The Bank of Georgia,Peachtree City,GA,35259,Fidelity Bank,2-Oct-15,10516
26,Capitol City Bank & Trust Company,Atlanta,GA,33938,First-Citizens Bank & Trust Company,13-Feb-15,10512
34,Eastside Commercial Bank,Conyers,GA,58125,Community & Southern Bank,18-Jul-14,10504
59,Sunrise Bank,Valdosta,GA,58185,Synovus Bank,10-May-13,10481
61,Douglas County Bank,Douglasville,GA,21649,Hamilton State Bank,26-Apr-13,10476
67,Frontier Bank,LaGrange,GA,16431,HeritageBank of the South,8-Mar-13,10471
72,Hometown Community Bank,Braselton,GA,57928,"CertusBank, National Association",16-Nov-12,10466
83,Jasper Banking Company,Jasper,GA,16240,Stearns Bank N.A.,27-Jul-12,10455
86,First Cherokee State Bank,Woodstock,GA,32711,Community & Southern Bank,20-Jul-12,10450


Out of the rows that fulfill the filtering condition, we can further specify which columns to be returned.

In [32]:
# Specify a single column to be returned
df.loc[filt, 'Bank Name']

17          The Woodbury Banking Company
22                   The Bank of Georgia
26     Capitol City Bank & Trust Company
34              Eastside Commercial Bank
59                          Sunrise Bank
61                   Douglas County Bank
67                         Frontier Bank
72               Hometown Community Bank
83                Jasper Banking Company
86             First Cherokee State Bank
87                    Georgia Trust Bank
90               Montgomery Bank & Trust
92                Security Exchange Bank
108                Covenant Bank & Trust
110                 Global Commerce Bank
112              Central Bank of Georgia
120                 The First State Bank
126           Community Bank of Rockmart
131               Community Capital Bank
132                   Decatur First Bank
                     ...                
450         Security Bank of North Metro
451        Security Bank of North Fulton
452     Security Bank of Gwinnett County
457             

Of course, we can select muliple columns to be returned out of the filtered rows. 

In [33]:
# Specify multiple columns to be returned
df.loc[filt, ['Bank Name', 'Fund']]

Unnamed: 0,Bank Name,Fund
17,The Woodbury Banking Company,10521
22,The Bank of Georgia,10516
26,Capitol City Bank & Trust Company,10512
34,Eastside Commercial Bank,10504
59,Sunrise Bank,10481
61,Douglas County Bank,10476
67,Frontier Bank,10471
72,Hometown Community Bank,10466
83,Jasper Banking Company,10455
86,First Cherokee State Bank,10450


#### Conjunction of multiple filtering conditions: `&`

Oftentimes, you would want to filter a dataframe based on more complex conditions. For example, suppose you would like to get the banks in GA that were closed between 2008 and 2010. How do you use `df.loc[ ]` to achieve it?

The location of the failed banks is stored in the `State` column. The closing year of the banks is stored in the `Closing Date` column. 

In [34]:
# Create the first filtering condition restricting the state
filt1 = (df['State'] == 'GA')

How to get the closing year of the banks? Recall what we have learned in [Pandas 1](./pandas-1.ipynb) about creating a new column based on an old one. How do you extract the closing year out of the column `Closing Date`? We will need to use the `.apply()` method.

In [35]:
# Create a new column storing the closing year of the banks
df['Closing Year'] = df['Closing Date'].apply(lambda r: r.split('-')[2])
df['Closing Year'] = df['Closing Year'].astype(int)

In [36]:
# Create the second filtering condition restricting the closing year
filt2 = (df['Closing Year'] > 7) & (df['Closing Year'] < 11)

With the two filtering conditions, we are ready to extract the banks in GA that failed between 2008 and 2010.

In [37]:
# Use filt1 and filt2 to get the target rows
df.loc[filt1 & filt2]

Unnamed: 0,Bank Name,City,State,Cert,Acquiring Institution,Closing Date,Fund,Closing Year
216,"United Americas Bank, N.A.",Atlanta,GA,35065,State Bank and Trust Company,17-Dec-10,10323,10
217,"Appalachian Community Bank, FSB",McCaysville,GA,58495,Peoples Bank of East Tennessee,17-Dec-10,10319,10
218,Chestatee State Bank,Dawsonville,GA,34578,Bank of the Ozarks,17-Dec-10,10320,10
226,Darby Bank & Trust Co.,Vidalia,GA,14580,Ameris Bank,12-Nov-10,10312,10
227,Tifton Banking Company,Tifton,GA,57831,Ameris Bank,12-Nov-10,10313,10
235,The First National Bank of Barnesville,Barnesville,GA,2119,United Bank,22-Oct-10,10304,10
236,The Gordon Bank,Gordon,GA,33904,Morris Bank,22-Oct-10,10305,10
248,The Peoples Bank,Winder,GA,182,Community & Southern Bank,17-Sep-10,10292,10
249,First Commerce Community Bank,Douglasville,GA,57448,Community & Southern Bank,17-Sep-10,10289,10
250,Bank of Ellijay,Ellijay,GA,58197,Community & Southern Bank,17-Sep-10,10287,10


Note that when we extract rows that fulfill multiple conditions, we use `&` in Pandas, not `and`. If you replace `&` with `and`, you will get an error. This is different than what we have learned about boolean operators in [Python basics 2](./python-basics-2.ipynb). In Python, we use `and`, `or` and `not`. In Pandas, we use `&`, `|` and `~` intead. 

|Pandas Operator|Boolean|Requires|
|---|---|---|
|&|and|All required to `True`|
|\||or|If any are `True`|
|~|not|The opposite|

Although we use different symbols for these boolean operators, the truth table for them stays the same. For a quick review of the truth table, see [Python basics 2](./python-basics-2.ipynb).

#### Disjunction of multiple filtering conditions: `|`
Suppose you would like to take a look at all the failed banks in the state of Georgia or the state of New York. How do you use `df.loc[ ]` to get the target rows?

In [38]:
# Create the two filtering conditions restricting the state to GA and NY
filt1 = (df['State'] == 'GA')
filt2 = (df['State'] == 'NY')

In [39]:
# Use filt1 and filt2 to get the target rows
df.loc[filt1|filt2]

Unnamed: 0,Bank Name,City,State,Cert,Acquiring Institution,Closing Date,Fund,Closing Year
17,The Woodbury Banking Company,Woodbury,GA,11297,United Bank,19-Aug-16,10521,16
22,The Bank of Georgia,Peachtree City,GA,35259,Fidelity Bank,2-Oct-15,10516,15
26,Capitol City Bank & Trust Company,Atlanta,GA,33938,First-Citizens Bank & Trust Company,13-Feb-15,10512,15
34,Eastside Commercial Bank,Conyers,GA,58125,Community & Southern Bank,18-Jul-14,10504,14
59,Sunrise Bank,Valdosta,GA,58185,Synovus Bank,10-May-13,10481,13
61,Douglas County Bank,Douglasville,GA,21649,Hamilton State Bank,26-Apr-13,10476,13
67,Frontier Bank,LaGrange,GA,16431,HeritageBank of the South,8-Mar-13,10471,13
72,Hometown Community Bank,Braselton,GA,57928,"CertusBank, National Association",16-Nov-12,10466,12
83,Jasper Banking Company,Jasper,GA,16240,Stearns Bank N.A.,27-Jul-12,10455,12
86,First Cherokee State Bank,Woodstock,GA,32711,Community & Southern Bank,20-Jul-12,10450,12


If you would like to get the data of the failed banks in the following six states --- Georgia, New York, New Jersey, Florida, California and West Virginia, you will not want to write six filtering conditions and use the vertical bar `|` to connect all of them. That would be too repetitive. In this case, we can use the `.isin()` method to create a filtering condition.

In [40]:
# Create a list of the states
states = ['GA', 'NY', 'NJ', 'FL', 'CA', 'WV']

In [41]:
# Create a filtering condition
filt = (df['State'].isin(states))

In [42]:
# Use filt to find all failed banks in the six states
df.loc[filt]

Unnamed: 0,Bank Name,City,State,Cert,Acquiring Institution,Closing Date,Fund,Closing Year
1,First City Bank of Florida,Fort Walton Beach,FL,16748,"United Fidelity Bank, fsb",16-Oct-20,10537,20
2,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.",3-Apr-20,10536,20
4,City National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,1-Nov-19,10534,19
15,Harvest Community Bank,Pennsville,NJ,34951,First-Citizens Bank & Trust Company,13-Jan-17,10523,17
17,The Woodbury Banking Company,Woodbury,GA,11297,United Bank,19-Aug-16,10521,16
22,The Bank of Georgia,Peachtree City,GA,35259,Fidelity Bank,2-Oct-15,10516,15
26,Capitol City Bank & Trust Company,Atlanta,GA,33938,First-Citizens Bank & Trust Company,13-Feb-15,10512,15
28,First National Bank of Crestview,Crestview,FL,17557,First NBC Bank,16-Jan-15,10510,15
30,"Frontier Bank, FSB D/B/A El Paseo Bank",Palm Desert,CA,34738,"Bank of Southern California, N.A.",7-Nov-14,10508,14
34,Eastside Commercial Bank,Conyers,GA,58125,Community & Southern Bank,18-Jul-14,10504,14


#### Negation of a certain condition:`~`
Now, suppose you would like to filter all the failed banks that were **not** closed in 2008. How do you do it?

In [43]:
# Create the filtering condition restricting the closing year to non-2008
filt = (~(df['Closing Year'] == 8))

In [44]:
# Use the filtering condition to get the target rows with specified columns
df.loc[filt, ['Bank Name', 'City']]

Unnamed: 0,Bank Name,City
0,Almena State Bank,Almena
1,First City Bank of Florida,Fort Walton Beach
2,The First State Bank,Barboursville
3,Ericson State Bank,Ericson
4,City National Bank of New Jersey,Newark
5,Resolute Bank,Maumee
6,Louisa Community Bank,Louisa
7,The Enloe State Bank,Cooper
8,Washington Federal Bank for Savings,Chicago
9,The Farmers and Merchants State Bank of Argonia,Argonia


<h3 style="color:red; display:inline">Coding Challenge! &lt; / &gt; </h3>

Let's do some filtering!

From the Shakespeare dataframe, get the title and the creator of the documents published between August of 2000 **and** May of 2010.

From the Shakespeare dataframe, get the creator of the documents shorter than 10 pages **or** longer than 50 pages. 

From the Shakespeare dataframe, get the title of the documents whose publisher is **not** Folger Shakespeare Library. 

## Update a dataframe
We can make changes to the data in a dataframe.
### Update headers
We can update the column names of a dataframe.

In [45]:
# Take a look at the columns
df.columns

Index(['Bank Name', 'City', 'State', 'Cert', 'Acquiring Institution',
       'Closing Date', 'Fund', 'Closing Year'],
      dtype='object')

In [46]:
# Access a column using the `.ColumnName` attribute
df.City

0                  Almena
1       Fort Walton Beach
2           Barboursville
3                 Ericson
4                  Newark
5                  Maumee
6                  Louisa
7                  Cooper
8                 Chicago
9                 Argonia
10             Saint Elmo
11              Milwaukee
12            New Orleans
13     Cottonwood Heights
14                Chicago
15             Pennsville
16               Mulberry
17               Woodbury
18        King of Prussia
19                Memphis
              ...        
543          Philadelphia
544        Blanchardville
545              Torrance
546           Cheneyville
547                 Alamo
548               Atlanta
549               Chicago
550              Stamford
551       Shelby Township
552            Boca Raton
553               Phoenix
554               Oakwood
555         Sierra Blanca
556                 Miami
557              Gravette
558              Hinsdale
559                 Malta
560         

In [47]:
# If a column name has a space in it
df.Bank Name

SyntaxError: invalid syntax (3808230729.py, line 2)

We could replace all the spaces in column names with an `_`. In this way, we can access all the columns using the `.ColumnName` attribute.

In [48]:
# Replace spaces in column names with underscores
df.columns = df.columns.str.replace(' ', '_')
df

Unnamed: 0,Bank_Name,City,State,Cert,Acquiring_Institution,Closing_Date,Fund,Closing_Year
0,Almena State Bank,Almena,KS,15426,Equity Bank,23-Oct-20,10538,20
1,First City Bank of Florida,Fort Walton Beach,FL,16748,"United Fidelity Bank, fsb",16-Oct-20,10537,20
2,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.",3-Apr-20,10536,20
3,Ericson State Bank,Ericson,NE,18265,Farmers and Merchants Bank,14-Feb-20,10535,20
4,City National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,1-Nov-19,10534,19
5,Resolute Bank,Maumee,OH,58317,Buckeye State Bank,25-Oct-19,10533,19
6,Louisa Community Bank,Louisa,KY,58112,Kentucky Farmers Bank Corporation,25-Oct-19,10532,19
7,The Enloe State Bank,Cooper,TX,10716,"Legend Bank, N. A.",31-May-19,10531,19
8,Washington Federal Bank for Savings,Chicago,IL,30570,Royal Savings Bank,15-Dec-17,10530,17
9,The Farmers and Merchants State Bank of Argonia,Argonia,KS,17719,Conway Bank,13-Oct-17,10529,17


You could also change the case of the headers.

In [49]:
# Change all headers to upper case
df.columns = df.columns.str.upper()
df

Unnamed: 0,BANK_NAME,CITY,STATE,CERT,ACQUIRING_INSTITUTION,CLOSING_DATE,FUND,CLOSING_YEAR
0,Almena State Bank,Almena,KS,15426,Equity Bank,23-Oct-20,10538,20
1,First City Bank of Florida,Fort Walton Beach,FL,16748,"United Fidelity Bank, fsb",16-Oct-20,10537,20
2,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.",3-Apr-20,10536,20
3,Ericson State Bank,Ericson,NE,18265,Farmers and Merchants Bank,14-Feb-20,10535,20
4,City National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,1-Nov-19,10534,19
5,Resolute Bank,Maumee,OH,58317,Buckeye State Bank,25-Oct-19,10533,19
6,Louisa Community Bank,Louisa,KY,58112,Kentucky Farmers Bank Corporation,25-Oct-19,10532,19
7,The Enloe State Bank,Cooper,TX,10716,"Legend Bank, N. A.",31-May-19,10531,19
8,Washington Federal Bank for Savings,Chicago,IL,30570,Royal Savings Bank,15-Dec-17,10530,17
9,The Farmers and Merchants State Bank of Argonia,Argonia,KS,17719,Conway Bank,13-Oct-17,10529,17


We have been updating the column names all at one time. However, oftentimes we just want to update specific columns. In this case, we could use the `df.rename()` method and pass in a **dictionary** where the keys are the original column names and the values are the new column names.

In [50]:
# Change the column name of 'CERT' to 'CERTIFICATE_NUM'
df.rename(columns = {'CERT':'CERTIFICATE_NUM'})

Unnamed: 0,BANK_NAME,CITY,STATE,CERTIFICATE_NUM,ACQUIRING_INSTITUTION,CLOSING_DATE,FUND,CLOSING_YEAR
0,Almena State Bank,Almena,KS,15426,Equity Bank,23-Oct-20,10538,20
1,First City Bank of Florida,Fort Walton Beach,FL,16748,"United Fidelity Bank, fsb",16-Oct-20,10537,20
2,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.",3-Apr-20,10536,20
3,Ericson State Bank,Ericson,NE,18265,Farmers and Merchants Bank,14-Feb-20,10535,20
4,City National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,1-Nov-19,10534,19
5,Resolute Bank,Maumee,OH,58317,Buckeye State Bank,25-Oct-19,10533,19
6,Louisa Community Bank,Louisa,KY,58112,Kentucky Farmers Bank Corporation,25-Oct-19,10532,19
7,The Enloe State Bank,Cooper,TX,10716,"Legend Bank, N. A.",31-May-19,10531,19
8,Washington Federal Bank for Savings,Chicago,IL,30570,Royal Savings Bank,15-Dec-17,10530,17
9,The Farmers and Merchants State Bank of Argonia,Argonia,KS,17719,Conway Bank,13-Oct-17,10529,17


To change multiple column names, we just pass in a dictionary to `df.rename()` with multiple key:value pairs.

In [51]:
# Change multiple column names
df.rename(columns = {'CERT':'CERTIFICATE_NUM', 'FUND':'FUND_IN_THOUSAND'})

Unnamed: 0,BANK_NAME,CITY,STATE,CERTIFICATE_NUM,ACQUIRING_INSTITUTION,CLOSING_DATE,FUND_IN_THOUSAND,CLOSING_YEAR
0,Almena State Bank,Almena,KS,15426,Equity Bank,23-Oct-20,10538,20
1,First City Bank of Florida,Fort Walton Beach,FL,16748,"United Fidelity Bank, fsb",16-Oct-20,10537,20
2,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.",3-Apr-20,10536,20
3,Ericson State Bank,Ericson,NE,18265,Farmers and Merchants Bank,14-Feb-20,10535,20
4,City National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,1-Nov-19,10534,19
5,Resolute Bank,Maumee,OH,58317,Buckeye State Bank,25-Oct-19,10533,19
6,Louisa Community Bank,Louisa,KY,58112,Kentucky Farmers Bank Corporation,25-Oct-19,10532,19
7,The Enloe State Bank,Cooper,TX,10716,"Legend Bank, N. A.",31-May-19,10531,19
8,Washington Federal Bank for Savings,Chicago,IL,30570,Royal Savings Bank,15-Dec-17,10530,17
9,The Farmers and Merchants State Bank of Argonia,Argonia,KS,17719,Conway Bank,13-Oct-17,10529,17


In [52]:
# Make the change permanent
df.rename(columns = {'CERT':'CERTIFICATE_NUM', 'FUND':'FUND_IN_THOUSAND'}, inplace = True)

### Update rows 
How to update the values in a row? In [Pandas 1](./pandas-1.ipynb), we have learned how to look up values using `.loc` and `.iloc`.

To update a row, we could use `.loc` or `.iloc` to locate it and then assign the new values to that row.

In [53]:
# Change an entire row
df.loc[0] = ['Almena State Bank', 'Almena', 'KS', 15426, 'Equity Bank', '23-Oct-20', 10000, 20]

It is a pain to enter all the values for each column of a row when we are actually only update some of the values in a row. If the dataframe has hundred even thousands of columns, it will take too much time. Also, manually entering all the values is prone to mistake. 

In [54]:
# Change a specific value in a row
df.loc[0, 'FUND_IN_THOUSAND'] = 10001
df

Unnamed: 0,BANK_NAME,CITY,STATE,CERTIFICATE_NUM,ACQUIRING_INSTITUTION,CLOSING_DATE,FUND_IN_THOUSAND,CLOSING_YEAR
0,Almena State Bank,Almena,KS,15426,Equity Bank,23-Oct-20,10001,20
1,First City Bank of Florida,Fort Walton Beach,FL,16748,"United Fidelity Bank, fsb",16-Oct-20,10537,20
2,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.",3-Apr-20,10536,20
3,Ericson State Bank,Ericson,NE,18265,Farmers and Merchants Bank,14-Feb-20,10535,20
4,City National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,1-Nov-19,10534,19
5,Resolute Bank,Maumee,OH,58317,Buckeye State Bank,25-Oct-19,10533,19
6,Louisa Community Bank,Louisa,KY,58112,Kentucky Farmers Bank Corporation,25-Oct-19,10532,19
7,The Enloe State Bank,Cooper,TX,10716,"Legend Bank, N. A.",31-May-19,10531,19
8,Washington Federal Bank for Savings,Chicago,IL,30570,Royal Savings Bank,15-Dec-17,10530,17
9,The Farmers and Merchants State Bank of Argonia,Argonia,KS,17719,Conway Bank,13-Oct-17,10529,17


We could change multiple specific values in a row using `.loc[]`. 

In [55]:
# Change multiple values in a row
df.loc[0, ['BANK_NAME', 'FUND_IN_THOUSAND']] = ['Almena Bank', 12000]
df

Unnamed: 0,BANK_NAME,CITY,STATE,CERTIFICATE_NUM,ACQUIRING_INSTITUTION,CLOSING_DATE,FUND_IN_THOUSAND,CLOSING_YEAR
0,Almena Bank,Almena,KS,15426,Equity Bank,23-Oct-20,12000,20
1,First City Bank of Florida,Fort Walton Beach,FL,16748,"United Fidelity Bank, fsb",16-Oct-20,10537,20
2,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.",3-Apr-20,10536,20
3,Ericson State Bank,Ericson,NE,18265,Farmers and Merchants Bank,14-Feb-20,10535,20
4,City National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,1-Nov-19,10534,19
5,Resolute Bank,Maumee,OH,58317,Buckeye State Bank,25-Oct-19,10533,19
6,Louisa Community Bank,Louisa,KY,58112,Kentucky Farmers Bank Corporation,25-Oct-19,10532,19
7,The Enloe State Bank,Cooper,TX,10716,"Legend Bank, N. A.",31-May-19,10531,19
8,Washington Federal Bank for Savings,Chicago,IL,30570,Royal Savings Bank,15-Dec-17,10530,17
9,The Farmers and Merchants State Bank of Argonia,Argonia,KS,17719,Conway Bank,13-Oct-17,10529,17


### Update columns
There are multiple methods we can use to update columns: `.apply()`, `.map()` and `replace()`.

In [56]:
# Use apply to update the column CLOSING_YEAR
df['CLOSING_YEAR'] = df['CLOSING_YEAR'].apply(lambda r: r+2000)
df

Unnamed: 0,BANK_NAME,CITY,STATE,CERTIFICATE_NUM,ACQUIRING_INSTITUTION,CLOSING_DATE,FUND_IN_THOUSAND,CLOSING_YEAR
0,Almena Bank,Almena,KS,15426,Equity Bank,23-Oct-20,12000,2020
1,First City Bank of Florida,Fort Walton Beach,FL,16748,"United Fidelity Bank, fsb",16-Oct-20,10537,2020
2,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.",3-Apr-20,10536,2020
3,Ericson State Bank,Ericson,NE,18265,Farmers and Merchants Bank,14-Feb-20,10535,2020
4,City National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,1-Nov-19,10534,2019
5,Resolute Bank,Maumee,OH,58317,Buckeye State Bank,25-Oct-19,10533,2019
6,Louisa Community Bank,Louisa,KY,58112,Kentucky Farmers Bank Corporation,25-Oct-19,10532,2019
7,The Enloe State Bank,Cooper,TX,10716,"Legend Bank, N. A.",31-May-19,10531,2019
8,Washington Federal Bank for Savings,Chicago,IL,30570,Royal Savings Bank,15-Dec-17,10530,2017
9,The Farmers and Merchants State Bank of Argonia,Argonia,KS,17719,Conway Bank,13-Oct-17,10529,2017


In [57]:
# Use .map() to update specific values in a column
df['BANK_NAME'].map({'Almena Bank': 'Almena State Bank', 'The First State Bank': 'West Virginia Bank'})

0       Almena State Bank
1                     NaN
2      West Virginia Bank
3                     NaN
4                     NaN
5                     NaN
6                     NaN
7                     NaN
8                     NaN
9                     NaN
10                    NaN
11                    NaN
12                    NaN
13                    NaN
14                    NaN
15                    NaN
16                    NaN
17                    NaN
18                    NaN
19                    NaN
              ...        
543                   NaN
544                   NaN
545                   NaN
546                   NaN
547                   NaN
548                   NaN
549                   NaN
550                   NaN
551                   NaN
552                   NaN
553                   NaN
554                   NaN
555                   NaN
556                   NaN
557                   NaN
558                   NaN
559                   NaN
560         

In [58]:
# Use .replace() to update specific values in a column while maintaining the rest
df['BANK_NAME'].replace({'Almena Bank': 'Almena State Bank', 'The First State Bank': 'West Virginia Bank'})

0                                      Almena State Bank
1                             First City Bank of Florida
2                                     West Virginia Bank
3                                     Ericson State Bank
4                       City National Bank of New Jersey
5                                          Resolute Bank
6                                  Louisa Community Bank
7                                   The Enloe State Bank
8                    Washington Federal Bank for Savings
9        The Farmers and Merchants State Bank of Argonia
10                                   Fayette County Bank
11     Guaranty Bank, (d/b/a BestBank in Georgia & Mi...
12                                        First NBC Bank
13                                         Proficio Bank
14                         Seaway Bank and Trust Company
15                                Harvest Community Bank
16                                           Allied Bank
17                          The

We can also use a filtering condition to locate the target columns and then make changes. 

For example, we can locate all the banks that failed in 2020 and change their closing date to 'Recent'.

In [59]:
# Make a filtering condition to get the banks that failed in 2020
filt = (df['CLOSING_YEAR'] == 2020)

In [60]:
# Use the filtering condition to locate the columns and update them
df.loc[filt, 'CLOSING_DATE'] = 'Recent'
df

Unnamed: 0,BANK_NAME,CITY,STATE,CERTIFICATE_NUM,ACQUIRING_INSTITUTION,CLOSING_DATE,FUND_IN_THOUSAND,CLOSING_YEAR
0,Almena Bank,Almena,KS,15426,Equity Bank,Recent,12000,2020
1,First City Bank of Florida,Fort Walton Beach,FL,16748,"United Fidelity Bank, fsb",Recent,10537,2020
2,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.",Recent,10536,2020
3,Ericson State Bank,Ericson,NE,18265,Farmers and Merchants Bank,Recent,10535,2020
4,City National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,1-Nov-19,10534,2019
5,Resolute Bank,Maumee,OH,58317,Buckeye State Bank,25-Oct-19,10533,2019
6,Louisa Community Bank,Louisa,KY,58112,Kentucky Farmers Bank Corporation,25-Oct-19,10532,2019
7,The Enloe State Bank,Cooper,TX,10716,"Legend Bank, N. A.",31-May-19,10531,2019
8,Washington Federal Bank for Savings,Chicago,IL,30570,Royal Savings Bank,15-Dec-17,10530,2017
9,The Farmers and Merchants State Bank of Argonia,Argonia,KS,17719,Conway Bank,13-Oct-17,10529,2017


<h3 style="color:red; display:inline">Coding Challenge! &lt; / &gt; </h3>

Make all column names in the Shakespeare dataframe upper case. 

Get all documents whose current title is 'Review Article' and change their title to 'Review'.

Get all documents whose word count exceeds 5000 and change their word count to the string 'Long article'.

___
## Lesson Complete

Congratulations! You have completed *Pandas 2*.

### Start Next Lesson: [Pandas 3 ->](./pandas-3.ipynb)

### Exercise Solutions
Here are a few solutions for exercises in this lesson.

In [61]:
# Read in the metadata
shake = pd.read_csv(metadata)

In [62]:
# Set the rows to display to 30
pd.set_option('display.max_columns', 20)

In [63]:
# Explore the dataframe
shake.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500 entries, 0 to 1499
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   id                  1500 non-null   object 
 1   title               1500 non-null   object 
 2   isPartOf            1500 non-null   object 
 3   publicationYear     1500 non-null   int64  
 4   doi                 0 non-null      float64
 5   docType             1500 non-null   object 
 6   provider            1500 non-null   object 
 7   datePublished       1500 non-null   object 
 8   issueNumber         1467 non-null   float64
 9   volumeNumber        1500 non-null   int64  
 10  url                 1500 non-null   object 
 11  creator             1209 non-null   object 
 12  publisher           1500 non-null   object 
 13  language            1500 non-null   object 
 14  pageStart           1443 non-null   object 
 15  pageEnd             1443 non-null   object 
 16  placeO

In [64]:
# Drop the columns of doi and the placeofPublication, make the change permanent
shake.drop(columns=['doi', 'placeOfPublication'], inplace=True)

In [66]:
# get the title and the creator of the documents published between 2000 and 2010
filt = (shake['publicationYear']>1999) & (shake['publicationYear']<2011)
shake.loc[filt, ['title', 'creator']]

Unnamed: 0,title,creator
0,Fragments of Nationalism in Troilus and Cressida,Matthew A. Greenfield
15,Citizens' Games: Differentiating Collaboration...,Nina Levine
26,Review Article,Heather Hirschfeld
28,Review Article,Christa Jansohn
31,The Alcestis and the Statue Scene in the Winte...,Sarah Dewar-Watson
35,Review Article,Michelle Ephraim
39,Review Article,Jane Donawerth
40,"When Theaters Were Bear-Gardens; Or, What's at...",Jason Scott-Warren
48,Back Matter,
51,Review Article,Arthur F. Kinney


In [67]:
# get the creator of the documents shorter than 10 pages or longer than 50 pages
filt = (shake['pageCount']<10)|(shake['pageCount']>50)
shake.loc[filt, 'creator']

2                                 R. G. Cox
3                              Darryl Gless
4                           Marion O'Connor
5                    Jeanne Addison Roberts
7                          Wilhelm Hortmann
8                        A. H. R. Fairchild
9                              John W. Velz
10                         Anna Maria Crinò
11                                      NaN
13                                      NaN
16                        Alexander Leggatt
17                        Ian Forbes Fraser
18                            Russ McDonald
19                                      NaN
20                             E. G. Rogers
21                            Bonamy Dobrée
22                                      NaN
23                              David Frost
24                           G. B. Harrison
25                     Louisa Foulke Newlin
                       ...                 
1476                                    NaN
1477                       Marvi

In [68]:
# get the title of the documents whose publisher is not Folger Shakespeare Library
filt = (shake['publisher']=='Folger Shakespeare Library')
shake.loc[~filt, 'title']

48                                            Back Matter
118     Bibliography: Shakespeare: Annotated World Bib...
257     Bibliography: Shakespeare: Annotated World Bib...
287                                          Front Matter
366     BIBLIOGRAPHY: World Shakespeare Bibliography 1992
664                                          Front Matter
1089                                         Front Matter
1100    BIBLIOGRAPHY: World Shakespeare Bibliography 1993
Name: title, dtype: object

In [69]:
# Make all column names in the Shakespeare dataframe upper case
shake.columns = shake.columns.str.upper()

In [70]:
# Get all documents whose current title is 'Review Article' and change their title to 'Review'
shake.loc[shake['TITLE']=='Review Article', 'TITLE'] = 'Review'

In [71]:
# Get all documents whose word count exceeds 5000 and change their word count to the string 'Long article'
shake.loc[shake['WORDCOUNT']>5000, 'WORDCOUNT'] = 'Long article'