<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

This notebook is adapted by Zhuo Chen from the notebooks created by [Nathan Kelber](http://nkelber.com), [William Mattingly](https://datascience.si.edu/people/dr-william-mattingly) and [Melanie Walsh](https://melaniewalsh.org) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email zhuo.chen@ithaka.org or nathan.kelber@ithaka.org.<br />
___

# Pandas 2

**Description:** This notebook describes how to:
* Sort a dataframe
* Filter data in a dataframe
* Update data in a dataframe

This is the second notebook in a series on learning to use Pandas. 

**Use Case:** For Learners (Detailed explanation, not ideal for researchers)

**Difficulty:** Intermediate

**Knowledge Required:** 
* [Pandas 1](./pandas-1.ipynb)
* Python Basics ([Start Python Basics I](./python-basics-1.ipynb))

**Knowledge Recommended:** 
* [Python Intermediate 2](./python-intermediate-2.ipynb)
* [Python Intermediate 4](./python-intermediate-4.ipynb)

**Completion Time:** 90 minutes

**Data Format:** CSV (.csv)

**Libraries Used:** Pandas

**Research Pipeline:** None
___


In [1]:
# Import pandas library, `as pd` allows us to shorten typing `pandas` to `pd` when we call pandas
import pandas as pd

## Sort a dataframe

In this section, we will continue working with the dataframe we created in Pandas 1 storing data on the most recent 10 World Cup games. 

In [2]:
# Create a dataframe with world cup data
wcup = pd.DataFrame({"Year": [2022, 
                              2018, 
                              2014, 
                              2010, 
                              2006, 
                              2002, 
                              1998, 
                              1994, 
                              1990,
                              1986], 
                     "Champion": ["Argentina", 
                                  "France", 
                                  "Germany", 
                                  "Spain", 
                                  "Italy", 
                                  "Brazil", 
                                  "France", 
                                  "Brazil", 
                                  "Germany", 
                                  "Argentina"], 
                     "Host": ["Qatar", 
                              "Russia", 
                              "Brazil", 
                              "South Africa", 
                              "Germany", 
                              "Korea/Japan", 
                              "France", 
                              "USA", 
                              "Italy", 
                              "Mexico"],
                     "Score": ["7-5", 
                               "4-2", 
                               "1-0", 
                               "1-0", 
                               "6-4", 
                               "2-0", 
                               "3-0", 
                               "3-2", 
                               "1-0", 
                               "3-2"]
                    })
wcup['Goals Scored'] = wcup['Score'].apply(lambda r: r.split('-')[0])
wcup['Goals Conceded'] = wcup['Score'].apply(lambda r: r.split('-')[1])
wcup['Difference'] = wcup['Goals Scored'].astype(int) - wcup['Goals Conceded'].astype(int)
wcup

Unnamed: 0,Year,Champion,Host,Score,Goals Scored,Goals Conceded,Difference
0,2022,Argentina,Qatar,7-5,7,5,2
1,2018,France,Russia,4-2,4,2,2
2,2014,Germany,Brazil,1-0,1,0,1
3,2010,Spain,South Africa,1-0,1,0,1
4,2006,Italy,Germany,6-4,6,4,2
5,2002,Brazil,Korea/Japan,2-0,2,0,2
6,1998,France,France,3-0,3,0,3
7,1994,Brazil,USA,3-2,3,2,1
8,1990,Germany,Italy,1-0,1,0,1
9,1986,Argentina,Mexico,3-2,3,2,1


### Set, reset and use indexes
We have seen that by default, the rows in a dataframe are numbered by integer indexes starting from 0. The indexes look like a column to the far left without a name. 

We can set the index column to one of the columns in the dataframe. This is desirable because a range of integers is not descriptive but a column with a name is descriptive. When we want to locate specific data, descriptive labels are much more useful. 

In [3]:
# Set index column to 'Host'
wcup.set_index('Year')

Unnamed: 0_level_0,Champion,Host,Score,Goals Scored,Goals Conceded,Difference
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2022,Argentina,Qatar,7-5,7,5,2
2018,France,Russia,4-2,4,2,2
2014,Germany,Brazil,1-0,1,0,1
2010,Spain,South Africa,1-0,1,0,1
2006,Italy,Germany,6-4,6,4,2
2002,Brazil,Korea/Japan,2-0,2,0,2
1998,France,France,3-0,3,0,3
1994,Brazil,USA,3-2,3,2,1
1990,Germany,Italy,1-0,1,0,1
1986,Argentina,Mexico,3-2,3,2,1


Take a look at the original dataframe, is it changed? 

In [4]:
# Take a look at the original dataframe
wcup

Unnamed: 0,Year,Champion,Host,Score,Goals Scored,Goals Conceded,Difference
0,2022,Argentina,Qatar,7-5,7,5,2
1,2018,France,Russia,4-2,4,2,2
2,2014,Germany,Brazil,1-0,1,0,1
3,2010,Spain,South Africa,1-0,1,0,1
4,2006,Italy,Germany,6-4,6,4,2
5,2002,Brazil,Korea/Japan,2-0,2,0,2
6,1998,France,France,3-0,3,0,3
7,1994,Brazil,USA,3-2,3,2,1
8,1990,Germany,Italy,1-0,1,0,1
9,1986,Argentina,Mexico,3-2,3,2,1


The original dataframe is **NOT** changed after we use the `.set_index()` method to change the index column. This is because in Pandas, we have a distinction between a view and a copy. When a view of the dataframe is returned, any change we make will affect the original dataframe, but when a copy is returned, any change we make only affects the copy, not the original dataframe. The `.set_index()` method returns a copy, this is why the original dataframe is not affected.  

If you want to make the change permanent, there is a parameter `inplace` you can use. If you set this parameter to `True`, the change will be made in place and the original dataframe will be changed. 

In [5]:
# Change the index column and commit the change
wcup.set_index('Year', inplace=True)
wcup

Unnamed: 0_level_0,Champion,Host,Score,Goals Scored,Goals Conceded,Difference
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2022,Argentina,Qatar,7-5,7,5,2
2018,France,Russia,4-2,4,2,2
2014,Germany,Brazil,1-0,1,0,1
2010,Spain,South Africa,1-0,1,0,1
2006,Italy,Germany,6-4,6,4,2
2002,Brazil,Korea/Japan,2-0,2,0,2
1998,France,France,3-0,3,0,3
1994,Brazil,USA,3-2,3,2,1
1990,Germany,Italy,1-0,1,0,1
1986,Argentina,Mexico,3-2,3,2,1


You could also sort the index column. Here, we have a numerical column as our index colummn. When we sort the indexes, by default, the dataframe will be sorted by the index column in an ascending order. 

In [6]:
# Sort the indexes
wcup.sort_index()

Unnamed: 0_level_0,Champion,Host,Score,Goals Scored,Goals Conceded,Difference
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1986,Argentina,Mexico,3-2,3,2,1
1990,Germany,Italy,1-0,1,0,1
1994,Brazil,USA,3-2,3,2,1
1998,France,France,3-0,3,0,3
2002,Brazil,Korea/Japan,2-0,2,0,2
2006,Italy,Germany,6-4,6,4,2
2010,Spain,South Africa,1-0,1,0,1
2014,Germany,Brazil,1-0,1,0,1
2018,France,Russia,4-2,4,2,2
2022,Argentina,Qatar,7-5,7,5,2


You could set the parameter `ascending=False` to sort the indexes in a descending order.

In [7]:
# Specify the ascending order
wcup.sort_index(ascending=False)

Unnamed: 0_level_0,Champion,Host,Score,Goals Scored,Goals Conceded,Difference
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2022,Argentina,Qatar,7-5,7,5,2
2018,France,Russia,4-2,4,2,2
2014,Germany,Brazil,1-0,1,0,1
2010,Spain,South Africa,1-0,1,0,1
2006,Italy,Germany,6-4,6,4,2
2002,Brazil,Korea/Japan,2-0,2,0,2
1998,France,France,3-0,3,0,3
1994,Brazil,USA,3-2,3,2,1
1990,Germany,Italy,1-0,1,0,1
1986,Argentina,Mexico,3-2,3,2,1


Note that the sorting change is not committed by default. If you want to make the change permanent, again, you will have to add `inplace=True`.

Sometimes we would want to change the index column back to the integer column. In this case, we can use the method `reset_index()`. But again, to make the reset permanent, you will have to add `inplace=True`.

In [8]:
# Reset the index and update the dataframe
wcup.reset_index(inplace=True)
wcup

Unnamed: 0,Year,Champion,Host,Score,Goals Scored,Goals Conceded,Difference
0,2022,Argentina,Qatar,7-5,7,5,2
1,2018,France,Russia,4-2,4,2,2
2,2014,Germany,Brazil,1-0,1,0,1
3,2010,Spain,South Africa,1-0,1,0,1
4,2006,Italy,Germany,6-4,6,4,2
5,2002,Brazil,Korea/Japan,2-0,2,0,2
6,1998,France,France,3-0,3,0,3
7,1994,Brazil,USA,3-2,3,2,1
8,1990,Germany,Italy,1-0,1,0,1
9,1986,Argentina,Mexico,3-2,3,2,1


### Sort by one column

We can sort the entire dataframe by a column other than the index column. 

In [9]:
# Sort the dataframe by the column 'Goals Scored'
wcup.sort_values(by=['Goals Scored'])

Unnamed: 0,Year,Champion,Host,Score,Goals Scored,Goals Conceded,Difference
2,2014,Germany,Brazil,1-0,1,0,1
3,2010,Spain,South Africa,1-0,1,0,1
8,1990,Germany,Italy,1-0,1,0,1
5,2002,Brazil,Korea/Japan,2-0,2,0,2
6,1998,France,France,3-0,3,0,3
7,1994,Brazil,USA,3-2,3,2,1
9,1986,Argentina,Mexico,3-2,3,2,1
1,2018,France,Russia,4-2,4,2,2
4,2006,Italy,Germany,6-4,6,4,2
0,2022,Argentina,Qatar,7-5,7,5,2


### Sort by multiple columns
It is a convention to sort the soccer results first by difference (i.e. how many more goals the champion scored than the runner-up) and then by goals conceded (i.e. how many goals the champion lost). Pandas can easily do that. 

In [10]:
# Sort the dataframe by Difference column in descending order 
# then by Goals Conceded column in ascending order
wcup.sort_values(by=['Difference', 'Goals Conceded'], ascending=[False, True])

Unnamed: 0,Year,Champion,Host,Score,Goals Scored,Goals Conceded,Difference
6,1998,France,France,3-0,3,0,3
5,2002,Brazil,Korea/Japan,2-0,2,0,2
1,2018,France,Russia,4-2,4,2,2
4,2006,Italy,Germany,6-4,6,4,2
0,2022,Argentina,Qatar,7-5,7,5,2
2,2014,Germany,Brazil,1-0,1,0,1
3,2010,Spain,South Africa,1-0,1,0,1
8,1990,Germany,Italy,1-0,1,0,1
7,1994,Brazil,USA,3-2,3,2,1
9,1986,Argentina,Mexico,3-2,3,2,1


## A quick review of how to create a dataframe from a file

In [Pandas 1](./pandas-1.ipynb), we learned how to create a dataframe by passing a **dictionary** to the `DataFrame` method or by reading in a csv or an excel file. 

For example, we can convert the data in a .csv file to a Pandas DataFrame using the `.read_csv()` method. We pass in the location of the .csv file.

In [11]:
### Download the sample file for this Lesson
import urllib
url = 'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/Pandas1_failed_banks_since_2000.csv'
urllib.request.urlretrieve(url, './data/' + url.rsplit('/', 1)[-1])
print('Sample file retrieved.')

Sample file retrieved.


Use the `**File > Open**` menu above to navigate to the `failed_banks_since_2000.csv` in the `/data` folder. Preview its structure before we load it into a dataframe.

In [12]:
# Create a DataFrame `df` from a CSV file using the .read_csv() method
df = pd.read_csv('data/Pandas1_failed_banks_since_2000.csv') # pass in the location of the file
df

Unnamed: 0,Bank Name,City,State,Cert,Acquiring Institution,Closing Date,Fund
0,Signature Bank,New York,NY,57053,"Signature Bridge Bank, N.A.",12-Mar-23,10540
1,Silicon Valley Bank,Santa Clara,CA,24735,"Silicon Valley Bridge Bank, N.A.",10-Mar-23,10539
2,Almena State Bank,Almena,KS,15426,Equity Bank,23-Oct-20,10538
3,First City Bank of Florida,Fort Walton Beach,FL,16748,"United Fidelity Bank, fsb",16-Oct-20,10537
4,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.",3-Apr-20,10536
...,...,...,...,...,...,...,...
560,"Superior Bank, FSB",Hinsdale,IL,32646,"Superior Federal, FSB",27-Jul-01,6004
561,Malta National Bank,Malta,OH,6629,North Valley Bank,3-May-01,4648
562,First Alliance Bank & Trust Co.,Manchester,NH,34264,Southern New Hampshire Bank & Trust,2-Feb-01,4647
563,National State Bank of Metropolis,Metropolis,IL,3815,Banterra Bank of Marion,14-Dec-00,4646


In [13]:
# Change the display setting
pd.set_option('display.min_rows', 20) # set the minimum number of rows to display to 20
df

Unnamed: 0,Bank Name,City,State,Cert,Acquiring Institution,Closing Date,Fund
0,Signature Bank,New York,NY,57053,"Signature Bridge Bank, N.A.",12-Mar-23,10540
1,Silicon Valley Bank,Santa Clara,CA,24735,"Silicon Valley Bridge Bank, N.A.",10-Mar-23,10539
2,Almena State Bank,Almena,KS,15426,Equity Bank,23-Oct-20,10538
3,First City Bank of Florida,Fort Walton Beach,FL,16748,"United Fidelity Bank, fsb",16-Oct-20,10537
4,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.",3-Apr-20,10536
5,Ericson State Bank,Ericson,NE,18265,Farmers and Merchants Bank,14-Feb-20,10535
6,City National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,1-Nov-19,10534
7,Resolute Bank,Maumee,OH,58317,Buckeye State Bank,25-Oct-19,10533
8,Louisa Community Bank,Louisa,KY,58112,Kentucky Farmers Bank Corporation,25-Oct-19,10532
9,The Enloe State Bank,Cooper,TX,10716,"Legend Bank, N. A.",31-May-19,10531


By default, Pandas displays the first five rows and the last five rows of the dataframe. You can change the display setting using the `.set_option()` method.

The display setting is global throughout the notebook. Therefore, any dataframe in the current notebook will have this setting in effect.

Now, you see that Pandas displays the first 10 rows and the last 10 rows of the dataframe.

By convention, a dataframe variable is called `df` but we could give it any valid Python variable name. Here, we follow the convention. 

In [14]:
# Get some info about the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 565 entries, 0 to 564
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Bank Name              565 non-null    object
 1   City                   565 non-null    object
 2   State                  565 non-null    object
 3   Cert                   565 non-null    int64 
 4   Acquiring Institution  534 non-null    object
 5   Closing Date           565 non-null    object
 6   Fund                   565 non-null    int64 
dtypes: int64(2), object(5)
memory usage: 31.0+ KB


The `info()` method tells us that there are 565 rows and 7 columns in the dataframe. Almost all columns have 565 non-null values, except the column of `Acquiring Institution`.

<h3 style="color:red; display:inline">Coding Challenge! &lt; / &gt; </h3>

In the exercises in this notebook, we'll work on a dataset built from Constellate.

We'll use the `constellate` client to automatically retrieve the [metadata](https://constellate.org/docs/key-terms/#metadata) for a [dataset](https://constellate.org/docs/key-terms/#dataset). We can retrieve [metadata](https://constellate.org/docs/key-terms/#metadata) in a [CSV file](https://constellate.org/docs/key-terms/#csv-file) using the `get_metadata` method.


In [15]:
# Creating a variable `dataset_id` to hold our dataset ID
# The default dataset is Shakespeare Quarterly, 1950-present
# retrieve the metadata
import constellate
dataset_id = "7e41317e-740f-e86a-4729-20dab492e925"
metadata = constellate.get_metadata(dataset_id)

Constellate: use and download of datasets is covered by the Terms & Conditions of Use: https://constellate.org/terms-and-conditions/
All documents from JSTOR published in Shakespeare Quarterly from 1950 - 2020. 6745 documents.
INFO:root:File /Users/zchen/data/7e41317e-740f-e86a-4729-20dab492e925-sampled-metadata.csv exists. Not re-downloading.


The metadata is stored in a .csv file. In the following code cell, read in the data using Pandas. Give the dataframe a name other than `df`. Then print out the dataframe to take a look. 

Use a Pandas method to explore the dataframe. How many rows does it have? How many columns does it have? What is the data type of the data in each column?

## Filter dataframe

A common pipeline in data processing in Pandas is that you create a dataframe from a file and then reduce the dataframe only to the rows and columns that you are interested in. 

We have learned how to use `.loc` and `.iloc` to select part of a dataframe in [Pandas 1](./pandas-1.ipynb). We will learn more ways to do data filtering in this section.

### Work with missing values
It is a common case that datasets have missing values. As you may have already noticed, blank cells in a CSV file show up as NaN in a Pandas DataFrame. For example, in the dataset of failed banks, the `Acquiring Institution` column gives the name when a failed bank was acquired by another institution and is empty otherwise.

In Pandas, we have a bunch of methods that can create a boolean mask over the data.

In [16]:
# Use isna() to check whether a dataframe has missing values
df.isna()

Unnamed: 0,Bank Name,City,State,Cert,Acquiring Institution,Closing Date,Fund
0,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False
6,False,False,False,False,False,False,False
7,False,False,False,False,False,False,False
8,False,False,False,False,False,False,False
9,False,False,False,False,False,False,False


The `.isna()` method put a mask on the original dataframe. The cells with a non-null value are masked with the boolean value of `False`. The cells with a null value are masked with the boolean value of `True`.

We can also use `.isna()` to check whether a specific column has missing values. 

In [17]:
# Use isna() to check whether a column has missing values
df['Acquiring Institution'].isna()

0      False
1      False
2      False
3      False
4      False
5      False
6      False
7      False
8      False
9      False
       ...  
555     True
556    False
557    False
558    False
559    False
560    False
561    False
562    False
563    False
564    False
Name: Acquiring Institution, Length: 565, dtype: bool

### Drop rows and columns with missing values

If you want to exclude the rows and columns with missing values from your data analysis, you can use the `.dropna()` method to do that.

By default, the `.dropna()` method drops the rows with at least one missing value. 

In [18]:
# Use .dropna() to remove all rows with at least one missing value
df.dropna() # no argument passed in

Unnamed: 0,Bank Name,City,State,Cert,Acquiring Institution,Closing Date,Fund
0,Signature Bank,New York,NY,57053,"Signature Bridge Bank, N.A.",12-Mar-23,10540
1,Silicon Valley Bank,Santa Clara,CA,24735,"Silicon Valley Bridge Bank, N.A.",10-Mar-23,10539
2,Almena State Bank,Almena,KS,15426,Equity Bank,23-Oct-20,10538
3,First City Bank of Florida,Fort Walton Beach,FL,16748,"United Fidelity Bank, fsb",16-Oct-20,10537
4,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.",3-Apr-20,10536
5,Ericson State Bank,Ericson,NE,18265,Farmers and Merchants Bank,14-Feb-20,10535
6,City National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,1-Nov-19,10534
7,Resolute Bank,Maumee,OH,58317,Buckeye State Bank,25-Oct-19,10533
8,Louisa Community Bank,Louisa,KY,58112,Kentucky Farmers Bank Corporation,25-Oct-19,10532
9,The Enloe State Bank,Cooper,TX,10716,"Legend Bank, N. A.",31-May-19,10531


You can also set the axis parameter to 0 to drop the rows with missing values.

In [19]:
# Use .dropna() to remove all rows with at least one missing value
df.dropna(axis=0) # Set the axis to 0

Unnamed: 0,Bank Name,City,State,Cert,Acquiring Institution,Closing Date,Fund
0,Signature Bank,New York,NY,57053,"Signature Bridge Bank, N.A.",12-Mar-23,10540
1,Silicon Valley Bank,Santa Clara,CA,24735,"Silicon Valley Bridge Bank, N.A.",10-Mar-23,10539
2,Almena State Bank,Almena,KS,15426,Equity Bank,23-Oct-20,10538
3,First City Bank of Florida,Fort Walton Beach,FL,16748,"United Fidelity Bank, fsb",16-Oct-20,10537
4,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.",3-Apr-20,10536
5,Ericson State Bank,Ericson,NE,18265,Farmers and Merchants Bank,14-Feb-20,10535
6,City National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,1-Nov-19,10534
7,Resolute Bank,Maumee,OH,58317,Buckeye State Bank,25-Oct-19,10533
8,Louisa Community Bank,Louisa,KY,58112,Kentucky Farmers Bank Corporation,25-Oct-19,10532
9,The Enloe State Bank,Cooper,TX,10716,"Legend Bank, N. A.",31-May-19,10531


Or, you can set the axis parameter to 'rows' drop the rows with missing values.

In [20]:
# Use .dropna() to remove all rows with at least one missing value
df.dropna(axis='rows') # Set the axis to 'rows'

Unnamed: 0,Bank Name,City,State,Cert,Acquiring Institution,Closing Date,Fund
0,Signature Bank,New York,NY,57053,"Signature Bridge Bank, N.A.",12-Mar-23,10540
1,Silicon Valley Bank,Santa Clara,CA,24735,"Silicon Valley Bridge Bank, N.A.",10-Mar-23,10539
2,Almena State Bank,Almena,KS,15426,Equity Bank,23-Oct-20,10538
3,First City Bank of Florida,Fort Walton Beach,FL,16748,"United Fidelity Bank, fsb",16-Oct-20,10537
4,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.",3-Apr-20,10536
5,Ericson State Bank,Ericson,NE,18265,Farmers and Merchants Bank,14-Feb-20,10535
6,City National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,1-Nov-19,10534
7,Resolute Bank,Maumee,OH,58317,Buckeye State Bank,25-Oct-19,10533
8,Louisa Community Bank,Louisa,KY,58112,Kentucky Farmers Bank Corporation,25-Oct-19,10532
9,The Enloe State Bank,Cooper,TX,10716,"Legend Bank, N. A.",31-May-19,10531


If you set the axis parameter to 1, you will drop the columns with missing values. 

In [21]:
# Use .dropna() to remove all columns with at least one missing value
df.dropna(axis=1) # set the axis to 1

Unnamed: 0,Bank Name,City,State,Cert,Closing Date,Fund
0,Signature Bank,New York,NY,57053,12-Mar-23,10540
1,Silicon Valley Bank,Santa Clara,CA,24735,10-Mar-23,10539
2,Almena State Bank,Almena,KS,15426,23-Oct-20,10538
3,First City Bank of Florida,Fort Walton Beach,FL,16748,16-Oct-20,10537
4,The First State Bank,Barboursville,WV,14361,3-Apr-20,10536
5,Ericson State Bank,Ericson,NE,18265,14-Feb-20,10535
6,City National Bank of New Jersey,Newark,NJ,21111,1-Nov-19,10534
7,Resolute Bank,Maumee,OH,58317,25-Oct-19,10533
8,Louisa Community Bank,Louisa,KY,58112,25-Oct-19,10532
9,The Enloe State Bank,Cooper,TX,10716,31-May-19,10531


You can also drop the columns with missing values by setting the axis parameter to 'columns'.

In [22]:
# Use .dropna() to remove all columns with at least one missing value
df.dropna(axis='columns') # set the axis to 'columns'

Unnamed: 0,Bank Name,City,State,Cert,Closing Date,Fund
0,Signature Bank,New York,NY,57053,12-Mar-23,10540
1,Silicon Valley Bank,Santa Clara,CA,24735,10-Mar-23,10539
2,Almena State Bank,Almena,KS,15426,23-Oct-20,10538
3,First City Bank of Florida,Fort Walton Beach,FL,16748,16-Oct-20,10537
4,The First State Bank,Barboursville,WV,14361,3-Apr-20,10536
5,Ericson State Bank,Ericson,NE,18265,14-Feb-20,10535
6,City National Bank of New Jersey,Newark,NJ,21111,1-Nov-19,10534
7,Resolute Bank,Maumee,OH,58317,25-Oct-19,10533
8,Louisa Community Bank,Louisa,KY,58112,25-Oct-19,10532
9,The Enloe State Bank,Cooper,TX,10716,31-May-19,10531


Sometimes we would want to drop a row only if that row has a missing value in a specific column. We can use the subset parameter to specify the column(s) to look for missing values. 

In [23]:
# Specify the columns to look for missing values
df.dropna(subset=['Acquiring Institution'])

Unnamed: 0,Bank Name,City,State,Cert,Acquiring Institution,Closing Date,Fund
0,Signature Bank,New York,NY,57053,"Signature Bridge Bank, N.A.",12-Mar-23,10540
1,Silicon Valley Bank,Santa Clara,CA,24735,"Silicon Valley Bridge Bank, N.A.",10-Mar-23,10539
2,Almena State Bank,Almena,KS,15426,Equity Bank,23-Oct-20,10538
3,First City Bank of Florida,Fort Walton Beach,FL,16748,"United Fidelity Bank, fsb",16-Oct-20,10537
4,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.",3-Apr-20,10536
5,Ericson State Bank,Ericson,NE,18265,Farmers and Merchants Bank,14-Feb-20,10535
6,City National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,1-Nov-19,10534
7,Resolute Bank,Maumee,OH,58317,Buckeye State Bank,25-Oct-19,10533
8,Louisa Community Bank,Louisa,KY,58112,Kentucky Farmers Bank Corporation,25-Oct-19,10532
9,The Enloe State Bank,Cooper,TX,10716,"Legend Bank, N. A.",31-May-19,10531


Note that the `.dropna()` method only returns a copy, not a view. This means that any change you make using the `.dropna()` method will not affect the original dataframe. To make the change permanent, you could either assign the result to the variable where you store the original dataframe to update it; or you could use the parameter `inplace=True`.

<h3 style="color:red; display:inline">Coding Challenge! &lt; / &gt; </h3>

When you explore the Shakespeare dataframe, what did you find about the column `doi`? What did you find about the column `placeOfPublication`? Is there any non-null value in them?

We have seen how to exclude rows and columns with at least one missing value. Actually, there is a threshold parameter we can use to specify at least how many non-null values are required to be present in a row or a column for it **not** to be dropped. Read the documentation on the `.dropna()` method, figure out how to use the `threshold` parameter and in the next code cell write a line of code to drop the columns in the Shakespeare dataset which have at least 2 missing values.  

In [24]:
# Drop the columns in the Shakespeare dataset with at least 2 missing values


Sometimes you may want to exclude a row/column from your consideration when you decide whether to drop a row/column. In other words, even if a row/column has a missing value, you don't want to drop it. How do you do that? 

In the dataset with data on failed banks, let's say we want to drop any row with missing values except the rows in `Acquiring Institution`. In other words, we want to preserve the rows in `Acquiring Institution` no matter whether it has missing values or not. 

In [25]:
# Drop any row with missing values except the rows in 'Acquiring Institution'
cols = df.columns.tolist()
cols.remove('Acquiring Institution')
df.dropna(subset=cols)

Unnamed: 0,Bank Name,City,State,Cert,Acquiring Institution,Closing Date,Fund
0,Signature Bank,New York,NY,57053,"Signature Bridge Bank, N.A.",12-Mar-23,10540
1,Silicon Valley Bank,Santa Clara,CA,24735,"Silicon Valley Bridge Bank, N.A.",10-Mar-23,10539
2,Almena State Bank,Almena,KS,15426,Equity Bank,23-Oct-20,10538
3,First City Bank of Florida,Fort Walton Beach,FL,16748,"United Fidelity Bank, fsb",16-Oct-20,10537
4,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.",3-Apr-20,10536
5,Ericson State Bank,Ericson,NE,18265,Farmers and Merchants Bank,14-Feb-20,10535
6,City National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,1-Nov-19,10534
7,Resolute Bank,Maumee,OH,58317,Buckeye State Bank,25-Oct-19,10533
8,Louisa Community Bank,Louisa,KY,58112,Kentucky Farmers Bank Corporation,25-Oct-19,10532
9,The Enloe State Bank,Cooper,TX,10716,"Legend Bank, N. A.",31-May-19,10531


Sometimes, you would want to maintain the rows and columns that have missing values. However, you would want to fill the cells with NaN values with some values which are of the same data type as the other cells in the same column. In this way, when you apply a certain function to a column in a dataframe, you will not run into type error. A common practice to deal with this kind of problem is to use the `.fillna()` method. 

In [26]:
# Fill the missing values
df['Acquiring Institution'].fillna('No Acquirer')

0              Signature Bridge Bank, N.A.
1         Silicon Valley Bridge Bank, N.A.
2                              Equity Bank
3                United Fidelity Bank, fsb
4                           MVB Bank, Inc.
5               Farmers and Merchants Bank
6                          Industrial Bank
7                       Buckeye State Bank
8        Kentucky Farmers Bank Corporation
9                       Legend Bank, N. A.
                      ...                 
555                            No Acquirer
556         The State Bank & Trust Company
557       The Security State Bank of Pecos
558       Israel Discount Bank of New York
559                     Delta Trust & Bank
560                  Superior Federal, FSB
561                      North Valley Bank
562    Southern New Hampshire Bank & Trust
563                Banterra Bank of Marion
564                     Bank of the Orient
Name: Acquiring Institution, Length: 565, dtype: object

### Drop certain columns or rows

We have seen how to drop rows or columns with missing values. Sometimes, even if a row or a column does not have a missing value, you still want to drop them because you will not use them in your analysis anyways. In this case, we will use the `.drop()` method to remove those rows or columns.

You can specify which column you want to drop using the 'columns' parameter. 

In [27]:
# Drop a column by setting the columns parameter
df.drop(columns='Fund')

Unnamed: 0,Bank Name,City,State,Cert,Acquiring Institution,Closing Date
0,Signature Bank,New York,NY,57053,"Signature Bridge Bank, N.A.",12-Mar-23
1,Silicon Valley Bank,Santa Clara,CA,24735,"Silicon Valley Bridge Bank, N.A.",10-Mar-23
2,Almena State Bank,Almena,KS,15426,Equity Bank,23-Oct-20
3,First City Bank of Florida,Fort Walton Beach,FL,16748,"United Fidelity Bank, fsb",16-Oct-20
4,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.",3-Apr-20
5,Ericson State Bank,Ericson,NE,18265,Farmers and Merchants Bank,14-Feb-20
6,City National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,1-Nov-19
7,Resolute Bank,Maumee,OH,58317,Buckeye State Bank,25-Oct-19
8,Louisa Community Bank,Louisa,KY,58112,Kentucky Farmers Bank Corporation,25-Oct-19
9,The Enloe State Bank,Cooper,TX,10716,"Legend Bank, N. A.",31-May-19


You can drop multiple columns at one time. 

In [28]:
# Drop multiple columns by setting the columns parameter
df.drop(columns=['Fund', 'Cert'])

Unnamed: 0,Bank Name,City,State,Acquiring Institution,Closing Date
0,Signature Bank,New York,NY,"Signature Bridge Bank, N.A.",12-Mar-23
1,Silicon Valley Bank,Santa Clara,CA,"Silicon Valley Bridge Bank, N.A.",10-Mar-23
2,Almena State Bank,Almena,KS,Equity Bank,23-Oct-20
3,First City Bank of Florida,Fort Walton Beach,FL,"United Fidelity Bank, fsb",16-Oct-20
4,The First State Bank,Barboursville,WV,"MVB Bank, Inc.",3-Apr-20
5,Ericson State Bank,Ericson,NE,Farmers and Merchants Bank,14-Feb-20
6,City National Bank of New Jersey,Newark,NJ,Industrial Bank,1-Nov-19
7,Resolute Bank,Maumee,OH,Buckeye State Bank,25-Oct-19
8,Louisa Community Bank,Louisa,KY,Kentucky Farmers Bank Corporation,25-Oct-19
9,The Enloe State Bank,Cooper,TX,"Legend Bank, N. A.",31-May-19


Another way to drop a column is to give the label of the column you want to drop and then set the axis parameter to 1.

In [29]:
# Drop a column by setting the axis parameter
df.drop('Fund', axis=1)

Unnamed: 0,Bank Name,City,State,Cert,Acquiring Institution,Closing Date
0,Signature Bank,New York,NY,57053,"Signature Bridge Bank, N.A.",12-Mar-23
1,Silicon Valley Bank,Santa Clara,CA,24735,"Silicon Valley Bridge Bank, N.A.",10-Mar-23
2,Almena State Bank,Almena,KS,15426,Equity Bank,23-Oct-20
3,First City Bank of Florida,Fort Walton Beach,FL,16748,"United Fidelity Bank, fsb",16-Oct-20
4,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.",3-Apr-20
5,Ericson State Bank,Ericson,NE,18265,Farmers and Merchants Bank,14-Feb-20
6,City National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,1-Nov-19
7,Resolute Bank,Maumee,OH,58317,Buckeye State Bank,25-Oct-19
8,Louisa Community Bank,Louisa,KY,58112,Kentucky Farmers Bank Corporation,25-Oct-19
9,The Enloe State Bank,Cooper,TX,10716,"Legend Bank, N. A.",31-May-19


You can also drop multiple columns by setting the axis parameter. 

In [30]:
# Drop multiple columns by setting the axis parameter
df.drop(['Fund', 'Cert'], axis=1)

Unnamed: 0,Bank Name,City,State,Acquiring Institution,Closing Date
0,Signature Bank,New York,NY,"Signature Bridge Bank, N.A.",12-Mar-23
1,Silicon Valley Bank,Santa Clara,CA,"Silicon Valley Bridge Bank, N.A.",10-Mar-23
2,Almena State Bank,Almena,KS,Equity Bank,23-Oct-20
3,First City Bank of Florida,Fort Walton Beach,FL,"United Fidelity Bank, fsb",16-Oct-20
4,The First State Bank,Barboursville,WV,"MVB Bank, Inc.",3-Apr-20
5,Ericson State Bank,Ericson,NE,Farmers and Merchants Bank,14-Feb-20
6,City National Bank of New Jersey,Newark,NJ,Industrial Bank,1-Nov-19
7,Resolute Bank,Maumee,OH,Buckeye State Bank,25-Oct-19
8,Louisa Community Bank,Louisa,KY,Kentucky Farmers Bank Corporation,25-Oct-19
9,The Enloe State Bank,Cooper,TX,"Legend Bank, N. A.",31-May-19


To drop a row, you can specify which row you want to drop using the 'index' parameter.

In [31]:
# Drop a row by setting the index parameter
df.drop(index=0)

Unnamed: 0,Bank Name,City,State,Cert,Acquiring Institution,Closing Date,Fund
1,Silicon Valley Bank,Santa Clara,CA,24735,"Silicon Valley Bridge Bank, N.A.",10-Mar-23,10539
2,Almena State Bank,Almena,KS,15426,Equity Bank,23-Oct-20,10538
3,First City Bank of Florida,Fort Walton Beach,FL,16748,"United Fidelity Bank, fsb",16-Oct-20,10537
4,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.",3-Apr-20,10536
5,Ericson State Bank,Ericson,NE,18265,Farmers and Merchants Bank,14-Feb-20,10535
6,City National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,1-Nov-19,10534
7,Resolute Bank,Maumee,OH,58317,Buckeye State Bank,25-Oct-19,10533
8,Louisa Community Bank,Louisa,KY,58112,Kentucky Farmers Bank Corporation,25-Oct-19,10532
9,The Enloe State Bank,Cooper,TX,10716,"Legend Bank, N. A.",31-May-19,10531
10,Washington Federal Bank for Savings,Chicago,IL,30570,Royal Savings Bank,15-Dec-17,10530


In the next code cell, can you write some code to drop multiple rows from df?

In [32]:
# Drop multiple rows using the index parameter


Another way to drop a row is to give the label of the row you want to drop and then set the axis parameter to 0.

In [33]:
# Drop a row by setting the axis parameter
df.drop(0, axis=0)

Unnamed: 0,Bank Name,City,State,Cert,Acquiring Institution,Closing Date,Fund
1,Silicon Valley Bank,Santa Clara,CA,24735,"Silicon Valley Bridge Bank, N.A.",10-Mar-23,10539
2,Almena State Bank,Almena,KS,15426,Equity Bank,23-Oct-20,10538
3,First City Bank of Florida,Fort Walton Beach,FL,16748,"United Fidelity Bank, fsb",16-Oct-20,10537
4,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.",3-Apr-20,10536
5,Ericson State Bank,Ericson,NE,18265,Farmers and Merchants Bank,14-Feb-20,10535
6,City National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,1-Nov-19,10534
7,Resolute Bank,Maumee,OH,58317,Buckeye State Bank,25-Oct-19,10533
8,Louisa Community Bank,Louisa,KY,58112,Kentucky Farmers Bank Corporation,25-Oct-19,10532
9,The Enloe State Bank,Cooper,TX,10716,"Legend Bank, N. A.",31-May-19,10531
10,Washington Federal Bank for Savings,Chicago,IL,30570,Royal Savings Bank,15-Dec-17,10530


We know that by default, the rows are indexed with integer numbers. You could set the index column to one of the columns of the dataframe. In the next code cell, can you set the index column to the `State` column and then drop all rows with the label 'GA' or 'KS'?

In [34]:
# Drop multiple rows by setting the axis parameter


You might want to drop multiple consecutive rows at one time. The `.drop()` method does not have a parameter for slicing but we can come up with a workaround.

We can use the `.index` property to get the range of indexes for the rows we want to drop and pass them to the `.drop()` method.

In [35]:
# Drop multiple consecutive rows
df.drop(df.index[2:5], axis=0) 

Unnamed: 0,Bank Name,City,State,Cert,Acquiring Institution,Closing Date,Fund
0,Signature Bank,New York,NY,57053,"Signature Bridge Bank, N.A.",12-Mar-23,10540
1,Silicon Valley Bank,Santa Clara,CA,24735,"Silicon Valley Bridge Bank, N.A.",10-Mar-23,10539
5,Ericson State Bank,Ericson,NE,18265,Farmers and Merchants Bank,14-Feb-20,10535
6,City National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,1-Nov-19,10534
7,Resolute Bank,Maumee,OH,58317,Buckeye State Bank,25-Oct-19,10533
8,Louisa Community Bank,Louisa,KY,58112,Kentucky Farmers Bank Corporation,25-Oct-19,10532
9,The Enloe State Bank,Cooper,TX,10716,"Legend Bank, N. A.",31-May-19,10531
10,Washington Federal Bank for Savings,Chicago,IL,30570,Royal Savings Bank,15-Dec-17,10530
11,The Farmers and Merchants State Bank of Argonia,Argonia,KS,17719,Conway Bank,13-Oct-17,10529
12,Fayette County Bank,Saint Elmo,IL,1802,"United Fidelity Bank, fsb",26-May-17,10528


To drop multiple consecutive columns, we can use the `.columns` attribute to get the range of the indexes for the columns we want to drop and then pass it to the `.drop()` method. 

In [36]:
# Drop multiple consecutive columns
df.drop(df.columns[3:5], axis=1)

Unnamed: 0,Bank Name,City,State,Closing Date,Fund
0,Signature Bank,New York,NY,12-Mar-23,10540
1,Silicon Valley Bank,Santa Clara,CA,10-Mar-23,10539
2,Almena State Bank,Almena,KS,23-Oct-20,10538
3,First City Bank of Florida,Fort Walton Beach,FL,16-Oct-20,10537
4,The First State Bank,Barboursville,WV,3-Apr-20,10536
5,Ericson State Bank,Ericson,NE,14-Feb-20,10535
6,City National Bank of New Jersey,Newark,NJ,1-Nov-19,10534
7,Resolute Bank,Maumee,OH,25-Oct-19,10533
8,Louisa Community Bank,Louisa,KY,25-Oct-19,10532
9,The Enloe State Bank,Cooper,TX,31-May-19,10531


The `.drop()` method returns a copy, not a view. Therefore, whatever change you make using it will not affect the original dataframe. 

<h3 style="color:red; display:inline">Coding Challenge! &lt; / &gt; </h3>

When you explore the Shakespeare dataframe, what did you find about the column `doi`? What did you find about the column `placeOfPublication`? Is there any non-null value in them?

In [37]:
# Drop the columns of doi and the placeofPublication, make the change permanent


### Filter data using conditionals
Conditional selection using `df.loc[]` is a very common method to filter a dataframe. 

You write a filtering condition to filter a target column. The condition then checks, for each cell in the target column, whether it fulfills the condition or not. The results will be returned as a Series of True/False values. The `.loc` indexer then uses this Series to select the rows that have True values. 

Suppose you are interested in the banks that failed since 2000 in the state of Georgia. From the original dataframe, you would like to get all the rows of the failed banks in Georgia. How do you do it?

In [38]:
# Write a filtering condition
df['State'] == 'GA' # Create a boolean mask over the column 'State'

0      False
1      False
2      False
3      False
4      False
5      False
6      False
7      False
8      False
9      False
       ...  
555    False
556    False
557    False
558    False
559    False
560    False
561    False
562    False
563    False
564    False
Name: State, Length: 565, dtype: bool

In [39]:
# Assign the filtering condition to a variable
filt = (df['State'] == 'GA') # Use parenthesis for better reading

In [40]:
# Put the Series returned by the filtering condition within the hard brackets of df.loc[]
df.loc[filt]

Unnamed: 0,Bank Name,City,State,Cert,Acquiring Institution,Closing Date,Fund
19,The Woodbury Banking Company,Woodbury,GA,11297,United Bank,19-Aug-16,10521
24,The Bank of Georgia,Peachtree City,GA,35259,Fidelity Bank,2-Oct-15,10516
28,Capitol City Bank & Trust Company,Atlanta,GA,33938,First-Citizens Bank & Trust Company,13-Feb-15,10512
36,Eastside Commercial Bank,Conyers,GA,58125,Community & Southern Bank,18-Jul-14,10504
61,Sunrise Bank,Valdosta,GA,58185,Synovus Bank,10-May-13,10481
63,Douglas County Bank,Douglasville,GA,21649,Hamilton State Bank,26-Apr-13,10476
69,Frontier Bank,LaGrange,GA,16431,HeritageBank of the South,8-Mar-13,10471
74,Hometown Community Bank,Braselton,GA,57928,"CertusBank, National Association",16-Nov-12,10466
85,Jasper Banking Company,Jasper,GA,16240,Stearns Bank N.A.,27-Jul-12,10455
88,First Cherokee State Bank,Woodstock,GA,32711,Community & Southern Bank,20-Jul-12,10450


Out of the rows that fulfill the filtering condition, we can further specify which columns to be returned.

In [41]:
# Specify a single column to be returned
df.loc[filt, 'Bank Name']

19          The Woodbury Banking Company
24                   The Bank of Georgia
28     Capitol City Bank & Trust Company
36              Eastside Commercial Bank
61                          Sunrise Bank
63                   Douglas County Bank
69                         Frontier Bank
74               Hometown Community Bank
85                Jasper Banking Company
88             First Cherokee State Bank
                     ...                
495                       FirstCity Bank
496              Freedom Bank of Georgia
506         FirstBank Financial Services
514                     Haven Trust Bank
515         First Georgia Community Bank
518                       Community Bank
522                   Alpha Bank & Trust
528                       Integrity Bank
539                              NetBank
550           AmTrade International Bank
Name: Bank Name, Length: 93, dtype: object

Of course, we can select muliple columns to be returned out of the filtered rows. 

In [42]:
# Specify multiple columns to be returned
df.loc[filt, ['Bank Name', 'Fund']]

Unnamed: 0,Bank Name,Fund
19,The Woodbury Banking Company,10521
24,The Bank of Georgia,10516
28,Capitol City Bank & Trust Company,10512
36,Eastside Commercial Bank,10504
61,Sunrise Bank,10481
63,Douglas County Bank,10476
69,Frontier Bank,10471
74,Hometown Community Bank,10466
85,Jasper Banking Company,10455
88,First Cherokee State Bank,10450


Now suppose you want to get all the failed banks whose name contains the word 'community'.

In [43]:
# Get all the banks with the word 'community' in their name
filt = (df['Bank Name'].str.contains('Community'))
df.loc[filt, ['Bank Name']]

Unnamed: 0,Bank Name
8,Louisa Community Bank
17,Harvest Community Bank
29,Highland Community Bank
49,"Texas Community Bank, National Association"
52,The Community's Bank
54,Community South Bank
56,First Community Bank of Southwest Florida (als...
62,Pisgah Community Bank
65,Chipola Community Bank
72,Westside Community Bank


#### Conjunction of multiple filtering conditions: `&`

Oftentimes, you would want to filter a dataframe based on more complex conditions. For example, suppose you would like to get the banks in GA that were closed between 2008 and 2010. How do you use `df.loc[ ]` to achieve it?

The location of the failed banks is stored in the `State` column. The closing year of the banks is stored in the `Closing Date` column. 

In [44]:
# Create the first filtering condition restricting the state
filt1 = (df['State'] == 'GA')

How to get the closing year of the banks? Recall what we have learned in [Pandas 1](./pandas-1.ipynb) about creating a new column based on an old one. How do you extract the closing year out of the column `Closing Date`? We can use the `.apply()` method.

In [45]:
# Create a new column storing the closing year of the banks
df['Closing Year'] = df['Closing Date'].apply(lambda r: r.split('-')[2])
df['Closing Year'] = df['Closing Year'].astype(int)

In [46]:
# Take a look at the dataframe
df

Unnamed: 0,Bank Name,City,State,Cert,Acquiring Institution,Closing Date,Fund,Closing Year
0,Signature Bank,New York,NY,57053,"Signature Bridge Bank, N.A.",12-Mar-23,10540,23
1,Silicon Valley Bank,Santa Clara,CA,24735,"Silicon Valley Bridge Bank, N.A.",10-Mar-23,10539,23
2,Almena State Bank,Almena,KS,15426,Equity Bank,23-Oct-20,10538,20
3,First City Bank of Florida,Fort Walton Beach,FL,16748,"United Fidelity Bank, fsb",16-Oct-20,10537,20
4,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.",3-Apr-20,10536,20
5,Ericson State Bank,Ericson,NE,18265,Farmers and Merchants Bank,14-Feb-20,10535,20
6,City National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,1-Nov-19,10534,19
7,Resolute Bank,Maumee,OH,58317,Buckeye State Bank,25-Oct-19,10533,19
8,Louisa Community Bank,Louisa,KY,58112,Kentucky Farmers Bank Corporation,25-Oct-19,10532,19
9,The Enloe State Bank,Cooper,TX,10716,"Legend Bank, N. A.",31-May-19,10531,19


In [47]:
# Create the second filtering condition restricting the closing year
filt2 = (df['Closing Year'] > 7) & (df['Closing Year'] < 11)

With the two filtering conditions, we are ready to extract the banks in GA that failed between 2008 and 2010.

In [48]:
# Use filt1 and filt2 to get the target rows
df.loc[filt1 & filt2]

Unnamed: 0,Bank Name,City,State,Cert,Acquiring Institution,Closing Date,Fund,Closing Year
218,"United Americas Bank, N.A.",Atlanta,GA,35065,State Bank and Trust Company,17-Dec-10,10323,10
219,"Appalachian Community Bank, FSB",McCaysville,GA,58495,Peoples Bank of East Tennessee,17-Dec-10,10319,10
220,Chestatee State Bank,Dawsonville,GA,34578,Bank of the Ozarks,17-Dec-10,10320,10
228,Darby Bank & Trust Co.,Vidalia,GA,14580,Ameris Bank,12-Nov-10,10312,10
229,Tifton Banking Company,Tifton,GA,57831,Ameris Bank,12-Nov-10,10313,10
237,The First National Bank of Barnesville,Barnesville,GA,2119,United Bank,22-Oct-10,10304,10
238,The Gordon Bank,Gordon,GA,33904,Morris Bank,22-Oct-10,10305,10
250,The Peoples Bank,Winder,GA,182,Community & Southern Bank,17-Sep-10,10292,10
251,First Commerce Community Bank,Douglasville,GA,57448,Community & Southern Bank,17-Sep-10,10289,10
252,Bank of Ellijay,Ellijay,GA,58197,Community & Southern Bank,17-Sep-10,10287,10


Note that when we extract rows that fulfill multiple conditions, we use `&` in Pandas, not `and`. If you replace `&` with `and`, you will get an error. This is different than what we have learned about boolean operators in [Python basics 2](./python-basics-2.ipynb). In Python, we use `and`, `or` and `not`. In Pandas, we use `&`, `|` and `~` intead. 

|Pandas Operator|Boolean|Requires|
|---|---|---|
|&|and|All required to `True`|
|\||or|If any are `True`|
|~|not|The opposite|

Although we use different symbols for these boolean operators, the truth table for them stays the same. For a quick review of the truth table, see [Python basics 2](./python-basics-2.ipynb).

#### Disjunction of multiple filtering conditions: `|`
Suppose you would like to take a look at all the failed banks in the state of Georgia or the state of New York. How do you use `df.loc[ ]` to get the target rows?

In [49]:
# Create the two filtering conditions restricting the state to GA and NY
filt1 = (df['State'] == 'GA')
filt2 = (df['State'] == 'NY')

In [50]:
# Use filt1 and filt2 to get the target rows
df.loc[filt1|filt2]

Unnamed: 0,Bank Name,City,State,Cert,Acquiring Institution,Closing Date,Fund,Closing Year
0,Signature Bank,New York,NY,57053,"Signature Bridge Bank, N.A.",12-Mar-23,10540,23
19,The Woodbury Banking Company,Woodbury,GA,11297,United Bank,19-Aug-16,10521,16
24,The Bank of Georgia,Peachtree City,GA,35259,Fidelity Bank,2-Oct-15,10516,15
28,Capitol City Bank & Trust Company,Atlanta,GA,33938,First-Citizens Bank & Trust Company,13-Feb-15,10512,15
36,Eastside Commercial Bank,Conyers,GA,58125,Community & Southern Bank,18-Jul-14,10504,14
61,Sunrise Bank,Valdosta,GA,58185,Synovus Bank,10-May-13,10481,13
63,Douglas County Bank,Douglasville,GA,21649,Hamilton State Bank,26-Apr-13,10476,13
69,Frontier Bank,LaGrange,GA,16431,HeritageBank of the South,8-Mar-13,10471,13
74,Hometown Community Bank,Braselton,GA,57928,"CertusBank, National Association",16-Nov-12,10466,12
85,Jasper Banking Company,Jasper,GA,16240,Stearns Bank N.A.,27-Jul-12,10455,12


If you would like to get the data of the failed banks in the following six states --- Georgia, New York, New Jersey, Florida, California and West Virginia, you will not want to write six filtering conditions and use the vertical bar `|` to connect all of them. That would be too repetitive. In this case, we can use the `.isin()` method to create a filtering condition.

In [51]:
# Create a list of the states
states = ['GA', 'NY', 'NJ', 'FL', 'CA', 'WV']

In [52]:
# Create a filtering condition
filt = (df['State'].isin(states))

In [53]:
# Use filt to find all failed banks in the six states
df.loc[filt]

Unnamed: 0,Bank Name,City,State,Cert,Acquiring Institution,Closing Date,Fund,Closing Year
0,Signature Bank,New York,NY,57053,"Signature Bridge Bank, N.A.",12-Mar-23,10540,23
1,Silicon Valley Bank,Santa Clara,CA,24735,"Silicon Valley Bridge Bank, N.A.",10-Mar-23,10539,23
3,First City Bank of Florida,Fort Walton Beach,FL,16748,"United Fidelity Bank, fsb",16-Oct-20,10537,20
4,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.",3-Apr-20,10536,20
6,City National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,1-Nov-19,10534,19
17,Harvest Community Bank,Pennsville,NJ,34951,First-Citizens Bank & Trust Company,13-Jan-17,10523,17
19,The Woodbury Banking Company,Woodbury,GA,11297,United Bank,19-Aug-16,10521,16
24,The Bank of Georgia,Peachtree City,GA,35259,Fidelity Bank,2-Oct-15,10516,15
28,Capitol City Bank & Trust Company,Atlanta,GA,33938,First-Citizens Bank & Trust Company,13-Feb-15,10512,15
30,First National Bank of Crestview,Crestview,FL,17557,First NBC Bank,16-Jan-15,10510,15


#### Negation of a certain condition:`~`
Now, suppose you would like to get all the failed banks that were **not** closed in 2008. How do you do it?

In [54]:
# Create the filtering condition restricting the closing year to non-2008
filt = (~(df['Closing Year'] == 8))

In [55]:
# Use the filtering condition to get the target rows with specified columns
df.loc[filt, ['Bank Name', 'City']]

Unnamed: 0,Bank Name,City
0,Signature Bank,New York
1,Silicon Valley Bank,Santa Clara
2,Almena State Bank,Almena
3,First City Bank of Florida,Fort Walton Beach
4,The First State Bank,Barboursville
5,Ericson State Bank,Ericson
6,City National Bank of New Jersey,Newark
7,Resolute Bank,Maumee
8,Louisa Community Bank,Louisa
9,The Enloe State Bank,Cooper


<h3 style="color:red; display:inline">Coding Challenge! &lt; / &gt; </h3>

Let's do some filtering!

From the Shakespeare dataframe, get the title and the creator of the documents published between 2000 **and** 2010.

From the Shakespeare dataframe, get the creator of the documents shorter than 10 pages **or** longer than 50 pages. 

From the Shakespeare dataframe, get the title of the documents whose publisher is **not** Folger Shakespeare Library. 

## Update a dataframe
We can make changes to the data in a dataframe.
### Update headers
We can update the column names of a dataframe.

In [56]:
# Take a look at the columns
df.columns

Index(['Bank Name', 'City', 'State', 'Cert', 'Acquiring Institution',
       'Closing Date', 'Fund', 'Closing Year'],
      dtype='object')

In [57]:
# Access a column using the dot notation
df.City

0               New York
1            Santa Clara
2                 Almena
3      Fort Walton Beach
4          Barboursville
5                Ericson
6                 Newark
7                 Maumee
8                 Louisa
9                 Cooper
             ...        
555              Phoenix
556              Oakwood
557        Sierra Blanca
558                Miami
559             Gravette
560             Hinsdale
561                Malta
562           Manchester
563           Metropolis
564             Honolulu
Name: City, Length: 565, dtype: object

In [58]:
# If a column name has a space in it
df.Bank Name

SyntaxError: invalid syntax (3808230729.py, line 2)

We could replace all the spaces in column names with an `_`. In this way, we can access all the columns using the dot notation.

In [60]:
# Replace spaces in column names with underscores
df.columns = df.columns.str.replace(' ', '_')
df

Unnamed: 0,Bank_Name,City,State,Cert,Acquiring_Institution,Closing_Date,Fund,Closing_Year
0,Signature Bank,New York,NY,57053,"Signature Bridge Bank, N.A.",12-Mar-23,10540,23
1,Silicon Valley Bank,Santa Clara,CA,24735,"Silicon Valley Bridge Bank, N.A.",10-Mar-23,10539,23
2,Almena State Bank,Almena,KS,15426,Equity Bank,23-Oct-20,10538,20
3,First City Bank of Florida,Fort Walton Beach,FL,16748,"United Fidelity Bank, fsb",16-Oct-20,10537,20
4,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.",3-Apr-20,10536,20
5,Ericson State Bank,Ericson,NE,18265,Farmers and Merchants Bank,14-Feb-20,10535,20
6,City National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,1-Nov-19,10534,19
7,Resolute Bank,Maumee,OH,58317,Buckeye State Bank,25-Oct-19,10533,19
8,Louisa Community Bank,Louisa,KY,58112,Kentucky Farmers Bank Corporation,25-Oct-19,10532,19
9,The Enloe State Bank,Cooper,TX,10716,"Legend Bank, N. A.",31-May-19,10531,19


You could also change the case of the headers.

In [61]:
# Change all headers to upper case
df.columns.str.upper()

Index(['BANK_NAME', 'CITY', 'STATE', 'CERT', 'ACQUIRING_INSTITUTION',
       'CLOSING_DATE', 'FUND', 'CLOSING_YEAR'],
      dtype='object')

We have been updating the column names all at one time. However, oftentimes we just want to update specific columns. In this case, we could use the `df.rename()` method and pass in a **dictionary** where the keys are the original column names and the values are the new column names.

In [62]:
# Change the column name of 'CERT' to 'CERTIFICATE_NUM'
df.rename(columns = {'Cert':'Certificate_Num'})

Unnamed: 0,Bank_Name,City,State,Certificate_Num,Acquiring_Institution,Closing_Date,Fund,Closing_Year
0,Signature Bank,New York,NY,57053,"Signature Bridge Bank, N.A.",12-Mar-23,10540,23
1,Silicon Valley Bank,Santa Clara,CA,24735,"Silicon Valley Bridge Bank, N.A.",10-Mar-23,10539,23
2,Almena State Bank,Almena,KS,15426,Equity Bank,23-Oct-20,10538,20
3,First City Bank of Florida,Fort Walton Beach,FL,16748,"United Fidelity Bank, fsb",16-Oct-20,10537,20
4,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.",3-Apr-20,10536,20
5,Ericson State Bank,Ericson,NE,18265,Farmers and Merchants Bank,14-Feb-20,10535,20
6,City National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,1-Nov-19,10534,19
7,Resolute Bank,Maumee,OH,58317,Buckeye State Bank,25-Oct-19,10533,19
8,Louisa Community Bank,Louisa,KY,58112,Kentucky Farmers Bank Corporation,25-Oct-19,10532,19
9,The Enloe State Bank,Cooper,TX,10716,"Legend Bank, N. A.",31-May-19,10531,19


To change multiple column names, we just pass in a dictionary to `df.rename()` with multiple key:value pairs.

In [63]:
# Change multiple column names
df.rename(columns = {'Cert':'Certificate_Num', 'Fund':'Financial_Institution_Num'})

Unnamed: 0,Bank_Name,City,State,Certificate_Num,Acquiring_Institution,Closing_Date,Financial_Institution_Num,Closing_Year
0,Signature Bank,New York,NY,57053,"Signature Bridge Bank, N.A.",12-Mar-23,10540,23
1,Silicon Valley Bank,Santa Clara,CA,24735,"Silicon Valley Bridge Bank, N.A.",10-Mar-23,10539,23
2,Almena State Bank,Almena,KS,15426,Equity Bank,23-Oct-20,10538,20
3,First City Bank of Florida,Fort Walton Beach,FL,16748,"United Fidelity Bank, fsb",16-Oct-20,10537,20
4,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.",3-Apr-20,10536,20
5,Ericson State Bank,Ericson,NE,18265,Farmers and Merchants Bank,14-Feb-20,10535,20
6,City National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,1-Nov-19,10534,19
7,Resolute Bank,Maumee,OH,58317,Buckeye State Bank,25-Oct-19,10533,19
8,Louisa Community Bank,Louisa,KY,58112,Kentucky Farmers Bank Corporation,25-Oct-19,10532,19
9,The Enloe State Bank,Cooper,TX,10716,"Legend Bank, N. A.",31-May-19,10531,19


In [64]:
# Make the change permanent
df.rename(columns = {'Cert':'Certificate_Num', 'Fund':'Financial_Institution_Num'}, inplace = True)

### Update rows 
How to update the values in a row? In [Pandas 1](./pandas-1.ipynb), we have learned how to look up values using `.loc` and `.iloc`.

To update a row, we could use `.loc` or `.iloc` to locate it and then assign the new values to that row.

In [65]:
# Change an entire row
df.loc[0] = ['Almena State Bank', 'Almena', 'KS', 15426, 'Equity Bank', '23-Oct-20', 10000, 20]

You can locate a specific cell in a row and update the value in that cell alone.

In [66]:
# Change a specific value in a row
df.loc[0, 'Financial_Institution_Num'] = 10001
df

Unnamed: 0,Bank_Name,City,State,Certificate_Num,Acquiring_Institution,Closing_Date,Financial_Institution_Num,Closing_Year
0,Almena State Bank,Almena,KS,15426,Equity Bank,23-Oct-20,10001,20
1,Silicon Valley Bank,Santa Clara,CA,24735,"Silicon Valley Bridge Bank, N.A.",10-Mar-23,10539,23
2,Almena State Bank,Almena,KS,15426,Equity Bank,23-Oct-20,10538,20
3,First City Bank of Florida,Fort Walton Beach,FL,16748,"United Fidelity Bank, fsb",16-Oct-20,10537,20
4,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.",3-Apr-20,10536,20
5,Ericson State Bank,Ericson,NE,18265,Farmers and Merchants Bank,14-Feb-20,10535,20
6,City National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,1-Nov-19,10534,19
7,Resolute Bank,Maumee,OH,58317,Buckeye State Bank,25-Oct-19,10533,19
8,Louisa Community Bank,Louisa,KY,58112,Kentucky Farmers Bank Corporation,25-Oct-19,10532,19
9,The Enloe State Bank,Cooper,TX,10716,"Legend Bank, N. A.",31-May-19,10531,19


We could change multiple specific values in a row using `.loc[]`. 

In [67]:
# Change multiple values in a row
df.loc[0, ['Bank_Name', 'Financial_Institution_Num']] = ['Almena Bank', 12000]
df

Unnamed: 0,Bank_Name,City,State,Certificate_Num,Acquiring_Institution,Closing_Date,Financial_Institution_Num,Closing_Year
0,Almena Bank,Almena,KS,15426,Equity Bank,23-Oct-20,12000,20
1,Silicon Valley Bank,Santa Clara,CA,24735,"Silicon Valley Bridge Bank, N.A.",10-Mar-23,10539,23
2,Almena State Bank,Almena,KS,15426,Equity Bank,23-Oct-20,10538,20
3,First City Bank of Florida,Fort Walton Beach,FL,16748,"United Fidelity Bank, fsb",16-Oct-20,10537,20
4,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.",3-Apr-20,10536,20
5,Ericson State Bank,Ericson,NE,18265,Farmers and Merchants Bank,14-Feb-20,10535,20
6,City National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,1-Nov-19,10534,19
7,Resolute Bank,Maumee,OH,58317,Buckeye State Bank,25-Oct-19,10533,19
8,Louisa Community Bank,Louisa,KY,58112,Kentucky Farmers Bank Corporation,25-Oct-19,10532,19
9,The Enloe State Bank,Cooper,TX,10716,"Legend Bank, N. A.",31-May-19,10531,19


### Update columns
There are multiple methods we can use to update columns. Let's take a look at two methods `.map()` and `replace()`.

In [68]:
# Use .map() to update specific values in a column
df['Bank_Name'].map({'Almena Bank': 'Almena State Bank', 'The First State Bank': 'West Virginia Bank'})

0       Almena State Bank
1                     NaN
2                     NaN
3                     NaN
4      West Virginia Bank
5                     NaN
6                     NaN
7                     NaN
8                     NaN
9                     NaN
              ...        
555                   NaN
556                   NaN
557                   NaN
558                   NaN
559                   NaN
560                   NaN
561                   NaN
562                   NaN
563                   NaN
564                   NaN
Name: Bank_Name, Length: 565, dtype: object

In [69]:
# Use .replace() to update specific values in a column while maintaining the rest
df['Bank_Name'].replace({'Almena Bank': 'Almena State Bank', 'The First State Bank': 'West Virginia Bank'})

0                      Almena State Bank
1                    Silicon Valley Bank
2                      Almena State Bank
3             First City Bank of Florida
4                     West Virginia Bank
5                     Ericson State Bank
6       City National Bank of New Jersey
7                          Resolute Bank
8                  Louisa Community Bank
9                   The Enloe State Bank
                     ...                
555                         NextBank, NA
556             Oakwood Deposit Bank Co.
557                Bank of Sierra Blanca
558                    Hamilton Bank, NA
559               Sinclair National Bank
560                   Superior Bank, FSB
561                  Malta National Bank
562      First Alliance Bank & Trust Co.
563    National State Bank of Metropolis
564                     Bank of Honolulu
Name: Bank_Name, Length: 565, dtype: object

We can also use a filtering condition to locate the target columns and then make changes. 

For example, we can locate all the banks that failed in 2020 and change their closing date to 'Recent'.

In [70]:
# Make a filtering condition to get the banks that failed in 2020
filt = (df['Closing_Year'] == 2020)

In [71]:
# Use the filtering condition to locate the columns and update them
df.loc[filt, ['Financial_Institution_Num', 'Closing_Year']] = [1000, 'Recent']
df

Unnamed: 0,Bank_Name,City,State,Certificate_Num,Acquiring_Institution,Closing_Date,Financial_Institution_Num,Closing_Year
0,Almena Bank,Almena,KS,15426,Equity Bank,23-Oct-20,12000,20
1,Silicon Valley Bank,Santa Clara,CA,24735,"Silicon Valley Bridge Bank, N.A.",10-Mar-23,10539,23
2,Almena State Bank,Almena,KS,15426,Equity Bank,23-Oct-20,10538,20
3,First City Bank of Florida,Fort Walton Beach,FL,16748,"United Fidelity Bank, fsb",16-Oct-20,10537,20
4,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.",3-Apr-20,10536,20
5,Ericson State Bank,Ericson,NE,18265,Farmers and Merchants Bank,14-Feb-20,10535,20
6,City National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,1-Nov-19,10534,19
7,Resolute Bank,Maumee,OH,58317,Buckeye State Bank,25-Oct-19,10533,19
8,Louisa Community Bank,Louisa,KY,58112,Kentucky Farmers Bank Corporation,25-Oct-19,10532,19
9,The Enloe State Bank,Cooper,TX,10716,"Legend Bank, N. A.",31-May-19,10531,19


<h3 style="color:red; display:inline">Coding Challenge! &lt; / &gt; </h3>

Make all column names in the Shakespeare dataframe upper case. 

Get all documents whose current title is 'Review Article' and change their title to 'Review'.

Get all documents whose word count exceeds 5000 and change their word count to the string 'Long article'.

___
## Lesson Complete

Congratulations! You have completed *Pandas 2*.

### Start Next Lesson: [Pandas 3 ->](./pandas-3.ipynb)

### Exercise Solutions
Here are a few solutions for exercises in this lesson.

In [72]:
# Read in the metadata
shake = pd.read_csv(metadata)

In [73]:
# Set the rows to display to 30
pd.set_option('display.max_rows', 30)

In [74]:
# Explore the dataframe
shake.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500 entries, 0 to 1499
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   id                  1500 non-null   object 
 1   title               1500 non-null   object 
 2   isPartOf            1500 non-null   object 
 3   publicationYear     1500 non-null   int64  
 4   doi                 0 non-null      float64
 5   docType             1500 non-null   object 
 6   provider            1500 non-null   object 
 7   datePublished       1500 non-null   object 
 8   issueNumber         1467 non-null   float64
 9   volumeNumber        1500 non-null   int64  
 10  url                 1500 non-null   object 
 11  creator             1209 non-null   object 
 12  publisher           1500 non-null   object 
 13  language            1500 non-null   object 
 14  pageStart           1443 non-null   object 
 15  pageEnd             1443 non-null   object 
 16  placeO

In [75]:
### Drop the rows and columns with at least 2 missing values

# Get the tuple (# of rows, # of columns) 
df.shape

# Store the num of rows and num of columns in two variables
num_rows = df.shape[0]
num_cols = df.shape[1]

# Drop all columns which have at least 2 missing values
df.dropna(thresh=num_rows-1, axis=1)

Unnamed: 0,Bank_Name,City,State,Certificate_Num,Closing_Date,Financial_Institution_Num,Closing_Year
0,Almena Bank,Almena,KS,15426,23-Oct-20,12000,20
1,Silicon Valley Bank,Santa Clara,CA,24735,10-Mar-23,10539,23
2,Almena State Bank,Almena,KS,15426,23-Oct-20,10538,20
3,First City Bank of Florida,Fort Walton Beach,FL,16748,16-Oct-20,10537,20
4,The First State Bank,Barboursville,WV,14361,3-Apr-20,10536,20
5,Ericson State Bank,Ericson,NE,18265,14-Feb-20,10535,20
6,City National Bank of New Jersey,Newark,NJ,21111,1-Nov-19,10534,19
7,Resolute Bank,Maumee,OH,58317,25-Oct-19,10533,19
8,Louisa Community Bank,Louisa,KY,58112,25-Oct-19,10532,19
9,The Enloe State Bank,Cooper,TX,10716,31-May-19,10531,19


In [76]:
# Drop the columns of doi and the placeofPublication, make the change permanent
shake.drop(columns=['doi', 'placeOfPublication'], inplace=True)

In [77]:
# get the title and the creator of the documents published between 2000 and 2010
filt = (shake['publicationYear']>1999) & (shake['publicationYear']<2011)
shake.loc[filt, ['title', 'creator']]

Unnamed: 0,title,creator
0,Fragments of Nationalism in Troilus and Cressida,Matthew A. Greenfield
15,Citizens' Games: Differentiating Collaboration...,Nina Levine
26,Review Article,Heather Hirschfeld
28,Review Article,Christa Jansohn
31,The Alcestis and the Statue Scene in the Winte...,Sarah Dewar-Watson
35,Review Article,Michelle Ephraim
39,Review Article,Jane Donawerth
40,"When Theaters Were Bear-Gardens; Or, What's at...",Jason Scott-Warren
48,Back Matter,
51,Review Article,Arthur F. Kinney


In [78]:
# get the creator of the documents shorter than 10 pages or longer than 50 pages
filt = (shake['pageCount']<10)|(shake['pageCount']>50)
shake.loc[filt, 'creator']

2                    R. G. Cox
3                 Darryl Gless
4              Marion O'Connor
5       Jeanne Addison Roberts
7             Wilhelm Hortmann
8           A. H. R. Fairchild
9                 John W. Velz
10            Anna Maria Crinò
11                         NaN
13                         NaN
                 ...          
1489         Claire McGlinchee
1490               David Riggs
1491     George Burke Johnston
1492        Frankie Rubinstein
1493    George Walton Williams
1494         M. Lindsay Kaplan
1495         T. H. Howard-Hill
1496             Yar Slavutych
1498              Thomas Pyles
1499                 S. Thomas
Name: creator, Length: 1258, dtype: object

In [79]:
# get the title of the documents whose publisher is not Folger Shakespeare Library
filt = (shake['publisher']=='Folger Shakespeare Library')
shake.loc[~filt, 'title']

48                                            Back Matter
118     Bibliography: Shakespeare: Annotated World Bib...
257     Bibliography: Shakespeare: Annotated World Bib...
287                                          Front Matter
366     BIBLIOGRAPHY: World Shakespeare Bibliography 1992
664                                          Front Matter
1089                                         Front Matter
1100    BIBLIOGRAPHY: World Shakespeare Bibliography 1993
Name: title, dtype: object

In [80]:
# Make all column names in the Shakespeare dataframe upper case
shake.columns = shake.columns.str.upper()

In [81]:
# Get all documents whose current title is 'Review Article' and change their title to 'Review'
shake.loc[shake['TITLE']=='Review Article', 'TITLE'] = 'Review'

In [82]:
# Get all documents whose word count exceeds 5000 and change their word count to the string 'Long article'
shake.loc[shake['WORDCOUNT']>5000, 'WORDCOUNT'] = 'Long article'