<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

This notebook is adapted by Zhuo Chen from the notebooks created by [Nathan Kelber](http://nkelber.com), [William Mattingly](https://datascience.si.edu/people/dr-william-mattingly) and [Melanie Walsh](https://melaniewalsh.org) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email zhuo.chen@ithaka.org or nathan.kelber@ithaka.org.<br />
___

# Pandas 2

**Description:** This notebook describes how to:
* Sort a dataframe
* Filter data in a dataframe
* Update data in a dataframe

This is the second notebook in a series on learning to use Pandas. 

**Use Case:** For Learners (Detailed explanation, not ideal for researchers)

**Difficulty:** Intermediate

**Knowledge Required:** 
* [Pandas 1](./pandas-1.ipynb)
* Python Basics ([Start Python Basics I](./python-basics-1.ipynb))

**Knowledge Recommended:** 
* [Python Intermediate 2](./python-intermediate-2.ipynb)
* [Python Intermediate 4](./python-intermediate-4.ipynb)

**Completion Time:** 90 minutes

**Data Format:** CSV (.csv)

**Libraries Used:** Pandas

**Research Pipeline:** None
___


In [None]:
# Import pandas library, `as pd` allows us to shorten typing `pandas` to `pd` when we call pandas
import pandas as pd

## Sort a dataframe

In this section, we will continue working with the dataframe we created in Pandas 1 storing data on the most recent 10 World Cup games. 

In [None]:
# Create a dataframe with world cup data
wcup = pd.DataFrame({"Year": [2022, 
                              2018, 
                              2014, 
                              2010, 
                              2006, 
                              2002, 
                              1998, 
                              1994, 
                              1990,
                              1986], 
                     "Champion": ["Argentina", 
                                  "France", 
                                  "Germany", 
                                  "Spain", 
                                  "Italy", 
                                  "Brazil", 
                                  "France", 
                                  "Brazil", 
                                  "Germany", 
                                  "Argentina"], 
                     "Host": ["Qatar", 
                              "Russia", 
                              "Brazil", 
                              "South Africa", 
                              "Germany", 
                              "Korea/Japan", 
                              "France", 
                              "USA", 
                              "Italy", 
                              "Mexico"],
                     "Score": ["7-5", 
                               "4-2", 
                               "1-0", 
                               "1-0", 
                               "6-4", 
                               "2-0", 
                               "3-0", 
                               "3-2", 
                               "1-0", 
                               "3-2"]
                    })
wcup['Goals Scored'] = wcup['Score'].apply(lambda r: r.split('-')[0])
wcup['Goals Conceded'] = wcup['Score'].apply(lambda r: r.split('-')[1])
wcup['Difference'] = wcup['Goals Scored'].astype(int) - wcup['Goals Conceded'].astype(int)
wcup

### Set, reset and use indexes
We have seen that by default, the rows in a dataframe are numbered by integer indexes starting from 0. The indexes look like a column to the far left without a name. 

We can set the index column to one of the columns in the dataframe. This is desirable because a range of integers is not descriptive but a column with a name is descriptive. When we want to locate specific data, descriptive labels are much more useful. 

In [None]:
# Set index column to 'Host'
wcup.set_index('Year')

Take a look at the original dataframe, is it changed? 

In [None]:
# Take a look at the original dataframe
wcup

The original dataframe is **NOT** changed after we use the `.set_index()` method to change the index column. This is because in Pandas, we have a distinction between a view and a copy. When a view of the dataframe is returned, any change we make will affect the original dataframe, but when a copy is returned, any change we make only affects the copy, not the original dataframe. The `.set_index()` method returns a copy, this is why the original dataframe is not affected.  

If you want to make the change permanent, there is a parameter `inplace` you can use. If you set this parameter to `True`, the change will be made in place and the original dataframe will be changed. 

In [None]:
# Change the index column and commit the change
wcup.set_index('Year', inplace=True)
wcup

You could also sort the index column. Here, we have a numerical column as our index colummn. When we sort the indexes, by default, the dataframe will be sorted by the index column in an ascending order. 

In [None]:
# Sort the indexes
wcup.sort_index()

You could set the parameter `ascending=False` to sort the indexes in a descending order.

In [None]:
# Specify the ascending order
wcup.sort_index(ascending=False)

Note that the sorting change is not committed by default. If you want to make the change permanent, again, you will have to add `inplace=True`.

Sometimes we would want to change the index column back to the integer column. In this case, we can use the method `reset_index()`. But again, to make the reset permanent, you will have to add `inplace=True`.

In [None]:
# Reset the index and update the dataframe
wcup.reset_index(inplace=True)
wcup

### Sort by one column

We can sort the entire dataframe by a column other than the index column. 

In [None]:
# Sort the dataframe by the column 'Goals Scored'
wcup.sort_values(by=['Goals Scored'])

### Sort by multiple columns
It is a convention to sort the soccer results first by difference (i.e. how many more goals the champion scored than the runner-up) and then by goals conceded (i.e. how many goals the champion lost). Pandas can easily do that. 

In [None]:
# Sort the dataframe by Difference column in descending order 
# then by Goals Conceded column in ascending order
wcup.sort_values(by=['Difference', 'Goals Conceded'], ascending=[False, True])

## A quick review of how to create a dataframe from a file

In [Pandas 1](./pandas-1.ipynb), we learned how to create a dataframe by passing a **dictionary** to the `DataFrame` method or by reading in a csv or an excel file. 

For example, we can convert the data in a .csv file to a Pandas DataFrame using the `.read_csv()` method. We pass in the location of the .csv file.

In [None]:
### Download the sample file for this Lesson
import urllib
url = 'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/Pandas1_failed_banks_since_2000.csv'
urllib.request.urlretrieve(url, './data/' + url.rsplit('/', 1)[-1])
print('Sample file retrieved.')

Use the `**File > Open**` menu above to navigate to the `failed_banks_since_2000.csv` in the `/data` folder. Preview its structure before we load it into a dataframe.

In [None]:
# Create a DataFrame `df` from a CSV file using the .read_csv() method
df = pd.read_csv('data/Pandas1_failed_banks_since_2000.csv') # pass in the location of the file
df

In [None]:
# Change the display setting
pd.set_option('display.min_rows', 20) # set the minimum number of rows to display to 20
df

By default, Pandas displays the first five rows and the last five rows of the dataframe. You can change the display setting using the `.set_option()` method.

The display setting is global throughout the notebook. Therefore, any dataframe in the current notebook will have this setting in effect.

Now, you see that Pandas displays the first 10 rows and the last 10 rows of the dataframe.

By convention, a dataframe variable is called `df` but we could give it any valid Python variable name. Here, we follow the convention. 

In [None]:
# Get some info about the dataframe
df.info()

The `info()` method tells us that there are 565 rows and 7 columns in the dataframe. Almost all columns have 565 non-null values, except the column of `Acquiring Institution`.

<h3 style="color:red; display:inline">Coding Challenge! &lt; / &gt; </h3>

In the exercises in this notebook, we'll work on a dataset built from Constellate.

We'll use the `constellate` client to automatically retrieve the [metadata](https://constellate.org/docs/key-terms/#metadata) for a [dataset](https://constellate.org/docs/key-terms/#dataset). We can retrieve [metadata](https://constellate.org/docs/key-terms/#metadata) in a [CSV file](https://constellate.org/docs/key-terms/#csv-file) using the `get_metadata` method.


In [None]:
# Creating a variable `dataset_id` to hold our dataset ID
# The default dataset is Shakespeare Quarterly, 1950-present
# retrieve the metadata
import constellate
dataset_id = "7e41317e-740f-e86a-4729-20dab492e925"
metadata = constellate.get_metadata(dataset_id)

The metadata is stored in a .csv file. In the following code cell, read in the data using Pandas. Give the dataframe a name other than `df`. Then print out the dataframe to take a look. 

Use a Pandas method to explore the dataframe. How many rows does it have? How many columns does it have? What is the data type of the data in each column?

## Filter dataframe

A common pipeline in data processing in Pandas is that you create a dataframe from a file and then reduce the dataframe only to the rows and columns that you are interested in. 

We have learned how to use `.loc` and `.iloc` to select part of a dataframe in [Pandas 1](./pandas-1.ipynb). We will learn more ways to do data filtering in this section.

### Work with missing values
It is a common case that datasets have missing values. As you may have already noticed, blank cells in a CSV file show up as NaN in a Pandas DataFrame. For example, in the dataset of failed banks, the `Acquiring Institution` column gives the name when a failed bank was acquired by another institution and is empty otherwise.

In Pandas, we have a bunch of methods that can create a boolean mask over the data.

In [None]:
# Use isna() to check whether a dataframe has missing values
df.isna()

The `.isna()` method put a mask on the original dataframe. The cells with a non-null value are masked with the boolean value of `False`. The cells with a null value are masked with the boolean value of `True`.

We can also use `.isna()` to check whether a specific column has missing values. 

In [None]:
# Use isna() to check whether a column has missing values
df['Acquiring Institution'].isna()

### Drop rows and columns with missing values

If you want to exclude the rows and columns with missing values from your data analysis, you can use the `.dropna()` method to do that.

By default, the `.dropna()` method drops the rows with at least one missing value. 

In [None]:
# Use .dropna() to remove all rows with at least one missing value
df.dropna() # no argument passed in

You can also set the axis parameter to 0 to drop the rows with missing values.

In [None]:
# Use .dropna() to remove all rows with at least one missing value
df.dropna(axis=0) # Set the axis to 0

Or, you can set the axis parameter to 'rows' drop the rows with missing values.

In [None]:
# Use .dropna() to remove all rows with at least one missing value
df.dropna(axis='rows') # Set the axis to 'rows'

If you set the axis parameter to 1, you will drop the columns with missing values. 

In [None]:
# Use .dropna() to remove all columns with at least one missing value
df.dropna(axis=1) # set the axis to 1

You can also drop the columns with missing values by setting the axis parameter to 'columns'.

In [None]:
# Use .dropna() to remove all columns with at least one missing value
df.dropna(axis='columns') # set the axis to 'columns'

Sometimes we would want to drop a row only if that row has a missing value in a specific column. We can use the subset parameter to specify the column(s) to look for missing values. 

In [None]:
# Specify the columns to look for missing values
df.dropna(subset=['Acquiring Institution'])

Note that the `.dropna()` method only returns a copy, not a view. This means that any change you make using the `.dropna()` method will not affect the original dataframe. To make the change permanent, you could either assign the result to the variable where you store the original dataframe to update it; or you could use the parameter `inplace=True`.

<h3 style="color:red; display:inline">Coding Challenge! &lt; / &gt; </h3>

When you explore the Shakespeare dataframe, what did you find about the column `doi`? What did you find about the column `placeOfPublication`? Is there any non-null value in them?

We have seen how to exclude rows and columns with at least one missing value. Actually, there is a threshold parameter we can use to specify at least how many non-null values are required to be present in a row or a column for it **not** to be dropped. Read the documentation on the `.dropna()` method, figure out how to use the `threshold` parameter and in the next code cell write a line of code to drop the columns in the Shakespeare dataset which have at least 2 missing values.  

In [None]:
# Drop the columns in the Shakespeare dataset with at least 2 missing values


Sometimes you may want to exclude a row/column from your consideration when you decide whether to drop a row/column. In other words, even if a row/column has a missing value, you don't want to drop it. How do you do that? 

In the dataset with data on failed banks, let's say we want to drop any row with missing values except the rows in `Acquiring Institution`. In other words, we want to preserve the rows in `Acquiring Institution` no matter whether it has missing values or not. 

In [None]:
# Drop any row with missing values except the rows in 'Acquiring Institution'
cols = df.columns.tolist()
cols.remove('Acquiring Institution')
df.dropna(subset=cols)

Sometimes, you would want to maintain the rows and columns that have missing values. However, you would want to fill the cells with NaN values with some values which are of the same data type as the other cells in the same column. In this way, when you apply a certain function to a column in a dataframe, you will not run into type error. A common practice to deal with this kind of problem is to use the `.fillna()` method. 

In [None]:
# Fill the missing values
df['Acquiring Institution'].fillna('No Acquirer')

### Drop certain columns or rows

We have seen how to drop rows or columns with missing values. Sometimes, even if a row or a column does not have a missing value, you still want to drop them because you will not use them in your analysis anyways. In this case, we will use the `.drop()` method to remove those rows or columns.

You can specify which column you want to drop using the 'columns' parameter. 

In [None]:
# Drop a column by setting the columns parameter
df.drop(columns='Fund')

You can drop multiple columns at one time. 

In [None]:
# Drop multiple columns by setting the columns parameter
df.drop(columns=['Fund', 'Cert'])

Another way to drop a column is to give the label of the column you want to drop and then set the axis parameter to 1.

In [None]:
# Drop a column by setting the axis parameter
df.drop('Fund', axis=1)

You can also drop multiple columns by setting the axis parameter. 

In [None]:
# Drop multiple columns by setting the axis parameter
df.drop(['Fund', 'Cert'], axis=1)

To drop a row, you can specify which row you want to drop using the 'index' parameter.

In [None]:
# Drop a row by setting the index parameter
df.drop(index=0)

In the next code cell, can you write some code to drop multiple rows from df?

In [None]:
# Drop multiple rows using the index parameter


Another way to drop a row is to give the label of the row you want to drop and then set the axis parameter to 0.

In [None]:
# Drop a row by setting the axis parameter
df.drop(0, axis=0)

We know that by default, the rows are indexed with integer numbers. You could set the index column to one of the columns of the dataframe. In the next code cell, can you set the index column to the `State` column and then drop all rows with the label 'GA' or 'KS'?

In [None]:
# Drop multiple rows by setting the axis parameter


You might want to drop multiple consecutive rows at one time. The `.drop()` method does not have a parameter for slicing but we can come up with a workaround.

We can use the `.index` property to get the range of indexes for the rows we want to drop and pass them to the `.drop()` method.

In [None]:
# Drop multiple consecutive rows
df.drop(df.index[2:5], axis=0) 

To drop multiple consecutive columns, we can use the `.columns` attribute to get the range of the indexes for the columns we want to drop and then pass it to the `.drop()` method. 

In [None]:
# Drop multiple consecutive columns
df.drop(df.columns[3:5], axis=1)

The `.drop()` method returns a copy, not a view. Therefore, whatever change you make using it will not affect the original dataframe. 

<h3 style="color:red; display:inline">Coding Challenge! &lt; / &gt; </h3>

When you explore the Shakespeare dataframe, what did you find about the column `doi`? What did you find about the column `placeOfPublication`? Is there any non-null value in them?

In [None]:
# Drop the columns of doi and the placeofPublication, make the change permanent


### Filter data using conditionals
Conditional selection using `df.loc[]` is a very common method to filter a dataframe. 

You write a filtering condition to filter a target column. The condition then checks, for each cell in the target column, whether it fulfills the condition or not. The results will be returned as a Series of True/False values. The `.loc` indexer then uses this Series to select the rows that have True values. 

Suppose you are interested in the banks that failed since 2000 in the state of Georgia. From the original dataframe, you would like to get all the rows of the failed banks in Georgia. How do you do it?

In [None]:
# Write a filtering condition
df['State'] == 'GA' # Create a boolean mask over the column 'State'

In [None]:
# Assign the filtering condition to a variable
filt = (df['State'] == 'GA') # Use parenthesis for better reading

In [None]:
# Put the Series returned by the filtering condition within the hard brackets of df.loc[]
df.loc[filt]

Out of the rows that fulfill the filtering condition, we can further specify which columns to be returned.

In [None]:
# Specify a single column to be returned
df.loc[filt, 'Bank Name']

Of course, we can select muliple columns to be returned out of the filtered rows. 

In [None]:
# Specify multiple columns to be returned
df.loc[filt, ['Bank Name', 'Fund']]

Now suppose you want to get all the failed banks whose name contains the word 'community'.

In [None]:
# Get all the banks with the word 'community' in their name
filt = (df['Bank Name'].str.contains('Community'))
df.loc[filt, ['Bank Name']]

#### Conjunction of multiple filtering conditions: `&`

Oftentimes, you would want to filter a dataframe based on more complex conditions. For example, suppose you would like to get the banks in GA that were closed between 2008 and 2010. How do you use `df.loc[ ]` to achieve it?

The location of the failed banks is stored in the `State` column. The closing year of the banks is stored in the `Closing Date` column. 

In [None]:
# Create the first filtering condition restricting the state
filt1 = (df['State'] == 'GA')

How to get the closing year of the banks? Recall what we have learned in [Pandas 1](./pandas-1.ipynb) about creating a new column based on an old one. How do you extract the closing year out of the column `Closing Date`? We can use the `.apply()` method.

In [None]:
# Create a new column storing the closing year of the banks
df['Closing Year'] = df['Closing Date'].apply(lambda r: r.split('-')[2])
df['Closing Year'] = df['Closing Year'].astype(int)

In [None]:
# Take a look at the dataframe
df

In [None]:
# Create the second filtering condition restricting the closing year
filt2 = (df['Closing Year'] > 7) & (df['Closing Year'] < 11)

With the two filtering conditions, we are ready to extract the banks in GA that failed between 2008 and 2010.

In [None]:
# Use filt1 and filt2 to get the target rows
df.loc[filt1 & filt2]

Note that when we extract rows that fulfill multiple conditions, we use `&` in Pandas, not `and`. If you replace `&` with `and`, you will get an error. This is different than what we have learned about boolean operators in [Python basics 2](./python-basics-2.ipynb). In Python, we use `and`, `or` and `not`. In Pandas, we use `&`, `|` and `~` intead. 

|Pandas Operator|Boolean|Requires|
|---|---|---|
|&|and|All required to `True`|
|\||or|If any are `True`|
|~|not|The opposite|

Although we use different symbols for these boolean operators, the truth table for them stays the same. For a quick review of the truth table, see [Python basics 2](./python-basics-2.ipynb).

#### Disjunction of multiple filtering conditions: `|`
Suppose you would like to take a look at all the failed banks in the state of Georgia or the state of New York. How do you use `df.loc[ ]` to get the target rows?

In [None]:
# Create the two filtering conditions restricting the state to GA and NY
filt1 = (df['State'] == 'GA')
filt2 = (df['State'] == 'NY')

In [None]:
# Use filt1 and filt2 to get the target rows
df.loc[filt1|filt2]

If you would like to get the data of the failed banks in the following six states --- Georgia, New York, New Jersey, Florida, California and West Virginia, you will not want to write six filtering conditions and use the vertical bar `|` to connect all of them. That would be too repetitive. In this case, we can use the `.isin()` method to create a filtering condition.

In [None]:
# Create a list of the states
states = ['GA', 'NY', 'NJ', 'FL', 'CA', 'WV']

In [None]:
# Create a filtering condition
filt = (df['State'].isin(states))

In [None]:
# Use filt to find all failed banks in the six states
df.loc[filt]

#### Negation of a certain condition:`~`
Now, suppose you would like to get all the failed banks that were **not** closed in 2008. How do you do it?

In [None]:
# Create the filtering condition restricting the closing year to non-2008
filt = (~(df['Closing Year'] == 8))

In [None]:
# Use the filtering condition to get the target rows with specified columns
df.loc[filt, ['Bank Name', 'City']]

<h3 style="color:red; display:inline">Coding Challenge! &lt; / &gt; </h3>

Let's do some filtering!

From the Shakespeare dataframe, get the title and the creator of the documents published between 2000 **and** 2010.

From the Shakespeare dataframe, get the creator of the documents shorter than 10 pages **or** longer than 50 pages. 

From the Shakespeare dataframe, get the title of the documents whose publisher is **not** Folger Shakespeare Library. 

## Update a dataframe
We can make changes to the data in a dataframe.
### Update headers
We can update the column names of a dataframe.

In [None]:
# Take a look at the columns
df.columns

In [None]:
# Access a column using the dot notation
df.City

In [None]:
# If a column name has a space in it
df.Bank Name

We could replace all the spaces in column names with an `_`. In this way, we can access all the columns using the dot notation.

In [None]:
# Replace spaces in column names with underscores
df.columns = df.columns.str.replace(' ', '_')
df

You could also change the case of the headers.

In [None]:
# Change all headers to upper case
df.columns.str.upper()

We have been updating the column names all at one time. However, oftentimes we just want to update specific columns. In this case, we could use the `df.rename()` method and pass in a **dictionary** where the keys are the original column names and the values are the new column names.

In [None]:
# Change the column name of 'CERT' to 'CERTIFICATE_NUM'
df.rename(columns = {'Cert':'Certificate_Num'})

To change multiple column names, we just pass in a dictionary to `df.rename()` with multiple key:value pairs.

In [None]:
# Change multiple column names
df.rename(columns = {'Cert':'Certificate_Num', 'Fund':'Financial_Institution_Num'})

In [None]:
# Make the change permanent
df.rename(columns = {'Cert':'Certificate_Num', 'Fund':'Financial_Institution_Num'}, inplace = True)

### Update rows 
How to update the values in a row? In [Pandas 1](./pandas-1.ipynb), we have learned how to look up values using `.loc` and `.iloc`.

To update a row, we could use `.loc` or `.iloc` to locate it and then assign the new values to that row.

In [None]:
# Change an entire row
df.loc[0] = ['Almena State Bank', 'Almena', 'KS', 15426, 'Equity Bank', '23-Oct-20', 10000, 20]

You can locate a specific cell in a row and update the value in that cell alone.

In [None]:
# Change a specific value in a row
df.loc[0, 'Financial_Institution_Num'] = 10001
df

We could change multiple specific values in a row using `.loc[]`. 

In [None]:
# Change multiple values in a row
df.loc[0, ['Bank_Name', 'Financial_Institution_Num']] = ['Almena Bank', 12000]
df

### Update columns
There are multiple methods we can use to update columns. Let's take a look at two methods `.map()` and `replace()`.

In [None]:
# Use .map() to update specific values in a column
df['Bank_Name'].map({'Almena Bank': 'Almena State Bank', 'The First State Bank': 'West Virginia Bank'})

In [None]:
# Use .replace() to update specific values in a column while maintaining the rest
df['Bank_Name'].replace({'Almena Bank': 'Almena State Bank', 'The First State Bank': 'West Virginia Bank'})

We can also use a filtering condition to locate the target columns and then make changes. 

For example, we can locate all the banks that failed in 2020 and change their closing date to 'Recent'.

In [None]:
# Make a filtering condition to get the banks that failed in 2020
filt = (df['Closing_Year'] == 2020)

In [None]:
# Use the filtering condition to locate the columns and update them
df.loc[filt, ['Financial_Institution_Num', 'Closing_Year']] = [1000, 'Recent']
df

<h3 style="color:red; display:inline">Coding Challenge! &lt; / &gt; </h3>

Make all column names in the Shakespeare dataframe upper case. 

Get all documents whose current title is 'Review Article' and change their title to 'Review'.

Get all documents whose word count exceeds 5000 and change their word count to the string 'Long article'.

___
## Lesson Complete

Congratulations! You have completed *Pandas 2*.

### Start Next Lesson: [Pandas 3 ->](./pandas-3.ipynb)

### Exercise Solutions
Here are a few solutions for exercises in this lesson.

In [None]:
# Read in the metadata
shake = pd.read_csv(metadata)

In [None]:
# Set the rows to display to 30
pd.set_option('display.max_rows', 30)

In [None]:
# Explore the dataframe
shake.info()

In [None]:
### Drop the rows and columns with at least 2 missing values

# Get the tuple (# of rows, # of columns) 
df.shape

# Store the num of rows and num of columns in two variables
num_rows = df.shape[0]
num_cols = df.shape[1]

# Drop all columns which have at least 2 missing values
df.dropna(thresh=num_rows-1, axis=1)

In [None]:
# Drop the columns of doi and the placeofPublication, make the change permanent
shake.drop(columns=['doi', 'placeOfPublication'], inplace=True)

In [None]:
# get the title and the creator of the documents published between 2000 and 2010
filt = (shake['publicationYear']>1999) & (shake['publicationYear']<2011)
shake.loc[filt, ['title', 'creator']]

In [None]:
# get the creator of the documents shorter than 10 pages or longer than 50 pages
filt = (shake['pageCount']<10)|(shake['pageCount']>50)
shake.loc[filt, 'creator']

In [None]:
# get the title of the documents whose publisher is not Folger Shakespeare Library
filt = (shake['publisher']=='Folger Shakespeare Library')
shake.loc[~filt, 'title']

In [None]:
# Make all column names in the Shakespeare dataframe upper case
shake.columns = shake.columns.str.upper()

In [None]:
# Get all documents whose current title is 'Review Article' and change their title to 'Review'
shake.loc[shake['TITLE']=='Review Article', 'TITLE'] = 'Review'

In [None]:
# Get all documents whose word count exceeds 5000 and change their word count to the string 'Long article'
shake.loc[shake['WORDCOUNT']>5000, 'WORDCOUNT'] = 'Long article'