<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

Created by [Nathan Kelber](http://nkelber.com) and Ted Lawless for [JSTOR Labs](https://labs.jstor.org/) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email nathan.kelber@ithaka.org.<br />
___

**Pandas I**

**Description:** This notebook describes how to:
* Create a Pandas Series or DataFrame
* How to access data rows, columns, elements using `.loc` and `.iloc`
* How to change data in rows, columns, and elements.

This is the first notebook in a series on learning to use Pandas. 

**Difficulty:** Intermediate

**Knowledge Required:** 
* Python Basics

**Knowledge Recommended:** None

**Completion Time:** 75 minutes

**Data Format:** None

**Libraries Used:** Pandas
___

# When to use Pandas

Pandas is a Python data analysis and manipulation library. When it comes to viewing and manipulating data, most people are familiar with commercial spreadsheet software, such as Microsoft Excel or Google Sheets. While spreadsheet software and Pandas can accomplish similar tasks, each has significant advantages depending on the use-case.

**Advantages of Spreadsheet Software**
* Point and click
* Easier to learn
* Great for small datasets (<10,000 rows)
* Better for browsing data

**Advantages of Pandas**
* More powerful data manipulation with Python
* Can work with large datasets (millions of rows)
* Faster for complicated manipulations
* Better for cleaning and/or pre-processing data
* Can automate workflows in a larger data pipeline

In short, spreadsheet software is better for browsing small datasets and making moderate adjustments. Pandas is better for automating data cleaning processes that require large or complex data manipulation.

Pandas can interpret a wide variety of data sources, including Excel files, CSV files, and Python objects like lists and dictionaries. Pandas converts these into two fundamental objects: 

* Data Series- a single column of data
* DataFrame- a table of data containing multiple columns and rows

# Pandas Series

We can think of a Series as a single column of data. A DataFrame then is made by combining Series objects side-by-side into a table that has both height and width. Let's create a Series based on this data about the world's ten most-populated countries [according to Wikipedia](https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population).

|Population (in millions)|
|---|
|1,404|
|1,366|
|330|
|269|
|220|
|211|
|206|
|169|
|146|
|127|

We can put the population data into a Series.

In [None]:
# import pandas, `as pd` allows us to shorten typing `pandas` to `pd` for each 
import pandas as pd

In [None]:
# Create a data series in Pandas
worldpop = pd.Series([1404, 1366, 330, 269, 220, 211, 206, 169, 146, 127])

# Give our series a name
worldpop.name = 'World Population (In Millions)'
print(worldpop)

Underneath the Series is a `dtype` which describes the way the data is stored in the Series. Here we see `int64`, denoting the data is a 64-bit integer.

## `.iloc[]` Integer Location Selection

To the left of each Series is an index number. This index number is very similar to a Python list index; it can help us reference a particular row for data retrieval. Also, like a Python list, the index to a Series begins with 0. We can retrieve individual elements in a Series using the `.iloc` attribute, which stands for "integer location." 

In [None]:
# Return the 4th element in our series
worldpop.iloc[3]

In [None]:
# Return a slice of elements in our series
# This slice will not include element 4
worldpop.iloc[2:4]

By default, our Series has a numerical index like a Python list, but it would be much easier to use if our Series had names like a Python dictionary. We can 

It is cumbersome to remember the index number for each country, so we can instead give each row an index with names.

In [None]:
# Rename the index to use names instead of numerical indexes
worldpop.index = [
    'China',
    'India',
    'United States',
    'Indonesia',
    'Pakistan',
    'Brazil',
    'Nigeria',
    'Bangladesh',
    'Russia',
    'Mexico'
]

worldpop

## `.loc[]` Location Selection
Now we can also reference each element by its index name, very similar to how we can supply a key to a dictionary to get a value. We use the `.loc` attribute.

In [None]:
# Return the series value for Nigeria
worldpop.loc['Nigeria']

In [None]:
# Return a series value for Indonesia and Mexico
worldpop.loc[['Indonesia', 'Mexico']]

In [None]:
# Return a slice from Nigeria to Russia
# This slice will include the final element!
worldpop.loc['Nigeria':'Russia']

A Series is like an ordered dictionary. In fact, we can create a Series out of a list (where the index will automatically be numerical starting at 0) or a dictionary (where the keys are the index).

In [None]:
# Creating a Series from a dictionary
# Based on most populous cities in the world according to Wikipedia

worldcitiespop = pd.Series({
    'Tokyo': 37,
    'Delhi': 28,
    'Shanghai': 25,
    'SÃ£o Paulo': 21,
    'Mexico City': 21,
    'Cairo': 20,
    'Mumbai': 19,
    'Beijing': 19,
    'Dhaka': 19,
    'Osaka': 19,
}, name='World City Populations (In Millions)')

#Return the series
worldcitiespop

## Boolean Expressions

We have seen already how we can select a particular value in a series by using an index name or number. We can also select particular values using Boolean expressions. An expression will evaluate to a Truth Table.

In [None]:
# Which countries have populations greater than 200 million?
worldpop > 200

Instead of evaluating to a Truth Table, we can also evaluate to a smaller series.

In [None]:
# Evaluate worldpop for `worldpop > 200`
worldpop.loc[worldpop > 200]

# If we wanted to save this to a new series variable
#new_series = worldpop[worldpop > 200]

Pandas uses `|` to represent `or` operations. It uses `&` to represent `and` operations. We can also use `~` for negation.

|Pandas Operator|Boolean|Requires|
|---|---|---|
|&|and|All required to `True`|
|\||or|If any are `True`|
|~|not|The opposite|

In [None]:
worldpop.loc[(worldpop > 500) | (worldpop < 250)]

## Modifying a Series

We can use an initialization statement to change a value in our Series. 

In [None]:
# Change the population of China to 1500
worldpop.loc['China'] = 1500
print(worldpop)

In [None]:
# Change the population of several countries based on an expression
worldpop.loc[worldpop < 300] = 25
worldpop

## Summary of Pandas Series

* A Series is a single column of data that may contain a Name and Index
* Use `.iloc` to select a row by index number
* Use `.loc` to select a row by index name
* Use an initialization statement to change values
* Boolean operators include & (and), | (or), ~ (negation) 

# Pandas DataFrame

If a Series is like a column of data, a DataFrame is like a table connecting multiple columns together. DataFrames can contain thousands or millions of rows and columns. When working with DataFrames, we are usually using a dataset that has been compiled by someone else. Often the data will be in the form of a CSV or Excel file. 

In [1]:
import pandas as pd

# Create a DataFrame `df` from the CSV file 'sample.csv'
df = pd.read_csv('sample2.csv', index_col='Username')

## Exploring DataFrame Contents
Now that we have a DataFrame called `df`, we need to learn a little more about its contents. The first step is usually to explore the DataFrame's attributes. Attributes are properties of the dataset (not functions), so they do not have parentheses `()` after them. 

|Attribute|Reveals|
|---|---|
|.shape| The number of rows and columns|
|.info| The shape plus the first and last 5 rows|
|.columns| The name of each column|
|.rows| The name of each row|

In [2]:
# Use `.shape` to find rows and columns in the DataFrame
df.shape

(12, 4)

In [3]:
# Use `.info` to find the shape plus the first and last five rows of the DataFrame
df.info

<bound method DataFrame.info of                      Login email  Identifier First name Last name
Username                                                         
booker12      rachel@example.com        9012     Rachel    Booker
grey07                       NaN        2070      Laura      Grey
johnson81                    NaN        4081      Craig   Johnson
jenkins46       mary@example.com        9346       Mary   Jenkins
smith79        jamie@example.com        5079      Jamie     Smith
redtree333      phil@example.com        3332     Philip     Marks
ghost032       tonya@example.com        2310      Tonya     Smith
french999       fre1@example.com        2343       Toby    French
Yolandam719  ymurphy@example.com        8300    Yolanda    Murphy
sandy333      sandra@example.com        1132     Sandra   Hammond
tristan299   tristan@example.com        2143    Tristan   Markham
broom378      brooms@example.com        8002     Brooke    Carver>

In [4]:
# Use `.columns` to find the name of each column (if they are named)
df.columns

Index(['Login email', 'Identifier', 'First name', 'Last name'], dtype='object')

We can use `.index` attribute to discover the name for each row in our DataFrame. We set the index column to `Username`, but `Identifier` would also make sense. If no column is chosen, a numeric index is created starting at 0.

In [16]:
# Use `.index` to list the rows of our DataFrame
df.index

Index(['booker12', 'grey07', 'johnson81', 'jenkins46', 'smith79', 'redtree333',
       'ghost032', 'french999', 'Yolandam719', 'sandy333', 'tristan299',
       'broom378'],
      dtype='object', name='Username')

## Preview with `.head()` and `.tail()`
We can also use the `.head()` and `.tail` methods to get a preview of our DataFrame.

In [6]:
# Use `.head()` to see the first five lines
# Pass an integer into .head() to see a different number of lines
df.head()

Unnamed: 0_level_0,Login email,Identifier,First name,Last name
Username,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
booker12,rachel@example.com,9012,Rachel,Booker
grey07,,2070,Laura,Grey
johnson81,,4081,Craig,Johnson
jenkins46,mary@example.com,9346,Mary,Jenkins
smith79,jamie@example.com,5079,Jamie,Smith


In [7]:
# Use `.tail()` to see the last five lines
# Pass an integer into .tail() to see a different number lines
df.tail()

Unnamed: 0_level_0,Login email,Identifier,First name,Last name
Username,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
french999,fre1@example.com,2343,Toby,French
Yolandam719,ymurphy@example.com,8300,Yolanda,Murphy
sandy333,sandra@example.com,1132,Sandra,Hammond
tristan299,tristan@example.com,2143,Tristan,Markham
broom378,brooms@example.com,8002,Brooke,Carver


### Display More Rows or Columns
By default, Pandas limits the number of rows and columns to display. If desired, we can increase or decrease the number to display. If your DataFrame has limited number of rows or columns, you may wish to show all of them.

In [8]:
# Show all columns
# Set `None` to an integer to show a set number
pd.set_option('display.max_columns', None)

# Show all rows
# Set `None` to an integer to show a set number
# Be careful if your dataset is thousands of lines long!
pd.set_option('display.max_rows', None)

## Change Column Names
If we wanted to change the column names, one option is to modify the original data file. We can also change the column names in the DataFrame.

In [40]:
# Updating all column names at once
df.columns = ['email', 'Identifier', 'First name', 'Last name']
df

Unnamed: 0_level_0,email,Identifier,First name,Last name
Username,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
booker12,rachel@example.com,9012,Rachel,Booker
grey07,,2070,Laura,Grey
johnson81,,4081,Craig,Johnson
jenkins46,mary@example.com,9346,Mary,Jenkins
smith79,jamie@example.com,5079,Jamie,Smith
redtree333,phil@example.com,3332,Philip,Marks
ghost032,tonya@example.com,2310,Tonya,Smith
french999,fre1@example.com,2343,Toby,French
Yolandam719,ymurphy@example.com,8300,Yolanda,Murphy
sandy333,sandra@example.com,1132,Sandra,Hammond


In [41]:
# Updating a single column name
df.rename(columns={'email': 'Login email'}, inplace=True)
df

Unnamed: 0_level_0,Login email,Identifier,First name,Last name
Username,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
booker12,rachel@example.com,9012,Rachel,Booker
grey07,,2070,Laura,Grey
johnson81,,4081,Craig,Johnson
jenkins46,mary@example.com,9346,Mary,Jenkins
smith79,jamie@example.com,5079,Jamie,Smith
redtree333,phil@example.com,3332,Philip,Marks
ghost032,tonya@example.com,2310,Tonya,Smith
french999,fre1@example.com,2343,Toby,French
Yolandam719,ymurphy@example.com,8300,Yolanda,Murphy
sandy333,sandra@example.com,1132,Sandra,Hammond


## Reset the Index

When we created the dataframe, we used the `index_col` attribute to set the index column to the `Username` column.

```df = pd.read_csv('sample2.csv', index_col='Username')```

We could reset the index to a numerical index starting at 0 using the `.reset_index()` method.

In [9]:
# Reset the Index for the DataFrame to integers
# creating a new column
# Passing a `inplace=True` makes the change immediately
df.reset_index()

Unnamed: 0,Username,Login email,Identifier,First name,Last name
0,booker12,rachel@example.com,9012,Rachel,Booker
1,grey07,,2070,Laura,Grey
2,johnson81,,4081,Craig,Johnson
3,jenkins46,mary@example.com,9346,Mary,Jenkins
4,smith79,jamie@example.com,5079,Jamie,Smith
5,redtree333,phil@example.com,3332,Philip,Marks
6,ghost032,tonya@example.com,2310,Tonya,Smith
7,french999,fre1@example.com,2343,Toby,French
8,Yolandam719,ymurphy@example.com,8300,Yolanda,Murphy
9,sandy333,sandra@example.com,1132,Sandra,Hammond


For many operations that will alter a DataFrame, such as `.reset_index`, the changes will be previewed unless a `inplace=True` parameter is passed. This allows users to preview changes to the data before implementing them in a permanent fashion. Of course, you should always work on a copy of your data in case a manipulation goes awry. 

In [11]:
# Confirm index has not been changed
df

Unnamed: 0,Username,Login email,Identifier,First name,Last name
0,booker12,rachel@example.com,9012,Rachel,Booker
1,grey07,,2070,Laura,Grey
2,johnson81,,4081,Craig,Johnson
3,jenkins46,mary@example.com,9346,Mary,Jenkins
4,smith79,jamie@example.com,5079,Jamie,Smith
5,redtree333,phil@example.com,3332,Philip,Marks
6,ghost032,tonya@example.com,2310,Tonya,Smith
7,french999,fre1@example.com,2343,Toby,French
8,Yolandam719,ymurphy@example.com,8300,Yolanda,Murphy
9,sandy333,sandra@example.com,1132,Sandra,Hammond


In [10]:
# Make the change to reset the index
df.reset_index(inplace=True)
# Print the index, now changed
df

Unnamed: 0,Username,Login email,Identifier,First name,Last name
0,booker12,rachel@example.com,9012,Rachel,Booker
1,grey07,,2070,Laura,Grey
2,johnson81,,4081,Craig,Johnson
3,jenkins46,mary@example.com,9346,Mary,Jenkins
4,smith79,jamie@example.com,5079,Jamie,Smith
5,redtree333,phil@example.com,3332,Philip,Marks
6,ghost032,tonya@example.com,2310,Tonya,Smith
7,french999,fre1@example.com,2343,Toby,French
8,Yolandam719,ymurphy@example.com,8300,Yolanda,Murphy
9,sandy333,sandra@example.com,1132,Sandra,Hammond


In [12]:
# Change the index back to `Username`
df.set_index('Username', inplace=True)
df

Unnamed: 0_level_0,Login email,Identifier,First name,Last name
Username,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
booker12,rachel@example.com,9012,Rachel,Booker
grey07,,2070,Laura,Grey
johnson81,,4081,Craig,Johnson
jenkins46,mary@example.com,9346,Mary,Jenkins
smith79,jamie@example.com,5079,Jamie,Smith
redtree333,phil@example.com,3332,Philip,Marks
ghost032,tonya@example.com,2310,Tonya,Smith
french999,fre1@example.com,2343,Toby,French
Yolandam719,ymurphy@example.com,8300,Yolanda,Murphy
sandy333,sandra@example.com,1132,Sandra,Hammond


## Sorting the Index

We can sort the index by using `sort_index()`.

In [15]:
# Sort the DataFrame by ascending order
df.sort_index()

Unnamed: 0_level_0,Login email,Identifier,First name,Last name
Username,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Yolandam719,ymurphy@example.com,8300,Yolanda,Murphy
booker12,rachel@example.com,9012,Rachel,Booker
broom378,brooms@example.com,8002,Brooke,Carver
french999,fre1@example.com,2343,Toby,French
ghost032,tonya@example.com,2310,Tonya,Smith
grey07,,2070,Laura,Grey
jenkins46,mary@example.com,9346,Mary,Jenkins
johnson81,,4081,Craig,Johnson
redtree333,phil@example.com,3332,Philip,Marks
sandy333,sandra@example.com,1132,Sandra,Hammond


In [18]:
# Sort by descending order
df.sort_index(ascending=False)

Unnamed: 0_level_0,Login email,Identifier,First name,Last name
Username,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
tristan299,tristan@example.com,2143,Tristan,Markham
smith79,jamie@example.com,5079,Jamie,Smith
sandy333,sandra@example.com,1132,Sandra,Hammond
redtree333,phil@example.com,3332,Philip,Marks
johnson81,,4081,Craig,Johnson
jenkins46,mary@example.com,9346,Mary,Jenkins
grey07,,2070,Laura,Grey
ghost032,tonya@example.com,2310,Tonya,Smith
french999,fre1@example.com,2343,Toby,French
broom378,brooms@example.com,8002,Brooke,Carver


## `.loc[]` and `.iloc[]` Selection

Like Series, DataFrames can use the `.iloc[]` and `.loc[]` methods for selection. To select a particular element, we need to supply a row *and* a column.


In [22]:
# View our DataFrame for reference
df

Unnamed: 0_level_0,Login email,Identifier,First name,Last name
Username,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
booker12,rachel@example.com,9012,Rachel,Booker
grey07,,2070,Laura,Grey
johnson81,,4081,Craig,Johnson
jenkins46,mary@example.com,9346,Mary,Jenkins
smith79,jamie@example.com,5079,Jamie,Smith
redtree333,phil@example.com,3332,Philip,Marks
ghost032,tonya@example.com,2310,Tonya,Smith
french999,fre1@example.com,2343,Toby,French
Yolandam719,ymurphy@example.com,8300,Yolanda,Murphy
sandy333,sandra@example.com,1132,Sandra,Hammond


In [23]:
# Return the value for the specified row and column
df.iloc[6, 3]

'Smith'

In [19]:
# Return the value for the specified row and column
df.loc['booker12', 'First name']

'Rachel'

In [29]:
# Select an entire row
df.loc['redtree333', :]

Login email    phil@example.com
Identifier                 3332
First name               Philip
Last name                 Marks
Name: redtree333, dtype: object

Technically, we could also use: `df.loc['redtree333']` for the same result, but including the `, :` makes our row and column selections explicit, where the `:` is basically a slice that includes the whole column. Using a `:` is required if we want to select an entire column using `.loc[]` since the row selection comes before the column selection. 

In [34]:
# Select an entire column
df.loc[:, 'Login email']

Username
booker12        rachel@example.com
grey07                         NaN
johnson81                      NaN
jenkins46         mary@example.com
smith79          jamie@example.com
redtree333        phil@example.com
ghost032         tonya@example.com
french999         fre1@example.com
Yolandam719    ymurphy@example.com
sandy333        sandra@example.com
tristan299     tristan@example.com
broom378        brooms@example.com
Name: Login email, dtype: object

Of course, we can use the `:` to make a slice using `.loc[]` or `.loc`.

In [35]:
# Slicing rows and columns using `.iloc`
df.iloc[0:3, 1:4]

Unnamed: 0_level_0,Identifier,First name,Last name
Username,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
booker12,9012,Rachel,Booker
grey07,2070,Laura,Grey
johnson81,4081,Craig,Johnson


**Note that `.iloc[]` slicing is not inclusive of the final value, similar to a Python list**. On the other hand, `.loc[]` slicing *is* inclusive. The reason for this difference is that it would make the code confusing since we would need to include whatever name is *after* the name we want to include.

In [37]:
# Slicing rows and columns using `.loc`
df.loc['booker12':'french999', 'Login email':'First name']

Unnamed: 0_level_0,Login email,Identifier,First name
Username,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
booker12,rachel@example.com,9012,Rachel
grey07,,2070,Laura
johnson81,,4081,Craig
jenkins46,mary@example.com,9346,Mary
smith79,jamie@example.com,5079,Jamie
redtree333,phil@example.com,3332,Philip
ghost032,tonya@example.com,2310,Tonya
french999,fre1@example.com,2343,Toby


# Boolean Expressions
We can also use Boolean expressions to select based on the contents of the elements. We can use these expressions to create filters for selecting particular rows or columns.

|Pandas Operator|Boolean|Requires|
|---|---|---|
|&|and|All required to `True`|
|\||or|If any are `True`|
|~|not|The opposite|


In [42]:
df

Unnamed: 0_level_0,Login email,Identifier,First name,Last name
Username,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
booker12,rachel@example.com,9012,Rachel,Booker
grey07,,2070,Laura,Grey
johnson81,,4081,Craig,Johnson
jenkins46,mary@example.com,9346,Mary,Jenkins
smith79,jamie@example.com,5079,Jamie,Smith
redtree333,phil@example.com,3332,Philip,Marks
ghost032,tonya@example.com,2310,Tonya,Smith
french999,fre1@example.com,2343,Toby,French
Yolandam719,ymurphy@example.com,8300,Yolanda,Murphy
sandy333,sandra@example.com,1132,Sandra,Hammond


In [44]:
# Return a Truth Table for the `Identifier` column
# Where the Identifier is more than 4000
df.loc[:, 'Identifier'] > 4000

Username
booker12        True
grey07         False
johnson81       True
jenkins46       True
smith79         True
redtree333     False
ghost032       False
french999      False
Yolandam719     True
sandy333       False
tristan299     False
broom378        True
Name: Identifier, dtype: bool

In [47]:
# Preview every row where the Identifier is more than 4000
id_filter = (df.loc[:, 'Identifier'] > 4000
df.loc[id_filter, :]

# Alternatively, the whole expression can be written out
# But this can be a little more difficult to read
# In this case, it is a good idea to include parentheses
# To make clear the row filter is one expression
#df.loc[(df.loc[:, 'Identifier'] > 4000), :]

Unnamed: 0_level_0,Login email,Identifier,First name,Last name
Username,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
booker12,rachel@example.com,9012,Rachel,Booker
johnson81,,4081,Craig,Johnson
jenkins46,mary@example.com,9346,Mary,Jenkins
smith79,jamie@example.com,5079,Jamie,Smith
Yolandam719,ymurphy@example.com,8300,Yolanda,Murphy
broom378,brooms@example.com,8002,Brooke,Carver


In [58]:
# Preview every row with Last name not "Smith"
name_filter = df.loc[:, 'Last name'] == 'Smith'
df.loc[name_filter, :]

Unnamed: 0_level_0,Login email,Identifier,First name,Last name
Username,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
smith79,jamie@example.com,5079,Jamie,Smith
ghost032,tonya@example.com,2310,Tonya,Smith


In [61]:
# Select the row with `First Name` of Jamie
# And last name of `Smith`
name_filter = (df.loc[:, 'Last name'] == 'Smith') & (df.loc[:, 'First name'] == 'Jamie')
df.loc[name_filter, :]


Unnamed: 0_level_0,Login email,Identifier,First name,Last name
Username,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
smith79,jamie@example.com,5079,Jamie,Smith


In [62]:
# Find every row with Last Name not `Smith`
name_filter = (df.loc[:, 'Last name'] == 'Smith')
df.loc[~name_filter, :]

# Or alternatively
#name_filter = (df.loc[:, 'Last name'] != 'Smith')
#df.loc[name_filter, :]

Unnamed: 0_level_0,Login email,Identifier,First name,Last name
Username,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
booker12,rachel@example.com,9012,Rachel,Booker
grey07,,2070,Laura,Grey
johnson81,,4081,Craig,Johnson
jenkins46,mary@example.com,9346,Mary,Jenkins
redtree333,phil@example.com,3332,Philip,Marks
french999,fre1@example.com,2343,Toby,French
Yolandam719,ymurphy@example.com,8300,Yolanda,Murphy
sandy333,sandra@example.com,1132,Sandra,Hammond
tristan299,tristan@example.com,2143,Tristan,Markham
broom378,brooms@example.com,8002,Brooke,Carver


## Modifying a DataFrame

In [None]:


# Check for string in cells
# Use a string method
# filt = df['column'].str.contains('string', na=False)
# 
# Na=False is error message for NaN cells

## Dropping Rows Without Data