<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

Created by [Nathan Kelber](http://nkelber.com) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email nathan.kelber@ithaka.org.<br />
___

# Pandas 2

**Description:** This notebook describes how to:
* Create a Pandas DataFrame
* Accessing data rows, columns, elements using `.loc` and `.iloc`
* Create filters using boolean operators
* Changing data in the DataFrame

This is the second notebook in a series on learning to use Pandas. 

**Use Case:** For Learners (Detailed explanation, not ideal for researchers)

**Difficulty:** Intermediate

**Knowledge Required:** 
* [Pandas 1](./pandas-1.ipynb)
* Python Basics ([Start Python Basics I](./python-basics-1.ipynb))

**Knowledge Recommended:** 
* [Working with Dataset Files](./working-with-dataset-files.ipynb)

**Completion Time:** 75 minutes

**Data Format:** CSV (.csv)

**Libraries Used:** Pandas

**Research Pipeline:** None
___


In [None]:
### Download Sample Files for this Lesson
import urllib.request
download_urls = [
    'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/sample2.csv'
]

for url in download_urls:
    urllib.request.urlretrieve(url, './data/' + url.rsplit('/', 1)[-1])
print('Samples files retrieved.')

## Pandas DataFrames

If a Series represents a column of data, a DataFrame represents a full table composed of multiple columns together. DataFrames can contain thousands or millions of rows and columns. When working with DataFrames, we are usually using a dataset that has been compiled by someone else. Often the data will be in the form of a CSV or Excel file. 

We can convert the data in a .csv file to a Pandas DataFrame using the `.read_csv()` method. We pass in the location of the .csv file. Additionally, we can supply an index column name with `index_col`.

Use the `**File > Open**` menu above to navigate to the `sample2.csv` in the `/data` folder. Preview its structure before we load it into a dataframe.

In [None]:
import pandas as pd

# Create a DataFrame `df` from the CSV file 'sample2.csv'
df = pd.read_csv('data/sample2.csv', index_col='Username')

By convention, a dataframe variable is called `df` but we could give it any valid Python variable name.

### Exploring DataFrame Contents
Now that we have a DataFrame called `df`, we need to learn a little more about its contents. The first step is usually to explore the DataFrame's attributes. Attributes are properties of the dataset (not functions), so they do not have parentheses `()` after them. 

|Attribute|Reveals|
|---|---|
|.shape| The number of rows and columns|
|.info| The shape plus the first and last 5 rows|
|.columns| The name of each column|

In [None]:
# Use `.shape` to find rows and columns in the DataFrame
df.shape

In [None]:
# Use `.info` to find the shape plus the first and last five rows of the DataFrame

In [None]:
# Use `.columns` to find the name of each column (if they are named)

We can use `.index` attribute to discover the name for each row in our DataFrame. We set the index column (`index_col=`) to `Username`, but `Identifier` might also make sense with this data. If no column is chosen, a numeric index is created starting at 0.

In [None]:
# Use `.index` to list the rows of our DataFrame
df.index

### Preview with `.head()` and `.tail()`
We can also use the `.head()` and `.tail()` methods to get a preview of our DataFrame.

In [None]:
# Use `.head()` to see the first five lines
# Pass an integer into .head() to see a different number of lines
df.head()

In [None]:
# Use `.tail()` to see the last five lines
# Pass an integer into .tail() to see a different number lines
df.tail()

### Display More Rows or Columns with `.set_option()`
By default, Pandas limits the number of rows and columns to display. If desired, we can increase or decrease the number to display. If your DataFrame has limited number of rows or columns, you may wish to show all of them.

In [None]:
# Show all columns
# Set `None` to an integer to show a set number
pd.set_option('display.max_columns', None)

# Show all rows
# Set `None` to an integer to show a set number
# Be careful if your dataset is millions of lines long!
pd.set_option('display.max_rows', None)

### Change Column Names with `.columns`
We can change the column names with the `.columns` attribute.

In [None]:
# Updating all column names at once
df.columns = ['email', 'Identifier', 'First name', 'Last name']
df

We can also use the `.rename()` method to change the name of a single given column name.

In [None]:
# Updating a single column name
df = df.rename(columns={'email': 'Login email'})
df

### An important note on previewing and permanent changes in Pandas

In order to make our changes stick, we had to use an assignment statement:
`df = df.rename(columns={'email': 'Login email'})`
If we had just written:
`df.rename(columns={'email': 'Login email'})` 
Pandas would preview the change but not actually change the dataframe. The assignment statement tells Pandas we  we want the change to permanently change the dataframe. 

There is no "undo" when making a change to a Pandas dataframe, so it is a good idea to always preview changes before committing them to an assignment statement. **Always back up your data so a dataframe manipulation mistake will not ruin your data.**

#### An alternative way to permanently change a dataframe
There is another alternative for making permanent changes. You can pass the argument `inplace=True` without using an assignment statement. We do not recommend this as a good practice, and it is possible it may be removed in the future. We mention it here, however, since it may appear in other code you find.

In [None]:
# Updating a single column name using `inplace=True`
df.rename(columns={'email': 'email'}, inplace=True)
df

### Reset the Index

When we created the dataframe, we used the `index_col` attribute to set the index column to the `Username` column:

```df = pd.read_csv('data/sample2.csv', index_col='Username')```

We could reset the index to a numerical index starting at 0 using the `.reset_index()` method.

In [None]:
# Change the dataframe to a numerical index
df = df.reset_index()
df

### Set the Index with `.set_index()`
We can change the index to a different column with the `.set_index()` method. The `.set_index()` method will drop the current index, so it is always recommended to use `reset_index()` first if you want to keep the data currently in the index as a column. If you accidentally delete a column, load the dataframe back in from the original CSV file. If you need to delete a column, you can use: `df.drop('column_name', axis=1)`.

In [None]:
# Change the index back to a numerical index
# Then assign a new index (dropping the numerical index)
df = df.set_index('Username')
df

### Sorting the Index

We can sort the index by using the `.sort_index()` method.

In [None]:
# Sort the DataFrame by ascending order
df.sort_index()

To sort the index in descending order, pass the argument `ascending=False`.

In [None]:
# Sort the DataFrame by descending order
df.sort_index(ascending=False)

### `.loc[]` and `.iloc[]` Selection

Like a Series, DataFrames can use the `.iloc[]` and `.loc[]` methods for selection. To select a particular element, we need to supply a row *and* a column.

In [None]:
# View our DataFrame for reference
df

When we use index numbers with `.iloc[]`, the column names and index column—written in bold the DataFrame preview above—are not counted.

In [None]:
# Return the value for the specified row and column
# based on index numbers. The index column names
df.iloc[1, 3]

When we use index names with `.loc[]`, we need to supply the row name and the column name to select a single element. 

In [None]:
# Return the value for the specified row and column
df.loc['booker12', 'First name']

If we want to select an entire row of data, we pass in the row and a `:` for the column. The colon—without a start or stop specified—creates a slice that contains every column.

In [None]:
# Select an entire row
df.loc['redtree333', :]

Technically, we could also use: `df.loc['redtree333']` for the same result, but including the `, :` makes our row *and column* selection explicit.

In [None]:
# Select an entire row without a colon
df.loc['redtree333']

If we select an entire column with `iloc[]`, the colon is required since the row selection comes before the column selection.

In [None]:
# Select an entire column with `.loc[]`
df.loc[:, 'Login email']

We can select columns in a more compact way without using `.iloc[]`. Passing the column name into `df[]` will return the full column.

In [None]:
# A shorter way to select a column without using `.loc[]`
# This only works with columns and returns a Series 
df['Login email']

As you might expect, we can use `:` to make a slice using `.iloc[]` or `.loc[]`. Try this below.

In [None]:
# Slicing rows and columns using `.iloc`

In [None]:
# Slicing rows and columns using `.loc`

**As a quick reminder**, remember that `.iloc[]` slicing is not inclusive of the final value. On the other hand, `.loc[]` slicing does include the final value. 

### Boolean Expressions
We can also use Boolean expressions to select based on the contents of the elements. We can use these expressions to create filters for selecting particular rows or columns.

|Pandas Operator|Boolean|Requires|
|---|---|---|
|&|and|All required to `True`|
|\||or|If any are `True`|
|~|not|The opposite|


In [None]:
df

In [None]:
# Return a Truth Table for every row
# where the Identifier is over 4000
df['Identifier'] > 4000

# The output is the same as
#df.loc[:, 'Identifier'] > 4000

We can store this expression in a variable to filter the dataframe using `.loc[]`. 

In [None]:
# Put the expression from above into a variable
id_filter = df['Identifier'] > 4000

# Return all the rows where the expression is true
df.loc[id_filter]

# The output is also the same as
#df.loc[id_filter, :]

Try using a similar approach to find every person with the last name "Smith."

In [None]:
# Preview every row with Last name of "Smith"


If we were looking for "Jamie Smith" we could create a filter that specifies the first and last name.

In [None]:
# Preview every row with the first name Jamie and last name Smith
# You will need to use the & operator in your filter variable


In [None]:
# Find every row with Last Name not `Smith`


### Changing a Value in the DataFrame

A single element can be changed with an initialization statement using `.loc[]`.

In [None]:
# Change a single value using `.loc[]`
df.loc['jenkins46', 'First name'] = 'Mark'
df

### Removing Rows or Columns

We can use filters to make more widespread changes. For example, we could filter out certain rows based on what is contained in those rows.

In [None]:
# Remove all rows where the identifier is less than 7000
# Create a filter variable
id_filter = df['Identifier'] > 7000

# Create a new dataframe based on the old dataframe with the filter applied
filtered_df = df[id_filter]

# Preview the new dataframe
filtered_df

# Optionally, overwrite the old dataframe
# df = filtered_df

To drop particular rows, we can also use the `.drop()` method which accepts either a string or a Python list. We must pass the parameter `axis=1` to indicate we are dropping rows.

In [None]:
# Preview dropping the given column from the dataframe
df.drop('Login email', axis=1)

We can use the same technique to drop columns by specifying `axis=0`. Try this in the next code cell.

In [None]:
# Preview dropping a row from the dataframe

### Drop Rows with Missing Data with `.dropna()`

Removing Rows Without Data with `.dropna()`
We can remove rows without data by using the `.dropna` method. We must pass a Python list of rows to drop for the `subset` parameter.

In [None]:
# Remove all rows without a `Login email` using `.dropna()`
df.dropna(subset=['Login email'])

To see more options for dropping rows or columns without data, check out the [pandas documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html).

### Summary of Pandas DataFrames

* A DataFrame has multiple rows and columns
* Use attributes along with `.head()` and `.tail()` to explore the DataFrame
* Use `.iloc[]` and `.loc[]` to select an column, row, or element
* Filters and Boolean Operators can be powerful selectors
* Use an initialization statement to change one or many elements
* Drop selected rows or columns with `.drop()`
* Drop rows without data using `.dropna()`
___
Learn about apply, map, and replace with Pandas DataFrames in [Pandas 3 ->](./pandas-3.ipynb)