<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

Created by [Nathan Kelber](http://nkelber.com) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email nathan.kelber@ithaka.org.<br />
___

# Pandas 1

**Description:** This notebook describes how to:
* Create a Pandas Series or DataFrame
* Accessing data rows, columns, elements using `.loc` and `.iloc`
* Creating filters using boolean operators
* Changing data in rows, columns, and elements

This is the first notebook in a series on learning to use Pandas. 

**Use Case:** For Learners (Detailed explanation, not ideal for researchers)

**Difficulty:** Intermediate

**Knowledge Required:** 
* Python Basics ([Start Python Basics I](./python-basics-1.ipynb))

**Knowledge Recommended:** 
* [Working with Dataset Files](./working-with-dataset-files.ipynb)

**Completion Time:** 120 minutes

**Data Format:** CSV (.csv)

**Libraries Used:** Pandas

**Research Pipeline:** None
___

In [None]:
### Download Sample Files for this Lesson
import urllib.request
download_urls = [
    'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/sample2.csv'
]

for url in download_urls:
    urllib.request.urlretrieve(url, './data/' + url.rsplit('/', 1)[-1])

## When to use Pandas

Pandas is a Python data analysis and manipulation library. When it comes to viewing and manipulating data, most people are familiar with commercial spreadsheet software, such as Microsoft Excel or Google Sheets. While spreadsheet software and Pandas can accomplish similar tasks, each has significant advantages depending on the use-case.

**Advantages of Spreadsheet Software**
* Point and click
* Easier to learn
* Great for small datasets (<10,000 rows)
* Better for browsing data

**Advantages of Pandas**
* More powerful data manipulation with Python
* Can work with large datasets (millions of rows)
* Faster for complicated manipulations
* Better for cleaning and/or pre-processing data
* Can automate workflows in a larger data pipeline

In short, spreadsheet software is better for browsing small datasets and making moderate adjustments. Pandas is better for automating data cleaning processes that require large or complex data manipulation.

Pandas can interpret a wide variety of data sources, including Excel files, CSV files, and Python objects like lists and dictionaries. Pandas converts these into two fundamental objects: 

* Data Series- a single column of data
* DataFrame- a table of data containing multiple columns and rows

This lesson introduces their basic affordances. 

## Pandas Series

We can think of a Series as a single column of data. A DataFrame then is made by combining Series objects side-by-side into a table that has both height and width. Let's create a Series based on the world's ten most-populated countries [according to Wikipedia](https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population).

|Population (in millions)|
|---|
|1,404|
|1,366|
|330|
|269|
|220|
|211|
|206|
|169|
|146|
|127|

We will put these population numbers into a Pandas Series.

In [None]:
# import pandas, `as pd` allows us to shorten typing `pandas` to `pd` when we call pandas
import pandas as pd

To create our Series, we pass a list into the Series method:

`variable_name = pd.Series([1, 2, 3])`

In [None]:
# Create a data series object in Pandas
worldpop = pd.Series([1404, 1366, 330, 269, 220, 211, 206, 169, 146, 127])
print(worldpop)


Underneath the Series is a `dtype` which describes the way the data is stored in the Series. Here we see `int64`, denoting the data are 64-bit integers. We can assign a name to our series using `.name`.

In [None]:
# Give our series a name
worldpop.name = 'World Population (In Millions)'
print(worldpop)

### `.iloc[]` Integer Location Selection

To the left of each row in a Series are index numbers. The index numbers are similar to the index numbers for a Python list; they help us reference a particular row for data retrieval. Also, like a Python list, the index to a Series begins with 0. We can retrieve individual elements in a Series using the `.iloc` attribute, which stands for "index location." 

In [None]:
# Return the 4th element in our series
worldpop.iloc[3]

Just like a python list, we can also slice a series into a smaller series. When slicing a Pandas series, the new series will not include the final index row.

In [None]:
# Return a slice of elements in our series
# This slice will not include element 4
worldpop.iloc[2:4]

By default, our Series has a numerical index like a Python list, but we can also give each row an identifier (like a key within a Python dictionary). We do this by using:

`series_name.index = [name_1, name_2, name_3]`

Since we are storing the populations of countries, it would also be helpful to include the name of each country within our index. 

In [None]:
# Rename the index to use names instead of numerical indexes
worldpop.index = [
    'China',
    'India',
    'United States',
    'Indonesia',
    'Pakistan',
    'Brazil',
    'Nigeria',
    'Bangladesh',
    'Russia',
    'Mexico'
]

worldpop

### `.loc[]` Location Selection
Now we can also reference each element by its index name, similar to how we can supply a key to a dictionary to get a value. We use the `.loc[]` attribute to reference by name (as opposed to integer/index location using `.iloc[]`.

Try finding the value for Nigeria using both `iloc[]` and `.loc[]` selection.

In [None]:
# Use `.iloc[]` to return the series value for Nigeria

In [None]:
# Use `.loc[]` to return the series value for Nigeria

Instead of a value, we can return a new series by supplying a list. This will return the value *with the index names* as well. 

In [None]:
# Return a new series containing only Nigeria
# Note that we use two sets of brackets

worldpop.loc[['Nigeria']]

In [None]:
# Return a series value for Indonesia and Mexico
worldpop.loc[['Indonesia', 'Mexico']]

Instead of supplying a list of every index name, we can use a slice notation using a `:`. There is, however, a significant difference in how this slice is created with *index names*: the final named index **is included**.

In [None]:
# Return a slice from Nigeria to Russia
# This slice will include the final element!
# This behavior is different than a list slice

worldpop.loc['Nigeria':'Russia']

Although we created this Pandas series from a list, a series with index names is kind of like an ordered dictionary. Indeed, we could have created our Pandas series from a dictionary instead of a list.

In [None]:
# Creating a Series from a dictionary
# Based on most populous cities in the world according to Wikipedia

citiespop = pd.Series({
    'Tokyo': 37,
    'Delhi': 28,
    'Shanghai': 25,
    'São Paulo': 21,
    'Mexico City': 21,
    'Cairo': 20,
    'Mumbai': 19,
    'Beijing': 19,
    'Dhaka': 19,
    'Osaka': 19,
}, name='World City Populations (In Millions)') # We can also specify the series name as an argument

#Return the series
citiespop

### Boolean Expressions

We have seen already how we can select a particular value in a series by using an index name or number. We can also select particular values using Boolean expressions. An expression will evaluate to a Truth Table.

In [None]:
# Which countries have populations greater than 200 million?
worldpop > 200

By passing this expression into `.iloc[]`, we can retrieve just the rows that would evaluate to `True`.

In [None]:
# Evaluate worldpop for `worldpop > 200`
worldpop.loc[worldpop > 200]

Note that we have not changed the values of `worldpop` but only evaluated an expression. `worldpop` remains the same.

In [None]:
worldpop

If we wanted to store the evaluation, we would need to use an assignment statement, either for `worldpop` or a new variable.

In [None]:
# If we wanted to save this to a new series variable
new_series = worldpop[worldpop > 200]

new_series

We can also evaluate multiple expressions, but there is a difference in syntax between Python generally and Pandas. Python Boolean expressions are written `and`, `or` and `not`. Pandas Boolean expressions are written `&`, `|`, and `~`.

|Pandas Operator|Boolean|Requires|
|---|---|---|
|&|and|All must be `True`|
|\||or|If any are `True`|
|~|not|The opposite|

Try returning a series from `worldpop` using `.loc[]` for countries with populations either over 500 or under 250.

In [None]:
# Return a series from `worldpop` with populations
# over 500 or under 250

### Modifying a Series

So far, we have been returning expressions but not actually changing our original Pandas series object. We can use an initialization statement to make a change to the original series object The syntax is very similar to changing an item value in a Python dictionary.

In [None]:
# Change the population of China to 1500
worldpop.loc['China'] = 1500
worldpop

We could also change the value of multiple rows based on an expression.

In [None]:
# Change the population of several countries based on an expression
worldpop.loc[worldpop < 300] = 25
worldpop

### Summary of Pandas Series

* A series is a single column of data that may contain a Name
* A particular row in a series can be referenced by index number or index name
* Use `.iloc` to select a row by index number
* Use `.loc` to select a row by index name
* Use an initialization statement to change values
* Boolean operators include & (and), | (or), ~ (negation) 

## Pandas DataFrame

If a Series represents a column of data, a DataFrame represents a full table composed of multiple columns together. DataFrames can contain thousands or millions of rows and columns. When working with DataFrames, we are usually using a dataset that has been compiled by someone else. Often the data will be in the form of a CSV or Excel file. 

We can convert the data in a .csv file to a Pandas DataFrame using the `.read_csv()` method. We pass in the location of the .csv file. Additionally, we can supply an index column name with `index_col`.

Use the `**File > Open**` menu above to navigate to the `sample2.csv` in the `/data` folder. Preview its structure before we load it into a dataframe.

In [None]:
import pandas as pd

# Create a DataFrame `df` from the CSV file 'sample2.csv'
df = pd.read_csv('data/sample2.csv', index_col='Username')

By convention, a dataframe variable is called `df` but we could give it any valid Python variable name.

### Exploring DataFrame Contents
Now that we have a DataFrame called `df`, we need to learn a little more about its contents. The first step is usually to explore the DataFrame's attributes. Attributes are properties of the dataset (not functions), so they do not have parentheses `()` after them. 

|Attribute|Reveals|
|---|---|
|.shape| The number of rows and columns|
|.info| The shape plus the first and last 5 rows|
|.columns| The name of each column|
|.rows| The name of each row|

In [None]:
# Use `.shape` to find rows and columns in the DataFrame
df.shape

In [None]:
# Use `.info` to find the shape plus the first and last five rows of the DataFrame

In [None]:
# Use `.columns` to find the name of each column (if they are named)

We can use `.index` attribute to discover the name for each row in our DataFrame. We set the index column (`index_col=`) to `Username`, but `Identifier` might also make sense with this data. If no column is chosen, a numeric index is created starting at 0.

In [None]:
# Use `.index` to list the rows of our DataFrame
df.index

### Preview with `.head()` and `.tail()`
We can also use the `.head()` and `.tail` methods to get a preview of our DataFrame.

In [None]:
# Use `.head()` to see the first five lines
# Pass an integer into .head() to see a different number of lines
df.head()

In [None]:
# Use `.tail()` to see the last five lines
# Pass an integer into .tail() to see a different number lines
df.tail()

### Display More Rows or Columns with `.set_option()`
By default, Pandas limits the number of rows and columns to display. If desired, we can increase or decrease the number to display. If your DataFrame has limited number of rows or columns, you may wish to show all of them.

In [None]:
# Show all columns
# Set `None` to an integer to show a set number
pd.set_option('display.max_columns', None)

# Show all rows
# Set `None` to an integer to show a set number
# Be careful if your dataset is millions of lines long!
pd.set_option('display.max_rows', None)

### Change Column Names with `.columns`
We can change the column names with the `.columns` attribute.

In [None]:
# Updating all column names at once
df.columns = ['email', 'Identifier', 'First name', 'Last name']
df

We can also use the `.rename()` method to change the name of a single given column name.

In [None]:
# Updating a single column name
df = df.rename(columns={'email': 'Login email'})
df

### An important note on previewing and permanent changes in Pandas

In order to make our changes stick, we had to use an assignment statement:
`df = df.rename(columns={'email': 'Login email'})`
If we had just written:
`df.rename(columns={'email': 'Login email'})` 
Pandas would preview the change but not actually change the dataframe. The assignment statement tells Pandas we  we want the change to permanently change the dataframe. 

There is no "undo" when making a change to a Pandas dataframe, so it is a good idea to always preview changes before committing them to an assignment statement. **Always back up your data so a dataframe manipulation mistake will not ruin your data.**

#### An alternative way to permanently change a dataframe
There is another alternative for making permanent changes. You can pass the argument `inplace=True` without using an assignment statement. We do not recommend this as a good practice, and it is possible it may be removed in the future. We mention it here, however, since it may appear in other code you find.

In [None]:
# Updating a single column name using `inplace=True`
df.rename(columns={'email': 'email'}, inplace=True)
df

### Reset the Index

When we created the dataframe, we used the `index_col` attribute to set the index column to the `Username` column:

```df = pd.read_csv('data/sample2.csv', index_col='Username')```

We could reset the index to a numerical index starting at 0 using the `.reset_index()` method.

In [None]:
# Preview the dataframe after using `reset_index()`
df.reset_index()

Note that the above change was not made to the dataframe since there was no assignment statement.

In [None]:
# Confirm dataframe index has not been changed
df

In [None]:
# Make the permanent change to reset the index
df = df.reset_index()
# Print the dataframe with change index
df

### Set the Index with `.set_index()`
We can change the index back to `'Username` with the `.set_index()` method. Try this in the next code cell.

In [None]:
# Change the index back to `Username`

### Sorting the Index

We can sort the index by using the `.sort_index()` method.

In [None]:
# Sort the DataFrame by ascending order


To sort the index in descending order, pass the argument `ascending=False`.

In [None]:
# Sort the DataFrame by descending order


### `.loc[]` and `.iloc[]` Selection

Like a Series, DataFrames can use the `.iloc[]` and `.loc[]` methods for selection. To select a particular element, we need to supply a row *and* a column.

In [None]:
# View our DataFrame for reference
df

When we use index numbers with `.iloc[]`, the column names and index column—written in bold the DataFranme preview above—are not counted.

In [None]:
# Return the value for the specified row and column
# based on index numbers. The index column names
df.iloc[1, 3]

When we use index names with `.loc[]`, we need to supply the row name and the column name to select a single element. 

In [None]:
# Return the value for the specified row and column
df.loc['booker12', 'First name']

If we want to select an entire row of data, we pass in the row and a `:` for the column. The colon—without a start or stop specified—creates a slice that contains every column.

In [None]:
# Select an entire row
df.loc['redtree333', :]

Technically, we could also use: `df.loc['redtree333']` for the same result, but including the `, :` makes our row and column selections explicit.

When we want to select an entire column, the colon must be included using `.loc[]` since the row selection comes before the column selection. 

In [None]:
# Select an entire column
df.loc[:, 'Login email']

As you might expect, we can use `:` to make a slice using `.loc[]` or `.loc`.

In [None]:
# Slicing rows and columns using `.iloc`
df.iloc[0:3, 1:4]

In [None]:
# Slicing rows and columns using `.loc`
df.loc['booker12':'french999', 'Login email':'First name']

**As a quick reminder**, remember that `.iloc[]` slicing is not inclusive of the final value. On the other hand, `.loc[]` slicing *is* inclusive. The reason for this difference is that it would make the code confusing since we would need to include whatever name is *after* the name we want to include.

### Boolean Expressions
We can also use Boolean expressions to select based on the contents of the elements. We can use these expressions to create filters for selecting particular rows or columns.

|Pandas Operator|Boolean|Requires|
|---|---|---|
|&|and|All required to `True`|
|\||or|If any are `True`|
|~|not|The opposite|


In [None]:
df

In [None]:
# Return a Truth Table for every row
# where the Identifier is over 4000
df.loc[:, 'Identifier'] > 4000

We can store this expression in a variable to filter the dataframe using `.loc[]`. 

In [None]:
# Put the expression from above into a variable
id_filter = df.loc[:, 'Identifier'] > 4000

# Return all the rows where the expression is true
df.loc[id_filter, :]

# This is equivalent to
#df.loc[(df.loc[:, 'Identifier'] > 4000), :]

Try using a similar approach to find every person with the last name "Smith."

In [None]:
# Preview every row with Last name of "Smith"


If we were looking for "Jamie Smith" we could create a filter that specifies the first and last name.

In [None]:
# Preview every row with the first name Jamie and last name Smith
# You will need to use the & operator in your filter variable


In [None]:
# Find every row with Last Name not `Smith`


### Modifying a DataFrame

A single element can be changed with an initialization statement.

In [None]:
# Change a value using `.loc[]`
df.loc['jenkins46', 'First name'] = 'Mark'
df

We can use filters to make more widespread changes. For example, we could remove all rows based on a filter.

In [None]:
# Remove all rows where the identifier is less than 7000
# Create a filter variable
id_filter = df.loc[:, 'Identifier'] > 7000

# Create a new dataframe based on the old dataframe with the filter applied
filtered_df = df.loc[id_filter, :]

# Preview the new dataframe
filtered_df

# Optionally, overwrite the old dataframe
# df = filtered_df

### Summary of Pandas DataFrames

* A DataFrame has multiple rows and columns
* Use attributes along with `.head()` and `.tail()` to explore the DataFrame
* Use `.iloc` and `.loc` to select an column, row, or element
* Filters and Boolean Operators can be powerful selectors
* Use an initialization statement to change one or many elements