[Table of Contents](../../index.ipynb)

# FRC Analytics with Python - Session 09
# Tabular Data
**Last Updated: 20 December 2020**

## I. Introduction
Tabular data is data that is organized into rows and columns. It's everywhere -- in newspapers and magazines, on websites, in computer databases, and in Microsoft Excel files. Evaluating and manipulating tabular data is an essential skill for any analyst.

So far we've experimented with lists, tuples, and dictionaries. These data structures are indispensable, but they are not optimal for working with tables. In this session we will review several techniques and tools for working with tabular data.

## II. Comma Separated Value (CSV) Files
### A. CSV File Structure
Tabular data is often stored in comma separated value (CSV) files. CSV files are text files that use commas and newline characters (I'll explain what those are shortly) to organize the contents of the file into a table. Let's look at an example. The `space.csv` file contains information on 4,324 space launches, starting with the launch of the the Sputnik spacecraft by the Soviet Union in 1957. The dataset is available on the [Kaggle website](https://www.kaggle.com/agirlcoding/all-space-missions-from-1957). The Python code below opens the file and displays the first five lines.

In [None]:
# Open a text file and print the first five lines
# Don't worry if you don't understand all of this code.
with open("space.csv", "rt", encoding="UTF-8") as csv_file:
    for row in range(5):
        print(csv_file.readline())

The first row of text contains the column headings, with each column separated by a comma. The subsequent rows contain the data, with one row for each space launch. The rows are separated from each other with a newline character.

Commas are also used for separation in the data rows, but it's a bit difficult to keep track of what text belongs to which column. Many of the data values contain commas inside the data. For example, the second column in the first row contains the value *"LC-39A, Kennedy Space Center, Florida, USA"*. The commas within quotation marks are part of the data and are not used for column separation.

CSV files are popular because they are simple to create and can be opened and read with any text editor. Still, the content can be tedious to read. All of the values within a row are smashed together and the columns in the data rows do not line up with the column headers.

### B. Python CSV Module
We saw earlier how we can use a built-in Python function like `open()` to read data from a CSV file on disk, but the results can be difficult to read. The Python Standard Library has a `csv` module that makes things a little better.

In [None]:
import csv
space_csv = []
with open("space.csv", "rt", encoding="UTF-8") as csv_file:
    reader = csv.reader(csv_file)
    for row in reader:
        space_csv.append(row)

In [None]:
space_csv[:3]

The `csv` module converts every row of the CSV file to a Python list. We appended every row to an outer list, to create a lists of lists, or a nested list. Each value is now put on its own row. We can even extract individual values from the nested list. For example, to get the third element of the third row:

In [None]:
space_csv[2][2]

But what if we wanted to figure out how many space launches occurred in China on Wednesdays since the year 2000? That would require us to write several lines of code to read through all of the rows of data and count the applicable launches. Fortunately Python has a better tool for working with tabular data.

## III. Pandas Package
The *Pandas* package is an excellent tool for working with tabular data. Pandas is not included by default when installing Python, but it can easily be installed by running the command `conda install pandas`.

Pandas has so many features that it would take a book to explain them all. This session omits several important Pandas topics, such as timeseries data and multi-level indices. Students are encouraged to become familiar with the official Pandas documentation, which is located here: https://pandas.pydata.org/pandas-docs/stable/index.html. The remaining sections of this notebook contain links to applicable portions of the documentation, which provide additional information on each topic.

### A. Getting Started
 Let's see how our space data looks when we use Pandas to view it.

In [None]:
import pandas as pd
space_df = pd.read_csv("space.csv", thousands=",")
space_df.head()

Now that is much better. All of the data lines up with the column headers. Pandas even adds row numbers and shades alternate rows to make everything easy to read. And we did everything in three short lines of code:
* The first line imports the pandas module and renames it `pd`.
* The next line reads the CSV file and creates a `DataFrame` object.
* The final line displays the `DataFrame` object. The `.head()` method causes only the first five lines to be displayed. You can customize the number of lines displayed with .head() by putting the number in the parenthesis.

By the way, the package isn't named *Pandas* because the developers really like pandas (but who doesn't like pandas?). *Pandas* is short for *panel data*. Panel data is common in the social sciences. It is multi-dimensional data on on multiple entities, with measurements taken at several points in time. For example, suppose we're conducting a study on family income over time. We might collect multiple pieces of data on each family, such as income, number of children, education level of parents, age of parents, whether they own their home, etc. If we collect such information on 500 families, and then update the information every year for five years, we have panel data.

Python's `len()` function can be used with dataframes to get the number of rows.

In [None]:
# Using len() with DataFrames
len(space_df)

Pandas `DataFrame` objects have a `shape` attribute that contains a two-element tuple (immutable lists). The first element is the number of rows and the second is the number of columns.

In [None]:
print("Number of rows and columns:", space_df.shape)
print("Just the number of columns:", space_df.shape[1])

##### More Information
* [Intro to Dataframes](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/01_table_oriented.html)
* [Read and Write Tabular Data](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/02_read_write.html)
* [10 Minutes to Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html)
* [Essential Functionality](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html)

### B. Easy Pandas Exercises

**Ex. III.1** The `.head()` method will accept an integer argument that represents the number of rows to display. Display the first eight rows of the dataframe.

In [None]:
# Ex. III.1


**Ex. III.2** There is also a `tail()` method that will display the last few rows of a dataframe. Display the last 4 rows of the `space_df` dataframe.

In [None]:
# Ex. III.2


### C. Pandas Data Types
Pandas provides two different types of data structures: `Series` and `DataFrame`. 

#### DataFrame
What type of data structure is the `space_df` object?

In [None]:
# What type of object is space_df?
type(space_df)

The `space_df` object is an object of type [`DataFrame`](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#dataframe). Most `DataFrame` objects are two-dimensional with rows and columns. `DataFrames` can be modified to contain data with three or more dimensions, such as panel data, but we won't bother with that in this course.

Pay attention to the capitalization of `DataFrame` (capital F). We're using the word "dataframe" a couple different ways. A dataframe is a two-dimensional data structure that can be found in Python, [R](https://en.wikipedia.org/wiki/R_(programming_language) (a language for statistical analysis), and [Julia](https://en.wikipedia.org/wiki/Julia_(programming_language) (a relatively new language that is good for numerical analysis). A `DataFrame`, on the other hand, is a Python data type provided by the *Pandas* package.

#### Series
The following code extracts a single column from the `space_df` dataframe and displays its type.

In [None]:
# pandas.Series data type
datum_series = space_df.Datum.head(6)  # Extract a single column(named Datum in this case) and display top six rows
print(datum_series)
type(datum_series)

See how we extracted a single column from the `DataFrame` by appending a period and its name to the name of the `DataFrame`? Extracting a single column results in a Pandas [`Series`](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#series) object. A `Series` is similar to a list, but with some differences:
* Unlike a Python list, all elements of a `Series` must have the same data type. The *Datum* column's type is *object*, which is the type Pandas uses for strings.
* The contents of a Pandas `Series` are stored in memory more efficiently than lists. Because of this, calculations on `Series` objects are often faster than equivalent calculations on lists.

We will mostly use `DataFrame` objects instead of `Series` objects. But it's important to know what `Series` objects are because that's what we'll end up with whenever we extract a single column from a dataframe.

##### More Information
* [Pandas Data Structures](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html)

### D. Selecting Data within Pandas Dataframes
Pandas provides an immense number of ways to select and extract data from a dataframe, many more than can be covered in this class. Check out the [official Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html) to see for yourself. The Pandas documentation often refers to selecting data as *indexing*, which may seem strange at first. Think about it like this - to extract the fourth item from a list called `mylist`, we put the number 3 in square brackets: `fourth_num = mylist[3]`. The number 3 is the index position of the lists fourth element of the list. We're selecting a value from the list by passing an index to the list.

#### 1. Selecting a Single Column
Select a single column by using a dictionary-style notation. Place the column name in quotes and square brackets. This technique works for all columns regardless of their name.

In [None]:
# Extracting a single column
space_df["Status Mission"]

The dictionary style also works with a string variable that contains the column name.

In [None]:
col_var = "Status Rocket"
space_df[col_var]

#### 2. Selecting Part of a Single Column
We can select one or more rows from a single column the same way we select portions of a list.

In [None]:
space_df.Detail[100:105] # Detail is the name of the column

#### 3. Selecting Multiple Columns
Multiple columns can be selected by passing a list of column names within square brackets. We can even change the column order.

In [None]:
space_df[["Location", "Company Name", "Status Mission"]] 
# Notice how with more than 1 specified column name, you need 2 sets of square brackets

#### 4. Selecting Rows
You might be tempted to select a row from a `DataFrame` the same way we select an element from a list. Resist that temptation, it won't work. Use `.loc[]` instead.

In [None]:
# Selecting part of a DataFrame with the `.loc()` function
space_df.loc[0:3, "Company Name":"Datum"]

To use the `.loc` function, pass two elements within square brackets, separated by a comma. The first element specifies what rows are selected, and the second specifies what columns are selected. List-style slice notation can be used to select ranges of rows and columns. In the example above, we selected rows 0 through 3 and columns "Company Name" through "Datum".

One difference between Pandas dataframe slice notation differs and Python list notation is that for Python lists, the slice does NOT return the final element. For example:

In [None]:
# Python list slices include all elements up to but NOT including the final
# element in the slice
# Will return 3 list items
tens = [0, 10, 20, 30, 40, 50, 60]
tens[0:3]  # The fourth element, 30, is not returned.

In [None]:
# Slices in DataFrames will include the final element of the slice
# Returns 4 rows and 4 columns
space_df.loc[0:3, "Detail":"Status Mission"]

Rows and columns need not be contiguous. We can pass in lists of row indices and column names.

In [None]:
space_df.loc[[100, 200, 300, 400], ["Detail", "Datum", "Rocket"]]

#####  More Information
* [Selecting a Subset of a Dataframe (short intro)](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/03_subset_data.html)

### E. Pandas Indexing Exercises

#### Ex III.3
Display the datums and company names for rows 1318, 2976, and 4131.

In [None]:
# Ex III.3


#### Ex III.4
Display the final 10 rows of the *Detail* column.

In [None]:
# Ex III.4


### F. Searching Within a Dataframe
Being able to extract data by row and column numbers is helpful at times, but it requires that we know the exact location of the data we want. In large dataframes with thousands of rows, we typically do NOT know the exact location. Fortunately, Pandas provides many techniques for searching within a dataframe

#### 1. Boolean Indexing
Boolean indexing is a powerful technique for filtering dataframes. Check out the following example.

In [None]:
# Searching for Specific Data in Dataframe
# How many successful space launches were conducted by the U.S. Navy?
space_df[space_df["Company Name"] == "US Navy"].head()

The syntax in for filtering the dataframe to only US Navy launches may look strange. It will make sense once we know more about how Pandas works. First, let's see what the expression `space_df["Company Name"] == "US Navy"` does by itself.

In [None]:
print('Expression: space_df["Company Name"] == "US Navy"')
print("Type of object returned by expression:", type(space_df["Company Name"] == "US Navy"))
print("Object contents:")
space_df["Company Name"] == "US Navy"

The expression returned `Series` object of Boolean values. The Series has a length of 4,324 items, which is the same as the number of rows in the dataframe. That's suspicous...

In the expression, we're taking the `Series` object returned by the expression `space_df["Company Name"]` and we're comparing it to the string "US Navy". Pandas then checks every single element (all 4,324) to the string "US Navy", generating a value of True if the value is equal to "US Navy" and False otherwise. This action generates 4,234 True or False values, which Pandas returns as a `Series`. An operation that is repeated on every single element of an array-like object (e.g., list, array, Series) is called an element-wise or vectorized operation. One of the benefits of using Pandas is it is capable of many element-wise operations. Not having to write a `for` loop is convenient and makes code more concise. Also, element-wise operations are often faster than operations that use `for` loops.

When we pass a `Series` of Boolean values to the dataframe by placing it in square brackets, Pandas will return another dataframe with only the rows that correspond to values of True in the Boolean series. Let's experiment. The following cell generates a Boolean series that is the same length as `space_df`, with every value set to `False`.

In [None]:
# Create a series where every element is equal to False
bseries = pd.Series([False] * len(space_df))
bseries

Now we'll bass the series to the space_df dataframe.

In [None]:
# Pass series to space_df
space_df[bseries]

As expected, no dataframe rows were returned, because all `bseries` elements are false. Let's change a few of the elements to `True`.

In [None]:
bseries[1318] = True
bseries[1595] = True
bseries[2046] = True
bseries[2907] = True
bseries[2017] = True
space_df[bseries]

For comparison operators we are not limited to `==`. We can also use `<=`, `>=`, `<`, `>`, and `!=`.

#### 2. Compound Boolean Indexing
Compound Boolean exressions are also allowed. See the following example.

In [None]:
# Searching for Specific Data in Dataframe
# How many unsuccessful space launches were conducted by the U.S. Navy?
space_df[(space_df["Company Name"] == "US Navy") & (space_df["Status Mission"] == "Success")]

Please note that the Boolean indexing syntax uses different Boolean operators than standard Python code. Instead of `and`, Boolean indexing uses `&`. Instead of `or` and `not`, Boolean indexing uses `|` and `~` respectively. Also, each part of a compound Boolean expression *must* be grouped with parentheses, or the expression will generate an error.

The `isin()` method is useful.

In [None]:
space_df[space_df["Company Name"].isin(["SpaceX", "Blue Origin", "ULA"])]

#### 3. The `isin()` Method
The `.isin()` method returns `True` if the value is equal to any of the elements in the list that is passed as a parameter. We could get the same results with a long compound Boolean statement with many `|` (i.e., *or*) operators, but using `.isin()` is much easier. The result can be negated with the `~` operator.

In [None]:
space_df[~space_df["Company Name"].isin(["SpaceX", "Blue Origin", "ULA"])]

##### More Information
* [Boolean Indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#boolean-indexing)

#### 4. *NaN* Values
By now you've noticed that the value *Nan* appears frequently. It stands for *not a number* and is used to represent missing data. We can use the `.notna()` method to filter out rows with *NaN* values. Suppose we only want rows with no missing data in the *Rocket* column:

In [None]:
space_df[space_df["Rocket"].notna()].head()

Conversely, we can use the `.isna()` method to select rows that contain *NaN*.

##### More Information
* [Pandas documentation on missing data](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html)

#### 5. Using the Query Method

The `.query()` method can also be used to extract specific rows from a dataframe. Users pass a query string to the method instead of a Boolean series object. Our queries will be more effective if we make a couple tweaks on the space dataframe first. All dataframe columns are currently strings. The following cell will convert two of the columns to different data types.

In [None]:
# Convert Rocket and Datum columns to numeric and datetime data types
space_df["Rocket"] = space_df["Rocket"].astype('float32')
space_df["Datum"] = pd.to_datetime(space_df["Datum"])

The next cell uses the `.query()` method to find all launches by NASA where the mission cost more than $1,000,000,000 (The *Rocket* column contains the cost of the mission, in millions of dollars).

In [None]:
# Alternate search technique using the `query()` method
space_df.query("`Company Name` == 'NASA' and Rocket > 1000")

Query strings are more intuitive and less verbose than Boolean indexing. They can also be faster on large dataframes. They do have a few quirks:
* The query string specifies criteria that each column name must meet in order to be included in the resulting dataframe. If column names that contain spaces or other special characters must be enclosed in backticks, like `Company Name`. Note that a backtick (\`) is not the same as a single quote ('). The backtick key on your keyboard should be near th upper left corner, whereas the single quote is next to the ENTER key.
* String values that appear in the query, like 'NASA', must be quoted. We use single quotes to denote strings in our example because the entire query string is enclosed in double quotes. But we could swap the single and double quotes --  '`Company Name` == "NASA" and Rocket > 1000' would work just as well as a query string.
* Query strings uses `and`, `or`, and `not` as logical operators.
* Comparison operators in query strings are as expected: `==`, `<=`, `>=`, `<`, `>`, and `!=`.

Suppose the query was embedded in a function, with the *Company Name* passed in as a parameter. We can include a variable in the query string using the 

In [None]:
def get_launches(company):
    return space_df.query("`Company Name` == @company and Rocket > 1000")

get_launches("NASA").head()

The `.query()` method accepts `in` and `not in` operators, similar to Boolean indexing.

In [None]:
companies = ["SpaceX", "Blue Origin", "ULA"]
space_df.query("`Company Name` in @companies").head()

##### More Information
* [Pandas user guide section on the `.query()` method](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#the-query-method).
* [Query method API reference](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html#pandas.DataFrame.query)

### 6. Boolean Indexing Exercises and the Query Method

##### Ex III.5
The `grads` dataframe contains information on different college majors. How many majors are there in the *Engineering* category? How many in *Biology and Life Science*?

In [None]:
# Ex III.5
grads = pd.read_csv("recent-grads.csv")     # Leave this line alone



##### Ex III.6
Display the first 20 rows and only the *Major*, *Total*, *Unemployment_rate*, and *Median* columns of the grads dataframe.

In [None]:
# Ex III.6



##### Ex III.7

Use the `.sort_values()` method to sort the `grads` dataframe by *Median* in descending order and display the first five rows. Which major has the highest median income? Refer to the [Dataframe.sort_values documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html?highlight=sort_values#pandas.DataFrame.sort_values).

In [None]:
# Ex III.7



##### Ex III.8
Use Boolean indexing to extract majors in the categories of *Business*, *Physical Sciences*, or *Arts* that have greater than 2500 respondents (*Total* column). Don't forget the parentheses.

In [None]:
# Ex III.8


##### Ex III.9
Use the `.query()` method to identify majors for which the share of women is greater than 50% and have a median salary greater $50,000.

In [None]:
# Ex III.9


### E. Column Datatypes
Every column in a Pandas `DataFrame` has a specific datatype. For the sake of demonstration, we will reload the space dataframe from the CSV file.

In [None]:
# Reload space dataframe
space_df2 = pd.read_csv("space.csv")

We can see the datatype of each column with the `.dtypes` attribute.

In [None]:
print("Column Datatypes")
space_df2.dtypes

The type *object* is Pandas' datatype for a string. You might think that the *Rocket* column should be a numeric datatype. Pandas decided it was a string because several entries are using a comma as a thousands separator, e.g., `1,160`. The `.read_csv` method has an argument that will help Pandas read the *Rocket* column as a string, in spite of the commas.

In [None]:
# Reload space dataframe, with commas as thousands separator
space_df2 = pd.read_csv("space.csv", thousands=",")

print("Column Datatypes")
space_df2.dtypes

The *Rocket* column is now a floating point number. That's more like it.

But what about the Datum column? We can't do date calculation on that column because Pandas is considering it to be a string. We can convert the column to a special datetime object with the `pandas.to_datetime()` method.

In [None]:
# Convert Datum to a datetime object.
space_df2["Datum"] = pd.to_datetime(space_df2["Datum"], utc=True)
print("Column Datatypes")
space_df2.dtypes

Now that the column is converted to a datetime object, we can access parts of the date using the `.dt` accessor. For example, to get an integer representing the month:

In [None]:
space_df2.Datum.dt.month.head()

Another useful method is `.astype()`. For example:

In [None]:
space_df2.Rocket.astype("float32")

We converted the numbers in the *Rocket* column to 32-bit numbers (from 64-bit numbers) to save space. Other common datatypes include "int32", "int64", "unint32" (unsigned, meaning only positve numbers are allowed), and "boolean".

##### More Information
* [Pandas Datatypes](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#dtypes)

#### 1. Datatype Exercises

##### Ex III.10
Using the `grads` dataframe, convert the *Median* column's datatype to `uint32`.

In [None]:
# Ex III.10


### E. Statistics Functions
Pandas has several functions for extracting summary statistics.

In [None]:
print("Maximum mission cost in $millions:\t", space_df.Rocket.max())
print("Mean mission cost in $millions:\t\t", space_df.Rocket.mean())
print("Standard Deviation mission cost:\t", space_df.Rocket.std())

##### More Information
* [Getting Started Tutorial on Summary Satistics](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/06_calculate_statistics.html)

#### 1. Statistics Exercises

##### Ex III.11
Use the `.min()` and `max()` methods to determine which majors have the lowest and highest unemployment rates.

In [None]:
# Ex III.11


##### Ex III.12
Repeat exercise III.11, but only consider majors with more than 2,000 respondents (*Total* column).

In [None]:
# Ex III.12


### G. Dataframe Indexes

#### 1. Getting started with Indexes
Indexes are an important concept in Pandas. To understand indexes, we'll use a dataframe that contains match data from the 2019 Pacific Northwest district competition at Glacier Peak High School in Snohomish, WA. First we have to load the dataframe from a file.

In [None]:
# Loading a datframe from a pickle file.
import pickle
with open("matches.pickle", "rb") as pfile:
    matches = pickle.load(pfile)

The Python Standard Library includes a *pickle* module that allows us to save any Python object to a file and load it back into Python at a later time. Here are the first few rows from the dataframe that was contained in the *matches.pickle* file.

In [None]:
matches.head(3)

There is a significant difference between the space and matches dataframes, besides the fact that they have different columns. Let's look at the first few rows of the space dataframe again.

In [None]:
space_df.head(3)

The far left column of the space dataframe contains integers, with the first row having integer 0, the second row having integer 1, and so on. They appear to be row numbers. Unlike all other columns, they are displayed using a bold font.

The matches dataframe also has a column on the far left that is displayed using a bold font, but it does not contain integers. It contains a string with a code that identifies the match, formatted like this: *{year}{event_code}{comp_level_code}m{match_number}*. The column also has a name: *key*. By the way, the syntax of the *key* values has nothing to do with Pandas. This data was retrieved from [The Blue Alliance](https://www.thebluealliance.com/) website, which uses a this key value to uniquely identify every FRC match that occurs at every competition throughout the world.

The column on the far left is a special column. It is called the *index* and it can contain integers, strings, data values or other data types. If we don't tell Pandas how to create the index when we create a dataframe, Pandas will create an integer index, where each index value is the row number. But when the matches dataframe was created, Pandas was told to use *The Blue Alliance's* match key for the index. (If you would like to see how the matches datframe was created, check out the *convert_json_to_df.ipynb* notebook in the *setup* subfolder.)

#### 2. Indexes are Objects
Here is one way to think about indexes: just like every column has a unique name, every row has a unique name. The index is a column that contains the row names.

In [None]:
# Viewing column and row names for the matches dataframe.
print("Matches Dataframe Column Names:\n", matches.columns)
print("\nMatches Dataframe Row Names (first 10):\n", matches.index[:10])

Look closely at the output from the cell above. First of all, we can easily view all column or row names with the `DataFrame.columns` and `DataFrame.index` attributes. But look closer. See how the printed output starts with `Index([...`? The columns and index are themselves an object with a special datatype. We'll prove it.

In [None]:
# Index object types
print("Index object type:\t\t", type(matches.index))
print("Columns object type:\t\t", type(matches.columns))
print("Object type of a regular column:", type(matches.red_score))

Both the index and columns are the same object type: `Index`, which is different than the object type of a regular column like *red_score* (object type is `Series`). This means the index will behave differently than a regular column.

#### 3. Reason for Having Indexes - The `.loc()` Method
Why do Pandas dataframes have an index? The size of a Pandas dataframe is limited only by the amount of memory on your computer.  Pandas dataframes can have millions of rows -- that's not hyperbole. The designers of Pandas wanted to be able to extract data from dataframes quickly, regardless of the size of the dataframe. The `Index` objects make that possible. If we extract data using index values and column names, we can extract data just as quickly from large datasets as we can from small datasets.

There is a caveat. For indexing (i.e., extracting data) to be fast on large dataframes, each row and column must have a unique index value. Pandas will allow duplicate index values, but some operations will be slower, and there could be other problems. The best practice is to ensure each row and column has a unique index value. Your choice of index will depend on the data and how you want to use it.

#### 4. Extracting Data with Index Values
We've already used the method for extracting data with index values. It's the `.loc` method back in section D.4. Here's a review:

In [None]:
space_df.loc[0:3, "Company Name":"Datum"]

When we use `.loc()` on the space dataframe, it appears that `.loc()` uses row numbers. But that's because the index for the space dataframe uses a simple integer index. The same technique will *not* work on the matches dataframe.

In [None]:
# This code generates an error
matches.loc[0:3, "red1":"blue3"]

At the bottom of the long error message, you should see a statement indicating that Pandas is unable to use integers to index this dataframe. The `.loc()` method requires us to pass index values.

In [None]:
# Table of teams in quarterfinal matches.
matches.loc["2020wasno_qf1m1":"2020wasno_qf4m3", "red1":"blue3"]

##### More Information
* [The loc Function](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#selection-by-label)

#### 5. Extracting Data with Row and Column Numbers -- Using `.iloc()`
There may be times when you don't care about the index or column name. For example, you just want the value from the 5th column of the 17th row. Pandas makes that easy.

In [None]:
# Getting value from specific row and column
# Remember, 1st row and 1st column are at index 0
matches.iloc[16, 4]

The `.iloc()` method works similarly to `.loc()`, but it only takes integer row and column numbers. Here's another example:

In [None]:
matches.iloc[30:35, :-9:-1]

The `.iloc()` method accepts list-style slices. In the example above, we selected a range of rows and the last eight columns, but in reverse order.

##### More Information
* [The iloc Function](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#selection-by-position)

#### 6. Index Exercises

##### Ex III.13
For the `grads` dataframe, use the `.set_index()` method to make the *Major_code* column the index. Refer to the [documentation for the `set_index()` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html?highlight=set_index#pandas.DataFrame.set_index). Either assign the modified dataframe to a new variable, or use the *inplace* parameter.

In [None]:
# Ex III.13


##### Ex III.14
With the modified `grads` dataframe from exercise III.13, sort the dataframe in ascending order by the index. Use the `.sort_index()` method ([see the documentation here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_index.html?highlight=sort_index#pandas.DataFrame.sort_index)). Next, extract all columns from *Major_code* to *Employed*, but limit rows to those with major codes between 4000 and 4999.

In [None]:
# Ex III.14


## H. Making your own DataFrames from Scratch
The two most common ways to construct a `DataFrame` object from core Python objects are the column method and the row method.

For the column method, you pass a dictionary to the `Pandas.DataFrame` constructor.
* Each key of the dictionary is a column name.
* Each value is a list of values that will go in the corresonding column.
* Each list must be the same length.

In [None]:
# Column method for creating a dataframe
x = pd.DataFrame({'x': [1, 2, 3], 'y': [3, 4, 5]})
x

For the row method, you pass a list of dictionaries to the `Pandas.DataFrame` constructor.
* Each dictionary corresponds to one row of the dataframe.
* Each dictionary key is a column name and each dictionary value will be the value for the corresponding row and column.
* This method is more fault tolerant. If a dictionary is missing a column value, Pandas will not throw an error, but will insert a `NaN` value.

In [None]:
x2 = pd.DataFrame([{"x": 10, "y": 30},
                   {"x": 20, "y": 40, "z": 100},
                   {"x": 30, "y": 50}])
x2

##### More Information
* [Series and Dataframe Creation](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html#object-creation)

## I. Modifying Data in Dataframes
So far we've only been reading data from Pandas DataFrames. But we can change dataframes. They are mutable.

In [None]:
# You can also assign a dict to a row of a DataFrame
x.iloc[1] = {'x': 9, 'y': 99}
x

In [None]:
x = x.append({'x': 5, 'y': 9}, ignore_index = True)
x

In [None]:
x['z'] = [1, 2, 3, 4]
x

In [None]:
z = pd.DataFrame({'x': [1, 2, 3], 'y': [3, 4, 5], 'z': [45, 45, 56]})
z

In [None]:
#You can append 2 dataframes together
x = x.append(z)
x

#### 1. Dataframe Creation Exercises

##### Ex III.15
Use list comprehensions to create three lists each with 20 numbers. The first list should contain integers ranging from 0 to 19. The second should contain the squares of the numbers in the first list, and the third list should contain the cubes. Create a dataframe with three columns, where each column contains one of the lists. Use descriptive, logical names for the columns.

In [None]:
# Ex III.15


## J. Grouping and Aggregating Data

Until now, we've been using only a fraction of Pandas' capabilities. Pandas Pandas can also be used to transform and analyze data.

Suppose we want to know the average alliance score for the entire competition. Since scores are contained in either the *blue_score* or *red_score* column. We could use the `.mean()` method to calculate the average of each column and then average those two values.

In [None]:
# Calculating the mean of the red and blue score columns
print("Blue alliance mean score:\t",
      round(matches.blue_score.mean(), 1))
print("Red alliance mean score:\t", round(matches.red_score.mean(), 1))
print("Overall mean score:\t\t",
      round((matches.red_score.mean() + matches.blue_score.mean()) / 2, 2))

With Pandas, there is usually more than one way to do something, and this example is no exception. For example, we could also add a column to the dataframe that contains the mean of of the red and blue alliance scores for each match, and just take the mean of that column.

In [None]:
# Creating a new mean_score column
matches["mean_score"] = (matches["red_score"] + matches["blue_score"]) / 2    # line 1
print("Overall mean score:\t", round(matches.mean_score.mean(), 1)) # line 2

Pay close attention to line 1 -- there is a lot going on in that line.
* First, we're able to create a new column named *mean_score* simply by referencing the column and assigning something to it, e.g., `matches["mean_score"] = ...`.
* Line 1 is also using element-wise calculations. This is an important concept, so we'll cover it in detail.
    * The expressions `matches["red_score"]` and `matches["blue_score"]` both return a Pandas `Series` object.
    * Mathematical operators like `+` behave differently with `Series` objects than they behave with other data types, such as Python lists. See below for an example.

In [None]:
# Using '+' with Python lists and Pandas Series
list1 = [1, 2, 3]
list2 = [10, 20, 30]
print("Using '+' with Python lists:\t", list1 + list2)
series1 = pd.Series([1, 2, 3])
series2 = pd.Series([10, 20, 30])
series3 = series1 + series2
print("\nUsing '+' with Pandas series:")
series3

Adding two Python lists concatenates the two lists into a longer list. But adding two `Series` objects causes the first elements of each series to be summed, as well as the second elements, etc. This is what we mean by element-wise operations.

So when we added `matches["red_score"]` and `matches["blue_score"]` and divided the result by two, for every row in the `matches` dataframe, we took the red score and the blue score, added them together, divided the sum by two, and put the result in a new column called *mean_score*. The following cell shows the results.

In [None]:
matches[["blue_score", "red_score", "mean_score"]].head()

Now suppose we were interested in the average score for each level of competition. That is, suppose we wanted an average score for qualification matches, another average score for quarter-finals, another for semi-finals, and so on. We could use the indexing techniques to extract a dataframe containing only the rows corresponding to each level, and then calculate a mean from each separate dataframe, but that would be a lot of work. It's easier to do something like this:

In [None]:
matches.groupby("comp_level").agg({"mean_score": "mean"})

This result is actually quite remarkable. One line of code split our data into groups by competition level and then calculated a mean score for each level.

First, the `groupby()` method created a Pandas `GroupBy` object, where rows are split into groups based on the content of the *comp_level* column. We can extract the individual groups with the `.get_group()` method. For example, the following line extracts the finals matches.

In [None]:
matches.groupby("comp_level").get_group("f")

Next, the `.agg()` method calculates the mean of the values in each group's *mean_score* column. The string "mean" refers to an aggregate function, which is a function that calculates a single number from many different numbers. Pandas provides numerous aggregate functions, including *sum*, *size*, *count*, *std* (standard deviation), *var* (variance), *min*, and *max*. We can add additional columns that calculate different summary statistics for each group. The following example adds a column with the earliest match start time for each group.

In [None]:
matches.groupby("comp_level").agg({"actual_time": "min", "mean_score": "mean"})

##### More Information
* [Groub by: Split-Apply-Combine](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html)

#### 1. Aggregation Exercises

##### Ex III.16
Using the `grads` dataframe, calculate the average unemployment rate and the standard deviation of the median salaries for each category of majors.

In [None]:
# Ex III.16


## IV. Quiz
Answer the following questions by typing the answers as comments in the code block below each question.

**#1.** Of the data file formats we have studied so far, which one can be viewed in a standard text editor *and* can contain nested data?

In [None]:
# 1. 


**#2.** Tab separated files (TSV) are very similar to comma separated files. Tab characters, '\t' are used to separate columns instead of commas. This format can be useful if the data itself contains commas.

Review the [documentation for the Python Standard Library's `csv` module](https://docs.python.org/3/library/csv.html?highlight=csv#module-csv). Can this module be used for reading tab TSV files? If so, write out the code that could be used to load a TSV file.

In [None]:
# 2.


**#3.**  This code will throw an error. Why? How do we fix it?

```python
space_df["Company Name", "Datum", "Detail"]
```

In [None]:
# 3. 


**#4.** What are two differences between a Python list and a Pandas Series?

In [None]:
# 4.


**#5.** What is the difference between the `.loc()` and `.iloc()` methods?

In [None]:
# 5.


**#6.** When are tick marks required when using the `.query()` methods? How are tick marks different than quotation marks?

In [None]:
# 6.


**#7.** Suppose we have a dataframe `small_df` with two numeric columns, *col1* and *col2*.

|  |col1|col2|
|--|----|----|
|0 |1   |2   |
|1 |10  |20  |
|2 |100 |NaN |

Suppose we run the following line of code?
```python
small_df["col3"] = small_df["col1"] * small_df["col2"]
```
What values will *col3* contain? Review the [user guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#calculations-with-missing-data) to see how *NaN* values are handled within mathematical calculations.

In [None]:
# 7.


**#8.** Many Pandas methods, including several used earlier in this notebook, have an *inplace* parameter. What does this parameter do? Review the API documentation for some of the functions that we have used if you are not sure.

In [None]:
# 8.


**#9.** We discussed two different techniques for creating a Pandas dataframe from Python data structures. One technique used a dictionary of lists, and the other used a list of dictionaries. Which one is the column-centric approach, and which is the row-centric approach?

In [None]:
# 9.


**#10.** Review the *`setup/convert_json_to_df.ipynb`* notebook. This notebook was used to create the matches dataframe that we used in this session. The dataframe was created using data in the *`setup/matches.json`* file. Review the notebook and JSON file and try to figure out how it works. Refer to Python or Pandas documentation if any of the syntax looks unfamiliar.

There is a line of code in the notebook that merges two dataframes. Copy the line of code into your answer to this question.

In [None]:
# 10.


**#11.** The answer to question #10 contains a method that takes more than one argument. Explain what each argument does. You can easily find the documentation for the method by typing it into the search field on the Pandas documentation website (upper left).

In [None]:
# 11.


## V. Save Your Work
Once you have completed the exercises, save a copy of the notebook outside of the git repository (outside of the *pyclass_frc* folder). Include your name in the file name. Follow instructions from your instructor to get feedback on your work.

## VI. Concept and Terminology Review
You should be able to define the following terms or describe the concept.
* CSV files
* TSV files
* The csv module from the Python Standard Library
* Creating dataframes, including both the row and column approach
* Pandas `DataFrame` and `Series` datatypes
* Pandas `.head()` and `.tail()` methods
* `Dataframe.shape` attribute
* Selecting dataframe columns
* Selecting rows from a Series
* Pandas `.loc` and `.iloc` methods
* Boolean indexing
* Pandas `.query()` method
* Element-wise operations
* The `.isin()` and `isna()`, and `.notna()` methods
* *NaN* values
* Column datatypes
* Pandas `.to_datetime()` and `.astype()` methods
* Pandas statistical functions
* Pandas Indexes
* The `DataFrame.set_index()` method
* Modifying data in dataframes
* Adding new columns to a dataframe
* Grouping and Aggregating data
* Sorting the dataframe by a column or by the index (`.sort_index()` and `sort_values()` methods)
* Merging Pandas dataframes

### Notes for later

[Table of Contents](../../index.ipynb)