# Data Analysis with Pandas

<img src="https://terpconnect.umd.edu/~gciampag/INST126/images/pandas.png" width="75%"/>

Pandas is a **library** in Python that is designed for **data manipulation and analysis**

Especially tabular data, as in an SQL table or Excel spreadsheet. So things like:
* Time series data
* Arbitrary matrix data with meaningful row and column labels
* Any other form of observational / statistical data sets

<img src="https://terpconnect.umd.edu/~gciampag/INST126/images/latlon.png" width="50%"/>

# Importing the pandas library (getting started)

## What is a library?

You can think of a library is a **collection of functions and data structures**. You *import* a library (or subsets of it) into your program / notebook so you have access to special functions or data structures in your program.

You are already using Python's standard library, which includes built-in functions like `print()`, and built-in data structures like `str` and `dict`. Every time you fire up Python, these are "imported" into your program in the background.

As you advance in your programming career, you will often find that you want to solve some (sub)problems that others have tried to do, and wrote a collection of functions and/or data structures to solve those problems really well, and saved that collection into a library that others can use. Take advantage of this!

## &ldquo;importing&rdquo; a library: mechanics

Here's what it looks like to import a library and use it, conceptually with a "fake" library, and with the pandas library

<img src="https://terpconnect.umd.edu/~gciampag/INST126/images/importing.png" width="50%"/>

In [None]:
import pandas

One problem with this is that we need to type the name of the library every time we want to access a function in it.

This is tedious if the name of the library is too long to type. Or if it conflicts with a variable in our code.

To solve this, we can use the `import ... as ...` version which lets us make it available under a name of our choice.

<img src="https://terpconnect.umd.edu/~gciampag/INST126/images/importingas.png" width="50%"/>

In [None]:
# import the pandas library, give it the name pd for easier access
import pandas as pd

## The core of Pandas: The dataframe data structure

We've so far progressed from single-item data structures (`str`, `int`, `float`) to "basic" collections (`list`, `dict`)

Now we will learn about the `dataframe`, which has:
* nice properties of both lists (*orderable, indexable*) and dictionaries (can *retrieve things quickly by key, store associated values*)
* and othe properties and *built-in algorithms and methods* that are useful for data analysis (e.g., summarizing, grouping, statistics, etc.)

Remember: **data structures and algorithms go hand in hand**: people made dataframes (and the associated pandas library) so we can do particular kinds of algorithms more easily.

Pandas represents data in as a table, using rows and columns.

<img src="https://terpconnect.umd.edu/~gciampag/INST126/images/01_table_dataframe.svg" />

Individual columns in a data frame are called **series**; see the [official documentation](https://pandas.pydata.org/docs/getting_started/intro_tutorials/01_table_oriented.html#each-column-in-a-dataframe-is-a-series) for more information.

Dataframes combine the best characteristics of lists and dictionaries, and more!

- Can be sorted (like lists)
- Allow to access data by key (like dicts)
- _Bonus_: They can also reindex easily!

In [None]:
# integrated practice! how do we specify directions to the INST courses file?
fname = 'data/INST_courses.csv'  # Note: on Windows use data\INST_courses.csv instead!
df = pd.read_csv(fname)
df.head(10)

In [None]:
# show me the "columns"
df.columns

In [None]:
# get the column with the class codes
df['Code']

In [None]:
# get all the data for the first row only (using the .loc attribute)
df.loc[0]

### Common error: forget the `.loc` attribute when indexing by row

In [None]:
# this will result in KeyError!
df[0]

In [None]:
# get the number of credits for the first course
df.loc[0]['Credits']

In [None]:
# find the courses that are 3 credits
df[df['Credits'] == 3.0]

In [None]:
# find all courses where the title contains the word introduction
df[df['Title'].str.contains("Introduction")]

In [None]:
df.head(10) # show me the top 10 rows in the dataframe

# Common operations (basic)

Let's go over some common operations with dataframes. This will overlap with the PCE, mostly Q1&ndash;5 and Q8.

## Constructing a series

Series can be created by scratch from existing data structures. There are two methods depending on whether the data are labeled or unlabeled.

### Series from unlabeled data

The simplest is to create a series from some unlabeled data, for example in a list. Here is a set of ages of three people:

In [None]:
ages = pd.Series([22, 35, 58], name="Age")
ages

I can ask the age of the first person by using list-like indexing (using the `.iloc[]` indexer).

In [None]:
ages.iloc[0]

Note that pandas is still labeling the data. When we supply unlabeled data, by default pandas uses the position of each element (0, 1, 2, ...) as the labels of the series. 

This means that technically this way of accessing the series elements is also allowed, but it shows a warning:

In [None]:
ages[0]  # Allowed but not recommended

What's happening here is that Pandas knows that the `ages` series uses integer labels and so when it detects that a integer key is passed within the square brackets, it interprets it as a list-like access. However this behavior has been deprecated by the Pandas developers and will be removed from future versions of the library. (Hence the warning.)

So always use the `.iloc[]` indexer when you intend to use list-like access to a series.

### Series from labeled data

Of course we can also specify labels for our data. This is achieved if instead of a list we use a dictionary.

Here is the same height data, but now we map each person name (the key) to their age (the value). Pandas uses the key information to set up labels for the entries:

In [None]:
# labeled series
age_data = {
    "Giovanni": 22, 
    "Joel": 35, 
    "Frances": 58
}
ages = pd.Series(age_data, name="Age")
ages

Now Giovanni's age can be accessed like in a dictionary

In [None]:
ages['Giovanni']

Note that even though the labels do not represent positions anymore, list-like indexing is still available thanks to the `.iloc[]` indexer

In [None]:
ages.iloc[0]  # Giovanni's age is 22

## Constructing a dataframe

Now that we have seen how to create series, let's move to the main data structure in pandas: the dataframe (also called _data frame_).

There are two ways to construct them:
1. From other data structures, like series;
2. From external files.

### From other data structures (e.g., lists, dictionaries)

Even though dataframes are seldom created this way (usually we import data from an external file like a `.csv` file into a dataframe), the dataframe mirrors series in that we can create them from scratch from data already stored in Python objects.

The key to understand how this works is to remember that Pandas sees dataframes as _collections of variables_, each variable being a column in the table. 

So the simplest method is to collect all such columns in a dictionary that maps names to Python lists.

In [None]:
# Dictionary of lists
df_data = {"age": [22, 35, 58],
           "height_in": [87, 95, 75],
           "weight_lbs": [160, 199, 143]}

df = pd.DataFrame(df_data)
df

What pandas is doing is to create a series for each list and then combine them together. In fact we can even do this last step explicitly, by passing a dict of series objects.

In [None]:
# Dictionary of series
df_data_series = {"age": pd.Series([22, 35, 58]),
                  "height_in": pd.Series([87, 95, 75]),
                  "weight_lbs": pd.Series([160, 199, 143])}

df = pd.DataFrame(df_data_series)
df

Like with series, if our data are unlabeled, by default the dataframe will employ integer positions as the labels. 

If we want the rows in the dataframe to be labeled, we can either supply labeled data (more on that below), or, if we want to label the rows of an otherwise unlabeled dataframe, we can supply them as a list through an extra argument, called `index=`.

In [None]:
# Dictionary of lists + row labels supplied separately
df_data = {"age": [22, 35, 58],
           "height_in": [87, 95, 75],
           "weight_lbs": [160, 199, 143]}

df = pd.DataFrame(df_data, index=["Giovanni", "Joel", "Frances"])
df

An intermediate method is to supply a list of _records_, where each record is a really a Python dictionary (so a _list of dictionaries_) detailing the information about a person. So each record is going to be interpreted as a _row_.

In [None]:
# List of dictionaries (aka list of records)
records = [
    {'age': 22, 'height_in': 87, 'weight_lbs': 160},
    {'age': 35, 'height_in': 95, 'weight_lbs': 199},
    {'age': 58, 'height_in': 75, 'weight_lbs': 143}
]
df = pd.DataFrame(records)
df

As before we can also supply a non-numerical index with the names of the individuals by passing a list of strings to the `index=` argument: 

In [None]:
# List of dictionaries (aka list of records)
records = [
    {'age': 22, 'height_in': 87, 'weight_lbs': 160},
    {'age': 35, 'height_in': 95, 'weight_lbs': 199},
    {'age': 58, 'height_in': 75, 'weight_lbs': 143}
]
df = pd.DataFrame(records, index=['Giovanni', 'Joel', 'Frances'])
df

In all these methods the information about the row labels was provided separately using the `index=` keyword argument. So they are methods to label some unlabeled data on the fly.

The last method is a variant of the original dictionary of lists method, and it is inspired to the methods from creating series from labeled data.

Liked the original methods, it sees a dataframe as a collection of variables, all grouped together in a dictionary mapping columns (identified by name)( to their data.

So we still pass a dictionary mapping each column to its data points / series. But instead of supplying the data points in an unlabeled sequence, we supply them as a labeled mappings from labels (the keys) to data points (the values).

In [None]:
# Dict of dictionaries (aka dict of labeled data)
named_records = {
    'age': {'Giovanni': 22, 'Joel': 35, 'Frances': 58},
    'height_in': {'Giovanni': 87, 'Joel': 95, 'Frances': 75},
    'weight_lbs': {'Giovanni': 160, 'Joel': 199, 'Frances': 143} 
}
df = pd.DataFrame(named_records)
df

As before, what pandas is doing is to create a series for each inner dictionary and then combine these series together. In fact we can even do this last step explicitly, by passing a dict of series objects, each series created from a dict.

In [None]:
# Dict of dictionaries (aka dict of labeled *series*)
named_records_series = {
    'age': pd.Series({'Giovanni': 22, 'Joel': 35, 'Frances': 58}),
    'height_in': pd.Series({'Giovanni': 87, 'Joel': 95, 'Frances': 75}),
    'weight_lbs': pd.Series({'Giovanni': 160, 'Joel': 199, 'Frances': 143})
}
                            
df = pd.DataFrame(named_records_series)
df

What's important to remember is that when passing a dictionary (of lists, of other dictionaries, and of series objects), pandas always treats the keys (and their corresponding values) as _the columns_ of your data.

### From (external) data files

Of course most of the times we want to create a dataframe from data sitting in an external file. Most frequently this is done with `.read_csv()`, but there are many other common formats, such as JSON or XML. See [here](https://pandas.pydata.org/docs/getting_started/intro_tutorials/02_read_write.html) for more information from the Pandas tutorial.

The acronym **CSV** in `read_csv()` stands for **c**omma-**s**eparated **v**alues. It is a very common format since the files contain just *plain text*. This means any program that can read a string can read CSV files. Even a text editor. Excel files (those that ends in `.xls` or `.xlsx`) are instead binary files and typically cannot be opened with a text editor.

In [None]:
# (without pandas) csv files are plain text files. Let's grab all the lines.
fname = 'data/INST_courses.csv'  # NOTE: on Windows use data\INST_courses.csv instead.
flines = open(fname, 'r').readlines()

In [None]:
for line in flines:
    elements = line.strip().split(",")
    print(elements)

In [None]:
# with pandas
fname = 'data/INST_courses.csv'  # NOTE: on Windows use data\INST_courses.csv instead.
df = pd.read_csv(fname) # needs a path to a csv file
df

## Inspecting your dataframe

Common operations:
- summarizing
- filtering / accessing
- sorting

### Summarizing

With:
- `.head()` / `.tail()`
- `.describe()`
- various stats

In [None]:
# print the first rows (5 by default) of the data frame
df.head()

In [None]:
# print the last rows (5 by default) of the data frame
df.tail()

Let's use the second dataset, which gives the height and weight of 25,000 individuals. (Source: UCLA Stats Dept)

Here is an example of what the data look like in the file:

	 ﻿Index,Height(Inches),Weight(Pounds)
	 1,65.78331,112.9925
	 2,71.51521,136.4873
	 3,69.39874,153.0269
	 4,68.2166,142.3354
	 5,67.78781,144.2971
	 6,68.69784,123.3024
	 7,69.80204,141.4947
	 8,70.01472,136.4623
	 9,67.90265,112.3723


Because the file already includes the index of each individual, we can tell pandas to use that column as the "index" of the data with the `index_col=` parameter.

In [None]:
fname = 'data/SOCR-HeightWeight.csv'  # NOTE: on Windows use data\SOCR-HeightWeight.csv
socr = pd.read_csv(fname, index_col=0)
socr.head()

The `.describe()` method instead summarizes data from a statistical point of view.

In [None]:
socr.describe()

### Subsetting / getting / accessing parts of our dataframe

Most basic is just getting a specific column. Looks like the basic way we index things in lists or dictionaries.

In [None]:
socr['Height(Inches)']

Let's say you want a particular statistic for only one column. You can do this by accessing the series, and asking for a specific statistic.

In [None]:
socr['Height(Inches)'].median()

We can rename columns with the `.rename()` method. It takes a dictionary specifying which columns need to be renamed (keys) to what new name (values).

The dictionary must be passed to the `columns=` parameter.

Let's say I want to rename `"Height(Inches)"` to `"height"` (lowercase) and `"Weight(Pounds)"` to `"weight"`:

In [None]:
new_names = {
    'Height(Inches)': "height",
    'Weight(Pounds)': "weight"
}
socr = socr.rename(columns=new_names)
socr

### Filtering the data based on one or more columns

But we sometimes also want to get **subsets** of the data, depending on one or more column values. For this, we will use another dataset about college sports. You can find an explanation of some of the columns [on this glossary](https://www.sports-reference.com/cbb/about/glossary.html).

First let's look at the head of the data frame (first 5 rows):

In [None]:
fname = "data/ncaa-team-data.csv"  # NOTE: on Windows use data\ncaa-team-data.csv

ncaa = pd.read_csv(fname)
ncaa.head()

To find subsets of the rows, we use a special syntax of pandas Dataframes called _boolean indexing_. Here are a few examples of queries we may want to ask:

In [None]:
# find all of the seasons where maryland had a won more than it lost (column `wl` > 0.5)
ncaa[(ncaa['wl'] > .5) & (ncaa['school'] == "maryland")]

In [None]:
# find all of the seasons where a Big Ten school won more than it lost (column `conf`)
ncaa[(ncaa['wl'] > .5) & (ncaa['conf'] == "Big Ten")]

In [None]:
# find all of the seasons where a Duke lost more than it won
ncaa[(ncaa['wl'] < .5) & (ncaa['school'] == "duke")]

In [None]:
# All losing seasons for coach K
target = "Mike Krzyzewski"
ncaa[(ncaa['wl'] < .5) & (ncaa['coaches'].str.startswith(target))]

### Reshaping

Most basic is form of reshaping is _sorting/reordering_ the rows in a dataframe. More advanced stuff like _transposing_ and so on we will discuss next week. Let's sort the INST courses data frame by class code:

In [None]:
fname = "data/INST_courses.csv"  # Note: on Windows use data\INST_courses.csv instead!
df = pd.read_csv(fname)
df.columns

In [None]:
df.sort_values(by="Code")

In [None]:
# reverse the sort order
df.sort_values(by="Code", ascending=False)

By default, `.sort_values()` will return a new sorted data frame. So the last step did not change what's stored in variable `df`

In [None]:
df

So to store the result you will need to save it to a new variables

In [None]:
df_reversed = df.sort_values(by="Code", ascending=False)

In [None]:
df.head()

In [None]:
df_reversed.head()

But `.sort_values()` can also be told to modify the data frame *in place*

In [None]:
df.sort_values(by="Code", ascending=False, inplace=True)

In [None]:
df.head()

Can also sort by multiple columns. Let's go back to the NCAA data

In [None]:
ncaa.sort_values(by="wl")

In [None]:
# To sort by year and wl, pass a list with the column names
ncaa.sort_values(by=['wl', 'year'])

## Computing new columns from existing ones

To create new columns, we need to remember that dataframes are collections of columns that can be accessed like dictionaries. So the syntax for creating a new column is very similar to what you would do with a dictionary.

As an example, let's go back to the SOCR dataset use again the formula for the Body Mass Index:

$$BMI = \frac{w}{h^2} \times 703$$ 

In [None]:
# Associate a new column in the data frame to the name 'bmi'
socr['bmi'] = socr['weight'] / socr['height'] ** 2 * 703

In [None]:
socr.head()

Note also that operations among columns result in a Pandas series.

In [None]:
socr['weight'] / socr['height'] ** 2 * 703

## Aside: dataframes are (mostly) immutable

Python wants you to treat dataframes as immutable: by default, any modifications you make to a dataframe will create a modified copy (just like a string), rather than modifying the dataframe itself. 

You *can* get around this if you want, by passing in a `inplace=True` argument to most function calls.

# Coding Challenge

Using the `ncaa` data frame, find the rank (column `rk`) of Maryland in the last 10 years available in the dataset. Your dataframe should look like something like this: 

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>year</th>
      <th>rk</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>10814</th>
      <td>2007.0</td>
      <td>10</td>
    </tr>
    <tr>
      <th>10813</th>
      <td>2008.0</td>
      <td>9</td>
    </tr>
    <tr>
      <th>10812</th>
      <td>2009.0</td>
      <td>8</td>
    </tr>
    <tr>
      <th>10811</th>
      <td>2010.0</td>
      <td>7</td>
    </tr>
    <tr>
      <th>10810</th>
      <td>2011.0</td>
      <td>6</td>
    </tr>
    <tr>
      <th>10809</th>
      <td>2012.0</td>
      <td>5</td>
    </tr>
    <tr>
      <th>10808</th>
      <td>2013.0</td>
      <td>4</td>
    </tr>
    <tr>
      <th>10807</th>
      <td>2014.0</td>
      <td>3</td>
    </tr>
    <tr>
      <th>10806</th>
      <td>2015.0</td>
      <td>2</td>
    </tr>
    <tr>
      <th>10805</th>
      <td>2016.0</td>
      <td>1</td>
    </tr>
  </tbody>
</table>

In [None]:
# Enter your solution here!
...

# What to do next: learn how to read documentation for libraries

Pandas is a huge topic and this notebook barely scratched the surface. As you begin your journey learning it, you should have handy access to (and know how to use):
- Docs for "ground truth"; the pandas website is decent place to start: https://pandas.pydata.org/, and you can always access the inline help directly from Jupyter (see cells below);
- Some collection of examples for references; this "cheat sheet" is also a really helpful guide to more common operations that you may run into later: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf;
- A set of principles for how to use pandas in your code; This &ldquo;style guide&rdquo; gives some recommendations for writing good code using pandas: https://tomaugspurger.github.io/posts/modern-1-intro/.

Last but not least, a note about collaborative learning. The cool thing about pandas and data analysis in python is that many people share notebooks that you can inspect / learn from / adapt code for your own projects (just like this one!). In Data Science (and many other fields) nobody writes anything all from scratch, unless they are trying to *really* learn something deeply. So learning how to use libraries is also good training for learning to code in teams, where using code from others is typically strongly expected.

In [None]:
# Open help page for the pandas library
import pandas
pandas?

In [None]:
# Open help page for the read_csv function in the pandas library
pandas.read_csv?

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

---


# Solution

In [None]:
ncaa[ncaa['school'] == 'maryland'][['year', 'rk']].sort_values(by='year').tail(10)