# Data Analysis with Pandas: split-apply-combine

Last week, we learned:
- Pandas is a library in Python that is designed for data manipulation and analysis
- How to use libraries (import them, access their functions and data structures with `library.function_name()`)
- About the `dataframe` data structure: basically a smart spreadsheet, with rows of observations, and columns of variables/data for each observation - sort of a cross between a list (sortable, indexable) and a dictionary (quickly access data by key)
- Some basic operations: constructing a dataframe, summarizing, subsetting, reshaping

This week, we'll learn a bit more about summarization:
- Use `.value_counts()` to summarize categorical data
- Use `.crosstab()` to summarize categorical data cross multiple columns

And more advanced operations for reshaping/modifying your dataframe:
- Use `.apply()` to apply functions to one or more columns to generate new columns
- Use `.groupby()` to split your data into subgroups, apply some function to their data, then combine them into a new dataframe for further analysis (the "**split-apply-combine**" pattern that is fundamental to data analysis with pandas)
- Use some basic plotting functions to explore your data

I'll then tie it all together to show how they map to problem formulations for your Project 4: all the projects have the same basic structure!

These roughly correspond to Qs 6-8 in your PCEs.

# Setup

The files we'll be working with in this session are the datasets we're giving for Project 4.

You can download them here:
* [bls-by-category.csv](https://terpconnect.umd.edu/~gciampag/INST126/data/bls-by-category.csv "bls-by-category.csv")
* [BreadBasket_DMS.csv](https://terpconnect.umd.edu/~gciampag/INST126/data/BreadBasket_DMS.csv "BreadBasket_DMS.csv")
* [ncaa-team-data.csv](https://terpconnect.umd.edu/~gciampag/INST126/data/ncaa-team-data.csv "ncaa-team-data.csv")
* [testudo_fall2020.csv](https://terpconnect.umd.edu/~gciampag/INST126/data/testudo_fall2020.csv "testudo_fall2020.csv")

<!--
Make sure you save them into the _same folder_ as the notebook. In Google Colab, also make sure to upload them to your runtime.

<img src="https://terpconnect.umd.edu/~gciampag/INST126/images/colab-files.png" width="15%">
-->

In [None]:
import pandas as pd

fn = 'testudo_fall2020.csv'

# read in the file into a dataframe called courses
courses = pd.read_csv(fn)

# use the .head() function to show the top 5 rows in the dataframe
courses.head(5) 

In [None]:
# quick summary for quantitative data (works on numerical columns only)
courses.describe() 

## Use `.value_counts()` to summarize categorical data in your dataframe

Another way to get a summary of one or more columns that are *categorical*.

The counts correspond to how many time each particular _category_ (a value) appears in the column.

Results are sorted in descending order by default.

*Hint: this could be useful for Project 4!*

In [None]:
# access the area column in the courses dataframe
area = courses['area']

# and apply the value_counts method to that column, which is a series data structure
area.value_counts()

A Pandas series is like a cross between a dictionary and a list.

In [None]:
# same as above but stored in a new variable
area_counts = courses['area'].value_counts()

In [None]:
# can get value by named key like a dict
area_code = "INST"
print(f"{area_code}: {area_counts[area_code]}")

# and also by location
print(f"most frequent item count: {area_counts[0]}")

To list all the keys use the `.keys()` method. Returns a pandas object called an "index"

In [None]:
area_counts.keys()

Let's say we want the top 5 most populous areas. We can slice/subset the series just like a list, and then get the keys from that subset.


In [None]:
# use a slice (like a list) and then get keys (like a dict)
area_counts[:5].keys()

In [None]:
bread = pd.read_csv('BreadBasket_DMS.csv')

# how do we get the frequency counts for items in the bread dataframe?
bread['Item'].value_counts()

## Use `crosstabs()` to summarize categorical data across multiple columns 

If we have _multiple_ categorical columns, we may want to get the frequency of a particular combination of values from both columns.

Let's use the NCAA dataset to show this

In [None]:
# read data from CSV
ncaa = pd.read_csv("ncaa-team-data.csv")

# summarize first 5 rows
ncaa.head(5)

Let's explore the `ncaa_result` column.

In [None]:
# All possible results
ncaa['ncaa_result'].value_counts()

Let's explore the `coaches` column

We can count how many times each coach had a particular result (column `ncaa_result`).

One problem is that coaches can appear with different seasons (if they coached at different schools).

So, first, let's clean the `coaches` column to extract only the name without the season part.

In [None]:
def get_coach_name(x):
    """
    Extract the name of the (first) coach in the season
    """
    # split the string around spaces
    try:
        elements = x.split(' ')
    except AttributeError as er:
        elements = ['No', 'Coach']
    first_and_last_name = elements[:2]
    coach_name = " ".join(first_and_last_name)
    return coach_name

# apply get_coach_name function to each entry in column `coaches` and create new column `coach_name`
ncaa['coach_name'] = ncaa['coaches'].apply(get_coach_name)
ncaa
# create a sub-dataframe with just two columns -- ncaa_result and coach_name
sub_ncaa = ncaa[['coach_name', 'ncaa_result']]
sub_ncaa.value_counts()

This however gives me a series and not a dataframe.

The `crosstab` function instead organizes the same data in a tabular format (so as a dataframe instead of a series).

We can ask who are the people who won _some_ national finals but also lost _some_ regional finals.

In [None]:
coach_results = pd.crosstab(ncaa['coach_name'], ncaa['ncaa_result'])

coach_results[(coach_results["Won National Final"] > 0) & (coach_results['Lost Regional Final'] > 0)]

Can do the same with the UMD courses data. Let's see how many areas offer introductory courses.

To decide whether a course is an &ldquo;introductory&rdquo; we can check if it has the string `"Introduction"` in the title.

In [None]:
def is_intro(title):
    """ 
    Determine whether a course title has the word "Introduction" or not in it
    """
    if "Introduction" in title:
        return True
    else:
        return False

# apply the is_intro function to the title column in courses
# and save the results in the is_intro column in courses
courses['is_intro'] = courses['title'].apply(is_intro)

# show me the top 5 rows in the dataframe
courses.head()

Now use `.crosstab()` method to cross-tabulate the counts across columns `area` and `is_intro`

In [None]:
# cross-tabulate area and is_intro
intro_by_area = pd.crosstab(courses["area"], courses["is_intro"])
intro_by_area

Notice that `.crosstab()` works with numerical data too. 

But the column names in the new data frame (`0` and `1`) are not very meaningful. 

We can rename them with `.rename()`.

In [None]:
new_names = {
    True: "Yes",
    False: "No"
}
intro_by_area = intro_by_area.rename(columns=new_names)
intro_by_area

Now compute the fraction of intro courses per area. This is:

$$
\frac{\rm yes}{({\rm yes} + {\rm no})}
$$

In [None]:
intro_by_area['frac_intro'] = intro_by_area['Yes'] / (intro_by_area['Yes'] + intro_by_area['No'])
intro_by_area

Reset index if you want the `area` information to be usable for analysis/plotting/etc.

In [None]:
intro_by_area = intro_by_area.reset_index()
intro_by_area

## Computing data based on one or more columns using `.apply()`

All these examples involved modifying or creating new columns!

In data analysis, we often want to do things to data in our columns for data preparation/cleaning. 

Sometimes there is missing data we want to re-code, or there is data we want to re-describe or re-classify for our analysis. 

We can do this with a combination of functions and the `apply()` method. It comes in two flavors:

- With a single column (i.e. on a Series)
- With multiple columns (i.e. on a DataFrame)

### `.apply()` with a single column

The `prereqs` column gives a string description of the prerequisites for the course

In [None]:
courses['prereqs'].head()

Let's say we want to have a `prereqs` column that is sortable. For example:

    0 = No prereqs 
    1 = has prereqs

#### Step 1: Define the function you want to apply

In [None]:
# Step 1: define the function you want to apply
def has_prereq(prereq_descr):
    """
    Determine whether a pre-requisite description includes the string "None"
    """
    if "None" in prereq_descr:
        return 0
    else:
        return 1

Test the function on some sample inputs to check it works

In [None]:
# this should yield 1
prereq = "BMGT301; or instructor permission" 
print(f"With {prereq=!r}: {has_prereq(prereq)=}")

# this should yield 0
prereq = "None"
print(f"With {prereq=!r}: {has_prereq(prereq)=}")

#### Step 2: Apply the function to a column

We can create a new column called `has_prereqs` to save this information in the dataframe

In [None]:
# Step 2: apply it to one or more columns

# This applies the has_prereq() function to every row in the prereqs column in the courses data frame
courses['has_prereqs'] = courses['prereqs'].apply(has_prereq) 
courses.head(10)

We can crosstab the new column with the previous `is_intro` column to see how many introductory courses have pre-requisites

In [None]:
pd.crosstab(courses['is_intro'], courses['has_prereqs'])

Interestingly, some courses that are called `"Introduction ..."` do have pre-requisites. 

We can use boolean indexing (in the o to see what are these courses!

In [None]:
courses[(courses['is_intro'] == True) & (courses['has_prereqs'] == 1)]

In [None]:
def has_intro(descr):
    if "intro" in descr.lower():
        return 1
    else:
        return 0

courses['has_intro'] = courses['description'].apply(has_intro)
courses.head(10)

#### What's happening under the hood

As another example, let's say I want to know how many courses of each level (100-, 200-, 300-level, etc.) we have in each area. We don't have that data in the dataset; at least not explicitly. Fortunately we can make it with some simple programming that you already know how to do! The problem here is, given a code (i.e., data from one column), how do we "extract" the area?

In [None]:
# Step 1: define the function
def extract_level(code):
    """
    Given a course code (e.g. INST126) extract the course level (100)
    
    Note that this function assumes course code starts with 4-letter area
    """
    return code[4] + '00'

In [None]:
c = "CMSC250"
extract_level(c)

Let's see how this works!

The `.apply()` function generates a list that is the same length as the input column's number of rows, with a corresponding value for each input 

(in this case, we have 414 rows in the data frame)

In [None]:
# Step 2: apply the function
courses['level'] = courses['code'].apply(extract_level)
courses.head(10)

This is equivalent to calling `extract_level` repeatedly in a for loop:

In [None]:
tmp = []
for i in range(len(courses)):
    row = courses.loc[i]
    code = row['code']
    level = extract_level(code)
    print(f"{i}: code={code}, level={level}")
    tmp.append(level)
courses['level'] = tmp

Another example with the bread basket data frame. 

Let's extract the hour of the day from the `Time` column and save it in a column called `Hour`

In [None]:
def extract_hour(time):
    return int(time.split(":")[0])

bread['Hour'] = bread['Time'].apply(extract_hour)
bread.sort_values(by="Hour")

#### Step 3: Save the resulting data from the `.apply()` into a new / existing column

What if we want to save the results so we can use it later? We can simply assign it to a column, new or existing. 

Remember, pandas prefers immutability in general (return a new object instead of modifying the object), and sometimes enforces it. 

With `.apply()`, it's enforced: you can't do it in place, you have to assign the returned series to a new variable if you want it to persist. 

### `.apply()` with data from multiple columns

What if you want to have a way to filter the courses in terms of "easy entry points" (i.e., both introductory *and* has no prerequisites)? 

That might also be interesting to analyze by area to see how many departments offer these easy entry points into the department for students from other departments.

Core thing we need to know here is that our `.apply()` will now apply a function that has a **row** as input, not an element of a single column. 

That way, we can access data from any column in the row: in this case, data from the `is_intro` and `has_prereq` columns.

We tell `.apply()` to do this with the `axis` parameter. 

We need to pass `axis=1` when we call `.apply()` so it knows to pass a row into the function, not just a single column element. 

See here for more details: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html

In [None]:
# is_entry_point function
def is_entry_point(row):
    """
    Determine whether a course is an "entry point" based on these two conditions:
    - It is an "intro" course (is_intro = 1)
    - It has no prerequisites (has_prere = 1)
    """
    if row['is_intro'] == 1 and row['has_prereqs'] == 0: 
        return 1
    else:
        return 0

In [None]:
# this should yield 1
test_row = {
    'is_intro': 1,
    'has_prereqs': 0
}
print(f"With test_row = {test_row} is_entry_point(test_row) = {is_entry_point(test_row)}")
      
# this should yield 0
test_row = {
    'is_intro': 1,
    'has_prereqs': 1
}

print(f"With test_row = {test_row} is_entry_point(test_row) = {is_entry_point(test_row)}")

In [None]:
# Step 2 apply the function (to the whole data frame!) and save the result

# need to specify axis=1 to apply it to every row
courses['is_entrypoint'] = courses.apply(is_entry_point, axis=1) 
courses.head()

What courses are entry points?

In [None]:
courses[courses['is_entrypoint'] == 1]

In [None]:
majors = pd.crosstab(courses['area'], courses['is_entrypoint'])
majors['frac_entry_points'] = majors[1] / (majors[0] + majors[1])
majors

## The split-apply-combine pattern (with `.groupby()`)

We have seen we can &ldquo;reshape&rdquo; a dataframe in various ways: sorting, summarizing, cross-tabulation, etc.

Going more deeply on this path of &ldquo;reshaping&rdquo;, we often __want to compute data based on subsets of the data, grouped by some column__.

For example, we might want to see how many departments offer &ldquo;easy&rdquo; entry point courses. 

We can do this with the "split-apply-combine" pattern, which is implemented in the `.groupby()` function.

Basically, it goes like this:

1. **Split** the data into subgroups (e.g., split courses into department subgroups)

2. **Apply** some computation on each subgroup (e.g., find number of easy entry points for each department subgroup)

3. **Combine** subgroup-computation information into an overall new dataframe that has subgroups as entries


More info in this tutorial https://pandas.pydata.org/docs/getting_started/intro_tutorials/06_calculate_statistics.html

### 1. Split 

We use the `.groupby()` method to split a dataframe into subgroups based on the values of a column

Let's split the courses dataframe by area and see how many courses are in each split.

In [None]:
# The total number of rows in the data frame
print(f"There are {len(courses)} rows in the data frame")

In [None]:
# Now we print the rows in each split
for area, area_courses in courses.groupby('area'):
    print(f"There are {len(area_courses)} rows in the data frame for area = {area}")

In [None]:
# Show the split from the last iteration of the loop: it is a dataframe (notice the area column)
area_courses

#### Split the manual way

We use indexing to find subgroups of rows with a given value, then we can apply some summarization statistics, like the average credits.

In [None]:
# get all the unique area values
course_areas = courses['area'].value_counts().keys()

# iterate through each unique area value
for area in course_areas:
    # get the subset of the course data that is associated with this area
    area_courses = courses[courses['area'] == area]
    
    # print the number of rows
    print(f"There are {len(area_courses)} rows in the data frame for area = {area}")

In [None]:
# Show the split from the last iteration of the loop: it is a dataframe (notice the area column)
area_courses

Notice also that these numbers correspond to the output of `courses['area'].value_counts()`

In [None]:
courses['area'].value_counts()

### Properties of `.groupby()`

In general `.groupby()` returns an object of class `DataFrameGroupBy` that performs the split.

We use this object to split the courses df into subsets grouped by area. 

The data frame is first sorted by the column with the groups.

The object also works as an _iterator_: we can iterate through the resulting collection of dataframe subsets where each step in the iteration allows us to grab:
1. the name of the subset, which is the shared value (in this case area)
2. the subset dataframe (here called `area_courses`)

In [None]:
courses.groupby('area')

On top of supporting iteration, objects returned by `.groupby()` also give you a number of methods/attributes:

- `.groupby(...).get(KEY)` &ndash; (_method_) returns the group associated to KEY
- `.groupby(...).ngroups` &ndash; (_attribute_) the number of groups
- `.groupby(...).groups` &ndash; (_attribute_) a dictionary whose keys are group names and the values the list of corresponding row indices. 

Here `...` represents the column with the groups.

In [None]:
grouped = courses.groupby('area')

grouped.get_group('INST')

In [None]:
grouped.get_group

In [None]:
grouped.ngroups

In [None]:
grouped.groups['INST']

So yet another way to split the dataframe in groups is to use the `.groups` dict

In [None]:
grouped = courses.groupby('area')
for area in grouped.groups:
    idx = grouped.groups[area]  # <--- list of indices of the rows in this area
    area_df = courses.loc[idx]  # <--- need to pass list of indices to the .loc[] indexer
    print(f"There are {len(area_courses)} rows in the data frame for area = {area}")

### 2. Apply and Combine

The "manual" way: apply and combine into a new dataframe we construct from scratch.

First, let's recreate the three columns we built last time using `.apply()`

In [None]:
def isintro(title):
    return 'Introduction' in title

courses['is_intro'] = courses['title'].apply(isintro)


def hasprereqs(description):
    return 'None' in description

courses['has_prereqs'] = courses['prereqs'].apply(hasprereqs)


def isentrypoint(row):
    if row['is_intro'] == 1 & row['has_prereqs'] == 0:
        return 1
    else:
        return 0

courses['is_entrypoint'] = courses.apply(isentrypoint, axis=1)

In [None]:
# create an empty list to hold the new rows of the new COMBINED dataframe
tmp = []

# SPLIT the dataframe by area, and iterate through each split
for area, areaData in courses.groupby('area'): 
  
    # APPLY operations on the dataframe split
    # ---------------------------------------
    
    # count the number of entry point courses in the subarea
    num_entrypoints = areaData['is_entrypoint'].sum()
    
    # count the number of total courses in the subarea
    num_classes = len(areaData)
    

    # COMBINE the resulting subcomputation into a new dataset
    # -------------------------------------------------------
    entry = {
      'area': area, # each row is an area
      'num_entrypoints': num_entrypoints, 
      'num_classes': num_classes,
    }
    tmp.append(entry) 
    
# convert the list of new entries into a dataframe
entry_courses_by_area = pd.DataFrame(tmp)
entry_courses_by_area

Another example: what is busiest hour of the day at the restaurant?

In [None]:
bread

First, we use `.apply()` again to extract the hour of the transaction from the `Time` column

In [None]:
def gethour(time):
    return int(time.split(":")[0])

bread['hour'] = bread['Time'].apply(gethour)
bread

The use split-apply-combine pattern again

In [None]:
# create an empty list to hold the new rows of the new COMBINED dataframe
tmp = []

# SPLIT the dataframe by area, and iterate through each split
for hour, hour_df in bread.groupby('hour'):
  
    # APPLY operations on the dataframe split
    # ---------------------------------------
    
    # count the number of entry point courses in the subarea
    num_transactions = len(hour_df)    

    # COMBINE the resulting subcomputation into a new dataset
    # -------------------------------------------------------
    entry = {
      'hour': hour, # each row is an hour
      'num_transactions': num_transactions, 
    }
    tmp.append(entry) 
    
# convert the list of new entries into a dataframe
transactions_by_hour = pd.DataFrame(tmp)
transactions_by_hour    

#### Named aggregation: Shortcut apply-combine with `.groupby()` + `.agg()`

To make `.groupby()` more powerful, we tack on the `.agg()` function to it to tell pandas to *aggregate* particular columns in particular ways (e.g., count the number of entry point courses in a given department, vs. give an average *proportion* of classes that are entry points).

this is also called [named aggregation](https://pandas.pydata.org/docs/user_guide/groupby.html#named-aggregation) in the pandas official documentation.

In [None]:
# SPLIT by area
courses.groupby("area", as_index=False).agg(
    # APPLY these computations and COMBINE into a new data frame using .agg
    # ----------------------------------------------------------------------

    # 1: apply `.sum()` to the `is_entrypoint` column of each subgroup
    num_entrypoints=('is_entrypoint', "sum"), 
    # 2: apply `.count() to the `area` column of each subgroup (similar to taking len() of subgroup)
    num_classes=('area', "count")
)

Same with restaurant transactions data frame

In [None]:
bread.groupby("hour", as_index=False).agg(
    num_transactions=("Time", "count")
)

Named aggregations is a relatively newer feature in Pandas. 

The old school method is much more convoluted. You may still see if sometimes on StackOverflow or other websites.

In [None]:
bread.groupby("hour", as_index=False)[['Time']].count().rename(columns={'Time': 'num_transactions'})

## Anatomy of combining `.groupby()` with `.agg()` via named aggregation

<img src="https://terpconnect.umd.edu/~gciampag/INST126/images/groupby.png" width="100%" />

### Using groupby for further analysis

Sometimes we want to take the result of the split-apply-combined data frame and do further analysis on it.

Recall we said an entry point course (introduction + no prereqs) is an &ldquo;easy&rdquo; to get a sense of what an area is like. 

Areas with more entry points are more &ldquo;open&rdquo; to people wishing to change major for example.

What areas are more open?

Now that we now how may entry points courses are there per area, and also how many courses in total there are, we could compute an &ldquo;openness&rdquo; score.

Here is the aggregated data frame we built before

In [None]:
entry_courses_by_area = courses.groupby("area", as_index=False).agg(
    num_entrypoints=('is_entrypoint', "sum"), 
    num_classes=('area', "count")
)
entry_courses_by_area

Let's now compute the proportion of entry point classes, as a proxy for &ldquo;openness&rdquo;, and finally let's sort by that score.

In [None]:
# step 1: define the function
def openness(row):
    return row['num_entrypoints'] / row['num_classes']

# step 2: apply the function and save the results
entry_courses_by_area['openness'] = entry_courses_by_area.apply(openness, axis=1)

# step 3: sort by openness and reset index (passing ignore_index=True argument)
entry_courses_by_area.sort_values(by="openness", ascending=False, ignore_index=True)

### What can you aggregate?

<table class="colwidths-given table">
<colgroup>
<col style="width: 20%">
<col style="width: 80%">
</colgroup>
<thead>
<tr class="row-odd"><th class="head"><p>Function</p></th>
<th class="head"><p>Description</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td><p><code class="xref py py-meth docutils literal notranslate"><span class="pre">mean()</span></code></p></td>
<td><p>Compute mean of groups</p></td>
</tr>
<tr class="row-odd"><td><p><code class="xref py py-meth docutils literal notranslate"><span class="pre">sum()</span></code></p></td>
<td><p>Compute sum of group values</p></td>
</tr>
<tr class="row-even"><td><p><code class="xref py py-meth docutils literal notranslate"><span class="pre">size()</span></code></p></td>
<td><p>Compute group sizes</p></td>
</tr>
<tr class="row-odd"><td><p><code class="xref py py-meth docutils literal notranslate"><span class="pre">count()</span></code></p></td>
<td><p>Compute count of group</p></td>
</tr>
<tr class="row-even"><td><p><code class="xref py py-meth docutils literal notranslate"><span class="pre">std()</span></code></p></td>
<td><p>Standard deviation of groups</p></td>
</tr>
<tr class="row-odd"><td><p><code class="xref py py-meth docutils literal notranslate"><span class="pre">var()</span></code></p></td>
<td><p>Compute variance of groups</p></td>
</tr>
<tr class="row-even"><td><p><code class="xref py py-meth docutils literal notranslate"><span class="pre">sem()</span></code></p></td>
<td><p>Standard error of the mean of groups</p></td>
</tr>
<tr class="row-odd"><td><p><code class="xref py py-meth docutils literal notranslate"><span class="pre">describe()</span></code></p></td>
<td><p>Generates descriptive statistics</p></td>
</tr>
<tr class="row-even"><td><p><code class="xref py py-meth docutils literal notranslate"><span class="pre">first()</span></code></p></td>
<td><p>Compute first of group values</p></td>
</tr>
<tr class="row-odd"><td><p><code class="xref py py-meth docutils literal notranslate"><span class="pre">last()</span></code></p></td>
<td><p>Compute last of group values</p></td>
</tr>
<tr class="row-even"><td><p><code class="xref py py-meth docutils literal notranslate"><span class="pre">nth()</span></code></p></td>
<td><p>Take nth value, or a subset if n is a list</p></td>
</tr>
<tr class="row-odd"><td><p><code class="xref py py-meth docutils literal notranslate"><span class="pre">min()</span></code></p></td>
<td><p>Compute min of group values</p></td>
</tr>
<tr class="row-even"><td><p><code class="xref py py-meth docutils literal notranslate"><span class="pre">max()</span></code></p></td>
<td><p>Compute max of group values</p></td>
</tr>
</tbody>
</table>

In [None]:
def getname(coaches):
    try:
        return " ".join(coaches.split(" ")[:2])
    except AttributeError:
        return "No Coach"

ncaa['coach'] = ncaa['coaches'].apply(getname)

ncaa.groupby('coach', as_index=False).agg(
    best_wl=('wl', 'max')
)

# Putting it all together

<img src="https://terpconnect.umd.edu/~gciampag/INST126/images/putting-together.png" width="100%" />

## Reminder: More resources

The pandas website is decent place to start: https://pandas.pydata.org/

This "cheat sheet" is also a really helpful guide to more common operations that you may run into later: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

There are also many blogs that are helpful, like towardsdatascience.com

The cool thing about pandas and data analysis in python is that many people share notebooks that you can inspect / learn from / adapt code for your own projects (just like mine!).

## EXTRA: Plotting

The main library for plotting in Python is `matplotlib`. You can learn that library later. It has lots of fine-grained controls.

For now, you can use pandas "wrapper" over matplotlib (basically calling matplotlib from inside pandas), which is a bit easier to learn.

In [None]:
# plot openness by area
entry_courses_by_area.sort_values(by='openness', ascending=False).plot(
    x="area", 
    y="openness", 
    kind='bar', 
    xlabel="Major", 
    ylabel="Proportion of entry point classes",
    title="Average openness by major"
)

In [None]:
entry_courses_by_area.sort_values(by="num_classes", ascending=False).plot(
    x="area", 
    y="num_classes", 
    kind="bar",
    xlabel="Major",
    ylabel="Number of classes",
    title="Number of courses by major"
)