# Lab 05

## Introduction to Data Moves (Calculating and Grouping)

In this assignment, you’ll be introduced to the concept of *data moves* using functions from the `datascience` package. You’ll learn how to perform calculations on both numerical and categorical data, and how to group data for summary and comparison.

## Guidelines

- Follow good programming practices by using descriptive variable names, maintaining appropriate spacing for readability, and adding comments to clarify your code.

- Ensure written responses use correct spelling, complete sentences, and proper grammar.

**Name:**

**Section:**

**Date:**

Let's get started!

### More About the Datascience Package

The `datascience` package is a beginner-friendly Python library developed at UC Berkeley to support its introductory data science course, **Data 8**. It simplifies working with data by providing an easy-to-use `Table` structure, making it accessible to students with little or no programming experience.

The `Table` class functions like a spreadsheet. You can create tables from scratch or load them from CSV files using `Table.read_table()`. Once loaded, tables support a range of operations for analyzing and transforming data.

#### Common `Table` Operations

| Operation        | Method                          | Description                                       |
| ---------------- | ------------------------------- | ------------------------------------------------- |
| Show data        | `table.show(n)`                 | Display first `n` rows                            |
| Select columns   | `table.select('col1', ...)`     | Create new table with selected columns            |
| Drop columns     | `table.drop('col1', ...)`       | Remove specified columns                          |
| Filter rows      | `table.where('col', condition)` | Keep only rows that match a condition             |
| Sort data        | `table.sort('col')`             | Sort rows by a column                             |
| Group by values  | `table.group('col')`            | Group rows and count values                       |
| Aggregate values | `table.group('col', func)`      | Group and apply custom function (e.g., `np.mean`) |
| Join tables      | `table1.join('key', table2)`    | Merge tables based on a common column             |

The package also includes simple charting functions like `.scatter()`, `.barh()`, and `.hist()` that are built on Matplotlib and allow students to visualize data with minimal code.

Underneath the hood, the `datascience` package is built on top of NumPy and Matplotlib which provide numerical computation and plotting capabilities. 

In summary, `datascience` is a great starting point for exploring real data, performing analysis, and creating visualizations with clear and beginner-friendly Python code.

**Question 1.** Import the `datascience` package. 

In [None]:
...

**Question 2.** Load the `ceo_compensation_summary.csv` file into a table named `ceo_pay`.

In [None]:
ceo_pay = Table.read_table('data/ceo_compensation_summary.csv')

**Question 3.** Print the number of rows, number of columns, and all column labels from the `colleges` table.

In [None]:
...

### Calculate (Numerical)

The calculate data move involves generating new columns or features based on operations performed on existing ones. This might include computing ratios, percentages, differences, or other derived values that help reveal patterns in the data.

#### Total Compensation

The `ceo_pay` table does not have a column that contains the total compensation. Let's examine the values that go toward total compensation by looking at the first row of values.

Run the cell below.

In [None]:
ceo_pay.row(0)

We can select each individual component of pay by using the `.item` table method.

Run the cell below to see how we can access the salary.

In [None]:
ceo_pay.row(0).item(1)

If we wanted to cacluate the total compensation we could add all the relevenat items together.

**Question 4.**  Calculate the total compensation for the CEO represented in row one.

**Hint:** If your expression is getting too long and runs off the page, try one of these techniques to keep your code neat and readable:

**Explicit Line Continuation with Backslash `(\)`**
```python
ceo_pay.row(0).item(1) + \
ceo_pay.row(0).item(2) + \
...
```

**Implicit Line Continuation Using Parentheses**
```python
(ceo_pay.row(0).item(1) + 
 ceo_pay.row(0).item(2) + 
 ...)
```

In [None]:
...

To do this for each row, we’d have to write a lot of code—most of it repetitive. That’s because the calculation stays the same; only the row number changes. A better approach is to write a single user-defined function and apply it to each row.

Let’s look at a couple of ways to do this, and then you can decide which one you’d like to use.

#### Separate Variables

We could create a new varaible for each component, then sum them together to get the total.

**Question 5.**  Complete each assignment statement by writing the correct expression.

In [None]:
total = 0

salary = ...
bonus = ...
stock_awards = ...
option_awards = ...
non_equity_comp = ...
pension_change = ...
other_comp = ...

total = (salary +
         bonus +
         stock_awards +
         option_awards +
         non_equity_comp + 
         pension_change + 
         other_comp)

total

#### `for` loop

Since only the item index is changing, we can use a `for` loop.

**Question 6.**  Write a `for` loop that calculates the total CEO compensation for the first row.

In [None]:
...

Both methods are effective, but how can we apply them to every row?

### User-Defined Functions in Python

In Python, a user-defined function lets you create your own custom block of code that performs a specific task. This is especially useful when you're doing the same kind of operation many times. Instead of copying and pasting the same code, you can write the logic once in a function and use it wherever you need it.

#### Calculating Total CEO Compensation

```python
def total_compensation(row):
    """
    Calculates the total CEO compensation for a given row.
    
    It returns the sum of the values in columns with index values from 1 to 7.

    Parameters:
    
    row: A row object from a Table.

    Returns:
    
    float: The total compensation based on salary, bonus, stock_awards, option_awards, 
           non_equity_comp, pension_change, and other_comp
    """
    
    total = 
    
    return total
```

- `def`: This keyword tells Python you are defining a function.

- `total_compensation`: This is the name of the function. You can name it almost anything, but it should describe what the function does. Here, it suggests the function calculates total CEO compensation.

- `(row)`: This is the parameter the function takes in. In this case, the function expects to receive one row of a `Table` object in the `datascience` library.

- The code represents the body of the function.

- `return`: This tells Python to output the result of `total`.

**Question 7.**  Complete the function by adding code to the body.

In [None]:
def total_compensation(row):
    """
    Calculates the total CEO compensation for a given row.
    
    It returns the sum of the values in columns with index values from 1 to 7.

    Parameters:
    
    row: A row object from a Table.

    Returns:
    
    float: The total compensation based on salary, bonus, stock_awards, option_awards, 
           non_equity_comp, pension_change, and other_comp
    """
    
    ...
    
    total = ...
    
    return total

Now we can call the function with a row from the `ceo_pay` table.

In [None]:
total_compensation(ceo_pay.row(0))

To _**"apply"**_ this function to all the rows in the `ceo_pay` table, we can use the `.apply` method. The `.apply` method runs a function on each row of the table and returns a list of results, one for each row.

Run the cell below.

In [None]:
ceo_pay = ceo_pay.with_column(
    'total_compensation',
    ceo_pay.apply(total_compensation)
)

Verify that it worked.

In [None]:
ceo_pay.show(3)

### Calculate (Categorical)

The calculate data move can also be used to create new categorical features. This involves generating a new column based on existing data by applying rules or conditions that assign category labels. For example, you might classify numerical values into categories like *"High"*, *"Medium"*, or *"Low"*, or create flags such as *"Pass"* or *"Fail"* based on a threshold. These calculated categories help organize the data in ways that make patterns easier to see and analyze.

We can categorize CEO pay by setting thresholds for total compensation. For example, compensation between 0 and 1,000,000 could be labeled as _"Low"_, between 1,000,000 and 5,000,000 as _"Moderate"_, between 5,000,000 and 15,000,000 as _"High"_, and anything above 15,000,000 as _"Very High"_.

To do this type of calculation, we need to use **conditional logic**. Conditional logic in Python allows us to make decisions in our code by executing different actions depending on whether certain conditions are true or false. This is typically done using `if`, `elif`, and `else` statements.

For example, if we want to assign a category based on total compensation, we can use conditional logic to check which range a value falls into and return the appropriate label.

### Conditional Statements

Conditional statements let your program make decisions based on whether something is true or false. They allow you to control the flow of your code using conditions.

The structure looks like this:

```python
if condition:
    # Do something
elif another_condition:
    # Do something else
else:
    # Do something if the other conditions are not true
```

Each condition is typically a comparison (like `x > 5`) that evaluates to either `True` or `False`.

| Operator | Description              | Example  | Evaluates To |
| -------- | ------------------------ | -------- | ------------ |
| `==`     | Equal to                 | `5 == 5` | `True`       |
| `!=`     | Not equal to             | `5 != 3` | `True`       |
| `>`      | Greater than             | `10 > 7` | `True`       |
| `<`      | Less than                | `3 < 8`  | `True`       |
| `>=`     | Greater than or equal to | `5 >= 5` | `True`       |
| `<=`     | Less than or equal to    | `4 <= 6` | `True`       |

Let's write a function that calculates the compensation tiers.

In [None]:
def compensation_tier(total_compensation):
    """
    Categorizes CEO total compensation into one of four categories:
    'Low', 'Moderate', 'High', or 'Very High'.

    Thresholds:
    - 0 to 1,000,000: 'Low'
    - 1,000,001 to 5,000,000: 'Moderate'
    - 5,000,001 to 15,000,000: 'High'
    - Over 15,000,000: 'Very High'

    Parameters:
        
    total_compensation (float): The total compensation value.

    Returns:
    
    str: A category label based on the compensation thresholds.
    """
    if total_compensation <= 1_000_000:
        return "Low"
    elif total_compensation <= 5_000_000:
        return "Moderate"
    elif total_compensation <= 15_000_000:
        return "High"
    else:
        return "Very High"

**Question 8.**  Apply the `compensation_tier` function to the `ceo_pay` table to add a column named `compensation_tier`.

In [None]:
ceo_pay = ...

Let's verify to see if our function works as intended.

In [None]:
ceo_pay.show(3)

### Grouping

The grouping data move is used to summarize data by combining rows that share a common value in one or more columns. This involves grouping the data based on a categorical variable like location or demographic, and then computing counts or summary statistics for each group. For example, you might group a dataset by _"School"_ to count the number of students in each. Grouping helps reveal patterns across categories, compare subgroups, and simplify large datasets for further analysis.

Let's compare total compensation across the different compensation tiers. We can do this by using the 5-number summary and a box plot to visualize how compensation values are distributed within each tier.

To compute the 5-number summary we can use functions from the NumPy library. Let's import NumPy.

In [None]:
import numpy as np

We can list all the values in the `compensation_tier` column.

In [None]:
ceo_pay.column('compensation_tier')

Now we can list only the unique values from the output using `np.unique`.

In [None]:
np.unique(ceo_pay.column('compensation_tier'))

We can filter rows from the `ceo_pay` table by tier. For each subset of the table, we can then calculate the 5-number summary.

First, let's filter the ceo_pay table for each compensation tier.

In [None]:
low = ceo_pay.where('compensation_tier', are.equal_to('Low'))
moderate = ceo_pay.where('compensation_tier', are.equal_to('Moderate'))
high = ceo_pay.where('compensation_tier', are.equal_to('High'))
very_high = ceo_pay.where('compensation_tier', are.equal_to('Very High'))

We can do the 5-number summary for the `low` tier.

In [None]:
low_min = np.min(low.column('total_compensation'))/1e6
low_q1 = np.percentile(low.column('total_compensation'), 25)/1e6
low_median = np.median(low.column('total_compensation'))/1e6
low_q3 = np.percentile(low.column('total_compensation'), 75)/1e6
low_max = np.max(low.column('total_compensation'))/1e6

Now we can make the table.

In [None]:
Table().with_columns(
    "Tier", ["Low"],
    "Min", [low_min],
    "Q1", [low_q1],
    "Median", [low_median],
    "Q3", [low_q3],
    "Max", [low_max]
)

**Question 9.**  Create another table with the values for the High compensation tier.

In [None]:
high_min = ...
high_q1 = ...
high_median = ...
high_q3 = ...
high_max = ...

Table().with_columns(
    "Tier", ...,
    "Min", ...,
    "Q1", ...,
    "Median", ...,
    "Q3", ...,
    "Max", ...
)

**Question 10.** Create a boxplot to compare the distribution of values for the _"Low"_ and _"High"_ tiers. The _"Low"_ tier has been doen for you as an example. 

You will need to run the cell below to import `matplotlib.pyplot` and add the magic command.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
data = [..., ...]

plt.boxplot(data)
plt.xticks([1, 2], ["Low Tier", "High Tier"])  
plt.ylabel("Total Compensation (in millions)");

## Submission

Make sure that all cells in your assignment have been executed to display all output, images, and graphs in the final document.

**Note:** Save the assignment before proceeding to download the file.

After downloading, locate the `.ipynb` file and upload **only** this file to Moodle. The assignment will be automatically submitted to Gradescope for grading.