# Lab 05

## Introduction to Data Moves (Summarizing and Filtering)

In this assignment, you’ll be introduced to the concept of *data moves* using functions from the `datascience` package. You’ll learn how to summarize data both numerically and visually, and how to filter datasets to focus on specific subsets.

## Guidelines

- Follow good programming practices by using descriptive variable names, maintaining appropriate spacing for readability, and adding comments to clarify your code.

- Ensure written responses use correct spelling, complete sentences, and proper grammar.

**Name:**

**Section:**

**Date:**

Let's get started!

### More About the Datascience Package

The `datascience` package is a beginner-friendly Python library developed at UC Berkeley to support its introductory data science course, **Data 8**. It simplifies working with data by providing an easy-to-use `Table` structure, making it accessible to students with little or no programming experience.

The `Table` class functions like a spreadsheet. You can create tables from scratch or load them from CSV files using `Table.read_table()`. Once loaded, tables support a range of operations for analyzing and transforming data.

#### Common `Table` Operations

| Operation        | Method                          | Description                                       |
| ---------------- | ------------------------------- | ------------------------------------------------- |
| Show data        | `table.show(n)`                 | Display first `n` rows                            |
| Select columns   | `table.select('col1', ...)`     | Create new table with selected columns            |
| Drop columns     | `table.drop('col1', ...)`       | Remove specified columns                          |
| Filter rows      | `table.where('col', condition)` | Keep only rows that match a condition             |
| Sort data        | `table.sort('col')`             | Sort rows by a column                             |
| Group by values  | `table.group('col')`            | Group rows and count values                       |
| Aggregate values | `table.group('col', func)`      | Group and apply custom function (e.g., `np.mean`) |
| Join tables      | `table1.join('key', table2)`    | Merge tables based on a common column             |

The package also includes simple charting functions like `.scatter()`, `.barh()`, and `.hist()` that are built on Matplotlib and allow students to visualize data with minimal code.

Underneath the hood, the `datascience` package is built on top of NumPy and Matplotlib which provide numerical computation and plotting capabilities. 

In summary, `datascience` is a great starting point for exploring real data, performing analysis, and creating visualizations with clear and beginner-friendly Python code.

**Question 1.** Import the `datascience` package. 

In [None]:
...

**Question 2.** Load the `colleges.csv` file into a table named `colleges`.

In [None]:
colleges = ...

**Question 3.** Print the number of rows, number of columns, and all column labels from the `colleges` table.

In [None]:
...

### Summarize (Numerical)

We can summarize numerical data using statistics like the mean and median. To calculate these summaries from table columns, we can use `np.mean()`, a function from the NumPy library, or `.mean()`, a method available on NumPy arrays returned by the `.column()` method.

####  Functions and Methods

A function like `np.mean()` is called independently and takes data as an argument, while a method like `.mean()` is called directly on an object (in this case, a NumPy array) and operates on that object. Functions and methods often perform similar tasks but differ in how they are applied.

For beginners, `np.mean()` is often preferred for its clarity and consistency, especially when working with data structures other than NumPy arrays.

**Question 4.**  Import Numpy with the appropriate alias. Then run the cells below to see.

In [None]:
...

In [None]:
# Calculates the mean of a list of integers: (1 + 2 + 3) / 3 = 2.0
np.mean([1, 2, 3])

In [None]:
# Calculates the mean of a tuple of integers: (1 + 2 + 3) / 3 = 2.0
np.mean((1, 2, 3))

In [None]:
# Calculates the mean of a range object from 0 to 3: (0 + 1 + 2 + 3) / 4 = 1.5
np.mean(range(4))

In [None]:
# Calculates the mean of a NumPy array: (1 + 2 + 3) / 3 = 2.0
np.mean(np.array([1, 2, 3]))

**Question 5.** Calculate the average `default_rate` for all institutions in the `colleges` table.

In [None]:
...

While the `datascience` package doesn't include a built-in function for calculating the five-number summary, we can use NumPy functions to compute each value individually and store the results in a dictionary for easy reference.

In [None]:
col = colleges.column('default_rate')  # Get the NumPy array from the column

summary = {
    "Minimum": np.min(col),
    "Q1": np.percentile(col, 25),
    "Median": np.median(col),
    "Q3": np.percentile(col, 75),
    "Maximum": np.max(col)
}

summary

### Summarize (Categorical)

We can also generate summary statistics from columns that contain categorical data, such as counting how often each category label appears.

**Question 6.** Count the frequency of each category in the `ownership` column.

In [None]:
...

To get a better sense of the distribution between ownership statuses across all institutions, we can use proportions. This approach offers several advantages:

- Proportions adjust for different group sizes, making comparisons meaningful.

- Values range from 0 to 1, making patterns easier to interpret.

- Proportions show true representation, not just raw totals.

Run the cells below to see how this can be done.

In [None]:
# Group and count
ownership_counts = colleges.group('ownership')
ownership_counts

In [None]:
# Add a new column with proportions
ownership_proportions = ownership_counts.with_column(
    'proportion', ownership_counts.column('count') / ownership_counts.column('count').sum()
)

ownership_proportions

#### Summarize (Visualization)

Visualizations are also useful for summarizing both numerical and categorical data. A histogram displays the distribution of a single numerical variable, while a bar chart is ideal for summarizing categorical variables.

These visual tools help us understand the frequency and pattern of values across different categories or ranges.

Let's import `matplotlib.pyplot` using the appropriate alias, and include the `%matplotlib inline` magic command to ensure that visualizations are displayed directly within the notebook.

In [None]:
import matplotlib.pyplot as plt 
%matplotlib inline

**Question 7.** Create a histogram using the values from the `default_rate` column.

In [None]:
...

**Question 8.** Create a bar chart based on the categories in the `ownership` column.

In [None]:
...

### Filter (Numerical)

The filter data move allows us to narrow down a dataset by selecting only the rows that meet certain conditions. When filtering based on numerical features, we use comparisons (such as greater than, less than, or equal to) to isolate specific ranges or values of interest. This helps us focus on a subset of data that meets meaningful criteria. For example, selecting only colleges with an average debt greater than $20,000 or an admit rate below 50%. Filtering enables more targeted analysis and can reveal patterns or trends that may be hidden in the full dataset.

**Question 9.** Filter the `colleges` table to keep only the rows where the default rate is more than double the average default rate.

In [None]:
...

### Filter (Categorical)

The filter data move can also be used with categorical features to select rows that match specific category labels. This allows us to focus on particular groups within the data. For example, filtering a college dataset to show only public institutions or only schools located in the Midwest region. By narrowing the dataset to one or more categories, we can compare groups more directly or analyze patterns specific to that subset. Filtering by category is especially useful for making group-specific summaries or visualizations.

**Question 10.** Filter the `colleges` table to keep only the rows where the highest degree offered is a certificate.

In [None]:
...

## Submission

Make sure that all cells in your assignment have been executed to display all output, images, and graphs in the final document.

**Note:** Save the assignment before proceeding to download the file.

After downloading, locate the `.ipynb` file and upload **only** this file to Moodle. The assignment will be automatically submitted to Gradescope for grading.