# Python Quick Start for Supervised Learning

This notebook provides a minimal foundation in Python, NumPy, and Pandas for students following along with INSY 7120. It is not a comprehensive introduction to Python programming - for that, see INSY 5/6500.

The goal is to give you just enough context to understand what's happening in our scikit-learn notebooks.

## Python Essentials

Everything in Python has a type and a value. These things are called *objects*. The core object types in Python are:

| Type | What it holds | Example |
|------|---------------|---------|
| `int` | Whole numbers | `42` |
| `float` | Decimal numbers | `3.14` |
| `str` | Text (sequence of characters) | `"hello"` |
| `list` | Ordered, mutable sequence | `[1, 2, 3]` |
| `tuple` | Ordered, immutable sequence | `(1, 2, 3)` |
| `set` | Unordered collection of unique values | `{1, 2, 3}` |
| `dict` | Key-value mappings | `{"a": 1, "b": 2}` |
| `bool` | Logical values | `True`, `False` |
| `None` | Absence of value | `None` |

*Mutable* objects can be changed after creation (e.g., append to a list). *Immutable* objects cannot - any "change" creates a new object.

### Program Structure

Python code is organized in a hierarchy:

| Level | Description |
|-------|-------------|
| Modules / Packages | Reusable blocks of functionality, composed of... |
| Scripts / Notebooks | Programs that accomplish a task, composed of... |
| Functions | Single-purpose tools (verbs), composed of... |
| Statements | Instructions that change internal or external state, composed of... |
| Expressions | Any combination of values, variables, operators, and function calls that evaluates to an object |

For example, `x * 2 + 1` is an expression. `y = x * 2 + 1` is a statement (it assigns the result to `y`). A function groups statements into a reusable tool. A script or notebook combines functions and statements to accomplish something. A module packages code for others to import.

### Classes

Python includes many other object types beyond those listed above, but these are the primary building blocks. The formal term for type is *class* (i.e., a class of object). If functions are Python's verbs, classes are its nouns.

Classes bundle variables and functions relevant to objects of that type, accessed using *dot notation*. The variables are called *attributes* and hold data (like `arr.shape`). The functions are called *methods* (like `df.head()`).

The entire Python ecosystem - including NumPy, Pandas, and scikit-learn - is built on this foundation.

## Why NumPy?

Python is a general-purpose language. It's great for many tasks, but numerical computation with native Python data structures (like lists) is slow.

To use NumPy, we first import it. By convention, it's aliased as `np`.

In [None]:
import numpy as np

Consider a simple task: double every value in a collection of 1 million numbers.

In [None]:
# Create a list with 1 million elements
my_list = list(range(1_000_000))

Using a list comprehension (Python's concise loop syntax):

In [None]:
%timeit my_list2 = [x * 2 for x in my_list]

List comprehension is just shorthand for the following loop:

```python
my_list2 = []
for x in my_list:
    my_list2.append(x * 2)
```

Both approaches iterate through each element one at a time - that's why they're slow.

Now compare with NumPy:

In [None]:
my_array = np.array(my_list)  # convert Python list into NumPy array
%timeit my_array2 = my_array * 2

Typical results: 12.3 *milliseconds* (python list) vs 469 *microseconds* (numpy array) - about 26x faster (1 ms = 1000 μs, so 12,300 / 469 = 26.2x).

NumPy is typically 20-100x faster for numerical operations. Why?

1. Homogeneous types: Every element in a NumPy array has the same data type, so the computer knows exactly how much memory each element needs
2. Contiguous memory: Elements are stored next to each other in memory, making access predictable and fast
3. Vectorized operations: Operations happen in optimized C code, not Python loops

The key insight: Think in terms of array operations, not element-by-element loops.

## NumPy Essentials

### Creating Arrays

Use `np.array()` to convert Python lists into NumPy arrays.

In [None]:
# 1D array from a list (like a table row)
row = [1, 2, 3, 4, 5]
arr_1d = np.array(row)
print(arr_1d)
print(type(arr_1d))

In [None]:
# 2D array from a nested list (like a table)
table = [
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
]
arr_2d = np.array(table)
print(arr_2d)

Notice that NumPy arrays display without commas between elements - this helps visually distinguish them from Python lists.

### Shape and Dimensions

Every array has a shape (the size of each axis) and a data type.

In [None]:
print("1D array:")
print("  shape:", arr_1d.shape)   # (5,) - one axis with 5 elements
print("  ndim:", arr_1d.ndim)     # 1 dimension
print("  dtype:", arr_1d.dtype)   # int64

print("\n2D array:")
print("  shape:", arr_2d.shape)   # (3, 3) - 3 rows, 3 columns
print("  ndim:", arr_2d.ndim)     # 2 dimensions
print("  dtype:", arr_2d.dtype)

Understanding shape is critical for scikit-learn. The shape tells you:
- How many samples (rows) you have
- How many features (columns) you have

### Data Types

NumPy arrays are *homogeneous* - all elements must be the same type.

In [None]:
int_arr = np.array([1, 2, 3])
float_arr = np.array([1., 2., 3.])
text_arr = np.array(['a', 'b', 'c'])

print("Integer array:", int_arr.dtype)   # int64
print("Float array:", float_arr.dtype)   # float64
print("Text array:", text_arr.dtype)     # Unicode string

Mixed types get converted automatically (sometimes in surprising ways).

Mixing integers and floats upcasts to float:

In [None]:
mixed = np.array([1, 2, 3.5])
print("Mixed int/float:", mixed.dtype, mixed)

Mixing numbers and text converts everything to text:

In [None]:
mixed_text = np.array([1, 2, 'text'])
print("Mixed with text:", mixed_text.dtype, mixed_text)

### Indexing and Slicing

For 1D arrays, indexing works just like Python lists.

In [None]:
print("First element:", arr_1d[0])
print("Last element:", arr_1d[-1])
print("First three:", arr_1d[0:3])
print("Every other:", arr_1d[::2])

For 2D arrays, NumPy uses `[row, col]` syntax - both indices in a single set of brackets, separated by a comma. This is different from base Python, where you'd chain brackets: `list[row][col]`.

In [None]:
print(arr_2d)
print("Element at row 1, col 1:", arr_2d[1, 1])
print("First row:", arr_2d[0])

Each position can also contain a slice using standard Python `start:stop:step` notation. A lone `:` means "all" along that axis.

So `arr[:, 0]` means "all rows, column 0" - an easy way to select a column. In base Python, you'd need a loop to extract a column from a list of lists.

In [None]:
print("First column:", arr_2d[:, 0])

Slices work in either position:

In [None]:
print("First two rows:\n", arr_2d[:2])

### Vectorized Operations

*Vectorized operations* apply to all elements at once, without explicit loops. This is the key to writing fast NumPy code.

Arithmetic on every element:

In [None]:
arr = np.array([1, 2, 3, 4, 5])

print("Double:", arr * 2)
print("Squared:", arr ** 2)
print("Add 10:", arr + 10)

In base Python, you'd need a loop for this. For 2D arrays, you'd need nested loops - one for rows, one for columns:

```python
result = []
for row in table:
    new_row = []
    for val in row:
        new_row.append(val * 2)
    result.append(new_row)
```

Aggregations (sum, mean, etc.) also require loops in base Python. NumPy replaces all of this with concise, fast operations.

Operations between arrays are element-wise:

In [None]:
arr2 = np.array([10, 20, 30, 40, 50])
print("Sum:", arr + arr2)
print("Product:", arr * arr2)

Aggregation methods reduce an array to a single value:

In [None]:
print("Sum of all:", arr.sum())
print("Mean:", arr.mean())
print("Max:", arr.max())

### Boolean Indexing

Comparisons produce arrays of True/False values. These boolean arrays can be used to filter data - only elements where the condition is True are selected.

In [None]:
arr = np.array([1, 5, 3, 8, 2, 9, 4])

print("Greater than 4:", arr > 4)
print("Values > 4:", arr[arr > 4])

### Quick Reference

| Attribute/Method | What it does | Example |
|------------------|--------------|---------|
| `.shape` | Dimensions as tuple | `(3, 4)` = 3 rows, 4 cols |
| `.ndim` | Number of dimensions | `2` |
| `.dtype` | Data type of elements | `int64`, `float64` |
| `.sum()` | Sum of all elements | |
| `.mean()` | Average of elements | |
| `.max()`, `.min()` | Extreme values | |

## Pandas Essentials

Pandas is built on NumPy but designed for *tabular data* - the kind of data you'd see in a spreadsheet or database table.

To use Pandas, we import it. By convention, it's aliased as `pd`.

In [None]:
import pandas as pd

### Observations and Attributes

Tabular data has a fundamental structure:
- Rows are observations (samples, records, instances) - each row describes one thing
- Columns are attributes (features, variables, measurements) - each column describes one property

When we look at a table, we naturally read row by row. But analysis is almost always about how attributes vary across observations:
- "What's the average age?" - summarize the age attribute
- "Find everyone over 30" - filter observations by an attribute
- "Does income predict spending?" - relate two attributes across observations

This is why Pandas is *column-based*. It stores and operates on data one column at a time, because that's how we analyze it.

Each column contains values of the same type (all numbers, all text, all dates), so it can be stored as a fast NumPy array. A DataFrame is essentially a collection of these column-arrays that share a row index.

This structure has implications:
- Selecting a column is instant: `df["age"]`
- Column operations are fast: `df["price"] * df["quantity"]`
- Data should be organized so each row is one observation and each column is one attribute (this is called "tidy" data)

### The Mental Model

Think of a Pandas DataFrame as a dictionary of named NumPy arrays that share a common row index.

In [None]:
# Create a DataFrame from a dictionary
data = {
    "state": ["Ohio", "Ohio", "Nevada", "Nevada"],
    "year": [2000, 2001, 2001, 2002],
    "pop": [1.5, 1.7, 2.4, 2.9]
}
df = pd.DataFrame(data)
print(df)

Each column is a Series (a 1D labeled array):

In [None]:
print(type(df["pop"]))
print(df["pop"])

### Loading Data

Most often you'll load data from a file (or url or zip - the syntax is the same):

In [None]:
url = "https://github.com/ageron/data/raw/main/lifesat/lifesat.csv"
lifesat = pd.read_csv(url)

### Exploring Data

A handful of methods tell you what you're working with.

First few rows:

In [None]:
lifesat.head()

Structure - columns, types, non-null counts:

In [None]:
lifesat.info()

Summary statistics for numeric columns:

In [None]:
lifesat.describe()

Shape as (rows, columns):

In [None]:
lifesat.shape

### Selecting Columns

Access columns by name using bracket notation.

Single column returns a Series (1D):

In [None]:
lifesat["Country"]

Multiple columns returns a DataFrame (2D):

In [None]:
lifesat[["Country", "GDP per capita (USD)"]]

### Selecting Rows and Cells

Pandas offers a variety of ways to access parts of a dataframe and best practice has evolved over time. For this course:

- Use brackets for column(s) by name.
  - `df["col"]` → column by name
- Use `.loc[]` or `.iloc[]` for everything else.
  - `.loc[row, col]` → by label
  - `.iloc[row, col]` → by position (integer index)

Select rows by index position:

In [None]:
print("First row:")
print(lifesat.iloc[0])

Select specific cell by label:

In [None]:
print("First country:", lifesat.loc[0, "Country"])

Select a range of rows:

In [None]:
print("Rows 0-2:")
print(lifesat.iloc[0:3])

Two gotchas:

1. Slicing with brackets refers to rows, not columns. This is an exception to the "brackets for columns" rule.
   - `df["a":"c"]` → rows by label
   - `df[0:3]` → rows by position
2. `.loc[]` slices are *inclusive* of the endpoint. `.iloc[]` slices are *exclusive* (like Python lists).
   - `df.loc["a":"c"]` → includes row "c"
   - `df.iloc[0:3]` → rows 0, 1, 2 (not 3)

### Filtering Rows

Use boolean expressions to select rows that meet a condition.

In [None]:
wealthy = lifesat[lifesat["GDP per capita (USD)"] > 50000]
print(wealthy)

### Missing Values

Missing data is represented as `NaN` (Not a Number). It propagates through calculations - any operation involving NaN produces NaN.

In [None]:
# Create data with missing values
data_with_gaps = pd.DataFrame({
    "A": [1, 2, np.nan, 4],
    "B": [10, np.nan, 30, 40]
})
print(data_with_gaps)

Check for missing values:

In [None]:
print(data_with_gaps.isna().sum())

NaN propagation - the mean of column A ignores NaN, but a manual sum doesn't:

In [None]:
print("Mean of A:", data_with_gaps["A"].mean())  # Pandas handles NaN
print("Manual sum:", 1 + 2 + np.nan + 4)          # NaN propagates

This is actually a feature - NaN propagation forces you to deal with missing values explicitly. If you ignore them, your results will be NaN, making the problem obvious. Always check for missing data before analysis and decide how to handle it (drop rows, fill with a value, etc.).

You may also encounter `None` (Python's null) or `pd.NA` (Pandas' modern missing value indicator). The advantage of `pd.NA` is that it works with nullable integer and boolean types, whereas `np.nan` is a float and forces type conversion. For most purposes, Pandas handles these consistently - just be aware they exist.

### Quick Reference

| Method | What it does |
|--------|--------------|
| `pd.read_csv(path)` | Load data from CSV file or URL |
| `.head()`, `.tail()` | First/last 5 rows |
| `.info()` | Column types and non-null counts |
| `.describe()` | Summary statistics |
| `.shape` | (rows, columns) tuple |
| `.isna()` | Boolean mask of missing values |
| `df["col"]` | Select column as Series |
| `df[["col"]]` | Select column(s) as DataFrame |
| `.loc[row, col]` | Select by label |
| `.iloc[row, col]` | Select by position |

## The Shape Gotcha

This is the most common source of confusion when starting with scikit-learn.

scikit-learn expects:
- `X` (features) to be 2D with shape `(n_samples, n_features)`
- `y` (target) to be 1D with shape `(n_samples,)`

Even if you have only one feature, `X` must still be 2D.

### The Problem

Single brackets return a Series (1D):

In [None]:
gdp_series = lifesat["GDP per capita (USD)"]
print("Type:", type(gdp_series))
print("Shape:", gdp_series.shape)

Double brackets return a DataFrame (2D):

In [None]:
gdp_df = lifesat[["GDP per capita (USD)"]]
print("Type:", type(gdp_df))
print("Shape:", gdp_df.shape)

### The Solution

When preparing data for scikit-learn, use double brackets and `.values` to get the underlying NumPy array:

In [None]:
X = lifesat[["GDP per capita (USD)"]].values
y = lifesat["Life satisfaction"].values

print("X shape:", X.shape)
print("y shape:", y.shape)

### Why Double Brackets?

The bracket notation in Pandas works like this:

- `df["col"]` - returns the column named "col" as a Series
- `df[["col"]]` - returns a DataFrame containing the columns in the list

The inner brackets create a list of column names. When you pass a list to `df[...]`, Pandas always returns a DataFrame.

In [None]:
print(type(lifesat["Country"]))
print(type(lifesat[["Country"]]))

This is also how you select multiple columns:

In [None]:
lifesat[["Country", "GDP per capita (USD)"]].head()

### Quick Reference

| What you write | What you get | Shape |
|----------------|--------------|-------|
| `df["col"]` | Series | `(n,)` |
| `df[["col"]]` | DataFrame | `(n, 1)` |
| `df["col"].values` | 1D NumPy array | `(n,)` |
| `df[["col"]].values` | 2D NumPy array | `(n, 1)` |

For scikit-learn: use double brackets for X, single brackets (or none) for y.

## Summary

Why NumPy?
- Python loops are slow; NumPy operations are fast
- Think in terms of array operations, not element-by-element loops

NumPy essentials:
- `.shape` tells you the dimensions: `(rows, cols)`
- `.dtype` tells you the data type
- Use `[row, col]` for 2D indexing, `[:, col]` for columns
- Operations are vectorized - no loops needed

Pandas essentials:
- DataFrame = dictionary of named Series (columns) with shared row index
- `pd.read_csv()` to load data
- `.head()`, `.info()`, `.describe()` to explore
- `df["col"]` for Series, `df[["col"]]` for DataFrame
- `.loc[]` for labels, `.iloc[]` for positions

The shape gotcha:
- scikit-learn needs X to be 2D: `(n_samples, n_features)`
- Use `df[["col"]].values` not `df["col"].values` for X
- Use `.shape` to verify your data dimensions

Now revisit the previous notebook. You followed the logic well enough the first time through - now you can connect the dots on implementation details like data extraction, array shapes, and method calls.