# Class 2: Introduction to pandas

**Objective**: Learn how to use pandas to manage datasets, including creating Series and DataFrames, loading data, and performing basic cleaning.

**Topics**:
- What is pandas?
- Series: creating and basic operations
- DataFrames: creating, selecting, filtering
- Loading datasets with `pd.read_csv()`
- Basic data cleaning (missing values, data types)

This notebook includes explanations, examples, and exercises to get you comfortable with pandas. Follow along, run the code, and try the exercises! We'll also start working with the Iris dataset for our mini-project.

## 1. What is pandas?

pandas is a Python library for data analysis. It’s like a super-powered spreadsheet, letting you load, manipulate, and explore datasets. It’s perfect for preparing data for AI projects.

Let’s import pandas (and NumPy, since it pairs well). Run the cell below:

In [None]:
import pandas as pd
import numpy as np

## 2. Series

A Series is a 1D array (like a column or list) with labels (an index). Think of it as a single column of data.

### Example 1: Creating and Using a Series

In [None]:
# Create a Series from a list
scores = pd.Series([85, 90, 78, 92], index=['Alice', 'Bob', 'Charlie', 'Diana'])
print("Scores Series:\n", scores)

# Access by index
print("Bob’s score:", scores['Bob'])

# Basic operation: compute mean
print("Average score:", scores.mean())

**Quick Check**: How would you get Diana’s score? (Hint: Try `scores['Diana']` in your head!)

## 3. DataFrames

A DataFrame is a 2D table (like a spreadsheet) with rows and columns. It’s the main way to work with datasets in pandas.

### Example 2: Creating and Exploring a DataFrame

In [None]:
# Create a DataFrame from a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 22],
    'Score': [85, 90, 78]
}
df = pd.DataFrame(data)
print("DataFrame:\n", df)

# Show first few rows
print("\nFirst 2 rows:\n", df.head(2))

# Select a column
print("\nScores column:\n", df['Score'])

## 4. Loading Datasets

Most data comes from files like CSVs. pandas makes this easy with `pd.read_csv()`.

### Example 3: Loading the Iris Dataset

Let’s load the Iris dataset (flower measurements). If you have `iris.csv`, use that. Otherwise, we’ll create it from `sklearn` for practice.

In [None]:
# Option 1: Load from iris.csv (uncomment if you have the file)
# df_iris = pd.read_csv('iris.csv')

# Option 2: Create from sklearn
from sklearn.datasets import load_iris
iris = load_iris()
df_iris = pd.DataFrame(iris.data, columns=iris.feature_names)
df_iris['species'] = iris.target_names[iris.target]

# Show first 5 rows
print("Iris DataFrame:\n", df_iris.head())

# Basic info
print("\nDataset info:")
df_iris.info()

**Note**: The Iris dataset has columns like `sepal length (cm)`, `petal width (cm)`, and `species`. We’ll use it for our mini-project.

## 5. Filtering Data

You can filter rows in a DataFrame using conditions, like picking rows where a value is above a threshold.

### Example 4: Filtering the Iris Dataset

In [None]:
# Filter for petal length > 5 cm
long_petals = df_iris[df_iris['petal length (cm)'] > 5]
print("Flowers with petal length > 5 cm:\n", long_petals.head())

# Select specific columns
petal_data = df_iris[['petal length (cm)', 'petal width (cm)']]
print("\nPetal columns:\n", petal_data.head())

## 6. Basic Data Cleaning

Real datasets often have issues like missing values. pandas can help clean them.

### Example 5: Handling Missing Values

The Iris dataset is clean, so let’s simulate a missing value.

In [None]:
# Create a small DataFrame with a missing value
df_test = pd.DataFrame({
    'petal_length': [5.1, np.nan, 4.9, 5.0],
    'species': ['setosa', 'setosa', 'setosa', 'setosa']
})
print("DataFrame with missing value:\n", df_test)

# Check for missing values
print("\nMissing values:\n", df_test.isna())

# Drop rows with missing values
df_clean = df_test.dropna()
print("\nAfter dropping missing values:\n", df_clean)

## Exercises

Time to practice! Complete the exercises below using pandas. Write your code in the provided cells and run them to check your work.

**Exercise 1**: Create a Series from the list `[10, 20, 30, 40]` with index `['a', 'b', 'c', 'd']`. Compute its mean.

In [None]:
# Your code here



**Exercise 2**: Using the Iris DataFrame (`df_iris`), select the `sepal length (cm)` and `species` columns. Show the first 3 rows.

In [None]:
# Your code here



**Exercise 3**: Filter the Iris DataFrame to show only rows where `petal width (cm)` is greater than 2.0. How many rows are returned?

In [None]:
# Your code here



**Exercise 4**: Simulate a missing value in the `petal length (cm)` column of the first row of `df_iris` (make a copy to avoid changing the original). Then drop rows with missing values.

In [None]:
# Your code here



## Mini-Project Progress

For our mini-project, we’re analyzing the Iris dataset. You’ve loaded it and explored its columns. Now, let’s prepare data for visualization (next class).

**Task**: Select the `petal length (cm)` and `petal width (cm)` columns from `df_iris` and save them to a new DataFrame called `df_petals`. Print its first 5 rows.

In [None]:
# Your code here
df_petals = df_iris[['petal length (cm)', 'petal width (cm)']]
print(df_petals.head())

**Think Ahead**: What might a scatter plot of petal length vs. petal width look like? Could it help separate species? We’ll find out in Class 3!

**Optional Challenge**: Compute the average `petal length (cm)` for the `setosa` species. (Hint: Filter by species, then use `.mean()`.)

In [None]:
# Optional: Try it here



## Wrap-Up

Awesome work! You’ve learned how to:
- Create and use pandas Series.
- Build and explore DataFrames.
- Load datasets like Iris with `pd.read_csv()`.
- Filter data and handle missing values.
- Start preparing data for our mini-project.

Save this notebook and share your exercise results if asked. Next class, we’ll visualize data with Matplotlib!