# Class 4: Reading/Writing Data and Mini-Project Wrap-Up

**Objective**: Learn to read/write CSV files with pandas, combine NumPy, pandas, and Matplotlib skills, and complete the Iris dataset mini-project.

**Topics**:
- Reading CSV files with `pd.read_csv()`
- Writing CSV files with `df.to_csv()`
- Combining NumPy, pandas, and Matplotlib for data analysis
- Best practices: checking data integrity, saving plots
- Mini-project: Analyze and visualize the Iris dataset

This notebook includes examples, exercises, and the final mini-project. Run the code, try the exercises, and let’s finish strong!

## 1. Setup

We’ll use pandas for data I/O, NumPy for calculations, and Matplotlib for visualization. Run the cell below to import everything.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Ensure plots display in the notebook
%matplotlib inline

## 2. Reading CSV Files

pandas’ `pd.read_csv()` loads data from CSV files into DataFrames. You can specify headers, skip rows, or handle missing values.

### Example 1: Loading a Sample CSV

Let’s create a small CSV in-memory and load it. (We’ll use Iris later.)

In [None]:
# Create a sample CSV (simulating a file)
sample_data = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'score': [85, 90, 78]
})
sample_data.to_csv('sample.csv', index=False)

# Read it back
df_sample = pd.read_csv('sample.csv')
print("Sample DataFrame:\n", df_sample)

# Check data types
print("\nData types:\n", df_sample.dtypes)

**Quick Check**: Why do we use `index=False` when saving? (Hint: It avoids saving the DataFrame index as a column.)

## 3. Writing CSV Files

You can save DataFrames to CSV with `df.to_csv()` for sharing or later use.

### Example 2: Modifying and Saving a CSV

In [None]:
# Add a new column
df_sample['score_squared'] = df_sample['score'] ** 2
print("Modified DataFrame:\n", df_sample)

# Save to a new CSV
df_sample.to_csv('sample_modified.csv', index=False)
print("\nSaved to 'sample_modified.csv'. Check your folder!")

## 4. Combining Skills

Let’s combine NumPy, pandas, and Matplotlib to analyze data and visualize results.

### Example 3: Analysis and Visualization Workflow

In [None]:
# Load Iris dataset (use iris.csv or sklearn)
from sklearn.datasets import load_iris
iris = load_iris()
df_iris = pd.DataFrame(iris.data, columns=iris.feature_names)
df_iris['species'] = iris.target_names[iris.target]

# NumPy: Compute mean petal length
mean_petal_length = np.mean(df_iris['petal length (cm)'])
print("Mean petal length:", mean_petal_length)

# pandas: Filter for setosa species
df_setosa = df_iris[df_iris['species'] == 'setosa']
print("\nSetosa rows (first 5):\n", df_setosa.head())

# Matplotlib: Plot petal length distribution
plt.hist(df_iris['petal length (cm)'], bins=20, color='skyblue', edgecolor='black')
plt.title('Petal Length Distribution')
plt.xlabel('Petal Length (cm)')
plt.ylabel('Frequency')
plt.show()

## 5. Best Practices

- **Check Data Integrity**: Use `df.info()` or `df.isna().sum()` to ensure data loaded correctly.
- **File Paths**: Double-check paths when reading/writing CSVs (e.g., `'data/iris.csv'` if in a subfolder).
- **Save Plots**: Use `plt.savefig()` before `plt.show()` to avoid blank files.
- **Comment Code**: Explain your steps for clarity (e.g., “Filtering for versicolor”).

## Exercises

Practice reading, writing, and combining skills with these exercises. Write your code in the provided cells.

**Exercise 1**: Read the Iris dataset into a DataFrame (use `iris.csv` or `df_iris` from above). Add a new column `petal_area` computed as `petal length (cm) * petal width (cm)`. Save the result to `iris_with_area.csv`.

In [None]:
# Your code here



**Exercise 2**: Filter the Iris DataFrame for rows where `sepal length (cm)` is greater than 6.0. Save the filtered DataFrame to `large_sepals.csv`.

In [None]:
# Your code here



**Exercise 3**: Create a histogram of `sepal width (cm)` from the Iris dataset. Save the plot as `sepal_width_hist.png`.

In [None]:
# Your code here



**Exercise 4**: Using NumPy, compute the standard deviation of `petal length (cm)` for the `versicolor` species in the Iris dataset. Print the result.

In [None]:
# Your code here



## Mini-Project: Iris Dataset Analysis

Let’s complete our mini-project! We’ll analyze the Iris dataset, create a scatter plot colored by species, and save our results.

**Tasks**:
1. Load the Iris dataset (use `df_iris` or `iris.csv`).
2. Filter for two species: `versicolor` and `virginica`.
3. Create a scatter plot of `petal length (cm)` vs. `petal width (cm)`, with different colors for each species.
4. Add a title, labels, legend, and grid.
5. Save the plot as `iris_species_scatter.png`.
6. Save the filtered DataFrame (versicolor and virginica only) to `iris_filtered.csv`.

In [None]:
# Your code here
# Step 1: Ensure df_iris is loaded (already done above, or uncomment below)
# df_iris = pd.read_csv('iris.csv')

# Step 2: Filter for versicolor and virginica
df_filtered = df_iris[df_iris['species'].isin(['versicolor', 'virginica'])]

# Step 3-4: Scatter plot with colors
plt.scatter(df_filtered[df_filtered['species'] == 'versicolor']['petal length (cm)'],
            df_filtered[df_filtered['species'] == 'versicolor']['petal width (cm)'],
            color='blue', label='Versicolor', alpha=0.6)
plt.scatter(df_filtered[df_filtered['species'] == 'virginica']['petal length (cm)'],
            df_filtered[df_filtered['species'] == 'virginica']['petal width (cm)'],
            color='red', label='Virginica', alpha=0.6)
plt.title('Petal Length vs. Petal Width by Species')
plt.xlabel('Petal Length (cm)')
plt.ylabel('Petal Width (cm)')
plt.legend()
plt.grid(True)

# Step 5: Save plot
plt.savefig('iris_species_scatter.png')
plt.show()

# Step 6: Save filtered DataFrame
df_filtered.to_csv('iris_filtered.csv', index=False)
print("Saved filtered data to 'iris_filtered.csv'")

**Reflect**: Does the scatter plot show clear separation between species? What features seem most distinct? Share your thoughts if asked!

**Optional Challenge**: Compute the mean `petal length (cm)` for each species in `df_filtered` and visualize them in a bar plot. Save it as `species_means.png`.

In [None]:
# Optional: Try it here



## Wrap-Up

Congratulations! You’ve learned how to:
- Read and write CSV files with pandas.
- Combine NumPy, pandas, and Matplotlib for data analysis.
- Follow best practices for data workflows.
- Complete the Iris mini-project with analysis and visualization.

Save this notebook, your CSV files (`iris_with_area.csv`, `large_sepals.csv`, `iris_filtered.csv`), and plots (`sepal_width_hist.png`, `iris_species_scatter.png`). Share them with the instructor if requested. Great work finishing Week 3!