# Getting Hands-on with Pandas: An Iris Dataset Exploration

This notebook provides a practical guide to working with the **Pandas** library in Python, using the famous **Iris dataset**. We'll cover fundamental operations from data loading and exploration to manipulation and basic visualization.

---

## 1. Setup and Data Acquisition

First, we need to import the necessary libraries: **`pandas`** for data manipulation, **`seaborn`** to easily load the Iris dataset, and **`matplotlib.pyplot`** for basic visualizations.

In a real-world scenario, you might load data from a CSV, Excel, or database file using `pd.read_csv()`, `pd.read_excel()`, etc. Here, we'll use `seaborn`'s built-in datasets for convenience.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np # Will be used later for handling missing data demonstration

# Load the Iris dataset directly from seaborn's built-in datasets
df_iris = sns.load_dataset('iris')

print("Iris Dataset loaded successfully into a Pandas DataFrame!")

---

## 2. Initial Data Exploration (Getting to Know Your Data)

Before diving into analysis, it's crucial to understand the structure, content, and basic statistics of your dataset.

### Display the first few rows
`df.head()` shows the top N (default 5) rows, giving a quick glance at the data.

In [None]:
print("\n--- First 5 rows of the DataFrame ---")
print(df_iris.head())

### Display the last few rows
`df.tail()` shows the bottom N (default 5) rows.

In [None]:
print("\n--- Last 5 rows of the DataFrame ---")
print(df_iris.tail())

### Get a concise summary of the DataFrame
`df.info()` provides a summary including the index dtype and column dtypes, non-null values, and memory usage. It's excellent for quickly identifying missing values and data types.

In [None]:
print("\n--- DataFrame Info ---")
df_iris.info()

### Get descriptive statistics for numerical columns
`df.describe()` generates descriptive statistics (count, mean, std, min, 25%, 50%, 75%, max) for numerical columns.

In [None]:
print("\n--- Descriptive Statistics for Numerical Columns ---")
print(df_iris.describe())

### Check the number of rows and columns
`df.shape` returns a tuple representing the dimensionality of the DataFrame (rows, columns).

In [None]:
print(f"\n--- Shape of the DataFrame ---")
print(f"Rows: {df_iris.shape[0]}, Columns: {df_iris.shape[1]}")

### Get the column names
`df.columns` returns an Index object containing the column labels.

In [None]:
print("\n--- Column Names ---")
print(df_iris.columns)

### Check unique values and their counts in a categorical column
`df['column_name'].unique()` returns an array of unique values in a Series.
`df['column_name'].value_counts()` returns a Series containing counts of unique values.

In [None]:
print("\n--- Unique Species in the 'species' column ---")
print(df_iris['species'].unique())

print("\n--- Value Counts for each Species ---")
print(df_iris['species'].value_counts())

---

## 3. Data Selection and Filtering

Pandas provides powerful ways to select specific columns, rows, or subsets of data based on conditions.

### Select a single column (returns a Series)
You can access a column using dictionary-like notation.

In [None]:
print("\n--- 'sepal_length' column (first 5 values) ---")
print(df_iris['sepal_length'].head())

### Select multiple columns (returns a DataFrame)
Pass a list of column names to select multiple columns.

In [None]:
print("\n--- 'sepal_length' and 'petal_length' columns (first 5 rows) ---")
print(df_iris[['sepal_length', 'petal_length']].head())

### Select rows using `.loc[]` (label-based indexing)
`loc` is used for selection by label. You can select rows by their index labels and columns by their column labels.

In [None]:
print("\n--- Rows with index labels 0 to 2 (all columns) using .loc ---")
print(df_iris.loc[0:2]) # Inclusive range for labels

### Select rows using `.iloc[]` (integer-location based indexing)
`iloc` is used for selection by integer position. It behaves like standard Python slicing (exclusive of the end index).

In [None]:
print("\n--- First 3 rows, first 2 columns using .iloc ---")
print(df_iris.iloc[0:3, 0:2]) # Rows 0, 1, 2; Columns 0, 1

### Filter rows based on a single condition
This creates a boolean Series, which is then used to select rows where the condition is `True`.

In [None]:
print("\n--- Data for 'setosa' species ---")
setosa_df = df_iris[df_iris['species'] == 'setosa']
print(setosa_df.head())

### Filter with multiple conditions
Combine conditions using `&` (AND) or `|` (OR). Remember to wrap each condition in parentheses.

In [None]:
print("\n--- 'virginica' species with petal_length greater than 5.0 ---")
filtered_df = df_iris[(df_iris['species'] == 'virginica') & (df_iris['petal_length'] > 5.0)]
print(filtered_df.head())

---

## 4. Data Manipulation and Aggregation

Pandas allows for easy creation of new columns, grouping data, and performing aggregate calculations.

### Create a new column
New columns can be created by performing operations on existing columns.

In [None]:
df_iris['sepal_area'] = df_iris['sepal_length'] * df_iris['sepal_width']
print("\n--- DataFrame with new 'sepal_area' column (first 5 rows) ---")
print(df_iris.head())

### Group data and calculate aggregates
`groupby()` is one of the most powerful Pandas features, allowing you to split data into groups based on some criteria and then apply a function (e.g., `mean()`, `sum()`, `count()`) to each group.

In [None]:
print("\n--- Mean values for all numerical columns, grouped by species ---")
print(df_iris.groupby('species').mean())

### Group by a column and apply multiple aggregate functions to a specific column
Use `.agg()` after `groupby()` to apply different aggregate functions to one or more columns.

In [None]:
print("\n--- Petal Length statistics (mean and standard deviation) by species ---")
print(df_iris.groupby('species')['petal_length'].agg(['mean', 'std']))

### Sort the DataFrame
`sort_values()` sorts the DataFrame by one or more columns.

In [None]:
print("\n--- DataFrame sorted by 'sepal_length' in descending order (first 5 rows) ---")
print(df_iris.sort_values(by='sepal_length', ascending=False).head())

### Drop a column
`drop()` removes specified rows or columns. `axis=1` indicates columns, `axis=0` indicates rows. `inplace=True` modifies the DataFrame directly.

In [None]:
df_iris.drop(columns=['sepal_area'], inplace=True)
print("\n--- DataFrame after dropping 'sepal_area' column (first 5 rows) ---")
print(df_iris.head())

---

## 5. Handling Missing Data (Illustrative Example)

Real-world datasets often have missing values. Pandas provides robust tools to detect and handle them. The Iris dataset is very clean, so we'll artificially introduce some **`NaN` (Not a Number)** values to demonstrate.

### Create a temporary copy and introduce NaNs

In [None]:
# Create a temporary copy to avoid modifying the original df_iris
df_temp = df_iris.copy()

# Introduce NaN in 'sepal_width' for rows 5 to 10
df_temp.loc[5:10, 'sepal_width'] = np.nan
# Introduce NaN in 'petal_length' for rows 20 to 25
df_temp.loc[20:25, 'petal_length'] = np.nan

print("\n--- DataFrame with Artificial Missing Values (first 30 rows) ---")
print(df_temp.head(30))

### Check for missing values
`isnull()` returns a boolean DataFrame indicating missing values. `sum()` counts them per column.

In [None]:
print("\n--- Count of Missing values per column ---")
print(df_temp.isnull().sum())

### Fill missing values
`fillna()` replaces missing values. Common strategies include filling with the mean, median, mode, or a specific constant.

In [None]:
# Fill missing values in 'sepal_width' with its mean
mean_sepal_width = df_temp['sepal_width'].mean()
df_temp['sepal_width'].fillna(mean_sepal_width, inplace=True)
print(f"\n--- 'sepal_width' after filling NaNs with mean ({mean_sepal_width:.2f}) (first 30 rows) ---")
print(df_temp.head(30))

### Drop rows with missing values
`dropna()` removes rows or columns containing missing values.

In [None]:
# Let's see how many NaNs are left before dropping (petal_length still has NaNs)
print("\n--- Missing values count before dropping NaNs ---")
print(df_temp.isnull().sum())

df_temp.dropna(inplace=True) # Drops rows where 'petal_length' is still NaN
print("\n--- DataFrame after dropping rows with any remaining missing values ---")
print(df_temp.isnull().sum()) # Should show 0 for all columns
print(f"New shape after dropping rows with NaNs: {df_temp.shape}")

---

## 6. Basic Data Visualization

Pandas DataFrames integrate well with `matplotlib` and `seaborn` for quick and insightful visualizations.

### Scatter Plot: Sepal Length vs Sepal Width, colored by species
A scatter plot helps visualize the relationship between two numerical variables and how a third categorical variable (species) might influence it.

In [None]:
print("\n--- Generating Scatter Plot: Sepal Length vs Sepal Width ---")
plt.figure(figsize=(8, 6))
sns.scatterplot(x='sepal_length', y='sepal_width', hue='species', data=df_iris, s=100, alpha=0.8)
plt.title('Sepal Length vs Sepal Width by Species (Iris Dataset)')
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Sepal Width (cm)')
plt.grid(True)
plt.show()

### Box Plot: Petal Length Distribution by Species
A box plot is excellent for showing the distribution (median, quartiles, outliers) of a numerical variable across different categories.

In [None]:
print("\n--- Generating Box Plot: Petal Length Distribution by Species ---")
plt.figure(figsize=(8, 6))
sns.boxplot(x='species', y='petal_length', data=df_iris)
plt.title('Petal Length Distribution by Species (Iris Dataset)')
plt.xlabel('Species')
plt.ylabel('Petal Length (cm)')
plt.grid(axis='y')
plt.show()

---

## Conclusion

This notebook provided a hands-on introduction to Pandas using the Iris dataset. You've learned how to:
- Load data into a DataFrame.
- Perform initial data exploration (`head`, `tail`, `info`, `describe`, `shape`, `columns`, `unique`, `value_counts`).
- Select and filter data using `.loc[]`, `.iloc[]`, and boolean indexing.
- Manipulate data by creating new columns, grouping, aggregating, and sorting.
- Illustrate handling of missing data (filling and dropping).
- Create basic visualizations to gain insights from your data.

To continue improving your Pandas skills, I highly recommend downloading other real-world datasets from platforms like Kaggle, UCI Machine Learning Repository, or Data.gov, and applying these techniques. Each new dataset will present unique challenges and learning opportunities!