# Comprehensive Pandas Tutorial

This notebook covers essential pandas operations for data analysis, manipulation, and visualization.

## 1. Introduction to Pandas

Pandas is a powerful Python library for data manipulation and analysis. It provides data structures like DataFrames and Series that make working with structured data easy and intuitive.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set random seed for reproducibility
np.random.seed(42)

## 2. Creating DataFrames

There are several ways to create DataFrames in pandas:

In [None]:
# From a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'City': ['New York', 'Paris', 'London', 'Tokyo']
}
df = pd.DataFrame(data)

# From a list of lists
data_list = [
    ['Alice', 25, 'New York'],
    ['Bob', 30, 'Paris'],
    ['Charlie', 35, 'London'],
    ['David', 40, 'Tokyo']
]
df_list = pd.DataFrame(data_list, columns=['Name', 'Age', 'City'])

# From a NumPy array
array = np.random.rand(4, 3)
df_array = pd.DataFrame(array, columns=['A', 'B', 'C'])

# Display the DataFrames
df, df_list, df_array

## 3. Basic DataFrame Operations

Let's explore some fundamental operations with DataFrames.

In [None]:
# Create a sample DataFrame for demonstration
data = {
    'Product': ['Laptop', 'Phone', 'Tablet', 'Monitor', 'Keyboard'],
    'Price': [999, 699, 349, 249, 49],
    'Quantity': [10, 25, 15, 8, 30],
    'InStock': [True, True, False, True, True]
}
df_products = pd.DataFrame(data)

# Display basic information
df_products.info()

# Show first few rows
df_products.head()

# Show basic statistics
df_products.describe()

# Access specific columns
df_products['Product'], df_products[['Product', 'Price']]

# Filter rows
expensive_products = df_products[df_products['Price'] > 300]
in_stock_products = df_products[df_products['InStock'] == True]

expensive_products, in_stock_products

## 4. Data Manipulation

Pandas provides powerful tools for data manipulation.

In [None]:
# Create a sample DataFrame with more data
dates = pd.date_range('20230101', periods=10)
df_sales = pd.DataFrame({
    'Date': dates,
    'Product': ['Laptop', 'Phone', 'Tablet', 'Monitor', 'Keyboard', 
                'Laptop', 'Phone', 'Tablet', 'Monitor', 'Keyboard'],
    'Quantity': [5, 10, 3, 7, 15, 8, 12, 2, 5, 20],
    'Price': [999, 699, 349, 249, 49, 999, 699, 349, 249, 49],
    'Region': ['North', 'South', 'East', 'West', 'North', 
               'South', 'East', 'West', 'North', 'South']
})

# Add a new column
df_sales['Revenue'] = df_sales['Quantity'] * df_sales['Price']

# Sort by date
df_sorted = df_sales.sort_values('Date')

# Group by product and calculate total revenue
grouped = df_sales.groupby('Product').agg({
    'Quantity': 'sum',
    'Revenue': 'sum',
    'Price': 'mean'
}).reset_index()

# Pivot table
pivot = pd.pivot_table(df_sales, 
                       values='Revenue', 
                       index='Product', 
                       columns='Region', 
                       aggfunc='sum', 
                       fill_value=0)

df_sorted.head(), grouped, pivot

## 5. Data Visualization with Pandas

Pandas integrates with Matplotlib for easy visualization.

In [None]:
# Set up the plot style
plt.style.use('seaborn')

# Create a figure with multiple plots
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Plot 1: Line plot of sales over time
df_sales.set_index('Date')['Revenue'].plot(ax=axes[0, 0], title='Revenue Over Time')

# Plot 2: Bar plot of total revenue by product
grouped.set_index('Product')['Revenue'].sort_values().plot(kind='bar', ax=axes[0, 1], title='Revenue by Product')

# Plot 3: Pie chart of quantity by region
df_sales.groupby('Region')['Quantity'].sum().plot(kind='pie', autopct='%1.1f%%', ax=axes[1, 0], title='Quantity by Region')

# Plot 4: Scatter plot of price vs quantity
colors = {'North':'red', 'South':'blue', 'East':'green', 'West':'purple'}
axes[1, 1].scatter(
    x=df_sales['Price'],
    y=df_sales['Quantity'],
    c=df_sales['Region'].map(colors),
    alpha=0.7
)
axes[1, 1].set_title('Price vs Quantity by Region')
axes[1, 1].set_xlabel('Price')
axes[1, 1].set_ylabel('Quantity')

plt.tight_layout()
plt.show()

## 6. Best Practices and Tips

1. **Use meaningful column names**: Makes your code more readable and self-documenting.
2. **Handle missing data**: Use `dropna()` or `fillna()` appropriately.
3. **Avoid chained indexing**: Use `.loc[]` for assignment to prevent SettingWithCopyWarning.
4. **Leverage vectorized operations**: They're faster than looping through rows.
5. **Use categorical data types**: For columns with limited unique values to save memory.
6. **Optimize data types**: Use appropriate dtypes (e.g., int32 instead of int64 when possible).
7. **Use query() for complex filtering**: Can be more readable than boolean indexing.
8. **Consider memory usage**: Especially important with large datasets.

## 7. Handling Missing Data

Pandas provides several methods to handle missing data:

In [None]:
# Create a DataFrame with missing values
df_missing = df_sales.copy()
df_missing.loc[[1, 3, 5], 'Quantity'] = np.nan
df_missing.loc[[2, 4, 6], 'Price'] = np.nan

# Check for missing values
df_missing.isna().sum()

# Different ways to handle missing values

# 1. Drop rows with missing values
df_dropped = df_missing.dropna()

# 2. Fill with a specific value
df_filled_zero = df_missing.fillna(0)
df_filled_mean = df_missing.fillna({
    'Quantity': df_missing['Quantity'].mean(),
    'Price': df_missing['Price'].mean()
})

# 3. Forward fill
 df_filled_ffill = df_missing.fillna(method='ffill')

# 4. Interpolate
df_interpolated = df_missing.interpolate()

df_missing.head(), df_dropped.head(), df_filled_mean.head(), df_interpolated.head()

## 8. Conclusion

This notebook covered the essential pandas operations for data analysis:
- Creating and exploring DataFrames
- Basic operations and filtering
- Data manipulation and transformation
- Grouping and aggregation
- Data visualization
- Handling missing data

Pandas is a powerful tool that can handle complex data operations with just a few lines of code. The key to mastering pandas is practice - try these examples with your own datasets!