---
title: "NumPy and Pandas Mastery"
subtitle: "Essential Data Analysis Libraries for Python"
---

## Introduction

NumPy and Pandas are the foundation of data science in Python:

- **NumPy**: Efficient numerical computing with arrays
- **Pandas**: Data manipulation and analysis with DataFrames

This comprehensive tutorial will take you from basics to advanced techniques.

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Display settings
pd.set_option('display.max_columns', 10)
pd.set_option('display.max_rows', 10)
pd.set_option('display.precision', 3)
sns.set_theme(style='whitegrid')

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

# Part 1: NumPy - Numerical Python

## Why NumPy?

NumPy provides:
- Fast operations on arrays (10-100x faster than lists)
- Broadcasting for element-wise operations
- Mathematical functions
- Linear algebra operations
- Random number generation

## Creating Arrays

In [None]:
# From lists
arr1 = np.array([1, 2, 3, 4, 5])
arr2 = np.array([[1, 2, 3], [4, 5, 6]])

print("1D array:", arr1)
print("2D array:\n", arr2)
print(f"Shape: {arr2.shape}, Dimensions: {arr2.ndim}, Size: {arr2.size}")

In [None]:
# Special arrays
zeros = np.zeros((3, 4))           # 3x4 matrix of zeros
ones = np.ones((2, 3, 4))          # 2x3x4 tensor of ones
identity = np.eye(4)               # 4x4 identity matrix
random = np.random.rand(3, 3)      # 3x3 random values [0,1)

# Sequences
sequence = np.arange(0, 10, 2)     # [0, 2, 4, 6, 8]
linspace = np.linspace(0, 1, 5)    # 5 points from 0 to 1

print("Zeros:\n", zeros)
print("\nIdentity:\n", identity)
print("\nSequence:", sequence)
print("Linspace:", linspace)

## Array Operations

In [None]:
# Element-wise operations
a = np.array([1, 2, 3, 4])
b = np.array([5, 6, 7, 8])

print("Addition:", a + b)
print("Multiplication:", a * b)
print("Power:", a ** 2)
print("Boolean:", a > 2)

# Mathematical functions
print("\nSqrt:", np.sqrt(a))
print("Exp:", np.exp(a))
print("Log:", np.log(a + 1))

## Broadcasting

NumPy's powerful feature for operations on arrays of different shapes:

In [None]:
# Broadcasting examples
matrix = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])

# Add scalar to matrix
print("Matrix + 10:")
print(matrix + 10)

# Add vector to each row
row_vector = np.array([1, 0, -1])
print("\nMatrix + row vector:")
print(matrix + row_vector)

# Add vector to each column
col_vector = np.array([[10], [20], [30]])
print("\nMatrix + column vector:")
print(matrix + col_vector)

## Indexing and Slicing

In [None]:
# 2D array indexing
arr = np.array([[1, 2, 3, 4],
                [5, 6, 7, 8],
                [9, 10, 11, 12]])

print("Original array:")
print(arr)

# Access elements
print("\nElement [1,2]:", arr[1, 2])  # Row 1, Column 2
print("First row:", arr[0, :])         # or arr[0]
print("Last column:", arr[:, -1])

# Slicing
print("\nSubmatrix [0:2, 1:3]:")
print(arr[0:2, 1:3])

# Boolean indexing
mask = arr > 5
print("\nElements > 5:", arr[mask])

## Array Manipulation

In [None]:
# Reshaping
a = np.arange(12)
print("Original:", a)
print("Reshaped to 3x4:")
print(a.reshape(3, 4))
print("Reshaped to 2x2x3:")
print(a.reshape(2, 2, 3))

# Stacking
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
print("\nVertical stack:")
print(np.vstack([a, b]))
print("Horizontal stack:")
print(np.hstack([a, b]))

# Splitting
arr = np.arange(9).reshape(3, 3)
print("\nOriginal:")
print(arr)
print("Split horizontally:", np.hsplit(arr, 3))

## Statistical Operations

In [None]:
# Generate sample data
data = np.random.randn(1000)  # Normal distribution

print(f"Mean: {np.mean(data):.3f}")
print(f"Median: {np.median(data):.3f}")
print(f"Std Dev: {np.std(data):.3f}")
print(f"Variance: {np.var(data):.3f}")
print(f"Min: {np.min(data):.3f}")
print(f"Max: {np.max(data):.3f}")
print(f"25th percentile: {np.percentile(data, 25):.3f}")
print(f"75th percentile: {np.percentile(data, 75):.3f}")

# Axis operations
matrix = np.random.randn(3, 4)
print("\nMatrix:")
print(matrix)
print("Column means:", np.mean(matrix, axis=0))
print("Row means:", np.mean(matrix, axis=1))

## Linear Algebra

In [None]:
# Matrix operations
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

print("Matrix A:")
print(A)
print("\nMatrix B:")
print(B)

# Matrix multiplication
print("\nA @ B (matrix product):")
print(A @ B)  # or np.dot(A, B)

# Other operations
print("\nTranspose of A:")
print(A.T)
print("\nDeterminant of A:", np.linalg.det(A))
print("\nInverse of A:")
print(np.linalg.inv(A))

# Eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(A)
print("\nEigenvalues:", eigenvalues)
print("Eigenvectors:")
print(eigenvectors)

# Part 2: Pandas - Data Analysis

## Why Pandas?

Pandas provides:
- DataFrames for tabular data
- Missing data handling
- Data alignment and merging
- Time series functionality
- Input/Output tools for various formats

## Creating DataFrames

In [None]:
# From dictionary
df1 = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'age': [25, 30, 35, 28],
    'city': ['NYC', 'LA', 'Chicago', 'Houston'],
    'salary': [70000, 80000, 75000, 65000]
})

# From lists
df2 = pd.DataFrame(
    [[1, 2, 3], [4, 5, 6], [7, 8, 9]],
    columns=['A', 'B', 'C'],
    index=['row1', 'row2', 'row3']
)

# From NumPy array
df3 = pd.DataFrame(
    np.random.randn(5, 3),
    columns=['X', 'Y', 'Z']
)

print("DataFrame from dictionary:")
print(df1)
print("\nDataFrame with custom index:")
print(df2)

## Data Exploration

In [None]:
# Load sample data
tips = sns.load_dataset('tips')

# Basic information
print("Shape:", tips.shape)
print("\nFirst 5 rows:")
print(tips.head())
print("\nData types:")
print(tips.dtypes)
print("\nBasic statistics:")
print(tips.describe())
print("\nInfo:")
tips.info()

## Selecting and Filtering Data

In [None]:
# Column selection
print("Single column (Series):")
print(tips['total_bill'].head())

print("\nMultiple columns (DataFrame):")
print(tips[['total_bill', 'tip', 'day']].head())

# Row selection with loc (label-based)
print("\nRows 10-12, specific columns:")
print(tips.loc[10:12, ['total_bill', 'tip', 'day']])

# Row selection with iloc (position-based)
print("\nRows 0-2, columns 0-2:")
print(tips.iloc[0:3, 0:3])

In [None]:
# Boolean filtering
# Simple condition
high_bills = tips[tips['total_bill'] > 30]
print(f"High bills (>30): {len(high_bills)} rows")
print(high_bills.head())

# Multiple conditions
dinner_high_tip = tips[(tips['time'] == 'Dinner') & (tips['tip'] > 5)]
print(f"\nDinner with high tip: {len(dinner_high_tip)} rows")

# Using isin
weekend = tips[tips['day'].isin(['Sat', 'Sun'])]
print(f"\nWeekend data: {len(weekend)} rows")

# Using query
result = tips.query('total_bill > 30 and tip > 5')
print(f"\nQuery result: {len(result)} rows")

## Data Manipulation

In [None]:
# Adding columns
tips_copy = tips.copy()
tips_copy['tip_percentage'] = tips_copy['tip'] / tips_copy['total_bill'] * 100
tips_copy['bill_per_person'] = tips_copy['total_bill'] / tips_copy['size']

print("New columns added:")
print(tips_copy[['total_bill', 'tip', 'tip_percentage', 'bill_per_person']].head())

# Modifying columns
tips_copy['day_type'] = tips_copy['day'].apply(
    lambda x: 'Weekend' if x in ['Sat', 'Sun'] else 'Weekday'
)

# Dropping columns
tips_copy = tips_copy.drop(['bill_per_person'], axis=1)

# Renaming columns
tips_copy = tips_copy.rename(columns={'size': 'party_size'})

print("\nModified DataFrame:")
print(tips_copy.head())

## Handling Missing Data

In [None]:
# Create data with missing values
df_missing = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5],
    'B': [np.nan, 2, 3, np.nan, 5],
    'C': [1, 2, 3, 4, 5],
    'D': [np.nan, np.nan, np.nan, np.nan, 5]
})

print("Data with missing values:")
print(df_missing)

# Check for missing values
print("\nMissing values per column:")
print(df_missing.isnull().sum())

# Drop missing values
print("\nDrop rows with any NaN:")
print(df_missing.dropna())

print("\nDrop columns with any NaN:")
print(df_missing.dropna(axis=1))

# Fill missing values
print("\nFill with constant:")
print(df_missing.fillna(0))

print("\nForward fill:")
print(df_missing.fillna(method='ffill'))

print("\nFill with mean:")
print(df_missing.fillna(df_missing.mean()))

## GroupBy Operations

In [None]:
# Simple groupby
grouped = tips.groupby('day')
print("Mean by day:")
print(grouped[['total_bill', 'tip']].mean())

# Multiple grouping
print("\nMean by day and time:")
print(tips.groupby(['day', 'time'])['total_bill'].mean().unstack())

# Multiple aggregations
print("\nMultiple statistics:")
agg_result = tips.groupby('day').agg({
    'total_bill': ['mean', 'std', 'count'],
    'tip': ['mean', 'max']
})
print(agg_result)

In [None]:
# Transform - returns same-sized result
tips['bill_zscore'] = tips.groupby('day')['total_bill'].transform(
    lambda x: (x - x.mean()) / x.std()
)

print("Z-scores by day:")
print(tips[['day', 'total_bill', 'bill_zscore']].head(10))

# Apply - flexible operation
def top_tips(df, n=3):
    return df.nlargest(n, 'tip')[['total_bill', 'tip']]

print("\nTop 3 tips per day:")
print(tips.groupby('day').apply(top_tips))

## Merging and Joining

In [None]:
# Create sample DataFrames
customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'city': ['NYC', 'LA', 'Chicago', 'Houston']
})

orders = pd.DataFrame({
    'order_id': [101, 102, 103, 104, 105],
    'customer_id': [1, 2, 1, 3, 5],
    'amount': [250, 150, 300, 200, 175]
})

print("Customers:")
print(customers)
print("\nOrders:")
print(orders)

# Different join types
print("\nInner join:")
print(pd.merge(customers, orders, on='customer_id', how='inner'))

print("\nLeft join:")
print(pd.merge(customers, orders, on='customer_id', how='left'))

print("\nOuter join:")
print(pd.merge(customers, orders, on='customer_id', how='outer'))

## Pivot Tables and Reshaping

In [None]:
# Pivot table
pivot = tips.pivot_table(
    values='total_bill',
    index='day',
    columns='time',
    aggfunc='mean'
)
print("Pivot table - mean bill by day and time:")
print(pivot)

# Melt (unpivot)
melted = pivot.reset_index().melt(
    id_vars='day',
    var_name='time',
    value_name='avg_bill'
)
print("\nMelted data:")
print(melted)

# Stack and unstack
stacked = tips.groupby(['day', 'time'])['total_bill'].mean()
print("\nStacked:")
print(stacked)
print("\nUnstacked:")
print(stacked.unstack())

## Time Series

In [None]:
# Create time series data
dates = pd.date_range('2023-01-01', periods=365, freq='D')
ts = pd.Series(np.random.randn(365).cumsum() + 100, index=dates)

print("Time series data:")
print(ts.head())

# Resampling
monthly = ts.resample('M').mean()
print("\nMonthly average:")
print(monthly)

# Rolling operations
rolling_mean = ts.rolling(window=30).mean()
rolling_std = ts.rolling(window=30).std()

# Plot
fig, ax = plt.subplots(figsize=(12, 6))
ts.plot(ax=ax, label='Daily', alpha=0.5)
rolling_mean.plot(ax=ax, label='30-day MA', linewidth=2)
ax.fill_between(rolling_mean.index, 
                rolling_mean - 2*rolling_std,
                rolling_mean + 2*rolling_std,
                alpha=0.2, label='±2 STD')
ax.legend()
ax.set_title('Time Series with Moving Average')
plt.show()

## String Operations

In [None]:
# String data
df = pd.DataFrame({
    'name': ['John Smith', 'jane doe', 'Bob JONES', 'alice wonderland'],
    'email': ['john@email.com', 'JANE@GMAIL.COM', 'bob@yahoo.com', 'alice@outlook.com']
})

print("Original:")
print(df)

# String methods
df['name_upper'] = df['name'].str.upper()
df['name_title'] = df['name'].str.title()
df['first_name'] = df['name'].str.split().str[0]
df['email_domain'] = df['email'].str.split('@').str[1].str.lower()
df['name_length'] = df['name'].str.len()

print("\nProcessed:")
print(df)

## Categorical Data

In [None]:
# Create categorical data
df = pd.DataFrame({
    'grade': ['A', 'B', 'A', 'C', 'B', 'A', 'D', 'C'],
    'score': [95, 85, 92, 75, 88, 96, 65, 78]
})

# Convert to categorical
df['grade'] = pd.Categorical(
    df['grade'],
    categories=['D', 'C', 'B', 'A'],
    ordered=True
)

print("Categorical data:")
print(df)
print("\nCategories:", df['grade'].cat.categories)
print("Ordered:", df['grade'].cat.ordered)

# Sort by categorical order
print("\nSorted by grade:")
print(df.sort_values('grade'))

# Filter using categorical order
print("\nGrades better than C:")
print(df[df['grade'] > 'C'])

# Part 3: Real-World Examples

## Example 1: Data Cleaning Pipeline

In [None]:
# Create messy data
messy_data = pd.DataFrame({
    'Date': ['2023-01-01', '2023-01-02', '2023/01/03', '01-04-2023', None],
    'Amount': ['$1,234.56', '2345.67', '$3,456.78', 'N/A', '4567.89'],
    'Category': ['Food', 'food', 'FOOD', 'Transport', None],
    'Status': ['Complete', 'complete', 'PENDING', 'pending', 'Complete']
})

print("Messy data:")
print(messy_data)

def clean_data(df):
    """Clean and standardize the data"""
    df = df.copy()
    
    # Clean dates
    df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
    
    # Clean amounts
    df['Amount'] = (df['Amount']
                   .str.replace('$', '', regex=False)
                   .str.replace(',', '', regex=False)
                   .replace('N/A', np.nan))
    df['Amount'] = pd.to_numeric(df['Amount'], errors='coerce')
    
    # Standardize categories
    df['Category'] = df['Category'].str.title()
    df['Category'] = df['Category'].fillna('Unknown')
    
    # Standardize status
    df['Status'] = df['Status'].str.title()
    
    # Remove rows with critical missing data
    df = df.dropna(subset=['Date', 'Amount'])
    
    return df

clean = clean_data(messy_data)
print("\nCleaned data:")
print(clean)
print("\nData types:")
print(clean.dtypes)

## Example 2: Sales Analysis

In [None]:
# Generate sales data
np.random.seed(42)
n_records = 1000

sales = pd.DataFrame({
    'date': pd.date_range('2023-01-01', periods=n_records, freq='H'),
    'product': np.random.choice(['A', 'B', 'C', 'D'], n_records),
    'quantity': np.random.poisson(10, n_records),
    'price': np.random.uniform(10, 100, n_records),
    'region': np.random.choice(['North', 'South', 'East', 'West'], n_records)
})

sales['revenue'] = sales['quantity'] * sales['price']

# Analysis
print("Sales Summary:")
print(sales.describe())

# Daily aggregation
daily_sales = sales.set_index('date').resample('D').agg({
    'quantity': 'sum',
    'revenue': 'sum',
    'price': 'mean'
})

print("\nDaily sales (first week):")
print(daily_sales.head(7))

# Product performance
product_performance = sales.groupby('product').agg({
    'quantity': 'sum',
    'revenue': ['sum', 'mean'],
    'price': 'mean'
}).round(2)

print("\nProduct Performance:")
print(product_performance)

# Regional analysis
regional = sales.pivot_table(
    values='revenue',
    index='product',
    columns='region',
    aggfunc='sum'
).round(2)

print("\nRevenue by Product and Region:")
print(regional)

# Visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Daily revenue trend
daily_sales['revenue'].plot(ax=axes[0, 0])
axes[0, 0].set_title('Daily Revenue Trend')
axes[0, 0].set_ylabel('Revenue ($)')

# Product distribution
sales.groupby('product')['revenue'].sum().plot(kind='bar', ax=axes[0, 1])
axes[0, 1].set_title('Total Revenue by Product')
axes[0, 1].set_ylabel('Revenue ($)')

# Regional distribution
sales.groupby('region')['revenue'].sum().plot(kind='pie', ax=axes[1, 0], autopct='%1.1f%%')
axes[1, 0].set_title('Revenue Distribution by Region')

# Heatmap
sns.heatmap(regional, annot=True, fmt='.0f', cmap='YlOrRd', ax=axes[1, 1])
axes[1, 1].set_title('Revenue Heatmap')

plt.tight_layout()
plt.show()

## Performance Tips

### NumPy Performance

In [None]:
import time

# Vectorization vs loops
size = 1000000
a = np.random.randn(size)
b = np.random.randn(size)

# Loop method
start = time.time()
result_loop = []
for i in range(size):
    result_loop.append(a[i] + b[i])
loop_time = time.time() - start

# Vectorized method
start = time.time()
result_vector = a + b
vector_time = time.time() - start

print(f"Loop time: {loop_time:.4f} seconds")
print(f"Vectorized time: {vector_time:.4f} seconds")
print(f"Speedup: {loop_time/vector_time:.1f}x")

### Pandas Performance

In [None]:
# Efficient data types
df = pd.DataFrame({
    'int_col': np.random.randint(0, 100, 10000),
    'float_col': np.random.randn(10000),
    'category_col': np.random.choice(['A', 'B', 'C'], 10000)
})

print("Original memory usage:")
print(df.memory_usage(deep=True))

# Optimize data types
df['int_col'] = df['int_col'].astype('int8')  # Smaller int
df['float_col'] = df['float_col'].astype('float32')  # Smaller float
df['category_col'] = df['category_col'].astype('category')  # Categorical

print("\nOptimized memory usage:")
print(df.memory_usage(deep=True))

# Use vectorized string operations
# Bad: df['new'] = df['category_col'].apply(lambda x: x.lower())
# Good: df['new'] = df['category_col'].str.lower()

## Summary and Best Practices

### NumPy Best Practices
1. **Use vectorization** instead of loops
2. **Preallocate arrays** when size is known
3. **Use appropriate data types** (float32 vs float64)
4. **Leverage broadcasting** for operations
5. **Use views instead of copies** when possible

### Pandas Best Practices
1. **Use vectorized operations** (.str, .dt methods)
2. **Optimize data types** (categories, smaller ints)
3. **Use .loc/.iloc** for explicit indexing
4. **Chain operations** for readability
5. **Profile memory usage** for large datasets

### When to Use What?
- **NumPy**: Numerical computations, linear algebra, image processing
- **Pandas**: Tabular data, time series, data cleaning, analysis

### Resources
- [NumPy Documentation](https://numpy.org/doc/)
- [Pandas Documentation](https://pandas.pydata.org/docs/)
- [NumPy Tutorial](https://numpy.org/doc/stable/user/quickstart.html)
- [Pandas Tutorial](https://pandas.pydata.org/docs/getting_started/tutorials.html)

Master these libraries and you'll be equipped for any data analysis task! 🚀📊