# Pandas Basics and DataFrames

Pandas is the most important library for data manipulation and analysis in Python. It provides high-performance data structures and data analysis tools, making it essential for data science, NLP, and machine learning projects.

## Why Pandas?
- **Data structures**: DataFrame and Series for handling structured data
- **Data cleaning**: Tools for handling missing data, duplicates, etc.
- **Data transformation**: Grouping, merging, reshaping operations
- **File I/O**: Read/write CSV, Excel, JSON, SQL databases, etc.
- **Integration**: Works seamlessly with NumPy, Matplotlib, Scikit-learn

## Topics Covered:
- Pandas installation and setup
- Series: 1D labeled arrays
- DataFrames: 2D labeled data structures
- Creating DataFrames from various sources
- Basic operations and attributes
- Indexing and selection
- Data inspection and summary statistics

## Installation and Setup

In [None]:
# Install pandas (if not already installed)
# !pip install pandas

import pandas as pd
import numpy as np
from datetime import datetime, date

# Check pandas version
print(f"Pandas version: {pd.__version__}")

# Display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

## Pandas Series: 1D Labeled Arrays

In [None]:
# Creating Series from lists
fruits = pd.Series(['apple', 'banana', 'orange', 'grape'])
print("Fruits Series:")
print(fruits)
print(f"Type: {type(fruits)}")
print()

# Series with custom index
prices = pd.Series([1.50, 0.80, 2.00, 3.50], 
                  index=['apple', 'banana', 'orange', 'grape'],
                  name='Fruit Prices')
print("Prices Series:")
print(prices)
print()

In [None]:
# Series attributes
print(f"Values: {prices.values}")
print(f"Index: {prices.index.tolist()}")
print(f"Name: {prices.name}")
print(f"Size: {prices.size}")
print(f"Shape: {prices.shape}")
print(f"Data type: {prices.dtype}")
print()

# Accessing elements
print(f"Apple price: ${prices['apple']}")
print(f"First item: {prices.iloc[0]}")
print(f"Expensive fruits (>$2): \n{prices[prices > 2.0]}")

In [None]:
# Creating Series from dictionary
student_grades = {
    'Alice': 95,
    'Bob': 87,
    'Charlie': 92,
    'Diana': 88,
    'Eve': 94
}

grades_series = pd.Series(student_grades, name='Math Grades')
print("Student grades:")
print(grades_series)
print()

# Basic statistics
print(f"Mean grade: {grades_series.mean():.1f}")
print(f"Highest grade: {grades_series.max()}")
print(f"Student with highest grade: {grades_series.idxmax()}")

## Pandas DataFrames: 2D Labeled Data Structures

In [None]:
# Creating DataFrame from dictionary
student_data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'Age': [20, 21, 19, 22, 20],
    'Grade': [95, 87, 92, 88, 94],
    'City': ['New York', 'London', 'Paris', 'Tokyo', 'Sydney']
}

df = pd.DataFrame(student_data)
print("Student DataFrame:")
print(df)
print(f"\nType: {type(df)}")

In [None]:
# DataFrame attributes
print(f"Shape: {df.shape}")
print(f"Size: {df.size}")
print(f"Columns: {df.columns.tolist()}")
print(f"Index: {df.index.tolist()}")
print(f"Data types:\n{df.dtypes}")
print()

# Quick info about the DataFrame
print("DataFrame info:")
df.info()

In [None]:
# Creating DataFrame from list of dictionaries
employees = [
    {'name': 'John', 'department': 'IT', 'salary': 75000, 'experience': 5},
    {'name': 'Sarah', 'department': 'HR', 'salary': 65000, 'experience': 3},
    {'name': 'Mike', 'department': 'IT', 'salary': 85000, 'experience': 7},
    {'name': 'Lisa', 'department': 'Finance', 'salary': 70000, 'experience': 4},
    {'name': 'Tom', 'department': 'IT', 'salary': 80000, 'experience': 6}
]

emp_df = pd.DataFrame(employees)
print("Employee DataFrame:")
print(emp_df)
print()

In [None]:
# Creating DataFrame from NumPy array
np.random.seed(42)
data_matrix = np.random.randn(5, 4)
columns = ['Feature_A', 'Feature_B', 'Feature_C', 'Feature_D']
index = ['Sample_1', 'Sample_2', 'Sample_3', 'Sample_4', 'Sample_5']

numeric_df = pd.DataFrame(data_matrix, columns=columns, index=index)
print("Numeric DataFrame from NumPy:")
print(numeric_df.round(3))
print()

## Basic DataFrame Operations

In [None]:
# Head and tail
print("First 3 rows:")
print(emp_df.head(3))
print()

print("Last 2 rows:")
print(emp_df.tail(2))
print()

# Summary statistics
print("Summary statistics:")
print(emp_df.describe())
print()

In [None]:
# Adding new columns
emp_df['annual_bonus'] = emp_df['salary'] * 0.1
emp_df['total_compensation'] = emp_df['salary'] + emp_df['annual_bonus']

print("DataFrame with new columns:")
print(emp_df)
print()

# Dropping columns
emp_df_clean = emp_df.drop(['annual_bonus'], axis=1)
print("After dropping 'annual_bonus' column:")
print(emp_df_clean.columns.tolist())

## Indexing and Selection

In [None]:
# Column selection
print("Names column:")
print(emp_df['name'])
print()

# Multiple column selection
print("Name and salary columns:")
print(emp_df[['name', 'salary']])
print()

# Row selection by index
print("First row:")
print(emp_df.iloc[0])
print()

print("First 3 rows:")
print(emp_df.iloc[:3])
print()

In [None]:
# Boolean indexing
print("IT employees:")
it_employees = emp_df[emp_df['department'] == 'IT']
print(it_employees)
print()

print("High salary employees (>75000):")
high_salary = emp_df[emp_df['salary'] > 75000]
print(high_salary[['name', 'salary']])
print()

# Multiple conditions
print("IT employees with salary > 75000:")
condition = (emp_df['department'] == 'IT') & (emp_df['salary'] > 75000)
print(emp_df[condition][['name', 'salary', 'experience']])

In [None]:
# Using .loc and .iloc
print("Using .loc (label-based):")
print(emp_df.loc[0, 'name'])  # Row 0, column 'name'
print(emp_df.loc[:2, ['name', 'salary']])  # First 3 rows, specific columns
print()

print("Using .iloc (integer-based):")
print(emp_df.iloc[0, 1])  # Row 0, column 1
print(emp_df.iloc[:3, [0, 2]])  # First 3 rows, columns 0 and 2
print()

## Data Inspection and Summary

In [None]:
# Value counts
print("Department distribution:")
print(emp_df['department'].value_counts())
print()

# Unique values
print("Unique departments:")
print(emp_df['department'].unique())
print(f"Number of unique departments: {emp_df['department'].nunique()}")
print()

# Check for missing values
print("Missing values per column:")
print(emp_df.isnull().sum())
print()

# Memory usage
print("Memory usage:")
print(emp_df.memory_usage(deep=True))

In [None]:
# Statistical summary by group
print("Salary statistics by department:")
dept_stats = emp_df.groupby('department')['salary'].agg([
    'count', 'mean', 'std', 'min', 'max'
])
print(dept_stats.round(2))
print()

# Correlation matrix for numeric columns
print("Correlation matrix:")
numeric_cols = emp_df.select_dtypes(include=[np.number])
print(numeric_cols.corr().round(3))

## Working with Different Data Types

In [None]:
# Creating DataFrame with mixed data types
mixed_data = {
    'product_id': [1, 2, 3, 4, 5],
    'product_name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam'],
    'price': [999.99, 25.50, 75.00, 299.99, 89.99],
    'in_stock': [True, False, True, True, False],
    'category': ['Electronics', 'Accessories', 'Accessories', 'Electronics', 'Accessories'],
    'release_date': ['2023-01-15', '2023-02-01', '2023-01-20', '2023-03-01', '2023-02-15']
}

products_df = pd.DataFrame(mixed_data)
print("Original data types:")
print(products_df.dtypes)
print()

# Convert data types
products_df['release_date'] = pd.to_datetime(products_df['release_date'])
products_df['category'] = products_df['category'].astype('category')

print("After type conversion:")
print(products_df.dtypes)
print()

print("DataFrame with proper types:")
print(products_df)

## Practical Example: Sales Data Analysis

In [None]:
# Create sample sales data
np.random.seed(42)
n_sales = 100

sales_data = {
    'sale_id': range(1, n_sales + 1),
    'customer_id': np.random.randint(1, 21, n_sales),
    'product': np.random.choice(['Laptop', 'Phone', 'Tablet', 'Headphones', 'Watch'], n_sales),
    'quantity': np.random.randint(1, 6, n_sales),
    'unit_price': np.random.uniform(50, 1000, n_sales).round(2),
    'date': pd.date_range('2023-01-01', periods=n_sales, freq='D')
}

sales_df = pd.DataFrame(sales_data)
sales_df['total_amount'] = sales_df['quantity'] * sales_df['unit_price']
sales_df['month'] = sales_df['date'].dt.month

print("Sales DataFrame (first 10 rows):")
print(sales_df.head(10))
print(f"\nShape: {sales_df.shape}")

In [None]:
# Basic analysis
print("Sales Summary:")
print(f"Total revenue: ${sales_df['total_amount'].sum():,.2f}")
print(f"Average order value: ${sales_df['total_amount'].mean():.2f}")
print(f"Total units sold: {sales_df['quantity'].sum():,}")
print(f"Unique customers: {sales_df['customer_id'].nunique()}")
print()

# Top products by revenue
print("Top products by revenue:")
product_revenue = sales_df.groupby('product')['total_amount'].sum().sort_values(ascending=False)
print(product_revenue.round(2))
print()

# Monthly sales
print("Monthly sales:")
monthly_sales = sales_df.groupby('month').agg({
    'total_amount': 'sum',
    'quantity': 'sum',
    'sale_id': 'count'
}).round(2)
monthly_sales.columns = ['Revenue', 'Units_Sold', 'Number_of_Sales']
print(monthly_sales)
print()

# Best customers
print("Top 5 customers by total spending:")
top_customers = sales_df.groupby('customer_id')['total_amount'].sum().sort_values(ascending=False).head(5)
print(top_customers.round(2))

## Key Takeaways

1. **Series**: 1D labeled arrays, similar to enhanced Python lists with indices
2. **DataFrame**: 2D labeled data structure, like a spreadsheet or SQL table
3. **Creation**: Can create from dictionaries, lists, NumPy arrays, and files
4. **Selection**: Use brackets, `.loc[]`, and `.iloc[]` for different selection needs
5. **Boolean indexing**: Filter data using conditions
6. **Data types**: Pandas handles mixed data types automatically but conversion may be needed
7. **Inspection**: Use `.info()`, `.describe()`, `.head()`, `.tail()` to understand data
8. **Groupby**: Powerful tool for aggregating data by categories

## Common DataFrame Attributes and Methods

**Attributes:**
- `.shape`, `.size`, `.columns`, `.index`, `.dtypes`
- `.values` (underlying NumPy array)

**Inspection:**
- `.head()`, `.tail()`, `.info()`, `.describe()`
- `.isnull()`, `.nunique()`, `.value_counts()`

**Selection:**
- `df['column']`, `df[['col1', 'col2']]`
- `.loc[]`, `.iloc[]`, boolean indexing

## Practice Exercises

1. Create a DataFrame of movie data (title, genre, year, rating, box_office) and analyze it
2. Load a CSV file and perform basic data exploration
3. Create a student gradebook system using pandas
4. Build a simple inventory management system
5. Analyze time series data (stock prices, weather, etc.)

## Next Steps

In the next notebook, we'll explore:
- Data cleaning and preprocessing
- Handling missing values
- Data transformation and reshaping
- Merging and joining DataFrames
- Advanced groupby operations