# Lecture 3: Introduction to Pandas & Data Structures - Building Your Foundation

## Learning Objectives

By the end of this lecture, you will be able to:
- Import and understand the pandas library for data manipulation
- Work with pandas Series and DataFrame structures effectively
- Load and inspect the Washington D.C. bike-sharing dataset
- Understand basic data types and structures in Python data analysis
- Apply fundamental pandas operations for data exploration

---

## 1. Your Journey as a Data Consultant Begins

Welcome to your first day as a junior data consultant! Your client, a growing bike-sharing startup, has asked you to help them understand their data and build better demand forecasting capabilities. Like any professional consultant, you need to start with the fundamentals - understanding your tools and your data.

Think of this like learning to be a craftsperson. Before a carpenter builds a house, they master their tools: saws, hammers, measuring devices. Before a chef creates a restaurant menu, they understand ingredients, cooking techniques, and basic preparations. As a data consultant, your primary tools are programming languages and libraries - and pandas is the most essential tool for data work in Python.

Today, you'll learn pandas not as an abstract programming concept, but as the foundation that will enable you to help your client make better business decisions. Every technique you master here will directly contribute to solving real transportation challenges.

## 2. The Pandas Library: Your Data Manipulation Powerhouse

### What is Pandas and Why Do You Need It?

Pandas (Python Data Analysis Library) is like having a powerful Swiss Army knife for working with data. Just as a mechanic wouldn't try to fix a car with just their bare hands, you wouldn't want to analyze data without pandas. It provides the tools you need to load, clean, analyze, and manipulate data efficiently.

For your bike-sharing client, this means pandas will help you:
- Load their historical rental data from files
- Clean and prepare messy real-world data
- Calculate important business metrics like peak usage times
- Identify patterns that will inform demand predictions
- Transform raw operational data into actionable insights

Think of pandas as your data laboratory - a place where raw information becomes valuable business intelligence.

### Understanding Your Data Toolkit Components

Pandas provides two fundamental structures for organizing data, much like how a workshop has different types of containers for different purposes:

**Series**: Like a single column of a spreadsheet or a list with labels. Perfect for storing one type of information (like all the temperatures recorded, or all the bike rental counts).

**DataFrame**: Like a complete spreadsheet with rows and columns. This is where you'll store your full bike-sharing dataset with all variables together.

### Importing Pandas: Setting Up Your Workshop

Before you can use any tool, you need to make it available in your workspace. In Python, this means importing pandas:

In [None]:
import pandas as pd

This simple line does several important things:
- Makes all pandas functions available to use
- Creates the shorthand `pd` so you don't have to type `pandas` every time
- Follows the standard convention that all Python data analysts use

The `as pd` part is like giving pandas a nickname that everyone in the data science community recognizes. When you see `pd.` in code, you immediately know you're working with pandas operations.

## 3. Series: Your First Data Structure

### Understanding Series Through Bike-Sharing Examples

A Series is like a single column of data with labels for each value. Imagine you're tracking the number of bikes rented each hour during a typical Monday morning:

In [None]:
hourly_rentals = pd.Series([15, 23, 45, 67, 89, 156, 234, 287])
print(hourly_rentals)

This creates a Series where:
- The values (15, 23, 45, etc.) represent bike rentals for each hour
- Each value gets an automatic index (0, 1, 2, etc.) like row numbers
- You can think of the index as the "label" for each measurement

### Practical Series Creation for Transportation Data

Let's create a more realistic example using actual time labels:

In [None]:
morning_hours = ['6 AM', '7 AM', '8 AM', '9 AM', '10 AM']
morning_rentals = pd.Series([23, 67, 156, 89, 45], index=morning_hours)
print(morning_rentals)

Now your Series has meaningful labels that make business sense. When you show this to your client, they can immediately understand that 8 AM has the highest rental count (156 bikes), which aligns with morning commute patterns.

### Why Series Matter for Your Consulting Work

Series are more than just lists - they're intelligent containers that:
- Keep related information together (values and their meanings)
- Enable mathematical operations (calculating averages, totals, trends)
- Support easy filtering and selection (finding peak hours, low-demand periods)
- Integrate seamlessly with business reporting tools

For transportation consulting, Series help you organize time-based patterns, station-specific metrics, and categorical data like weather conditions or user types.

## 4. DataFrame: Your Complete Data Laboratory

### Understanding DataFrames as Digital Spreadsheets

If Series are like single columns, DataFrames are like complete spreadsheets with multiple columns and rows. For your bike-sharing client, a DataFrame would contain all the information about each rental period: time, weather, bike counts, user types, and more.

Think of a DataFrame as a comprehensive record where each row represents one time period (like one hour) and each column represents one type of measurement (like temperature, humidity, bike count).

### Creating Your First Transportation DataFrame

Let's build a small DataFrame that represents what your client's data looks like:

In [None]:
bike_data = pd.DataFrame({
    'hour': ['6 AM', '7 AM', '8 AM', '9 AM'],
    'temperature': [45, 47, 52, 55],
    'rentals': [23, 67, 156, 89],
    'weather': ['Clear', 'Clear', 'Partly Cloudy', 'Clear']
})
print(bike_data)

This creates a structured table where:
- Each row represents one hour of operations
- Each column represents one type of measurement
- The data types can be mixed (text, numbers) within the same DataFrame
- All related information stays connected

### DataFrame Components and Business Value

Every DataFrame has key components that make it powerful for business analysis:

**Columns**: Each variable you're tracking (temperature, rentals, weather conditions). These become the factors you'll analyze to understand demand patterns.

**Rows**: Each observation or time period. In transportation, this is usually time-based (hourly, daily) but could be trip-based or station-based.

**Index**: The row identifiers. By default, these are numbers (0, 1, 2), but you can use dates, station IDs, or other meaningful identifiers.

**Values**: The actual data inside the table. This is the business information you'll analyze to generate insights and predictions.

### Advanced DataFrame Understanding for Professional Work

DataFrames are not just storage containers - they're analysis platforms. When you create a DataFrame for your client's bike-sharing data, you're creating a foundation for:

**Trend Analysis**: Comparing bike rentals across different time periods to identify peak and off-peak patterns that inform station stocking decisions.

**Correlation Discovery**: Understanding relationships between weather conditions and demand to improve demand forecasting accuracy.

**Operational Insights**: Identifying patterns that help optimize bike redistribution, maintenance scheduling, and capacity planning.

**Business Reporting**: Creating summaries and visualizations that communicate findings to non-technical stakeholders in your client's organization.

## 5. Loading Real-World Data: Working with the Washington D.C. Dataset

### Understanding Your Client's Data Source

Your bike-sharing startup client has provided you with historical data from the Washington D.C. bike-sharing system. This real-world dataset contains all the complexities you'll encounter in professional consulting: multiple variables, different data types, time series information, and the messiness of actual operations.

The dataset includes:
- **Temporal Information**: Date and time of each rental period
- **Weather Data**: Temperature, humidity, wind speed, and weather conditions
- **Usage Metrics**: Casual users, registered users, and total counts
- **Operational Context**: Holiday indicators, working day flags, and seasonal information

### The Data Loading Process

Loading data from files is a fundamental consulting skill. Your client's data is stored in CSV (Comma-Separated Values) format, which is the standard for sharing tabular data between systems:

In [None]:
# Define the path to your data file
data_path = "references/datasets/washington/dataset.csv"

# Load the data into a DataFrame
df = pd.read_csv(data_path)

# Display basic information about your dataset
print(df.head())

The `pd.read_csv()` function is incredibly powerful and handles many complexities automatically:
- Automatically detects column headers
- Infers appropriate data types for each column
- Handles different types of missing data representations
- Manages various CSV formatting conventions

### Professional Data Inspection Techniques

Once you've loaded the data, professional practice requires systematic inspection to understand what you're working with. This isn't just technical due diligence - it's essential for providing accurate consulting advice:

**Dataset Shape and Size**:

In [None]:
print(f"Dataset contains {df.shape[0]} rows and {df.shape[1]} columns")

This tells your client how much historical data they have available for analysis, which affects the reliability of predictions you can make.

**Column Information and Data Types**:

In [None]:
print(df.info())

Understanding data types is crucial for choosing appropriate analysis techniques and identifying potential data quality issues.

**Statistical Summary**:

In [None]:
print(df.describe())

This provides initial insights into data ranges, which helps identify outliers and validates that the data makes business sense.

### Connecting Data Structure to Business Questions

Every column in your DataFrame corresponds to a potential business insight:

**Datetime Column**: Enables time-based analysis to identify peak usage periods, seasonal trends, and operational patterns that inform staffing and inventory decisions.

**Weather Variables**: Allow demand forecasting based on weather predictions, helping optimize bike availability before weather changes.

**User Type Segmentation**: Casual vs. registered users have different usage patterns, enabling targeted service improvements and pricing strategies.

**Count Variables**: The target variable you'll eventually predict, representing the business outcome your client cares most about.

Understanding these connections transforms technical data loading into strategic business preparation.

## 6. Essential DataFrame Operations for Transportation Analysis

### Data Selection: Finding What Matters to Your Client

Professional data analysis requires the ability to extract specific information that answers business questions. Pandas provides multiple ways to select data, each appropriate for different consulting scenarios:

**Column Selection for Focused Analysis**:

In [None]:
# Select single variable for trend analysis
bike_counts = df['count']

# Select multiple related variables for weather impact analysis
weather_data = df[['temp', 'humidity', 'windspeed', 'count']]

**Row Selection for Time-Based Analysis**:

In [None]:
# First 100 hours of operations
early_operations = df.head(100)

# Specific time period for seasonal analysis
january_data = df.iloc[0:744]  # First month (24 hours × 31 days)

**Conditional Selection for Business Intelligence**:

In [None]:
# High-demand periods (above average usage)
peak_usage = df[df['count'] > df['count'].mean()]

# Weather-specific analysis
rainy_days = df[df['weather'] == 3]  # Weather code 3 represents rain

### Data Inspection for Quality Assurance

Before conducting any analysis for your client, professional practice requires validating data quality. This protects both your reputation and your client's business decisions:

**Missing Data Detection**:

In [None]:
# Count missing values in each column
missing_summary = df.isnull().sum()
print(missing_summary)

**Value Range Validation**:

In [None]:
# Check if bike counts fall within reasonable ranges
print(f"Bike count range: {df['count'].min()} to {df['count'].max()}")
print(f"Temperature range: {df['temp'].min()} to {df['temp'].max()}")

**Data Distribution Understanding**:

In [None]:
# Statistical summary for key business metrics
print(df[['temp', 'humidity', 'count']].describe())

### Connecting Operations to Business Outcomes

Each pandas operation you learn serves a specific business purpose:

**Selection Operations**: Enable focused analysis on specific aspects of the business (weather sensitivity, user behavior, temporal patterns) that directly inform strategic decisions.

**Inspection Operations**: Ensure data reliability, which builds client confidence and prevents costly mistakes based on flawed information.

**Summary Operations**: Provide the statistical foundation for making data-driven recommendations about operations, pricing, and strategic investments.

Mastering these fundamental operations creates the foundation for all advanced analysis you'll perform as a transportation consultant.

## 7. Data Types and Their Business Implications

### Understanding Data Types in Transportation Context

Not all data is created equal, and understanding the different types of information in your dataset directly affects the analysis techniques you can apply and the business insights you can generate.

**Numerical Data**: Information that can be measured or counted
- **Continuous**: Temperature, humidity, wind speed (can take any value within a range)
- **Discrete**: Bike counts, user counts (whole numbers only)

**Categorical Data**: Information that represents categories or groups
- **Nominal**: Weather conditions, season (no natural order)
- **Ordinal**: Weather severity codes (1=Clear, 2=Misty, 3=Light Rain, 4=Heavy Rain)

**Temporal Data**: Time-based information that enables trend analysis and forecasting

### Data Type Impact on Analysis Capabilities

Different data types enable different types of business analysis:

**Numerical Data Analysis**:
- Calculate averages, trends, and correlations
- Build predictive models for demand forecasting
- Identify optimal operating ranges (temperature ranges with highest demand)

**Categorical Data Analysis**:
- Compare performance across different conditions (sunny vs. rainy days)
- Segment users and customize services
- Identify operational factors that influence demand

**Temporal Data Analysis**:
- Forecast future demand based on historical patterns
- Identify seasonal trends for capacity planning
- Optimize maintenance and rebalancing schedules

Understanding these capabilities helps you choose the right analysis approach for each business question your client presents.

### Data Type Conversion for Enhanced Analysis

Sometimes you need to convert between data types to enable more sophisticated analysis:

In [None]:
# Convert datetime strings to actual datetime objects for time-based analysis
df['datetime'] = pd.to_datetime(df['datetime'])

# Extract useful time components for business analysis
df['hour'] = df['datetime'].dt.hour
df['day_of_week'] = df['datetime'].dt.day_name()
df['month'] = df['datetime'].dt.month

These conversions enable more sophisticated business questions:
- Which hours of the day have highest demand?
- Do weekends show different patterns than weekdays?
- How does seasonal variation affect monthly revenue?

---

## Summary and Transition to Data Quality Implementation

Your mastery of pandas fundamentals establishes the technical foundation essential for professional transportation consulting. Understanding Series and DataFrame structures, data loading procedures, and professional workflow practices provides the analytical infrastructure necessary for sophisticated business applications.

The pandas expertise you've developed transforms raw data files into structured analytical platforms. Your ability to load the Washington D.C. bike-sharing dataset, inspect data quality, and navigate DataFrame operations creates the operational foundation for all advanced analysis you'll perform as a transportation consultant.

Professional pandas proficiency distinguishes competent data consultants who combine technical accuracy with business applications. Your expertise enables engagement with complex transportation datasets while maintaining analytical rigor essential for generating reliable business insights and strategic recommendations.

Your next challenge involves implementing advanced data quality assessment and cleaning techniques that ensure analytical reliability. The data quality implementation will demonstrate how pandas fundamentals translate to working solutions that prepare messy real-world data for sophisticated business analysis.

The integration of pandas mastery with data quality practices creates comprehensive data preparation capability essential for professional transportation consulting success. Your technical foundation combined with systematic data preparation expertise enables sophisticated business applications that drive strategic value creation and competitive advantage in urban mobility markets.

The foundation you've built in this lecture - understanding pandas structures, loading real-world data, and applying professional practices - enables everything you'll accomplish as a transportation consultant. In our next lecture, we'll build on this foundation by learning how to clean and prepare messy real-world data for analysis, turning the raw information into reliable insights your clients can act upon.