# Lecture 3: Introduction to Pandas & Data Structures - Building Your Foundation

## Learning Objectives

By the end of this lecture, you will be able to:
- Import and understand the pandas library for data manipulation
- Work with pandas Series and DataFrame structures effectively
- Load and inspect the Washington D.C. bike-sharing dataset
- Understand basic data types and structures in Python data analysis
- Apply fundamental pandas operations for data exploration

---

## 1. Your Journey as a Data Consultant Begins

Welcome to your first day as a junior data consultant! Your client, a growing bike-sharing startup, has asked you to help them understand their data and build better demand forecasting capabilities. Like any professional consultant, you need to start with the fundamentals - understanding your tools and your data.

Think of this like learning to be a craftsperson. Before a carpenter builds a house, they master their tools: saws, hammers, measuring devices. As a data consultant, your primary tools are programming languages and libraries - and pandas is the most essential tool for data work in Python.

Today, you'll learn pandas not as an abstract programming concept, but as the foundation that will enable you to help your client make better business decisions. Every technique you master here will directly contribute to solving real transportation challenges.

## 2. The Pandas Library: Your Data Manipulation Powerhouse

### What is Pandas and Why Do You Need It?

Pandas is like having a powerful Swiss Army knife for working with data. Just as a mechanic wouldn't try to fix a car with just their bare hands, you wouldn't want to analyze data without pandas. It provides the tools you need to load, clean, and manipulate data efficiently.

For your bike-sharing client, this means pandas will help you in tasks such as:
- Load their historical rental data from files
- Clean and prepare messy real-world data
- Transform raw operational data into actionable insights

### Understanding Your Data Toolkit Components

Pandas provides two fundamental structures for organizing data, much like how a workshop has different types of containers for different purposes:

1. **Series**: Like a single column of a spreadsheet. Perfect for storing one type of information (like all the temperatures recorded).

2. **DataFrame**: Like a complete spreadsheet with rows and columns. This is where you'll store your full bike-sharing dataset with all variables together.

### Importing Pandas: Setting Up Your Workshop

Before you can use any tool, you need to make it available in your workspace. In Python, this means importing pandas:

In [None]:
import pandas as pd

This simple line does several important things:
- Makes all pandas functions available to use
- Creates the shorthand `pd` so you don't have to type `pandas` every time
- Follows the standard convention that all Python data analysts use

The `as pd` part is like giving pandas a nickname that everyone in the data science community recognizes. When you see `pd.` in code, you immediately know you're working with pandas operations.

## 3. Series: Your First Data Structure

### Understanding Series Through Bike-Sharing Examples

A Series is like a single column of data with labels for each value. Imagine you're tracking the number of bikes rented each hour during a typical Monday morning:

In [None]:
hourly_rentals = pd.Series([15, 23, 45, 67, 89, 156, 234, 287])
print(hourly_rentals)

This creates a Series, named `hourly_rentals`, where:
- The values (15, 23, 45, etc.) represent bike rentals for each hour
- Each value gets an automatic index (0, 1, 2, etc.) like row numbers
- You can think of the index as the "label" for each measurement

> **Note:** In Python, numbering (indexing) starts at 0, not 1. So the first item is at position 0.

Note that since we already imported pandas as `pd`, we can now use `pd.Series` to easily create our first data structure.

### Practical Series Creation for Transportation Data

Let's create a more realistic example using actual time labels:

In [None]:
morning_hours = ['6 AM', '7 AM', '8 AM', '9 AM', '10 AM']
morning_rentals = pd.Series([23, 67, 156, 89, 45], index=morning_hours)
print(morning_rentals)

Explanation of the code:
- `morning_hours = ['6 AM', '7 AM', '8 AM', '9 AM', '10 AM']` This line creates a Python list with time labels. These labels will serve as the **index** of the Series, representing the hour of the morning.
- `morning_rentals = pd.Series([23, 67, 156, 89, 45], index=morning_hours)` Here, we use `pd.Series` to create a pandas Series. The first argument `[23, 67, 156, 89, 45]` represents the bike rental counts, while the `index=morning_hours` argument assigns each value to a specific time label.
- `print(morning_rentals)` This line prints the Series to the console, displaying the bike rental counts along with their corresponding time labels.

Now your Series has meaningful labels that make business sense. When you show this to your client, they can immediately understand that 8 AM has the highest rental count (156 bikes), which aligns with morning commute patterns.

### Why Series Matter for Your Consulting Work

Series are more than just lists - they're intelligent containers that:
- Keep related information together (values and their meanings)
- Enable mathematical operations (calculating averages, totals, trends)
- Support easy filtering and selection (finding peak hours, low-demand periods)

For transportation consulting, Series help you organize time-based patterns, station-specific metrics, and categorical data like weather conditions or user types.

## 4. DataFrame: Your Complete Data Laboratory

### Understanding DataFrames as Digital Spreadsheets

If Series are like single columns, DataFrames are like complete spreadsheets with multiple columns and rows. For your bike-sharing client, a DataFrame would contain all the information about each rental period: time, weather, bike counts, user types, and more.

Think of a DataFrame as a comprehensive record where each row represents one time period (like one hour) and each column represents one type of measurement (like temperature, humidity, bike count).

### Creating Your First Transportation DataFrame

Let's build a small DataFrame that represents what your client's data looks like:

In [None]:
bike_data = pd.DataFrame({
    'hour': ['6 AM', '7 AM', '8 AM', '9 AM'],
    'temperature': [45, 47, 52, 55],
    'rentals': [23, 67, 156, 89],
    'weather': ['Clear', 'Clear', 'Partly Cloudy', 'Clear']
})
print(bike_data)

Explanation of the code:
- `bike_data = pd.DataFrame({...})` This line creates a pandas `DataFrame` with the name `bike_data`.
- Inside the curly braces `{ ... }`, we define the DataFrame as a dictionary:
  - Each **key** (e.g., `hour`, `temperature`, `rentals`, `weather`) becomes a **column name**.
  - Each **list of values** associated with the key (e.g., `[45, 47, 52, 55]` for temperature) becomes the **column data**.
  - All lists must be the **same length**, because each position across the lists corresponds to one row in the table.
- In this example:
  - `'hour'` marks the time of day.
  - `'temperature'` gives the measured temperature (we used Fahrenheit in this example).
  - `'rentals'` shows the number of bikes rented.
  - `'weather'` describes conditions during that hour.
- `print(bike_data)` displays the DataFrame in a structured, tabular format where each row represents one observation (an hour of operations), and each column represents one variable being tracked.

This DataFrame builds on what you just learned about Series. Instead of working with a single column of data, you now have **multiple Series combined side by side** in one structured table. Each row corresponds to a specific hour, while each column captures a different type of measurement (time, temperature, rentals, weather). This way, all the related information for each hour stays neatly aligned.

### Why DataFrame Matter for Your Consulting Work

Every DataFrame has key components that make it powerful for business analysis:

- **Columns**: Each variable you're tracking (temperature, rentals, weather conditions). These become the factors you'll analyze to understand demand patterns.
- **Rows**: Each observation or time period. In transportation, this is usually time-based (hourly, daily) but could be trip-based or station-based.
- **Index**: The row identifiers. By default, these are numbers (0, 1, 2), but you can use dates, station IDs, or other meaningful identifiers.
- **Values**: The actual data inside the table. This is the business information you'll analyze to generate insights and predictions.

## 5. Loading Real-World Data: Working with the Washington D.C. Dataset

### Understanding Your Client's Data Source

Your bike-sharing startup client has provided you with historical data from the Washington D.C. bike-sharing system. This real-world dataset contains all the complexities you'll encounter in professional consulting: multiple variables, different data types, time series information, and the messiness of actual operations.

The dataset includes:
- **Temporal Information**: Date and time of each rental period
- **Weather Data**: Temperature, humidity, wind speed, and weather conditions
- **Usage Metrics**: Casual users, registered users, and total counts
- **Operational Context**: Holiday indicators, working day flags, and seasonal information

### The Data Loading Process

Loading data from files is a fundamental consulting skill. Your client's data is stored in CSV (Comma-Separated Values) format, which is standard for sharing tabular data between systems:

In [None]:
# Define the path to your data file
data_path = "https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/dataset/dataset.csv"

# Load the data into a DataFrame
df = pd.read_csv(data_path)

# Display basic information about your dataset
print(df.head())

The `pd.read_csv()` function is incredibly powerful and handles many complexities automatically:
- Automatically detects column headers
- Infers appropriate data types for each column
- Manages various CSV formatting conventions

### First Inspection: Getting Familiar with the Dataset

Once you've loaded the data, the next step is to take a quick first look to understand its structure and contents. This initial inspection isn’t about deep analysis yet — it’s simply a way to get familiar with what you’re working with so you can start asking the right questions.

**Dataset Shape and Size**:

In [None]:
print(f"Dataset contains {df.shape[0]} rows and {df.shape[1]} columns")

This tells your client how much historical data they have available for analysis, which affects the reliability of predictions you can make.

**Column Information and Data Types**:

In [None]:
print(df.info())

Understanding the data you have is crucial for choosing appropriate analysis techniques and identifying potential issues.

**Statistical Summary**:

In [None]:
print(df.describe())

This provides initial insights into data ranges, which helps identify outliers and validates that the data makes business sense.

## 6. Essential DataFrame Operations for Transportation Analysis

Pandas has different types of operations that serve different business purposes:

- **Selection Operations**: Enable focused analysis on specific aspects of the business (weather sensitivity, user behavior, temporal patterns) that directly inform strategic decisions.
- **Inspection Operations**: Ensure data reliability, which builds client confidence and prevents costly mistakes based on flawed information.
- **Summary Operations**: Provide the statistical foundation for making data-driven recommendations about operations, pricing, and strategic investments.

Mastering these fundamental operations creates the foundation for all advanced analysis you'll perform as a transportation consultant.

### Selection Operations: Finding What Matters to Your Client

Professional data analysis requires the ability to extract specific information that answers business questions. Pandas provides multiple ways to select data, each appropriate for different consulting scenarios:

**Column Selection**:

In [None]:
# Select single variable for trend analysis
bike_counts = df['count']
print(f"Selected bike counts column - Shape: {bike_counts.shape}")
print(f"Data type: {bike_counts.dtype}")
print(f"First 5 values:\n{bike_counts.head()}")
print()

# Select multiple related variables for weather impact analysis
weather_data = df[['temp', 'humidity', 'windspeed', 'count']]
print(f"Weather analysis dataset - Shape: {weather_data.shape}")
print(f"Selected columns: {list(weather_data.columns)}")
print(f"Sample data:\n{weather_data.head(3)}")

Column selection lets you focus your analysis on specific business questions. The output here demonstrates two useful approaches:

1. **Single-column selection (bike counts only)**
   - The `count` column (10,886 records) is isolated as our target variable.
   - Values range from 1 to 40 bikes per hour.
   - This makes it ideal for simple trend analysis.

2. **Multi-column selection (weather impact analysis)**
   - A subset is created with four columns: `temp`, `humidity`, `windspeed`, and `count`.
   - Example values: temperature ~9.8°C, humidity ~80%, windspeed = 0.0 (calm conditions).
   - This focused dataset links bike usage with environmental factors.

Together, these views highlight how column selection shapes the analysis. For example, the sample shows **low bike usage (16, 40, 32 rentals)** under similar weather conditions, hinting at patterns worth investigating further.

**Row Selection**:

In [None]:
# First 100 hours of operations
early_operations = df.head(100)
print(f"Early operations period - Shape: {early_operations.shape}")
print(f"Date range: {early_operations['datetime'].min()} to {early_operations['datetime'].max()}")
print(f"Average bike usage in early period: {early_operations['count'].mean():.1f}")
print()

# Specific time period for seasonal analysis
january_data = df.iloc[0:744]  # First month (24 hours × 31 days)
print(f"January analysis dataset - Shape: {january_data.shape}")
print(f"Total hours covered: {len(january_data)} hours")
print(f"Date range: {january_data['datetime'].min()} to {january_data['datetime'].max()}")
print(f"January average usage: {january_data['count'].mean():.1f} bikes/hour")

Row selection allows you to analyze specific business periods. The output here highlights two comparisons:

1. **Early operations (first 100 hours: Jan 1–5, 2011)**
   - Average usage: **50.3 bikes/hour**
   - Shows strong initial demand during the launch phase.

2. **Full January dataset (744 hours: complete month)**
   - Average usage: **59.0 bikes/hour**
   - A **17% increase** compared to the launch period.
   - Suggests growing adoption and establishes a baseline for long-term demand.

This comparison helps distinguish between **startup dynamics** and **established usage patterns**. Looking at a wider range (Jan 1 – Feb 14) also reveals **seasonal variations**, which are important for staffing and inventory planning.

**Conditional Selection**:

In [None]:
# High-demand periods (above average usage)
average_usage = df['count'].mean()
peak_usage = df[df['count'] > average_usage]
print(f"Overall average usage: {average_usage:.1f} bikes/hour")
print(f"Peak usage periods: {len(peak_usage)} hours ({len(peak_usage)/len(df)*100:.1f}% of time)")
print(f"Average usage during peak periods: {peak_usage['count'].mean():.1f} bikes/hour")
print(f"Peak period usage range: {peak_usage['count'].min()} to {peak_usage['count'].max()} bikes/hour")
print()

# Weather-specific analysis
rainy_days = df[df['weather'] == 3]  # Weather code 3 represents rain
print(f"Rainy weather analysis:")
print(f"Total rainy hours: {len(rainy_days)} ({len(rainy_days)/len(df)*100:.1f}% of time)")
print(f"Average usage on rainy days: {rainy_days['count'].mean():.1f} bikes/hour")
print(f"Average usage on all days: {df['count'].mean():.1f} bikes/hour")
print(f"Rain impact: {((rainy_days['count'].mean() / df['count'].mean() - 1) * 100):+.1f}% change in usage")

Conditional selection can reveal important business patterns. The output highlights two key insights:

1. **Peak usage periods (above average demand)**
    - 4,356 hours identified (40.0% of total time).
    - Average usage: **369.9 bikes/hour** — nearly double the overall average of **191.6 bikes/hour**.
    - Demand ranges from **192 to 977 bikes/hour**, showing when maximum bikes and staff are needed.

2. **Rainy weather periods**
    - 859 rainy hours recorded (7.9% of time).
    - Average usage drops to **118.8 bikes/hour**.
    - This is a **38.0% decrease** compared to normal conditions.

Together, these insights help predict demand swings and guide operational planning. Peak periods show when to **scale up resources**, while rainy periods suggest opportunities for **maintenance or reduced staffing**.

### Inspection Operations: Ensuring Data Quality

Before conducting any analysis for your client, professional practice requires validating data quality. This protects both your reputation and your client's business decisions:

**Missing Data Detection**:

In [None]:
# Count missing values in each column
missing_summary = df.isnull().sum()
print(missing_summary)

Missing data can seriously affect business decisions, so it’s important to check for gaps early. The output shows that **no missing values were found** across all 11 columns (datetime, season, holiday, workingday, weather, temp, atemp, humidity, windspeed, casual, registered, and count).

Having a **complete dataset is excellent news**. Even small gaps in weather variables (like temperature or humidity) or user metrics (casual vs. registered) could lead to inaccurate demand forecasts. This clean dataset gives us a solid foundation for reliable analysis moving forward.

**Value Range Validation**:

In [None]:
# Check if bike counts fall within reasonable ranges
print(f"Bike count range: {df['count'].min()} to {df['count'].max()}")
print(f"Temperature range: {df['temp'].min()} to {df['temp'].max()}")

Checking value ranges helps confirm that our dataset doesn’t contain corrupted or unrealistic values. The output shows:

- **Bike counts**: Range from **1 to 977** bikes/hour. All values are positive and fall within realistic limits for a bike-sharing system.
- **Temperature**: Range from **0.82°C to 41.0°C**, which represents plausible seasonal variation - from near-freezing winter conditions to hot summer days.

These ranges confirm that there are no negative bike counts or impossible weather values. This is important because out-of-range values might signal **sensor malfunctions, data corruption, or recording errors** that must be addressed before analysis.

### Summarization Operations: Providing Statistical Insights

**Data Distribution**:

In [None]:
# Statistical summary for key business metrics
print(df[['temp', 'humidity', 'count']].describe())

Statistical summaries help reveal key operational patterns for business planning:

- **Bike usage**: The average is **191.6 bikes/hour** with a median of **145**, showing right-skewed demand where peaks rise well above typical usage. Demand is highly variable, with a standard deviation of **181.1** — almost equal to the mean.
- **Range of demand**: Low periods (25th percentile) average **42 bikes/hour**, while high periods (75th percentile) reach **284 bikes/hour**. At maximum, demand spikes to **977 bikes/hour** — over 5x the average and nearly 7x the median.

These patterns show that demand fluctuates sharply, meaning the system requires a strong capacity buffer to handle peak loads.

## 7. Data Types and Their Business Implications

### Understanding Data Types in Transportation Context

Not all data is created equal, and understanding the different types of information in your dataset directly affects the analysis techniques you can apply and the business insights you can generate. In transportation datasets, we usually encounter three main types of data:

**Numerical Data**

Information that can be measured or counted.
- **Continuous**: Temperature, humidity, wind speed (can take any value within a range)
- **Discrete**: Bike counts, user counts (whole numbers only)

**Categorical Data**

Information that represents categories or groups.
- **Nominal**: Weather conditions, season (no natural order)
- **Ordinal**: Weather severity codes (1=Clear, 2=Misty, 3=Light Rain, 4=Heavy Rain)

**Temporal Data**

Time-based information that enables trend analysis and forecasting.

### Data Type Impact on Analysis Capabilities

Different data types enable different types of business analysis:

**Numerical Data Analysis**

Used to measure patterns and relationships. Typical applications include:
- Calculate averages, trends, and correlations
- Build predictive models for demand forecasting
- Identify optimal operating ranges (temperature ranges with highest demand)

**Categorical Data Analysis**

Used to compare groups and categories. Typical applications include:
- Compare performance across different conditions (sunny vs. rainy days)
- Segment users and customize services
- Identify operational factors that influence demand

**Temporal Data Analysis**

Used to analyze time-dependent patterns. Typical applications include:
- Forecast future demand based on historical patterns
- Identify seasonal trends for capacity planning
- Optimize maintenance and rebalancing schedules

Understanding these capabilities helps you choose the right analysis approach for each business question your client presents.

---

## Summary and Transition to Data Quality Implementation

Your mastery of pandas fundamentals establishes the technical foundation essential for professional transportation consulting. Understanding Series and DataFrame structures, data loading procedures, data operations, and data types provides the analytical infrastructure necessary for sophisticated business applications.

The example provided by the Washington D.C. bike-sharing dataset, in which you were able to load the data, inspect data quality, and navigate DataFrame operations creates the operational foundation for all advanced analysis you'll perform as a transportation consultant in this course.

In our next lecture, we'll build on this foundation by learning how to clean and prepare messy real-world data for analysis, turning the raw information into reliable insights your clients can act upon.