# Lecture 3: Introduction to Pandas & Data Structures - Building Your Foundation

## Learning Objectives

By the end of this lecture, you will be able to:

- Import and understand the pandas library for data manipulation
- Work with pandas Series and DataFrame structures effectively
- Load and inspect the Washington D.C. bike-sharing dataset
- Understand basic data types and structures in Python data analysis
- Apply fundamental pandas operations for basic data exploration

---

## 1. Your Journey as a Data Consultant Begins

Welcome to your first day as a junior data consultant! Your client, a growing bike-sharing startup, has asked you to help them understand their data and build better demand forecasting capabilities. Like any professional consultant, you need to start with the fundamentals - understanding your tools and your data.

Think of this like learning to be a craftsperson. Before a carpenter builds a house, they master their tools: saws, hammers, measuring devices. As a data consultant, your primary tools are programming languages and libraries - and pandas is the most essential tool for data work in Python.

Today, you'll learn **pandas** not as an abstract programming concept, but as the foundation that will enable you to help your client make better business decisions. Every technique you master here will directly contribute to solving real transportation challenges.

## 2. The Pandas Library: Your Data Manipulation Powerhouse

### 2.1. What is Pandas and Why Do You Need It?

Pandas is like having a powerful Swiss Army knife for working with data. Just as a mechanic wouldn't try to fix a car with just their bare hands, you wouldn't want to analyze data without pandas. It provides the tools you need to load, clean, and manipulate data efficiently.

For your bike-sharing client, this means pandas will help you in tasks such as:

- Load their historical rental data from files
- Clean and prepare messy real-world data
- Transform raw operational data into actionable insights

### 2.2. Understanding Your Data Toolkit Components

Pandas provides two fundamental structures for organizing data, much like how a workshop has different types of containers for different purposes:

1. **Series**: Like a single column of a spreadsheet. Perfect for storing one type of information (like all the temperatures recorded).
2. **DataFrame**: Like a complete spreadsheet with rows and columns. This is where you'll store your full bike-sharing dataset with all variables together.

### Importing Pandas: Setting Up Your Workshop

Before you can use any tool, you need to make it available in your workspace. In Python, this means importing pandas:

In [33]:
import pandas as pd

> **Note:** This is your first chance to try out Google Colab. Click **"Open in Colab"**, select this cell, and press **Shift+Enter** to run it. Nothing visible will happen yet—that’s normal! This step simply loads the **pandas** library. The notebook already shows executed results, but we encourage you to run the code yourself.

This simple line does several important things:

- Makes all pandas functions available to use
- Creates the shorthand `pd` so you don't have to type `pandas` every time
- Follows the standard convention that all Python data analysts use

The `as pd` part is like giving pandas a nickname that everyone in the data science community recognizes. When you see `pd.` in code, you immediately know you're working with pandas operations.

## 3. Series: Your First Data Structure

### 3.1. Understanding Series Through Bike-Sharing Examples

A Series is like a single column of data with labels for each value. Imagine you're tracking the number of bikes rented each hour during a typical Monday morning:

In [34]:
hourly_rentals = pd.Series([15, 23, 45, 67, 89, 156, 234, 287])
print(hourly_rentals)

0     15
1     23
2     45
3     67
4     89
5    156
6    234
7    287
dtype: int64


Explanation of the code:

- `pd.Series([15, 23, 45, 67, 89, 156, 234, 287])` creates a pandas Series and saves it to the variable `hourly_rentals`. The numbers inside the list represent the bike rentals for each hour in sequence. For example, the first value `15` represents the number of bikes rented at the first hour.
- By default, pandas automatically assigns an **index** to each value, starting at 0. So the first value `15` gets index 0, the second value `23` gets index 1, and so on.
- `print(hourly_rentals)` displays the Series, showing both the index and the values side by side, as you can see in the presented output below the code.

Now let’s make the example more realistic by using actual time labels instead of default numeric indexes:

In [35]:
morning_hours = ['6 AM', '7 AM', '8 AM', '9 AM', '10 AM']
morning_rentals = pd.Series([23, 67, 156, 89, 45], index=morning_hours)
print(morning_rentals)

6 AM      23
7 AM      67
8 AM     156
9 AM      89
10 AM     45
dtype: int64


What’s new here:

- We added a custom index using the list `morning_hours = ['6 AM', '7 AM', '8 AM', '9 AM', '10 AM']`.
- By passing this list as the `index` argument, each rental value is now linked to a specific time label (e.g., `8 AM → 156`).

This makes the Series easier to interpret. For example, you can instantly see that **8 AM has the highest rental count (156 bikes)**, which matches the expected morning commute peak.

### 3.2. Why Series Matter for Your Consulting Work

Series are more than just lists - they're intelligent containers that:

- Keep related information together (values and their meanings)
- Enable mathematical operations (calculating averages, totals, trends)
- Support easy filtering and selection (finding peak hours, low-demand periods)

For transportation consulting, Series help you organize time-based patterns, station-specific metrics, and categorical data like weather conditions or user types.

## 4. DataFrame: Your Complete Data Laboratory

### 4.1. Understanding DataFrames as Digital Spreadsheets

If Series are like single columns, DataFrames are like complete spreadsheets with multiple columns and rows. For your bike-sharing client, a DataFrame would contain all the information about each rental period: time, weather, bike counts, user types, and more.

Think of a DataFrame as a comprehensive record where each row represents one time period (like one hour) and each column represents one type of measurement (like temperature, humidity, bike count).

Let's build a small DataFrame that represents what your client's data looks like:

In [36]:
bike_data = pd.DataFrame({
    'hour': ['6 AM', '7 AM', '8 AM', '9 AM'],
    'temperature': [45, 47, 52, 55],
    'rentals': [23, 67, 156, 89],
    'weather': ['Clear', 'Clear', 'Partly Cloudy', 'Clear']
})
print(bike_data)

   hour  temperature  rentals        weather
0  6 AM           45       23          Clear
1  7 AM           47       67          Clear
2  8 AM           52      156  Partly Cloudy
3  9 AM           55       89          Clear


Explanation of the code:

- `bike_data = pd.DataFrame({...})` This line creates a pandas `DataFrame` with the name `bike_data`.
- Inside the curly braces `{ ... }`, we define the DataFrame as a dictionary:
  - Each **key** (e.g., `hour`, `temperature`, `rentals`, `weather`) becomes a **column name**.
  - Each **list of values** associated with the key (e.g., `[45, 47, 52, 55]` for temperature) becomes the **column data**.
  - All lists must be the **same length**, because each position across the lists corresponds to one row in the table.
- In this example:
  - `'hour'` marks the time of day.
  - `'temperature'` gives the measured temperature (we used Fahrenheit in this example).
  - `'rentals'` shows the number of bikes rented.
  - `'weather'` describes conditions during that hour.
- `print(bike_data)` displays the DataFrame in a structured, tabular format - as  you can see in the presented output below the code - where each row represents one observation (an hour of operations), and each column represents one variable being tracked.

This DataFrame builds on what you just learned about Series. Instead of working with a single column of data, you now have **multiple Series combined side by side** in one structured table. Each row corresponds to a specific hour, while each column captures a different type of measurement (time, temperature, rentals, weather). This way, all the related information for each hour stays neatly aligned.

### 4.2. Why DataFrame Matter for Your Consulting Work

Every DataFrame has key components that make it powerful for business analysis:

- **Columns**: Each variable you're tracking (temperature, rentals, weather conditions). These become the factors you'll analyze to understand demand patterns.
- **Rows**: Each observation or time period. In transportation, this is usually time-based (hourly, daily) but could be trip-based or station-based.
- **Index**: The row identifiers. By default, these are numbers (0, 1, 2), but you can use dates, station IDs, or other meaningful identifiers.
- **Values**: The actual data inside the table. This is the business information you'll analyze to generate insights and predictions.

## 5. Loading Real-World Data: Working with the Washington D.C. Dataset

### 5.1. Understanding Your Client's Data Source

Now that you’re familiar with Series and DataFrame structures, it’s time to put them into action. Your bike-sharing startup client has provided you with historical data from the Washington D.C. bike-sharing system. This dataset will serve as the foundation for your consulting work throughout the course.

What makes this dataset valuable is that it mirrors the complexity of real consulting projects:

- **Multiple variables** capturing different aspects of the system (e.g., temperature, weather conditions, user types)
- **Mixed data types** that require careful handling
- **Time series information** essential for demand prediction
- And, of course, the **messiness of real-world operations**

By working with this data, you’ll gain hands-on experience tackling the same challenges professionals face when preparing transportation datasets for machine learning.

### 5.2. The Data Loading Process

Loading data from files is the first step of the consulting process. In this case, your client's data is stored in CSV (Comma-Separated Values) format, a standard way of sharing tabular data between systems:

In [37]:
# Define the path to your data file
data_path = "https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/dataset/dataset.csv"

# Load the data into a DataFrame
df = pd.read_csv(data_path)

The command `pd.read_csv()` tells pandas to:

1. **Read** the file located at the given path (`data_path` in this case), line by line.
2. **Break** each row of text into separate pieces of information whenever it finds a comma (each piece becomes a column value).
3. **Store** the resulting table in a **DataFrame**, pandas’ main data structure for tabular data.

In other words, this function converts raw CSV text into a structured DataFrame you can immediately start analyzing.

## 6. Essential DataFrame Operations for Transportation Analysis

Once your data is loaded, the next step is to get familiar with it. As a consultant, you don’t dive straight into modeling — you first **inspect and summarize** the dataset. This quick overview not only builds your own understanding but also helps you explain the data’s scope and quality to your client.

Let's go through some of the most important operations to start with.

### 6.1. Shape of the Dataset

In [38]:
print(f"Dataset contains {df.shape[0]} rows and {df.shape[1]} columns")

Dataset contains 10886 rows and 12 columns


This tells us that the dataset contains **10,886 hourly records** and **12 variables**. For your client, this means there is a solid amount of historical data available — enough to build models that can capture patterns across different times and conditions.

### 6.2. Column Information and Data Types

In [39]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   datetime    10886 non-null  object 
 1   season      10886 non-null  int64  
 2   holiday     10886 non-null  int64  
 3   workingday  10886 non-null  int64  
 4   weather     10886 non-null  int64  
 5   temp        10886 non-null  float64
 6   atemp       10886 non-null  float64
 7   humidity    10886 non-null  int64  
 8   windspeed   10886 non-null  float64
 9   casual      10886 non-null  int64  
 10  registered  10886 non-null  int64  
 11  count       10886 non-null  int64  
dtypes: float64(3), int64(8), object(1)
memory usage: 1020.7+ KB
None


The `.info()` method shows all the variables, their data types, and whether they contain missing values.

- We see that every column has **10,886 non-null entries**, meaning no missing data — excellent news for reliability.
- Variables like season, holiday, and count are stored as integers, while temperature-related variables are floats.
- The datetime column is still an object (text), which we’ll later convert to a proper date-time format for time-based analysis.

This overview helps you spot potential issues early, such as incorrect data types or missing values that could affect modeling.

### 6.3. Previewing the Data

In [40]:
print(df.head())

              datetime  season  holiday  workingday  weather  temp   atemp  \
0  2011-01-01 00:00:00       1        0           0        1  9.84  14.395   
1  2011-01-01 01:00:00       1        0           0        1  9.02  13.635   
2  2011-01-01 02:00:00       1        0           0        1  9.02  13.635   
3  2011-01-01 03:00:00       1        0           0        1  9.84  14.395   
4  2011-01-01 04:00:00       1        0           0        1  9.84  14.395   

   humidity  windspeed  casual  registered  count  
0        81        0.0       3          13     16  
1        80        0.0       8          32     40  
2        80        0.0       5          27     32  
3        75        0.0       3          10     13  
4        75        0.0       0           1      1  


The `.head()` command lets you **peek at the first rows** of the dataset. Here you can immediately recognize the structure: a datetime stamp, contextual variables (season, weather), and usage metrics (casual, registered, and total count). This helps you explain to your client what kind of data is being tracked.

### 6.4. Quick Statistical Summary

In [41]:
print(df.describe())

             season       holiday    workingday       weather         temp  \
count  10886.000000  10886.000000  10886.000000  10886.000000  10886.00000   
mean       2.506614      0.028569      0.680875      1.418427     20.23086   
std        1.116174      0.166599      0.466159      0.633839      7.79159   
min        1.000000      0.000000      0.000000      1.000000      0.82000   
25%        2.000000      0.000000      0.000000      1.000000     13.94000   
50%        3.000000      0.000000      1.000000      1.000000     20.50000   
75%        4.000000      0.000000      1.000000      2.000000     26.24000   
max        4.000000      1.000000      1.000000      4.000000     41.00000   

              atemp      humidity     windspeed        casual    registered  \
count  10886.000000  10886.000000  10886.000000  10886.000000  10886.000000   
mean      23.655084     61.886460     12.799395     36.021955    155.552177   
std        8.474601     19.245033      8.164537     49.96047

The `.describe()` method provides summary statistics for each numeric column.

- Bike demand (count) ranges from **1 to 977 bikes/hour**, with an average of ~192.
- The distribution is skewed: the median is 145, but the maximum is over 6× higher.
- Weather variables look realistic too — temperatures span from near freezing (0.8°C) to hot summer levels (41°C).

This quick check reassures both you and your client that the dataset is complete and values are plausible.

## 7. Data Types and Their Business Implications

In the previous section, you learned how to take a **first snapshot of the dataset** using commands like `.shape`, `.info()`, `.head()`, and `.describe()`. One of the key things you might have noticed in the `.info()` output is that each column has a data type — for example, integers for `season`, floats for `temperature`, and text (object) for `datetime`.

Data types are not just technical details. They define **what kind of analysis you can perform** and **what kind of business questions you can answer**. Misinterpreting them can lead to misleading results — for instance, treating a categorical variable like `season` as if it were a continuous number would distort any trend analysis.

In transportation datasets, we usually encounter three main types of data: numerical, categorical, and temporal. Let’s look more closely at each of them.

### 7.1. Numerical Data

Numerical data represents information that can be measured or counted. In our dataset, examples include `temp`, `humidity`, and `count`.

- **Continuous variables** like temperature or humidity can take on any value within a range. These are useful for finding patterns such as *“bike usage increases gradually as temperature rises up to 25°C, but then falls in very hot conditions.”*
- **Discrete variables** like the `count` of bikes or the number of `casual` users can only take whole numbers. These are central to forecasting business outcomes, for example, *“predicting how many bikes will be rented in the next hour.”*

Because numerical variables support arithmetic, they allow for statistical summaries (means, correlations) and predictive modeling that directly inform demand forecasting.

### 7.2. Categorical Data

Categorical data represents **groups or categories** rather than numeric scales. In our dataset, examples include `season`, `weather`, and `holiday`.

- **Nominal categories** like `season` (spring, summer, fall, winter) have no inherent order. They are ideal for comparing performance across conditions, for example, *“Does bike demand differ between summer and winter, or are spring and fall similar in usage patterns?”*
- **Ordinal categories** have a natural order, such as the weather severity codes (1 = clear, 2 = misty, 3 = light rain, 4 = heavy rain). These allow for ordered comparisons, like *“demand decreases steadily as weather severity worsens.”*

Categorical variables are crucial for **segmentation** — helping your client understand different user groups or conditions that influence demand.

### 7.3. Temporal Data

Temporal data captures the **dimension of time**, such as the `datetime` column in our dataset. This type of data enables consultants to uncover trends and seasonality that are invisible in static snapshots.

For example:

- Hourly patterns can reveal peak commuting periods.
- Monthly or seasonal trends can guide capacity planning.
- Long-term patterns help forecast growth and support strategic decisions, such as when to expand the fleet.

Unlike other data types, temporal variables connect the dataset into a **time series**, allowing you to move from descriptive statistics toward forecasting future demand.

### 7.4. Why Data Types Matter for Consultants

By recognizing the role of each data type, you can match business questions with the right analytical tools:

- Numerical data supports **measurement, comparison, and prediction**.
- Categorical data enables **segmentation and condition-based insights**.
- Temporal data allows **trend analysis and forecasting**.

Together, these categories transform the dataset from raw information into a structured foundation for business recommendations.

---

## Summary and Transition to Data Quality Implementation

Your mastery of pandas fundamentals establishes the technical foundation essential for professional transportation consulting. Understanding Series and DataFrame structures, data loading procedures, data operations, and data types provides the analytical infrastructure necessary for sophisticated business applications.

The ability to load the Washington D.C. bike-sharing dataset, navigate DataFrame operations, and understand data types creates the operational foundation for all advanced analysis you'll perform as a transportation consultant in this course.

In our next lecture, we'll build on this foundation by learning how to clean and prepare messy real-world data for analysis, turning the raw information into reliable insights your clients can act upon.