# Chapter 3: Python Pandas Tutorial

The `pandas` package is the most important tool at the disposal of Data Scientists and Analysts working in Python today. The powerful machine learning and glamorous visualization tools may get all the attention, but pandas is the backbone of most data projects.

> [pandas] is derived from the term "**pan**el **da**ta", an econometrics term for data sets that include observations over multiple time periods for the same individuals. 

## What `pandas` for?

`pandas` has so many uses that it might make sense to list the things it can't do instead of what it can do.'
'
This tool is essentially your data's home. Through `pandas`, you get acquainted with your data by cleaning, transforming, and analyzing it.

For example, say you want to explore a dataset stored in a CSV on your computer. `pandas` will extract the data from that CSV into a `DataFrame` — a table, basically — then let you do things like:

- Calculate statistics and answer questions about the data, like
    - What's the average, median, max, or min of each column?
    - Does column A correlate with column B?
    - What does the distribution of data in column C look like?

- Clean the data by doing things like removing missing values and filtering rows or columns by some criteria
- Visualize the data with help from Matplotlib. Plot bars, lines, histograms, bubbles, and more.
- Store the cleaned, transformed data back into a CSV, other file or database

Before you jump into the modeling or the complex visualizations you need to have a good understanding of the nature of your dataset and pandas is the best avenue through which to do that.

### First Steps: Install and import

`pandas` is an easy package to install. To install in this notebook, we just run:

In [32]:
!pip install pandas


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


The `!` at the beginning runs cells as if they were in a terminal.

To import `pandas` we usually import it with a shorter name since it's used so much:

In [33]:
import pandas as pd

### Core components of pandas: Series and DataFrames

The primary two components of `pandas` are the `Series` and `DataFrame`.

A `Series` is essentially a column, and a `DataFrame` is a multi-dimensional table made up of a collection of Series.

<img src="images/img1.png" />

`DataFrames` and `Series` are quite similar in that many operations that you can do with one you can do with the other, such as filling in null values and calculating the mean.

You'll see how these components work when we start working with data below.

### Creating `DataFrames` from scratch

Creating `DataFrames` right in Python is good to know and quite useful when testing new methods and functions you find in the pandas docs.

There are many ways to create a DataFrame from scratch, but a great option is to just use a simple dict.

Let's say we have a fruit stand that sells apples and oranges. We want to have a column for each fruit and a row for each customer purchase. To organize this as a dictionary for pandas we could do something like:

In [34]:
data = {
    'durians': [4, 55, 32, 13], 
    'apples': [3, 56, 3, 1]
}

fruit = pd.DataFrame(data)
fruit

Unnamed: 0,durians,apples
0,4,3
1,55,56
2,32,3
3,13,1


Each item (key, value) in data corresponds to a column in the resulting `DataFrame`.

The Index of this DataFrame was given to us on creation as the numbers 0-3, but we could also create our own when we initialize the `DataFrame`.

Let's put customer names as our index:

In [35]:
fruit = pd.DataFrame(data, index=['Ali', 'Abu', 'Lim', 'Muthu'])

fruit

Unnamed: 0,durians,apples
Ali,4,3
Abu,55,56
Lim,32,3
Muthu,13,1


### Locating the customer

We can locate a customer's order by using their name with loc.

In [None]:
fruit.loc['Ali']

## Mini Project: Birthday Explorer

Explore any date from 1920 to 2022 using Malaysia’s daily birth counts.

- Goal: Build a small, data-driven “explorer” to answer questions about birthdays using the official dataset.
- Data source: births.csv (daily births), 1920–2022  
  URL: https://storage.data.gov.my/demography/births.csv

What you’ll build
- Date lookup: Enter a date (YYYY-MM-DD) and show the number of births on that day.
- Birthday profile: For a given month-day (e.g., 14-Feb), summarize births across years (trend, averages, min/max).
- Seasonality views: Explore patterns by month and by day-of-week; visualize spikes (e.g., festive periods).
- Fun facts:
  - Most/least common birthdays.
  - Your birthday’s percentile/rarity across the year.
  - Leap day handling and nearby-day comparison for Feb 29 birthdays.
- Data quality checks: Identify missing dates, early-year gaps, and outliers.


### Reading CSV File

We will need to first load the CSV file. To do this, we use `read_csv()` function. This function accepts an URL or a file.

In [36]:
url = "https://storage.data.gov.my/demography/births.csv"
births_df = pd.read_csv(url)

Alternatively, to load using local file, we can:

In [None]:
births_local_df = pd.read_csv("births.csv")

We can look for the shape of the dataframe as below:

In [37]:
births_df.shape

(37833, 3)

Now, let's see what column is in the data.

In [38]:
births_df.columns.tolist()

['date', 'state', 'births']

The first few row of the dataframe.

In [39]:
births_df.head()

Unnamed: 0,date,state,births
0,1920-01-01,Malaysia,96
1,1920-01-02,Malaysia,115
2,1920-01-03,Malaysia,111
3,1920-01-04,Malaysia,101
4,1920-01-05,Malaysia,95


### Clean the data

Remove the `state` column since it's all 'Malaysia'

In [40]:
births_df = births_df.drop('state', axis=1)

Convert date column to proper `datetime` format

In [41]:
births_df['date'] = pd.to_datetime(births_df['date'])

Set date as index for faster lookups

In [42]:
births_df = births_df.set_index('date')

Now, let's see the first few row of the dataframe after the operations above.

In [43]:
births_df.head()

Unnamed: 0_level_0,births
date,Unnamed: 1_level_1
1920-01-01,96
1920-01-02,115
1920-01-03,111
1920-01-04,101
1920-01-05,95


### Finding how common is your birthdate

In [44]:
def get_births_on_date(date_input):
    """
    Get number of births on a specific date.
    
    Args:
        date_input (str): Date in format 'YYYY-MM-DD'
    
    Returns:
        int or str: Number of births or error message
    """
    try:
        target_date = pd.to_datetime(date_input)
        
        if target_date in births_df.index:
            births_count = births_df.loc[target_date, 'births']
            return f"On {date_input}, there were {births_count} births in Malaysia."
        else:
            return f"No data available for {date_input}. Available range: {births_df.index.min().date()} to {births_df.index.max().date()}"
    
    except Exception as e:
        return f"Invalid date format. Please use YYYY-MM-DD format. Error: {str(e)}"

In [45]:
your_date = "1995-08-17"  # Change this to any date you want to check
print(get_births_on_date(your_date))

On 1995-08-17, there were 1758 births in Malaysia.
