<div style="display: flex; justify-content: space-between; align-items: center;">
    <div style="text-align: left; flex: 4">
        <strong>Author:</strong> Amirhossein Heydari — 
        📧 <a href="mailto:amirhosseinheydari78@gmail.com">amirhosseinheydari78@gmail.com</a> — 
        🐙 <a href="https://github.com/mr-pylin/pandas-workshop" target="_blank" rel="noopener">github.com/mr-pylin</a>
    </div>
    <div style="text-align: right; flex: 1;">
        <a href="https://pandas.pydata.org/" target="_blank" rel="noopener noreferrer">
            <img src="../assets/images/pandas/logo/pandas_white.svg" 
                 alt="Pandas Logo"
                 style="max-height: 48px; width: auto; background-color: #1f1f1f; border-radius: 8px;">
        </a>
    </div>
</div>
<hr>


**Table of contents**<a id='toc0_'></a>    
- [Dependencies](#toc1_)    
- [What is Pandas?](#toc2_)    
  - [The Role of Pandas](#toc2_1_)    
  - [Pandas vs. Native Python Structures](#toc2_2_)    
  - [Pandas in the Ecosystem](#toc2_3_)    
  - [First Steps with Data](#toc2_4_)    
    - [Create a Toy Dataset Manually](#toc2_4_1_)    
    - [Read a Small CSV into a DataFrame](#toc2_4_2_)    
    - [Quick Exploratory Commands](#toc2_4_3_)    
- [Motivation](#toc3_)    
  - [Example 1](#toc3_1_)    
    - [Using Raw Python](#toc3_1_1_)    
    - [Using Pandas](#toc3_1_2_)    
  - [Example 2](#toc3_2_)    
    - [Using Raw Python](#toc3_2_1_)    
    - [Using Pandas](#toc3_2_2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Dependencies](#toc0_)


In [None]:
import csv
from collections import defaultdict
from urllib.request import urlopen

import pandas as pd

In [None]:
# disable wrapping entirely
pd.set_option('display.expand_frame_repr', False)

In [None]:
MOVIES_PATH = r"https://raw.githubusercontent.com/mr-pylin/datasets/refs/heads/main/data/tabular-data/movies/csv/dataset.csv"
SALES_PATH = r"https://raw.githubusercontent.com/mr-pylin/datasets/refs/heads/main/data/tabular-data/sales/dataset.csv"

# <a id='toc2_'></a>[What is Pandas?](#toc0_)

- Pandas is a Python library for data manipulation and analysis.

📝 **Official Documentation / Tutorials**:

- Pandas official docs: [pandas.pydata.org/docs/](https://pandas.pydata.org/docs/)
- Getting Started Tutorial: [pandas.pydata.org/docs/getting_started/index.html](https://pandas.pydata.org/docs/getting_started/index.html)
- 10 Minutes to Pandas: [pandas.pydata.org/docs/user_guide/10min.html](https://pandas.pydata.org/docs/user_guide/10min.html)

📺 **Useful Youtube Playlists**:

- Corey Schafer **[youtube]**: [youtube.com/playlist?list=PL-osiE80TeTt2d9bfVyTiXJA-UTHn6WwU](https://www.youtube.com/playlist?list=PL-osiE80TeTt2d9bfVyTiXJA-UTHn6WwU)
- Data School **[youtube]**: [youtube.com/playlist?list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y](https://www.youtube.com/playlist?list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y)


## <a id='toc2_1_'></a>[The Role of Pandas](#toc0_)

  - **Pandas vs. raw Python**: easier handling of structured/tabular data
  - **Pandas vs. Excel/CSV**: programmatic, reproducible, automatable
  - **Pandas in data science pipelines**: preprocessing $\rightarrow$ analysis $\rightarrow$ visualization


## <a id='toc2_2_'></a>[Pandas vs. Native Python Structures](#toc0_)

<table>
  <thead>
    <tr>
      <th>Feature</th>
      <th>Native Python (list/dict)</th>
      <th>Pandas (Series/DataFrame)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Labeled data</td>
      <td>No labels (need manual tracking)</td>
      <td>Indexes and column names</td>
    </tr>
    <tr>
      <td>Missing data</td>
      <td>Manual checks</td>
      <td>NaN handling with built-in methods</td>
    </tr>
    <tr>
      <td>Vectorized operations</td>
      <td>Loops required</td>
      <td>Arithmetic applied element-wise</td>
    </tr>
    <tr>
      <td>Aggregations</td>
      <td>Custom loops or comprehensions</td>
      <td><code>sum()</code>, <code>mean()</code>, <code>groupby()</code></td>
    </tr>
    <tr>
      <td>Filtering &amp; selection</td>
      <td>Loops and conditionals</td>
      <td>Boolean indexing and slicing</td>
    </tr>
    <tr>
      <td>File I/O</td>
      <td>Manual parsing</td>
      <td><code>.read_csv()</code>, <code>.to_csv()</code>, <code>.read_sql()</code></td>
    </tr>
  </tbody>
</table>


## <a id='toc2_3_'></a>[Pandas in the Ecosystem](#toc0_)

- Pandas builds on NumPy: provides labeled axes and higher-level operations on arrays.
- Integrates seamlessly with Matplotlib / Seaborn for plotting.
- Works naturally with Scikit-learn for ML pipelines (features in DataFrames).
- Often used in data pipelines: ingestion → cleaning → transformation → visualization → modeling → export.
- Supports multiple file formats (CSV, Excel, Parquet, SQL, JSON) for real-world workflows.


## <a id='toc2_4_'></a>[First Steps with Data](#toc0_)


### <a id='toc2_4_1_'></a>[Create a Toy Dataset Manually](#toc0_)


In [None]:
# python dict
data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 35],
    "City": ["New York", "Los Angeles", "Chicago"],
}

# log
print(data)

In [None]:
# create a small DataFrame
df = pd.DataFrame(data)

# log
print(df)

### <a id='toc2_4_2_'></a>[Read a Small CSV into a DataFrame](#toc0_)


In [None]:
# read CSV
df_movies = pd.read_csv(MOVIES_PATH)

# log
print(df_movies)

### <a id='toc2_4_3_'></a>[Quick Exploratory Commands](#toc0_)


In [None]:
shape = df_movies.shape      # number of rows and columns
columns = df_movies.columns  # columns
index = df_movies.index      # index

# log
print(f"shape   : {shape}")
print(f"index   : {index}")
print(f"columns : {list(columns)}")


In [None]:
# Summary statistics for numeric columns
describe = df_movies.describe()

# log
print(f"describe: {describe}")

In [None]:
# info about data types and missing values
info = df_movies.info()

# log
print(f"info: {info}")

# <a id='toc3_'></a>[Motivation](#toc0_)


## <a id='toc3_1_'></a>[Example 1](#toc0_)

- Find top 5 **product category** with highest **total revenue**.
- About `sales.csv`: [github.com/mr-pylin/datasets/raw/refs/heads/main/data/tabular-data/sales/dataset.csv](https://github.com/mr-pylin/datasets/raw/refs/heads/main/data/tabular-data/sales/dataset.csv)


### <a id='toc3_1_1_'></a>[Using Raw Python](#toc0_)


In [None]:
category_revenue = defaultdict(float)

with urlopen(SALES_PATH) as response:
    data = response.read().decode('latin-1')
    lines = data.splitlines()
    reader = csv.DictReader(lines)
    for row in reader:
        # calculate total revenue for each row
        total = float(row["QUANTITYORDERED"]) * float(row["PRICEEACH"])
        # add to the corresponding category
        category_revenue[row["PRODUCTLINE"]] += total

# sort categories by total revenue in descending order
sorted_categories = sorted(category_revenue.items(), key=lambda x: x[1], reverse=True)

# log
for category, revenue in sorted_categories[:5]:
    print(f"{category}: ${revenue:,.2f}")

### <a id='toc3_1_2_'></a>[Using Pandas](#toc0_)


In [None]:
df = pd.read_csv(SALES_PATH, encoding='latin-1')

df["Total Revenue"] = df["QUANTITYORDERED"] * df["PRICEEACH"]
category_revenue = df.groupby("PRODUCTLINE")["Total Revenue"].sum().sort_values(ascending=False)

# log
print(category_revenue.head(5))

## <a id='toc3_2_'></a>[Example 2](#toc0_)

- Find the top 3 **highest-rated Drama** movies released between **2007** and **2011**.
- About `movies.csv`: [github.com/mr-pylin/datasets/raw/refs/heads/main/data/tabular-data/movies/dataset.csv](https://github.com/mr-pylin/datasets/raw/refs/heads/main/data/tabular-data/movies/dataset.csv)


### <a id='toc3_2_1_'></a>[Using Raw Python](#toc0_)


In [None]:
movies = []

with urlopen(MOVIES_PATH) as response:
    data = response.read().decode('utf-8')
    lines = data.splitlines()
    reader = csv.DictReader(lines)
    for row in reader:
        if row["Genre"] == "Drama" and 2007 <= int(row["Year"]) <= 2011:
            movies.append({
                "Film": row["Film"], 
                "Year": int(row["Year"]), 
                "Audience score %": float(row["Audience score %"])
            })

# sort selected films by audience score in descending order
movies_sorted = sorted(movies, key=lambda x: x["Audience score %"], reverse=True)
top_3 = movies_sorted[:3]

# log
for movie in top_3:
    print(movie)

### <a id='toc3_2_2_'></a>[Using Pandas](#toc0_)


In [None]:
df = pd.read_csv(MOVIES_PATH)

top_movies = (
    df[(df["Genre"] == "Drama") & (df["Year"].between(2007, 2011))]
    .sort_values(
        "Audience score %",
        ascending=False,
    )
    .head(3)
)

# log
print(top_movies[["Film", "Year", "Genre", "Audience score %"]])