<div style="display: flex; justify-content: space-between; align-items: center;">
    <div style="text-align: left; flex: 4;">
        📝 <strong>Author:</strong> Amirhossein Heydari -
        📧 <strong>Email:</strong> <a href="mailto:amirhosseinheydari78@gmail.com">amirhosseinheydari78@gmail.com</a> -
        📍 <strong>Origin:</strong> <a href="https://github.com/mr-pylin/pandas-workshop" target="_blank" rel="noopener">pandas-workshop</a>
    </div>
    <div style="text-align: right; flex: 1;">
        <img src="../assets/images/pandas/logo/pandas_white.svg" alt="Pandas Logo" style="max-height: 48px; width: auto; background-color: #1f1f1f; border-radius: 8px;">
    </div>
</div>
<hr>


**Table of contents**<a id='toc0_'></a>    
- [Dependencies](#toc1_)    
- [Pandas Data Structures](#toc2_)    
  - [Series](#toc2_1_)    
  - [DataFrame](#toc2_2_)    
  - [DataFrame vs. Structured Array](#toc2_3_)    
- [Common Attributes and Methods](#toc3_)    
  - [Core Attributes](#toc3_1_)    
    - [Shape and Size](#toc3_1_1_)    
    - [Index and Labels](#toc3_1_2_)    
    - [Content Access / Representation](#toc3_1_3_)    
    - [Metadata / Utility](#toc3_1_4_)    
  - [Core Methods](#toc3_2_)    
    - [Data Inspection / Quick View](#toc3_2_1_)    
    - [Unique / Value Analysis](#toc3_2_2_)    
    - [Missing Data](#toc3_2_3_)    
    - [Data Copy](#toc3_2_4_)    
    - [Sorting](#toc3_2_5_)    
- [Dtypes and Type Inference](#toc4_)    
  - [Data Types](#toc4_1_)    
  - [Type Inference in Pandas](#toc4_2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Dependencies](#toc0_)


In [None]:
import numpy as np
import pandas as pd

In [None]:
# set maximum line width for printing arrays
np.set_printoptions(linewidth=120)

# disable wrapping entirely
pd.set_option('display.expand_frame_repr', False)

In [None]:
data_1 = {
    "Employee_ID": [101, 102, 103, 104, 105, 106, 107, 108],
    "Name": [
        "Alice Johnson",
        "Bob Smith",
        "Charlie Brown",
        "Diana Davis",
        "Eve Wilson",
        "Frank Miller",
        "Grace Lee",
        "Henry Taylor",
    ],
    "Department": ["Engineering", "Marketing", "Sales", "Engineering", "HR", "Sales", "Marketing", "Engineering"],
    "Salary": [75000, 62000, 58000, None, 55000, 61000, 59000, 78000],
    "Experience_Years": [3, 5, 2, 7, 4, 3, 6, None],
}

In [None]:
data_2 = [
    [101, "Alice Johnson", "Engineering", 75000, 3],
    [102, "Bob Smith", "Marketing", 62000, 5],
    [103, "Charlie Brown", "Sales", 58000, 2],
    [104, "Diana Davis", "Engineering", 82000, 7],
    [105, "Eve Wilson", "HR", 55000, 4],
    [106, "Frank Miller", "Sales", 61000, 3],
    [107, "Grace Lee", "Marketing", 59000, 6],
    [108, "Henry Taylor", "Engineering", 78000, 5],
]

columns_2 = ["Employee_ID", "Name", "Department", "Salary", "Experience_Years"]

In [None]:
# note: NumPy arrays are homogeneous, so all data becomes strings
data_3 = np.array(
    [
        ["101", "Alice Johnson", "Engineering", "75000", "3"],
        ["102", "Bob Smith", "Marketing", "62000", "5"],
        ["103", "Charlie Brown", "Sales", "58000", "2"],
        ["104", "Diana Davis", "Engineering", "82000", "7"],
        ["105", "Eve Wilson", "HR", "55000", "4"],
        ["106", "Frank Miller", "Sales", "61000", "3"],
        ["107", "Grace Lee", "Marketing", "59000", "6"],
        ["108", "Henry Taylor", "Engineering", "78000", "5"],
    ]
)

columns_3 = ["Employee_ID", "Name", "Department", "Salary", "Experience_Years"]

# <a id='toc2_'></a>[Pandas Data Structures](#toc0_)

📚 **Tutorials**:

- What kind of data does pandas handle? [pandas.pydata.org/docs/getting_started/intro_tutorials/01_table_oriented.html](https://pandas.pydata.org/docs/getting_started/intro_tutorials/01_table_oriented.html)


## <a id='toc2_1_'></a>[Series](#toc0_)

- A `pandas.Series` is a **one-dimensional** labeled array capable of holding any data type.
- It contains **data values** (integers, strings, floats, etc.) and **indexes** (labels that identify each value).
- You can think of a `pandas.Series` as a **single column** of a **spreadsheet**, where each entry has both a **value** and a **label**.
- It is perfect for representing **simple, one-dimensional data** like daily **temperatures**, **student grades**, or **stock prices**.

<div style="text-align: center; padding-top: 10px;">
    <img src="../assets/images/pandas/tutorials/01/table_series.svg" alt="Pandas Series" style="min-width: 128px; max-width: 40%; height: auto; background-color: #DBDBDB; border-radius: 16px;">
    <p><em>Figure 1: Pandas Series structure</em> (<a href="https://pandas.pydata.org/docs/getting_started/index.html" target="_blank">source</a>)</p>
</div>

✍️ **Key Characteristics**

- **Homogeneous data type:** all values in a Series are typically of the same type (`int`, `float`, `string`, etc.).
- **Automatic indexing:** if you don’t provide labels, Pandas assigns **integer** indexes starting from `0`.
- **Custom indexing:** you can assign meaningful labels (e.g., **names**, **dates**, **categories**).

📝 **Docs**:

- `pandas.Series`: [pandas.pydata.org/docs/reference/api/pandas.Series.html](https://pandas.pydata.org/docs/reference/api/pandas.Series.html)


In [None]:
# creating a Series from a list
name = data_1["Name"]
s1 = pd.Series(name)

# log
print(f"type(name) : {type(name)}")
print(f"name       : {name}")
print(f"type(s1)   : {type(s1)}\n")
print(f"s1:\n{s1}")

In [None]:
# creating a Series from a NDArray with Custom Index
salary = data_3[:, 3]
s2 = pd.Series(salary, index=data_1["Name"])

# log
print(f"type(salary) : {type(salary)}")
print(f"salary       : {salary}")
print(f"type(s2) : {type(s2)}\n")
print(f"s2:\n{s2}")

In [None]:
# creating a Series from a dictionary
experience = {k: v for k, v in zip(data_1["Name"], data_1["Experience_Years"])}
s3 = pd.Series(experience)


# log
print(f"type(experience) : {type(salary)}")
print(f"experience       : {salary}")
print(f"type(s3) : {type(s3)}\n")
print(f"s3:\n{s3}")

## <a id='toc2_2_'></a>[DataFrame](#toc0_)

- A `pandas.DataFrame` is a **two-dimensional**, **size-mutable**, **heterogeneous** tabular data structure with labeled axes (rows and columns).
- It is essentially a collection of `pandas.Series` objects, aligned on a **shared index** similar to a spreadsheet with rows and columns.

<div style="text-align: center; padding-top: 10px;">
    <img src="../assets/images/pandas/tutorials/01/table_dataframe.svg" alt="Pandas Series" style="min-width: 256px; max-width: 40%; height: auto; background-color: #DBDBDB; border-radius: 16px;">
    <p><em>Figure 2: Pandas DataFrame structure</em> (<a href="https://pandas.pydata.org/docs/getting_started/index.html" target="_blank">source</a>)</p>
</div>

❓ **Why is a DataFrame useful?**

- Most real-world data (e.g., **CSV files**, **Excel sheets**, **SQL tables**) is tabular, not just one-dimensional.
- DataFrames let you store and manipulate **heterogeneous data**: one column may hold integers, another strings, another dates, etc.
- They provide powerful tools for filtering, grouping, joining, and summarizing data.

📝 **Docs**:

- `pandas.DataFrame`: [pandas.pydata.org/docs/reference/api/pandas.DataFrame.html](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)


In [None]:
# creating a DataFrame from a dictionary
df1 = pd.DataFrame(data_1)

# log
print(f"type(df1): {type(df1)}\n")
print(f"df1:\n{df1}")

In [None]:
# creating a DataFrame from a list
df2 = pd.DataFrame(data_2, columns=columns_2)

# log
print(f"type(df2): {type(df2)}\n")
print(f"df2:\n{df2}")

In [None]:
# creating a DataFrame from a NDArray
df3 = pd.DataFrame(data_3, columns=columns_3)

# log
print(f"type(df3): {type(df3)}\n")
print(f"df3:\n{df3}")

In [None]:
# creating a DataFrame from Series
df4 = pd.DataFrame({"Salary": s2, "Experience_Years": s3})

# log
print(f"type(df4): {type(df4)}\n")
print(f"df4:\n{df4}")

## <a id='toc2_3_'></a>[DataFrame vs. Structured Array](#toc0_)

- A structured array in **NumPy** is a type of array where each element can have **named fields**, and each field can have its own **data type**.
- Fields are accessed by **name**, and the array itself remains a **homogeneous** NumPy object.
- Structured arrays are **memory-efficient** and **simple** for **small**, **fixed** datasets.
- DataFrames are more *flexible* and designed for **real-world data**, especially when working with **heterogeneous types**, **missing values**, or **large** datasets.

✍️ **Key Differences**

<table style="margin: 0 auto;">
  <thead>
    <tr>
      <th>Feature</th>
      <th>Structured Array (NumPy)</th>
      <th>DataFrame (Pandas)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Column access</td>
      <td>By field name (<code>data["field"]</code>)</td>
      <td>By column name (<code>df["column"]</code>)</td>
    </tr>
    <tr>
      <td>Row labels / index</td>
      <td>None</td>
      <td>Flexible <code>.index</code></td>
    </tr>
    <tr>
      <td>Heterogeneous types</td>
      <td>Supported per field</td>
      <td>Supported per column</td>
    </tr>
    <tr>
      <td>Missing values</td>
      <td>Not supported natively</td>
      <td>Supports <code>NaN</code></td>
    </tr>
    <tr>
      <td>Operations</td>
      <td>Limited; mostly vectorized math</td>
      <td>Powerful: filtering, grouping, joining</td>
    </tr>
    <tr>
      <td>Integration with analysis tools</td>
      <td>Limited</td>
      <td>Excellent: plotting, I/O, NumPy interop</td>
    </tr>
  </tbody>
</table>

📝 **Docs**:

- Structured arrays **[numpy]**: [numpy.org/doc/stable/user/basics.rec.html](https://numpy.org/doc/stable/user/basics.rec.html)
- More info: [github.com/mr-pylin/numpy-workshop/blob/main/code/15-structured-array.ipynb](https://github.com/mr-pylin/numpy-workshop/blob/main/code/15-structured-array.ipynb)


In [None]:
# define a structured array dtype
dtype = [("Name", "U10"), ("Age", "i4"), ("City", "U15")]

# create the structured array
data = np.array(
    [
        ("Alice", 25, "New York"),
        ("Bob", 30, "Los Angeles"),
        ("Charlie", 35, "San Francisco"),
    ],
    dtype=dtype,
)

# log
print(f"dtype        : {dtype}")
print(f"data         : {data}")
print(f"data['Name'] : {data["Name"]}")
print(f"data['Age']  : {data["Age"]}")
print(f"type(data)   : {type(data)}")

In [None]:
# equivalent DataFrame
df = pd.DataFrame(
    {
        "Name": ["Alice", "Bob", "Charlie"],
        "Age": [25, 30, 35],
        "City": ["New York", "Los Angeles", "San Francisco"],
    }
)

# log
print(f"df:\n{df}\n")
print(f"df['Name']:\n{df["Name"]}\n")
print(f"df['Age']:\n{df["Age"]}")

# <a id='toc3_'></a>[Common Attributes and Methods](#toc0_)

- Before performing any analysis, it’s essential to understand the **structure** and **content** of your dataset.
  - How many rows and columns does it have?
  - What are the column names and index labels?
  - What data types are present?
  - Are there missing values?
- Inspection helps you build intuition about the data and spot potential issues early.


📝 **Docs**:

- `Series` Attributes: [pandas.pydata.org/docs/reference/api/pandas.Series.html](https://pandas.pydata.org/docs/reference/api/pandas.Series.html#:~:text=changed%20as%20well.-,Attributes,-T)
- `Series` Methods: [pandas.pydata.org/docs/reference/api/pandas.Series.html](https://pandas.pydata.org/docs/reference/api/pandas.Series.html#:~:text=on%20the%20dtype.-,Methods,-abs())
- `DataFrame` Attributes: [pandas.pydata.org/docs/reference/api/pandas.DataFrame.html](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#:~:text=1%0Ac%20%203-,Attributes,-T)
- `DataFrame` Methods: [pandas.pydata.org/docs/reference/api/pandas.DataFrame.html](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#:~:text=of%20the%20DataFrame.-,Methods,-abs())


## <a id='toc3_1_'></a>[Core Attributes](#toc0_)

- These attributes provide quick information about the structure of a `DataFrame` or `Series`:

<table style="margin: 0 auto;">
  <thead>
    <tr>
      <th>Attribute</th>
      <th>Applies to</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code>.shape</code></td>
      <td><code>Series, DataFrame</code></td>
      <td>Tuple representing dimensions; (rows, cols) for DataFrame, (n,) for Series</td>
    </tr>
    <tr>
      <td><code>.size</code></td>
      <td><code>Series, DataFrame</code></td>
      <td>Total number of elements (rows × columns for DataFrame)</td>
    </tr>
    <tr>
      <td><code>.ndim</code></td>
      <td><code>Series, DataFrame</code></td>
      <td>Number of array dimensions (1 for Series, 2 for DataFrame)</td>
    </tr>
    <tr>
      <td><code>.index</code></td>
      <td><code>Series, DataFrame</code></td>
      <td>Row index object that stores row labels</td>
    </tr>
    <tr>
      <td><code>.columns</code></td>
      <td><code>DataFrame</code></td>
      <td>Column index object containing column labels</td>
    </tr>
    <tr>
      <td><code>.dtypes</code></td>
      <td><code>Series, DataFrame</code></td>
      <td>Data type(s) of the Series or each column in a DataFrame</td>
    </tr>
    <tr>
      <td><code>.T</code></td>
      <td><code>Series, DataFrame</code></td>
      <td>Transposed view of the object (rows become columns and vice versa)</td>
    </tr>
    <tr>
      <td><code>.values</code></td>
      <td><code>Series, DataFrame</code></td>
      <td>Underlying NumPy array representation (legacy, prefer <code>.to_numpy()</code>)</td>
    </tr>
    <tr>
      <td><code>.axes</code></td>
      <td><code>Series, DataFrame</code></td>
      <td>List of axis objects: index for Series, [rows, cols] for DataFrame</td>
    </tr>
    <tr>
      <td><code>.empty</code></td>
      <td><code>Series, DataFrame</code></td>
      <td>Boolean indicating whether the object has no elements</td>
    </tr>
    <tr>
      <td><code>.name</code></td>
      <td><code>Series</code></td>
      <td>Optional name of the Series; often useful when part of a DataFrame or groupby result</td>
    </tr>
    <tr>
      <td><code>.names</code></td>
      <td><code>DataFrame</code></td>
      <td>Names of index and column axes; can be set for labeled or hierarchical indexing</td>
    </tr>
  </tbody>
</table>

📝 **Docs**:

- Attributes and underlying data: [pandas.pydata.org/docs/user_guide/basics.html#attributes-and-underlying-data](https://pandas.pydata.org/docs/user_guide/basics.html#attributes-and-underlying-data)


### <a id='toc3_1_1_'></a>[Shape and Size](#toc0_)


In [None]:
# dataframe
print(f"df1.shape : {df1.shape}")
print(f"df1.size  : {df1.size}")
print(f"df1.ndim  : {df1.ndim}")


In [None]:
# series
print(f"df1['Salary'].shape : {df1['Salary'].shape}")
print(f"df1['Salary'].size  : {df1['Salary'].size}")
print(f"df1['Salary'].ndim  : {df1['Salary'].ndim}")


### <a id='toc3_1_2_'></a>[Index and Labels](#toc0_)


In [None]:
# series
index_1 = df1['Salary'].index.to_list()  # df1['Salary'].axes[0]

# dataframe
index_2 = df1.index.to_list()            # df1.axes[0]
columns_2 = df1.columns.to_list()        # df1.axes[1]

index_3 = df4.index.to_list()
columns_3 = df4.columns.to_list()

# log
print(f"index_1   : {index_1}")
print(f"index_2   : {index_2}")
print(f"index_3   : {index_3}")
print(f"columns_2 : {columns_2}")
print(f"columns_3 : {columns_3}")

### <a id='toc3_1_3_'></a>[Content Access / Representation](#toc0_)


In [None]:
# transpose
df1.T

In [None]:
# using <.values> is highly discouraged [legacy ⚠️] -> use <.to_numpy()> instead [modern ✅]
values_1 = df1['Salary'].values
values_2 = df1.values
values_3 = df4.values

print(f"df1['Salary'].values:\n{df1['Salary'].values}\n")
print(f"df1.values:\n{df1.values}\n")
print(f"df4.values:\n{df4.values}")

### <a id='toc3_1_4_'></a>[Metadata / Utility](#toc0_)


In [None]:
# assign names to index and columns
s1.name = "values"
df1.index.name = "row_id"
df4.columns.name = "attributes"

# log
print(f"s1:\n{s1}\n")
print(f"df1:\n{df1}\n")
print(f"df4:\n{df4}")

In [None]:
# empty check
print(f"s1.empty             : {s1.empty}")
print(f"df1.empty            : {df1.empty}")
print(f"df2.empty            : {df2.empty}")
print(f"pd.DataFrame().empty : {pd.DataFrame().empty}")

## <a id='toc3_2_'></a>[Core Methods](#toc0_)

- These methods provide common operations and inspection tools for a `DataFrame` or `Series`:

<table style="margin: 0 auto;">
  <thead>
    <tr>
      <th>Method</th>
      <th>Applies to</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code>.head()</code></td>
      <td><code>Series, DataFrame</code></td>
      <td>Return the first n rows (default 5) for quick inspection</td>
    </tr>
    <tr>
      <td><code>.tail()</code></td>
      <td><code>Series, DataFrame</code></td>
      <td>Return the last n rows (default 5) of the object</td>
    </tr>
    <tr>
      <td><code>.sample()</code></td>
      <td><code>Series, DataFrame</code></td>
      <td>Return a random sample of items; size and random_state optional</td>
    </tr>
    <tr>
      <td><code>.info()</code></td>
      <td><code>DataFrame</code></td>
      <td>Summary including columns, data types, non-null counts, memory</td>
    </tr>
    <tr>
      <td><code>.describe()</code></td>
      <td><code>Series, DataFrame</code></td>
      <td>Generate descriptive statistics for numeric or object data</td>
    </tr>
    <tr>
      <td><code>.value_counts()</code></td>
      <td><code>Series</code></td>
      <td>Return counts of unique values, sorted in descending order</td>
    </tr>
    <tr>
      <td><code>.unique()</code></td>
      <td><code>Series</code></td>
      <td>Return an array of unique values in the Series</td>
    </tr>
    <tr>
      <td><code>.nunique()</code></td>
      <td><code>Series, DataFrame</code></td>
      <td>Count the number of distinct elements per Series or column</td>
    </tr>
    <tr>
      <td><code>.isna()</code></td>
      <td><code>Series, DataFrame</code></td>
      <td>Return a boolean mask indicating missing (NaN/NaT) values</td>
    </tr>
    <tr>
      <td><code>.notna()</code></td>
      <td><code>Series, DataFrame</code></td>
      <td>Return a boolean mask indicating non-missing values</td>
    </tr>
    <tr>
      <td><code>.copy()</code></td>
      <td><code>Series, DataFrame</code></td>
      <td>Create a deep copy to avoid modifying the original data</td>
    </tr>
    <tr>
      <td><code>.astype()</code></td>
      <td><code>Series, DataFrame</code></td>
      <td>Cast the object to a specified data type</td>
    </tr>
    <tr>
      <td><code>.sort_values()</code></td>
      <td><code>Series, DataFrame</code></td>
      <td>Sort by values along the specified axis</td>
    </tr>
    <tr>
      <td><code>.sort_index()</code></td>
      <td><code>Series, DataFrame</code></td>
      <td>Sort by index labels (row or column)</td>
    </tr>
  </tbody>
</table>


### <a id='toc3_2_1_'></a>[Data Inspection / Quick View](#toc0_)


In [None]:
# head and tail
print(f"df1.head(2):\n{df1.head(2)}\n")
print(f"df1.tail(3):\n{df1.tail(3)}")

In [None]:
# random sampling
print(f"df1.sample(2):\n{df1.sample(2)}\n")
print(f"df1.sample(2):\n{df1.sample(2)}")

In [None]:
# info
df1.info()


In [None]:
# describe
print(f"df1.describe():\n{df1.describe()}")

### <a id='toc3_2_2_'></a>[Unique / Value Analysis](#toc0_)


In [None]:
# series
s_unique = df1["Department"].unique()
s_nunique = df1["Department"].nunique()
s_value_counts = df1["Department"].value_counts()

# log
print(f"s_unique:\n{s_unique}\n")
print(f"s_nunique:\n{s_nunique}\n")
print(f"s_value_counts:\n{s_value_counts}")

In [None]:
# dataframe
df_nunique = df1.nunique()

# log
print(f"df_nunique:\n{df_nunique}")

### <a id='toc3_2_3_'></a>[Missing Data](#toc0_)


In [None]:
df1.isna()  # df1.isnull()

In [None]:
df1.notna()  # df1.notnull() | ~df1.isna()

### <a id='toc3_2_4_'></a>[Data Copy](#toc0_)


In [None]:
# modifying copy does not affect original
df5 = df1.copy()
df5["Employee_ID"] += 10

# log
print(f"df1:\n{df1}\n")
print(f"df5:\n{df5}")

### <a id='toc3_2_5_'></a>[Sorting](#toc0_)


In [None]:
s2.sort_index()

In [None]:
s2.sort_values(ascending=False)

In [None]:
df1.sort_values(by=["Experience_Years", "Salary"])

# <a id='toc4_'></a>[Dtypes and Type Inference](#toc0_)


## <a id='toc4_1_'></a>[Data Types](#toc0_)

- Every column (`Series`) has an associated **dtype** (data type) that tells Pandas how to **store** and **interpret** the values.


📈 **Common dtypes**

- **Numeric types:** `int64`, `float64`
- **Text type:** `object` (usually strings)
- **Boolean:** `bool`
- **Datetime and Timedelta:** `datetime64`, `timedelta64`
- **Categorical:** `category` (for repeated values with limited categories)


‼️**Knowing the dtype is important!**

- **Memory usage:** some types are more efficient than others.
- **Operations:** mathematical operations can only apply to numeric dtypes.
- **Interoperability:** when exporting to other tools or databases, **dtype compatibility** matters.

📝 **Docs**:

- dtypes: [pandas.pydata.org/docs/user_guide/basics.html#dtypes](https://pandas.pydata.org/docs/user_guide/basics.html#dtypes)


In [None]:
df = pd.DataFrame({
    "integers": [1, 2, 3, 4],
    "floats": [1.5, 2.5, 3.5, 4.5],
    "strings": ["apple", "banana", "cherry", "date"],
    "booleans": [True, False, True, False],
    "datetimes": pd.date_range("2023-01-01", periods=4),
    "categories": pd.Series(["A", "B", "A", "C"], dtype="category")
})

# log
print(f"df:\n{df}\n")
print(f"df.dtypes:\n{df.dtypes}")

## <a id='toc4_2_'></a>[Type Inference in Pandas](#toc0_)

- When you create a `Series` or `DataFrame`, Pandas **automatically** infers the **dtype** based on the **input data**.
- This inference makes it easy to start working with data, but sometimes it **may not match** your expectations.

✍️ **Key Points**

- Mixed types often default to `object`, which is **less** efficient.
- **Dates** and **times** may be read as **plain strings** unless **explicitly parsed**.
- You can **override** automatic inference by casting to a specific dtype.

⚒️ **Working with Dtypes**

- Sometimes you need to change or control dtypes:
  - **Inspection:** `.dtypes` shows the dtype of each column.
  - **Conversion:** `.astype()` lets you convert a column to a new dtype.
  - **Optimization:** converting columns to `category` or **smaller integer types** can save memory.

📝 **Docs**:

- `pandas.Series.astype`: [pandas.pydata.org/docs/reference/api/pandas.Series.astype.html](https://pandas.pydata.org/docs/reference/api/pandas.Series.astype.html)
- `pandas.DataFrame.astype`: [pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html)


In [None]:
# pandas infers dtypes automatically [implicit]
s4 = pd.Series([10, 20, 30])                  # integers -> int64   ✅
s5 = pd.Series([1.1, None, 3.3])              # floats   -> float64 ✅
s6 = pd.Series(["a", "b", "c"])               # strings  -> object  ✅
s7 = pd.Series(["2023-01-01", "2023-02-01"])  # strings  -> object  ❌

# log
print(f"s4.dtype: {s4.dtype}")
print(f"s5.dtype: {s5.dtype}")
print(f"s6.dtype: {s6.dtype}")
print(f"s7.dtype: {s7.dtype}")

In [None]:
# casting with astype()
s7_casted = s7.astype('datetime64[ns]')

# log
print(f"s7_casted.dtype: {s7_casted.dtype}")

In [None]:
# explicit conversion (override inference)
s7_converted = pd.to_datetime(s7)

# log
print(f"s7_converted.dtype: {s7_converted.dtype}")

In [None]:
# mixed types default to object
s_mixed = pd.Series([1, "two", 3.0])

# log
print(f"s_mixed.dtype: {s_mixed.dtype}")

In [None]:
# optimization: use category for repeated strings
s_object = pd.Series(["dog", "cat", "dog", "dog", "cat"])
s_categorical = pd.Series(["dog", "cat", "dog", "dog", "cat"], dtype="category")

# check size [bytes]
s_object_size = s_object.memory_usage(deep=True)
s_categorical_size = s_categorical.memory_usage(deep=True)

# log
print(f"s_object.dtype      : {s_object.dtype}")
print(f"s_object_size       : {s_object_size} bytes")
print(f"s_categorical.dtype : {s_categorical.dtype}")
print(f"s_categorical_size  : {s_categorical_size} bytes")