# 00. Pandas Introduction

<span style="color: grey">(10-15 min - quick intro)</span>

Welcome to the Pandas DataFrames tutorial series! In this notebook, we'll introduce you to Pandas and get you started with your first DataFrame.

## What is Pandas?

**Pandas** is a powerful Python library for data manipulation and analysis. It provides data structures and functions that make working with structured data (like tables) much easier than using basic Python lists and dictionaries.

### Why Use Pandas?

- **Easy data loading**: Read data from CSV, JSON, Excel, and many other formats
- **Powerful data manipulation**: Filter, merge, transform, and analyze data efficiently
- **Perfect for tables**: Designed for working with rows and columns of data
- **Essential for data work**: Used extensively in data science, research, and analysis

### Key Concepts

Pandas has two main data structures:

1. **Series**: A one-dimensional array (like a single column in a spreadsheet)
2. **DataFrame**: A two-dimensional table (like an entire spreadsheet with rows and columns)

We'll focus primarily on **DataFrames** since they're what you'll use most for working with character data, variant tables, and dictionary information.


## Installing Pandas

If you don't have Pandas installed yet, you can install it using pip:

```bash
pip install pandas
```

Let's check if Pandas is installed and what version we have:


In [1]:
# Import pandas - we typically import it as 'pd' for convenience
import pandas as pd

# Check the version
print(f"Pandas version: {pd.__version__}")


Pandas version: 2.3.3


## Loading Your First Dataset

Let's start by loading a simple dataset. We'll use data from the `cjkvi-variants` repository, which contains variant character relationships.

The file we'll load is `joyo-variants.txt`, which contains Japanese Jōyō kanji and their variant forms.


In [None]:
# Load a CSV file using pandas
# The file is comma-separated, and we need to skip comment lines (starting with #)
df = pd.read_csv('../submodules/cjkvi-variants/joyo-variants.txt', 
                 sep=',',
                 comment='#',
                 names=['character', 'type', 'variant'])

# Display the first few rows
df.head()


Unnamed: 0,character,type,variant
0,joyo/proper,<rev>,joyo/variant
1,joyo/proper,<name>,正字（常用漢字表）
2,joyo/variant,<name>,異体字（常用漢字表）
3,亜,joyo/variant,亞
4,悪,joyo/variant,惡


Great! We've loaded our first DataFrame. Let's see what information we can get about it:


In [3]:
# Get basic information about the DataFrame
print("DataFrame shape (rows, columns):", df.shape)
print("\nColumn names:", df.columns.tolist())
print("\nData types:")
print(df.dtypes)
print("\nFirst few rows:")
df.head(10)


DataFrame shape (rows, columns): (367, 3)

Column names: ['character', 'type', 'variant']

Data types:
character    object
type         object
variant      object
dtype: object

First few rows:


Unnamed: 0,character,type,variant
0,joyo/proper,<rev>,joyo/variant
1,joyo/proper,<name>,正字（常用漢字表）
2,joyo/variant,<name>,異体字（常用漢字表）
3,亜,joyo/variant,亞
4,悪,joyo/variant,惡
5,圧,joyo/variant,壓
6,囲,joyo/variant,圍
7,医,joyo/variant,醫
8,為,joyo/variant,爲
9,壱,joyo/variant,壹


## Understanding the Data

This DataFrame contains:
- **character**: The modern Japanese character (shinjitai)
- **type**: The type of variant relationship (e.g., "joyo/variant")
- **variant**: The traditional form (kyujitai) of the character

This is exactly the kind of data you'll work with when building character lookup tables and dictionaries!

## Basic DataFrame Operations

Let's try a few basic operations:


In [4]:
# How many rows do we have?
print(f"Total number of variant relationships: {len(df)}")

# Get a summary of the data
df.info()


Total number of variant relationships: 367
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 367 entries, 0 to 366
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   character  367 non-null    object
 1   type       367 non-null    object
 2   variant    367 non-null    object
dtypes: object(3)
memory usage: 8.7+ KB


## What's Next?

In the next notebook, we'll learn:
- How to load different file formats (CSV, JSON, TSV)
- How to handle encoding for CJK characters
- How to specify column names and separators
- How to deal with comment lines and other file quirks

## Reference Material

For a more comprehensive introduction to Pandas, check out:
- **PANDAS-TUTORIAL** in `../PANDAS-TUTORIAL/01-What-is-Pandas.ipynb`
- [Pandas Official Documentation](https://pandas.pydata.org/docs/)

## Try It Yourself

1. Try loading a different variant file from `../submodules/cjkvi-variants/`
2. Experiment with `.head()` and `.tail()` to see different parts of the data
3. Try accessing a single column: `df['character']`
