# 02. Exploring DataFrames

Once you've loaded data into a DataFrame, the next step is to explore and understand it. This notebook covers the essential methods for inspecting your data.


In [18]:
import pandas as pd

# Load some data to work with
df = pd.read_csv('../cjkvi-variants/joyo-variants.txt',
                 sep=',',
                 skiprows=4,
                 comment='#',
                 names=['character', 'type', 'variant'],
                 encoding='utf-8')


## Viewing Data

### Viewing the First and Last Rows


In [None]:
# View first 5 rows (default)
df.head()

# View first 10 rows
# df.head(10)

# View last 5 rows
# df.tail()

# View a random sample
# df.sample(5)


Unnamed: 0,character,type,variant
0,亜,joyo/variant,亞
1,悪,joyo/variant,惡
2,圧,joyo/variant,壓
3,囲,joyo/variant,圍
4,医,joyo/variant,醫
5,為,joyo/variant,爲
6,壱,joyo/variant,壹
7,逸,joyo/variant,逸
8,隠,joyo/variant,隱
9,栄,joyo/variant,榮


## Understanding Structure

### Shape, Columns, and Data Types


In [21]:
# Get dimensions: (rows, columns)
print("Shape:", df.shape)
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")

# Get column names
print("\nColumns:", df.columns.tolist())

# Get data types for each column
print("\nData types:")
print(df.dtypes)


Shape: (364, 3)
Rows: 364, Columns: 3

Columns: ['character', 'type', 'variant']

Data types:
character    object
type         object
variant      object
dtype: object


### Getting Detailed Information


In [22]:
# Get comprehensive information about the DataFrame
df.info()

# Get basic statistics (for numeric columns)
# Note: For text data, this shows character count statistics
df.describe()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 364 entries, 0 to 363
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   character  364 non-null    object
 1   type       364 non-null    object
 2   variant    364 non-null    object
dtypes: object(3)
memory usage: 8.7+ KB


Unnamed: 0,character,type,variant
count,364,364,364
unique,362,1,364
top,弁,joyo/variant,灣
freq,3,364,1


## Selecting Columns

### Single Column


In [23]:
# Select a single column (returns a Series)
characters = df['character']
print(type(characters))
characters.head()

# Alternative syntax
characters = df.character
characters.head()


<class 'pandas.core.series.Series'>


0    亜
1    悪
2    圧
3    囲
4    医
Name: character, dtype: object

### Multiple Columns


In [24]:
# Select multiple columns (returns a DataFrame)
# Note: Use double brackets for multiple columns
df_subset = df[['character', 'variant']]
df_subset.head()


Unnamed: 0,character,variant
0,亜,亞
1,悪,惡
2,圧,壓
3,囲,圍
4,医,醫


## Indexing: Selecting Rows and Columns

### Using .iloc[] (Integer Location)


In [25]:
# Select by row index (integer position)
# First row
df.iloc[0]

# First 5 rows
df.iloc[0:5]

# Specific row and column
df.iloc[0, 0]  # First row, first column

# Multiple rows and columns
df.iloc[0:5, [0, 2]]  # First 5 rows, columns 0 and 2


Unnamed: 0,character,variant
0,亜,亞
1,悪,惡
2,圧,壓
3,囲,圍
4,医,醫


### Using .loc[] (Label Location)


In [None]:
# Select by label (row index and column name)
# First row, all columns
df.loc[0]

# First row, specific column
df.loc[0, 'character']

# Multiple rows, specific columns
df.loc[0:4, ['character', 'variant']]


## Quick Summary

| Method | Purpose | Returns |
|--------|---------|---------|
| `.head(n)` | First n rows | DataFrame |
| `.tail(n)` | Last n rows | DataFrame |
| `.sample(n)` | Random n rows | DataFrame |
| `.shape` | Dimensions (rows, cols) | Tuple |
| `.columns` | Column names | Index |
| `.dtypes` | Data types | Series |
| `.info()` | Comprehensive info | None (prints) |
| `.describe()` | Statistics summary | DataFrame |
| `df['col']` | Single column | Series |
| `df[['col1', 'col2']]` | Multiple columns | DataFrame |
| `.iloc[row, col]` | Integer position | Value/Series/DataFrame |
| `.loc[row, col]` | Label position | Value/Series/DataFrame |

## What's Next?

In the next notebook, we'll learn how to:
- Filter data based on conditions
- Use boolean indexing
- Filter by text patterns
- Handle missing values

## Try It Yourself

1. Load a different dataset and explore it using these methods
2. Try selecting different combinations of rows and columns
3. Use `.sample()` to get random rows and examine them
4. Check the `.shape` and `.info()` of different datasets you've loaded
