# This notebook is an IDA of people.csv

## 1. Dataset Overview

- **File:** `people.csv`
- **Number of rows:** 24
- **Number of columns:** 2
- **Columns:**
    1. `Person` – Name of the person.
    2. `Region` – Geographic region associated with the person.
- **Type of data:** Reference/dimension dataset (categorical)
- **Primary use:** Could serve as an enrichment layer for analytics or reporting.

In [None]:
import pandas as pd
import numpy as np

people = pd.read_csv("../data/raw/people.csv")

print("First 5 rows:")
display(people.head())

First 5 rows:


Unnamed: 0,Person,Region
0,Marilène Rousseau,Caribbean
1,Andile Ihejirika,Central Africa
2,Nicodemo Bautista,Central America
3,Cansu Peynirci,Central Asia
4,Lon Bonher,Central US


## First 5 Rows

The dataset is small, so we can quickly view all records if needed. Here's the first 5 rows:

- Confirms the dataset contains person names and their regions.
- No obvious formatting issues observed.


In [9]:
print("Columns:")
print(people.columns)

Columns:
Index(['Person', 'Region'], dtype='str')


In [8]:
print("Shape:")
print(people.shape)

Shape:
(24, 2)


In [10]:
print("Info:")
print(people.info())

Info:
<class 'pandas.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Person  24 non-null     str  
 1   Region  24 non-null     str  
dtypes: str(2)
memory usage: 516.0 bytes
None


## Column Types and Missing Values

- Both columns (`Person` and `Region`) are strings.
- No missing values are present in either column.
- Dataset integrity is good; no immediate cleaning is required.


In [11]:
print("Missing values per column:")
print(people.isnull().sum())

Missing values per column:
Person    0
Region    0
dtype: int64


In [12]:
print("Numeric summary statistics:")
display(people.describe())

Numeric summary statistics:


Unnamed: 0,Person,Region
count,24,24
unique,24,24
top,Marilène Rousseau,Caribbean
freq,1,1


## Summary Statistics

- Both columns are categorical.
- `people.describe()` confirms:
    - 24 unique persons
    - 24 unique regions
- Each person and region appears exactly once.
- Dataset is clean, without duplicates.


In [None]:
print("Categories per categorical column:")
for col in cat_cols:
    print(f"\n{col} value counts:")
    display(people[col].value_counts())


Top categories per categorical column:

Person value counts:


Person
Marilène Rousseau       1
Andile Ihejirika        1
Nicodemo Bautista       1
Cansu Peynirci          1
Lon Bonher              1
Wasswa Ahmed            1
Hadia Bousaid           1
Lynne Marchand          1
Oxana Lagunov           1
Dolores Davis           1
Lindiwe Afolayan        1
Miina Nylund            1
Kauri Anaru             1
Vasco Magalhães         1
Preecha Metharom        1
Nora Cuijper            1
Chandrakant Chaudhri    1
Gavino Bove             1
Flannery Newton         1
Katlego Akosua          1
Kaoru Xun               1
Angela Jephson          1
Gilbert Wolff           1
Derrick Snyders         1
Name: count, dtype: int64


Region value counts:


Region
Caribbean            1
Central Africa       1
Central America      1
Central Asia         1
Central US           1
Eastern Africa       1
Eastern Asia         1
Eastern Canada       1
Eastern Europe       1
Eastern US           1
North Africa         1
Northern Europe      1
Oceania              1
South America        1
Southeastern Asia    1
Southern Africa      1
Southern Asia        1
Southern Europe      1
Southern US          1
Western Africa       1
Western Asia         1
Western Canada       1
Western Europe       1
Western US           1
Name: count, dtype: int64

## Key Observations

1. The dataset is very small and serves as **reference/dimension data**.
2. No missing values, no duplicates – clean and ready to use.
3. Columns can be joined with other datasets (like orders or returns) if needed for analytics enrichment.
4. No transformations or cleaning are required at this stage.
