# <font color="#418FDE" size="6.5" uppercase>**Understanding Datasets**</font>

>Last update: 20260131.
    
By the end of this Lecture, you will be able to:
- Describe the typical tabular structure of a basic machine learning dataset. 
- Distinguish between common data types such as numeric, categorical, and text. 
- Evaluate simple datasets for obvious quality issues that could affect learning. 


## **1. Tabular Dataset Basics**

### **1.1. Each Row A Case**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_02/Lecture_A/image_01_01.jpg?v=1769917319" width="250">



>* Each row is one real-world example
>* Columns store that example’s details for algorithms

>* Each row’s case depends on the question
>* Designer must choose one consistent unit per row

>* Mixed or split cases confuse models and patterns
>* Always define and apply one clear row meaning



### **1.2. Feature Columns Explained**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_02/Lecture_A/image_01_02.jpg?v=1769917334" width="250">



>* Feature columns describe each case with measurements
>* Good, clear features let models learn patterns

>* Features include direct measurements and engineered values
>* Multiple feature columns give richer customer descriptions

>* Each feature has unique meaning, type, scale
>* Understand every column to build useful models



### **1.3. Target Column Basics**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_02/Lecture_A/image_01_03.jpg?v=1769917345" width="250">



>* Target column stores the outcome or answer
>* Model learns patterns linking features to target

>* Target values define regression, classification, or ordering
>* Target type guides model choice and evaluation

>* Target values can be complex, delayed, or subjective
>* Ambiguous or biased targets weaken model reliability



## **2. Common Data Types**

### **2.1. Numeric Features**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_02/Lecture_A/image_02_01.jpg?v=1769917360" width="250">



>* Numeric features measure ordered, comparable quantities
>* They support averaging and track real-world magnitudes

>* Numeric features can be continuous or discrete
>* This choice influences modeling, visuals, and expectations

>* Numeric meaning depends on scale, units, and zero
>* Transformations change how models see differences



### **2.2. Categorical Data Basics**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_02/Lecture_A/image_02_02.jpg?v=1769917370" width="250">



>* Categorical data groups items using descriptive labels
>* Labels show categories, not amounts or averages

>* Nominal categories have no meaningful order between values
>* Ordinal categories are ordered and need special encoding

>* Cardinality and rare categories affect storage, learning
>* Messy, inconsistent, or missing labels require careful cleaning



### **2.3. Working With Text**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_02/Lecture_A/image_02_03.jpg?v=1769917383" width="250">



>* Text data stores free-form words in columns
>* Meaning must be extracted through extra encoding steps

>* Text is nuanced, context-dependent, and often ambiguous
>* We convert text into numeric features for modeling

>* Text features add insights beyond other columns
>* They require cleaning, anonymization, and careful preprocessing



## **3. Spotting Data Issues**

### **3.1. Detecting Missing Data**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_02/Lecture_A/image_03_01.jpg?v=1769917397" width="250">



>* Check datasets for obvious and subtle gaps
>* Scan columns for blanks, placeholders, and patterns

>* Missing data mechanisms can change your conclusions
>* Clustered missingness signals bias and unrepresentative data

>* Identify placeholder codes that hide missing information
>* Use documentation and checks to handle them



In [None]:
#@title Python Code - Detecting Missing Data

# This script shows basic missing data detection.
# It uses a tiny made up dataset.
# Focus on simple checks beginners can understand.

# Import pandas for working with tabular data.
import pandas as pd

# Create a tiny dataset with obvious missing issues.
data = {
    "age": [25, None, 40, -1],
    "income": [50000, 60000, None, 999999],
    "city": ["London", "N/A", "Paris", ""],
}

# Build a DataFrame from the dictionary.
df = pd.DataFrame(data)

# Show the small dataset to understand its structure.
print("Dataset preview with possible missing issues:")
print(df)

# Use isna to count missing values in each column.
missing_counts = df.isna().sum()

# Print how many true missing values each column has.
print("\nTrue missing values per column:")
print(missing_counts)

# Define placeholder values that really mean missing data.
placeholder_values = {"age": [-1], "income": [999999], "city": ["N/A", ""]}

# Create a copy so original data stays unchanged.
df_checked = df.copy()

# Replace placeholder values with proper pandas missing markers.
for column, bad_values in placeholder_values.items():
    df_checked[column] = df_checked[column].replace(bad_values, pd.NA)

# Count missing values again after placeholder replacement.
missing_after = df_checked.isna().sum()

# Print updated counts to show the effect of cleaning.
print("\nMissing values per column after cleaning placeholders:")
print(missing_after)

# Print a short conclusion line summarizing the key idea.
print("\nCarefully checking placeholders reveals hidden missing data.")




### **3.2. Detecting Outliers**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_02/Lecture_A/image_03_02.jpg?v=1769917436" width="250">



>* Check for unusually large or small values
>* Uninvestigated outliers can distort models and predictions

>* Check value distributions against real-world expectations
>* Flag implausible extremes as likely misleading outliers

>* Some outliers are meaningful, not just errors
>* Investigate extremes and decide to keep or fix



### **3.3. Sampling and Bias Clues**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_02/Lecture_A/image_03_03.jpg?v=1769917452" width="250">



>* Check who the dataset actually represents
>* Unrepresentative samples harm model performance and fairness

>* Check subgroup distributions and collection metadata carefully
>* Question representativeness and seek adjustments or more data

>* Consider how timing and incentives shaped data
>* Check whose experiences are missing and adjust



In [None]:
#@title Python Code - Sampling and Bias Clues

# This script shows simple sampling bias clues.
# We use tiny synthetic customer churn data.
# Focus on distributions not complex modeling today.

# import required libraries for data handling.
import numpy as np
import pandas as pd

# set deterministic random seed for reproducibility.
np.random.seed(42)

# create a small synthetic customer dataset.
data_size = 40
ages = np.random.randint(18, 70, size=data_size)

# create regions with heavy bias toward one region.
regions = np.random.choice(
    ["North America", "Europe", "Asia"],
    size=data_size,
    p=[0.8, 0.15, 0.05],
)

# create income levels with limited diversity.
income_levels = np.random.choice(
    ["Low", "Medium", "High"],
    size=data_size,
    p=[0.1, 0.8, 0.1],
)

# create churn labels as simple binary values.
churned = np.random.choice([0, 1], size=data_size, p=[0.7, 0.3])

# build pandas dataframe from the synthetic arrays.
df = pd.DataFrame(
    {
        "age": ages,
        "region": regions,
        "income_level": income_levels,
        "churned": churned,
    }
)

# print first few rows to understand dataset structure.
print("Sample rows showing dataset structure:")
print(df.head(5))

# check basic counts for each region category.
region_counts = df["region"].value_counts(normalize=False)

# check region proportions to reveal sampling imbalance.
region_props = df["region"].value_counts(normalize=True).round(2)

# check income level proportions for additional bias clues.
income_props = df["income_level"].value_counts(normalize=True).round(2)

# print region distribution and highlight potential bias.
print("\nRegion counts and proportions:")
print(region_counts)
print(region_props)

# print income distribution and highlight limited diversity.
print("\nIncome level proportions:")
print(income_props)




# <font color="#418FDE" size="6.5" uppercase>**Understanding Datasets**</font>


In this lecture, you learned to:
- Describe the typical tabular structure of a basic machine learning dataset. 
- Distinguish between common data types such as numeric, categorical, and text. 
- Evaluate simple datasets for obvious quality issues that could affect learning. 

In the next Lecture (Lecture B), we will go over 'Features And Targets'