# Data Cleaning with Pandas: A Beginner's Guide

Welcome to this hands-on lab where you'll learn the essential first steps in any data analysis project: loading, exploring, and cleaning data. We'll be using the popular `pandas` library in Python to work with a real-world dataset from the World Bank.

**Lab Scenario:**
You will take on the role of a data analyst who has just received a new dataset. You will go through the realistic process of:
1. Loading the data as-is.
2. Discovering problems with the data (like incorrect data types).
3. Writing transformations to clean the data.
4. Handling missing values.
5. Saving your cleaned data for future analysis.

## 1. Setting Up Our Environment

First, we need to import the `pandas` library. We'll import it and give it the shorter alias `pd`, which is a common convention.

In [None]:
import pandas as pd

## 2. Loading the Data (The First Look)

Let's load our dataset. We have a CSV file named `world_bank_data.csv`. We will load it in the most basic way. 

*Note: We are including the `skipfooter=5` and `engine='python'` parameters because there are some informational rows at the very end of the file that are not part of the actual data. This is a common issue in real-world files.*

In [None]:
df = pd.read_csv('data/world_bank_data.csv', skipfooter=5, engine='python')

# The 'df' variable now holds our data.

## 3. Observing the Problem

Now that the data is loaded, let's inspect it. A good analyst always checks their data types.

### Your Turn!

Use the `.info()` method on the DataFrame to see a summary of the data, including the data types of each column.

### Analysis of the Problem

Look at the output of `.info()`. Do you see the issue? 

The columns for the years (e.g., `1990 [YR1990]`, `2000 [YR2000]`, etc.) should be numeric, but pandas has loaded them as `object` type. In pandas, `object` usually means the column contains text (strings).

**Why did this happen?**
This happens when a column contains a mix of numbers and non-numeric characters. If even one value in a column is not a number, pandas will play it safe and treat the entire column as text. In our case, the dataset uses `..` to represent missing data, and these double dots are not numbers.

## 4. Transforming the Data

Now we will fix the data types.

### Task: Convert Numeric Columns to Floats

Our goal is to convert all the year columns from `object` to `float` (a float is a number that can have decimals). 

A robust way to do this is to use the `pd.to_numeric` function. We will apply this to all columns that should be numeric. The key is the `errors='coerce'` argument. This tells pandas: "Try to convert the values to numbers. If you find a value that you can't convert (like our `..` strings), don't raise an error. Instead, replace that value with `NaN` (Not a Number)."

`NaN` is the standard way pandas represents missing numeric data.

### Your Turn!

Complete the code below. We have already identified the list of columns that should *not* be numeric. Your task is to loop through all the columns in the DataFrame (`df.columns`) and if a column is **not** in our `non_numeric_cols` list, apply the `pd.to_numeric` function to it.

In [None]:
# Define the columns that should remain as text
non_numeric_cols = ['Series Name', 'Series Code', 'Country Name', 'Country Code']

# Loop through all columns in the DataFrame
for col in df.columns:
    # Check if the column is NOT in our list of non-numeric columns
    if col not in non_numeric_cols:
        # --- YOUR CODE GOES HERE --- #
        # Convert the column to numeric, coercing errors to NaN
        df[col] = pd.to_numeric(df[col], errors='coerce')
        # ------------------------- #

# Now, let's check the .info() again to see if our conversion worked!
df.info()

## 5. Handling Missing Values

Now that our data types are correct and our non-numeric values have been converted to `NaN`, we can properly handle the missing data.

### Why do we replace missing values?

Many mathematical operations (like calculating a mean or sum) will fail or produce incorrect results if missing values are present. Depending on the goal, we can either drop rows with missing data or fill them in with a reasonable substitute.

- **Mean:** A good choice when the data is fairly symmetrical and doesn't have extreme outliers.
- **Median:** A better choice when the data has outliers, as the median is less sensitive to extreme values.
- **Mode:** Used for categorical (text-based) data to fill in with the most frequent value.

### Your Turn!

First, get a count of missing values in each column using `.isnull().sum()` to see the extent of the problem.

Let's practice by filling the missing values in the `2022 [YR2022]` column with the *median* of that column.

In [None]:
# --- YOUR CODE GOES HERE --- #
# 1. Calculate the median of the '2022 [YR2022]' column


# 2. Use .fillna() to replace the missing values with the median. 
#    Use inplace=True to modify the DataFrame directly.


# ------------------------- #

# Verify that the missing values in the column are filled
print(f"Missing values in 2022 column after filling: {df['2022 [YR2022]'].isnull().sum()}")

## 6. Saving the Cleaned Data

Once you have cleaned your data, it's a good practice to save the result to a new file. This way, you don't have to repeat the cleaning steps every time you want to perform analysis.

### Your Turn!

Use the `.to_csv()` method to save your cleaned DataFrame (`df`) to a new file called `cleaned_world_bank_data.csv`.

**Hint:** Include the argument `index=False` to prevent pandas from writing the DataFrame index as a new column in your CSV file.

In [None]:
# --- YOUR CODE GOES HERE --- #


# ------------------------- #

print("Cleaned data saved successfully!")

## 7. Congratulations!

You've completed this data cleaning lab. You have learned the realistic workflow of a data analyst:
- Loading raw data and identifying problems.
- Using transformations to fix data types.
- Strategically handling missing values.
- Saving your clean data for the next stage of analysis.

These are fundamental skills that you'll use in every data analysis project. Keep practicing!