# Data Cleaning with Pandas: A Beginner's Guide

Welcome to this hands-on lab where you'll learn the essential first steps in any data analysis project: loading, exploring, and cleaning data. We'll be using the popular `pandas` library in Python to work with a real-world dataset from the World Bank.

**Lab Scenario:**
You will take on the role of a data analyst who has just received a new dataset. You will go through the realistic process of:
1. Loading the data as-is.
2. Discovering problems with the data (like incorrect data types).
3. Writing transformations to clean the data.
4. Handling missing values.
5. Saving your cleaned data for future analysis.

## 1. Setting Up Our Environment

First, we need to import the `pandas` library. We'll import it and give it the shorter alias `pd`, which is a common convention.

In [None]:
import pandas as pd

## 2. Loading the Data (The First Look)

Let's load our dataset. We have a CSV file named `world_bank_data.csv`. We will load it in the most basic way.

In [None]:
df = pd.read_csv('data/world_bank_data.csv')

# The 'df' variable now holds our data.

You can preview the loaded data by executing a code cell with a `df` code/command or by using `head()` or `tail()` function, eg. `df.head()`. Use these 3 commands and observe the difference.

In [None]:
# your code

You should observe that bottom (tail) rows contain some metadata and empty rows. By default only 5 rows are displayed for head or tail function. You can modify this behaviour by adding a numeric `n` **function parameter** to these functions. Please display more bottom rows to understand how many bottom rows should be deleted/cut out of `df` data frame.

**Tip:** A function parameter usage `some_object.some_function(parameter_name=<parameter_value>)`

In [None]:
# your code

As you know the answer let's figure out how to "trim" these unnecesary rows.
We can do it by reading the file once again with a `read_csv` function, but this time adding `skipfooter` parameter with a number of rows to skip at the bottom.

In [None]:
# your code

## 3. Observing the Problem

Now that the data is loaded, let's inspect it. A good analyst always checks their data types.

### Your Turn!

Use the `.info()` method on the DataFrame to see a summary of the data, including the data types of each column.

In [None]:
# your code

### Analysis of the Problem

Look at the output of `.info()`. Do you see the issue?

The columns for the years (e.g., `1990 [YR1990]`, `2000 [YR2000]`, etc.) should be numeric, but pandas has loaded them as `object` type. In pandas, `object` usually means the column contains text (strings).

**Why did this happen?**
This happens when a column contains a mix of numbers and non-numeric characters. If even one value in a column is not a number, pandas will play it safe and treat the entire column as text. In our case, the dataset uses `..` to represent missing data, and these double dots are not numbers.

## 4. Transforming the Data

Now we will fix the data types step by step.

### Step 1: Understanding Column Modification

Let's start by fixing just one column. We'll use the '2022 [YR2022]' column as an example.
To convert string values to numbers, pandas provides the `pd.to_numeric` function.

In [None]:
# Let's look at the first few values of our column before conversion
print("Before conversion:")
print(df['2022 [YR2022]'].head())

# Convert the column to numeric
df['2022 [YR2022]'] = pd.to_numeric(df['2022 [YR2022]'], errors='coerce')

# Look at the values after conversion
print("\nAfter conversion:")
print(df['2022 [YR2022]'].head())

# Check the dataframe info to see the type change
df.info()

The `errors='coerce'` parameter tells pandas:
- Try to convert each value to a number
- If a value can't be converted (like our '..' strings), replace it with `NaN` (Not a Number)
- `NaN` is how pandas represents missing numeric data

#### **Your turn**
Convert another numeric column yourself, eg. 1990. Please note how many null values are there.

In [None]:
# your code

### Step 2: Working with Column Names

Before we fix all columns, let's learn how to work with column names.
The `df.columns` gives us access to all column names in our DataFrame.

#### Part A: Listing Column Names
First, let's see all our column names:

In [None]:
# Let's print all column names using a for loop
print("Columns in our DataFrame:")
for column_name in df.columns:
    print(column_name)

#### Part B: Filtering Columns
Sometimes we want to process only certain columns. We can use if/else statements to decide which columns to work with.

Let's create a simple example that identifies which columns are year columns and which are not:

In [None]:
# Define which columns we want to ignore
columns_to_ignore = ['Series Name', 'Series Code', 'Country Name', 'Country Code']

# Loop through columns and check each one
print("\nChecking each column:")
for column_name in df.columns:
    if column_name in columns_to_ignore:
        print(f"{column_name}: This is a text column - skip conversion")
    else:
        print(f"{column_name}: This is a numeric column - should convert")

# We can also count how many columns of each type we have
numeric_columns = df.columns.difference(columns_to_ignore)
print(f"\nSummary:")
print(f"Text columns: {len(columns_to_ignore)}")
print(f"Numeric columns: {len(numeric_columns)}")

This helps us understand:
1. How to check if a column is in a list using `in`
2. How to use if/else to make decisions about each column
3. How many columns we'll need to convert to numeric type

### Step 3: Fixing All Numeric Columns

Now that we know how to:
1. Convert a single column to numeric type
2. Loop through column names

We can combine these to fix all numeric columns at once. We'll need to skip the columns that should remain as text (like 'Country Name').

Here's what we'll do:
1. Define which columns should NOT be converted to numbers
2. Loop through all columns
3. Convert appropriate columns to numeric type

In [None]:
# Define the columns that should remain as text


# Loop through all columns in the DataFrame
    # Check if the column is NOT in our list of non-numeric columns
        # Convert the column to numeric, coercing errors to NaN

# Now, let's check the .info() again to see if our conversion worked!

## 5. Handling Missing Values

Now that our data types are correct and our non-numeric values have been converted to `NaN`, we can properly handle the missing data.

### Why do we replace missing values?

Many mathematical operations (like calculating a mean or sum) will fail or produce incorrect results if missing values are present. Depending on the goal, we can either drop rows with missing data or fill them in with a reasonable substitute.

- **Mean:** A good choice when the data is fairly symmetrical and doesn't have extreme outliers.
- **Median:** A better choice when the data has outliers, as the median is less sensitive to extreme values.
- **Mode:** Used for categorical (text-based) data to fill in with the most frequent value.

### Your Turn!

First, get a count of missing values in each column using `.isnull().sum()` to see the extent of the problem. Please use the knowledge from previous section and use `for loop` to print null counts for each column.

**Tip:** Example for single a column

```python
print(f"Missing values in 2022 column before filling: {df['2022 [YR2022]'].isnull().sum()}")
```

In [None]:
# your code

Let's practice by filling the missing values in the `2022 [YR2022]` column with the *median* of that column.

In [None]:
# your code
# 1. Calculate the median of the '2022 [YR2022]' column
median_val = df['2022 [YR2022]'].median()

# 2. Use .fillna() on specific column (df[<column_name>].fillna()) to replace the missing values with the median.
#    Use inplace=True to modify the DataFrame directly.


# 3. Verify that the missing values in the column are filled
print(f"Missing values in 2022 column after filling: {df['2022 [YR2022]'].isnull().sum()}")

Now fill missing values for all numeric columns using `for loop`

In [None]:
# your code

## 6. Saving the Cleaned Data

Once you have cleaned your data, it's a good practice to save the result to a new file. This way, you don't have to repeat the cleaning steps every time you want to perform analysis.

### Your Turn!

Use the `.to_csv()` method to save your cleaned DataFrame (`df`) to a new file called `cleaned_world_bank_data.csv` in a data directory.

**Hint:** Include the argument `index=False` to prevent pandas from writing the DataFrame index as a new column in your CSV file.

In [None]:
# your code

## 7. Congratulations!

You've completed this data cleaning lab. You have learned the realistic workflow of a data analyst:
- Loading raw data and identifying problems.
- Using transformations to fix data types.
- Strategically handling missing values.
- Saving your clean data for the next stage of analysis.

These are fundamental skills that you'll use in every data analysis project. Keep practicing!