# Week 2 Tutorial: Pandas Basics for Social Impact

Welcome to your second hands-on tutorial! This week, you'll learn how to use Pandas to load, explore, and clean real-world datasets related to social issues.

## Learning Goals
- Load CSV data into Pandas DataFrames
- Explore dataset structure (rows, columns, data types)
- Perform basic data cleaning (handle missing values, rename columns)
- Calculate summary statistics
- Filter and sort data to answer questions
- Connect data analysis to social impact themes

## Part 1: Loading Data with Pandas

Let's start by loading a dataset about global education indicators.

In [14]:
import pandas as pd

# Load the dataset (replace with your file path or use a sample dataset)
df = pd.read_csv("https://ourworldindata.org/grapher/mean-years-of-schooling-long-run.csv?v=1&csvType=full&useColumnShortNames=true", storage_options = {'User-Agent': 'Our World In Data data fetch/1.0'})

print(f"Loaded dataset with {df.shape[0]:,} rows and {df.shape[1]} columns.")

Loaded dataset with 4,311 rows and 4 columns.


## Part 2: Exploring DataFrame Structure

Let's look at the first few rows and get a sense of the data.

In [15]:
# Preview the data
df.head()

Unnamed: 0,Entity,Code,Year,mf_youth_and_adults__15_64_years__average_years_of_education
0,Afghanistan,AFG,1870,0.01
1,Afghanistan,AFG,1875,0.01
2,Afghanistan,AFG,1880,0.01
3,Afghanistan,AFG,1885,0.01
4,Afghanistan,AFG,1890,0.01


In [16]:
# List columns and data types
print("Columns:", df.columns.tolist())
print("\nData types:")
print(df.dtypes)

Columns: ['Entity', 'Code', 'Year', 'mf_youth_and_adults__15_64_years__average_years_of_education']

Data types:
Entity                                                           object
Code                                                             object
Year                                                              int64
mf_youth_and_adults__15_64_years__average_years_of_education    float64
dtype: object


## Part 3: Summary Statistics

Let's calculate some basic statistics to understand the data.

In [17]:
# Summary statistics for numeric columns
df.describe()

Unnamed: 0,Year,mf_youth_and_adults__15_64_years__average_years_of_education
count,4311.0,4311.0
mean,1958.632568,4.722069
std,51.736581,4.131015
min,1870.0,0.0
25%,1915.0,0.8
50%,1960.0,3.73
75%,2005.0,8.17
max,2040.0,15.48


In [18]:
# Count unique countries
print("Number of unique countries:", df['Entity'].nunique())

# Count years covered
print("Years covered:", df['Year'].min(), "to", df['Year'].max())

Number of unique countries: 153
Years covered: 1870 to 2040


## Part 4: Data Cleaning

Let's handle missing values and rename columns for clarity.

In [19]:
# Check for missing values
df.isnull().sum().sort_values(ascending=False).head(10)

Code                                                            181
Entity                                                            0
Year                                                              0
mf_youth_and_adults__15_64_years__average_years_of_education      0
dtype: int64

In [22]:
# Check available columns
print(df.columns.tolist())

# If the column name is different, update the code below accordingly.
# For now, let's try to find the correct column for mean years of schooling.

# Example: Try a more generic column name if the above fails
# df_clean = df.dropna(subset=['Mean years of schooling'])

# After identifying the correct column, you can proceed with renaming as before.

['Entity', 'Code', 'Year', 'mf_youth_and_adults__15_64_years__average_years_of_education']


## Part 5: Filtering and Sorting Data

Let's answer some questions using filtering and sorting.

In [None]:
# Top 5 countries by literacy in 2018
top_lit = df_clean[df_clean['year'] == 2018].sort_values('literacy', ascending=False)
top_lit[['country', 'literacy']].head()

In [None]:
# Countries with lowest education spending in 2018
low_spend = df_clean[df_clean['year'] == 2018].sort_values('edu_spending_gdp')
low_spend[['country', 'edu_spending_gdp']].head()

## Part 6: Grouping and Aggregation

Let's calculate average literacy by region.

In [None]:
# Group by region and calculate mean literacy
if 'region' in df_clean.columns:
    region_lit = df_clean[df_clean['year'] == 2018].groupby('region')['literacy'].mean().sort_values(ascending=False)
    print(region_lit)
else:
    print("No 'region' column in this dataset.")

## Part 7: Your Turn - Explore and Summarize

Now it's your turn! Try answering these questions:
- What is the average education spending as % of GDP for the top 10 countries by literacy?
- Which countries have the largest gap between education spending and literacy?
- How has literacy changed over time for a country of your choice?

Use filtering, sorting, and groupby as shown above.

## Part 8: Simple Functions for Reusable Analysis

Let's create a function to summarize a country's education stats.

In [None]:
def summarize_country(df, country):
    """Prints summary statistics for a given country."""
    data = df[df['country'] == country]
    if data.empty:
        print(f"No data for {country}.")
        return
    print(f"Summary for {country}:")
    print(f"Years covered: {data['year'].min()} - {data['year'].max()}")
    print(f"Average literacy: {data['literacy'].mean():.1f}%")
    print(f"Average education spending (% GDP): {data['edu_spending_gdp'].mean():.2f}")

# Try it out
summarize_country(df_clean, "Finland")

## üéâ Congratulations!

You've completed your first Pandas tutorial for social impact! You've learned:

‚úÖ **Loading data** from CSVs  
‚úÖ **Exploring DataFrames** (rows, columns, types)  
‚úÖ **Cleaning data** (missing values, renaming)  
‚úÖ **Calculating statistics** and answering questions  
‚úÖ **Writing simple functions** for reusable analysis

## üìù Take-Home Assignment

**Assignment:** Create a notebook that summarizes key statistics of a real dataset related to social impact.

**Requirements:**
1. **Load a provided dataset** about education, health, or development
2. **Explore the data structure** using Pandas methods
3. **Clean the data** by handling missing values and renaming columns
4. **Calculate summary statistics** (mean, median, min, max, counts)
5. **Answer 3 specific questions** about the data using filtering/grouping
6. **Document your findings** with clear explanations

Use the code patterns you learned here as a starting point!

## Next Steps
- Complete the Week 2 assignment
- Upload your notebook to your GitHub repository
- Get ready for Week 3: Data Visualization with Plotly!