# Pandas data exploration - intro

**Learning Objectives:** Understand Python fundamentals (variables, logic, functions), set up the environment for data handling, and successfully load and inspect datasets.

**Dataset:** World Bank GDP growth data (annual %) for various countries and regions.

## Phase 1: Python Foundations

We begin by reviewing fundamental Python concepts—variables, conditional statements, and functions—which are necessary building blocks for later data cleaning and transformation tasks.

### 1. Variables: Storing Data and Constants

Variables allow us to store constants or references, such as file paths or economic indicators, for easy access in our analysis.

#### **Exercise 1.1: Defining Variables**
Define:
1. A string variable called `file_name` for the dataset file name ('world_bank_data.csv')
2. A numerical variable called `target_year` for the year we want to analyze (2022)
3. Print both variables to confirm assignment

**Expected Outcome:** 
```
world_bank_data.csv
2022
```

In [None]:
# Your solution here

### 2. Conditional Statements: Introducing Logic

Conditional statements (`if/elif/else`) allow us to execute specific code blocks based on whether a condition is true or false. This logic is foundational for data classification and analysis.

#### **Exercise 2.1: Simple Logic Check**
1. Define a variable `growth_rate` with value 5.0 (representing 5% GDP growth)
2. Write a conditional statement to check if it exceeds a 3% target
3. Print "High growth" if true, "Normal growth" if false

**Expected Outcome:**
```
High growth
```

In [None]:
# Your solution here

### 3. Functions: Reusable Transformations

Functions are reusable blocks of code that help ensure consistency and efficiency. In data analytics, they are often used to apply the same transformation logic repeatedly.

#### **Exercise 3.1: Defining and Calling a Function**
1. Define a function called `classify_growth` that takes a growth rate as input
2. The function should return:
   - "High growth" if rate > 5%
   - "Moderate growth" if rate is between 2% and 5%
   - "Low growth" if rate < 2%
3. Test the function with growth rates: 6.5, 3.2, and 1.5

**Expected Outcome:**
```
High growth
Moderate growth
Low growth
```

In [None]:
# Your solution here

## Phase 2: Environment Setup and Data Loading

The core hands-on activity for Lab 1 is importing CSV data into `pandas`. This requires importing the necessary library and executing the read command.

### 4. Importing Required Libraries

We'll use `pandas` for data analysis and manipulation.

In [None]:
# Your solution here

### 5. Loading the Dataset

Now we'll load the World Bank GDP growth data into a pandas DataFrame. The data contains annual GDP growth rates for various countries and regions.

#### **Exercise 5.1: Load Data into DataFrame**
1. Use `pd.read_csv()` to load the dataset into a variable called `df`
2. Make sure to use the `file_name` variable defined earlier
3. Print the first few rows using `.head()`

**Expected structure of the data:**
```
   Series Name        Series Code  Country Name  Country Code  2022 [YR2022]
0  GDP growth (%)  NY.GDP.MKTP... Afghanistan   AFG          -6.24017...
```

In [None]:
# Your solution here

## Phase 3: Inspecting and Describing Data

As per the learning objectives, the final step is to inspect and analyze the loaded dataset.

### 6. Initial Data Inspection

We use basic pandas functions to check the size, content, and data types of our World Bank dataset.

#### **Exercise 6.1: Basic Data Inspection**
Perform the following inspections:
1. Use `.info()` to check data types and non-null counts
2. Use `.shape` to see the dimensions of the dataset
3. Use `.describe()` to get summary statistics for the dataset

This will help us understand:
- How many countries/regions are in our dataset
- What years are covered
- Basic statistics about GDP growth rates

#### **Exercise 6.2: Exploring Unique Values**
1. Use `.unique()` to check unique values in the 'Series Name' column
2. Count how many different countries we have in 'Country Name'
3. Print both results

Expected output will show:
- Different types of economic indicators in the dataset
- Total number of countries/regions being analyzed

In [None]:
# Your solution here
# Check unique values in Series Name
# Count unique countries

### 7. Data Cleaning

Our numeric columns are currently stored as objects due to special characters and missing values. Let's clean them up.

#### **Exercise 7.1: Converting Data Types**
Create a function to clean numeric columns and apply it to our DataFrame. The function should:
1. Handle special characters like '..' that represent missing values
2. Convert string values to float
3. Handle any conversion errors by setting them to NaN
4. Apply the conversion to all numeric columns (all except 'Country Name' and 'Country Code')

In [None]:
# Your code here

### 8. Inspecting Cleaned Data

Now that we have cleaned our data, let's inspect it again to see the improvements.

#### **Exercise 8.1: Re-inspection After Cleaning**
Using the cleaned DataFrame (`df_clean`):
1. Check the data types using `.info()`
2. Generate summary statistics using `.describe()`
3. Compare the results with our previous inspection

You should see that:
- Year columns are now of type float64
- Summary statistics are more meaningful
- Missing values are properly handled

In [None]:
# Your solution here
# Check data types
# Generate summary statistics

### 9. Optional Exercise: Data Filtering

Let's focus on specific economic indicators in our dataset.

#### **Exercise 9.1: Filtering Series**
1. Create a new DataFrame called `df_filtered` that contains only rows where:
   - Series Name contains "GDP growth" or "Inflation"
2. Check how many rows are in the filtered dataset
3. Display the first few rows of the filtered data

This will help us focus on the key economic indicators we're interested in analyzing.

In [None]:
# Your solution here
# Filter the DataFrame
# Check the size
# Display first few rows