# 🐼 Introduction to Pandas for AI Beginners!

Your Comprehensive Guide to the Fundamentals of the Pandas Library

### 📘 Overview of Today's 2-Hour Session

Welcome to the world of data manipulation with **Pandas**! If NumPy gives us superpowers for numbers, Pandas gives us superpowers for data tables. It's the most essential tool for cleaning, transforming, and analyzing data in Python, which is a crucial first step in any AI project.

**Why Pandas?** Real-world data is often messy, with missing values and different data types. Pandas helps us tame this chaos with ease!

### 🎯 Learning Objectives:

By the end of this session, you will be able to:
1.  Understand the two core Pandas data structures: `Series` and `DataFrame`.
2.  Load data and create your own `DataFrames`.
3.  Select specific rows, columns, and subsets of data using `.loc` and `.iloc`.
4.  Filter data based on conditions.
5.  Perform powerful operations like sorting, counting values, and applying custom functions.
6.  Handle common problems like missing data by dropping or filling it.

### ⚙️ Let's Get Set Up!

Just like with NumPy, we first need to import the Pandas library. The standard way to do this is `import pandas as pd`.

In [1]:
# Import the pandas library
import pandas as pd
import numpy as np # We often use numpy alongside pandas

--- 
## Topic 1: The Core Components - Series & DataFrame

Pandas has two main data structures you'll use all the time:

- **`Series`**: A one-dimensional labeled array. Think of it as a single column in a spreadsheet.
- **`DataFrame`**: A two-dimensional labeled table with columns of potentially different types. This is like a whole spreadsheet or a SQL table!

In [2]:
# Example 1: Creating a Series (a single column)
# Let's create a Series of student scores.
scores = pd.Series([10, 20, 30, 40], index=['Alice', 'Bob', 'Charlie', 'David'])

print("--- Our First Series ---")
print(scores)

--- Our First Series ---
Alice      10
Bob        20
Charlie    30
David      40
dtype: int64


In [3]:
# Example 2: Creating a DataFrame (a full table)
# We can create a DataFrame from a dictionary.

data = {
    'State': ['Ohio', 'Ohio', 'Nevada', 'Nevada'],
    'Year': [2000, 2001, 2001, 2002],
    'Population': [1.5, 1.7, 2.4, 2.9]
}

df = pd.DataFrame(data)

print("--- Our First DataFrame ---")
print(df)

--- Our First DataFrame ---
    State  Year  Population
0    Ohio  2000         1.5
1    Ohio  2001         1.7
2  Nevada  2001         2.4
3  Nevada  2002         2.9


### 🎯 Practice Task: Create Your Own DataFrame

Create a DataFrame about your favorite movies. It should have three columns: 'Title', 'Genre', and 'Release_Year'. Include at least 3 movies.

In [4]:
# Your code here!

--- 
## Topic 2: Data Input (Reading from Files)

Most of the time, your data will be in a file. Pandas is fantastic at reading data from various formats like CSV, Excel, JSON, and more!

The most common function you'll use is `pd.read_csv()`.

```python
# This is how you would read a CSV file named 'students.csv'
# student_df = pd.read_csv('students.csv')
```
💡 **Note:** Since we don't have a file to load right now, the code above is commented out. But this is the exact command you would use.

--- 
## Topic 3: Selection - Getting the Data You Want

Once you have a DataFrame, you need to know how to grab specific pieces of it. This is called **selection** or **indexing**.

### 📄 Selecting Columns

You can select a single column using `df['ColumnName']`, which returns a Series. To select multiple columns, use `df[['Col1', 'Col2']]` (notice the double brackets!).

In [6]:
df

Unnamed: 0,State,Year,Population
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Nevada,2001,2.4
3,Nevada,2002,2.9


In [7]:
# Let's use our state DataFrame from before
print("--- Original DataFrame ---")
print(df)

# Select a single column ('State')
states = df['State']
print("\n--- 'State' Column (a Series) ---")
print(states)

# Select multiple columns ('Year' and 'Population')
year_and_pop = df[['Year', 'Population']]
print("\n--- 'Year' & 'Population' Columns (a DataFrame) ---")
print(year_and_pop)

--- Original DataFrame ---
    State  Year  Population
0    Ohio  2000         1.5
1    Ohio  2001         1.7
2  Nevada  2001         2.4
3  Nevada  2002         2.9

--- 'State' Column (a Series) ---
0      Ohio
1      Ohio
2    Nevada
3    Nevada
Name: State, dtype: object

--- 'Year' & 'Population' Columns (a DataFrame) ---
   Year  Population
0  2000         1.5
1  2001         1.7
2  2001         2.4
3  2002         2.9


### 📄 Selecting Rows with `.loc` and `.iloc`

This is super important! Pandas gives us two main ways to select rows:

- `.loc`: Selects rows by their **label** (the index name).
- `.iloc`: Selects rows by their **integer position** (0, 1, 2, ...).

In [9]:
df

Unnamed: 0,State,Year,Population
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Nevada,2001,2.4
3,Nevada,2002,2.9


In [10]:
# Using .loc to select the row with index label 0
row_0_loc = df.loc[0]
print("--- Row with index label 0 (using .loc) ---")
print(row_0_loc)

--- Row with index label 0 (using .loc) ---
State         Ohio
Year          2000
Population     1.5
Name: 0, dtype: object


In [11]:
# Using .iloc to select the first row (at integer position 0)
row_0_iloc = df.iloc[0]
print("--- First row at position 0 (using .iloc) ---")
print(row_0_iloc)

# Using .iloc to select the first two rows (slicing)
first_two_rows = df.iloc[0:2]
print("\n--- First two rows (using .iloc) ---")
print(first_two_rows)

--- First row at position 0 (using .iloc) ---
State         Ohio
Year          2000
Population     1.5
Name: 0, dtype: object

--- First two rows (using .iloc) ---
  State  Year  Population
0  Ohio  2000         1.5
1  Ohio  2001         1.7


### 📄 Selecting Subsets of Rows and Columns

The real power of `.loc` and `.iloc` comes from combining row and column selections. The format is `df.loc[row_selector, column_selector]`.

In [13]:
df

Unnamed: 0,State,Year,Population
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Nevada,2001,2.4
3,Nevada,2002,2.9


In [14]:
# Get the 'Year' and 'Population' for the rows with index 0 and 2.
subset = df.loc[[0, 2], ['Year', 'Population']]

print("--- A specific subset of rows and columns ---")
print(subset)

--- A specific subset of rows and columns ---
   Year  Population
0  2000         1.5
2  2001         2.4


### 📄 Index Setting and Resetting
Sometimes, it's useful to set one of your columns as the main index for the DataFrame. This is common when you have a unique ID for each row.
- `df.set_index('ColumnName')`: Sets a column as the new index.
- `df.reset_index()`: Resets the index back to the default 0, 1, 2, ...

In [15]:
df

Unnamed: 0,State,Year,Population
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Nevada,2001,2.4
3,Nevada,2002,2.9


In [16]:
# Let's set the 'State' column as the index
# Note: This is not a great index as it has duplicates, but it's a good example!
df_with_state_index = df.set_index('State')

print("--- DataFrame with 'State' as Index ---")
print(df_with_state_index)

--- DataFrame with 'State' as Index ---
        Year  Population
State                   
Ohio    2000         1.5
Ohio    2001         1.7
Nevada  2001         2.4
Nevada  2002         2.9


### 🎯 Practice Task: Select & Set Index

1. From the `movie_df` you created, select the 'Title' and 'Release_Year' for the first two rows using `.iloc`.
2. Create a new DataFrame from `movie_df` where the 'Title' is the index.

In [17]:
# Your code here!

--- 
## Topic 4: Conditional Selection (Filtering)

This is where Pandas really shines! You can filter your DataFrame to find rows that meet a certain condition. It's like asking a question about your data.

The syntax looks like this: `df[df['Column'] > some_value]`

In [18]:
# Let's find all the rows where the year is greater than 2001
modern_years = df[df['Year'] > 2001]

print("--- Rows where Year > 2001 ---")
print(modern_years)

--- Rows where Year > 2001 ---
    State  Year  Population
3  Nevada  2002         2.9


In [19]:
# You can also combine conditions with & (and) and | (or)
# Let's find rows where the Year > 2000 AND the State is 'Ohio'
ohio_after_2000 = df[(df['Year'] > 2000) & (df['State'] == 'Ohio')]

print("--- Ohio, after the year 2000 ---")
print(ohio_after_2000)

--- Ohio, after the year 2000 ---
  State  Year  Population
1  Ohio  2001         1.7


--- 
## Topic 5: Powerful Operations on DataFrames

Beyond just selecting data, Pandas provides a huge library of functions to inspect, clean, and transform your data.

### 📄 Inspecting and Summarizing
- `.head()`: Shows the first 5 rows (useful for a quick peek).
- `.columns`: Shows a list of all column names.
- `.index`: Shows the index labels of the DataFrame.
- `.unique()`: Shows all the unique values in a column.
- `.value_counts()`: Counts how many times each unique value appears in a column.

In [20]:
import pandas as pd
# Let's create a more complex DataFrame for these examples
ops_df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Alice'],
    'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Chicago', 'Los Angeles', 'New York'],
    'Score': [88, 92, 79, 95, 85, 72, 90]
})

print("--- Column Names ---")
print(ops_df.columns)

print("\n--- Index Labels ---")
print(ops_df.index)

print("\n--- Unique cities ---")
print(ops_df['City'].unique())

print("\n--- Counts for each city ---")
print(ops_df['City'].value_counts())

--- Column Names ---
Index(['Name', 'City', 'Score'], dtype='object')

--- Index Labels ---
RangeIndex(start=0, stop=7, step=1)

--- Unique cities ---
['New York' 'Los Angeles' 'Chicago']

--- Counts for each city ---
City
New York       3
Los Angeles    2
Chicago        2
Name: count, dtype: int64


### 📄 Sorting, Ordering, and Replacing Values
- `.sort_values()`: Sorts the DataFrame by the values in a column.
- `.sort_index()`: Sorts the DataFrame by its index labels.
- `.replace()`: Replaces specified values in a column.

In [21]:
# Let's sort the DataFrame by the score, from highest to lowest
sorted_df = ops_df.sort_values(by='Score', ascending=False)
print("--- DataFrame Sorted by Score ---")
print(sorted_df)

--- DataFrame Sorted by Score ---
      Name         City  Score
3    David     New York     95
1      Bob  Los Angeles     92
6    Alice     New York     90
0    Alice     New York     88
4      Eva      Chicago     85
2  Charlie      Chicago     79
5    Frank  Los Angeles     72


In [22]:
# Let's replace 'New York' with 'NYC'
replaced_df = ops_df.replace('New York', 'NYC')
print("\n--- DataFrame with 'New York' replaced by 'NYC' ---")
print(replaced_df)


--- DataFrame with 'New York' replaced by 'NYC' ---
      Name         City  Score
0    Alice          NYC     88
1      Bob  Los Angeles     92
2  Charlie      Chicago     79
3    David          NYC     95
4      Eva      Chicago     85
5    Frank  Los Angeles     72
6    Alice          NYC     90


### 🎯 Practice Task: Operations

Using the `ops_df` dataframe:
1. Find the number of unique names using `.unique()`.
2. Sort the DataFrame alphabetically by 'Name'.

In [23]:
# Your code here!

--- 
## Topic 6: Handling Missing Data

Real-world data is rarely perfect. Often, you'll have missing values, which Pandas represents as `NaN` (Not a Number).

### 📄 Null Value Check
The first step is to identify where the missing values are.
- `df.isnull()`: Returns a DataFrame of boolean values indicating if a cell is null.
- `df.isnull().sum()`: A powerful chain of commands that returns the total number of null values in each column.

In [24]:
# Let's create a DataFrame with some missing data
student_data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, np.nan, 22],
    'Score': [88, 92, 79, np.nan]
}
student_df = pd.DataFrame(student_data)

print("--- Original DataFrame with Missing Data ---")
print(student_df)

print("\n--- Null values per column ---")
print(student_df.isnull().sum())

--- Original DataFrame with Missing Data ---
      Name   Age  Score
0    Alice  25.0   88.0
1      Bob  30.0   92.0
2  Charlie   NaN   79.0
3    David  22.0    NaN

--- Null values per column ---
Name     0
Age      1
Score    1
dtype: int64


### 📄 Handling Missing Data
You have two main options:
1.  **Remove it:** Drop rows or columns with missing data using `.dropna()`.
2.  **Fill it:** Fill the missing values with something else using `.fillna()`.

In [25]:
# Option 1: Drop rows with any missing values
df_dropped = student_df.dropna()
print("\n--- DataFrame after dropping rows with NaN ---")
print(df_dropped)


--- DataFrame after dropping rows with NaN ---
    Name   Age  Score
0  Alice  25.0   88.0
1    Bob  30.0   92.0


In [26]:
# Option 2: Fill missing values
# Let's fill the missing Age with the average age
mean_age = student_df['Age'].mean()
df_filled = student_df.fillna({'Age': mean_age, 'Score': 0}) # Fill score with 0

print("\n--- DataFrame after filling missing values ---")
print(df_filled)


--- DataFrame after filling missing values ---
      Name        Age  Score
0    Alice  25.000000   88.0
1      Bob  30.000000   92.0
2  Charlie  25.666667   79.0
3    David  22.000000    0.0


--- 
## 🎉 Final Revision Assignment 🎉

Amazing work! You've learned the most important Pandas skills. Let's combine everything you've learned in a final assignment.

**Scenario:** You are managing a dataset of products for an online store.

**Your Tasks:**

1.  **Create a DataFrame:** Create the product DataFrame from the dictionary provided below.
2.  **Initial Inspection:** Use `.value_counts()` on the 'Category' column to see how many products are in each category.
3.  **Filter for Electronics:** Find and print all products that belong to the 'Electronics' category.
4.  **Find High-Rated Products:** Filter the DataFrame to show only products with a 'Rating' greater than 4.0.
5.  **Handle Missing Price:** The price for the 'Keyboard' is missing. Fill this `NaN` value with the average price of the other products.
6.  **Add a New Column:** Use `.apply()` to create a new column called 'Price_with_Tax' that is 20% higher than the 'Price' column.
7.  **Sort and Finalize:** Sort the final DataFrame by 'Rating' in descending order. Then, drop the original 'Price' column to clean up the table.

In [None]:
# Your solution for the Final Assignment here!

# 1. Create a DataFrame
product_dict = {
    'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam', 'Headphones'],
    'Category': ['Electronics', 'Electronics', 'Electronics', 'Electronics', 'Accessories', 'Accessories'],
    'Price': [1200, 25, np.nan, 300, 50, 150],
    'Rating': [4.5, 4.2, 4.8, 4.9, 3.8, 4.6]
}
products_df = pd.DataFrame(product_dict)
print("--- Task 1: Full DataFrame ---")
print(products_df)

# 2. Initial Inspection
print("\n--- Task 2: Category Counts ---")
print(products_df['Category'].value_counts())

# 3. Filter for Electronics
electronics = products_df[products_df['Category'] == 'Electronics']
print("\n--- Task 3: Electronics only ---")
print(electronics)

# 4. Find High-Rated Products
high_rated = products_df[products_df['Rating'] > 4.0]
print("\n--- Task 4: High Rated Products ---")
print(high_rated)

# 5. Handle Missing Price
avg_price = products_df['Price'].mean()
products_df['Price'].fillna(avg_price, inplace=True) 
print("\n--- Task 5: DataFrame with filled price ---")
print(products_df)

# 6. Add a New Column
products_df['Price_with_Tax'] = products_df['Price'].apply(lambda x: x * 1.20)
print("\n--- Task 6: DataFrame with Tax Column ---")
print(products_df)

# 7. Sort and Finalize
final_products_df = products_df.sort_values(by='Rating', ascending=False)
final_products_df = final_products_df.drop('Price', axis=1)
print("\n--- Task 7: Final Sorted & Cleaned DataFrame ---")
print(final_products_df)

## 🥳 You've Mastered the Basics!

Excellent job! You now have a solid foundation in Pandas. These skills are exactly what you need to start exploring and cleaning datasets for your AI projects. The journey of a data scientist starts here!