# **AI TECH INSTITUTE** · *Intermediate AI & Data Science*
### Week 01 · Notebook 02 — DataFrames Fundamentals
**Instructor:** Amir Charkhi  |  **Goal:** Master DataFrame creation, selection, and basic operations.

> Format: build on Series knowledge → explore 2D data → practice real operations.


---
## From Series to DataFrames
Think of a DataFrame as a collection of Series sharing the same index - like a spreadsheet!

In [None]:
import pandas as pd
import numpy as np

# Remember Week 0: list of dictionaries?
students_list = [
    {'name': 'Alice', 'age': 22, 'grade': 85},
    {'name': 'Bob', 'age': 24, 'grade': 78},
    {'name': 'Charlie', 'age': 23, 'grade': 92}
]

# Now as a DataFrame!
df = pd.DataFrame(students_list)
print("Our first DataFrame:")
print(df)
print(f"\nShape: {df.shape} (rows, columns)")
print(f"Columns: {df.columns.tolist()}")
print(f"Data types:\n{df.dtypes}")

## 1. Creating DataFrames - Multiple Ways

In [None]:
# Method 1: From dictionary of lists
data_dict = {
    'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam'],
    'Price': [1200, 25, 80, 350, 120],
    'Stock': [15, 102, 45, 28, 33],
    'Category': ['Electronics', 'Accessories', 'Accessories', 'Electronics', 'Accessories']
}

products_df = pd.DataFrame(data_dict)
print("Products DataFrame:")
print(products_df)

# Method 2: From NumPy array with custom columns and index
data_array = np.array([[25, 180, 68], [28, 175, 72], [22, 165, 58]])
people_df = pd.DataFrame(
    data_array,
    columns=['Age', 'Height_cm', 'Weight_kg'],
    index=['Person_1', 'Person_2', 'Person_3']
)
print("\nPeople DataFrame:")
print(people_df)

**Exercise 1 — Sales Dashboard (easy)**  
Create a DataFrame with 5 days of sales data: date, revenue, customers, avg_order_value.


In [None]:
# Your turn


<details>
<summary><b>Solution</b></summary>

```python
sales_data = {
    'date': ['2025-08-19', '2025-08-20', '2025-08-21', '2025-08-22', '2025-08-23'],
    'revenue': [5420, 6100, 4850, 7200, 8900],
    'customers': [45, 52, 41, 58, 72],
    'avg_order_value': [120.44, 117.31, 118.29, 124.14, 123.61]
}
sales_df = pd.DataFrame(sales_data)
print(sales_df)
print(f"\nTotal revenue: ${sales_df['revenue'].sum():,.2f}")
```
</details>

## 2. Selecting Data - Columns and Rows

In [None]:
# Using our products DataFrame
print("Select single column (returns Series):")
print(products_df['Price'])

print("\nSelect multiple columns (returns DataFrame):")
print(products_df[['Product', 'Price']])

# Selecting rows by condition
print("\nProducts under $100:")
affordable = products_df[products_df['Price'] < 100]
print(affordable)

# Multiple conditions
print("\nElectronics with stock > 20:")
electronics_available = products_df[
    (products_df['Category'] == 'Electronics') & 
    (products_df['Stock'] > 20)
]
print(electronics_available)

## 3. loc vs iloc - Precise Selection

In [None]:
# Create sample data
df = pd.DataFrame({
    'A': [1, 2, 3, 4],
    'B': [5, 6, 7, 8],
    'C': [9, 10, 11, 12]
}, index=['row1', 'row2', 'row3', 'row4'])

print("Original DataFrame:")
print(df)

# loc: label-based selection
print("\nUsing loc (labels):")
print("df.loc['row2', 'B']:", df.loc['row2', 'B'])
print("\ndf.loc['row1':'row3', 'A':'B']:")
print(df.loc['row1':'row3', 'A':'B'])

# iloc: position-based selection
print("\nUsing iloc (positions):")
print("df.iloc[1, 1]:", df.iloc[1, 1])
print("\ndf.iloc[0:3, 0:2]:")
print(df.iloc[0:3, 0:2])

**Exercise 2 — Data Extraction (medium)**  
From products_df: 1) Get prices of first 3 products, 2) Get all info for products with stock < 40.


In [None]:
# Your turn


<details>
<summary><b>Solution</b></summary>

```python
# 1) Prices of first 3 products
first_three_prices = products_df.iloc[:3]['Price']
# or products_df.loc[:2, 'Price'] if using default index
print("First 3 prices:")
print(first_three_prices)

# 2) Products with stock < 40
low_stock = products_df[products_df['Stock'] < 40]
print("\nLow stock products:")
print(low_stock)
```
</details>

## 4. Adding and Modifying Columns

In [None]:
# Calculate new columns
products_df['Total_Value'] = products_df['Price'] * products_df['Stock']
products_df['Needs_Restock'] = products_df['Stock'] < 30

print("Enhanced products DataFrame:")
print(products_df)

# Modify existing column
products_df['Price_After_Tax'] = products_df['Price'] * 1.1
products_df['Price_After_Tax'] = products_df['Price_After_Tax'].round(2)

# Using apply for complex operations
def categorize_price(price):
    if price < 50: return 'Budget'
    elif price < 200: return 'Mid-range'
    else: return 'Premium'

products_df['Price_Category'] = products_df['Price'].apply(categorize_price)
print("\nWith price categories:")
print(products_df[['Product', 'Price', 'Price_Category']])

## 5. Basic DataFrame Operations

In [None]:
# Summary statistics
print("Numeric columns summary:")
print(products_df.describe())

# Info about DataFrame
print("\nDataFrame info:")
print(products_df.info())

# Sorting
print("\nSorted by price (descending):")
sorted_df = products_df.sort_values('Price', ascending=False)
print(sorted_df[['Product', 'Price']])

# Value counts
print("\nCategory distribution:")
print(products_df['Category'].value_counts())

**Exercise 3 — Employee Analysis (hard)**  
Create an employee DataFrame with: name, department, salary, years_experience.
Add columns for: salary_level (Low/Mid/High), bonus (10% of salary if years > 3).
Find average salary by department.


In [None]:
# Your turn


<details>
<summary><b>Solution</b></summary>

```python
employees = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'department': ['IT', 'Sales', 'IT', 'HR', 'Sales'],
    'salary': [75000, 65000, 82000, 58000, 71000],
    'years_experience': [5, 2, 7, 3, 4]
})

# Add salary level
def salary_level(sal):
    if sal < 60000: return 'Low'
    elif sal < 75000: return 'Mid'
    else: return 'High'

employees['salary_level'] = employees['salary'].apply(salary_level)

# Add bonus
employees['bonus'] = employees.apply(
    lambda row: row['salary'] * 0.1 if row['years_experience'] > 3 else 0, 
    axis=1
)

print("Employee data:")
print(employees)

# Average salary by department
print("\nAverage salary by department:")
print(employees.groupby('department')['salary'].mean())
```
</details>

## 6. Handling Missing Data in DataFrames

In [None]:
# Create DataFrame with missing values
data_with_gaps = pd.DataFrame({
    'Date': pd.date_range('2025-08-19', periods=5),
    'Sales': [1200, None, 1450, 1380, None],
    'Visitors': [120, 115, None, 135, 142],
    'Conversion': [0.10, 0.09, 0.11, None, 0.08]
})

print("Data with missing values:")
print(data_with_gaps)
print(f"\nMissing values per column:")
print(data_with_gaps.isnull().sum())

# Different filling strategies
filled_df = data_with_gaps.copy()
filled_df['Sales'].fillna(filled_df['Sales'].mean(), inplace=True)
filled_df['Visitors'].fillna(method='ffill', inplace=True)  # forward fill
filled_df['Conversion'].fillna(filled_df['Conversion'].median(), inplace=True)

print("\nAfter filling:")
print(filled_df)

## 7. Mini-Challenges
- **M1 (easy):** Create a DataFrame of your favorite movies with title, year, rating
- **M2 (medium):** Add a 'decade' column and find average rating per decade
- **M3 (hard):** Create a function that highlights rows where rating > average

In [None]:
# Your turn - try the challenges!


<details>
<summary><b>Solutions</b></summary>

```python
# M1 & M2
movies = pd.DataFrame({
    'title': ['Inception', 'Matrix', 'Interstellar', 'Arrival', 'Blade Runner'],
    'year': [2010, 1999, 2014, 2016, 1982],
    'rating': [8.8, 8.7, 8.6, 7.9, 8.1]
})

# Add decade
movies['decade'] = (movies['year'] // 10) * 10
print("Movies with decades:")
print(movies)

# Average per decade
print("\nAverage rating per decade:")
print(movies.groupby('decade')['rating'].mean())

# M3
avg_rating = movies['rating'].mean()
above_avg = movies[movies['rating'] > avg_rating]
print(f"\nMovies above average ({avg_rating:.1f}):")
print(above_avg[['title', 'rating']])
```
</details>

## Wrap-Up
✅ You can create DataFrames from various sources  
✅ You mastered selection with [], loc, and iloc  
✅ You can add columns and handle missing data  

**Next:** Data wrangling - merging, grouping, and reshaping!
