# **AI TECH INSTITUTE** · *Intermediate AI & Data Science*
### Week 01 · Notebook 02 — DataFrames Fundamentals
**Instructor:** Amir Charkhi  |  **Goal:** Master DataFrame creation, selection, and basic operations.

> Format: build on Series knowledge → explore 2D data → practice real operations.


---
## From Series to DataFrames
Think of a DataFrame as a collection of Series sharing the same index - like a spreadsheet!

In [1]:
import pandas as pd
import numpy as np

# Remember Week 0: list of dictionaries?
students_list = [
    {'name': 'Alice', 'age': 22, 'grade': 85},
    {'name': 'Bob', 'age': 24, 'grade': 78},
    {'name': 'Charlie', 'age': 23, 'grade': 92}
]

# Now as a DataFrame!
df = pd.DataFrame(students_list)
print("Our first DataFrame:")
print(df)
print(f"\nShape: {df.shape} (rows, columns)")
print(f"Columns: {df.columns.tolist()}")
print(f"Data types:\n{df.dtypes}")

Our first DataFrame:
      name  age  grade
0    Alice   22     85
1      Bob   24     78
2  Charlie   23     92

Shape: (3, 3) (rows, columns)
Columns: ['name', 'age', 'grade']
Data types:
name     object
age       int64
grade     int64
dtype: object


## 1. Creating DataFrames - Multiple Ways

In [2]:
# Method 1: From dictionary of lists
data_dict = {
    'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam'],
    'Price': [1200, 25, 80, 350, 120],
    'Stock': [15, 102, 45, 28, 33],
    'Category': ['Electronics', 'Accessories', 'Accessories', 'Electronics', 'Accessories']
}

products_df = pd.DataFrame(data_dict)
print("Products DataFrame:")
print(products_df)

# Method 2: From NumPy array with custom columns and index
data_array = np.array([[25, 180, 68], [28, 175, 72], [22, 165, 58]])
people_df = pd.DataFrame(
    data_array,
    columns=['Age', 'Height_cm', 'Weight_kg'],
    index=['Person_1', 'Person_2', 'Person_3']
)
print("\nPeople DataFrame:")
print(people_df)

Products DataFrame:
    Product  Price  Stock     Category
0    Laptop   1200     15  Electronics
1     Mouse     25    102  Accessories
2  Keyboard     80     45  Accessories
3   Monitor    350     28  Electronics
4    Webcam    120     33  Accessories

People DataFrame:
          Age  Height_cm  Weight_kg
Person_1   25        180         68
Person_2   28        175         72
Person_3   22        165         58


**Exercise 1 — Sales Dashboard (easy)**  
Create a DataFrame with 5 days of sales data: date, revenue, customers, avg_order_value.


In [6]:
# Your turn

sales_days = pd.date_range('2025-08-01', periods = 5)
revenue = np.random.uniform(50,300,5).round(2)
customer = np.random.choice(['A','B', 'C'],5)
avg_order_value = np.random.randint(5,50,5)

sales_df = pd.DataFrame({
    'sales_days' : sales_days,
    'revenue' : revenue,
    'customer' : customer,
    'avg_order_value' : avg_order_value
})
print(sales_df)
print()
print(sales_df.info())

  sales_days  revenue customer  avg_order_value
0 2025-08-01   241.05        A               37
1 2025-08-02   100.50        C               18
2 2025-08-03   293.26        B               33
3 2025-08-04   226.08        A               22
4 2025-08-05   157.64        A                7

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   sales_days       5 non-null      datetime64[ns]
 1   revenue          5 non-null      float64       
 2   customer         5 non-null      object        
 3   avg_order_value  5 non-null      int32         
dtypes: datetime64[ns](1), float64(1), int32(1), object(1)
memory usage: 272.0+ bytes
None


<details>
<summary><b>Solution</b></summary>

```python
sales_data = {
    'date': ['2025-08-19', '2025-08-20', '2025-08-21', '2025-08-22', '2025-08-23'],
    'revenue': [5420, 6100, 4850, 7200, 8900],
    'customers': [45, 52, 41, 58, 72],
    'avg_order_value': [120.44, 117.31, 118.29, 124.14, 123.61]
}
sales_df = pd.DataFrame(sales_data)
print(sales_df)
print(f"\nTotal revenue: ${sales_df['revenue'].sum():,.2f}")
```
</details>

## 2. Selecting Data - Columns and Rows

In [7]:
# Using our products DataFrame
print("Select single column (returns Series):")
print(products_df['Price'])

print("\nSelect multiple columns (returns DataFrame):")
print(products_df[['Product', 'Price']])

# Selecting rows by condition
print("\nProducts under $100:")
affordable = products_df[products_df['Price'] < 100]
print(affordable)

# Multiple conditions
print("\nElectronics with stock > 20:")
electronics_available = products_df[
    (products_df['Category'] == 'Electronics') & 
    (products_df['Stock'] > 20)
]
print(electronics_available)

Select single column (returns Series):
0    1200
1      25
2      80
3     350
4     120
Name: Price, dtype: int64

Select multiple columns (returns DataFrame):
    Product  Price
0    Laptop   1200
1     Mouse     25
2  Keyboard     80
3   Monitor    350
4    Webcam    120

Products under $100:
    Product  Price  Stock     Category
1     Mouse     25    102  Accessories
2  Keyboard     80     45  Accessories

Electronics with stock > 20:
   Product  Price  Stock     Category
3  Monitor    350     28  Electronics


## 3. loc vs iloc - Precise Selection

In [8]:
# Create sample data
df = pd.DataFrame({
    'A': [1, 2, 3, 4],
    'B': [5, 6, 7, 8],
    'C': [9, 10, 11, 12]
}, index=['row1', 'row2', 'row3', 'row4'])

print("Original DataFrame:")
print(df)

# loc: label-based selection
print("\nUsing loc (labels):")
print("df.loc['row2', 'B']:", df.loc['row2', 'B'])
print("\ndf.loc['row1':'row3', 'A':'B']:")
print(df.loc['row1':'row3', 'A':'B'])

# iloc: position-based selection
print("\nUsing iloc (positions):")
print("df.iloc[1, 1]:", df.iloc[1, 1])
print("\ndf.iloc[0:3, 0:2]:")
print(df.iloc[0:3, 0:2])

Original DataFrame:
      A  B   C
row1  1  5   9
row2  2  6  10
row3  3  7  11
row4  4  8  12

Using loc (labels):
df.loc['row2', 'B']: 6

df.loc['row1':'row3', 'A':'B']:
      A  B
row1  1  5
row2  2  6
row3  3  7

Using iloc (positions):
df.iloc[1, 1]: 6

df.iloc[0:3, 0:2]:
      A  B
row1  1  5
row2  2  6
row3  3  7


**Exercise 2 — Data Extraction (medium)**  
From products_df: 1) Get prices of first 3 products, 2) Get all info for products with stock < 40.


In [9]:
# Your turn
products_df

Unnamed: 0,Product,Price,Stock,Category
0,Laptop,1200,15,Electronics
1,Mouse,25,102,Accessories
2,Keyboard,80,45,Accessories
3,Monitor,350,28,Electronics
4,Webcam,120,33,Accessories


In [12]:
#1) prices of first 3 products
print(f"prices of first 3 products: \n{products_df.iloc[0:3,1]}")

prices of first 3 products: 
0    1200
1      25
2      80
Name: Price, dtype: int64


In [14]:
#2) Get all info for products with stock < 40
print(f"All info for products with stock < 40: \n{products_df[products_df['Stock']<40]}")

All info for products with stock < 40: 
   Product  Price  Stock     Category
0   Laptop   1200     15  Electronics
3  Monitor    350     28  Electronics
4   Webcam    120     33  Accessories


<details>
<summary><b>Solution</b></summary>

```python
# 1) Prices of first 3 products
first_three_prices = products_df.iloc[:3]['Price']
# or products_df.loc[:2, 'Price'] if using default index
print("First 3 prices:")
print(first_three_prices)

# 2) Products with stock < 40
low_stock = products_df[products_df['Stock'] < 40]
print("\nLow stock products:")
print(low_stock)
```
</details>

## 4. Adding and Modifying Columns

In [15]:
# Calculate new columns
products_df['Total_Value'] = products_df['Price'] * products_df['Stock']
products_df['Needs_Restock'] = products_df['Stock'] < 30

print("Enhanced products DataFrame:")
print(products_df)

# Modify existing column
products_df['Price_After_Tax'] = products_df['Price'] * 1.1
products_df['Price_After_Tax'] = products_df['Price_After_Tax'].round(2)

# Using apply for complex operations
def categorize_price(price):
    if price < 50: return 'Budget'
    elif price < 200: return 'Mid-range'
    else: return 'Premium'

products_df['Price_Category'] = products_df['Price'].apply(categorize_price)
print("\nWith price categories:")
print(products_df[['Product', 'Price', 'Price_Category']])

Enhanced products DataFrame:
    Product  Price  Stock     Category  Total_Value  Needs_Restock
0    Laptop   1200     15  Electronics        18000           True
1     Mouse     25    102  Accessories         2550          False
2  Keyboard     80     45  Accessories         3600          False
3   Monitor    350     28  Electronics         9800           True
4    Webcam    120     33  Accessories         3960          False

With price categories:
    Product  Price Price_Category
0    Laptop   1200        Premium
1     Mouse     25         Budget
2  Keyboard     80      Mid-range
3   Monitor    350        Premium
4    Webcam    120      Mid-range


## 5. Basic DataFrame Operations

In [16]:
# Summary statistics
print("Numeric columns summary:")
print(products_df.describe())

# Info about DataFrame
print("\nDataFrame info:")
print(products_df.info())

# Sorting
print("\nSorted by price (descending):")
sorted_df = products_df.sort_values('Price', ascending=False)
print(sorted_df[['Product', 'Price']])

# Value counts
print("\nCategory distribution:")
print(products_df['Category'].value_counts())

Numeric columns summary:
             Price       Stock   Total_Value  Price_After_Tax
count     5.000000    5.000000      5.000000         5.000000
mean    355.000000   44.600000   7582.000000       390.500000
std     488.313424   33.842281   6475.926189       537.144766
min      25.000000   15.000000   2550.000000        27.500000
25%      80.000000   28.000000   3600.000000        88.000000
50%     120.000000   33.000000   3960.000000       132.000000
75%     350.000000   45.000000   9800.000000       385.000000
max    1200.000000  102.000000  18000.000000      1320.000000

DataFrame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Product          5 non-null      object 
 1   Price            5 non-null      int64  
 2   Stock            5 non-null      int64  
 3   Category         5 non-null      object 
 4   Total_Value      5 

**Exercise 3 — Employee Analysis (hard)**  
Create an employee DataFrame with: name, department, salary, years_experience.
Add columns for: salary_level (Low/Mid/High), bonus (10% of salary if years > 3).
Find average salary by department.


In [17]:
# Your turn
name = ['A','B','C','D','E']
department = ['RD','IT','Sales','RD','IT']
salary = np.random.randint(70000,170000,5)
years_experience = np.random.randint(1,10,5)
employee_df = pd.DataFrame({
    'name':name,
    'department':department,
    'salary':salary,
    'years_experience':years_experience
})
print(employee_df)

  name department  salary  years_experience
0    A         RD   79295                 1
1    B         IT  121501                 6
2    C      Sales  133884                 4
3    D         RD  132149                 8
4    E         IT   75235                 2


In [18]:
def sal_level(salary):
    if salary<100000: return 'Low'
    elif salary<130000: return 'Mid'
    else: return 'High'

employee_df['salary_level']=employee_df['salary'].apply(sal_level)
print(employee_df)

  name department  salary  years_experience salary_level
0    A         RD   79295                 1          Low
1    B         IT  121501                 6          Mid
2    C      Sales  133884                 4         High
3    D         RD  132149                 8         High
4    E         IT   75235                 2          Low


In [25]:
def bonus_calc(row):
    if row['years_experience']>3: return row['salary']*0.1
    else: return 0
employee_df['bonus']=employee_df.apply(bonus_calc, axis=1)
print(employee_df)

  name department  salary  years_experience salary_level    bonus
0    A         RD   79295                 1          Low      0.0
1    B         IT  121501                 6          Mid  12150.1
2    C      Sales  133884                 4         High  13388.4
3    D         RD  132149                 8         High  13214.9
4    E         IT   75235                 2          Low      0.0


<details>
<summary><b>Solution</b></summary>

```python
employees = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'department': ['IT', 'Sales', 'IT', 'HR', 'Sales'],
    'salary': [75000, 65000, 82000, 58000, 71000],
    'years_experience': [5, 2, 7, 3, 4]
})

# Add salary level
def salary_level(sal):
    if sal < 60000: return 'Low'
    elif sal < 75000: return 'Mid'
    else: return 'High'

employees['salary_level'] = employees['salary'].apply(salary_level)

# Add bonus
employees['bonus'] = employees.apply(
    lambda row: row['salary'] * 0.1 if row['years_experience'] > 3 else 0, 
    axis=1
)

print("Employee data:")
print(employees)

# Average salary by department
print("\nAverage salary by department:")
print(employees.groupby('department')['salary'].mean())
```
</details>

## 6. Handling Missing Data in DataFrames

In [28]:
# Create DataFrame with missing values
data_with_gaps = pd.DataFrame({
    'Date': pd.date_range('2025-08-19', periods=5),
    'Sales': [1200, None, 1450, 1380, None],
    'Visitors': [120, 115, None, 135, 142],
    'Conversion': [0.10, 0.09, 0.11, None, 0.08]
})

print("Data with missing values:")
print(data_with_gaps)
print(f"\nMissing values per column:")
print(data_with_gaps.isnull().sum())

# Different filling strategies
filled_df = data_with_gaps.copy()

#filled_df['Sales'].fillna(filled_df['Sales'].mean(), inplace=True)
#filled_df['Sales']=filled_df['Sales'].fillna(filled_df['Sales'].mean())
filled_df.fillna({'Sales': filled_df['Sales'].mean()}, inplace=True)

filled_df['Visitors'].fillna(method='ffill', inplace=True)  # forward fill
filled_df['Conversion'].fillna(filled_df['Conversion'].median(), inplace=True)

print("\nAfter filling:")
print(filled_df)

Data with missing values:
        Date   Sales  Visitors  Conversion
0 2025-08-19  1200.0     120.0        0.10
1 2025-08-20     NaN     115.0        0.09
2 2025-08-21  1450.0       NaN        0.11
3 2025-08-22  1380.0     135.0         NaN
4 2025-08-23     NaN     142.0        0.08

Missing values per column:
Date          0
Sales         2
Visitors      1
Conversion    1
dtype: int64

After filling:
        Date        Sales  Visitors  Conversion
0 2025-08-19  1200.000000     120.0       0.100
1 2025-08-20  1343.333333     115.0       0.090
2 2025-08-21  1450.000000     115.0       0.110
3 2025-08-22  1380.000000     135.0       0.095
4 2025-08-23  1343.333333     142.0       0.080


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  filled_df['Visitors'].fillna(method='ffill', inplace=True)  # forward fill
  filled_df['Visitors'].fillna(method='ffill', inplace=True)  # forward fill
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  filled_df['Conversion'].fillna(filled_df['Conversion'].median(), inplace=True)


## 7. Mini-Challenges
- **M1 (easy):** Create a DataFrame of your favorite movies with title, year, rating
- **M2 (medium):** Add a 'decade' column and find average rating per decade
- **M3 (hard):** Create a function that highlights rows where rating > average

In [56]:
movies = pd.DataFrame({
    'title': ['Inception', 'Matrix', 'Interstellar', 'Arrival', 'Blade Runner'],
    'year': [2010, 1999, 2014, 2016, 1982],
    'rating': [8.8, 8.7, 8.6, 7.9, 8.1]
})

decade=(np.floor(movies['year']/10)*10).astype(int)
movies['decade'] = decade
movies['avg_rating_by_decade'] = movies.groupby('decade')['rating'].transform('mean').round(1)
print(f"Movies table with added average rating for each decade: \n{movies}")

above_average = movies[movies['rating']>movies['avg_rating_by_decade']]
print(f"\nMovies with ratings above average within the decade: \n{above_average}")

Movies table with added average rating for each decade: 
          title  year  rating  decade  avg_rating_by_decade
0     Inception  2010     8.8    2010                   8.4
1        Matrix  1999     8.7    1990                   8.7
2  Interstellar  2014     8.6    2010                   8.4
3       Arrival  2016     7.9    2010                   8.4
4  Blade Runner  1982     8.1    1980                   8.1

Movies with ratings above average within the decade: 
          title  year  rating  decade  avg_rating_by_decade
0     Inception  2010     8.8    2010                   8.4
2  Interstellar  2014     8.6    2010                   8.4


<details>
<summary><b>Solutions</b></summary>

```python
# M1 & M2
movies = pd.DataFrame({
    'title': ['Inception', 'Matrix', 'Interstellar', 'Arrival', 'Blade Runner'],
    'year': [2010, 1999, 2014, 2016, 1982],
    'rating': [8.8, 8.7, 8.6, 7.9, 8.1]
})

# Add decade
movies['decade'] = (movies['year'] // 10) * 10
print("Movies with decades:")
print(movies)

# Average per decade
print("\nAverage rating per decade:")
print(movies.groupby('decade')['rating'].mean())

# M3
avg_rating = movies['rating'].mean()
above_avg = movies[movies['rating'] > avg_rating]
print(f"\nMovies above average ({avg_rating:.1f}):")
print(above_avg[['title', 'rating']])
```
</details>

## Wrap-Up
✅ You can create DataFrames from various sources  
✅ You mastered selection with [], loc, and iloc  
✅ You can add columns and handle missing data  

**Next:** Data wrangling - merging, grouping, and reshaping!
