# DS-Tutor Getting Started

Welcome to DS-Tutor! This notebook will guide you through your first learning session.

## What is DS-Tutor?

DS-Tutor is an AI-powered learning system that teaches you Data Science right here in Jupyter notebooks.

**Features:**
- 📚 Progressive curriculum from beginner to advanced
- 🤖 AI-powered hints and feedback
- ✅ Automatic code validation
- 📊 Progress tracking
- 🎯 Hands-on exercises with real code

## Step 1: Load the Extension

Run the cell below to load DS-Tutor:

In [2]:
%cd ..

/home/preslaff/dstutor


In [7]:
%reload_ext dstutor

You should see a colorful welcome message! 🎉

## Step 2: Initialize DS-Tutor

Initialize your learning environment:

In [8]:
%dstutor init

## Step 3: Explore Available Topics

Let's see what you can learn:

In [9]:
%dstutor config


0,1
auto_validate:,True
hint_style:,progressive
feedback_verbosity:,normal
difficulty:,medium
user_id:,default
llm_enabled:,True
current_topic:,
current_lesson:,


In [10]:
%dstutor topics

## Step 4: Start Learning NumPy

Let's begin with NumPy, the foundation of numerical computing in Python:

In [12]:
%dstutor goto pandas_05

# GroupBy - Split, Apply, Combine

One of Pandas' most powerful features - group data by categories and compute
statistics. Essential for data analysis!

**What you'll learn:**
- The split-apply-combine pattern
- Group by single and multiple columns
- Common aggregations (mean, sum, count, etc.)
- Custom aggregations


## The GroupBy Pattern

**Split-Apply-Combine:**
1. **Split** - Divide data into groups based on criteria
2. **Apply** - Compute a function on each group
3. **Combine** - Merge results into a new structure

**Basic Syntax:**
```python
df.groupby('column').agg_function()
```

**Common Aggregations:**
- `sum()` - Total
- `mean()` - Average
- `count()` - Number of items
- `min()`, `max()` - Range
- `std()` - Standard deviation
- `agg()` - Custom/multiple aggregations

**Real-world Examples:**
- Sales by region
- Average salary by department
- Student scores by class
- Revenue by product category


### 💡 Example: Basic GroupBy

```python
import pandas as pd

df = pd.DataFrame({
    'Department': ['Sales', 'Sales', 'Engineering', 'Engineering', 'Marketing'],
    'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Salary': [70000, 65000, 90000, 85000, 60000]
})

print("Original Data:")
print(df)
print()

# Average salary by department
avg_salary = df.groupby('Department')['Salary'].mean()
print("Average Salary by Department:")
print(avg_salary)
print()

# Count employees by department
counts = df.groupby('Department').size()
print("Employees per Department:")
print(counts)

```

**Expected Output:**
```
Original Data:
   Department Employee  Salary
0       Sales    Alice   70000
1       Sales      Bob   65000
2 Engineering  Charlie   90000
3 Engineering    David   85000
4   Marketing      Eve   60000

Average Salary by Department:
Department
Engineering    87500.0
Marketing      60000.0
Sales          67500.0
Name: Salary, dtype: float64

Employees per Department:
Department
Engineering    2
Marketing      1
Sales          2
dtype: int64

```

### 💡 Example: Multiple Aggregations

```python
import pandas as pd

sales_data = pd.DataFrame({
    'Region': ['East', 'East', 'West', 'West', 'East'],
    'Product': ['A', 'B', 'A', 'B', 'A'],
    'Sales': [100, 150, 120, 180, 110],
    'Units': [10, 15, 12, 18, 11]
})

print("Sales Data:")
print(sales_data)
print()

# Multiple aggregations
summary = sales_data.groupby('Region').agg({
    'Sales': ['sum', 'mean'],
    'Units': ['sum', 'max']
})

print("Summary by Region:")
print(summary)
print()

# Rename columns
summary_clean = sales_data.groupby('Region').agg({
    'Sales': 'sum',
    'Units': 'sum'
}).rename(columns={'Sales': 'Total_Sales', 'Units': 'Total_Units'})

print("Clean Summary:")
print(summary_clean)

```

**Expected Output:**
```
Sales Data:
  Region Product  Sales  Units
0   East       A    100     10
1   East       B    150     15
2   West       A    120     12
3   West       B    180     18
4   East       A    110     11

Summary by Region:
       Sales        Units
         sum  mean   sum max
Region
East      360 120.0    36  15
West      300 150.0    30  18

Clean Summary:
       Sales  Units
Region
East     360     36
West     300     30

```

### 💡 Example: Group by Multiple Columns

```python
import pandas as pd

df = pd.DataFrame({
    'City': ['NYC', 'NYC', 'LA', 'LA', 'NYC'],
    'Category': ['Food', 'Tech', 'Food', 'Tech', 'Food'],
    'Revenue': [1000, 2000, 1200, 2500, 1100]
})

print("Data:")
print(df)
print()

# Group by multiple columns
grouped = df.groupby(['City', 'Category'])['Revenue'].sum()
print("Revenue by City and Category:")
print(grouped)
print()

# Unstack for better view
print("Unstacked view:")
print(grouped.unstack(fill_value=0))

```

**Expected Output:**
```
Data:
  City Category  Revenue
0  NYC     Food     1000
1  NYC     Tech     2000
2   LA     Food     1200
3   LA     Tech     2500
4  NYC     Food     1100

Revenue by City and Category:
City  Category
LA    Food        1200
      Tech        2500
NYC   Food        2100
      Tech        2000
Name: Revenue, dtype: int64

Unstacked view:
Category  Food  Tech
City
LA        1200  2500
NYC       2100  2000

```

In [25]:
import pandas as pd

df = pd.DataFrame({
    'Department': ['HR', 'IT', 'HR', 'IT', 'Sales', 'Sales'],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
    'Salary': [60000, 90000, 65000, 95000, 70000, 75000]
})
result=None
result = df.groupby(['Department', 'Salary'], as_index="Department")['Salary'].sum()

In [28]:
%dstutor check

In [29]:
%dstutor next

# Handling Missing Data in Pandas

Real-world data is messy! Missing values are everywhere. Learning to handle them
properly is crucial for accurate analysis and modeling.

**What you'll learn:**
- Detect and count missing values
- Different strategies for handling them
- When to drop vs impute
- Best practices for missing data


## Missing Data in Pandas

**How Missing Data Appears:**
- `NaN` (Not a Number) - default for numeric columns
- `None` - Python's null value
- Empty strings (sometimes)
- Custom placeholders (-999, "N/A", etc.)

**Detection Methods:**
- `.isnull()` or `.isna()` - Returns boolean DataFrame
- `.notnull()` or `.notna()` - Opposite of isnull
- `.isnull().sum()` - Count missing per column

**Handling Strategies:**

1. **Drop Missing Data**
   - `.dropna()` - Remove rows/columns with NaN
   - Use when: Small amount of missing data, non-critical rows

2. **Fill Missing Data**
   - `.fillna(value)` - Replace with specific value
   - `.ffill()` - Forward fill (use previous value)
    - `.bfill()` - Backward fill (use next value)
   - `.interpolate()` - Interpolate values

3. **Imputation**
   - Fill with mean/median/mode
   - Use domain knowledge
   - Advanced: ML-based imputation

**When to Use Each:**
- **Drop**: <5% missing, random pattern
- **Mean/Median**: Numeric, normal distribution
- **Mode**: Categorical data
- **Forward/Backward Fill**: Time series
- **Keep as NaN**: Missingness is informative


### 💡 Example: Detecting Missing Values

```python
import pandas as pd
import numpy as np

# Create DataFrame with missing values
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, np.nan, 35, 28, np.nan],
    'City': ['NYC', 'LA', None, 'Boston', 'Seattle'],
    'Salary': [70000, 80000, np.nan, 65000, 85000]
})

print("DataFrame:")
print(df)
print()

# Check for missing values
print("Missing values (boolean):")
print(df.isnull())
print()

# Count missing per column
print("Missing count per column:")
print(df.isnull().sum())
print()

# Percentage missing
print("Percentage missing:")
print((df.isnull().sum() / len(df) * 100).round(1))

```

**Expected Output:**
```
DataFrame:
      Name   Age     City   Salary
0   Alice  25.0      NYC  70000.0
1     Bob   NaN       LA  80000.0
2 Charlie  35.0     None      NaN
3   David  28.0   Boston  65000.0
4     Eve   NaN  Seattle  85000.0

Missing values (boolean):
    Name    Age   City  Salary
0  False  False  False   False
1  False   True  False   False
2  False  False   True    True
3  False  False  False   False
4  False   True  False   False

Missing count per column:
Name      0
Age       2
City      1
Salary    1
dtype: int64

Percentage missing:
Name       0.0
Age       40.0
City      20.0
Salary    20.0
dtype: float64

```

### 💡 Example: Dropping Missing Values

```python
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, np.nan, 8],
    'C': [9, 10, 11, 12]
})

print("Original DataFrame:")
print(df)
print()

# Drop rows with ANY missing values
print("Drop rows with ANY NaN:")
print(df.dropna())
print()

# Drop rows where ALL values are missing
print("Drop rows where ALL are NaN:")
print(df.dropna(how='all'))
print()

# Drop columns with missing values
print("Drop columns with ANY NaN:")
print(df.dropna(axis=1))
print()

# Drop rows with at least 2 non-NaN values
print("Keep rows with at least 2 non-NaN:")
print(df.dropna(thresh=2))

```

**Expected Output:**
```
Original DataFrame:
     A    B   C
0  1.0  5.0   9
1  2.0  NaN  10
2  NaN  NaN  11
3  4.0  8.0  12

Drop rows with ANY NaN:
     A    B   C
0  1.0  5.0   9
3  4.0  8.0  12

Drop rows where ALL are NaN:
     A    B   C
0  1.0  5.0   9
1  2.0  NaN  10
2  NaN  NaN  11
3  4.0  8.0  12

Drop columns with ANY NaN:
    C
0   9
1  10
2  11
3  12

Keep rows with at least 2 non-NaN:
     A    B   C
0  1.0  5.0   9
1  2.0  NaN  10
3  4.0  8.0  12

```

### 💡 Example: Filling Missing Values

```python
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': [1, np.nan, 3, np.nan, 5],
    'B': [10, 20, np.nan, 40, 50]
})

print("Original:")
print(df)
print()

# Fill with a constant
print("Fill with 0:")
print(df.fillna(0))
print()

# Fill with column mean
print("Fill with column mean:")
print(df.fillna(df.mean()))
print()

# Forward fill
print("Forward fill:")
print(df.ffill())
print()

# Backward fill
print("Backward fill:")
print(df.bfill())

```

**Expected Output:**
```
Original:
     A     B
0  1.0  10.0
1  NaN  20.0
2  3.0   NaN
3  NaN  40.0
4  5.0  50.0

Fill with 0:
     A     B
0  1.0  10.0
1  0.0  20.0
2  3.0   0.0
3  0.0  40.0
4  5.0  50.0

Fill with column mean:
     A     B
0  1.0  10.0
1  3.0  20.0
2  3.0  30.0
3  3.0  40.0
4  5.0  50.0

Forward fill:
     A     B
0  1.0  10.0
1  1.0  20.0
2  3.0  20.0
3  3.0  40.0
4  5.0  50.0

Backward fill:
     A     B
0  1.0  10.0
1  3.0  20.0
2  3.0  40.0
3  5.0  40.0
4  5.0  50.0

```

### 💡 Example: Different Strategies by Column

```python
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'Age': [25, np.nan, 35, np.nan, 45],
    'Income': [50000, 60000, np.nan, 70000, 80000],
    'Category': ['A', 'B', np.nan, 'A', 'B']
})

print("Original:")
print(df)
print()

# Fill numeric with mean, categorical with mode
df_filled = df.copy()
df_filled['Age'].fillna(df['Age'].mean(), inplace=True)
df_filled['Income'].fillna(df['Income'].median(), inplace=True)
df_filled['Category'].fillna(df['Category'].mode()[0], inplace=True)

print("Filled (Age=mean, Income=median, Category=mode):")
print(df_filled)

```

**Expected Output:**
```
Original:
    Age   Income Category
0  25.0  50000.0        A
1   NaN  60000.0        B
2  35.0      NaN      NaN
3   NaN  70000.0        A
4  45.0  80000.0        B

Filled (Age=mean, Income=median, Category=mode):
    Age   Income Category
0  25.0  50000.0        A
1  35.0  60000.0        B
2  35.0  65000.0        A
3  35.0  70000.0        A
4  45.0  80000.0        B

```

In [8]:
import pandas as pd
result = pd.DataFrame()

The lesson content will appear above! 📚

Read through the explanation and examples, then try the exercise below.

## Step 5: Try the Exercise

Now it's your turn! Write code to solve the exercise.

**Exercise:** Create a NumPy array with values [10, 20, 30, 40, 50]

In [9]:
# Your code here
import numpy as np

result = np.array([10, 20, 30, 40, 50])

When you run the cell above, DS-Tutor will automatically validate your solution and give you feedback!

## Getting Help

### Need a hint?

If you're stuck, get a hint:

In [10]:
%dstutor hint

### Need more help?

In [11]:
%dstutor hint 2  # More specific hint

### Want to see the solution?

In [12]:
%dstutor solution

## Navigation

### Move to the next lesson:

In [13]:
%dstutor next

# Pandas - Lesson 3

Learn Pandas data manipulation.

Pandas concepts here...


---

## ✏️ Exercise

Complete the Pandas exercise

**Setup Code** (Run this first):


```python
import pandas as pd
```


**Your Solution:**

```python\n# Your code here\n```

---
💡 **Tip:** Use `%dstutor hint` if you need help!


HBox(children=(Button(button_style='info', description='◄ Previous', layout=Layout(width='auto'), style=Button…

Output()

### Go back to the previous lesson:

In [14]:
%dstutor previous

# Pandas - Lesson 2

Learn Pandas data manipulation.

Pandas concepts here...


---

## ✏️ Exercise

Complete the Pandas exercise

**Setup Code** (Run this first):


```python
import pandas as pd
```


**Your Solution:**

```python\n# Your code here\n```

---
💡 **Tip:** Use `%dstutor hint` if you need help!


HBox(children=(Button(button_style='info', description='◄ Previous', layout=Layout(width='auto'), style=Button…

Output()

## Track Your Progress

View your learning progress:

In [15]:
%dstutor progress

## All Available Commands

Get a complete list of commands:

In [19]:
%dstutor help

0,1
%dstutor init,Initialize the tutor
%dstutor start <topic>,Start learning a topic
%dstutor next,Go to next lesson
%dstutor previous,Go to previous lesson
%dstutor hint [level],Get a hint (levels 1-3)
%dstutor solution,Show solution
%dstutor progress,Show progress dashboard
%dstutor topics,List available topics
%dstutor reset,Reset current lesson
%dstutor goto <id>,Jump to specific lesson


## Tips for Success

1. **Try it yourself first** - Don't jump to hints immediately
2. **Experiment** - Modify examples to see what happens
3. **Practice daily** - Even 15-20 minutes helps
4. **Take notes** - Add your own markdown cells
5. **Review regularly** - Go back to reinforce concepts

---

## Ready to Learn?

You now know the basics! Here's your learning path:

### Level 1: Foundations
- ✅ **NumPy Mastery** - Array manipulation (Start here!)
- **Pandas Deep Dive** - Data manipulation
- **Matplotlib & Seaborn** - Data visualization

### Level 2: ML Pipeline
- **EDA** - Exploratory Data Analysis
- **Preprocessing** - Data cleaning
- **Scikit-Learn** - Machine learning models

### Level 3: Advanced
- **Deep Learning** - Keras & PyTorch
- **Model Interpretation** - SHAP, LIME
- **Specialized Topics** - Time series, NLP, CV

---

## Let's Begin!

Start your Data Science journey right now:

In [17]:
# Start learning!
%dstutor start pandas

# Pandas - Lesson 1

Learn Pandas data manipulation.

Pandas concepts here...


---

## ✏️ Exercise

Complete the Pandas exercise

**Setup Code** (Run this first):


```python
import pandas as pd
```


**Your Solution:**

```python\n# Your code here\n```

---
💡 **Tip:** Use `%dstutor hint` if you need help!


HBox(children=(Button(button_style='info', description='◄ Previous', layout=Layout(width='auto'), style=Button…

Output()

---

**Happy Learning! 🎓📊🤖**

Remember: The journey of a thousand miles begins with a single step. Your Data Science journey starts here!