# Sales Data Analysis Workshop — Notebook

References:
* [Data Cleaning Guide](https://colab.research.google.com/drive/1vLOybvHdk2D39v10RGbHkD3km98QOXi_?usp=sharing)
* [Filtering, Sorting, Selecting](https://colab.research.google.com/drive/1QugLu7m9T8lFocQpycfnn_5LRjEURqmF?usp=sharing)
* [Pivots and Plots](https://colab.research.google.com/drive/19uc_TPzNjjRnOvmEEfnsV4xepOT2mtSO)

* [Data files GitHub](https://github.com/najczuk/2025-pg-bda-notebooks/tree/main/data)


## Setup Instructions

Before starting, make sure you have Python, Jupyter or VS Code, and the packages `pandas`, `matplotlib`, and `seaborn` installed. The dataset is `data/sales_data.csv`.

Example (run in a code cell to test environment):
```python
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('data/sales_data.csv')
print('Dataset loaded successfully!')
print(df.head())
```

## Assignment 1 — Data Inspection & Cleaning

**Goal**: Load the dataset, inspect its structure, and detect messy rows (e.g., string values in numeric columns or invalid categorical values).

**Skills recap**: `df.head()`, `df.info()`, `pd.to_numeric(..., errors='coerce')`, `df['col'].isin(...)`, `df.dropna()`.

**Useful reminders (examples only — do not copy as final solution):**
```python
# Coerce a column to numeric (invalid values become NaN)
cleaned = pd.to_numeric(df['column_name'], errors='coerce')
print(cleaned.isna().sum())  # count values that failed conversion

# Show rows that failed conversion (inspect before changing)
df[cleaned.isna()]

# Filter rows to keep only allowed categorical values
valid_values = ['a','b','c']
df_valid = df[df['column_name'].isin(valid_values)]
df_invalid = df[~df['column_name'].isin(valid_values)]

# Applying string built in functions
df['Month'] = df['Month'].str.capitalize()
# Output: Jan, Jan, Jan, Jan, Feb, Feb, Feb, Feb, Feb

df['Month'] = df['Month'].str.upper()
# Output: JAN, JAN, FEB, ...

# Replacing values

## with map
month_map = {
    'January': 'Jan',
    'February': 'Feb'
}

df['Month'] = df['Month'].replace(month_map)

## regular string replace
df['column_name'] = df['column_name'].astype(str).str.replace('foo', 'bar', case=False)
# replaces foo -> bar not case sensitive


# Applying custom functions
def expand_k_suffix(value):
    """
    Convert strings like '9k' to 9000, or pass through numeric values.
    Example: '9k' -> 9000, '9' -> 9, 9 -> 9
    """
    if isinstance(value, str):
        if value.lower().endswith('k'):
            try:
                return float(value[:-1]) # [:-1] cuts last character out
            except ValueError:
                return None
    return value

# Apply the helper, then coerce to numeric
df['column_name'] = df['column_name'].apply(expand_k_suffix)
df['column_name'] = pd.to_numeric(df['column_name'], errors='column_name')
```

Use the some of above methods to detect issues, then decide whether to correct, impute, or drop problematic rows.

In [47]:
# Assignment 1 - your code
# your code

**Your Turn (Assignment 1)**:
- Inspect the data with `df.head()` and `df.info()`
- Detect and list rows with non-numeric `Revenue` values
- Decide and apply a strategy: correct obvious typos, drop, or impute missing values
- Standardize `Month` values to valid categories (e.g., 'Jan','Feb','Mar')

In [48]:
# Assignment 1 - tasks (student)
# your code

## Assignment 2 — Filtering & Selection

**Goal**: Use selection, boolean filtering, and sorting to extract useful subsets of the sales data.

**Useful snippets (examples only)**
```python
# Select a single column: 
df['column_name']
# Select multiple columns: 
df[['a','b']]
# Copy data frame
df_copy = df[['col1','col3']].copy()
# Filter rows by numeric condition: 
df[df['column_name'] > 1000]
# Combine conditions: 
df[(df['a']>1) & (df['b']=='x')]
# Filter membership: 
df[df['column_name'].isin(['a','b'])]
# Sort: 
df.sort_values('column_name', ascending=False).head()
```

**Your Turn (Assignment 2)**:
1. Select columns `Region`, `Salesperson`, `Revenue`, `Units_Sold`, `Month` into `df_view`.
2. Filter `df_view` to `Revenue` > 12_000 → `df_high`.
3. From `df_high` keep only `North` and `East` regions.
4. Find top 3 transactions by `Units_Sold`.
5. Show rows where `Revenue` > 10_000 AND `Units_Sold` < 350.

In [49]:
# Assignment 2 - your code
# your code

## Assignment 3 — Pivot Tables & Summary Statistics

**Goal**: Create summary tables that aggregate sales data to reveal patterns by region, salesperson, and month.

**Useful snippets (examples only)**
```python
pd.pivot_table(df, index='index_col', columns='column_col', values='value_col', aggfunc='sum')
df.groupby('group_col').agg({'value_col1':'sum','value_col2':'mean'})
grouped = df.groupby('group_col').agg({'value_col':'sum', 'value_co2':'mean'})
named_aggregates = df.groupby('group_col').agg(total=('value_col','sum'), cnt=('value_col','count'))
```

**Your Turn (Assignment 3)**:
1. Build `pivot_region`: total `Revenue` by `Region`.
2. Build `pivot_month_region`: `Month` rows × `Region` columns showing total `Revenue`.
3. Compute average `Units_Sold` per `Salesperson`.
4. Count transactions per `Month`.
5. (Challenge) Create a table with both total `Revenue` and average `Units_Sold` per `Region`.

In [50]:
# Assignment 3 - your code
# your code

## Assignment 4 — Visualizations & Interpretation

**Goal**: Create clear visualizations (bar, line, box, heatmap) that summarize the cleaned sales data and support short written interpretations.

**Useful snippets (examples only)**
```python
# Bar plot from aggregated data: agg.plot.bar(x='group_col', y='value_col')
# seaborn barplot: sns.barplot(data=agg, x='group_col', y='value_col')
# line plot: sns.lineplot(data=df_time_sorted, x='time_col', y='value_col')
# box plot: sns.boxplot(data=df, x='group_col', y='value_col')
# heatmap from pivot table: sns.heatmap(pivot, annot=True, fmt='.0f')
```

**Your Turn (Assignment 4)**:
1. Bar plot `bar_region`: total `Revenue` by `Region`.
2. Line plot `line_month`: total `Revenue` across ordered months.
3. Box plot `box_region`: `Revenue` distribution by `Region`.
4. Heatmap `heat_region_month`: total `Revenue` for each `Month` × `Region`.
5. Write a 4–6 sentence summary answering which region looks strongest, which shows most variability, and month trends.

In [51]:
# Assignment 4 - your code
# your code

## Assignment 5 — Capstone Analysis Project

**Goal**: Combine cleaning, filtering, pivoting, and visualization skills to perform a short end-to-end analysis and present findings.

**Project steps (scaffold)**:
1. Re-run cleaning and create `df_clean`.
2. Explore with filters/sorts to find interesting candidates.
3. Produce pivot tables to support your question.
4. Visualize key results (≥1 bar + 1 line/box plot).
5. Write a concise report (≤300 words) answering a clear business question and include 2–3 figures.
6. Save cleaned data: `df_clean.to_csv('sales_data_cleaned.csv', index=False)`

**Deliverables**: Notebook or markdown with cleaning steps, pivot tables, 2–3 plots, and a written conclusion.

In [52]:
# Assignment 5 - your code
# your code