## **Problem 3: Product Performance and Material Trends**

As part of supply chain review, you're asked to analyze how different materials are performing:

* Extract material name, color, and supplier into a structured format.
* Normalize supplier names (strip, title case).
* Calculate total revenue per material (sum of `Total Price`).
* Categorize materials by total weight purchased into quartiles.
* Flag materials with `Average Price per Gram > 0.06`.
* Create a dummy variable DataFrame from material type (`PLA`, `ABS`, etc.) and store zip code.

*Hint: You’ll need to create bins, normalize units, and apply one-hot encoding.*

In [309]:
import pandas as pd
import numpy as np
import re

In [310]:
data = pd.read_csv('fila_heat_filament_sales_april2025.csv')

In [311]:
df = pd.DataFrame(data)

In [312]:
df.head(3)

Unnamed: 0,Date Purchased,Receipt Number,Customer Name,Customer Address,Phone Number,Email,Store Location,Product Name,Product Code,Bar Code,Material Name,Color,Weight,Supplier,Lot Number,Price,Quantity,Tax,Total Price
0,2025-04-01,1ff49b78-8946-4e85-b59c-de66bacfb3d0,Danielle Johnson,"3321 Brittany Bypass, North Jefferyhaven, 79408",8386379402,danielle.johnson@hotmail.com,"5423 Garcia Light, West Melanieview, 06196",Standard PLA Filament,PLA-792,6184960000000.0,PLA,Blue,500,3DFilaments,L5012,26.69,1,1.87,28.56
1,2025-04-01,434308bc-89fa-4a68-8fb5-d27bbeb79919,Tracie Wyatt,"64752 Kelly Skyway, Jacquelineland, 80341",+1-283-276-4835x0305,tracie.wyatt@yahoo.com,"1395 Diana Locks, Thomasberg, 32826",Flexible TPU Filament,TPU-338,9696530000000.0,TPU,Purple,500,ProtoPolymers,L1520,20.88,2,2.92,44.68
2,2025-04-01,52fbe43b-9954-4eb4-8025-7ad1eb2263dd,Eric Moore,"691 James Mountain, Tashatown, 89667",001-184-514-6270x4828,eric.moore@gmail.com,"489 Eric Track, New Stephanie, 70015",Flexible TPU Filament,TPU-325,7015430000000.0,TPU,Purple,1000,PrintPro,L4257,41.47,4,11.61,177.49


In [313]:
# 1. Extract material name, color, and supplier into a structured format.
materials = df[['Material Name', 'Color', 'Supplier']].copy()

In [314]:
materials.head(3)

Unnamed: 0,Material Name,Color,Supplier
0,PLA,Blue,3DFilaments
1,TPU,Purple,ProtoPolymers
2,TPU,Purple,PrintPro


In [315]:
# 1.1 Count how many colors per material
materials.groupby('Material Name')['Color'].nunique()

Material Name
ABS     2
PETG    2
PLA     3
TPU     2
Name: Color, dtype: int64

In [316]:
# 1.2. Group by material and supplier
materials.groupby(['Material Name', 'Supplier']).size().reset_index(name='Count')

Unnamed: 0,Material Name,Supplier,Count
0,ABS,3DFilaments,17
1,ABS,MakerStuff,15
2,ABS,PrintPro,22
3,ABS,ProtoPolymers,18
4,PETG,3DFilaments,20
5,PETG,MakerStuff,25
6,PETG,PrintPro,14
7,PETG,ProtoPolymers,15
8,PLA,3DFilaments,24
9,PLA,MakerStuff,33


In [317]:
# optional: pivot for readability
materials.groupby(['Material Name', 'Supplier']).size().unstack(fill_value=0)

Supplier,3DFilaments,MakerStuff,PrintPro,ProtoPolymers
Material Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ABS,17,15,22,18
PETG,20,25,14,15
PLA,24,33,26,30
TPU,35,20,23,23


In [318]:
# 1.3 Detect duplicates or inconsistencies
# a. Find exact duplicate rows:
materials.duplicated().sum()

np.int64(324)

In [319]:
materials[materials.duplicated()]

Unnamed: 0,Material Name,Color,Supplier
9,TPU,Yellow,3DFilaments
11,TPU,Yellow,3DFilaments
12,PLA,Red,ProtoPolymers
14,TPU,Purple,3DFilaments
17,ABS,Green,PrintPro
...,...,...,...
355,PLA,Red,MakerStuff
356,PETG,Clear,PrintPro
357,TPU,Yellow,PrintPro
358,PLA,Red,3DFilaments


In [320]:
# b. Find duplicate combinations of Material + Color:
materials[materials.duplicated(subset=['Material Name', 'Color'], keep=False)]

Unnamed: 0,Material Name,Color,Supplier
0,PLA,Blue,3DFilaments
1,TPU,Purple,ProtoPolymers
2,TPU,Purple,PrintPro
3,TPU,Yellow,3DFilaments
4,PETG,Orange,3DFilaments
...,...,...,...
355,PLA,Red,MakerStuff
356,PETG,Clear,PrintPro
357,TPU,Yellow,PrintPro
358,PLA,Red,3DFilaments


In [321]:
# c. Check if a single material has inconsistent suppliers:
materials.groupby(['Material Name', 'Color'])['Supplier'].nunique()

Material Name  Color 
ABS            Black     4
               Green     4
PETG           Clear     4
               Orange    4
PLA            Blue      4
               Red       4
               White     4
TPU            Purple    4
               Yellow    4
Name: Supplier, dtype: int64

In [322]:
# 2. Normalize supplier names (strip, title case).
materials['Supplier'].str.strip().str.title().unique()

array(['3Dfilaments', 'Protopolymers', 'Printpro', 'Makerstuff'],
      dtype=object)

In [323]:
# Optional: Split PascalCase / CamelCase and Numbers from letters
def split_camel_case(name):
     # Insert space between a lowercase/number and an uppercase character
    name = re.sub(r'(?<=[a-z0-9])(?=[A-Z])', ' ', name)
    # Insert space between a letter and a digit
    name = re.sub(r'(?<=[A-Za-z])(?=[0-9])', ' ', name)
    # Insert space between a digit and a letter
    name = re.sub(r'(?<=[0-9])(?=[A-Za-z])', ' ', name)
    return name.title()  # Optional: Title-case the result

In [324]:
materials['Supplier Spaced'] = materials['Supplier'].apply(split_camel_case)

In [325]:
materials.head(3)

Unnamed: 0,Material Name,Color,Supplier,Supplier Spaced
0,PLA,Blue,3DFilaments,3 Dfilaments
1,TPU,Purple,ProtoPolymers,Proto Polymers
2,TPU,Purple,PrintPro,Print Pro


In [326]:
# 3. Calculate total revenue per material (sum of `Total Price`).
materials_total_price = df[['Material Name', 'Total Price']].copy()

In [327]:
materials_total_price.head(3)

Unnamed: 0,Material Name,Total Price
0,PLA,28.56
1,TPU,44.68
2,TPU,177.49


In [328]:
materials_total_price.groupby('Material Name')['Total Price'].sum().sort_values(ascending=False)

Material Name
PLA     11430.24
TPU      9729.98
ABS      7163.03
PETG     6715.95
Name: Total Price, dtype: float64

In [329]:
# 4. Categorize materials by total weight purchased into quartiles.
materials_weight = df[['Material Name', 'Weight']].copy()

In [330]:
materials_weight.head(3)

Unnamed: 0,Material Name,Weight
0,PLA,500
1,TPU,500
2,TPU,1000


In [331]:
bin_result = pd.qcut(materials_weight['Weight'], 4, duplicates='drop')

In [332]:
n_bins = bin_result.cat.categories.size

In [333]:
n_bins

2

In [334]:
labels = ['Low', 'Mid-Low', 'Mid-High', 'High'][:n_bins]

In [335]:
labels

['Low', 'Mid-Low']

In [336]:
materials_weight['Weight Quartile'] = pd.qcut(
    materials_weight['Weight'], 
    q=n_bins, 
    labels=labels, 
    duplicates="drop"
)

In [337]:
materials_weight['Weight'] = materials_weight['Weight'].astype('category')

In [338]:
materials_weight.head(3)

Unnamed: 0,Material Name,Weight,Weight Quartile
0,PLA,500,Low
1,TPU,500,Low
2,TPU,1000,Mid-Low


In [339]:
# 5. Flag materials with `Average Price per Gram > 0.06`.
materials_weight_total_price = df[['Material Name', 'Weight', 'Total Price']].copy()

In [340]:
materials_weight_total_price.head(3)

Unnamed: 0,Material Name,Weight,Total Price
0,PLA,500,28.56
1,TPU,500,44.68
2,TPU,1000,177.49


In [341]:
materials_weight_total_price['Average Price per Gram'] = materials_weight_total_price['Total Price'] / materials_weight_total_price['Weight']

In [342]:
materials_weight_total_price.head(3)

Unnamed: 0,Material Name,Weight,Total Price,Average Price per Gram
0,PLA,500,28.56,0.05712
1,TPU,500,44.68,0.08936
2,TPU,1000,177.49,0.17749


In [343]:
materials_weight_total_price['Flagged'] = materials_weight_total_price['Average Price per Gram'].apply(lambda x: 'FLAGGED' if x > 0.06 else '')

In [344]:
materials_weight_total_price.head(3)

Unnamed: 0,Material Name,Weight,Total Price,Average Price per Gram,Flagged
0,PLA,500,28.56,0.05712,
1,TPU,500,44.68,0.08936,FLAGGED
2,TPU,1000,177.49,0.17749,FLAGGED


In [345]:
# 6. Create a dummy variable DataFrame from material type (`PLA`, `ABS`, etc.) and store zip code.
material_store_zip_code = df[['Material Name', 'Store Location']].copy()

In [346]:
material_store_zip_code.head(3)

Unnamed: 0,Material Name,Store Location
0,PLA,"5423 Garcia Light, West Melanieview, 06196"
1,TPU,"1395 Diana Locks, Thomasberg, 32826"
2,TPU,"489 Eric Track, New Stephanie, 70015"


In [347]:
material_store_zip_code['Zip Code'] = material_store_zip_code['Store Location'].str.extract(r'(\d{5})')

In [348]:
material_store_zip_code.head(3)

Unnamed: 0,Material Name,Store Location,Zip Code
0,PLA,"5423 Garcia Light, West Melanieview, 06196",6196
1,TPU,"1395 Diana Locks, Thomasberg, 32826",32826
2,TPU,"489 Eric Track, New Stephanie, 70015",70015


In [358]:
dummies = pd.get_dummies(material_store_zip_code[['Material Name', 'Zip Code']]).astype(int)

In [359]:
dummies.head(3)

Unnamed: 0,Material Name_ABS,Material Name_PETG,Material Name_PLA,Material Name_TPU,Zip Code_00000,Zip Code_00471,Zip Code_00547,Zip Code_00595,Zip Code_00905,Zip Code_00956,...,Zip Code_97374,Zip Code_97585,Zip Code_97763,Zip Code_97906,Zip Code_98579,Zip Code_99187,Zip Code_99297,Zip Code_99704,Zip Code_99749,Zip Code_99758
0,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [360]:
material_store_zip_code_with_dummies = pd.concat([material_store_zip_code, dummies], axis=1)

In [361]:
material_store_zip_code_with_dummies

Unnamed: 0,Material Name,Store Location,Zip Code,Material Name_ABS,Material Name_PETG,Material Name_PLA,Material Name_TPU,Zip Code_00000,Zip Code_00471,Zip Code_00547,...,Zip Code_97374,Zip Code_97585,Zip Code_97763,Zip Code_97906,Zip Code_98579,Zip Code_99187,Zip Code_99297,Zip Code_99704,Zip Code_99749,Zip Code_99758
0,PLA,"5423 Garcia Light, West Melanieview, 06196",06196,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,TPU,"1395 Diana Locks, Thomasberg, 32826",32826,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,TPU,"489 Eric Track, New Stephanie, 70015",70015,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,TPU,"93010 Carlos Bypass, Chadbury, 37001",93010,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,PETG,"013 Richard Orchard, Port Richard, 08186",08186,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
355,PLA,"015 Steven Flat, South Shawnbury, 07563",07563,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
356,PETG,"6621 Richards Pine, Port Devon, 62760",62760,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
357,TPU,"497 Logan Pines, Fowlerland, 26540",26540,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
358,PLA,"203 Williams Mount, North Christopher, 07699",07699,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 🧠 **What Are Dummy Variables?**

**Dummy variables** are **binary (0 or 1)** columns used to represent **categorical data** in a numerical format — usually for use in:

* **Statistical models** (like regression)
* **Machine learning algorithms** (like decision trees, random forests, logistic regression, etc.)
* **Data analysis** when filtering or grouping based on categories

---

### 💡 Example

Let’s say you have a column like this:

| Material Name |
| ------------- |
| PLA           |
| ABS           |
| TPU           |

To use this in a model, you **can't pass text** — you need to convert it into a numerical format.

#### 👉 Dummy variable format:

| Material Name\_ABS | Material Name\_PLA | Material Name\_TPU |
| ------------------ | ------------------ | ------------------ |
| 0                  | 1                  | 0                  |
| 1                  | 0                  | 0                  |
| 0                  | 0                  | 1                  |

Each row is marked with **1 for the category it belongs to**, and **0 for the rest**.

---

### 📈 Why Are They Important?

1. ✅ **Enables models to interpret text categories**
2. ✅ **Avoids assigning numeric order to unordered categories** (e.g., PLA ≠ 1, TPU ≠ 2)
3. ✅ **Preserves fairness in machine learning** — no false assumptions of ranking

---

### 🔧 Tools to Generate Dummy Variables

```python
pd.get_dummies(df['Material Name'])
```

Also works on multiple columns:

```python
pd.get_dummies(df[['Material Name', 'Zip Code']])
```

You can also control behavior:

* `drop_first=True` – drops one column to avoid multicollinearity
* `prefix=...` – sets column prefixes
