# Polymers Classifier

EM 538
Instructor:  
Student: Mike Keating


## Preprocessing

Data was queried by polymer symbol (PP, PMMA, etc) and saved to 5 separate csv files. Due to the nature of the search, there are likely blends and special materials included in each dataset, and we will want to pair down to a reasonable number of classes.


In [46]:
## Dependencies
import pandas as pd


In [57]:
## Combine
import os

df = pd.DataFrame()
for file in os.listdir("data"):
    if file.endswith(".csv"):
        df_tmp = pd.read_csv(os.path.join("data", file))
        df = pd.concat([df, df_tmp], ignore_index=True)
# Overview
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36382 entries, 0 to 36381
Data columns (total 13 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Product Name                  36382 non-null  object 
 1   Grade                         36382 non-null  object 
 2   Generic Polymer Type          36382 non-null  object 
 3   Generic Polymer Symbol        36382 non-null  object 
 4   Density                       25865 non-null  float64
 5   Tensile Strength at Yield     18942 non-null  float64
 6   Flexural Modulus              26080 non-null  float64
 7   Flexural Strength             12903 non-null  float64
 8   Tensile Modulus               12543 non-null  float64
 9   Glass Transition Temperature  515 non-null    float64
 10  Melt Mass-Flow Rate (MFR)     25821 non-null  float64
 11  Polymer Code                  7891 non-null   object 
 12  Melting Temperature           1264 non-null   float64
dtypes

In [58]:
# Discard  Glass Transition Temp as it has too many null columns
df = df.drop("Glass Transition Temperature", axis=1)
df = df.drop("Polymer Code", axis=1)
df = df.drop("Melting Temperature", axis=1)
# df.dropna(inplace=True)
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36382 entries, 0 to 36381
Data columns (total 10 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Product Name               36382 non-null  object 
 1   Grade                      36382 non-null  object 
 2   Generic Polymer Type       36382 non-null  object 
 3   Generic Polymer Symbol     36382 non-null  object 
 4   Density                    25865 non-null  float64
 5   Tensile Strength at Yield  18942 non-null  float64
 6   Flexural Modulus           26080 non-null  float64
 7   Flexural Strength          12903 non-null  float64
 8   Tensile Modulus            12543 non-null  float64
 9   Melt Mass-Flow Rate (MFR)  25821 non-null  float64
dtypes: float64(6), object(4)
memory usage: 2.8+ MB


In [59]:
#


symbols = [
    "ABS",
    "PET",
    "MABS",
    "PP Homopolymer",
    "Acrylic (PMMA)",
    "PC+ABS",
    "PP Copolymer",
    "PS (GPPS)",
    "PEEK",
    "PS (HIPS)",
    "PVC, Rigid",
    "HDPE",
    "LDPE",
]


df = df[df["Generic Polymer Symbol"].isin(symbols)]
# df
# Add placeholder density for polymers without density
# More maintainable approach using a mapping
density_defaults = {
    "PVC, Rigid": 1.4,
    "PEEK": 1.3,
    "Acrylic (PMMA)": 1.2,
    "LDPE": 0.92,
    "HDPE": 0.95,
}

for polymer, default_density in density_defaults.items():
    mask = df["Generic Polymer Symbol"] == polymer
    df.loc[mask, "Density"] = df.loc[mask, "Density"].fillna(default_density)


df.groupby("Generic Polymer Symbol").count()
#df_clean = df.dropna()
#df_clean.groupby("Generic Polymer Symbol").mean(numeric_only=True)
#df_clean.groupby("Generic Polymer Symbol").count()


Unnamed: 0_level_0,Product Name,Grade,Generic Polymer Type,Density,Tensile Strength at Yield,Flexural Modulus,Flexural Strength,Tensile Modulus,Melt Mass-Flow Rate (MFR)
Generic Polymer Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
ABS,3153,3153,3153,3025,1865,2693,2263,1379,2075
Acrylic (PMMA),622,622,622,622,115,449,371,340,478
HDPE,2673,2673,2673,2673,1848,1776,214,533,2489
LDPE,1797,1797,1797,1797,484,334,31,179,1738
MABS,56,56,56,56,39,44,44,37,42
PC+ABS,1734,1734,1734,1691,1081,1432,1083,877,1071
PEEK,847,847,847,847,256,626,573,717,116
PET,648,648,648,478,99,292,226,240,66
PP Copolymer,2980,2980,2980,2844,1919,2566,616,792,2698
PP Homopolymer,4708,4708,4708,4082,3081,3890,1237,1434,4256


In [60]:
# Create meaningful groupings for class balancing
def map_polymer_group(symbol):
    if symbol in ["PP Homopolymer", "PP Copolymer"]:
        return "PP"
    elif symbol in ["PS (GPPS)", "PS (HIPS)"]:
        return "PS"
    elif symbol in ["Acrylic (PMMA)", "MABS"]:
        return "Acrylics"
    elif symbol in ["PC+ABS", "ABS"]:
        return "ABS/Blends"
    elif symbol == "HDPE":
        return "HDPE"
    elif symbol == "LDPE":
        return "LDPE"
    elif symbol == "PET":
        return "PET"
    elif symbol == "PEEK":
        return "PEEK"
    elif "pvc" in symbol.lower():
        return "PVC"
    else:
        return "Other"


df["Polymer Group"] = df["Generic Polymer Symbol"].apply(map_polymer_group)

print(df["Polymer Group"].value_counts())


Polymer Group
PP            7688
ABS/Blends    4887
HDPE          2673
LDPE          1797
PS            1148
PVC            986
PEEK           847
Acrylics       678
PET            648
Name: count, dtype: int64


In [61]:
from scipy import stats
import numpy as np

def fill_from_normal_distribution(df, feature_col, group_col):
    """Fill missing values by fitting normal distribution to existing data"""
    df_filled = df.copy()
    
    for group in df[group_col].unique():
        group_mask = df[group_col] == group
        existing_values = df.loc[group_mask, feature_col].dropna()
        
        if len(existing_values) >= 2:  # Need at least 2 points to fit distribution
            # Fit normal distribution
            mu, sigma = stats.norm.fit(existing_values)
            
            # Find missing values
            missing_mask = group_mask & df[feature_col].isna()
            n_missing = missing_mask.sum()
            
            if n_missing > 0:
                # Generate from fitted distribution
                generated_values = np.random.normal(mu, sigma, n_missing)
                df_filled.loc[missing_mask, feature_col] = generated_values
    
    return df_filled

# Apply to multiple features
numeric_features = df.select_dtypes(include=[np.number]).columns
for feature in numeric_features:
    if feature != 'Polymer Group':  # Skip target variable
        df = fill_from_normal_distribution(df, feature, 'Polymer Group')

In [None]:
from sklearn.neighbors import KernelDensity
from sklearn.model_selection import GridSearchCV

def find_optimal_bandwidth(data):
    """Find optimal bandwidth using cross-validation"""
    if len(data) < 5:
        return 0.1  # Default bandwidth for small datasets
    
    bandwidths = np.logspace(-2, 0, 10)  # From 0.01 to 1
    grid = GridSearchCV(KernelDensity(), {'bandwidth': bandwidths}, cv=min(3, len(data)))
    grid.fit(data.reshape(-1, 1))
    return grid.best_params_['bandwidth']

def fill_from_kde(df, feature_col, group_col):
    """Fill missing values using Kernel Density Estimation"""
    df_filled = df.copy()
    np.random.seed(42)  # For reproducibility
    
    for group in df[group_col].unique():
        group_mask = df[group_col] == group
        existing_values = df.loc[group_mask, feature_col].dropna().values
        
        if len(existing_values) >= 3:  # Need sufficient points for KDE
            # Find optimal bandwidth
            bandwidth = find_optimal_bandwidth(existing_values)
            
            # Fit KDE to existing values
            kde = KernelDensity(bandwidth=bandwidth, kernel='gaussian')
            kde.fit(existing_values.reshape(-1, 1))
            
            # Find missing values
            missing_mask = group_mask & df[feature_col].isna()
            n_missing = missing_mask.sum()
            
            if n_missing > 0:
                # Sample from KDE distribution
                samples = kde.sample(n_missing).flatten()
                df_filled.loc[missing_mask, feature_col] = samples
                
                print(f"KDE: Filled {n_missing} missing values for {group} in {feature_col}")
        else:
            print(f"KDE: Insufficient data for {group} in {feature_col} (only {len(existing_values)} samples)")
    
    return df_filled

In [None]:
def compare_imputation_methods(df_original, feature_col, group_col):
    """Compare normal distribution vs KDE imputation methods"""
    print(f"\n=== Comparing Imputation Methods for {feature_col} ===")
    
    # Create copies for each method
    df_normal = fill_from_normal_distribution(df_original, feature_col, group_col)
    df_kde = fill_from_kde(df_original, feature_col, group_col)
    
    print(f"\n{feature_col} Statistics by Polymer Group:")
    print("-" * 60)
    
    for group in df_original[group_col].unique():
        original = df_original[df_original[group_col] == group][feature_col].dropna()
        normal_filled = df_normal[df_normal[group_col] == group][feature_col]
        kde_filled = df_kde[df_kde[group_col] == group][feature_col]
        
        if len(original) > 0:
            print(f"\n{group}:")
            print(f"  Original    - Mean: {original.mean():.3f}, Std: {original.std():.3f}, Count: {len(original)}")
            print(f"  Normal Fill - Mean: {normal_filled.mean():.3f}, Std: {normal_filled.std():.3f}, Count: {len(normal_filled)}")
            print(f"  KDE Fill    - Mean: {kde_filled.mean():.3f}, Std: {kde_filled.std():.3f}, Count: {len(kde_filled)}")
            
            # Calculate how many values were imputed
            n_imputed = len(normal_filled) - len(original)
            if n_imputed > 0:
                print(f"  Imputed: {n_imputed} values")
    
    return df_normal, df_kde

In [None]:
# Create a copy of the original data before imputation for comparison
df_before_imputation = df.copy()

# Test both methods on density (main feature with missing values)
df_normal_density, df_kde_density = compare_imputation_methods(df_before_imputation, 'Density', 'Polymer Group')

# Test on other numeric features if they have missing values
print("\n" + "="*80)
print("TESTING OTHER NUMERIC FEATURES")
print("="*80)

numeric_features = df_before_imputation.select_dtypes(include=[np.number]).columns
for feature in numeric_features:
    if feature != 'Polymer Group' and df_before_imputation[feature].isna().sum() > 0:
        print(f"\nTesting {feature} (has {df_before_imputation[feature].isna().sum()} missing values)")
        compare_imputation_methods(df_before_imputation, feature, 'Polymer Group')

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

def visualize_imputation_comparison(df_original, df_normal, df_kde, feature_col, group_col):
    """Visualize the distribution differences between imputation methods"""
    
    # Focus on groups with enough data
    main_groups = ['ABS/Blends', 'PP', 'PS']  # Your main classes
    
    fig, axes = plt.subplots(len(main_groups), 3, figsize=(15, 4*len(main_groups)))
    fig.suptitle(f'Distribution Comparison: {feature_col}', fontsize=16)
    
    for i, group in enumerate(main_groups):
        if group in df_original[group_col].values:
            # Get data for each method
            original = df_original[df_original[group_col] == group][feature_col].dropna()
            normal_filled = df_normal[df_normal[group_col] == group][feature_col]
            kde_filled = df_kde[df_kde[group_col] == group][feature_col]
            
            # Original distribution
            axes[i, 0].hist(original, bins=15, alpha=0.7, density=True, color='blue')
            axes[i, 0].set_title(f'{group} - Original Data')
            axes[i, 0].set_ylabel('Density')
            
            # Normal imputation
            axes[i, 1].hist(normal_filled, bins=15, alpha=0.7, density=True, color='green')
            axes[i, 1].set_title(f'{group} - Normal Imputation')
            
            # KDE imputation
            axes[i, 2].hist(kde_filled, bins=15, alpha=0.7, density=True, color='red')
            axes[i, 2].set_title(f'{group} - KDE Imputation')
            
            # Set same x-axis limits for comparison
            x_min = min(original.min(), normal_filled.min(), kde_filled.min()) * 0.95
            x_max = max(original.max(), normal_filled.max(), kde_filled.max()) * 1.05
            for j in range(3):
                axes[i, j].set_xlim(x_min, x_max)
                axes[i, j].set_xlabel(feature_col)
    
    plt.tight_layout()
    plt.show()

# Visualize the comparison for density
print("Generating visualization comparison...")
visualize_imputation_comparison(df_before_imputation, df_normal_density, df_kde_density, 'Density', 'Polymer Group')

In [None]:
# Final decision: Choose the best method for your dataset
print("\n" + "="*80)
print("RECOMMENDATION")
print("="*80)

print("""
Based on the comparison above:

1. NORMAL DISTRIBUTION IMPUTATION:
   - Simpler and faster
   - Good when data is approximately normal
   - Preserves mean but may not preserve original variance exactly
   - Better for smaller datasets

2. KDE IMPUTATION:
   - More flexible, no distribution assumptions
   - Better preserves original data shape
   - Can handle skewed or multi-modal distributions
   - Requires more data per group (ideally 20+ samples)

For your polymer classification:
- Use NORMAL imputation for groups with < 20 samples
- Use KDE imputation for well-represented groups (ABS/Blends, PP)
- Both methods are better than simple mean imputation

Next steps:
1. Choose one method and apply to your full dataset
2. Proceed with model training and evaluation
3. Compare model performance with different imputation strategies
""")

# Apply the chosen method (let's use Normal for consistency)
print("Applying normal distribution imputation to the dataset...")
numeric_features = df.select_dtypes(include=[np.number]).columns
for feature in numeric_features:
    if feature != 'Polymer Group':  # Skip target variable
        df = fill_from_normal_distribution(df, feature, 'Polymer Group')

print("Imputation completed!")

In [67]:

df.dropna(inplace=True)

df.groupby("Polymer Group").mean(numeric_only=True).round(2)

Unnamed: 0_level_0,Density,Tensile Strength at Yield,Flexural Modulus,Flexural Strength,Tensile Modulus,Melt Mass-Flow Rate (MFR)
Polymer Group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ABS/Blends,1.14,50.51,3158.88,84.23,3232.82,16.21
Acrylics,1.17,55.39,2593.43,91.48,2646.34,6.89
HDPE,0.95,26.65,1262.68,41.06,1508.06,5.21
LDPE,0.92,10.77,246.57,24.32,244.12,11.78
PEEK,1.42,106.22,9981.13,219.66,10467.99,15.85
PET,1.51,100.18,8877.25,193.23,9565.78,31.36
PP,1.03,32.79,2407.71,65.68,3082.12,23.47
PS,1.06,30.7,2612.28,60.3,2571.92,7.73
PVC,1.4,45.42,2782.43,77.81,2718.13,84.72
