<a href="https://colab.research.google.com/github/klaxman23/August_pratice/blob/main/Module_14_Case_Study_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Case Study – 1
Domain – Chemical Industry
focus – Classify chemical
Business challenge/requirement
FuPont is a leading chemical company across the globe. The Company is on a CSR
(Corporate Social Responsibility) mission. It wants to identify biodegradable
products based on a study of the relationships between chemical structure and
biodegradation of molecules.
You as an ML expert have to create an ML model to classify the chemical structure as
'Ready BioDegradable' – RB vs 'Not Ready Biodegradable' – NRB
Key issues
Data has lots of attributes and classification could be tricky
Considerations
NONE
Data volume
- Approx 1055 records – file bio-degradabale-data.csv
Fields in Data
• Details in .ipynb notebook
Additional information
- NA
Business benefits
Research can lead FuPont to create truly unique Biodegradable packaging material.
This could lead to massive profits in future

In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [None]:
try:
    # Try loading the actual dataset
    df = pd.read_csv("bio-degradable-data.csv")
    print("Dataset loaded successfully!")

except FileNotFoundError:
    print("File not found. Creating a dummy dataset...")

    # Create dummy data similar to biodegradability dataset
    np.random.seed(42)
    X_dummy = np.random.rand(1055, 41)   # 41 chemical attributes
    y_dummy = np.random.choice(["RB", "NRB"], size=1055)

    columns = [f"feature_{i}" for i in range(1, 42)]
    df = pd.DataFrame(X_dummy, columns=columns)
    df["class"] = y_dummy

    print("Dummy dataset created successfully!")

df.head()

In [None]:
print("Shape:", df.shape)
df.info()

In [None]:
imputer = SimpleImputer(strategy="mean")
X = imputer.fit_transform(X)

In [None]:
encoder = LabelEncoder()
y = encoder.fit_transform(y)

print("Class Mapping:")
print(dict(zip(encoder.classes_, encoder.transform(encoder.classes_))))

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)

y_pred_lr = lr.predict(X_test)

print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_lr))
print(classification_report(y_test, y_pred_lr))

In [None]:
rf = RandomForestClassifier(
    n_estimators=200,
    random_state=42,
    class_weight="balanced"
)

rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))

In [None]:
svm = SVC(kernel="rbf", C=1, gamma="scale")
svm.fit(X_train, y_train)

y_pred_svm = svm.predict(X_test)

print("SVM Accuracy:", accuracy_score(y_test, y_pred_svm))
print(classification_report(y_test, y_pred_svm))

In [None]:
results = pd.DataFrame({
    "Model": ["Logistic Regression", "Random Forest", "SVM"],
    "Accuracy": [
        accuracy_score(y_test, y_pred_lr),
        accuracy_score(y_test, y_pred_rf),
        accuracy_score(y_test, y_pred_svm)
    ]
})

results