This code performs a binary classification task to predict whether a startup will be profitable or not based on features such as "R&D Spend," "Administration," and "Marketing Spend."

**Importing Libraries**<br>
- Imports necessary libraries such as pandas for data manipulation, matplotlib.pyplot for plotting graphs, seaborn for statistical visualization, and various modules from scikit-learn for machine learning tasks.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, roc_curve, roc_auc_score

**Loading the Dataset**<br>
- Loads the dataset "50_Startups.csv" into a pandas DataFrame called data.

In [None]:
# Load the dataset
data = pd.read_csv("../Datasets/50_Startups.csv")

**Data Preprocessing**<br>
- Separates the features (X) from the target variable (y) in the dataset.
- Binarizes the target variable (y) based on a threshold (mean value) to convert it into a binary classification problem.
- Defines numerical and categorical features.
- Sets up preprocessing steps for numerical features (scaling) and categorical features (one-hot encoding) using pipelines and column transformers.

In [None]:
# Separate features and target variable
X = data.drop(columns=["Profit"])
y = data["Profit"]

# Binarize the target variable
threshold = y.mean()  # Use mean as threshold for binary classification
y_binary = (y > threshold).astype(int)

# Define numerical and categorical features
numeric_features = ["R&D Spend", "Administration", "Marketing Spend"]
categorical_features = ["State"]

# Define preprocessing steps for numerical and categorical features
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder())
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

**Model Definition and Training**<br>
- Defines a logistic regression model using pipelines to streamline preprocessing and model fitting.
- Splits the dataset into training and testing sets using train_test_split.
- Fits the logistic regression model to the training data (X_train and y_train).

In [None]:
# Define the logistic regression model
model = Pipeline(steps=[('preprocessor', preprocessor),
                        ('classifier', LogisticRegression())])

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y_binary, test_size=0.2, random_state=42)

# Fit the model
model.fit(X_train, y_train)

**Model Evaluation**<br>
- Makes predictions on the test data (X_test).
- Calculates the accuracy score of the model using accuracy_score.
- Calculates the confusion matrix to evaluate the model's performance further.

In [None]:
# Make predictions
y_pred = model.predict(X_test)

# Calculate accuracy (not meaningful for regression, just for demonstration)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Calculate confusion matrix
cm = confusion_matrix(y_test, y_pred)

**Visualization**<br>
- Plots a scatter plot to visualize the binary classification predictions based on two features (R&D Spend and Marketing Spend).
- Plots a heatmap of the confusion matrix to visualize the model's performance in terms of true positives, true negatives, false positives, and false negatives.
- Plots the Receiver Operating Characteristic (ROC) Curve. Provides a visual representation of the trade-off between sensitivity and specificity for the classifier.
    - AUC (Area Under the Curve) summarizes the performance of the classifier across all possible classification thresholds.

In [None]:
# Make predictions
y_pred_proba = model.predict_proba(X_test)[:, 1]

# Calculate ROC curve and AUC
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = roc_auc_score(y_test, y_pred_proba)

# Plot ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC Curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

# Scatter plot for binary classification predictions
plt.figure(figsize=(8, 6))
sns.scatterplot(x=X_test['R&D Spend'], y=X_test['Marketing Spend'], hue=model.predict(X_test), palette="Set1")
plt.title("Binary Classification (Profitable vs Non-Profitable)")
plt.xlabel("R&D Spend")
plt.ylabel("Marketing Spend")
plt.legend(title="Prediction", loc="upper right")
plt.show()

# Confusion matrix
y_pred = model.predict(X_test)
cm = confusion_matrix(y_test, y_pred)

# Plot confusion matrix heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.title("Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()