### Business Case: Sales Effectiveness Optimization for FicZon Inc
- FicZon Inc is a technology-driven IT solutions provider offering a range of products, from on-premise software to SaaS-based platforms. The company's primary source of business leads comes through digital channels, particularly its official website.

- However, with increasing market competition and sales stagnation, FicZon faces a growing need to optimize its sales process. One key challenge is that lead quality assessment is highly manual, subjective, and dependent on individual sales staff expertise. While there is an internal quality process for lead categorization, its use is limited to post-sale analysis, offering minimal support to improve real-time conversion rates.

- To tackle this, FicZon wants to leverage Machine Learning (ML) to pre-categorize leads based on historical sales and customer interaction data. This proactive categorization is expected to enhance sales effectiveness, reduce wasted efforts, and drive revenue growth.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv('sales_data.csv')
df

In [None]:
pd.set_option('display.max_rows',None)

In [None]:
df.drop(['Mobile','EMAIL'],axis=1,inplace=True)

### Domain Analysis

### Creating Target Column either high potential or low potential based on status column

### 0 = low potential , 1 = high potential

In [None]:
# Define high and low potential statuses
high_potential = [
    'CONVERTED', 'converted', 'Potential',
    'In Progress Positive'
]

low_potential = [
    'Junk Lead', 'Not Responding', 'Just Enquiry',
    'In Progress Negative', 'LOST','Long Term' , 'Open'
]

# Function to map status to binary label
def map_status_to_target(status):
    if status in high_potential:
        return 1  # High Potential
    elif status in low_potential:
        return 0  # Low Potential
    else:
        return None  # Ambiguous or unhandled (like 'Open')

# Apply the function to create new column
df['Target'] = df['Status'].apply(map_status_to_target)

In [None]:
df['Target'] = df['Target'].astype(int)

In [None]:
df.drop(['Status'],axis=1,inplace=True)

### Basic Checks

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.describe(include='O')

In [None]:
df.Target.value_counts()

### Handling Null Values

In [None]:
df.isnull().sum()

In [None]:
df.loc[df.Product_ID.isnull()]

In [None]:
np.nanmedian(df.Product_ID)

In [None]:
df.loc[df.Product_ID.isnull(),'Product_ID'] = np.nanmedian(df.Product_ID)

In [None]:
df.loc[[0,1,2,3,4,5,6,7,8,9]]

In [None]:
df.loc[df.Location.isnull(),'Location'] = 'Other Locations'

In [None]:
df.loc[[0,1,2,3,4,5,7]]

In [None]:
df.loc[df.Source.isnull(),'Source'] = 'Call'

In [None]:
df.loc[df.Sales_Agent.isnull(),'Sales_Agent'] = 'Sales-Agent-4'

In [None]:
df.isnull().sum()

### converting Created column into datetime

In [None]:
df.info()

In [None]:
df.Created = pd.to_datetime(df['Created'])

In [None]:
df.Product_ID = df.Product_ID.astype('int32')

In [None]:
df.info()

### EDA

#### Univariate Analysis

In [None]:
df.head()

In [None]:
plt.figure(figsize = (12,5))
sns.countplot(x=df.Product_ID)

In [None]:
plt.figure(figsize=(10,10))
sns.countplot(x=df.Source)
plt.xticks(rotation=90)
plt.tight_layout()

In [None]:
plt.figure(figsize=(10,10))
sns.countplot(x=df.Sales_Agent)
plt.xticks(rotation=90)
plt.tight_layout()

In [None]:
plt.figure(figsize=(10,10))
sns.countplot(x=df.Location)
plt.xticks(rotation=90)
plt.tight_layout()

In [None]:
plt.figure(figsize=(10,10))
sns.countplot(x=df.Delivery_Mode)

In [None]:
sns.countplot(x=df.Target)

In [None]:
df['Created'] = pd.to_datetime(df['Created'])

# Group by day
daily_counts = df.set_index('Created').resample('D').size()

# Plot
daily_counts.plot(figsize=(12, 5))
plt.title("Number of Records Over Time")
plt.ylabel("Count")
plt.xlabel("Date")
plt.tight_layout()
plt.show()

### Bivariate Analysis

In [None]:
plt.figure(figsize = (12,5))
sns.countplot(x=df.Product_ID,hue=df.Target)

In [None]:
plt.figure(figsize=(10,10))
sns.countplot(x=df.Source,hue=df.Target)
plt.xticks(rotation=90)
plt.tight_layout()

In [None]:
plt.figure(figsize=(10,10))
sns.countplot(x=df.Sales_Agent,hue=df.Target)
plt.xticks(rotation=90)
plt.tight_layout()

In [None]:
plt.figure(figsize=(10,10))
sns.countplot(x=df.Location,hue=df.Target)
plt.xticks(rotation=90)
plt.tight_layout()

In [None]:
plt.figure(figsize=(10,10))
sns.countplot(x=df.Delivery_Mode,hue=df.Target)

In [None]:
df_sorted = df.sort_values('Created')
df_sorted['rolling_mean'] = df_sorted['Target'].rolling(window=100).mean()

plt.figure(figsize=(14, 5))
plt.plot(df_sorted['Created'], df_sorted['rolling_mean'])
plt.title("Rolling Conversion Rate Over Time")
plt.xlabel("Time")
plt.ylabel("Rolling Mean of Target (Conversion Rate)")
plt.tight_layout()
plt.show()

### Data Preprocessing

##### Handling Null Values
- Done Before EDA

##### Handling With Categorical data

In [None]:
df.head()

In [None]:
# Step 1: Convert 'Created' to datetime and extract features (as you did)
df['Created'] = pd.to_datetime(df['Created'])
df['Hour'] = df['Created'].dt.hour
df['Day'] = df['Created'].dt.day
df['Month'] = df['Created'].dt.month
df['Weekday'] = df['Created'].dt.weekday
df.drop('Created', axis=1, inplace=True)

In [None]:
#import pandas as pd

# Assuming your dataset is loaded into 'df'
# df = pd.read_csv("your_file.csv")  # if not already loaded

# -------------------------------
# 1️ Combine Rare Categories (optional but recommended for 'Location' & 'Delivery_Mode')

# Combine rare categories in 'Location' (e.g., categories with <50 entries as 'Other')
rare_locations = df['Location'].value_counts()[df['Location'].value_counts() < 50].index
df['Location'] = df['Location'].replace(rare_locations, 'Other')

# Similarly, for Delivery_Mode if you want to combine rare modes:
rare_modes = df['Delivery_Mode'].value_counts()[df['Delivery_Mode'].value_counts() < 50].index
df['Delivery_Mode'] = df['Delivery_Mode'].replace(rare_modes, 'Other')

# -------------------------------
# 2️ Frequency Encoding for 'Sales_Agent'
freq_sales = df['Sales_Agent'].value_counts(normalize=True)
df['Sales_Agent_Freq'] = df['Sales_Agent'].map(freq_sales)

# Drop the original 'Sales_Agent' column
df.drop(columns=['Sales_Agent'], inplace=True)

# -------------------------------
# 3️ One-Hot Encoding for ['Source', 'Location', 'Delivery_Mode']
df = pd.get_dummies(df, columns=['Source', 'Location', 'Delivery_Mode'], drop_first=True)

print(df.head())
print("Shape after encoding:", df.shape)


In [None]:
df.head()

In [None]:
# Convert all boolean columns to integers (True → 1, False → 0)
df = df.astype({col: 'int' for col in df.select_dtypes('bool').columns})

In [None]:
df.head()

In [None]:
print(df.columns.tolist())

In [None]:
df.rename(columns={
    # Frequency encoding columns
    'Sales_Agent_Freq': 'Sales_Agent',
    
    # Fixing spaces or long names in 'Source_' features
    'Source_CRM form': 'Source_CRM',
    'Source_E-Mail Message': 'Source_Email_Message',
    'Source_E-mail Campaign': 'Source_Email_Campaign',
    'Source_Live Chat-Adwords Remarketing': 'Source_Adwords_Remarketing',
    'Source_Live Chat-Blog': 'Source_Blog',
    'Source_Live Chat-CPC': 'Source_CPC',
    'Source_Live Chat-Direct': 'Source_Direct',
    'Source_Live Chat-Google Ads': 'Source_Google_Ads',
    'Source_Live Chat-Google Organic': 'Source_Google_Organic',
    'Source_Live Chat-Justdial': 'Source_Justdial',
    'Source_Live Chat-Quora': 'Source_Quora',
    'Source_Live Chat-Youtube': 'Source_Youtube',

    # Fixing location columns
    'Location_Other Locations': 'Location_OtherLoc',
    
    # Delivery Mode columns
    'Delivery_Mode_Mode-3': 'Delivery_Mode_3',
    'Delivery_Mode_Mode-4': 'Delivery_Mode_4',
    'Delivery_Mode_Mode-5': 'Delivery_Mode_5'
    
}, inplace=True)


In [None]:
print(df.columns.tolist())

In [None]:
df.info()

#### Handling outliers

##### we dont handle outliers for categorical data 

### Scaling

In [None]:
df.head()

In [None]:
from sklearn.preprocessing import StandardScaler
std = StandardScaler()

In [None]:
df.Product_ID = std.fit_transform(df[['Product_ID']])

In [None]:
df.Hour = std.fit_transform(df[['Hour']])

In [None]:
df.Day = std.fit_transform(df[['Day']])

In [None]:
df.Month = std.fit_transform(df[['Month']])

In [None]:
df.head()

### Feature Selection

In [None]:
df.corr()

In [None]:
plt.figure(figsize=(15,15))
sns.heatmap(df.corr())

### Model Creation

In [None]:
x=df.drop('Target',axis=1)
x

In [None]:
y = df.Target
y

In [None]:
df.Target.value_counts()

### Balancing

In [None]:
from imblearn.over_sampling import SMOTE
sm = SMOTE()

In [None]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.25,random_state=42)

In [None]:
x_train

In [None]:
x_test

In [None]:
y_train

In [None]:
y_test

In [None]:
x_sm,y_sm = sm.fit_resample(x_train,y_train)
x_sm,y_sm

In [None]:
print(y_train.value_counts()) # before balancing
print(y_sm.value_counts()) 

### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, classification_report

In [None]:
lr= LogisticRegression(class_weight='balanced')

In [None]:
lr.fit(x_sm,y_sm)

In [None]:
y_pred_lr =lr.predict(x_test)

In [None]:
y_pred_lr

In [None]:
np.array(y_test)

In [None]:
accuracy_score(y_test,y_pred_lr)

In [None]:
precision_score(y_test,y_pred_lr)

In [None]:
recall_score(y_test,y_pred_lr)

In [None]:
f1_score(y_test,y_pred_lr)

In [None]:
roc_auc_score(y_test,y_pred_lr)

In [None]:
confusion_matrix(y_test,y_pred_lr)

In [None]:
print(classification_report(y_test,y_pred_lr))

### Cross validation

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
model = LogisticRegression(max_iter=200)

In [None]:
scores = cross_val_score(model, x_sm, y_sm, cv=5, scoring='accuracy')
print("Cross-Validation accuracy Scores:", scores)
print("Average accuracy Score:", scores.mean())

### Hyperparameter Tunning

In [None]:
param_dist = {
    'penalty': ['l1', 'l2', 'elasticnet', None],
    'C': [0.01, 0.1, 1, 10, 100],
    'solver': ['lbfgs', 'saga', 'liblinear'],  # saga and liblinear support l1
    'max_iter': [100, 200, 500]
}

In [None]:
from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(
    LogisticRegression(),
    param_grid=param_dist,  # make sure this is a grid (not a distribution)
    scoring='accuracy',           # or 'roc_auc'
    n_jobs=-1,
    cv=5,
    verbose=2
)


In [None]:
grid_search.fit(x_sm, y_sm)

In [None]:
best_model = grid_search.best_estimator_
print("Best Parameters:", grid_search.best_params_)

In [None]:
best_dt_model = grid_search.best_estimator_
y_pred_log = best_dt_model.predict(x_test)
accuracy = accuracy_score(y_test, y_pred_log)
print("Test Accuracy with Best Hyperparameters:", accuracy)

### Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
dt = DecisionTreeClassifier(random_state=42)
dt.fit(x_sm, y_sm)

In [None]:
y_pred_dt = dt.predict(x_test)

In [None]:
# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred_dt))
print("Precision:", precision_score(y_test, y_pred_dt))
print("Recall:", recall_score(y_test, y_pred_dt))
print("F1 Score:", f1_score(y_test, y_pred_dt))
print("ROC AUC:", roc_auc_score(y_test, y_pred_dt))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred_dt))
print("\nClassification Report:\n", classification_report(y_test, y_pred_dt))

### Cross validation

In [None]:
scores = cross_val_score(dt, x_sm, y_sm, cv=5, scoring='accuracy')

In [None]:
print("Cross-Validation accuracy Scores:", scores)

In [None]:
print("Average accuracy Score:", scores.mean())

### Hyperparameter

In [None]:
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 5, 10, 15, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': [None, 'sqrt', 'log2']
}

In [None]:
grid_search = GridSearchCV(estimator=dt,
                           param_grid=param_grid,
                           cv=5,                # 5-fold cross-validation
                           scoring='f1',        # or 'roc_auc' / 'accuracy'
                           n_jobs=-1,           # use all CPU cores
                           verbose=1)

grid_search.fit(x_sm, y_sm)

In [None]:
print("Best Parameters:", grid_search.best_params_)

best_model = grid_search.best_estimator_

y_pred = best_model.predict(x_test)

print(classification_report(y_test, y_pred_dt))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_dt))
print("Accuracy:", accuracy_score(y_test, y_pred_dt))
print("Precision:", precision_score(y_test, y_pred_dt))
print("Recall:", recall_score(y_test, y_pred_dt))
print("F1 Score:", f1_score(y_test, y_pred_dt))
print("ROC AUC:", roc_auc_score(y_test, y_pred_dt))

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
rf.fit(x_train, y_train)
y_pred_rf = rf.predict(x_test)

acc = accuracy_score(y_test, y_pred_rf)
pr = precision_score(y_test, y_pred_rf)
re = recall_score(y_test, y_pred_rf)
f1 = f1_score(y_test, y_pred_rf)
roc = roc_auc_score(y_test, y_pred_rf)
cm = confusion_matrix(y_test, y_pred_rf)
ct = pd.crosstab(y_test, y_pred_rf)

print('Accuracy score: ',acc )
print('Precision score: ',pr )
print('recall score: ',re )
print('f1 score: ',f1 )
print('roc and auc score: ',roc)
print(cm)
print(ct)

### Cross Validation

In [None]:
scores = cross_val_score(rf, x_train, y_train, cv=5, scoring='f1')  # you can also use 'accuracy', 'roc_auc', etc.
print("Cross-Validation F1 Scores:", scores)
print("Average F1 Score:", scores.mean())

## Hyperparameter Tunning

In [None]:
from sklearn.model_selection import RandomizedSearchCV

param_dist = {
    'n_estimators': [100, 200, 300, 400, 500],
    'max_depth': [None, 5, 10, 15, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

rf = RandomForestClassifier(random_state=42)

random_search = RandomizedSearchCV(
    rf, 
    param_distributions=param_dist,
    n_iter=20,          # Try 20 random combinations
    cv=5,               # 5-fold cross-validation
    scoring='f1',       # or 'roc_auc', 'accuracy'
    verbose=2,
    random_state=42,
    n_jobs=-1           # Use all processors
)

random_search.fit(x_train, y_train)

print("Best Parameters:", random_search.best_params_)
print("Best F1 Score:", random_search.best_score_)


In [None]:
print("Accuracy:", accuracy_score(y_test, y_pred_rf))
print("Precision:", precision_score(y_test, y_pred_rf))
print("Recall:", recall_score(y_test, y_pred_rf))
print("F1 Score:", f1_score(y_test, y_pred_rf))
print("ROC AUC:", roc_auc_score(y_test, y_pred_rf))

### XGBoost

In [None]:
from xgboost import XGBClassifier
xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
xgb.fit(x_sm, y_sm)
y_pred_xgb = xgb.predict(x_test)
# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred_xgb))
print("Precision:", precision_score(y_test, y_pred_xgb))
print("Recall:", recall_score(y_test, y_pred_xgb))
print("F1 Score:", f1_score(y_test, y_pred_xgb))
print("ROC AUC:", roc_auc_score(y_test, y_pred_xgb))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred_xgb))
print("\nClassification Report:\n", classification_report(y_test, y_pred_xgb))

### Cross Validation

In [None]:
scores = cross_val_score(xgb, x_sm, y_sm, cv=5, scoring='f1')

print("Cross-Validation F1 Scores:", scores)
print("Average F1 Score:", scores.mean())


### Hyperparameter Tunning

In [None]:
# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2]
}

# Set up grid search with cross-validation
grid_search = GridSearchCV(
    estimator=XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42),
    param_grid=param_grid,
    scoring='f1',
    n_jobs=-1,
    cv=5,
    verbose=2
)

# Fit to the oversampled training data
grid_search.fit(x_sm, y_sm)

# Best model after hyperparameter tuning
best_xgb_model = grid_search.best_estimator_

# Predictions on test set
y_pred_xgb_best = best_xgb_model.predict(x_test)

# Evaluation
print("Test Accuracy with Best Hyperparameters:", accuracy_score(y_test, y_pred_xgb_best))


### LightBGM

In [None]:
from lightgbm import LGBMClassifier
# Train initial model
lgbm = LGBMClassifier()
lgbm.fit(x_sm, y_sm)
y_pred_lgbm = lgbm.predict(x_test)
print("LightGBM - Initial Metrics")
print("Accuracy:", accuracy_score(y_test, y_pred_lgbm))
print("Precision:", precision_score(y_test, y_pred_lgbm))
print("Recall:", recall_score(y_test, y_pred_lgbm))
print("F1 Score:", f1_score(y_test, y_pred_lgbm))
print("ROC AUC:", roc_auc_score(y_test, y_pred_lgbm))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred_lgbm))
print("\nClassification Report:\n", classification_report(y_test, y_pred_lgbm))

### Cross Validation

In [None]:
scores = cross_val_score(lgbm, x_sm, y_sm, cv=5, scoring='f1')
print("Cross-Validation F1 Scores:", scores)
print("Average F1 Score:", scores.mean())

### Hyperparameter Tunning

In [None]:
# Define parameter grid
param_grid = {
    'num_leaves': [31, 50, 70],
    'learning_rate': [0.01, 0.05, 0.1],
    'n_estimators': [100, 200, 500]
}

# Set up Grid Search with cross-validation
grid_search_lgbm = GridSearchCV(
    estimator=LGBMClassifier(),
    param_grid=param_grid,
    scoring='f1',
    n_jobs=-1,
    cv=5,
    verbose=2
)

# Fit to the oversampled training data
grid_search_lgbm.fit(x_sm, y_sm)

# Best model after hyperparameter tuning
best_lgbm_model = grid_search_lgbm.best_estimator_

# Print best parameters
print("Best Parameters:", grid_search_lgbm.best_params_)

# Predictions on test set
y_pred_lgbm = best_lgbm_model.predict(x_test)

# Evaluation
print("Test Accuracy with Best Hyperparameters:", accuracy_score(y_test, y_pred_lgbm))


### KNN

In [None]:
from sklearn.neighbors import KNeighborsClassifier
# Train initial model
knn = KNeighborsClassifier()
knn.fit(x_sm, y_sm)
y_pred_knn = knn.predict(x_test)
print("KNN - Initial Metrics")
print("Accuracy:", accuracy_score(y_test, y_pred_knn))
print("Precision:", precision_score(y_test, y_pred_knn))
print("Recall:", recall_score(y_test, y_pred_knn))
print("F1 Score:", f1_score(y_test, y_pred_knn))
print("ROC AUC:", roc_auc_score(y_test, y_pred_knn))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred_knn))
print("\nClassification Report:\n", classification_report(y_test, y_pred_knn))

### Cross Validation

In [None]:
scores = cross_val_score(knn, x_sm, y_sm, cv=5, scoring='f1')
print("Cross-Validation F1 Scores:", scores)
print("Average F1 Score:", scores.mean())

### Hyper parameter Tunning

In [None]:
# Define parameter grid
param_grid = {
    'n_neighbors': [3, 5, 7, 9],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}

# Set up GridSearchCV
grid_search_knn = GridSearchCV(
    KNeighborsClassifier(),
    param_grid=param_grid,
    scoring='f1',     # or 'f1_macro' for multi-class
    n_jobs=-1,
    cv=5,
    verbose=2
)

# Fit on training data
grid_search_knn.fit(x_sm, y_sm)

# Get best model
best_knn_model = grid_search_knn.best_estimator_

# Print best parameters
print("Best Parameters:", grid_search_knn.best_params_)

# Evaluate on test data
y_pred_knn = best_knn_model.predict(x_test)
print('Test Accuracy with Best Hyperparameters:', accuracy_score(y_test, y_pred_knn))
