# Report on Binary Classification Model to Predict Company Acquisitions

## Introduction
The objective of this project was to develop a binary classification model to predict whether a company will be acquired. The prediction was based on various features provided in the dataset. The task commenced with the application of logistic regression and random forest models using only numerical data. The aim was to identify the more effective algorithm for this specific classification task.

## Methodology
### Initial Model Selection:
- We started with logistic regression and random forest models, focusing solely on the numerical attributes of the dataset.
- The models' performance was evaluated to determine the most suitable approach for this prediction task.

### Incorporation of Categorical Data:
- To enhance the model's predictive power, categorical data, including industry and address_country_code, were incorporated.
- The keywords column, a text-based feature, was also added after appropriate preprocessing using TF-IDF vectorization and PCA for dimensionality reduction.

### Model Training and Validation:
- The dataset was divided into training (80%), validation (10%), and test (10%) sets. This division ensured a robust evaluation framework.
- Random forest was used as the primary model due to its superior performance in the initial evaluation.
- Hyperparameter tuning was performed using RandomizedSearchCV to optimize the random forest model.

### Ensemble Model Creation:
- An ensemble model comprising different tree-based algorithms, including the tuned Random Forest, Gradient Boosting, and Extra Trees classifiers, was created.
- The ensemble approach aimed to improve model robustness and predictive accuracy.

### Threshold Optimization and Model Evaluation:
- The optimal threshold for classification was determined using the validation set. This approach aimed to maximize the F1 score, a crucial metric for binary classification tasks.
- Both the tuned random forest and the ensemble model were evaluated on the test set using this optimal threshold.

## Results and Discussion
- The random forest model demonstrated a superior performance compared to logistic regression in the initial phase, prompting its selection for further analysis and enhancement.
- Incorporating categorical data, especially the industry and keywords attributes, significantly improved the model's predictive accuracy. This improvement highlighted the importance of these features in predicting company acquisitions.
- The fine-tuned random forest model and the ensemble of tree-based models were robust, as evidenced by their performance on the test set. The models achieved an F1 score of approximately 0.71, indicating a strong balance between precision and recall.
- The use of a separate validation set for threshold optimization proved beneficial in enhancing the model's performance on unseen data. This approach ensured that the threshold tuning did not overfit the training data and was genuinely reflective of the model's capability to generalize.

## Conclusion
The project successfully developed a robust binary classification model capable of predicting company acquisitions with a high degree of accuracy. The integration of categorical data and the adoption of advanced modeling techniques, including ensemble modeling and optimal threshold determination, were key to achieving this high level of performance. The final model, with an F1 score of around 0.71, stands as a testament to the efficacy of these methods in tackling complex predictive tasks in the realm of business acquisitions.


In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (accuracy_score, confusion_matrix, classification_report,
                             average_precision_score, roc_auc_score,
                             precision_recall_fscore_support)

# Load the dataset
file_path = 'C:\\Users\\Parviz\\data_DEAN\\all_comp_csv\\all_comp.csv'
data = pd.read_csv(file_path)

# Dropping less relevant columns for simplicity
data_reduced = data.drop(['company_name', 'website_standard', 'description', 'address_city', 'address_admin1_name', 'industry', 'address_country_code', 'keywords'], axis=1)

# Encoding categorical variables
categorical_columns = data_reduced.select_dtypes(include=['object']).columns
for col in categorical_columns:
    data_reduced[col] = LabelEncoder().fit_transform(data_reduced[col])

# Select features and target variable
X = data_reduced.drop('acquired', axis=1)
y = data_reduced['acquired']

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Logistic Regression Model
# -------------------------
log_reg = LogisticRegression()
log_reg.fit(X_train_scaled, y_train)
y_pred_lr = log_reg.predict(X_test_scaled)
y_pred_proba_lr = log_reg.predict_proba(X_test_scaled)[:, 1]  # probabilities for the positive class

# Evaluate the logistic regression model
accuracy_lr = accuracy_score(y_test, y_pred_lr)
average_precision_lr = average_precision_score(y_test, y_pred_proba_lr)
roc_auc_lr = roc_auc_score(y_test, y_pred_proba_lr)
precision_lr, recall_lr, f1_score_lr, _ = precision_recall_fscore_support(y_test, y_pred_lr, average='binary')

# Random Forest Model
# --------------------
random_forest = RandomForestClassifier(random_state=42)
random_forest.fit(X_train_scaled, y_train)
y_pred_rf = random_forest.predict(X_test_scaled)
y_pred_proba_rf = random_forest.predict_proba(X_test_scaled)[:, 1]  # probabilities for the positive class

# Evaluate the Random Forest model
accuracy_rf = accuracy_score(y_test, y_pred_rf)
average_precision_rf = average_precision_score(y_test, y_pred_proba_rf)
roc_auc_rf = roc_auc_score(y_test, y_pred_proba_rf)
precision_rf, recall_rf, f1_score_rf, _ = precision_recall_fscore_support(y_test, y_pred_rf, average='binary')

# Display results
print("Logistic Regression:")
print("Accuracy:", accuracy_lr)
print("Precision:", precision_lr)
print("Recall:", recall_lr)
print("F1 Score:", f1_score_lr)
print("Average Precision:", average_precision_lr)
print("ROC AUC:", roc_auc_lr)

print("\nRandom Forest:")
print("Accuracy:", accuracy_rf)
print("Precision:", precision_rf)
print("Recall:", recall_rf)
print("F1 Score:", f1_score_rf)
print("Average Precision:", average_precision_rf)
print("ROC AUC:", roc_auc_rf)

Logistic Regression:
Accuracy: 0.6169686985172982
Precision: 0.5801526717557252
Recall: 0.15637860082304528
F1 Score: 0.24635332252836306
Average Precision: 0.5505941169599564
ROC AUC: 0.6881465371048704

Random Forest:
Accuracy: 0.6540362438220758
Precision: 0.5901639344262295
Recall: 0.4444444444444444
F1 Score: 0.5070422535211268
Average Precision: 0.5888375048068542
ROC AUC: 0.6793073503142948


In [4]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# Define the parameter grid to search
param_grid = {
    'n_estimators': [100, 200, 300],  # Example: number of trees in the forest
    'max_depth': [10, 20, 30],        # Example: maximum depth of each tree
    'min_samples_split': [2, 5, 10],  # Example: minimum number of samples required to split a node
    'min_samples_leaf': [1, 2, 4]     # Example: minimum number of samples required at each leaf node
}

# Create a base model
rf = RandomForestClassifier(random_state=42)

# Instantiate the grid search model
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, 
                           cv=3, n_jobs=-1, verbose=2, scoring='accuracy')

# Fit the grid search to the data
grid_search.fit(X_train_scaled, y_train)

# Best parameters found
print("Best parameters found: ", grid_search.best_params_)

# Evaluate the best model
best_grid = grid_search.best_estimator_
y_pred_best = best_grid.predict(X_test_scaled)
accuracy_best = accuracy_score(y_test, y_pred_best)

# Display results
print("Best Random Forest Model:")
print("Accuracy:", accuracy_best)



Fitting 3 folds for each of 81 candidates, totalling 243 fits
Best parameters found:  {'max_depth': 10, 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 300}
Best Random Forest Model:
Accuracy: 0.6820428336079077


In [5]:
# Random Forest Model
# --------------------
y_pred_rf = best_grid.predict(X_test_scaled)
y_pred_proba_rf = best_grid.predict_proba(X_test_scaled)[:, 1]  # probabilities for the positive class

# Evaluate the Random Forest model
accuracy_rf = accuracy_score(y_test, y_pred_rf)
average_precision_rf = average_precision_score(y_test, y_pred_proba_rf)
roc_auc_rf = roc_auc_score(y_test, y_pred_proba_rf)
precision_rf, recall_rf, f1_score_rf, _ = precision_recall_fscore_support(y_test, y_pred_rf, average='binary')



print("\nRandom Forest:")
print("Accuracy:", accuracy_rf)
print("Precision:", precision_rf)
print("Recall:", recall_rf)
print("F1 Score:", f1_score_rf)
print("Average Precision:", average_precision_rf)
print("ROC AUC:", roc_auc_rf)


Random Forest:
Accuracy: 0.6820428336079077
Precision: 0.6592356687898089
Recall: 0.42592592592592593
F1 Score: 0.5175000000000001
Average Precision: 0.6422215807468983
ROC AUC: 0.7251977061004838


In [6]:
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd

# Load the dataset
data = pd.read_csv(file_path)

# Encode the 'industry' column
label_encoder = LabelEncoder()
data['industry_encoded'] = label_encoder.fit_transform(data['industry'])

# Prepare the dataset for training
# Ensure all columns except 'acquired' and non-numeric columns are included
X = data.select_dtypes(include=[np.number])

# Ensure 'acquired' is not part of the features
X = X.drop('acquired', axis=1)

# Target variable
y = data['acquired']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scale the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train the Random Forest model
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train_scaled, y_train)

# Predict and calculate accuracy
y_pred = rf.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)

# Print the accuracy
print(f"Model Accuracy with 'Industry' as a feature: {accuracy}")
# Evaluate the Random Forest model
precision_rf, recall_rf, f1_score_rf, _ = precision_recall_fscore_support(y_test, y_pred, average='binary')
print("\nRandom Forest:")
print("Accuracy:", accuracy_rf)
print("Precision:", precision_rf)
print("Recall:", recall_rf)
print("F1 Score:", f1_score_rf)

Model Accuracy with 'Industry' as a feature: 0.6742174629324547

Random Forest:
Accuracy: 0.6820428336079077
Precision: 0.6291012838801712
Recall: 0.4537037037037037
F1 Score: 0.5271966527196652


In [7]:
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
import pandas as pd

# Load the dataset
data = pd.read_csv(file_path)

# Fill NaNs in 'keywords' column
data['keywords'] = data['keywords'].fillna('')

# TF-IDF Vectorization
tfidf = TfidfVectorizer(max_features=100)
keywords_tfidf = tfidf.fit_transform(data['keywords'])

# Apply PCA for Dimensionality Reduction
pca = PCA(n_components=10)  # Adjust the number of components as necessary
keywords_pca = pca.fit_transform(keywords_tfidf.toarray())

# Prepare the final dataset
keywords_df = pd.DataFrame(keywords_pca, columns=[f'pca_{i}' for i in range(keywords_pca.shape[1])])
data.reset_index(drop=True, inplace=True)
data = pd.concat([data, keywords_df], axis=1)

# Ensure all columns except 'acquired' are included as features
X = data.drop(['acquired', 'company_name', 'website_standard', 'description', 'address_city', 'address_admin1_name', 'industry', 'address_country_code', 'keywords'], axis=1)
X = X.select_dtypes(include=[np.number])

# Target variable
y = data['acquired']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train the Random Forest model
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train_scaled, y_train)

# Predict and calculate metrics
y_pred = rf.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print the metrics
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")

Accuracy: 0.685337726523888
Precision: 0.6580547112462006
Recall: 0.4454732510288066
F1 Score: 0.5312883435582823


In [8]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, roc_curve, f1_score
import pandas as pd
import numpy as np

# Assuming 'data' is your DataFrame and pre-processing steps have been done

# Encode 'industry' column
label_encoder = LabelEncoder()
data['industry_encoded'] = label_encoder.fit_transform(data['industry'])

# Prepare the dataset for training
X = pd.concat([data[['industry_encoded']], keywords_df], axis=1)
X = pd.concat([X, data.select_dtypes(include=[np.number]).drop(['acquired'], axis=1)], axis=1)
y = data['acquired']

# Split the data into train and temporary set (test + validation)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, random_state=42)

# Split the temporary set into validation and test sets
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

# Train the Random Forest model
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train_scaled, y_train)

# Function to find optimal threshold
def find_optimal_threshold(y_true, y_proba):
    fpr, tpr, thresholds = roc_curve(y_true, y_proba)
    optimal_idx = np.argmax(tpr - fpr)
    optimal_threshold = thresholds[optimal_idx]
    return optimal_threshold

# Predict probabilities on validation set and find optimal threshold
y_val_proba = rf.predict_proba(X_val_scaled)[:, 1]
optimal_threshold = find_optimal_threshold(y_val, y_val_proba)

# Predict on test set with optimal threshold
y_test_proba = rf.predict_proba(X_test_scaled)[:, 1]
y_pred_test = (y_test_proba >= optimal_threshold).astype(int)

# Evaluate on test set
auc_test = roc_auc_score(y_test, y_test_proba)
f1_test = f1_score(y_test, y_pred_test)

# Print the results
print(f"Optimal Threshold (from validation set): {optimal_threshold}")
print(f"Test Set AUC: {auc_test}")
print(f"Test Set F1 Score at Optimal Threshold: {f1_test}")


Optimal Threshold (from validation set): 0.4783333333333333
Test Set AUC: 0.7586337194533354
Test Set F1 Score at Optimal Threshold: 0.5694682675814752


In [9]:
from sklearn.metrics import roc_auc_score, roc_curve, f1_score
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

# Load and preprocess the dataset
data = pd.read_csv(file_path)
label_encoder_industry = LabelEncoder()
data['industry_encoded'] = label_encoder_industry.fit_transform(data['industry'])
label_encoder_country = LabelEncoder()
data['country_encoded'] = label_encoder_country.fit_transform(data['address_country_code'])

# Prepare the dataset for training
X = pd.concat([data[['industry_encoded', 'country_encoded']], keywords_df], axis=1)
X = pd.concat([X, data.select_dtypes(include=[np.number]).drop(['acquired'], axis=1)], axis=1)
y = data['acquired']

# Split the data into train, validation, and test sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

# Train the Random Forest model
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train_scaled, y_train)

# Function to find optimal threshold
def find_optimal_threshold(y_true, y_proba):
    fpr, tpr, thresholds = roc_curve(y_true, y_proba)
    optimal_idx = np.argmax(tpr - fpr)
    optimal_threshold = thresholds[optimal_idx]
    return optimal_threshold

# Predict probabilities on validation set and find optimal threshold
y_val_proba = rf.predict_proba(X_val_scaled)[:, 1]
optimal_threshold = find_optimal_threshold(y_val, y_val_proba)

# Predict on test set with optimal threshold
y_test_proba = rf.predict_proba(X_test_scaled)[:, 1]
y_pred_test = (y_test_proba >= optimal_threshold).astype(int)

# Evaluate on test set
auc_test = roc_auc_score(y_test, y_test_proba)
f1_test = f1_score(y_test, y_pred_test)

# Print the results
print(f"Optimal Threshold (from validation set): {optimal_threshold}")
print(f"Test Set AUC: {auc_test}")
print(f"Test Set F1 Score at Optimal Threshold: {f1_test}")

Optimal Threshold (from validation set): 0.36
Test Set AUC: 0.7968424528781182
Test Set F1 Score at Optimal Threshold: 0.7005347593582888


In [10]:
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import roc_auc_score, roc_curve, f1_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, ExtraTreesClassifier, VotingClassifier
from scipy.stats import randint as sp_randint
import pandas as pd
import numpy as np

# Load and preprocess the dataset
data = pd.read_csv(file_path)
label_encoder_industry = LabelEncoder()
data['industry_encoded'] = label_encoder_industry.fit_transform(data['industry'])
label_encoder_country = LabelEncoder()
data['country_encoded'] = label_encoder_country.fit_transform(data['address_country_code'])

# Prepare the dataset for training
X = pd.concat([data[['industry_encoded', 'country_encoded']], keywords_df], axis=1)
X = pd.concat([X, data.select_dtypes(include=[np.number]).drop(['acquired'], axis=1)], axis=1)
y = data['acquired']

# Split the data into train, validation, and test sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

# Randomized Search for Hyperparameter Tuning
rf = RandomForestClassifier(random_state=42)
param_dist = {
    'n_estimators': sp_randint(100, 500),
    'max_depth': [3, None],
    'max_features': sp_randint(1, 11),
    'min_samples_split': sp_randint(2, 11),
    'min_samples_leaf': sp_randint(1, 11),
    'bootstrap': [True, False]
}
random_search = RandomizedSearchCV(rf, param_distributions=param_dist, n_iter=10, cv=5, random_state=42)
random_search.fit(X_train_scaled, y_train)
rf_best = random_search.best_estimator_

# Train Ensemble Model
gb = GradientBoostingClassifier(random_state=42)
et = ExtraTreesClassifier(random_state=42)
ensemble = VotingClassifier(estimators=[('rf', rf_best), ('gb', gb), ('et', et)], voting='soft')
ensemble.fit(X_train_scaled, y_train)


VotingClassifier(estimators=[('rf',
                              RandomForestClassifier(max_features=8,
                                                     min_samples_leaf=5,
                                                     min_samples_split=8,
                                                     n_estimators=221,
                                                     random_state=42)),
                             ('gb',
                              GradientBoostingClassifier(random_state=42)),
                             ('et', ExtraTreesClassifier(random_state=42))],
                 voting='soft')

In [11]:
# Function to evaluate models
def evaluate_model(model, X_scaled, y):
    y_pred_proba = model.predict_proba(X_scaled)[:, 1]
    auc = roc_auc_score(y, y_pred_proba)
    fpr, tpr, thresholds = roc_curve(y, y_pred_proba)
    optimal_idx = np.argmax(tpr - fpr)
    optimal_threshold = thresholds[optimal_idx]
    y_pred_optimal = (y_pred_proba >= optimal_threshold).astype(int)
    f1_optimal = f1_score(y, y_pred_optimal)
    return auc, optimal_threshold, f1_optimal

# Fine-tune threshold on the validation set
auc_val_rf, optimal_threshold_rf, f1_val_rf = evaluate_model(rf_best, X_val_scaled, y_val)
auc_val_ensemble, optimal_threshold_ensemble, f1_val_ensemble = evaluate_model(ensemble, X_val_scaled, y_val)

# Function to evaluate models
def evaluate_model_with_threshold(model, X_scaled, y, optimal_threshold):
    y_pred_proba = model.predict_proba(X_scaled)[:, 1]
    auc = roc_auc_score(y, y_pred_proba)
    fpr, tpr, thresholds = roc_curve(y, y_pred_proba)
    y_pred_optimal = (y_pred_proba >= optimal_threshold).astype(int)
    f1_optimal = f1_score(y, y_pred_optimal)
    return auc, optimal_threshold, f1_optimal
# Test Set Evaluation
auc_test_rf, _, f1_test_rf = evaluate_model_with_threshold(rf_best, X_test_scaled, y_test, optimal_threshold_rf)
auc_test_ensemble, _, f1_test_ensemble = evaluate_model_with_threshold(ensemble, X_test_scaled, y_test, optimal_threshold_ensemble)

print("\nTuned Random Forest - Test Set:")
print(f"AUC: {auc_test_rf}, F1 Score at Optimal Threshold: {f1_test_rf}")

print("\nEnsemble Model - Test Set:")
print(f"AUC: {auc_test_ensemble}, F1 Score at Optimal Threshold: {f1_test_ensemble}")


Tuned Random Forest - Test Set:
AUC: 0.805930244373317, F1 Score at Optimal Threshold: 0.701530612244898

Ensemble Model - Test Set:
AUC: 0.8082482345170959, F1 Score at Optimal Threshold: 0.7101449275362318
