## **MACHINE LEARNING MODELS**

**1. Data Preprocessing**
Objective: Prepare the dataset (Aba3_cleaned.csv) for machine learning.

Steps:

Handle Missing Values: Fill numeric columns with the mean and categorical columns with the mode.

Encode Categorical Features: Convert Head_Quarter and Industry to numerical values using LabelEncoder.

Map RoundSeries: Assign numerical labels to funding rounds (e.g., "Pre-seed" → 0, "Seed" → 1).

Feature Scaling: Standardize Founded and RoundSeries_Numerical using StandardScaler.

Train-Test Split: Split data into training/testing sets (80/20).

Save Processed Data: Export scaled and encoded datasets (trainingdata.csv, testingdata.csv).

In [7]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
import os


file_path = "F:\\school\\Azubi Africa\\LP1 Data Analytics Project\\LP-1-Project\\data\\Aba3_cleaned.csv"
df = pd.read_csv(file_path)


for col in df.columns:
    if df[col].dtype == 'float64' or df[col].dtype == 'int64':
        df[col] = df[col].fillna(df[col].mean())
    else:
        df[col] = df[col].fillna(df[col].mode()[0])


categorical_cols = [col for col in ['Head Quarter', 'Industry In'] if col in df.columns]

for col in categorical_cols:
    le = LabelEncoder()
    
    if isinstance(df.loc[0, col], str):  
        try:
            encoded_col_values = le.fit_transform([str(x) for x in list(df.loc[:, col])])
            
            
            df.loc[:, col] = encoded_col_values
            
        except Exception as e:
            print(f"Error encoding {col}: {e}")


round_series_mapping = {
    'Pre-seed': 0,
    'Seed': 1,
    'Pre-series A': 2,
    'Series A': 3,
    'Series B': 4,
    'Series C': 5,
    'Series D': 6,
    'Series E': 7,
    'Debt': 8,
    'Bridge': 9,
}

df['RoundSeries_Numerical'] = df['Funding Round/Series'].map(round_series_mapping)


df['RoundSeries_Numerical'] = df['RoundSeries_Numerical'].fillna(-1)  


feature_columns=['Year Founded', 'Head Quarter', 'Industry In', 'RoundSeries_Numerical', 'AboutCompany']
target_column='Amount in ($)'

X_features=df[feature_columns]
y_target=df[target_column]


if y_target.dtype.name == "category" or y_target.dtype.name == "object":  
    X_train, X_test, y_train, y_test= train_test_split(
       X_features,y_target,test_size=0.2,
       random_state=42,stratify=y_target)
else:  
    X_train, X_test, y_train, y_test= train_test_split(
       X_features,y_target,test_size=0.2,
       random_state=42)

scaler=StandardScaler()

scaled_Xtrain=scaler.fit_transform(X_train[['Year Founded','RoundSeries_Numerical']])
scaled_Xtest=scaler.transform(X_test[['Year Founded','RoundSeries_Numerical']])


scaled_Xtrain_df=pd.DataFrame(scaled_Xtrain,
                              columns=['Founded_scaled','RoundSeries_scaled'])

scaled_Xtest_df=pd.DataFrame(scaled_Xtest,
                              columns=['Founded_scaled','RoundSeries_scaled'])

final_combined_training_data_with_all_columns_including_both_scale_and_non_scale_versions=(
pd.concat([
       scaled_Xtrain_df,X_train[[f"Head Quarter",f"Industry In",f"AboutCompany"]],
       y_train.to_frame()],  
       axis=1))

combined_testing_data_with_all_columns_including_both_scale_and_non_scale_versions=(
pd.concat([
      scaled_Xtest_df,X_test[[f"Head Quarter",f"Industry In",f"AboutCompany"]],
      y_test.to_frame()],  
      axis=1))

save_dir="F:\\school\\Azubi Africa\\LP1 Data Analytics Project\\LP-1-Project\\data"

training_data_to_save=(pd.concat([
                          scaled_Xtrain_df,X_train[['Head Quarter','Industry In','AboutCompany']],y_train.to_frame()],
                          axis=1))

training_data_to_save.to_csv(os.path.join(save_dir,"trainingdata.csv"),index=False)

testing_data_to_save=(pd.concat([
                         scaled_Xtest_df,X_test[['Head Quarter','Industry In','AboutCompany']],y_test.to_frame()],
                         axis=1))

testing_data_to_save.to_csv(os.path.join(save_dir,"testingdata.csv"),index=False)


### **1. Funding Prediction Model**

Objective: Predict the funding amount (Amount) a startup receives.

Model: RandomForestRegressor

In [8]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd


train_data_path = "F:\\school\\Azubi Africa\\LP1 Data Analytics Project\\LP-1-Project\\data\\trainingdata.csv"
test_data_path = "F:\\school\\Azubi Africa\\LP1 Data Analytics Project\\LP-1-Project\\data\\testingdata.csv"

train_data = pd.read_csv(train_data_path)
test_data = pd.read_csv(test_data_path)


print("NaN values in training target (Amount in ($)):", train_data['Amount in ($)'].isna().sum())
print("NaN values in testing target (Amount in ($)):", test_data['Amount in ($)'].isna().sum())


train_data = train_data.dropna(subset=['Amount in ($)'])
test_data = test_data.dropna(subset=['Amount in ($)'])



X_train = train_data[['Founded_scaled', 'RoundSeries_scaled', 'Head Quarter', 'Industry In']]
y_train = train_data['Amount in ($)']

X_test = test_data[['Founded_scaled', 'RoundSeries_scaled', 'Head Quarter', 'Industry In']]
y_test = test_data['Amount in ($)']

#Random Forest Regressor
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

y_pred = rf_model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

NaN values in training target (Amount in ($)): 166
NaN values in testing target (Amount in ($)): 167
Mean Squared Error: 7525477236989724.0
R-squared: -0.1752069958291933


*Evaluation Metrics:*

MSE: 7.53e+15 (extremely high, indicating very poor performance).

R-squared: -0.18 (negative value signifies that the model performs worse than a simple horizontal line, showing no predictive capability).

*Interpretation:*

The model is unable to capture meaningful patterns in the data, as evidenced by the extremely high error and negative R-squared value.

Possible issues: The presence of NaN values in both training (166) and testing (167) target variables likely impacted the model's ability to learn effectively. Other potential problems include significant data noise, outliers, insufficient or irrelevant features, improper scaling, or challenges posed by the distribution of the target variable ("Amount in ($)"). Addressing these issues may improve model performance.

### **2. Startup Success Prediction**

Objective: Classify startups as "successful" (1) if funding > $1M, otherwise "unsuccessful" (0).

*Model: LogisticRegression*

First, we need to create a binary target variable indicating success based on the 'Amount' column.

To build the Startup Success Prediction model, we need to define what "success" means using the available data. A reasonable proxy could be whether the startup has received funding above a certain threshold. Here's how we can implement that.



In [9]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.impute import SimpleImputer


train_data_path = "F:\\school\\Azubi Africa\\LP1 Data Analytics Project\\LP-1-Project\\data\\trainingdata.csv"
test_data_path = "F:\\school\\Azubi Africa\\LP1 Data Analytics Project\\LP-1-Project\\data\\testingdata.csv"

train_data = pd.read_csv(train_data_path)
test_data = pd.read_csv(test_data_path)

# Define a threshold for "success" 
success_threshold = 1000000  # $1,000,000


train_data['Success'] = train_data['Amount in ($)'].apply(lambda x: 1 if x > success_threshold else 0)
test_data['Success'] = test_data['Amount in ($)'].apply(lambda x: 1 if x > success_threshold else 0)


X_train = train_data[['Founded_scaled', 'RoundSeries_scaled', 'Head Quarter', 'Industry In']]
y_train = train_data['Success']

X_test = test_data[['Founded_scaled', 'RoundSeries_scaled', 'Head Quarter', 'Industry In']]
y_test = test_data['Success']

# Initialize the imputer
imputer = SimpleImputer(strategy='mean')


X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)

# train the Logistic Regression model
lr_model = LogisticRegression(random_state=42)
lr_model.fit(X_train, y_train)


y_pred = lr_model.predict(X_test)


accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy}')
print(f'Classification Report:\n{report}')

Accuracy: 0.3877005347593583
Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00       229
           1       0.39      1.00      0.56       145

    accuracy                           0.39       374
   macro avg       0.19      0.50      0.28       374
weighted avg       0.15      0.39      0.22       374



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


*Results:*

Accuracy: 38.8% (slightly better than random guessing for a binary classification problem).

*Classification Report:*

Precision/Recall for class 0 (unsuccessful): Near 0% precision and recall, indicating the model struggles to identify this class entirely.

Precision/Recall for class 1 (successful): 39% precision, 100% recall (the model heavily biases toward predicting "success," capturing all instances of this class but with low confidence).

*Interpretation:*

Severe class imbalance is evident (likely more "successful" startups in the dataset), leading to poor performance on the minority class (class 0). 

The model fails to generalize effectively, showing an inability to balance precision and recall across both classes. Hyperparameter tuning may slightly improve accuracy but does not address the underlying issue of class imbalance or improve overall performance metrics meaningfully.

### **Logistic Regression with hyperparameter tuning using GridSearchCV for the Startup Success model.**

The model aims to identify factors that correlate with high funding amounts (above $1M). It answers:

Which companies are likely to secure significant funding?

Which features (founding year, funding round stage, location, industry) are most influential?

In [10]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
import pandas as pd


train_data_path = "F:\\school\\Azubi Africa\\LP1 Data Analytics Project\\LP-1-Project\\data\\trainingdata.csv"
test_data_path = "F:\\school\\Azubi Africa\\LP1 Data Analytics Project\\LP-1-Project\\data\\testingdata.csv"

train_data = pd.read_csv(train_data_path)
test_data = pd.read_csv(test_data_path)


# Define a threshold for "success" 
success_threshold = 1000000  # $1,000,000


train_data['Success'] = train_data['Amount in ($)'].apply(lambda x: 1 if x > success_threshold else 0)
test_data['Success'] = test_data['Amount in ($)'].apply(lambda x: 1 if x > success_threshold else 0)


X_train = train_data[['Founded_scaled', 'RoundSeries_scaled', 'Head Quarter', 'Industry In']]
y_train = train_data['Success']

X_test = test_data[['Founded_scaled', 'RoundSeries_scaled', 'Head Quarter', 'Industry In']]
y_test = test_data['Success']


imputer = SimpleImputer(strategy='mean')  # Replace NaN with the mean value
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)

# Hyperparameter tuning using GridSearchCV
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear']
}

lr_model = LogisticRegression(random_state=42)
grid_search = GridSearchCV(lr_model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Best model
best_lr_model = grid_search.best_estimator_

# Make predictions
y_pred = best_lr_model.predict(X_test)


accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f'Best Parameters: {grid_search.best_params_}')
print(f'Accuracy: {accuracy}')
print(f'Classification Report:\n{report}')

Best Parameters: {'C': 0.01, 'penalty': 'l2', 'solver': 'liblinear'}
Accuracy: 0.3877005347593583
Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00       229
           1       0.39      1.00      0.56       145

    accuracy                           0.39       374
   macro avg       0.19      0.50      0.28       374
weighted avg       0.15      0.39      0.22       374



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


*Model Behavior:*

The model is predicting every instance as class 1 (funding > $1M). This is evident because:

Recall for class 1 is 1.00: It correctly identifies all 145 "successes," but at the cost of misclassifying all 229 "failures" (class 0).

Precision for class 1 is 0.39: Only 39% of its "success" predictions are correct (145 correct predictions / 374 total predictions).

*Accuracy Deception:*

The 38.77% accuracy matches the proportion of class 1 in the dataset (145/374 ≈ 38.77%). This means the model is no better than randomly guessing the majority class (class 0 is 61.23% of the data, but the model defaults to class 1).

*Class 0 Failure:*

Precision and recall for class 0 are 0.00, indicating the model completely fails to identify companies with funding ≤ $1M.

### **3. Industry Classification Model**

Objective: Classify startups into industries based on About_Company text descriptions.

Model: MultinomialNB with TF-IDF vectorization.

Given the dataset I am working with only includes the company's industry, I am building a model to predict the startup industry.

In [11]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import pandas as pd


file_path = "F:\\school\\Azubi Africa\\LP1 Data Analytics Project\\LP-1-Project\\data\\Aba3_cleaned.csv"
df = pd.read_csv(file_path)


df.dropna(subset=['AboutCompany', 'Industry In'], inplace=True)  


X = df['AboutCompany']  
y = df['Industry In']  


if y.isna().any():
    print("Warning: NaN values found in the target variable (y). Removing them.")
    df.dropna(subset=['Industry In'], inplace=True)  
    X = df['AboutCompany']  
    y = df['Industry In']  


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert text descriptions into numerical data using TF-IDF
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Initializing and train the Multinomial Naive Bayes model
nb_model = MultinomialNB()
nb_model.fit(X_train_tfidf, y_train)


y_pred = nb_model.predict(X_test_tfidf)


accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy}')
print(f'Classification Report:\n{report}')

Accuracy: 0.12560386473429952
Classification Report:
                                   precision    recall  f1-score   support

                       AI startup       0.00      0.00      0.00         3
                      AR platform       0.00      0.00      0.00         1
                    Advertisement       0.00      0.00      0.00         1
                         AgriTech       0.00      0.00      0.00         4
                         Agritech       0.00      0.00      0.00         2
                 Agritech startup       0.00      0.00      0.00         1
                Apparel & Fashion       0.00      0.00      0.00         1
                       Automation       0.00      0.00      0.00         1
            Automobile Technology       0.00      0.00      0.00         1
                       Automotive       0.00      0.00      0.00         5
           Automotive and Rentals       0.00      0.00      0.00         1
                   B2B E-commerce       0.00  

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


**Results:**

Accuracy: 12.6% (extremely low).

Classification Report: The majority of classes exhibit 0 precision, recall, and F1-score, indicating no correct predictions were made for these categories.

**Interpretation:**

The dataset contains a large number of industry classes (highly imbalanced with many labels) but suffers from insufficient data per class, leading to poor model performance. Additionally, the text features derived from the "About_Company" field may lack discriminative signals or require more advanced preprocessing techniques (e.g., n-grams, word embeddings, or domain-specific feature engineering) to capture meaningful patterns that can improve classification accuracy.