## **MACHINE LEARNING MODELS**

**1. Data Preprocessing**
Objective: Prepare the dataset (Aba3_cleaned.csv) for machine learning.

Steps:

Handle Missing Values: Fill numeric columns with the mean and categorical columns with the mode.

Encode Categorical Features: Convert Head_Quarter and Industry to numerical values using LabelEncoder.

Map RoundSeries: Assign numerical labels to funding rounds (e.g., "Pre-seed" → 0, "Seed" → 1).

Feature Scaling: Standardize Founded and RoundSeries_Numerical using StandardScaler.

Train-Test Split: Split data into training/testing sets (80/20).

Save Processed Data: Export scaled and encoded datasets (trainingdata.csv, testingdata.csv).

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
import os


file_path = "F:\\school\\Azubi Africa\\LP1 Data Analytics Project\\LP-1-Project\\data\\Aba3_cleaned.csv"
df = pd.read_csv(file_path)


for col in df.columns:
    if df[col].dtype == 'float64' or df[col].dtype == 'int64':
        df[col] = df[col].fillna(df[col].mean())
    else:
        df[col] = df[col].fillna(df[col].mode()[0])


categorical_cols = [col for col in ['Head Quarter', 'Industry In'] if col in df.columns]

for col in categorical_cols:
    le = LabelEncoder()
    
    if isinstance(df.loc[0, col], str):  
        try:
            encoded_col_values = le.fit_transform([str(x) for x in list(df.loc[:, col])])
            
            
            df.loc[:, col] = encoded_col_values
            
        except Exception as e:
            print(f"Error encoding {col}: {e}")


round_series_mapping = {
    'Pre-seed': 0,
    'Seed': 1,
    'Pre-series A': 2,
    'Series A': 3,
    'Series B': 4,
    'Series C': 5,
    'Series D': 6,
    'Series E': 7,
    'Debt': 8,
    'Bridge': 9,
}

df['RoundSeries_Numerical'] = df['Funding Round/Series'].map(round_series_mapping)


df['RoundSeries_Numerical'] = df['RoundSeries_Numerical'].fillna(-1)  


feature_columns=['Year Founded', 'Head Quarter', 'Industry In', 'RoundSeries_Numerical']
target_column='Amount in ($)'

X_features=df[feature_columns]
y_target=df[target_column]


if y_target.dtype.name == "category" or y_target.dtype.name == "object":  
    X_train, X_test, y_train, y_test= train_test_split(
       X_features,y_target,test_size=0.2,
       random_state=42,stratify=y_target)
else:  
    X_train, X_test, y_train, y_test= train_test_split(
       X_features,y_target,test_size=0.2,
       random_state=42)

scaler=StandardScaler()

scaled_Xtrain=scaler.fit_transform(X_train[['Year Founded','RoundSeries_Numerical']])
scaled_Xtest=scaler.transform(X_test[['Year Founded','RoundSeries_Numerical']])


scaled_Xtrain_df=pd.DataFrame(scaled_Xtrain,
                              columns=['Founded_scaled','RoundSeries_scaled'])

scaled_Xtest_df=pd.DataFrame(scaled_Xtest,
                              columns=['Founded_scaled','RoundSeries_scaled'])

final_combined_training_data_with_all_columns_including_both_scale_and_non_scale_versions=(
pd.concat([
       scaled_Xtrain_df,X_train[[f"Head Quarter",f"Industry In"]],
       y_train.to_frame()],  
       axis=1))

combined_testing_data_with_all_columns_including_both_scale_and_non_scale_versions=(
pd.concat([
      scaled_Xtest_df,X_test[[f"Head Quarter",f"Industry In"]],
      y_test.to_frame()],  
      axis=1))

save_dir="F:\\school\\Azubi Africa\\LP1 Data Analytics Project\\LP-1-Project\\data"

training_data_to_save=(pd.concat([
                          scaled_Xtrain_df,X_train[['Head Quarter','Industry In']],y_train.to_frame()],
                          axis=1))

training_data_to_save.to_csv(os.path.join(save_dir,"trainingdata.csv"),index=False)

testing_data_to_save=(pd.concat([
                         scaled_Xtest_df,X_test[['Head Quarter','Industry In']],y_test.to_frame()],
                         axis=1))

testing_data_to_save.to_csv(os.path.join(save_dir,"testingdata.csv"),index=False)


**1. Funding Prediction Model**

Objective: Predict the funding amount (Amount) a startup receives.

Model: RandomForestRegressor

In [3]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd

# Load the preprocessed training and testing data
train_data_path = "F:\\school\\Azubi Africa\\LP1 Data Analytics Project\\LP-1-Project\\data\\trainingdata.csv"
test_data_path = "F:\\school\\Azubi Africa\\LP1 Data Analytics Project\\LP-1-Project\\data\\testingdata.csv"

train_data = pd.read_csv(train_data_path)
test_data = pd.read_csv(test_data_path)

# Check for NaN values in the target variable (Amount)
print("NaN values in training target (Amount):", train_data['Amount'].isna().sum())
print("NaN values in testing target (Amount):", test_data['Amount'].isna().sum())

# Handle NaN values in the target variable
train_data = train_data.dropna(subset=['Amount'])
test_data = test_data.dropna(subset=['Amount'])


# Define features (X) and target (y)
X_train = train_data[['Founded_scaled', 'RoundSeries_scaled', 'Head_Quarter', 'Industry']]
y_train = train_data['Amount']

X_test = test_data[['Founded_scaled', 'RoundSeries_scaled', 'Head_Quarter', 'Industry']]
y_test = test_data['Amount']

# Initialize the Random Forest Regressor
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = rf_model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

NaN values in training target (Amount): 184
NaN values in testing target (Amount): 175
Mean Squared Error: 2571352183797737.0
R-squared: -0.13244843132855189


*Evaluation Metrics:*

MSE: 2.57e+15 (extremely high, indicating poor performance).

R-squared: -0.13 (negative value means the model performs worse than a horizontal line).

*Interpretation:*

The model fails to capture meaningful patterns in the data.

Possible issues: Data noise, outliers, insufficient features, or improper scaling.

**2. Startup Success Prediction**

Objective: Classify startups as "successful" (1) if funding > $1M, otherwise "unsuccessful" (0).

*Model: LogisticRegression*

First, we need to create a binary target variable indicating success based on the 'Amount' column.

To build the Startup Success Prediction model, we need to define what "success" means using the available data. A reasonable proxy could be whether the startup has received funding above a certain threshold. Here's how we can implement that.



In [4]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.impute import SimpleImputer

# Load the preprocessed training and testing data
train_data_path = "F:\\school\\Azubi Africa\\LP1 Data Analytics Project\\LP-1-Project\\data\\trainingdata.csv"
test_data_path = "F:\\school\\Azubi Africa\\LP1 Data Analytics Project\\LP-1-Project\\data\\testingdata.csv"

train_data = pd.read_csv(train_data_path)
test_data = pd.read_csv(test_data_path)

# Define a threshold for "success" (e.g., funding above $1,000,000)
success_threshold = 1000000  # $1,000,000

# Create a binary target variable: 1 for success, 0 for failure
train_data['Success'] = train_data['Amount'].apply(lambda x: 1 if x > success_threshold else 0)
test_data['Success'] = test_data['Amount'].apply(lambda x: 1 if x > success_threshold else 0)

# Define features (X) and target (y)
X_train = train_data[['Founded_scaled', 'RoundSeries_scaled', 'Head_Quarter', 'Industry']]
y_train = train_data['Success']

X_test = test_data[['Founded_scaled', 'RoundSeries_scaled', 'Head_Quarter', 'Industry']]
y_test = test_data['Success']

# Initialize the imputer
imputer = SimpleImputer(strategy='mean')

# Fit the imputer on the training data and transform both training and testing data
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)

# Initialize and train the Logistic Regression model
lr_model = LogisticRegression(random_state=42)
lr_model.fit(X_train, y_train)

# Make predictions
y_pred = lr_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy}')
print(f'Classification Report:\n{report}')

Accuracy: 0.4177215189873418
Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.01      0.02       232
           1       0.41      1.00      0.59       163

    accuracy                           0.42       395
   macro avg       0.71      0.50      0.30       395
weighted avg       0.76      0.42      0.25       395



*Results:*

Accuracy: 41.7% (slightly better than random guessing for a binary problem).

*Classification Report:*

Precision/Recall for class 0 (unsuccessful): Near 0%.

Precision/Recall for class 1 (successful): 41% precision, 100% recall (model biases toward predicting "success").

*Interpretation:*

Severe class imbalance (likely more "successful" startups in the dataset).

Model fails to generalize; hyperparameter tuning with GridSearchCV slightly improves accuracy but not performance.

**Logistic Regression with hyperparameter tuning using GridSearchCV for the Startup Success model.**

The model aims to identify factors that correlate with high funding amounts (above $1M). It answers:

Which companies are likely to secure significant funding?

Which features (founding year, funding round stage, location, industry) are most influential?

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
import pandas as pd

# Load the preprocessed training and testing data
train_data_path = "F:\\school\\Azubi Africa\\LP1 Data Analytics Project\\LP-1-Project\\data\\trainingdata.csv"
test_data_path = "F:\\school\\Azubi Africa\\LP1 Data Analytics Project\\LP-1-Project\\data\\testingdata.csv"

train_data = pd.read_csv(train_data_path)
test_data = pd.read_csv(test_data_path)


# Define a threshold for "success" (e.g., funding above $1,000,000)
success_threshold = 1000000  # $1,000,000

# Create a binary target variable: 1 for success, 0 for failure
train_data['Success'] = train_data['Amount'].apply(lambda x: 1 if x > success_threshold else 0)
test_data['Success'] = test_data['Amount'].apply(lambda x: 1 if x > success_threshold else 0)

# Define features (X) and target (y)
X_train = train_data[['Founded_scaled', 'RoundSeries_scaled', 'Head_Quarter', 'Industry']]
y_train = train_data['Success']

X_test = test_data[['Founded_scaled', 'RoundSeries_scaled', 'Head_Quarter', 'Industry']]
y_test = test_data['Success']

# Handle missing values in X_train and X_test
imputer = SimpleImputer(strategy='mean')  # Replace NaN with the mean value
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)

# Hyperparameter tuning using GridSearchCV
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear']
}

lr_model = LogisticRegression(random_state=42)
grid_search = GridSearchCV(lr_model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Best model
best_lr_model = grid_search.best_estimator_

# Make predictions
y_pred = best_lr_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f'Best Parameters: {grid_search.best_params_}')
print(f'Accuracy: {accuracy}')
print(f'Classification Report:\n{report}')

Best Parameters: {'C': 0.01, 'penalty': 'l1', 'solver': 'liblinear'}
Accuracy: 0.41265822784810124
Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00       232
           1       0.41      1.00      0.58       163

    accuracy                           0.41       395
   macro avg       0.21      0.50      0.29       395
weighted avg       0.17      0.41      0.24       395



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


*Model Behavior:*

The model is predicting every instance as class 1 (funding > $1M). This is evident because:

Recall for class 1 is 1.00: It correctly identifies all 163 "successes," but at the cost of misclassifying all 232 "failures" (class 0).

Precision for class 1 is 0.41: Only 41% of its "success" predictions are correct (163 correct predictions / 395 total predictions).

*Accuracy Deception:*

The 41.26% accuracy matches the proportion of class 1 in the dataset (163/395 ≈ 41%). This means the model is no better than randomly guessing the majority class (class 0 is 58.7% of the data, but the model defaults to class 1).

*Class 0 Failure:*

Precision and recall for class 0 are 0.00, indicating the model completely fails to identify companies with funding ≤ $1M.

**3. Industry Classification Model**

Objective: Classify startups into industries based on About_Company text descriptions.

Model: MultinomialNB with TF-IDF vectorization.

Given the dataset I am working with only includes the company's industry, I am building a model to predict the startup industry.

In [5]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import pandas as pd

# Load dataset
file_path = "F:\\school\\Azubi Africa\\LP1 Data Analytics Project\\LP-1-Project\\data\\Aba3_cleaned.csv"
df = pd.read_csv(file_path)

# Handle missing values in the 'About_Company' and 'Industry' columns
df.dropna(subset=['About_Company', 'Industry'], inplace=True)  # Remove rows with missing descriptions or labels

# Prepare features (X) and target (y)
X = df['About_Company']  # Company descriptions
y = df['Industry']  # Industry labels

# Check for NaN values in y
if y.isna().any():
    print("Warning: NaN values found in the target variable (y). Removing them.")
    df.dropna(subset=['Industry'], inplace=True)  # Drop rows where 'Industry' is NaN
    X = df['About_Company']  # Update X after dropping rows
    y = df['Industry']  # Update y after dropping rows

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert text descriptions into numerical data using TF-IDF
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Initialize and train the Multinomial Naive Bayes model
nb_model = MultinomialNB()
nb_model.fit(X_train_tfidf, y_train)

# Make predictions
y_pred = nb_model.predict(X_test_tfidf)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy}')
print(f'Classification Report:\n{report}')

Accuracy: 0.1461187214611872
Classification Report:
                                   precision    recall  f1-score   support

                       AI company       0.00      0.00      0.00         1
                       AI startup       0.00      0.00      0.00         1
                    AR/VR startup       0.00      0.00      0.00         1
                         AgriTech       0.00      0.00      0.00         7
                         Agritech       0.00      0.00      0.00         2
                Apparel & Fashion       0.00      0.00      0.00         1
                       Automation       0.00      0.00      0.00         1
                       Automotive       0.00      0.00      0.00         5
                         Aviation       0.00      0.00      0.00         1
                    Ayurveda tech       0.00      0.00      0.00         1
                   B2B E-commerce       0.00      0.00      0.00         1
                B2B Manufacturing       0.00   

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


*Results:*

Accuracy: 14.6% (extremely low).

Classification Report: Most classes have 0 precision/recall.

*Interpretation:*

Too many industry classes (219 labels) with insufficient data per class.

Text features (About_Company) may lack meaningful signals or require better preprocessing (e.g., n-grams, embeddings).