Capstone Two: Modeling

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import GridSearchCV
import numpy as np

# Load the dataset
file_path = 'Largest companies in world.csv'
df = pd.read_csv(file_path)

In [2]:
# Convert financial metrics to numeric
def convert_to_numeric(value):
    if pd.isnull(value):
        return None
    try:
        return float(value.replace('B', '')) * 1e9
    except ValueError:
        return None

columns_to_convert = ['revenue', 'profits', 'assets', 'marketValue']
for column in columns_to_convert:
    df[column] = df[column].apply(convert_to_numeric)

In [3]:


# Creating dummy variables for 'country'
df = pd.get_dummies(df, columns=['country'], drop_first=True)

# Standardizing numeric features, excluding 'marketValue'
numeric_columns = ['revenue', 'profits', 'assets']
scaler = StandardScaler()
df[numeric_columns] = scaler.fit_transform(df[numeric_columns].fillna(0))

# Splitting the dataset into training and testing sets
X = df.drop(['marketValue', 'rank', 'organizationName'], axis=1)  # Features
y = df['marketValue'].fillna(0)  # Target, handling NaNs by replacing with 0
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)



In [4]:
# Initialize the Linear Regression model
lr_model = LinearRegression()

# Fit the model
lr_model.fit(X_train, y_train)

# Predict on the testing set
y_pred = lr_model.predict(X_test)

# Evaluation metrics
print("Linear Regression R^2:", r2_score(y_test, y_pred))
print("Linear Regression MAE:", mean_absolute_error(y_test, y_pred))
print("Linear Regression RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

# Repeat the process for Ridge Regression and Random Forest Regressor,
# including hyperparameter tuning with GridSearchCV if desired.


Linear Regression R^2: -1.363256410717323e+22
Linear Regression MAE: 2.7927199705543596e+20
Linear Regression RMSE: 6.931348683182647e+21


Linear Regressiong Conlusion

R-squared (R²) value is negative: The R² metric indicates the proportion of the variance in the dependent variable that is predictable from the independent variables. It ranges from 0 to 1 for no-fit to perfect fit respectively. A negative R² value suggests that the model fits worse than a horizontal line representing the mean of the dependent variable. This can happen when the predictions are very far off from the actual values or if an incorrect model or methods were used for the analysis.

Mean Absolute Error (MAE) is extremely large: The MAE measures the average magnitude of the errors in a set of predictions, without considering their direction. The extremely large MAE indicates that the model’s predictions are, on average, far from the actual values.

Root Mean Squared Error (RMSE) is also extremely large: RMSE is the square root of the average of squared differences between prediction and actual observation. It's a measure of how spread out these residuals are. In other words, it tells you how concentrated the data is around the line of best fit. Like MAE, a large RMSE value indicates poor model performance.

Retry with Random Forest Regressor

In [6]:
# Helper function to convert financial metrics
def convert_to_numeric(value):
    # Check if value is a string and contains 'B'
    if isinstance(value, str) and 'B' in value:
        try:
            return float(value.replace('B', '')) * 1e9
        except ValueError:
            return np.nan
    # If value is numeric (float or int), return it as is
    elif isinstance(value, (float, int)):
        return value
    # For NaN and other non-numeric, non-string types
    else:
        return np.nan

In [8]:
# Apply the conversion
columns_to_convert = ['revenue', 'profits', 'assets', 'marketValue']
for column in columns_to_convert:
    df[column] = df[column].apply(convert_to_numeric)

# Drop rows with missing 'marketValue' as it's our target
df.dropna(subset=['marketValue'], inplace=True)

# Standardizing numeric features (excluding 'marketValue')
numeric_columns = ['revenue', 'profits', 'assets']
scaler = StandardScaler()
df[numeric_columns] = scaler.fit_transform(df[numeric_columns].fillna(0))


In [9]:

X = df.drop(['marketValue', 'rank', 'organizationName'], axis=1)  # Features
y = df['marketValue']  # Target

# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [10]:
# Initialize the Random Forest Regressor
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

# Fit the model to the training data
rf_model.fit(X_train, y_train)


In [11]:
# Make predictions
y_pred = rf_model.predict(X_test)

# Evaluate the model
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f"MAE: {mae}")
print(f"RMSE: {rmse}")
print(f"R²: {r2}")



MAE: 18933456427.378963
RMSE: 40726953279.64194
R²: 0.5521251597917307


Conclusions
Model Performance: The Random Forest Regressor has demonstrated a moderate ability to predict the market value of companies, capturing over half of the variance in the dataset. However, the significant MAE and RMSE indicate that there is substantial room for improvement, especially for more accurate predictions across companies of different sizes.

Potential Improvements:

Feature Engineering: More sophisticated feature engineering might help improve the model's predictive accuracy. This could include creating interaction terms, polynomial features, or extracting more informative features from the current dataset.