<a href="https://www.kaggle.com/code/mayarmohamedswilam/start-ups-companies-predection?scriptVersionId=144684275" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

* Import necessary libraries such as NumPy, pandas, and seaborn.
* List files in the '/kaggle/input/' directory.
* Read a CSV file ('companies.csv') into a DataFrame.
* Display basic information about the DataFrame using df.info().
* Drop unnecessary columns, fill missing numerical values with 0, and convert date columns to datetime format.
* Remove duplicate rows from the DataFrame.
* Drop columns with a high percentage of missing values.
* Drop specific columns ('normalized_name', 'entity_id', 'short_description').
* Filter data to select only operating companies with funding greater than 100,000 USD.
* Group the data by 'category_code' and calculate the mean of 'investment_rounds' within each category.
* Calculate the total funding received by each company by multiplying 'funding_total_usd' and 'funding_rounds' and store it in a new column 'total_funding'.
* Sort the DataFrame by 'total_funding' in descending order.
* Count the number of companies in each category using 'value_counts()'.
* Calculate the average funding received by companies with a 'status' of 'acquired'.
* Calculate the age of companies based on their 'founded_at' date and the current date.
* Group the data by 'status' and calculate the mean of 'milestones' and 'relationships'.
* Create a pivot table to analyze the mean 'investment_rounds' for each 'category_code' within different 'status' categories.

**Data cleaning**

In [None]:
import pandas as pd

# Assuming the CSV file is in the current working directory
csv_file_path = '/kaggle/input/ofhddd/companies.csv'

# Create a DataFrame from the CSV file
df = pd.read_csv(csv_file_path)
print(df.head())  # Display the first few rows of the DataFrame

In [None]:
df.info()

In [None]:
df.shape

**Drop unnecessary columns**

In [None]:
# Drop unnecessary columns
columns_to_drop = ['Unnamed: 0.1', 'permalink']
df.drop(columns=columns_to_drop, inplace=True)

**For demonstration purposes, we'll fill missing numerical values with 0**

In [None]:
# For demonstration purposes, we'll fill missing numerical values with 0
df.fillna(0, inplace=True)

In [None]:
# Convert date columns to datetime format
date_columns = ['first_milestone_at', 'last_milestone_at', 'created_at', 'updated_at']
for col in date_columns:
    df[col] = pd.to_datetime(df[col], errors='coerce')  # Convert invalid dates to NaT

In [None]:
df.info()

In [None]:
df.head()

In [None]:
missing_percentages = (df.isnull().sum() / len(df)) * 100
print(missing_percentages)

In [None]:
df.drop_duplicates(inplace=True)

In [None]:
threshold = 0.7  # Set a threshold for the percentage of NaN values
columns_to_drop = df.columns[df.isnull().mean() > threshold]
df.drop(columns=columns_to_drop, inplace=True)

In [None]:
columns_to_drop = ['normalized_name', 'entity_id', 'short_description']
df.drop(columns=columns_to_drop, inplace=True)

**Data Processing**

**Filtering Data**

In [None]:
filtered_df = df[(df['status'] == 'operating') & (df['funding_total_usd'] > 100000)]
print(filtered_df)

**Grouping Data**

In [None]:
grouped_df = df.groupby('category_code')['investment_rounds'].mean()
print(grouped_df)

**Calculations**

calculate the total funding received by each company

In [None]:
df['total_funding'] = df['funding_total_usd'] * df['funding_rounds']
print(df[['name', 'total_funding']])

**Sorting Data**

In [None]:
sorted_df = df.sort_values(by='total_funding', ascending=False)
print(sorted_df[['name', 'total_funding']])

**Counting Categories**

 count the number of companies

In [None]:
category_counts = df['category_code'].value_counts()
print(category_counts)

**Conditional Calculations**

Calculate the average funding received by companies with a "status" of "acquired"

In [None]:
average_funding_acquired = df[df['status'] == 'acquired']['funding_total_usd'].mean()
print("Average funding for acquired companies:", average_funding_acquired)

**Creating New Columns**

*********
Calculate the age of companies based on their "founded_at" date and the current date:


In [None]:
from datetime import datetime
# Convert 'founded_at' column to datetime format
df['founded_at'] = pd.to_datetime(df['founded_at'], errors='coerce')

# Calculate company age based on the 'founded_at' column
current_year = datetime.now().year
df['company_age'] = current_year - df['founded_at'].dt.year

print(df[['name', 'company_age']])

In [None]:
df[['name', 'company_age']].head()

**Aggregation with Grouping**

In [None]:
aggregated_group = df.groupby('status')[['milestones', 'relationships']].mean()
print(aggregated_group)

**Pivot Table**

In [None]:
pivot_table = df.pivot_table(index='status', columns='category_code', values='investment_rounds', aggfunc='mean')
print(pivot_table)

**Data Labeling**

**the average value of the "software" category for companies that went public (IPO)**

In [None]:
avg_software_for_ipo = pivot_table.loc['ipo', 'software']
print("Average value of the 'software' category for IPO companies:", avg_software_for_ipo)

**Which category has the lowest average value for companies that are acquired**

In [None]:
lowest_avg_category_acquired = pivot_table.loc['acquired'].idxmin()
lowest_avg_value_acquired = pivot_table.loc['acquired'].min()
print("Category with the lowest average value for acquired companies:", lowest_avg_category_acquired)
print("Lowest average value for acquired companies:", lowest_avg_value_acquired)

**the total average value of all categories for companies that are operating**

In [None]:
total_avg_value_operating = pivot_table.loc['operating'].mean()
print("Total average value of all categories for operating companies:", total_avg_value_operating)

**Which category has the highest variation (standard deviation) of average values across different statuses?**

In [None]:
highest_variation_category = pivot_table.std().idxmax()
highest_variation_value = pivot_table.std().max()
print("Category with the highest variation of average values:", highest_variation_category)
print("Highest variation value:", highest_variation_value)

**For IPO companies, category has the highest average value per company (excluding NaN values)**

In [None]:
avg_per_company_for_ipo = pivot_table.loc['ipo'] / pivot_table.loc['ipo'].count()
highest_avg_per_company_category_ipo = avg_per_company_for_ipo.idxmax()
highest_avg_per_company_value_ipo = avg_per_company_for_ipo.max()
print("Category with the highest average value per company for IPO companies:", highest_avg_per_company_category_ipo)
print("Highest average value per company for IPO companies:", highest_avg_per_company_value_ipo)

**the most common status among companies in the "web" category**

In [None]:
most_common_status_web = pivot_table['web'].idxmax()
print("Most common status among companies in the 'web' category:", most_common_status_web)

**the highest average value for companies that are closed**

In [None]:
highest_avg_value_closed_category = pivot_table.loc['closed'].idxmax()
highest_avg_value_closed = pivot_table.loc['closed'].max()
print("Category with the highest average value for closed companies:", highest_avg_value_closed_category)
print("Highest average value for closed companies:", highest_avg_value_closed)

**the total number of companies in the "software" category for each status**

In [None]:
total_companies_software = pivot_table['software'].sum()
print("Total number of companies in the 'software' category:", total_companies_software)

**the highest average value of milestones for companies that are acquired**

In [None]:
highest_avg_analytics_category_acquired = pivot_table.loc['acquired'].idxmax()
highest_avg_analytics_value_acquired = pivot_table.loc['acquired'].max()
print("Category with the highest average value for 'acquired' companies:", highest_avg_analytics_category_acquired)
print("Highest average value for 'acquired' companies:", highest_avg_analytics_value_acquired)

**the average value of the "semiconductor" category for companies that are operating**

In [None]:
avg_semiconductor_operating = pivot_table.loc['operating', 'semiconductor']
print("Average value of the 'semiconductor' category for operating companies:", avg_semiconductor_operating)

**the average value of the "advertising" category for companies that are operating**

In [None]:
avg_advertising_for_operating = pivot_table.loc['operating', 'advertising']
print("Average value of the 'advertising' category for operating companies:", avg_advertising_for_operating)

**the highest average value for companies that went public (IPO)**

In [None]:
highest_avg_category_ipo = pivot_table.loc['ipo'].idxmax()
highest_avg_value_ipo = pivot_table.loc['ipo'].max()
print("Category with the highest average value for IPO companies:", highest_avg_category_ipo)
print("Highest average value for IPO companies:", highest_avg_value_ipo)

**the average value of the "analytics" category for companies that are closed**

In [None]:
avg_analytics_for_closed = pivot_table.loc['closed', 'analytics']
print("Average value of the 'analytics' category for closed companies:", avg_analytics_for_closed)

**For companies that are operating, what is the category with the second-highest average value**

In [None]:
second_highest_avg_category_operating = pivot_table.loc['operating'].nlargest(2).idxmin()
second_highest_avg_value_operating = pivot_table.loc['operating'].nlargest(2).min()
print("Category with the second-highest average value for operating companies:", second_highest_avg_category_operating)
print("Second-highest average value for operating companies:", second_highest_avg_value_operating)

**the average value of the "ecommerce" category for companies that are acquired**

In [None]:
avg_ecommerce_for_acquired = pivot_table.loc['acquired', 'ecommerce']
print("Average value of the 'ecommerce' category for acquired companies:", avg_ecommerce_for_acquired)

**data visualization**

In [None]:
import matplotlib.pyplot as plt

In [None]:
plt.figure(figsize=(12, 6))
pivot_table.T.plot(kind='bar', stacked=True, cmap='Set3')
plt.title("Proportion of Categories by Startup Status")
plt.xlabel("Category Code")
plt.ylabel("Proportion")
plt.xticks(rotation=45)
plt.legend(title="Startup Status")
plt.show()

In [None]:
plt.figure(figsize=(12, 6))
sns.violinplot(data=pivot_table, palette="Set3")
plt.title("Distribution of Category Values by Startup Status")
plt.xlabel("Category Code")
plt.ylabel("Value")
plt.xticks(rotation=45)
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
pivot_table.plot(kind='bar')
plt.title("Average Values of Categories by Startup Status")
plt.xlabel("Startup Status")
plt.ylabel("Average Value")
plt.xticks(rotation=0)
plt.legend(title="Category Code")
plt.show()

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(8, 5))
categories = pivot_table.columns
statuses = pivot_table.index

for i, category in enumerate(categories):
    ax.bar(statuses, pivot_table[category], label=category)

ax.set_title("Average Values of Categories by Startup Status")
ax.set_xlabel("Startup Status")
ax.set_ylabel("Average Value")
ax.legend(title="Category Code")
plt.xticks(rotation=0)
plt.tight_layout()

plt.show()

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(8, 5))
categories = pivot_table.columns
statuses = pivot_table.index

for i, category in enumerate(categories):
    ax.barh(statuses, pivot_table[category], label=category)

ax.set_title("Average Values of Categories by Startup Status")
ax.set_xlabel("Average Value")
ax.set_ylabel("Startup Status")
ax.legend(title="Category Code")
plt.tight_layout()

plt.show()

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(10, 6))
statuses = pivot_table.index
categories = pivot_table.columns[1:]  # Exclude the '0' column

for i, category in enumerate(categories):
    ax.fill_between(statuses, pivot_table[category], label=category)

ax.set_title("Area Plot of Categories by Startup Status")
ax.set_xlabel("Startup Status")
ax.set_ylabel("Value")
ax.legend(title="Category Code")
plt.tight_layout()

plt.show()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Group by category_code and calculate average funding
average_funding_by_category = df.groupby('category_code')['funding_total_usd'].mean().reset_index()

# Set style
sns.set(style="whitegrid")

# Create bar plot
plt.figure(figsize=(12, 6))
sns.barplot(data=average_funding_by_category, x='category_code', y='funding_total_usd')
plt.xticks(rotation=45, ha='right')
plt.xlabel('Category Code')
plt.ylabel('Average Funding (USD)')
plt.title('Average Funding by Category Code')
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(8, 6))
sns.countplot(data=df, x='status')
plt.xlabel('Status')
plt.ylabel('Count')
plt.title('Count of Companies by Status')
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='funding_total_usd', y='milestones', hue='status')
plt.xlabel('Funding (USD)')
plt.ylabel('Milestones')
plt.title('Funding vs. Milestones')
plt.legend(title='Status')
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(8, 6))
sns.boxplot(data=df, x='status', y='funding_total_usd')
plt.xlabel('Status')
plt.ylabel('Funding (USD)')
plt.title('Funding Distribution by Status')
plt.tight_layout()
plt.show()

In [None]:
# Convert 'first_funding_at' to datetime format
df['first_funding_at'] = pd.to_datetime(df['first_funding_at'], errors='coerce')

# Extract year from 'first_funding_at'
df['funding_year'] = df['first_funding_at'].dt.year

# Calculate average funding per year
average_funding_by_year = df.groupby('funding_year')['funding_total_usd'].mean().reset_index()

# Create line plot
plt.figure(figsize=(10, 6))
sns.lineplot(data=average_funding_by_year, x='funding_year', y='funding_total_usd')
plt.xlabel('Year')
plt.ylabel('Average Funding (USD)')
plt.title('Average Funding Over Time')
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='funding_total_usd', bins=30, kde=True)
plt.xlabel('Funding Total (USD)')
plt.ylabel('Frequency')
plt.title('Distribution of Funding Total')
plt.tight_layout()
plt.show()

In [None]:
# Select relevant numeric columns
numeric_columns = ['funding_total_usd', 'milestones', 'investment_rounds', 'relationships']

# Create pair plot
sns.pairplot(data=df[numeric_columns])
plt.tight_layout()
plt.show()

In [None]:
# Calculate correlation matrix
correlation_matrix = df[numeric_columns].corr()

# Create heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.tight_layout()
plt.show()

PCA

In [None]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load the dataset
csv_file_path = '/kaggle/input/ofhddd/companies.csv'
df = pd.read_csv(csv_file_path)

# Drop unnecessary columns
columns_to_drop = ['Unnamed: 0.1', 'permalink']
df.drop(columns=columns_to_drop, inplace=True)

# Fill missing numerical values with 0
df.fillna(0, inplace=True)

# Convert date columns to datetime format
date_columns = ['first_milestone_at', 'last_milestone_at', 'created_at', 'updated_at']
for col in date_columns:
    df[col] = pd.to_datetime(df[col], errors='coerce')

# Filter data
filtered_df = df[(df['status'] == 'operating') & (df['funding_total_usd'] > 100000)]

# Perform PCA on selected numerical features
numerical_columns = ['investment_rounds', 'invested_companies', 'funding_rounds', 'funding_total_usd', 'milestones', 'relationships', 'lat', 'lng', 'ROI']
scaler = StandardScaler()
df[numerical_columns] = scaler.fit_transform(df[numerical_columns])

n_components = 5  # Choose the number of components
pca = PCA(n_components=n_components)
pca_result = pca.fit_transform(df[numerical_columns])

explained_variance = pca.explained_variance_ratio_
print("Explained Variance Ratios:", explained_variance)

# Feature engineering examples
df_encoded = pd.get_dummies(df, columns=['category_code'])  # One-hot encoding of categorical column
df['new_feature'] = df['investment_rounds'] * df['funding_rounds']  # Creating a new feature by interaction

# Extracting time-based features
df['month'] = df['created_at'].dt.month
df['day_of_week'] = df['created_at'].dt.dayofweek

**LinearRegression**

performing linear regression with preprocessed data

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load the dataset
csv_file_path = '/kaggle/input/ofhddd/companies.csv'
df = pd.read_csv(csv_file_path)

# Drop unnecessary columns
columns_to_drop = ['Unnamed: 0.1', 'permalink']
df.drop(columns=columns_to_drop, inplace=True)

# Fill missing numerical values with 0
df.fillna(0, inplace=True)

# Convert date columns to datetime format
date_columns = ['first_milestone_at', 'last_milestone_at', 'created_at', 'updated_at']
for col in date_columns:
    df[col] = pd.to_datetime(df[col], errors='coerce')

# Filter data
filtered_df = df[(df['status'] == 'operating') & (df['funding_total_usd'] > 100000)]

# Define your target variable
target_column = 'ROI'

# Filter target variable (y) to include only common samples
common_indices = df.index.isin(filtered_df.index)
y = df.loc[common_indices, target_column]

# Exclude non-numeric columns and select numerical features for X
numerical_columns = ['investment_rounds', 'invested_companies', 'funding_rounds', 'funding_total_usd', 'milestones', 'relationships', 'lat', 'lng']
X = filtered_df[numerical_columns]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared (R2) Score: {r2}")

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load the dataset
csv_file_path = '/kaggle/input/ofhddd/companies.csv'
df = pd.read_csv(csv_file_path)

# Drop unnecessary columns
columns_to_drop = ['Unnamed: 0.1', 'permalink']
df.drop(columns=columns_to_drop, inplace=True)

# Fill missing numerical values with 0
df.fillna(0, inplace=True)

# Convert date columns to datetime format
date_columns = ['first_milestone_at', 'last_milestone_at', 'created_at', 'updated_at']
for col in date_columns:
    df[col] = pd.to_datetime(df[col], errors='coerce')

# Filter data
filtered_df = df[(df['status'] == 'operating') & (df['funding_total_usd'] > 100000)]

# Perform PCA on selected numerical features
numerical_columns = ['investment_rounds', 'invested_companies', 'funding_rounds', 'funding_total_usd', 'milestones', 'relationships', 'lat', 'lng', 'ROI']
scaler = StandardScaler()
df[numerical_columns] = scaler.fit_transform(df[numerical_columns])

n_components = 5  # Choose the number of components
pca = PCA(n_components=n_components)
pca_result = pca.fit_transform(df[numerical_columns])

explained_variance = pca.explained_variance_ratio_
print("Explained Variance Ratios:", explained_variance)

# Feature engineering examples
df_encoded = pd.get_dummies(df, columns=['category_code'])  # One-hot encoding of categorical column
df['new_feature'] = df['investment_rounds'] * df['funding_rounds']  # Creating a new feature by interaction

# Extracting time-based features
df['month'] = df['created_at'].dt.month
df['day_of_week'] = df['created_at'].dt.dayofweek

# Define your target variable
target_column = 'ROI'

# Exclude non-numeric columns and select numerical features for X
numerical_columns = ['investment_rounds', 'invested_companies', 'funding_rounds', 'funding_total_usd', 'milestones', 'relationships', 'lat', 'lng']
X = df_encoded[numerical_columns]

# Split the data into training and testing sets
y = df_encoded[target_column]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared (R2) Score: {r2}")

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Load the dataset
csv_file_path = '/kaggle/input/ofhddd/companies.csv'
df = pd.read_csv(csv_file_path)

# Drop unnecessary columns
columns_to_drop = ['Unnamed: 0.1', 'permalink']
df.drop(columns=columns_to_drop, inplace=True)

# Fill missing numerical values with 0
df.fillna(0, inplace=True)

# Convert date columns to datetime format
date_columns = ['first_milestone_at', 'last_milestone_at', 'created_at', 'updated_at']
for col in date_columns:
    df[col] = pd.to_datetime(df[col], errors='coerce')

# Filter data
filtered_df = df[(df['status'] == 'operating') & (df['funding_total_usd'] > 100000)]

# Define your target variable
target_column = 'ROI'

# Exclude non-numeric columns and select numerical features for X
numerical_columns = ['investment_rounds', 'invested_companies', 'funding_rounds', 'funding_total_usd', 'milestones', 'relationships', 'lat', 'lng']
X = filtered_df[numerical_columns]

# Split the data into training and testing sets
y = filtered_df[target_column]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train a decision tree regression model
model = DecisionTreeRegressor(random_state=42)
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared (R2) Score: {r2}")

**Create and train a Decision Tree Regressor model**

In [None]:
df.columns

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Load the dataset
csv_file_path = '/kaggle/input/ofhddd/companies.csv'
df = pd.read_csv(csv_file_path)

# Drop unnecessary columns
columns_to_drop = ['Unnamed: 0.1', 'permalink']  # Corrected variable name
df.drop(columns=columns_to_drop, inplace=True)

# Fill missing numerical values with 0
df.fillna(0, inplace=True)

# Convert date columns to datetime format
date_columns = ['first_milestone_at', 'last_milestone_at', 'created_at', 'updated_at']
for col in date_columns:
    df[col] = pd.to_datetime(df[col], errors='coerce')

# Filter data
filtered_df = df[(df['status'] == 'operating') & (df['funding_total_usd'] > 100000)]

# Define your target variable
target_column = 'ROI'

# Exclude non-numeric columns and select numerical features for X
numerical_columns = ['investment_rounds', 'invested_companies', 'funding_rounds', 'funding_total_usd', 'milestones', 'relationships', 'lat', 'lng']
X = filtered_df[numerical_columns]

# Split the data into training and testing sets
y = filtered_df[target_column]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train a Decision Tree Regressor model
model = DecisionTreeRegressor(random_state=42)
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model on the test set
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error (MSE) on Test Set: {mse}")
print(f"R-squared (R2) Score on Test Set: {r2}")

**Importing SVM Classifier and Training the Model**

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, classification_report
from sklearn.impute import SimpleImputer

# Load your dataset
csv_file_path = '/kaggle/input/ofhddd/companies.csv'
df = pd.read_csv(csv_file_path)

# Define your target column
target_column = 'ROI'

In [None]:
# Remove rows with missing target values
df.dropna(subset=[target_column], inplace=True)
y_classes = pd.cut(df[target_column], bins=[-float("inf"), 0, 100, float("inf")], labels=['low', 'medium', 'high'])

In [None]:
# Encode the categorical classes
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y_classes)

In [None]:
# Define your features (X) and the encoded target (y)
X = df[['investment_rounds', 'invested_companies', 'funding_rounds', 'funding_total_usd', 'milestones', 'relationships', 'lat', 'lng']]
y = y_encoded

# Impute missing values in X using mean imputation
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_imputed, y, test_size=0.2, random_state=42)

In [None]:
# import SVC classifier
from sklearn.svm import SVC


# import metrics to compute accuracy
from sklearn.metrics import accuracy_score


svc=SVC() 


svc.fit(X_train,y_train)


# make predictions on test set
y_pred=svc.predict(X_test)


# compute and print accuracy score
print('Model accuracy score with default hyperparameters: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))

* Feature Selection:
* 
* Examine the importance of each feature in your dataset. You can use techniques like feature importance scores from tree-based models or correlation analysis to identify the most relevant features.
* Feature Scaling:
* 
* SVMs are sensitive to the scale of input features. Make sure to scale your features to have a mean of 0 and a standard deviation of 1. You can use StandardScaler from scikit-learn for this purpose.
* Polynomial Features:
* 
* Try creating polynomial features by squaring or cubing existing features. This can capture nonlinear relationships in your data.
* Feature Interactions:
* 
* Explore feature interactions by combining two or more features. For example, you can create a new feature that represents the product or ratio of two existing features.
* Dimensionality Reduction:
* 
* If you have a large number of features, consider applying dimensionality reduction techniques like Principal Component Analysis (PCA) to reduce the feature space while retaining important information.
* Outlier Handling:
* 
* Detect and handle outliers in your dataset. Outliers can negatively impact the performance of SVMs.
* Hyperparameter Tuning:
* 
* Experiment with different SVM hyperparameters, such as the choice of kernel (e.g., linear, radial basis function), C (regularization parameter), and gamma (kernel coefficient for RBF). Use techniques like grid search or random search to find the best hyperparameters.
* 

feature scaling using StandardScaler is applied to the features before training the SVM model

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.impute import SimpleImputer

In [None]:
# Load your dataset
csv_file_path = '/kaggle/input/ofhddd/companies.csv'
df = pd.read_csv(csv_file_path)

In [None]:
# Define your target column
target_column = 'ROI'

# Remove rows with missing target values
df.dropna(subset=[target_column], inplace=True)

# Convert 'ROI' to categorical labels
y_classes = pd.cut(df[target_column], bins=[-float("inf"), 0, 100, float("inf")], labels=['low', 'medium', 'high'])

# Encode the categorical classes
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y_classes)

In [None]:
# Define your features (X) and the encoded target (y)
X = df[['investment_rounds', 'invested_companies', 'funding_rounds', 'funding_total_usd', 'milestones', 'relationships', 'lat', 'lng']]
y = y_encoded

# Impute missing values in X using mean imputation
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

In [None]:
# Scale features to have mean=0 and std=1
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_imputed)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

In [None]:
# Create and train an SVM Classifier model
model = SVC(kernel='linear', random_state=42)
model.fit(X_train, y_train)

In [None]:
# Make predictions on the test set
y_pred = model.predict(X_test)

In [None]:
# Calculate and print the accuracy score
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy on Test Set: {accuracy:.4f}")