# Startup Data - Exploratory Data Analysis (EDA)
This notebook contains a complete EDA process, including missing value treatment, visualizations, and data cleaning. We aim to understand patterns and prepare the dataset for modeling.

## 1. Column Types
We separate categorical and numerical columns for easier handling and visualization.

In [None]:
cat_cols = df.select_dtypes(include='object').columns.tolist()
num_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
print("Categorical columns:", cat_cols)
print("Numerical columns:", num_cols)

## 2. Countplot for Categorical Columns
We visualize the frequency of values for each categorical feature (up to top 10 categories).

In [None]:
for col in cat_cols:
    plt.figure(figsize=(8, 4))
    sns.countplot(data=df, x=col, order=df[col].value_counts().index[:10])
    plt.xticks(rotation=45)
    plt.title(f'Distribution of {col}')
    plt.tight_layout()
    plt.show()

## 3. Correlation Heatmap for Numerical Columns
This helps us identify which numerical features are correlated.

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(df[num_cols].corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap")
plt.show()

## 4. Funding Distribution by Status
Visualize how funding amounts differ between startup statuses.

In [None]:
sns.boxplot(data=df, x='status', y='funding_total_usd')
plt.title('Funding Distribution by Startup Status')
plt.show()

## 5. Create Binary Target Column
Convert the 'status' column into a binary target for classification modeling.

In [None]:
df['status_binary'] = df['status'].map({'acquired': 1, 'closed': 0})
df['status_binary'].value_counts(normalize=True)

## 6. Handle Missing Values
Drop columns with over 50% missing data, then fill remaining values using median (numerical) or mode (categorical).

In [None]:
missing_percent = df.isnull().mean()
cols_to_drop = missing_percent[missing_percent > 0.5].index
df.drop(columns=cols_to_drop, inplace=True)

# Fill missing values
df[num_cols] = df[num_cols].fillna(df[num_cols].median())
df[cat_cols] = df[cat_cols].fillna(df[cat_cols].mode().iloc[0])

# Final check
print("Total missing values:", df.isnull().sum().sum())

## ✅ EDA Summary

We successfully completed the Exploratory Data Analysis (EDA) phase, which included:

- Identifying categorical and numerical features
- Visualizing distributions, correlations, and class imbalance
- Creating a binary target column (`status_binary`)
- Handling missing values using:
  - Dropping columns with over 50% missing
  - Imputing with median (numerical) and mode (categorical)
- Applying log transformation to `funding_total_usd` to reduce skewness

The dataset is now clean, transformed, and ready for feature engineering and modeling.\