### Comments and changes made:
* We should probably extract location into a seperate table as well
* Lead Investor (yes or no) should not be an attribute in the dim table because that can change in between different rounds, instead we need to split between lead investor and other investors in the main table
* Don't remove Investor Names
* Does it make sense to fill empty values with NaN? What if we e.g. want to calculate an average? Then the 0 filled values would be misleading. 

In [75]:
# Import necessary libraries
import os
import sys
import pandas as pd

# Modify sys.path for it to contain the main repo path so we can import modules
parent_dir = os.path.abspath(os.path.join(os.getcwd(), '..'))
if parent_dir not in sys.path:
    sys.path.insert(0, parent_dir)

from utils.data_utils import get_entire_df

In [76]:
class EDA:
    def __init__(self):
        self.df = get_entire_df()

    def print_overview(self):
        print("\nDataset Overview:")
        print(self.df.info())
        print("\nSummary Statistics:")
        print(self.df.describe())
        print("\nMissing Values:")
        print(self.df.isnull().sum())

    def check_for_duplicates(self):
        print("\nNumber of Duplicates:", self.df.duplicated().sum())

eda = EDA()

Created dataframe with shape: (3172, 26)


## After a review of the data we can see the follwing:
1. "Organization Location" consists of City, Region, Country, Continent in the same cell. This has to be splitted up into separate columns 
2. "Organization Industries" and "Investor Names" includes many diffirent values in each cell. These has to be splitted up and put in separate tables. 
3. The following columns contain both numeric values and NaN values. "Money Raised", "Money Raised (in USD)", "Pre-Money Valuation", "Pre-Money Valuation (in USD)", "Total Funding Amount" "Total Funding Amount (in USD)"

# NEED TO SEPERATE COMPANIES FROM ROUNDS FIRST

In [97]:
class Preprocessing:
    def __init__(self, df):
        self.df = df
        self.locations_df = pd.DataFrame()
        self.industries_df = pd.DataFrame()
        self.investors_df = pd.DataFrame()

    def process_locations(self):
        self.locations_df['Location'] = self.df['Organization Location']
        self.locations_df = self.locations_df.drop_duplicates().reset_index(drop=True)
        self.locations_df['LocationID'] = self.locations_df.index + 1
        
        # Merge the original df with the locations_df on the 'Organization Location' column
        self.df = pd.merge(self.df, self.locations_df, how='left', left_on='Organization Location', right_on='Location')
        self.df['Organization Location'] = self.df['LocationID']
        self.df = self.df.drop(columns=['Location', 'LocationID'])

        # Correctly format the columns in the locations_df
        location_split = self.locations_df['Location'].str.split(',', expand=True)
        self.locations_df['City'] = location_split[0]
        self.locations_df['Region'] = location_split[1]
        self.locations_df['Country'] = location_split[2]
        self.locations_df['Continent'] = location_split[3]
        self.locations_df = self.locations_df.drop(columns=['Location'])

    def process_industries(self):
        self.industries_df = self.df[['Organization Industries']].copy()
        self.industries_df['Industry'] = self.industries_df['Organization Industries'].str.split(',')
        self.industries_df = self.industries_df.explode('Industry').drop(columns=['Organization Industries'])
        self.industries_df['Industry'] = self.industries_df['Industry'].str.strip()
        self.industries_df = self.industries_df.drop_duplicates().reset_index(drop=True)
        
        self.industries_df['IndustryID'] = self.industries_df.index + 1

        # Create a mapping between original dataframe and industries (many-to-many relationship)
        industry_mapping_df = self.df[['Organization Name', 'Organization Industries']].copy()
        industry_mapping_df['Industry'] = industry_mapping_df['Organization Industries'].str.split(',')
        industry_mapping_df = industry_mapping_df.explode('Industry')
        industry_mapping_df['Industry'] = industry_mapping_df['Industry'].str.strip()
        industry_mapping_df = pd.merge(industry_mapping_df, self.industries_df, how='left', on='Industry')
        industry_mapping_df = industry_mapping_df.drop(columns=['Organization Industries'])
        self.industry_mapping_df = industry_mapping_df

        print(self.industry_mapping_df)

        # self.df = pd.merge(self.df, self.industries_df, how='left', left_on='', right_on='Industry')
        # self.df['Organization Industries'] = self.df['IndustryID']
        # self.df = self.df.drop(columns=['IndustryID'])

    # def process_investors(self):

preprocessing = Preprocessing(eda.df)

preprocessing.process_locations()
preprocessing.process_industries()

print(preprocessing.industries_df)

      Organization Name                   Industry  IndustryID
0             Flagright                 Compliance           1
1             Flagright         Financial Services           2
2             Flagright                    FinTech           3
3             Flagright            Fraud Detection           4
4             Flagright     Information Technology           5
...                 ...                        ...         ...
11298       AirForestry                     Drones         367
11299       AirForestry  Environmental Engineering         155
11300       AirForestry                   Forestry          39
11301       AirForestry                  GreenTech          25
11302       AirForestry             Sustainability          27

[11303 rows x 3 columns]
                   Industry  IndustryID
0                Compliance           1
1        Financial Services           2
2                   FinTech           3
3           Fraud Detection           4
4    Information T

In [135]:
# 2. Creating separate tables for "Organization Industries" and "Investor Names"

# Split "Organization Industries" into separate rows (normalization)
organization_industries_table = data[['Transaction Name', 'Organization Industries']].copy()
organization_industries_table = organization_industries_table.assign(
    Organization_Industry=organization_industries_table['Organization Industries'].str.split(', ')
).explode('Organization_Industry').drop(columns=['Organization Industries'])

# Split "Investor Names" into separate rows (normalization)
investor_names_table = data[['Transaction Name', 'Investor Names']].copy()
investor_names_table = investor_names_table.assign(
    Investor_Name=investor_names_table['Investor Names'].str.split(', ')
).explode('Investor_Name').drop(columns=['Investor Names'])

# Display the first few rows of the new tables
print("\nOrganization Industries Table:")
print(organization_industries_table.head())
print("\nInvestor Names Table:")
print(investor_names_table.head())


Organization Industries Table:
         Transaction Name   Organization_Industry
0  Seed Round - Flagright              Compliance
0  Seed Round - Flagright      Financial Services
0  Seed Round - Flagright                 FinTech
0  Seed Round - Flagright         Fraud Detection
0  Seed Round - Flagright  Information Technology

Investor Names Table:
         Transaction Name        Investor_Name
0  Seed Round - Flagright   Charles Delingpole
0  Seed Round - Flagright     Donald Bringmann
0  Seed Round - Flagright     Erik Muttersbach
0  Seed Round - Flagright  Four Cities Capital
0  Seed Round - Flagright    Fredrik Thomassen


In [136]:
# After creating the new tables, we can drop the original columns from the main DataFrame
data.drop('Organization Industries', axis=1, inplace=True)


Original columns dropped.


In [137]:
# 3. Change from NaN to 0 for the numeric columns: "Money Raised", "Money Raised (in USD)", "Pre-Money Valuation", "Pre-Money Valuation (in USD)", "Total Funding Amount" "Total Funding Amount (in USD)"

def fill_na_with_zero(df):
    numeric_columns = ['Money Raised', 'Money Raised (in USD)', 'Pre-Money Valuation', 'Pre-Money Valuation (in USD)', 'Total Funding Amount', 'Total Funding Amount (in USD)']
    df[numeric_columns] = df[numeric_columns].fillna(0)
    print("\nMissing values filled with 0 for numeric columns.")

fill_na_with_zero(data)

# Display the first few rows of the cleaned data focus on the changed columns
print("\nCleaned Data:")
print(data.head())


Missing values filled with 0 for numeric columns.

Cleaned Data:
                  Transaction Name  \
0           Seed Round - Flagright   
1             Seed Round - aboutuz   
2          Seed Round - Kubermatic   
3          Seed Round - MYNE Homes   
4  Pre Seed Round - Emulate Energy   

                                                         Transaction Name URL  \
0           https://www.crunchbase.com/funding_round/flagright-seed--82849d85   
1             https://www.crunchbase.com/funding_round/aboutuz-seed--9c881e5a   
2          https://www.crunchbase.com/funding_round/kubermatic-seed--286b6112   
3          https://www.crunchbase.com/funding_round/myne-homes-seed--3bf4b676   
4  https://www.crunchbase.com/funding_round/emulate-energy-pre-seed--71a33333   

        Lead Investors  Money Raised Money Raised Currency  \
0    Moonfire Ventures     2800000.0                   USD   
1        FasterCapital      632000.0                   USD   
2  NetApp Excellerator          

## Suggestion on Star Schema

### FactFunding
| Column                     | Description                        |
|----------------------------|------------------------------------|
| Transaction ID (PK)         | Unique identifier for the transaction |
| Organization ID (FK)        | Foreign key referencing DimensionOrganization |
| Lead Investor ID (FK)       | Foreign key referencing DimensionInvestor for the lead investors |
| Investor IDs (FK)           | Foreign key referencing DimensionInvestor for all investors |
| Money Raised                | Amount of money raised in local currency |
| Money Raised (in USD)       | Amount of money raised in USD |
| Funding Type                | Type of funding (e.g., Seed, Pre-Seed) |
| Announced Date              | Date when the funding was announced |
| Funding Stage               | Stage of funding (e.g., Seed, Series A) |
| Number of Funding Rounds    | Number of funding rounds for the organization |
| Total Funding Amount        | Total amount of funding raised in local currency |
| Total Funding Amount (in USD) | Total amount of funding raised in USD |
| Equity Only                 | Whether it was equity-only funding (Yes/No) |

---

### DimensionOrganization
| Column                     | Description                        |
|----------------------------|------------------------------------|
| Organization ID (PK)        | Unique identifier for the organization |
| Organization Name           | Name of the organization           |
| Location ID (FK)            | Reference to where the organization is located |
| Organization URL            | URL of the organization's Crunchbase profile |
| Organization Description    | Brief description of the organization |
| Organization Website        | Website of the organization        |

---

### DimensionLocation
| Column                     | Description                        |
|----------------------------|------------------------------------|
| Location ID (PK)            | Unique identifier for the location |
| City                        | City where the organization is located |
| Region                      | Region where the organization is located |
| Country                     | Country where the organization is located |
| Continent                   | Continent where the organization is located |

---

### DimensionInvestor
| Column                     | Description                        |
|----------------------------|------------------------------------|
| Investor ID (PK)            | Unique identifier for the investor |
| Investor Name               | Name of the investor               |

---

### DimensionIndustry
| Column                     | Description                        |
|----------------------------|------------------------------------|
| Industry ID (PK)            | Unique identifier for the industry |
| Industry Name               | Name of the industry               |