### Comments and changes made:
* We should probably extract location into a seperate table as well
* Lead Investor (yes or no) should not be an attribute in the dim table because that can change in between different rounds, instead we need to split between lead investor and other investors in the main table
* Don't remove Investor Names
* Does it make sense to fill empty values with NaN? What if we e.g. want to calculate an average? Then the 0 filled values would be misleading. 

In [88]:
# Import necessary libraries
import os
import sys
import pandas as pd

# Modify sys.path for it to contain the main repo path so we can import modules
parent_dir = os.path.abspath(os.path.join(os.getcwd(), '..'))
if parent_dir not in sys.path:
    sys.path.insert(0, parent_dir)

from utils.data_utils import get_entire_df

In [89]:
class EDA:
    def __init__(self):
        self.df = get_entire_df()

    def print_overview(self):
        print("\nDataset Overview:")
        print(self.df.info())
        print("\nSummary Statistics:")
        print(self.df.describe())
        print("\nMissing Values:")
        print(self.df.isnull().sum())

    def check_for_duplicates(self):
        print("\nNumber of Duplicates:", self.df.duplicated().sum())

eda = EDA()

Created dataframe with shape: (3172, 26)


## After a review of the data we can see the follwing:
1. "Organization Location" consists of City, Region, Country, Continent in the same cell. This has to be splitted up into separate columns 
2. "Organization Industries" and "Investor Names" includes many diffirent values in each cell. These has to be splitted up and put in separate tables. 
3. The following columns contain both numeric values and NaN values. "Money Raised", "Money Raised (in USD)", "Pre-Money Valuation", "Pre-Money Valuation (in USD)", "Total Funding Amount" "Total Funding Amount (in USD)"

## Suggestion on Star Schema

### FactFunding
| Column                     | Description                        |
|----------------------------|------------------------------------|
| Transaction ID (PK)         | Unique identifier for the transaction |
| Organization ID (FK)        | Foreign key referencing DimensionOrganization |
| Money Raised                | Amount of money raised in local currency |
| Money Raised (in USD)       | Amount of money raised in USD |
| Funding Type                | Type of funding (e.g., Seed, Pre-Seed) |
| Announced Date              | Date when the funding was announced |
| Funding Stage               | Stage of funding (e.g., Seed, Series A) |
| Number of Funding Rounds    | Number of funding rounds for the organization |
| Total Funding Amount        | Total amount of funding raised in local currency |
| Total Funding Amount (in USD) | Total amount of funding raised in USD |
| Equity Only                 | Whether it was equity-only funding (Yes/No) |
| Transaction Name            | Name of the transaction            |
| Transaction Name URL        | URL of the transaction             |
| Money Raised Currency       | Currency in which money was raised |
| Pre-Money Valuation         | Valuation before the money was raised |
| Pre-Money Valuation Currency | Currency of the pre-money valuation |
| Pre-Money Valuation (in USD) | Pre-money valuation in USD         |
| Funding Status              | Status of the funding              |
| Number of Investors         | Number of investors involved       |


### DimensionOrganization
| Column                     | Description                        |
|----------------------------|------------------------------------|
| Organization ID (PK)        | Unique identifier for the organization |
| Organization Name           | Name of the organization           |
| Location ID (FK)            | Reference to where the organization is located |
| Organization Name URL            | URL of the organization's Crunchbase profile |
| Organization Description    | Brief description of the organization |
| Organization Website        | Website of the organization        |


### DimensionLocation
| Column                     | Description                        |
|----------------------------|------------------------------------|
| Location ID (PK)            | Unique identifier for the location |
| City                        | City where the organization is located |
| Region                      | Region where the organization is located |
| Country                     | Country where the organization is located |
| Continent                   | Continent where the organization is located |


### DimensionInvestor
| Column                     | Description                        |
|----------------------------|------------------------------------|
| Investor ID (PK)            | Unique identifier for the investor |
| Investor                    | Name of the investor               |

### Investor Mapping Table (many-to-many)
| Column                     | Description                        |
|----------------------------|------------------------------------|
| Transaction ID (FK)            | Unique identifier for the transaction |
| Investor ID (FK)                    | Unique identifier for the investor               |
| IsLeadInvestor                   | Bool which describes who is lead investor in a specific transaction|


### DimensionIndustry
| Column                     | Description                        |
|----------------------------|------------------------------------|
| Industry ID (PK)            | Unique identifier for the industry |
| Industry               | Name of the industry               |

### Industry Mapping Table (many-to-many)
| Column                     | Description                        |
|----------------------------|------------------------------------|
| Organization ID (FK)            | Unique identifier for the transaction |
| Industry ID (FK)                    | Unique identifier for the investor               |

## This star schema will be implemented by the class below. 

In [90]:
class Preprocessing:
    def __init__(self, df):
        self.df = df
        self.organizations_df = pd.DataFrame()
        self.locations_df = pd.DataFrame()
        self.industries_df = pd.DataFrame()
        self.industry_mapping_df = pd.DataFrame()
        self.investors_df = pd.DataFrame()
        self.investor_mapping_df = pd.DataFrame()
        self.process_all()

    def process_all(self):
        self.process_organizations()
        self.process_locations()
        self.process_industries()
        self.process_investors()

    def process_organizations(self):
        self.df['Organization Name'] = self.df['Transaction Name'].str.split('-').str[1].str.strip()
        self.organizations_df = self.df[['Organization Name', 'Organization Location', 'Organization Industries', 'Organization Website', 'Organization Name URL', 'Organization Description']].copy()
        self.organizations_df.drop_duplicates(subset='Organization Name', inplace=True)
        self.organizations_df['OrganizationID'] = self.organizations_df.index + 1
        self.organizations_df.reset_index(drop=True, inplace=True)
        self.df = self.df.merge(self.organizations_df[['Organization Name', 'OrganizationID']],
                                              left_on='Organization Name', right_on='Organization Name')
        self.df.drop(columns=['Organization Name', 'Organization Location', 'Organization Industries', 'Organization Website', 'Organization Name URL', 'Organization Description'], inplace=True)

    def process_locations(self):
        self.locations_df['Location'] = self.organizations_df['Organization Location']
        self.locations_df = self.locations_df.drop_duplicates().reset_index(drop=True)
        self.locations_df['LocationID'] = self.locations_df.index + 1
        
        self.organizations_df = pd.merge(self.organizations_df, self.locations_df, how='left', left_on='Organization Location', right_on='Location')
        self.organizations_df.drop(columns=['Location', 'Organization Location'], inplace=True)

        location_split = self.locations_df['Location'].str.split(',', expand=True)
        self.locations_df['City'] = location_split[0]
        self.locations_df['Region'] = location_split[1]
        self.locations_df['Country'] = location_split[2]
        self.locations_df['Continent'] = location_split[3]
        self.locations_df = self.locations_df.drop(columns=['Location'])

    def process_industries(self):
        industries_series = self.organizations_df['Organization Industries'].dropna().str.split(',').explode().str.strip()
        self.industries_df = pd.DataFrame(industries_series.unique(), columns=['Industry'])        
        self.industries_df['IndustryID'] = self.industries_df.index + 1

        industry_mapping = self.organizations_df[['OrganizationID', 'Organization Industries']].copy()
        industry_mapping = industry_mapping.dropna().set_index('OrganizationID')['Organization Industries'].str.split(',').explode().str.strip().reset_index()

        self.industry_mapping_df = industry_mapping.merge(self.industries_df, how='left', left_on='Organization Industries', right_on='Industry')
        self.industry_mapping_df = self.industry_mapping_df[['OrganizationID', 'IndustryID']]
        self.organizations_df.drop(columns=['Organization Industries'], inplace=True)

    def process_investors(self):
        self.df['TransactionID'] = self.df.index + 1

        investors_series = self.df['Investor Names'].dropna().str.split(',').explode().str.strip()
        self.investors_df = pd.DataFrame(investors_series.unique(), columns=['Investor'])
        self.investors_df['InvestorID'] = self.investors_df.index + 1

        investor_mapping = self.df[['TransactionID', 'Investor Names']].copy()
        investor_mapping = investor_mapping.dropna().set_index('TransactionID')['Investor Names'].str.split(',').explode().str.strip().reset_index()

        self.investor_mapping_df = investor_mapping.merge(self.investors_df, how='left', left_on='Investor Names', right_on='Investor')

        lead_investors_series = self.df['Lead Investors'].dropna().str.split(',').explode().str.strip()
        lead_investor_ids = self.investors_df[self.investors_df['Investor'].isin(lead_investors_series)]['InvestorID'].tolist()

        self.investor_mapping_df['IsLeadInvestor'] = self.investor_mapping_df['InvestorID'].apply(lambda x: x in lead_investor_ids)

        self.investor_mapping_df = self.investor_mapping_df[['TransactionID', 'InvestorID', 'IsLeadInvestor']]
        self.df.drop(columns=['Investor Names', 'Lead Investors'], inplace=True)

preprocessing = Preprocessing(eda.df)

## Check envisioned star schema vs implementation

### FactFunding
| Column                     | Description                        |
|----------------------------|------------------------------------|
| Transaction ID (PK)         | Unique identifier for the transaction |
| Organization ID (FK)        | Foreign key referencing DimensionOrganization |
| Money Raised                | Amount of money raised in local currency |
| Money Raised (in USD)       | Amount of money raised in USD |
| Funding Type                | Type of funding (e.g., Seed, Pre-Seed) |
| Announced Date              | Date when the funding was announced |
| Funding Stage               | Stage of funding (e.g., Seed, Series A) |
| Number of Funding Rounds    | Number of funding rounds for the organization |
| Total Funding Amount        | Total amount of funding raised in local currency |
| Total Funding Amount (in USD) | Total amount of funding raised in USD |
| Equity Only                 | Whether it was equity-only funding (Yes/No) |
| Transaction Name            | Name of the transaction            |
| Transaction Name URL        | URL of the transaction             |
| Money Raised Currency       | Currency in which money was raised |
| Pre-Money Valuation         | Valuation before the money was raised |
| Pre-Money Valuation Currency | Currency of the pre-money valuation |
| Pre-Money Valuation (in USD) | Pre-money valuation in USD         |
| Funding Status              | Status of the funding              |
| Number of Investors         | Number of investors involved       |

In [94]:
preprocessing.df.head()

Unnamed: 0,Transaction Name,Transaction Name URL,Money Raised,Money Raised Currency,Money Raised (in USD),Funding Type,Announced Date,Pre-Money Valuation,Pre-Money Valuation Currency,Pre-Money Valuation (in USD),Funding Stage,Number of Funding Rounds,Total Funding Amount,Total Funding Amount Currency,Total Funding Amount (in USD),Equity Only Funding,Funding Status,Number of Investors,OrganizationID,TransactionID
0,Seed Round - Flagright,https://www.crunchbase.com/funding_round/flagr...,2800000.0,USD,2800000.0,Seed,2022-07-07,,,,Seed,1,2800000.0,USD,2800000.0,Yes,Seed,11.0,1,1
1,Seed Round - aboutuz,https://www.crunchbase.com/funding_round/about...,632000.0,USD,632000.0,Seed,2022-03-01,,,,Seed,1,632000.0,USD,632000.0,Yes,Seed,1.0,2,2
2,Seed Round - Kubermatic,https://www.crunchbase.com/funding_round/kuber...,,,,Seed,2022-04-27,,,,Seed,3,8300000.0,USD,8300000.0,Yes,Seed,1.0,3,3
3,Seed Round - MYNE Homes,https://www.crunchbase.com/funding_round/myne-...,23500000.0,EUR,23938847.0,Seed,2022-07-08,,,,Seed,4,63500000.0,EUR,67268844.0,Yes,Early Stage Venture,15.0,4,4
4,Pre Seed Round - Emulate Energy,https://www.crunchbase.com/funding_round/emula...,21956027.0,SEK,2092864.0,Pre-Seed,2022-07-15,63892307.0,SEK,6090260.0,Seed,3,56561291.0,SEK,5396545.0,Yes,Seed,4.0,5,5


### DimensionOrganization
| Column                     | Description                        |
|----------------------------|------------------------------------|
| Organization ID (PK)        | Unique identifier for the organization |
| Organization Name           | Name of the organization           |
| Location ID (FK)            | Reference to where the organization is located |
| Organization Name URL            | URL of the organization's Crunchbase profile |
| Organization Description    | Brief description of the organization |
| Organization Website        | Website of the organization        |

In [95]:
preprocessing.organizations_df.head()

Unnamed: 0,Organization Name,Organization Website,Organization Name URL,Organization Description,OrganizationID,LocationID
0,Flagright,https://flagright.com,https://www.crunchbase.com/organization/flagright,AI-native AML compliance & risk management pla...,1,1
1,aboutuz,https://www.aboutuz.com,https://www.crunchbase.com/organization/aboutuz,aboutuz is the first network for users to safe...,2,2
2,Kubermatic,https://www.kubermatic.com,https://www.crunchbase.com/organization/kuberm...,Kubermatic empowers organizations worldwide to...,3,3
3,MYNE Homes,https://www.myne-homes.de,https://www.crunchbase.com/organization/myne-h...,MYNE Homes is a digital co-ownership platform ...,4,1
4,Emulate Energy,https://www.emulate.energy/,https://www.crunchbase.com/organization/emulat...,Emulate provides SaaS for utilities to offer b...,5,4


### DimensionLocation
| Column                     | Description                        |
|----------------------------|------------------------------------|
| Location ID (PK)            | Unique identifier for the location |
| City                        | City where the organization is located |
| Region                      | Region where the organization is located |
| Country                     | Country where the organization is located |
| Continent                   | Continent where the organization is located |


In [96]:
preprocessing.locations_df.head()

Unnamed: 0,LocationID,City,Region,Country,Continent
0,1,Berlin,Berlin,Germany,Europe
1,2,Feldkirchen,Bayern,Germany,Europe
2,3,Hamburg,Hamburg,Germany,Europe
3,4,Lund,Skane Lan,Sweden,Europe
4,5,Osnabrück,Niedersachsen,Germany,Europe


### DimensionInvestor
| Column                     | Description                        |
|----------------------------|------------------------------------|
| Investor ID (PK)            | Unique identifier for the investor |
| Investor                    | Name of the investor               |

### Investor Mapping Table (many-to-many)
| Column                     | Description                        |
|----------------------------|------------------------------------|
| Transaction ID (FK)            | Unique identifier for the transaction |
| Investor ID (FK)                    | Unique identifier for the investor               |
| IsLeadInvestor                   | Bool which describes who is lead investor in a specific transaction|


In [98]:
preprocessing.investors_df.head()

Unnamed: 0,Investor,InvestorID
0,Charles Delingpole,1
1,Donald Bringmann,2
2,Erik Muttersbach,3
3,Four Cities Capital,4
4,Fredrik Thomassen,5


In [99]:
preprocessing.investor_mapping_df.head()

Unnamed: 0,TransactionID,InvestorID,IsLeadInvestor
0,1,1,False
1,1,2,False
2,1,3,False
3,1,4,False
4,1,5,False


### DimensionIndustry
| Column                     | Description                        |
|----------------------------|------------------------------------|
| Industry ID (PK)            | Unique identifier for the industry |
| Industry               | Name of the industry               |

### Industry Mapping Table (many-to-many)
| Column                     | Description                        |
|----------------------------|------------------------------------|
| Organization ID (FK)            | Unique identifier for the transaction |
| Industry ID (FK)                    | Unique identifier for the investor               |

In [101]:
preprocessing.industry_mapping_df.head()

Unnamed: 0,OrganizationID,IndustryID
0,1,1
1,1,2
2,1,3
3,1,4
4,1,5
