# Researching csv data

In current investigation the author is going to parse date from <code>yc_csv.csv</code> and get some conclusions about what is needed for successful startup.

## First glances at dataframe

In [23]:
import pandas as pd
df = pd.read_csv("assets/yc_csv.csv")

In [24]:
df

Unnamed: 0,xid,#,Companies,Company ID,Deal ID,Deal Date,Announced Date,Deal Size,Pre-money Valuation,Post Valuation,...,Lead/Sole Investors,Employees,Revenue Growth since last debt deal,Revenue,EBITDA,Total Debt (from financials),Deal Synopsis,Financing Status Note,CEO (at time of deal),CEO PBId
0,48635,48386,Parakey (Software Development Applications),52732-99,19117-54T,17/11/2006 00:00,,1.56,13.27,14.83,...,,,,,,,The company raised $1.56 million of Series A f...,The company was acquired by Facebook (NASDAQ: ...,{'62492-86P': 'Blake Ross'},62492-86P
1,58836,58587,Snaptalent,61729-75,32010-85T,01/01/2008 00:00,,2.00,4.03,6.03,...,,,,,,,The company joined Y Combinator as part of the...,The company is no longer actively in business ...,,
2,60141,59892,FriendFeed,41728-33,19653-94T,25/02/2008 00:00,,5.00,,,...,,,,,,,The company raised $5 million of Series A vent...,The company was acquired by Facebook for $50 m...,{'167058-55P': 'Jim Norris'},167058-55P
3,64785,64536,RescueTime,52794-46,135800-20T,23/09/2008 00:00,,0.90,,,...,,,,,,,"The company raised $900,000 of seed funding fr...","True Ventures, Y Combinator, Lowercase Capital...",{'43621-03P': 'Brian Fioca'},43621-03P
4,66671,66422,CarWoo,51122-80,18676-81T,01/01/2009 00:00,,1.93,2.69,4.62,...,,,,,,,The company raised $1.9 million of seed fundin...,The company is no longer actively in business ...,{'38143-27P': 'Robert McClung'},38143-27P
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1342,314141,313892,Reflex (Software Development Applications),520530-40,223029-82T,02/08/2023 00:00,,5.00,,,...,{'11229-04': 'Lux Capital'},,,,,,The company raised $5 million of seed funding ...,The company raised $5 million of seed funding ...,{'336293-83P': 'Nikhil Rao'},336293-83P
1343,314262,314013,MindsDB,223449-22,212197-42T,08/08/2023 00:00,07/02/2023 00:00,46.50,,,...,{'11133-01': 'Benchmark (San Francisco) (Cheta...,,,,,,The company raised $46.5 million of Series A v...,The company raised $46.5 million of Series A v...,{'100780-12P': 'Jorge Torres'},100780-12P
1344,314273,314024,Daybreak Health,434212-21,233351-02T,08/08/2023 00:00,,13.00,52.00,65.00,...,{'11323-45': 'Union Square Ventures'},,,,,,The company raised $13 million of Series B ven...,The company raised $13 million of Series B ven...,{'227761-57P': 'Alex Alvarado'},227761-57P
1345,314428,314179,Arpio,435028-33,219305-98T,14/08/2023 00:00,28/07/2023 00:00,8.20,,,...,"{'342995-41': 'Companyon Ventures', '42943-96'...",,,,,,The company raised $7.98 million of seed fundi...,The company raised $7.98 million of seed fundi...,{'229843-99P': 'Douglas Neumann'},229843-99P


In [132]:
import json
import numpy as np

In [137]:
investors_amount = []
investors_keys_values = {}
for index, row in df.iterrows():
    current_investors = json.loads(row["Investors"].replace(" '", ' "').replace("',", '",').replace("'}", '"}').replace("{'", '{"').replace("':", '":'))
    investors_keys_values = {**investors_keys_values, **current_investors}
len(investors_keys_values)

5238

## Research itself

In [50]:
companies = df["Companies"]
len(companies)

1347

Let's get an amount of active companies in the table:

In [55]:
def is_active(row):
    return "The company is no longer actively in business" not in row["Financing Status Note"]

In [103]:

active_companies_counter = 0
for index, row in df.iterrows():
    if is_active(row):
        active_companies_counter += 1
active_companies_counter

1301

Let's count an average amount of investors for active companies:

In [57]:
investors_amount = []
for index, row in df.iterrows():
    investors_amount.append(len(json.loads(row["Investors"].replace(" '", ' "').replace("',", '",').replace("'}", '"}').replace("{'", '{"').replace("':", '":')).keys()))
df.loc[:, "Investors_amount"] = investors_amount

In [63]:
investors_amount_for_active = []
for index, row in df.iterrows():
    if is_active(row):
        investors_amount_for_active.append(row["Investors_amount"])

median_investors_amount = np.median(investors_amount_for_active)
median_investors_amount

10.0

Counting in millions of dollars, an average revenue for active companies is equal to:

In [75]:
deal_sizes_for_active = []
for index, row in df.iterrows():
    if is_active(row) and pd.notna(row["Revenue"]):
        deal_sizes_for_active.append(row["Revenue"])

average_deal_size = np.mean(deal_sizes_for_active)
average_deal_size

57.72110344827586

Let's count same values for inactive companies on 11-29-2023:

In [76]:
investors_amount_for_notactive = []
for index, row in df.iterrows():
    if not is_active(row):
        investors_amount_for_notactive.append(row["Investors_amount"])

median_investors_amount_notactive = np.median(investors_amount_for_notactive)
median_investors_amount_notactive

9.0

In [78]:
deal_sizes_for_notactive = []
for index, row in df.iterrows():
    if not is_active(row) and pd.notna(row["Revenue"]):
        deal_sizes_for_notactive.append(row["Revenue"])

average_deal_size_notactive = np.mean(deal_sizes_for_notactive)
average_deal_size_notactive

0.56

We can notice that revenue of inactive companies is less more 100 times than revenue of active ones. Let's consider such companies "unsuccessful" and look which investors contributed to them and did not contribute to active starups:

In [148]:
investors_for_active = set()
investors_for_notactive = set()

for index, row in df.iterrows():
    investors_ids = json.loads(row["Investors"].replace(" '", ' "').replace("',", '",').replace("'}", '"}').replace("{'", '{"').replace("':", '":')).keys()
    # print(investors_ids)
    if is_active(row):
        investors_for_active = investors_for_active.union(set(investors_ids))
    else:
        investors_for_notactive = investors_for_notactive.union(set(investors_ids))

intersection = investors_for_active.intersection(investors_for_notactive)
investors_difference = investors_for_notactive.difference(intersection)

investors_names = []
for id in investors_difference:
    investors_names.append(investors_keys_values[id])

print(f'{len(investors_names)} investors out of {len(investors_for_active.union(investors_for_notactive))}')
investors_names

113 investors out of 5238


['Sep Kamvar (Sep Kamvar)',
 'The Esports Observers',
 'Bruno Bowden (Bruno Bowden)',
 'Kiem Tjong (Kiem Tjong)',
 'Rivendell Investments',
 'Tomer Cohen (Tomer Cohen)',
 'Karel Obluk (Karel Obluk)',
 'Founders Equity Partners',
 'Alex Neth (Alex Neth)',
 'First Chair Ventures',
 'Andrey Shirben (Andrey Shirben)',
 'Chris Bennett (Chris Bennett)',
 'University of Pennsylvania Endowment',
 'Dynamo Ventures',
 'FoundersGuild (Avi Rosenbaum)',
 'Ali Vahabzadeh (Ali Vahabzadeh)',
 'Maneesh Arora (Maneesh Arora)',
 'Glenn Willen (Glenn Willen)',
 'Mihir Bhanot (Mihir Bhanot)',
 'Scott Heller (Scott Heller)',
 'MATH Venture Partners',
 'Zeev Ventures (Oren Zeev)',
 'Rocket Venture Fund (Ben Trumbull)',
 'Ophir Ashkenazi (Ophir Ashkenazi)',
 'Villi Iltchev (Villi Iltchev)',
 'Emagen Entertainment',
 'Roger Ehrenberg (Roger Ehrenberg)',
 'Eyal Navon (Eyal Navon)',
 'Ruxton Ventures',
 'Clara Shih (Clara Shih)',
 'Gregory Lee (Gregory Lee)',
 'Initial:Capital',
 'Smac Partners',
 'Steve Kaplan 

Thus, we can outline investors, that have never contributed to big successful projects.

Let's also consider investors that have **not** contributed "unsuccessful" startups:

In [149]:
investors_for_active_difference = investors_for_active.difference(intersection)
investors_names_for_active_only = []
for id in investors_for_active_difference:
    investors_names_for_active_only.append(investors_keys_values[id])

print(f'{len(investors_names_for_active_only)} investors out of {len(investors_for_active | investors_for_notactive)}')
investors_names_for_active_only

4878 investors out of 5238


['Daniel Mathon (Daniel Mathon)',
 'Kenny Van Zant (Kenny Van Zant)',
 'Walden Catalyst',
 'Richard Socher (Richard Socher)',
 'Jetstream (San Francisco)',
 'James Conigliaro (James Conigliaro)',
 'Thomas Hulme (Thomas Hulme)',
 'P1 Ventures',
 'Cota Capital',
 'Tristan Handy (Tristan Handy)',
 'Dentsu Ventures',
 'Pentas Ventures',
 'Albert Ni (Albert Ni)',
 'Graphene Ventures',
 'Inspired Capital (New York)',
 'Shorewind Capital',
 'Crescent Fund',
 'AV8 Ventures (Baris Aksoy)',
 'Bernardus Verwaayen (Bernardus Verwaayen)',
 'Daniel Bragiel (Daniel Bragiel)',
 'Jan Deepen',
 'Passion Capital',
 'Driventure',
 'Andre Lorenceau',
 'Dalus Capital',
 'Romain Huet (Romain Huet)',
 'Atlas Capital (Singapore) (Djoann Fal)',
 'Marbruck Investments',
 'Prasanna Sankar (Prasanna Sankar)',
 'Nils Johnson (Nils Johnson)',
 'Tobias Knaup',
 'Darren Nix (Darren Nix)',
 'Grupo Bolívar (BOG: GRUBOLIVAR)',
 'Work Life Ventures',
 'Grit Partners',
 'Gordon Crawford (Gordon Crawford)',
 'White Buffalo 

Now let's find startups that didn't get investments in a half a year.

In [150]:
companies_without_data_about_time = 0
companies_without_investments_in_half_year = 0
unactive_companies_without_investments_in_half_year = 0

df['Deal Date'] = pd.to_datetime(df['Deal Date'], format='%d/%m/%Y %H:%M')
df['Announced Date'] = pd.to_datetime(df['Announced Date'], format='%d/%m/%Y %H:%M')
for index, row in df.iterrows():
    if pd.isna(row["Deal Date"]) or pd.isna(row["Announced Date"]):
        companies_without_data_about_time += 1
        continue

    difference = (df['Deal Date'] - df['Announced Date']) / np.timedelta64(1, 'M')
    if not type(difference)=='float':
        continue
    if difference > 6:
        companies_without_investments_in_half_year += 1

    if not is_active(row) and difference > 6:
        unactive_companies_without_investments_in_half_year += 1

companies_amount = len(df["Companies"])
print(f'No data for {companies_without_data_about_time} companies in {companies_amount}')
print(f'{unactive_companies_without_investments_in_half_year} unactive companies without investment for half a year out of {companies_amount - active_companies_counter}')
print(f'{companies_without_investments_in_half_year - unactive_companies_without_investments_in_half_year} active companies companies without investment for half a year out of {active_companies_counter}')

No data for 863 companies in 1347
0 unactive companies without investment for half a year out of 46
0 active companies companies without investment for half a year out of 1301


As it can be seen, the difference between columns <code>Deal Date</code> and <code>Announced Date</code> in the presence of data is always less than six months.

**Results:** Thereby we can make a conclusion, that for large project to be successful it is needed to have around ten investors to contribute to it. 50 million $ can be received as a total revenue on average. To choose investers, some lists that can be useful for this purposed were found.