# Examples of using analysis functionalities

Using discovery_utils analyses functionalities for investments data

Here, we'll find companies using their categories, but you can also use search results from the process shown in cybersec_search.ipynb

In [1]:
from discovery_utils.utils import (
    analysis_crunchbase,
    analysis,
    charts
)

In [2]:
from discovery_utils.getters import crunchbase
CB = crunchbase.CrunchbaseGetter()

2024-11-21 10:43:03,899 - discovery_utils.getters.crunchbase - INFO - Checking for latest version of data in S3 bucket: discovery-iss
2024-11-21 10:43:04,021 - discovery_utils.getters.crunchbase - INFO - Latest Crunchbase version found: Crunchbase_2024-11-18


Let's find categories to use for selecting companies

In [3]:
# Categories for cybersec
CB.find_similar_categories("cyber security", category_type="broad", n_results=10)

2024-11-21 10:43:04,034 - sentence_transformers.SentenceTransformer - INFO - Load pretrained SentenceTransformer: all-MiniLM-L6-v2
2024-11-21 10:43:05,618 - sentence_transformers.SentenceTransformer - INFO - Use pytorch device_name: mps


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2024-11-21 10:43:06,533 - discovery_utils.getters.crunchbase - INFO - Downloading parquet file: data/crunchbase/Crunchbase_2024-11-18/category_groups.parquet
2024-11-21 10:43:06,620 - discovery_utils.getters.crunchbase - INFO - Successfully downloaded and read parquet file: data/crunchbase/Crunchbase_2024-11-18/category_groups.parquet


Batches:   0%|          | 0/2 [00:00<?, ?it/s]

Unnamed: 0,group,similarity
37,Privacy and Security,0.613779
24,Information Technology,0.454696
25,Internet Services,0.432691
43,Software,0.392303
4,Artificial Intelligence (AI),0.389637
0,Administrative Services,0.351681
22,Hardware,0.339292
13,Data and Analytics,0.337529
14,Design,0.337226
15,Education,0.315132


In [4]:
# Categories for family and kids
CB.find_similar_categories("family", category_type="narrow", n_results=15)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/26 [00:00<?, ?it/s]

Unnamed: 0,category,similarity
120,Family,1.0
128,Parenting,0.553819
116,Children,0.497603
125,Lifestyle,0.476997
563,Association,0.471604
566,Charity,0.46592
139,Teenagers,0.46051
117,Communities,0.458581
414,Child Care,0.450171
133,Religion,0.447418


Now get both types of companies and take the intersection to find those that match both categories

This might take a minute for the first time, as it downloads the data

In [6]:
cybersec_df = CB.get_companies_in_categories(["Privacy and Security"], category_type="broad")
family_df = CB.get_companies_in_categories(["Family", "Parenting", "Children", "Teenagers"], category_type="narrow")

2024-11-21 10:43:22,836 - discovery_utils.getters.crunchbase - INFO - Downloading parquet file: data/crunchbase/enriched/organizations_full.parquet


2024-11-21 10:43:48,179 - discovery_utils.getters.crunchbase - INFO - Successfully downloaded and read parquet file: data/crunchbase/enriched/organizations_full.parquet


In [7]:
matching_ids = set(cybersec_df.id) & set(family_df.id)

In [8]:
len(matching_ids)

163

You can check these companies by querying the ids

In [9]:
matchings_orgs_df = CB.organisations_enriched.query("id in @matching_ids")
matchings_orgs_df[['name', 'homepage_url', 'short_description']]

Unnamed: 0,name,homepage_url,short_description
3646,IPowerApps,http://www.ipowerapps.com,iPhone and iPod Touch Software
64677,Qustodio,http://www.qustodio.com,Qustodio is a leading platform providing digit...
77769,7billionideas,http://www.7billionideas.com,7billionideas is a web platform allowing users...
98572,Notabli,http://www.notabli.com,Notabli is an ad-free digital space for privat...
132663,Pumpic,http://pumpic.com,Pumpic is a reliable parental control app that...
...,...,...,...
3359344,L'Hybridé,https://www.lhybride.com/,L'Hybridé is a non-profit organization that of...
3441652,Castle & Associates,https://www.castleandassociates.ca,Castle & Associates offers family law services...
3617326,Rikokatsu,https://ricokatsu.com,Rikokatsu offers online consultation services ...
3626448,SASENAI,https://sasenai.com/,SASENAI is a crime prevention app that detects...


Now get the funding rounds for the matching companies - you can specify what type of funding rounds you need

In [10]:
# Check what type of funding rounds there are
CB.unique_funding_round_types

2024-11-21 10:44:07,922 - discovery_utils.getters.crunchbase - INFO - Downloading parquet file: data/crunchbase/enriched/funding_rounds_full.parquet
2024-11-21 10:44:12,931 - discovery_utils.getters.crunchbase - INFO - Successfully downloaded and read parquet file: data/crunchbase/enriched/funding_rounds_full.parquet


['angel',
 'convertible_note',
 'corporate_round',
 'debt_financing',
 'equity_crowdfunding',
 'grant',
 'initial_coin_offering',
 'non_equity_assistance',
 'post_ipo_debt',
 'post_ipo_equity',
 'post_ipo_secondary',
 'pre_seed',
 'private_equity',
 'product_crowdfunding',
 'secondary_market',
 'seed',
 'series_a',
 'series_b',
 'series_c',
 'series_d',
 'series_e',
 'series_f',
 'series_g',
 'series_h',
 'series_i',
 'series_j',
 'series_unknown',
 'undisclosed']

In [11]:
funding_rounds_df = (
    CB.select_funding_rounds(org_ids=matching_ids, funding_round_types=["angel", "pre_seed", "seed", "series_a"])
)

Now let's generate some basic time series

In [14]:
ts_df = analysis_crunchbase.get_timeseries(matchings_orgs_df, funding_rounds_df, period='year', min_year=2014, max_year=2024)
ts_df

Unnamed: 0,time_period,year,n_rounds,raised_amount_usd_total,raised_amount_gbp_total,n_orgs_founded
0,2014-01-01,2014,5,2.25,1.367802,7
1,2015-01-01,2015,3,2.854647,1.867393,7
2,2016-01-01,2016,1,3.0,2.068337,4
3,2017-01-01,2017,4,2.44045,1.964278,4
4,2018-01-01,2018,3,13.715961,10.313301,4
5,2019-01-01,2019,1,1.14575,0.87888,3
6,2020-01-01,2020,1,0.12,0.092103,4
7,2021-01-01,2021,1,2.0,1.49281,3
8,2022-01-01,2022,3,6.0,4.948556,2
9,2023-01-01,2023,1,0.0,0.0,1


In [15]:
fig = charts.ts_bar(
    ts_df,
    variable='raised_amount_gbp_total',
    variable_title="Raised amount, £ millions",
    category_column="_category",
)
charts.configure_plots(fig, chart_title="")

Let's look into breakdown of deal types

In [16]:
deals_df, deal_counts_df = analysis_crunchbase.get_funding_by_year_and_range(funding_rounds_df, 2014, 2024)
aggregated_funding_types_df = analysis_crunchbase.aggregate_by_funding_round_types(funding_rounds_df)

In [17]:
deals_df

Unnamed: 0,year,n/a,£0-5M,£5-20M,£20-100M,£100M+,total_amount
0,2014,0.0,1.367802,0.0,0.0,0.0,1.367802
1,2015,0.0,1.867393,0.0,0.0,0.0,1.867393
2,2016,0.0,2.068337,0.0,0.0,0.0,2.068337
3,2017,0.0,1.964278,0.0,0.0,0.0,1.964278
4,2018,0.0,0.545198,9.768104,0.0,0.0,10.313301
5,2019,0.0,0.87888,0.0,0.0,0.0,0.87888
6,2020,0.0,0.092103,0.0,0.0,0.0,0.092103
7,2021,0.0,1.49281,0.0,0.0,0.0,1.49281
8,2022,0.0,4.948556,0.0,0.0,0.0,4.948556
9,2023,0.0,0.0,0.0,0.0,0.0,0.0


In [18]:
analysis_crunchbase.chart_investment_types(aggregated_funding_types_df)

In [19]:
analysis_crunchbase.chart_deal_sizes(deals_df)

In [20]:
analysis_crunchbase.chart_deal_sizes_counts(deal_counts_df)