# Top AI Startups 2020

Analysis of [CBInsights' List of Top AI Startups](https://www.cbinsights.com/research/artificial-intelligence-top-startups) as sample use case of data visualization using **Python** and **Jupyter**.

First let's load all the libraries that we need:
* **Pandas** for data manipulation
* **Matplotlib** for basic graphs
* **Plotly** for interactives graphs

*(We also disable some ugly Matplotlib to Plotly conversion warnings)*

In [0]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import matplotlib.pyplot as pt
import plotly.express as px
import plotly.graph_objects as go
from plotly.tools import mpl_to_plotly

Now we detect in which environment is the notebook running: local, [Google Colab](https://colab.research.google.com) or [DataLore](https://datalore.io)

This affects the path from which we load the csv data file

In [0]:
import sys
IN_COLAB = 'google.colab' in sys.modules
IN_DATALORE = 'datalore' in sys.modules

data_file = "../data/ai2020.csv"
if IN_COLAB:
    data_file = "https://raw.githubusercontent.com/isaacdlp/datascience/master/data/ai2020.csv"
elif IN_DATALORE:
    data_file = "ai2020.csv"

## Load Data

We load the csv file into a Pandas **dataframe**, perform some preliminary cleanup and look at the fields.

**Data density** seems to be very good, only a few rows missing in each category.

In [0]:
df = pd.read_csv(data_file, encoding = "ISO-8859-1")
# df.to_csv("clean.csv", index=False, encoding = "ISO-8859-1")
# df.drop("Focus Area", axis=1, inplace=True)

df.rename({"Round" : "Round Count"}, inplace = True)
df["Industries"] = df["Industries"].str.title()

for field in df:
    print(f"{field}:  {df[field].count()}")

Company:  100
Crunchbase:  96
Industries:  100
Location:  100
Employee Count:  95
Founded Date:  96
Investor Count:  94
Lead Investor Count:  89
Select Investors:  99
Funding:  94
Round:  94


Focusing on the **Location** field, it has the format  "*City, Region, Country*" which is less than ideal for data exploration. What we want is to **split it into columns** thus generating three new fiels in a geographic hierarchy
* Country
* Region
* City

*(We also drop Location, we will not use it anymore)*

In [0]:
df[["City","Region", "Country"]] = df["Location"].str.split(",", expand = True)
df["City"] = df["Region"].str.strip()
df["Region"] = df["Region"].str.strip()
df["Country"] = df["Country"].str.strip()
df.drop("Location", axis = 1, inplace = True)


Next field to cleanup is **Funding**: remove spaces, the "$" sign and then convert from text to number

In [0]:
df["Funding"].fillna(0, inplace = True)
df["Funding"] = df["Funding"].str.strip()
df["Funding"] = df["Funding"].str.replace("$", "")
df["Funding"] = df["Funding"].apply(pd.to_numeric)

## Companies per Country

In order to perform this analysis we will **group per Country** the **Count of Companies** and the **Sum of Funding**

In [0]:
gr = df.groupby("Country").agg({"Company" : "count", "Funding" : "sum"})
gr

Unnamed: 0_level_0,Company,Funding
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Canada,8,91.0
Chile,1,33.0
China,6,370.0
France,1,6.0
Germany,2,94.0
Israel,3,128.0
Japan,1,2.0
South Africa,1,0.0
Spain,1,17.0
Sweden,2,27.0


Let's organize it by **Company Count** in descending order and visualize it better with a **bar chart**

In [0]:
gr.sort_values("Company", ascending = False, inplace = True)
fig = px.bar(gr, x = gr.index, y = "Company")
fig.update_layout(
    title = "Lead AI Companies per Country",
    xaxis_title = "Country",
    yaxis_title = "Company Count"
)
fig.show()

## Funding per Country

We will do the same with the **Total Funding** per Country

In [0]:
gr.sort_values("Funding", ascending = False, inplace = True)
fig = px.bar(gr, x = gr.index, y = "Funding")
fig.update_layout(
    title = "Total AI Funding per Country",
    xaxis_title = "Country",
    yaxis_title = "Total Funding"
)
fig.show()

A few insights emerge

**USA** is dominant in both number of companies and funding. With **Canada**, **UK**, **China**, **Germany** and **Israel** as long distance followers

However, Canada is **2nd** in count while being **6th** in funding. In terms of efficiency of investment, they deserve a mention!

## Companies per Lead Investor

Now let's look at which Lead Investors have been able to gather the highest number of Companies. As above, the **Select Investors** field contains a *comma-separated* list of Lead Investors "*Investor 1, Investor 2, ...*" which is not ideal for exploration, we need to split it

However now we need to **split it into rows**: that is, generating individualized entries in our list rather than a whole new category

In [0]:
df.set_index("Company", drop = True, inplace = True)
df["Select Investors"].fillna("None", inplace = True)
ds = pd.DataFrame(df["Select Investors"].str.split(",").tolist(), index = df.index).stack()
ds = ds.reset_index([0, "Company"])
ds.columns = ["Company", "Lead Investor"]
ds["Lead Investor"] = ds["Lead Investor"].str.strip()
#df = pd.merge(left = df, right = ds, left_on = "Company", right_on = "Company")
#df.drop("Select Investors", axis = 1, inplace = True)
ds

Unnamed: 0,Company,Lead Investor
0,Count4Paradigm,Sequoia Capital China
1,Count4Paradigm,Industrial and Commercial Bank of China
2,Count4Paradigm,China Construction Bank
3,Count4Paradigm,Bank of China
4,Count4Paradigm,Genesis Capital
...,...,...
418,Zhuiyi Technology,Morningside Venture Capital
419,Zhuiyi Technology,GGV Capital
420,Zhuiyi Technology,Gaorong Capital
421,Zhuiyi Technology,China Merchants Capital


We could merge the newly created dataframe with the main one, but for the time being we will keep them separate

The list has **423** entries (Investor and Company combinations). Let's group them by **Company**

In [0]:
gr = ds.groupby("Lead Investor")["Company"].count()
gr.sort_values(ascending = True, inplace = True)
gr.count()

315

Still **315** entries. This indicates a significant dispersion of Lead Investors. Let's filter to **only Lead Investors with more than 2 Companies**

In [0]:
gr = gr[gr > 2]
gr.count()

25

Now the list has come down to **25** entries, removing the **long tail** of investors with only a few companies

Let's plot the list as we did before, but now as a **Horizontal Bar Chart** to better visualize the Lead Investors

In [68]:
fig = go.Figure(go.Bar(
            x = gr.values,
            y = gr.index.values,
            orientation = "h"))
fig.update_layout(
    title = "Companies per Lead Investor",
    xaxis_title = "Company Count",
    yaxis_title = "Lead Investor"
)
fig.show()