<a href="https://www.kaggle.com/code/mikedelong/one-big-little-scatter-plot?scriptVersionId=148235503" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
from pandas import read_csv
from numpy import nan
from arrow import Arrow

# this could be a lambda but it would be nasty
MONTH = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
def date_transform(arg: str, ):
    pieces = arg.split()
    return Arrow(month=1 + MONTH.index(pieces[0][:3]), day=15, year=int(pieces[1]),).datetime

df = read_csv(filepath_or_buffer='/kaggle/input/the-rise-of-ai-based-llms/The Rise of AI-Based Large Language Models (LLMs) - LLM data.csv', 
                skiprows=1,).drop(columns=['link', 'note / * = parameters undisclosed'])
df['year'] = df['date'].apply(func=lambda x: int(x.split()[1]))
df['Date'] = df['date'].apply(func=date_transform)
df['parameters (bn)'] = df['trained on x billion parameters'].apply(func=lambda x: float(x) if x.replace('.', '').isnumeric() else nan)
# let's consolidate some owners somewhat arbitrarily
owners = {
    'Meta / Facebook': 'Meta',
    'Facebook': 'Meta',
    'Meta AI': 'Meta',
    'OpenAI / Microsoft': 'OpenAI / Microsoft',
    'Microsoft / OpenAI' : 'OpenAI / Microsoft',
    'OpenAI': 'OpenAI / Microsoft',
    'Microsoft': 'OpenAI / Microsoft',
    'Open AI / Microsoft': 'OpenAI / Microsoft',
    'Google' : 'Google / DeepMind',
    'DeepMind' : 'Google / DeepMind',
    'Google Deepmind': 'Google / DeepMind',
}
df['Owner'] = df['owner'].apply(func=lambda x: owners[x] if x in owners.keys() else x)
df.head(n=5)

Unnamed: 0,name,owner,trained on x billion parameters,date,year,Date,parameters (bn),Owner
0,BERT,Google,0.34,Oct 2018,2018,2018-10-15 00:00:00+00:00,0.34,Google / DeepMind
1,GPT-2,OpenAI,1.5,Feb 2019,2019,2019-02-15 00:00:00+00:00,1.5,OpenAI / Microsoft
2,T5,Google,11.0,Oct 2019,2019,2019-10-15 00:00:00+00:00,11.0,Google / DeepMind
3,Megatron-11B,Meta / Facebook,11.0,Apr 2020,2020,2020-04-15 00:00:00+00:00,11.0,Meta
4,BlenderBot1,Meta / Facebook,9.4,Apr 2020,2020,2020-04-15 00:00:00+00:00,9.4,Meta


Our sizes tend to cluster in the middle range, but using a log axis in the y direction smoothes this out somewhat; unfortunately because we have 41 distinct owners we don't have a lot of options for color schemes. Setting the height near 1000 lets us see the whole legend.

In [2]:
from plotly.express import scatter
scatter(data_frame=df.sort_values(by='owner'), x='Date', y='parameters (bn)', hover_name='name', color='Owner', height=900, log_y=True,
       trendline='ols', trendline_scope='overall')

We talk about these models as if the number of parameters is a proxy for everything good about a model, but 'clearly' models aren't getting bigger as time passes, since our linear model of rather noisy data has a negative slope.

In [3]:
from plotly.express import bar
bar(data_frame=df, x='Owner', color='year')

This chart reinforces the idea that a small handful of companies currently dominate the space, but it also suggests that this may not always be true. What percentage of the models listed were released this calendar year? 

In [4]:
100 * round(sum(df['year'] > 2022) / len(df), 2)

36.0

About 36% of the models listed were released in this calendar year.

In [5]:
from plotly.express import histogram
histogram(data_frame=df, x='year', )