# AI Projects Growth Rate

## Table of Contents
- [Introduction](#introduction)
- [Database Connection](#connect-to-the-augur-database)
- [Load the Repositories](#load-the-urls-of-ai-repositories)
- [Retrieve Repository IDs](#retrieve-the-repository-ids-and-the-repository-names)
- [Fetch the Contribution Data](#fetch-the-contribution-data)
  - [Pull Request Contribution Data](#pull-request-contribution-data)
  - [Commit Contribution Data](#commit-contribution-data)
  - [Issue Contribution Data](#issue-contribution-data)
  - [Pull Request Review Contribution Data](#pull-request-review-contribution-data)
  - [Message Contribution Data](#message-contribution-data)
- [Growth Rate Calculation](#growth-rate-calculation)
  - [Last 6 Months Growth Rate Calculation](#last-6-months-contributions)
  - [First 6 Months Growth Rate Calculation](#first-6-months-contributions)
  - [Growth Rate Plots](#growth-rate-analysis---plots)
    - [Analysis Drawbacks](#analysis-and-drawbacks)
  - [Normalizing growth rate using Z-Score](#applying-z-score-normalization-on-growth-rate)
    - [Strategy Analysis and Outcomes](#strategy-analysis-and-learning-outcomes---standardized-growth-rates)
  - [Exponential Decay](#exponential-decay)
    - [Calculating Exponential Decay](#calculating-exponential-decay)
    - [Plotting Exponential Decay](#plotting-exponential-decay)
    - [Strategy Analysis and Outcomes](#strategy-analysis-and-learning-outcomes---exponential-decay)
- [Conclusion](#conclusion)

## Introduction

In this notebook, we will perform growth rate analysis of various AI/ML projects by considering different types of contributions like pull requests, commits, messages, issues, reviews etc. We will first start by fetching the contribution data for various types and merge them into a single dataset and in the later part, we will apply the formulas or the models described in this [document](https://docs.google.com/document/d/1ZkPCLNq5UBHrhTNIgAta9cFqevk2rVZ5Vxl9jNuRbQc/edit?usp=sharing).

In [1]:
# Importing the required libraries

import json
import pandas as pd
from sqlalchemy import create_engine, text
import numpy as np
from utils.growth_rate_utils import calculate_log_monthly_growth
from utils.growth_rate_utils import plot_growth_rate_by_category
from utils.growth_rate_utils import plot_exponential_decay_by_category
from utils.growth_rate_utils import plot_standardized_growth_rate

In [2]:
# To ignore and to not display deprecation warnings
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

## Connect to the Augur database

The credentials to connect to the database are stored in the `il_ai_creds.json`. Read the credentials from the json file and connect to the postgres database.

In [3]:
# Opening the JSON file containing database credentials and loading it into a dictionary
with open("data/il_ai_creds.json") as config_file:
    config = json.load(config_file)
    
# Creating a PostgreSQL database connection string using the credentials from the JSON file
database_connection_string = 'postgresql+psycopg2://{}:{}@{}:{}/{}'.format(
    config['user'],        # Username
    config['password'],    # Password
    config['host'],        # Hostname
    config['port'],        # Port number
    config['database']     # Database name
)

# Assigning the connection string to a variable
connection_string = database_connection_string

# Creating a SQLAlchemy engine using the connection string
engine = create_engine(connection_string)

## Load the URLs of AI repositories

Retrieve the list of repositories that will used for the growth rate analysis

In [4]:
# Opening the JSON file containing AI repository data and load it into a dictionary
f = open('ai_repos.json')

urls_json_data = json.load(f)

# Closing the file after loading the data
f.close()  

# Print the collected repositories
from pprint import pprint
pprint(urls_json_data)


{'gen_ai': ['https://github.com/lucidrains/imagen-pytorch',
            'https://github.com/langchain-ai/langchain',
            'https://github.com/run-llama/llama_index',
            'https://github.com/microsoft/lora',
            'https://github.com/nvidia/nemo',
            'https://github.com/huggingface/peft',
            'https://github.com/microsoft/semantic-kernel',
            'https://github.com/chroma-core/chroma',
            'https://github.com/milvus-io/milvus',
            'https://github.com/qdrant/qdrant',
            'https://github.com/bigscience-workshop/promptsource',
            'https://github.com/automatic1111/stable-diffusion-webui'],
 'llm': ['https://github.com/huggingface/transformers',
         'https://github.com/huggingface/datasets',
         'https://github.com/huggingface/trl',
         'https://github.com/microsoft/deepspeed',
         'https://github.com/timdettmers/bitsandbytes',
         'https://github.com/mistralai/mistral-common',
         'ht

In [5]:
# Initializing an empty list to store repository git URLs
repo_git_set = []

# Extracting the list of repositories from the loaded JSON data
for key in urls_json_data.keys():
    repo_git_set.extend(urls_json_data.get(key))

## Retrieve the repository IDs and the repository names

Let's retrieve the repository IDs and names from the augur database.

In [6]:
# Initializing empty lists to store repository IDs and names
repo_set = []
repo_name_set = []

# Iterating through the list of repository git URLs
for repo_git in repo_git_set:
    
    # Creating a SQL query to fetch repository ID and name for each git URL
    repo_query = text(f"""
                    SET SCHEMA 'augur_data';
                    SELECT 
                        b.repo_id,
                        b.repo_name
                    FROM
                        repo_groups a,
                        repo b
                    WHERE
                        a.repo_group_id = b.repo_group_id AND
                        b.repo_git = '{repo_git}'
            """)

    # Using the connection to execute the query
    with engine.connect() as connection:
        t = connection.execute(repo_query)  # Executing the query
        results = t.mappings().all()  # Fetching all the results
        
        # Checking if results are found and extracting repo_id and repo_name
        if results:
            repo_id = results[0]['repo_id']
            repo_name = results[0]['repo_name']
        else:
            repo_id = None
            repo_name = None
        
        # Appending the fetched repository ID and name to the respective lists
        repo_set.append(repo_id)
        repo_name_set.append(repo_name)

# Printing the lists of repository IDs and names
print(repo_set)
print(repo_name_set)

[25495, 25498, 25497, 25501, 25500, 25504, 25503, 25557, 25502, 25499, 25511, 25514, 25515, 25505, 25512, 25516, 25507, 25506, 25510, 25509, 25508, 25513, 25518, 25519, 25523, 25522, 25520, 25521, 25517, 25511, 25533, 25525, 25530, 25524, 25528, 25532, 25529, 25481, 25527, 25543, 25541, 25537, 25546, 25534, 25540, 25538, 25545, 25535, 25542, 25536, 25539]
['numpy', 'tensorflow', 'networkx', 'pytorch', 'keras-io', 'tinygrad', 'pandas', 'polars', 'arrow', 'mlx', 'transformers', 'spacy', 'nltk', 'allennlp', 'gensim', 'corenlp', 'deepspeech', 'fasttext', 'sentence-transformers', 'opennmt', 'opennlp', 'cogcomp-nlp', 'mycroft-core', 'open-assistant', 'rhasspy', 'ovos-core', 'jarvis', 'leon', 'porcupine', 'transformers', 'datasets', 'trl', 'deepspeed', 'bitsandbytes', 'mistral-common', 'llama', 'text-to-text-transfer-transformer', 'instructlab', 'gemma', 'imagen-pytorch', 'langchain', 'llama_index', 'lora', 'nemo', 'peft', 'semantic-kernel', 'chroma', 'milvus', 'qdrant', 'promptsource', 'stab

In [7]:
# Creating a dictionary using zip to pair repo_set (IDs) and repo_name_set (names)
repo_dict = dict(zip(repo_set, repo_name_set))

# Printing the dictionary
print(repo_dict)

{25495: 'numpy', 25498: 'tensorflow', 25497: 'networkx', 25501: 'pytorch', 25500: 'keras-io', 25504: 'tinygrad', 25503: 'pandas', 25557: 'polars', 25502: 'arrow', 25499: 'mlx', 25511: 'transformers', 25514: 'spacy', 25515: 'nltk', 25505: 'allennlp', 25512: 'gensim', 25516: 'corenlp', 25507: 'deepspeech', 25506: 'fasttext', 25510: 'sentence-transformers', 25509: 'opennmt', 25508: 'opennlp', 25513: 'cogcomp-nlp', 25518: 'mycroft-core', 25519: 'open-assistant', 25523: 'rhasspy', 25522: 'ovos-core', 25520: 'jarvis', 25521: 'leon', 25517: 'porcupine', 25533: 'datasets', 25525: 'trl', 25530: 'deepspeed', 25524: 'bitsandbytes', 25528: 'mistral-common', 25532: 'llama', 25529: 'text-to-text-transfer-transformer', 25481: 'instructlab', 25527: 'gemma', 25543: 'imagen-pytorch', 25541: 'langchain', 25537: 'llama_index', 25546: 'lora', 25534: 'nemo', 25540: 'peft', 25538: 'semantic-kernel', 25545: 'chroma', 25535: 'milvus', 25542: 'qdrant', 25536: 'promptsource', 25539: 'stable-diffusion-webui'}


Let's convert the data type of repo_set from a list to a tuple so that we can easily pass this in the sql queries.

In [8]:
repo_set_tuple = tuple(repo_set)

In [9]:
# Define a function to execute SQL queries and return the fetched results as a DataFrame
def execute_query(query, engine):
    with engine.connect() as connection:
        result = connection.execute(query)
        return pd.DataFrame(result.fetchall(), columns=result.keys())

## Fetch the contribution data

Let's fetch the contribution data. We will be pulling the data of 
- pull requests
- commits
- issues
- pull request review
- messages
  
in this notebook for the analysis.

### Pull Request Contribution data

Let's get the count of pull requests and also the count of merged pull requests for each repository in `repo_set`, grouped by the year and month of creation, and store the result in a pandas dataframe for further analysis.

In [10]:
from sqlalchemy.sql import text

pull_requests_query = text(f"""
    SELECT 
        repo_id, 
        CAST(DATE_PART('year', pr_created_at) AS INTEGER) AS year,
        CAST(DATE_PART('month', pr_created_at) AS INTEGER) AS month,
        COUNT(pull_request_id) AS pull_request_count,
        SUM(CASE WHEN pr_src_state = 'closed' AND pr_merged_at IS NOT NULL THEN 1 ELSE 0 END) AS merged_pull_request_count
    FROM 
        augur_data.pull_requests
    WHERE 
        repo_id IN :repo_set_tuple
        AND pr_created_at IS NOT NULL 
    GROUP BY 
        repo_id, year, month
    ORDER BY 
        repo_id, year, month;
""")

pull_requests_df = execute_query(pull_requests_query.bindparams(repo_set_tuple=repo_set_tuple), engine)

In [11]:
pull_requests_df

Unnamed: 0,repo_id,year,month,pull_request_count,merged_pull_request_count
0,25481,2024,2,93,85
1,25481,2024,3,390,289
2,25481,2024,4,183,121
3,25481,2024,5,120,89
4,25481,2024,6,180,123
...,...,...,...,...,...
2998,25557,2024,6,370,336
2999,25557,2024,7,301,258
3000,25557,2024,8,268,225
3001,25557,2024,9,258,234


Let's calculate the prs merged ratio that is `merged_prs/total_prs`. This can be a potential metric to analyze in future, if not now.

In [12]:
pull_requests_df['pr_merged_vs_raised_ratio'] = (
    pull_requests_df['merged_pull_request_count'] / pull_requests_df['pull_request_count']
).fillna(0)  

### Commit contribution data

Let's get the count of commits for each repository in `repo_set`, grouped by the year and month of creation, and store the result in a pandas dataframe for further analysis.

In [13]:
# Query to fetch commits data for repo_ids in repo_set
commits_query = text(f"""
    SELECT 
        repo_id, 
        CAST(DATE_PART('year', cmt_author_timestamp) AS INTEGER) AS year,
        CAST(DATE_PART('month', cmt_author_timestamp) AS INTEGER) AS month,
        COUNT(cmt_id) AS commit_count
    FROM 
        augur_data.commits
    WHERE 
        repo_id IN :repo_set_tuple
        AND cmt_author_timestamp IS NOT NULL 
    GROUP BY 
        repo_id, year, month
    ORDER BY 
        repo_id, year, month;
""")

commits_df = execute_query(commits_query.bindparams(repo_set_tuple=repo_set_tuple), engine)

In [14]:
commits_df

Unnamed: 0,repo_id,year,month,commit_count
0,25481,2024,2,346
1,25481,2024,3,699
2,25481,2024,4,632
3,25481,2024,5,384
4,25481,2024,6,807
...,...,...,...,...
3536,25557,2024,6,3115
3537,25557,2024,7,1616
3538,25557,2024,8,2080
3539,25557,2024,9,2246


### Issue contribution data

Let's get the count of issues for each repository in `repo_set`, grouped by the year and month of creation, and store the result in a pandas dataframe for further analysis.

In [15]:
# Query to fetch issues data for repo_ids in repo_set
issues_query = text(f"""
    SELECT 
        repo_id, 
        CAST(DATE_PART('year', created_at) AS INTEGER) AS year,
        CAST(DATE_PART('month', created_at) AS INTEGER) AS month,
        
        COUNT(issue_id) AS issue_count
    FROM 
        augur_data.issues
    WHERE 
        repo_id IN :repo_set_tuple
        AND created_at IS NOT NULL 
    GROUP BY 
        repo_id, year, month
    ORDER BY 
        repo_id, year, month;
""")

issues_df = execute_query(issues_query.bindparams(repo_set_tuple=repo_set_tuple), engine)

In [16]:
issues_df

Unnamed: 0,repo_id,year,month,issue_count
0,25481,2024,2,63
1,25481,2024,3,216
2,25481,2024,4,93
3,25481,2024,5,60
4,25481,2024,6,107
...,...,...,...,...
3014,25557,2024,6,305
3015,25557,2024,7,354
3016,25557,2024,8,264
3017,25557,2024,9,281


### Pull Request Review contribution data

Let's get the count of pull request reviews for each repository in `repo_set`, grouped by the year and month of creation, and store the result in a pandas dataframe for further analysis.

In [17]:
# Query to fetch pull request reviews data for repo_ids in repo_set
pr_reviews_query = text(f"""
    SELECT 
        repo_id, 
        CAST(DATE_PART('year', pr_review_submitted_at) AS INTEGER) AS year,
        CAST(DATE_PART('month', pr_review_submitted_at) AS INTEGER) AS month,
        COUNT(pr_review_id) AS review_count
    FROM 
        augur_data.pull_request_reviews
    WHERE 
        repo_id IN :repo_set_tuple
        AND pr_review_submitted_at IS NOT NULL
    GROUP BY 
        repo_id, year, month
    ORDER BY 
        repo_id, year, month;
""")

pr_reviews_df = execute_query(pr_reviews_query.bindparams(repo_set_tuple=repo_set_tuple), engine)

In [18]:
pr_reviews_df

Unnamed: 0,repo_id,year,month,review_count
0,25481,2024,2,133
1,25481,2024,3,1173
2,25481,2024,4,614
3,25481,2024,5,581
4,25481,2024,6,1040
...,...,...,...,...
1871,25557,2024,4,374
1872,25557,2024,5,428
1873,25557,2024,6,387
1874,25557,2024,7,420


### Message contribution data

Let's get the count comments or messages for each repository in `repo_set`, grouped by the year and month of creation, and store the result in a pandas dataframe for further analysis.

In [19]:
# Query to fetch messages data for repo_ids in repo_set
messages_query = text(f"""
    SELECT 
        repo_id, 
        CAST(DATE_PART('year', msg_timestamp) AS INTEGER) AS year,
        CAST(DATE_PART('month', msg_timestamp) AS INTEGER) AS month,
        COUNT(msg_id) AS message_count
    FROM 
        augur_data.message
    WHERE 
        repo_id IN :repo_set_tuple
        AND msg_timestamp IS NOT NULL 
    GROUP BY 
        repo_id, year, month
    ORDER BY 
        repo_id, year, month;
""")

messages_df = execute_query(messages_query.bindparams(repo_set_tuple=repo_set_tuple), engine)

In [20]:
messages_df

Unnamed: 0,repo_id,year,month,message_count
0,25481,2024,2,297
1,25481,2024,3,2233
2,25481,2024,4,1307
3,25481,2024,5,1095
4,25481,2024,6,1733
...,...,...,...,...
3169,25557,2024,5,1775
3170,25557,2024,6,1764
3171,25557,2024,7,1769
3172,25557,2024,8,1555


In [21]:
# Merging all types of contributions' dataframes into one singal dataframe

final_df = pull_requests_df.merge(commits_df, on=['repo_id', 'year', 'month'], how='outer')
final_df = final_df.merge(issues_df, on=['repo_id', 'year', 'month'], how='outer')
final_df = final_df.merge(pr_reviews_df, on=['repo_id', 'year', 'month'], how='outer')
final_df = final_df.merge(messages_df, on=['repo_id', 'year', 'month'], how='outer')

In [22]:
final_df

Unnamed: 0,repo_id,year,month,pull_request_count,merged_pull_request_count,pr_merged_vs_raised_ratio,commit_count,issue_count,review_count,message_count
0,25481,2024,2,93.0,85.0,0.913978,346.0,63.0,133.0,297.0
1,25481,2024,3,390.0,289.0,0.741026,699.0,216.0,1173.0,2233.0
2,25481,2024,4,183.0,121.0,0.661202,632.0,93.0,614.0,1307.0
3,25481,2024,5,120.0,89.0,0.741667,384.0,60.0,581.0,1095.0
4,25481,2024,6,180.0,123.0,0.683333,807.0,107.0,1040.0,1733.0
...,...,...,...,...,...,...,...,...,...,...
3848,25557,2024,6,370.0,336.0,0.908108,3115.0,305.0,387.0,1764.0
3849,25557,2024,7,301.0,258.0,0.857143,1616.0,354.0,420.0,1769.0
3850,25557,2024,8,268.0,225.0,0.839552,2080.0,264.0,281.0,1555.0
3851,25557,2024,9,258.0,234.0,0.906977,2246.0,281.0,,542.0


In [23]:
# Check for missing values in the final dataframe
missing_values = final_df.isnull().sum()
missing_values

repo_id                         0
year                            0
month                           0
pull_request_count            850
merged_pull_request_count     850
pr_merged_vs_raised_ratio     850
commit_count                  312
issue_count                   834
review_count                 1977
message_count                 679
dtype: int64

In [24]:
# Filling missing values with 0 and convert to integers
count_columns = ['pull_request_count', 'commit_count', 'issue_count', 'review_count', 'message_count', 'merged_pull_request_count']

final_df[count_columns] = final_df[count_columns].fillna(0).astype(int)

In [25]:
final_df['total_contributions'] = (
    final_df['pull_request_count'] +
    final_df['commit_count'] +
    final_df['issue_count'] +
    final_df['review_count'] +
    final_df['message_count']
)

In [26]:
# Print the final dataframe
final_df.head()

Unnamed: 0,repo_id,year,month,pull_request_count,merged_pull_request_count,pr_merged_vs_raised_ratio,commit_count,issue_count,review_count,message_count,total_contributions
0,25481,2024,2,93,85,0.913978,346,63,133,297,932
1,25481,2024,3,390,289,0.741026,699,216,1173,2233,4711
2,25481,2024,4,183,121,0.661202,632,93,614,1307,2829
3,25481,2024,5,120,89,0.741667,384,60,581,1095,2240
4,25481,2024,6,180,123,0.683333,807,107,1040,1733,3867


In [27]:
# Create the 'repo_name' column by mapping 'repo_id' to 'repo_dict'
final_df['repo_name'] = final_df['repo_id'].map(repo_dict)

# Insert 'repo_name' next to 'repo_id'
repo_id_index = final_df.columns.get_loc('repo_id')  # Get the index of 'repo_id' column
final_df.insert(repo_id_index + 1, 'repo_name', final_df.pop('repo_name'))  # Insert 'repo_name' after 'repo_id'

Let's compare the values in repo_name from the growth_df with [ai_repos.json](ai_repos.json) and add a `category` column to the final_df.

In [28]:
# Create a dictionary to map repo names to categories
repo_category_dict = {}

# Map each repo_name in growth_df to its category
for category, repo_list in urls_json_data.items():
    for repo_url in repo_list:
        # Extract repo name from the GitHub URLs
        repo_name = repo_url.split('/')[-1]
        repo_category_dict[repo_name] = category

final_df['category'] = final_df['repo_name'].map(repo_category_dict)

In [29]:
# Display updated dataframe with repo_name and category columns
final_df

Unnamed: 0,repo_id,repo_name,year,month,pull_request_count,merged_pull_request_count,pr_merged_vs_raised_ratio,commit_count,issue_count,review_count,message_count,total_contributions,category
0,25481,instructlab,2024,2,93,85,0.913978,346,63,133,297,932,llm
1,25481,instructlab,2024,3,390,289,0.741026,699,216,1173,2233,4711,llm
2,25481,instructlab,2024,4,183,121,0.661202,632,93,614,1307,2829,llm
3,25481,instructlab,2024,5,120,89,0.741667,384,60,581,1095,2240,llm
4,25481,instructlab,2024,6,180,123,0.683333,807,107,1040,1733,3867,llm
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3848,25557,polars,2024,6,370,336,0.908108,3115,305,387,1764,5941,math
3849,25557,polars,2024,7,301,258,0.857143,1616,354,420,1769,4460,math
3850,25557,polars,2024,8,268,225,0.839552,2080,264,281,1555,4448,math
3851,25557,polars,2024,9,258,234,0.906977,2246,281,0,542,3327,math


In [30]:
data = final_df.copy()

## Growth Rate Calculation

Let's calculate the growth rate of each repository for the first and last 6 months.

- We are calculating the growth rate by taking the difference of logarithmic growth rate between consecutive months.
- The log transformation captures relative (percentage) changes rather than absolute changes. This makes it easier to compare growth rates across repositories, even if they have vastly different numbers of total contributions.
- Logarithmic growth rates reduce the impact of very large outliers, smoothing out extreme spikes or drops in the data.

**Calculation:**
- `np.log(df['total_contributions'])` takes the natural logarithm of the total_contributions for each month
- `df['total_contributions'].shift(1)` The shift(1) method shifts the total_contributions data by 1 row, effectively aligning the value from the previous month with the current month. This allows us to compare the current month's contributions with the previous month's.
- `np.log(df['total_contributions']) - np.log(df['total_contributions'].shift(1))`: This calculates the difference in the logarithms of the contributions between the current month and the previous month. In logarithmic terms, this difference represents the logarithmic growth rate or log-return between the two periods.
- The result is multiplied by 100 to express the logarithmic growth rate as a percentage

In [31]:
# Sort the data by repo_id, year, and month to ensure chronological order
data = data.sort_values(by=['repo_id', 'year', 'month'])

### Last 6 months contributions

Let's filter out the last 6 months of contribution data from the whole dataframe.

In [32]:
# Function to filter and calculate log growth rate for the last 6 months of each repo
def get_last_6_months_growth(df, months_to_consider=7):
    
    # Sort by year and month to ensure chronological order
    df = df.sort_values(by=['year', 'month'])
    
    # Take the last 'months_to_consider' months
    if len(df) >= months_to_consider:
        df = df.tail(months_to_consider)
    else:
        return pd.DataFrame()  # Return empty DataFrame if less than required months

    # Apply the growth rate calculation for the last 6 months
    df = calculate_log_monthly_growth(df)
    
    return df

# Apply the function to each repository group to get the last 6 months of data
last_6months_df = data.groupby('repo_id').apply(get_last_6_months_growth)

# Flatten the groupby result and reset the index
last_6months_df = last_6months_df.reset_index(drop=True)

# Display or print the last 6 months dataframe
last_6months_df

Unnamed: 0,repo_id,repo_name,year,month,pull_request_count,merged_pull_request_count,pr_merged_vs_raised_ratio,commit_count,issue_count,review_count,message_count,total_contributions,category,log_growth_rate,x_axis
0,25481,instructlab,2024,5,120,89,0.741667,384,60,581,1095,2240,llm,-23.344743,1
1,25481,instructlab,2024,6,180,123,0.683333,807,107,1040,1733,3867,llm,54.600315,2
2,25481,instructlab,2024,7,261,213,0.816092,1149,143,2220,3294,7067,llm,60.295705,3
3,25481,instructlab,2024,8,149,116,0.778523,389,74,807,1523,2942,llm,-87.634644,4
4,25481,instructlab,2024,9,92,71,0.771739,323,73,0,484,972,llm,-110.748910,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
295,25557,polars,2024,6,370,336,0.908108,3115,305,387,1764,5941,math,22.752952,2
296,25557,polars,2024,7,301,258,0.857143,1616,354,420,1769,4460,math,-28.672870,3
297,25557,polars,2024,8,268,225,0.839552,2080,264,281,1555,4448,math,-0.269421,4
298,25557,polars,2024,9,258,234,0.906977,2246,281,0,542,3327,math,-29.038356,5


### First 6 months contributions

Let's filter out the first 6 months contributions data from the entire dataframe.

In [33]:
# Function to filter and calculate log growth rate for the first 6 months of each repo
def get_first_6_months_growth(df, months_to_consider=7):
    
    # Sort by year and month to ensure chronological order
    df = df.sort_values(by=['year', 'month'])

    # Take the first 'months_to_consider' months since the start of the project
    if len(df) >= months_to_consider:
        df = df[:months_to_consider]
    else:
        return pd.DataFrame()  # Return empty DataFrame if less than required months

    # Apply the growth rate calculation for the first 6 months
    df = calculate_log_monthly_growth(df)
    
    return df

# Apply the function to each repository group to get the first 6 months of data
first_6months_df = data.groupby('repo_id').apply(get_first_6_months_growth)

# Flatten the groupby result and reset the index
first_6months_df = first_6months_df.reset_index(drop=True)

# Display or print the first 6 months dataframe
first_6months_df

Unnamed: 0,repo_id,repo_name,year,month,pull_request_count,merged_pull_request_count,pr_merged_vs_raised_ratio,commit_count,issue_count,review_count,message_count,total_contributions,category,log_growth_rate,x_axis
0,25481,instructlab,2024,3,390,289,0.741026,699,216,1173,2233,4711,llm,162.032266,1
1,25481,instructlab,2024,4,183,121,0.661202,632,93,614,1307,2829,llm,-50.997691,2
2,25481,instructlab,2024,5,120,89,0.741667,384,60,581,1095,2240,llm,-23.344743,3
3,25481,instructlab,2024,6,180,123,0.683333,807,107,1040,1733,3867,llm,54.600315,4
4,25481,instructlab,2024,7,261,213,0.816092,1149,143,2220,3294,7067,llm,60.295705,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
295,25557,polars,2020,7,0,0,,448,24,0,18,490,math,-11.369305,2
296,25557,polars,2020,8,5,5,1.000000,368,23,0,13,409,math,-18.069024,3
297,25557,polars,2020,9,6,6,1.000000,493,22,1,32,554,math,30.344953,4
298,25557,polars,2020,10,3,2,0.666667,592,25,5,64,689,math,21.807658,5


### Growth Rate Analysis - Plots

Let's try to create growth rate plots for the projects based on their category. In order to plot based on the category, let's read the categories of repositories from the [ai_repos.json](ai_repos.json) file and iterate through each category and plot them.

In [34]:
# Fetch the categories from ai_repos.json
categories = list(urls_json_data.keys())

Let's plot the growth rate of the repositories for the first 6 months.

In [35]:
# Plot growth rate for all the repos in each individual category for the first 6 months
for category in categories:
    fig = plot_growth_rate_by_category(first_6months_df, category, title="First 6 Months Growth Rate Trend")
    fig.show()

Let's plot the growth rate of the repositories for the last 6 months.

In [36]:
# Plot growth rate for all the repos in each individual category for the first 6 months
for category in categories:
    fig = plot_growth_rate_by_category(last_6months_df, category, title="Last 6 Months Growth Rate Trend")
    fig.show()

### Analysis and Drawbacks

- By working with log-transformed growth rates, we mitigated the impact of extreme growth values. By applying log transformation helped smooth out large, potentially misleading variations, allowing for a more consistent comparison across different repositories over time.
- We can't actually make out much from the above graphs as the y-axis is not so interpretable and the range is changing with every plot and doesn't give much information. 
- One thing we can observe is that all the repositories oscillates a lot and there is no steady increase or decrease in growth rate.

So, let's use `Z-Score` as the normalization metric on the log_growth_rate parameter.

## Applying Z-Score Normalization on Growth Rate

**A z-score is a statistical measurement that shows how many standard deviations a data point is from the mean of a distribution.** 

Let's plot the Standardized Growth Rates for all categories over the first 6 months.

In [37]:
# Plot for each individual category
for category in categories:
    fig = plot_standardized_growth_rate(first_6months_df, category, title="First 6 Months Standardized Growth Rate Trend")
    fig.show()

Let's plot the standardized growth plots for all categories over the last 6 months.

In [38]:
# Plot for each individual category
for category in categories:
    fig = plot_standardized_growth_rate(last_6months_df, category, title="Last 6 Months Standardized Growth Rate Trend")
    fig.show()

### Strategy Analysis and Learning Outcomes - Standardized Growth Rates

- We are particularly aiming at identifying general growth patterns while minimizing the noise introduced by fluctuations specific to individual repositories.
- We standardized the log-transformed growth rates by calculating the Z-score for each repository's growth rate, allowing us to compare activity levels on a common scale. The Z-score standardization enabled us to observe growth patterns relative to the mean, with a Z-score near 0 indicating stability close to the average growth rate. This helped reveal whether a repository's growth was consistently above or below the mean, regardless of absolute values.
- If a repository's growth rate is above 1, then it can be considered as having 1 standard deviation from mean. The respoitories that are not consistently near zero can be considered volatile with fluctuating Z-scores reflect periods of intensified or reduced activity.
- This approach underscored the effectiveness of Z-score standardization in identifying meaningful activity trends over time. By focusing on standardized growth rates, we were able to create a baseline view of repository growth that minimizes temporary spikes or dips, making it easier to detect sustained patterns
  
Here are some of the patterns observed from the plots.
1. **Variability Across Repositories**: Different repositories exhibit varying growth rate trends. Some have more fluctuations, while others remain relatively stable.
2. **Decline**: mlx in math, allennlp in nlp, and instructlab, bitsandbytes in llm category are experiencing a notable decline in growth rate in the last couple of months.
3. **Stability**: tensorflow, pytorch, arrow, tinygrad, polars in math, sentence-transformers in nlp, jarvis in personal_assistants, llama, deepspeed, trl, transformers and instructlab in LLMs, and nemo, milvus, llama_index, semantic-kernal, chroma maintain relatively stable growth rates close to the mean (z-score around 0). It means their standard deviation is less than 1 in the recent 6 months.
4. **Overall Trends**: Most repositories hover around the mean, indicating that while there are fluctuations, they tend to stabilize over time.
5. **Interesting patterns**: TensorFlow demonstrates remarkable stability in its growth pattern over the last 6 months, dipping below the average (negative standard deviation) only once. This consistency may indicate a steady contributor base and sustained interest in the projects.

## Exponential Decay

Let's apply dynamic exponential decay to the actions. Exponential decay emphasizes the recent contributions while progressively reducing the impact of older ones. The decay factor is calculated as `0.9 ^ months_since_action`, where more recent actions have larger weight, and older actions decay more.

**Example**
Let's consider an activity in the previous month(September) and in May.

$decay\_factor_{september}$ = `0.9^0.1 = 0.9 `

$decay\_factor_{may}$ = `0.9^0.5 = 0.59`

We can clearly see that $decay\_factor_{september}$ is greater than $decay\_factor_{may}$.

Weighted Decayed Activity is a score that reflects the recent contribution level of a repository, prioritizing newer, more significant contributions while gradually discounting older ones. So, let's assign weights to different types of contributions inorder to use this metric.

In [39]:
# Define weights (can be adjusted as needed)
weights = {
    'commit': 0.6,
    'issue': 0.5,
    'pr': 0.4,
    'merged_pr': 0.8,
    'review': 0.4,
    'message': 0.3
}

### Calculating Exponential Decay

Let's caulcuate the exponential decay for all the repositories that are available in our dataframe.

In [40]:
def calculate_dynamic_exponential_decay(data, weights, current_year=2024, latest_month=11):
    
    def calculate_decay_for_repo(df):

        # Sort by year and month
        df = df.sort_values(by=['year', 'month'])
        
        current_date_month = latest_month
        current_date_year = current_year
        
        # Calculate months since each action occurred
        df['months_since_action'] = ((current_date_year - df['year']) * 12) + (current_date_month - df['month'])
        
        # Calculate the decay factor dynamically for each action
        df['decay_factor'] = 0.9 ** df['months_since_action']
        
        # Calculate decayed activities
        df['decayed_commit_count'] = df['commit_count'] * df['decay_factor']
        df['decayed_issue_count'] = df['issue_count'] * df['decay_factor']
        df['decayed_pr_count'] = df['pull_request_count'] * df['decay_factor']
        df['decayed_merged_pull_request_count'] = df['merged_pull_request_count'] * df['decay_factor']
        df['decayed_review_count'] = df['review_count'] * df['decay_factor']
        df['decayed_message_count'] = df['message_count'] * df['decay_factor']
        
        # Weighted decayed activity using the provided weights
        df['weighted_decayed_activity'] = (
            weights.get('commit', 1.0) * df['decayed_commit_count'] +
            weights.get('issue', 1.0) * df['decayed_issue_count'] +
            weights.get('pr', 1.0) * df['decayed_pr_count'] +
            weights.get('merged_pr', 1.0) * df['decayed_merged_pull_request_count'] +
            weights.get('review', 1.0) * df['decayed_review_count'] +
            weights.get('message', 1.0) * df['decayed_message_count']
        )
        
        return df

    # Apply the decay calculation to each repository group
    decayed_df = data.groupby('repo_id').apply(calculate_decay_for_repo)

    # Flatten the groupby result and reset the index
    decayed_df = decayed_df.reset_index(drop=True)

    return decayed_df

In [41]:
exp_decay_df = calculate_dynamic_exponential_decay(data, weights)

In [42]:
exp_decay_df

Unnamed: 0,repo_id,repo_name,year,month,pull_request_count,merged_pull_request_count,pr_merged_vs_raised_ratio,commit_count,issue_count,review_count,...,category,months_since_action,decay_factor,decayed_commit_count,decayed_issue_count,decayed_pr_count,decayed_merged_pull_request_count,decayed_review_count,decayed_message_count,weighted_decayed_activity
0,25481,instructlab,2024,2,93,85,0.913978,346,63,133,...,llm,9,0.387420,134.047489,24.407491,36.030105,32.930742,51.526925,115.063885,188.518810
1,25481,instructlab,2024,3,390,289,0.741026,699,216,1173,...,llm,8,0.430467,300.896580,92.980917,167.882212,124.405024,504.938037,961.233280,884.050509
2,25481,instructlab,2024,4,183,121,0.661202,632,93,614,...,llm,7,0.478297,302.283641,44.481612,87.528333,57.873925,293.674297,625.134048,589.931396
3,25481,instructlab,2024,5,120,89,0.741667,384,60,581,...,llm,6,0.531441,204.073344,31.886460,63.772920,47.298249,308.767221,581.927895,499.820261
4,25481,instructlab,2024,6,180,123,0.683333,807,107,1040,...,llm,5,0.590490,476.525430,63.182430,106.288200,72.630270,614.109600,1023.319170,970.765560
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3848,25557,polars,2024,6,370,336,0.908108,3115,305,387,...,math,5,0.590490,1839.376350,180.099450,218.481300,198.404640,228.519630,1041.624360,1843.686927
3849,25557,polars,2024,7,301,258,0.857143,1616,354,420,...,math,4,0.656100,1060.257600,232.259400,197.486100,169.273800,275.562000,1160.640900,1425.114810
3850,25557,polars,2024,8,268,225,0.839552,2080,264,281,...,math,3,0.729000,1516.320000,192.456000,195.372000,164.025000,204.849000,1133.595000,1637.406900
3851,25557,polars,2024,9,258,234,0.906977,2246,281,0,...,math,2,0.810000,1819.260000,227.610000,208.980000,189.540000,0.000000,439.020000,1572.291000


### Plotting Exponential Decay

In [43]:
def filter_first_6_months(df, months_to_consider=6):
    
    # Sort by year and month to ensure chronological order
    df = df.sort_values(by=['year', 'month'])
    
    # Take the first 'months_to_consider' months since the start of the project
    if len(df) >= months_to_consider:
        df = df[:months_to_consider]
    else:
        return pd.DataFrame()  # Return empty DataFrame if less than required months
    
    # Create a sequential 'x-axis' column (1, 2, 3, ..., months_to_consider)
    df['x_axis'] = np.arange(1, 7)
    
    return df

# Apply the function to each repository group to get the last 6 months of data
first_6m_exp_decay_df = exp_decay_df.groupby('repo_id').apply(filter_first_6_months)

# Flatten the groupby result and reset the index
first_6m_exp_decay_df = first_6m_exp_decay_df.reset_index(drop=True)

# Display or print the last 6 months dataframe
first_6m_exp_decay_df

Unnamed: 0,repo_id,repo_name,year,month,pull_request_count,merged_pull_request_count,pr_merged_vs_raised_ratio,commit_count,issue_count,review_count,...,months_since_action,decay_factor,decayed_commit_count,decayed_issue_count,decayed_pr_count,decayed_merged_pull_request_count,decayed_review_count,decayed_message_count,weighted_decayed_activity,x_axis
0,25481,instructlab,2024,2,93,85,0.913978,346,63,133,...,9,0.387420,134.047489,24.407491,36.030105,32.930742,51.526925,115.063885,188.518810,1
1,25481,instructlab,2024,3,390,289,0.741026,699,216,1173,...,8,0.430467,300.896580,92.980917,167.882212,124.405024,504.938037,961.233280,884.050509,2
2,25481,instructlab,2024,4,183,121,0.661202,632,93,614,...,7,0.478297,302.283641,44.481612,87.528333,57.873925,293.674297,625.134048,589.931396,3
3,25481,instructlab,2024,5,120,89,0.741667,384,60,581,...,6,0.531441,204.073344,31.886460,63.772920,47.298249,308.767221,581.927895,499.820261,4
4,25481,instructlab,2024,6,180,123,0.683333,807,107,1040,...,5,0.590490,476.525430,63.182430,106.288200,72.630270,614.109600,1023.319170,970.765560,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
295,25557,polars,2020,6,0,0,,532,14,0,...,53,0.003757,1.998778,0.052599,0.000000,0.000000,0.000000,0.011271,1.228948,2
296,25557,polars,2020,7,0,0,,448,24,0,...,52,0.004175,1.870202,0.100189,0.000000,0.000000,0.000000,0.075142,1.194758,3
297,25557,polars,2020,8,5,5,1.000000,368,23,0,...,51,0.004638,1.706930,0.106683,0.023192,0.023192,0.000000,0.060299,1.123420,4
298,25557,polars,2020,9,6,6,1.000000,493,22,1,...,50,0.005154,2.540811,0.113383,0.030923,0.030923,0.005154,0.164921,1.669823,5


In [44]:
def filter_last_6_months(df, months_to_consider=6):
    
    # Sort by year and month to ensure chronological order
    df = df.sort_values(by=['year', 'month'])
    
    # Take the last 'months_to_consider' months
    if len(df) >= months_to_consider:
        df = df.tail(months_to_consider)
    else:
        return pd.DataFrame()  # Return empty DataFrame if less than required months
    
    # Create a sequential 'x-axis' column (1, 2, 3, ..., months_to_consider)
    df['x_axis'] = np.arange(1, 7)
    
    return df

# Apply the function to each repository group to get the last 6 months of data
last_6m_exp_decay_df = exp_decay_df.groupby('repo_id').apply(filter_last_6_months)

# Flatten the groupby result and reset the index
last_6m_exp_decay_df = last_6m_exp_decay_df.reset_index(drop=True)

# Display or print the last 6 months dataframe
last_6m_exp_decay_df

Unnamed: 0,repo_id,repo_name,year,month,pull_request_count,merged_pull_request_count,pr_merged_vs_raised_ratio,commit_count,issue_count,review_count,...,months_since_action,decay_factor,decayed_commit_count,decayed_issue_count,decayed_pr_count,decayed_merged_pull_request_count,decayed_review_count,decayed_message_count,weighted_decayed_activity,x_axis
0,25481,instructlab,2024,5,120,89,0.741667,384,60,581,...,6,0.531441,204.073344,31.88646,63.77292,47.298249,308.767221,581.927895,499.820261,1
1,25481,instructlab,2024,6,180,123,0.683333,807,107,1040,...,5,0.590490,476.525430,63.18243,106.28820,72.630270,614.109600,1023.319170,970.765560,2
2,25481,instructlab,2024,7,261,213,0.816092,1149,143,2220,...,4,0.656100,753.858900,93.82230,171.24210,139.749300,1456.542000,2161.193400,1910.497590,3
3,25481,instructlab,2024,8,149,116,0.778523,389,74,807,...,3,0.729000,283.581000,53.94600,108.62100,84.564000,588.303000,1110.267000,876.622500,4
4,25481,instructlab,2024,9,92,71,0.771739,323,73,0,...,2,0.810000,261.630000,59.13000,74.52000,57.510000,0.000000,392.040000,379.971000,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
295,25557,polars,2024,6,370,336,0.908108,3115,305,387,...,5,0.590490,1839.376350,180.09945,218.48130,198.404640,228.519630,1041.624360,1843.686927,2
296,25557,polars,2024,7,301,258,0.857143,1616,354,420,...,4,0.656100,1060.257600,232.25940,197.48610,169.273800,275.562000,1160.640900,1425.114810,3
297,25557,polars,2024,8,268,225,0.839552,2080,264,281,...,3,0.729000,1516.320000,192.45600,195.37200,164.025000,204.849000,1133.595000,1637.406900,4
298,25557,polars,2024,9,258,234,0.906977,2246,281,0,...,2,0.810000,1819.260000,227.61000,208.98000,189.540000,0.000000,439.020000,1572.291000,5


Let's plot the exponential decay for the first 6 months.

In [45]:
# Plot growth rate for all the repos in each individual category for the first 6 months
for category in categories:
    fig = plot_exponential_decay_by_category(first_6m_exp_decay_df, category, title="First 6 Months Exponential Decay")
    fig.show()

Let's plot the exponential decay of the projects in the last 6 months.

In [46]:
# Plot growth rate for all the repos in each individual category for the last 6 months
for category in categories:
    fig = plot_exponential_decay_by_category(last_6m_exp_decay_df, category, title="Last 6 Months Exponential Decay")
    fig.show()

### Strategy Analysis and Learning Outcomes - Exponential Decay

- The aim of applying exponential decay is to assess repository activity by emphasizing recent contributions over older ones, thus better reflecting the current engagement enabling us to track freshness in activity.
- For each activity metric (e.g., pull requests, commits, issues), we calculated a decay factor based on the time since the action occurred. To better understand the relative importance of different contribution types, we combined the decayed values of each metric based on predetermined weights.
- Applying exponential decay to activity data proved effective in capturing meaningful trends by reducing the influence of outdated contributions. 
- This strategy helped clarify which repositories show enduring momentum in contributor engagement and support. By dynamically weighing recent actions more heavily, we developed a model that more accurately reflects the ongoing vitality of each project, providing a foundation for identifying promising repositories in future analysis.

Patterns observed from the plots.
- We can see that except for the gen_ai category that was recent(2023) most repositories in other categories have no significant contributions in terms of decay as we are calculating it based on the dynamic weights. 
- Exponential Decay ranged from 0-150 for personal_assistants to 0-10k for math repositories. Usually the repositories in the math category are very famous and are highly used so we can expect those repositories to have higher exponential decay.
- The repositories like pytorch and tensorflow in math, sentence-transformers in nlp, transformers and instructlab in llms, and langchain and milvus in gen_ai are dominating interms on exponential contributions over the last 6 months which says they are receiving more number of contributions in their respective categories compared to other repositories.
- We can see that repositories in personal_assistants are underperforming in this case with no significant and consistent contributions in their repositories.

## Conclusion

In this exploratory analysis, three distinct strategies were applied to understand and evaluate repository activity and growth: **log growth rate**, **Z-score (standardized log growth rate)**, and **exponential decay**. Each strategy offered unique insights into the data, serving different analytical purposes.

**Log Growth Rate:**
- Best for general trend analysis.
- Suitable for high-level comparison of growth rates, providing a broad perspective without focusing on recency or relative performance.
- While this method is effective for spotting general growth trends, it doesn’t allow direct comparison between repositories, as it lacks context for interpreting "above" or "below average" growth.

**Standardized Log Growth Rate (Z-Score):**
- Best for peer comparison and identifying above or below average growth.
- Useful for directly comparing repositories on a relative scale, especially when spotting outliers in growth.
- Z-score standardization doesn't certainly shows trends in the recency of activity, as it only measures deviations from the group mean without factoring in the timing of contributions.

**Exponential Decay:**
- Best for evaluating current and sustained activity.
- Ideal for identifying actively maintained repositories, emphasizing recency to detect momentum and freshness.
- Theis approach can underemphasize historically significant activity, and potentially undervalue projects with recent decreases in engagement that still hold long-term importance.

Each strategy contributes a unique perspective to the analysis, with log growth rate offering trend smoothing, Z-score providing relative comparison, and exponential decay reflecting recent activity. Together, these strategies allow for a comprehensive view of repository dynamics, covering past trends, relative growth, and current engagement.

Future scope would be to explore other metrics to define growth rate. They can be either using derivatives, slopes or including other types of github actions or contributions etc.
