# Analyzing OSS Projects under the Apache Software Foundation using GH Archive Data

## Introduction

The Apache Software Foundation (ASF) is a non-profit organization that supports various open-source software projects. The ASF has a large number of projects, and understanding the activity and trends within these projects can provide valuable insights into the open-source ecosystem.
This analysis aims to explore the activity of Apache projects using data from the GitHub Archive (GH Archive). The GH Archive provides a record of public GitHub events, which can be used to analyze the activity of repositories over time since 2011. The dataset encompasses bullions of events like commits, issues, pull requests, and more. this dataset precents an opportunity for large-sclae analysis of open-source projects. 

This report's objectives are:
* To identify key activity metrics derivable from the GH Archite dataset. 
* Provide descriptive analytics on these metrics over time for a selection of projects under ASF.
* Describe the patterns and trends associated with the different phases of an open-source project's lifecycle.
* Develop data-informed heuristics suggesting optimal project adoption and migration points. 


## Data Collection and Loading

### BigQuery Extract
The data for this analysis was collected from the GH Archive. A copy updated daily copy of this dataset is available on Google BigQuery under the `gharchive` public dataset. The data set was extracted using the following SQL query:

```sql
SELECT 
type,
payload, 
repo.name as repo_name,
DATE(created_at) as event_date,
EXTRACT(YEAR FROM created_at) as event_year,
EXTRACT(MONTH FROM created_at) as event_month, 
FROM `githubarchive.year.202*`
WHERE repo.name like 'apache/%'
```
Screenshot of the export job's steps are shown below:

![Export Job](./images/bigquery_export.png)

The results of the query was exported as parquet files on a Google Cloud Storage bucket and copied over the the `/home/ubuntu/lab1/data` folder of the EC2 instance used by the project. 

In [2]:
!du -sh /home/ubuntu/lab1/data

22G	/home/ubuntu/lab1/data


### Loading Data

In [12]:
from pyspark.sql import SparkSession
spark = (SparkSession
     .builder
     .master('local[*]')
     .config("spark.sql.execution.arrow.pyspark.enabled", "true")
     # .config("spark.sql.parquet.columnarReaderBatchSize", "1024")
     # .config("spark.sql.parquet.enableVectorizedReader", "false")
     .getOrCreate())

In [13]:
from pyspark.sql.types import (BooleanType, LongType, TimestampNTZType, StringType, StructType, StructField)

data = spark.read.parquet('./data/*.parquet', 
    schema=gharchive_schema,
    inferSchema=True
)

                                                                                

In [14]:
data.printSchema()

root
 |-- type: string (nullable = true)
 |-- payload: string (nullable = true)
 |-- repo_name: string (nullable = true)
 |-- event_date: date (nullable = true)
 |-- event_year: long (nullable = true)
 |-- event_month: long (nullable = true)



### Filtering Analysis on the 10 most active ASF repositories

Due to the size of the data,