# PGA files study

On this notebook, we'll do some descriptive statistics not on PGA _index_ file, but on the siva files themselves. We want to understand avg/mean/min/max repository size, number of blobs per-repository, avg contribution per-user, etc.

The steps I intend to complete to reach the results are:

1. Download siva files (following [PGA documentation](https://github.com/src-d/datasets/tree/master/PublicGitArchive/pga))
2. Extract siva files (following [siva documentation](https://github.com/src-d/go-siva))
3. Query repos using gitbase (following [source{d} documentation](https://docs.sourced.tech/intro/#analyzing-git-repositories))


## Step 1 - Download siva files

We will use the terminal to download the pertinent siva files. I will follow the same criterion as the study on PGA index, so we are interested on repos that has **Jupyter Notebook** files only.

To understand how to install PGA and its commands, please follow [the documentation](https://github.com/src-d/datasets/tree/master/PublicGitArchive).

```bash
$ pga list --lang "Jupyter Notebook" -f csv > repos_jupyter.csv
```


This will give us as output a csv file with the repo's URL, siva filenames for the repo, languages, and much more information. Let's see how the csv looks like.

_**NOTE**: since the export of pga doesn't come with the headers on the first row, I manually added on pandas dataframe using the ones from the index, as they are the same._

In [135]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("repos_jupyter.csv", names=["URL", "SIVA_FILENAMES", "FILE_COUNT", "LANGS", "LANGS_BYTE_COUNT", "LANGS_LINES_COUNT", "LANGS_FILES_COUNT", "COMMITS_COUNT", "BRANCHES_COUNT", "FORK_COUNT", "EMPTY_LINES_COUNT", "CODE_LINES_COUNT", "COMMENT_LINES_COUNT", "LICENSE"])

df.describe()

Unnamed: 0,FILE_COUNT,COMMITS_COUNT,BRANCHES_COUNT,FORK_COUNT
count,2606.0,2606.0,2606.0,2606.0
mean,471.105909,851.153108,142.298542,1.209133
std,2274.21056,3229.955311,677.735284,7.520236
min,1.0,1.0,2.0,0.0
25%,25.0,27.0,3.0,0.0
50%,73.0,95.0,9.0,0.0
75%,257.75,415.0,48.0,0.0
max,70056.0,79104.0,14709.0,67.0


As we can see above, PGA filtered 2605 repos that has Jupyter Notebook, in line with what we saw when analyzing the index file (PGA index study notebook).

Now let's see how many siva files there are.

On average, one repo = one siva files, but there can be more than one siva file per repo if there are completely independent branches.

To get this number, we'll examine the column B that corresponds to "SIVA_FILENAMES"

In [136]:
count_sivafiles = 0

for row in df['SIVA_FILENAMES']:
    num_sivafiles = row.split(",")
    count_sivafiles += len(num_sivafiles)

count_sivafiles

6349

Hm, this is odd. We can see that there are 2,605 repos, but 6,349 siva files. This is not normal at all.

Let's understand what's going on.

In [137]:
# We'll create a list that will keep the number of siva files per repo. The lenght of this list will be 2,605.

list_sivafiles = []

for row in df['SIVA_FILENAMES']:
    num_sivafiles = row.split(",")
    list_sivafiles.append(len(num_sivafiles))

# Checking what's the average number of siva files per repo and the standard deviation

import statistics
average = round(statistics.mean(list_sivafiles), 2)
stdev = round(statistics.stdev(list_sivafiles), 2)

print("The average number of siva files is", average, "with a standard deviation of", stdev)


The average number of siva files is 2.44 with a standard deviation of 64.53


We can see that the data is distorted by some anomaly, since a mean of 2.44 with such great standard deviation is not desireable. Let's keep digging and count the n

In [138]:
from collections import Counter

counter_files = Counter(list_sivafiles)

print("There are:")
for c in counter_files:
    print("-", counter_files[c], "repo(s) with", c, "siva files")

There are:
- 2249 repo(s) with 1 siva files
- 301 repo(s) with 2 siva files
- 40 repo(s) with 3 siva files
- 9 repo(s) with 4 siva files
- 4 repo(s) with 5 siva files
- 1 repo(s) with 6 siva files
- 1 repo(s) with 21 siva files
- 1 repo(s) with 3295 siva files


WOWWWW found the anomaly! This **ONE** repo with 3,295 siva files on it.

Now we need to see which repo this is.

In [139]:
anomaly_sivafiles = max(list_sivafiles)

wheres_the_anomaly = [i for i, x in enumerate(list_sivafiles) if x == anomaly_sivafiles][0]

print(df.iloc[wheres_the_anomaly]['URL'])

https://github.com/google/skia-buildbot


So, the repo [https://github.com/google/skia-buildbot](https://github.com/google/skia-buildbot) is responsible for 3,295 siva files.

For the sake of our current study, we'll leave this repo aside, since it's a HUGE outlier. 

The new dataframe can be redefined as:

In [141]:
df = df[df['URL'] != 'https://github.com/google/skia-buildbot']

To download the siva files corresponding to these repos we'll run the following command on the terminal:

```bash
$ pga list -l "Jupyter Notebook" -f json | jq -r 'select(.url != "https://github.com/google/skia-buildbot") | .sivaFilenames[]' | pga get -i -o jupyter_siva_files
```

What are we doing here?

1. using `pga get` command to list the repos that have Jupyter Notebook and resulting in a json file (**stdout**)
2. that will be input (**stdin**) for `jq` command (read further [here](https://stedolan.github.io/jq/)) that we will use to filter out the url for the anomaly repo and output the siva filenames (**stdout** again) that
3. will serve as the input for `pga get -i` command that will then download the files.

Now we have a much more reasonable outcome, since we're downloading 3,054 files. Lay back and relax, because this will take some time to download. (or go watch some movie, study something else, the download will take hours, I tell you in advance :) )

## Step 2 - Extract siva files

We will follow [go siva documentation](https://github.com/src-d/go-siva) to understand how to (1) install the tool and (2) extract siva files.

[EDIT] We actually won't need to extract siva files, since Gitbase is able to read and query siva files!
![celebrate](https://media1.giphy.com/media/YTbZzCkRQCEJa/200.webp?cid=3640f6095bc4ab7034554b4f59a77afd)

## Step 3 - Query repos using gitbase

We will use [source{d} engine](https://docs.sourced.tech/engine#quickstart) from now on. The engine enables us to query the repos we just downloaded.
