# Acquiring Repository URLs

This notebook will describe the process used to acquire the URLs for the repositories used in this project. It should be noted that this script was executed on May 16, 2022 and may produce different results at a later date. A .csv file containing the URLs acquired for this project is provided with the project repository in order for our results to be reproducible.

---

## The Required Imports

Below we'll import everything needed to run this script.

In [1]:
import requests
from env import github_username, github_token

import pandas as pd

---

## Sending the API Request

We can use the Github API to acquire the URLs we need for this project. We're interested in acquiring the URLs the would be returned when searching the term "bitcoin" on Github. We also want a reasonable amount of data to work with, but not too much because acquiring the data can be a time consuming process. For our purposes we'll acquire 500 URLs.

A Github account and Github API token is needed to run this script. The github username and API token can be saved in an env.py file. Instructions for setting up and using the Github API can be found [here](https://docs.github.com/en/rest).

In [2]:
# Acquire 100 items from Github using the search term "bitcoin"

headers = {'Authorization' : f'token {github_token}', 'user-agent' : f'{github_username}'}
data = requests.get('https://api.github.com/search/repositories?q=bitcoin&per_page=100', headers = headers)

---

## Exploring the Data

Now that we have the API response let's explore the data to determine how we can aquire the information we need.

In [3]:
# Convert the response to json format

data = data.json()

In [4]:
# Look at the keys in the data dictionary

data.keys()

dict_keys(['total_count', 'incomplete_results', 'items'])

In [5]:
# Let's see what total_count is. This should be the total results for the search query.

data['total_count']

57654

In [6]:
# Let's grab the URL from the first item returned.

data['items'][0]['html_url']

'https://github.com/bitcoin/bitcoin'

In [7]:
# Let's see the size of the data dictionary.

len(data['items'])

100

Now we know how to grab the information we need and we know that each API response will give us 100 items (Github has a maximum of 100 items per search request).

---

## Grabbing the URLs For All Items

Now that we know how to get the information we need let's put everything into a loop so that we can grab the URLs for all items.

In [8]:
# Use a loop to grab the URL for each item. The URL must have the base domain removed in order to work with 
# the acquire.py script that is provided with this project.

for item in data['items']:
    print(item['html_url'].replace('https://github.com/', ''))

bitcoin/bitcoin
bitcoinbook/bitcoinbook
bitcoin/bips
bitcoinjs/bitcoinjs-lib
spesmilo/electrum
bitcoin-wallet/bitcoin-wallet
etotheipi/BitcoinArmory
bitcoin-dot-org/Bitcoin.org
jgarzik/cpuminer
BitcoinExchangeFH/BitcoinExchangeFH
maxme/bitcoin-arbitrage
yenom/BitcoinKit
BitcoinUnlimited/BitcoinUnlimited
Bitcoin-ABC/bitcoin-abc
bisq-network/bisq
mobnetic/BitcoinChecker
bitcoin-abe/bitcoin-abe
petertodd/python-bitcoinlib
sipa/bitcoin-seeder
imfly/bitcoin-on-nodejs
PiSimo/BitcoinForecast
trottier/original-bitcoin
rust-bitcoin/rust-bitcoin
Bit-Wasp/bitcoin-php
btcpayserver/btcpayserver
bitcoin-core/bitcoincore.org
lian/bitcoin-ruby
GammaGao/bitcoinwhitepaper
tianmingyun/MasterBitcoin2CN
kylemanna/docker-bitcoind
pointbiz/bitaddress.org
BTCPrivate/BitcoinPrivate-legacy
jgarzik/python-bitcoinrpc
pooler/cpuminer
cryptean/bitcoinlib
HelloZeroNet/ZeroNet
progranism/Open-Source-FPGA-Bitcoin-Miner
xiaolai/bitcoin-whitepaper-chinese-translation
oleganza/CoreBitcoin
bitcoin-sv/bitcoin-sv
m0mchil/po

In [9]:
# Now let's put all of this into a pandas series. Having the data in a series will make it easy to cache the data.

urls = pd.Series([item['html_url'].replace('https://github.com/', '') for item in data['items']])
urls.head()

0            bitcoin/bitcoin
1    bitcoinbook/bitcoinbook
2               bitcoin/bips
3    bitcoinjs/bitcoinjs-lib
4          spesmilo/electrum
dtype: object

---

## Acquire 500 URLs

Now that we know how to acquire 100 items we need to add additional code so that we can acquire 500 items. We'll need to make 5 API requests in order to achieve this.

In [14]:
# We'll need to add the page parameter to the API request and we'll loop through pages 1 - 5.
# We'll store all the urls in a list, combine them all, and turn it into a series.

urls = []

for page in range(1, 6):
    
    data = requests.get(
        f'https://api.github.com/search/repositories?q=bitcoin&per_page=100&page={page}',
        headers = headers
    ).json()
    
    urls.extend([item['html_url'].replace('https://github.com/', '') for item in data['items']])
    
urls = pd.DataFrame(urls, columns = ['URL'])
urls.shape

(500, 1)

In [15]:
# Now let's cache this data in a .csv file

urls.to_csv('urls.csv', index = False)

In [16]:
# Let's read in the .csv file to make sure it worked

urls = pd.read_csv('urls.csv')
urls.shape

(500, 1)

In [17]:
urls.head()

Unnamed: 0,URL
0,bitcoin/bitcoin
1,bitcoinbook/bitcoinbook
2,bitcoin/bips
3,bitcoinjs/bitcoinjs-lib
4,spesmilo/electrum


Everything looks good and now the URLs we'll be using have been cached.