# Data collection

On this notebook we're collecting the README.md files from github API, and saving them locally to be analyzed on another notebook.

## Setup

### Installing packages

In [2]:
# Install a pip package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install markdown



### Importing
> External libraries used on the notebook

-  [Requests](http://docs.python-requests.org/en/master/): library to make https requests.
-  [Regular Expressions](https://docs.python.org/3/library/re.html): library to operate on strings using regex.
-  [Markdown](https://python-markdown.github.io/): library to convert markdown to html. 

In [17]:
import requests as rq
import re
import markdown
import math
import json
from functools import reduce

### Requests and Downloads

#### 1. fetchReadmeURL
> Query GitHub API for the download URL of the README.md file from the specified repository.

**Parameters:** 
- `repoOwner`: name of the repository owner.
- `repoName`: name of the repository.

**Return:**
- download URL of README.md file.

In [4]:
def fetchReadmeURL(repoOwner, repoName):
    baseURL = 'https://api.github.com/repos'
    requestURL = f'{baseURL}/{repoOwner}/{repoName}/readme'
    
    responseJSON = rq.get(requestURL).json()
    readmeURL = responseJSON['download_url']
    
    return readmeURL

#### 2. downloadReadme
> Download the README file, and save it with a specified filename on the `/data` folder.

**Parameters:** 
- `readmeURL`: download URL of README.md file.
- `filename`: name given to the file once downloaded.

**Side-effect:**
- new `filename.md` file saved on the `/data` folder.

In [5]:
def downloadReadme(readmeURL, filename):
    !cd data/READMES && curl -o {filename + '.md'} {readmeURL} && cd -

#### 3. fetchRepositories
> Fetch a number of repositories acording that match the query parameters. Pagination may occur.

**Parameters:** 
- `query`: The query contains one or more search keywords and qualifiers. Qualifiers allow you to limit your search to specific areas of GitHub..
- `parameters`: Query parameters (ex: "language:swift")
- `sort`: Parameter to sort the query (ex: "stars")
- `numResults`: Number of repositories wanted

**Return:**
- A list with the result of each request.

In [29]:
gitHubPageLimit = 100

def pagesCount(numResults):
    pageCount = []
    for x in range(math.floor(numResults / gitHubPageLimit)):
        pageCount.append(gitHubPageLimit)
    
    remainder = numResults % gitHubPageLimit
    if remainder != 0:
        pageCount.append(remainder)
        
    return pageCount

def assembleRepositoryQuery(parameters, sort, pageNumber, perPage):
    baseURL = "https://api.github.com/search/repositories?q="
    baseURL += parameters + "+sort:" + sort + "&per_page=" + str(perPage) + "&page=" + str(pageNumber)
    return baseURL

def fetchRepositories(parameters, sort, numResults):
    pages = pagesCount(numResults)

    repositories = []
    queries = []
    for pageNumber, perPage in enumerate(pages):
        queryURL = assembleRepositoryQuery(parameters, sort, pageNumber + 1, perPage)
        queries.append(queryURL)
        repositories.append(rq.get(queryURL))
        
    return (repositories, queries)

In [7]:
def saveJSONFile(filename, content):
    file = open(f'data/{filename}.json', "w") 
    file.write(json.dumps(content))
    file.close()


## API Queries

Through the GitHub API there are several ways of querying for projects. In this section we'll make some specific queries and some generic ones as well to later compare the obtained results.

The following parameters can be of interest for the queries:
- Programming language
- Topic
- Project type (framework, library, app, list)¹
- Domain (music, real estate, networking)¹
- Textual

The following paramenters can be used to **sort/filter** the results:
- Stars
- Forks
- Creation date
- Last updated

¹would require novel approach, needing further testing to prove it's efficiency

In [37]:
def jsonResponses(responses, urls):
    json = {}
    json["repos"] = reduce(lambda accum, response: accum + response.json()["items"], responses, [])
    json["urls"] = urls
    return json

query = ""
language = "Swift"
topic = "iOS"
libraryQuery = "library"
sort = "stars"
numberRepos = 100

languageQuery = f'language:{language}'
languageTopicQuery = f'language:{language}+topic:{topic}'
typeLanguageTopicQuery = f'{libraryQuery}+language:{language}+topic:{topic}'
typeLanguageQuery = f'{libraryQuery}+language:{language}'

queries = [languageQuery, languageTopicQuery, typeLanguageTopicQuery, typeLanguageQuery]

In [38]:
for query in queries:
    (queryResponses, urls) = fetchRepositories(query, sort, numberRepos)
    print(urls)
    saveJSONFile(query, jsonResponses(queryResponses, urls))

['https://api.github.com/search/repositories?q=language:Swift+sort:stars&per_page=100&page=1']
['https://api.github.com/search/repositories?q=language:Swift+topic:iOS+sort:stars&per_page=100&page=1']
['https://api.github.com/search/repositories?q=library+language:Swift+topic:iOS+sort:stars&per_page=100&page=1']
['https://api.github.com/search/repositories?q=library+language:Swift+sort:stars&per_page=100&page=1']


### Generic (stars)

In [None]:
queryResponses = fetchRepositories("", language, sort, numberRepos)
jsonResponses = reduce(lambda accum, response: accum + response.json()["items"], queryResponses, [])

saveJSONFile(f'{libraryQuery}-{language}-{sort}-{numberRepos}', jsonResponses(queryResponses))

In [14]:
fetchRepositories(query, language, sort, numberRepos)

[<Response [200]>]