# Data collection

On this notebook we're collecting the README.md files from github API, and saving them locally to be analyzed on another notebook.

## Setup

### Installing packages

In [2]:
# Install a pip package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install markdown



### Importing
> External libraries used on the notebook

-  [Requests](http://docs.python-requests.org/en/master/): library to make https requests.
-  [Regular Expressions](https://docs.python.org/3/library/re.html): library to operate on strings using regex.
-  [Markdown](https://python-markdown.github.io/): library to convert markdown to html. 

In [49]:
import requests as rq
import re
import markdown
import math
import json
from functools import reduce
import os

### Requests and Downloads

#### 1. fetchRepositories
> Fetch a number of repositories acording that match the query parameters. Pagination may occur.

**Parameters:** 
- `query`: The query contains one or more search keywords and qualifiers. Qualifiers allow you to limit your search to specific areas of GitHub..
- `parameters`: Query parameters (ex: "language:swift")
- `sort`: Parameter to sort the query (ex: "stars")
- `numResults`: Number of repositories wanted

**Return:**
- A list with the result of each request.

In [29]:
gitHubPageLimit = 100

def pagesCount(numResults):
    pageCount = []
    for x in range(math.floor(numResults / gitHubPageLimit)):
        pageCount.append(gitHubPageLimit)
    
    remainder = numResults % gitHubPageLimit
    if remainder != 0:
        pageCount.append(remainder)
        
    return pageCount

def assembleRepositoryQuery(parameters, sort, pageNumber, perPage):
    baseURL = "https://api.github.com/search/repositories?q="
    baseURL += parameters + "+sort:" + sort + "&per_page=" + str(perPage) + "&page=" + str(pageNumber)
    return baseURL

def fetchRepositories(parameters, sort, numResults):
    pages = pagesCount(numResults)

    repositories = []
    queries = []
    for pageNumber, perPage in enumerate(pages):
        queryURL = assembleRepositoryQuery(parameters, sort, pageNumber + 1, perPage)
        queries.append(queryURL)
        repositories.append(rq.get(queryURL))
        
    return (repositories, queries)

#### 2. jsonResponses
> Turn request responses into JSON (dict)

**Parameters:** 
- `responses`: array os response objects from requests lib.
- `urls`: urls that were requested to generate the responses.

**Return:**
- JSON dictionary with `repos` information and `urls` queried.

In [42]:
def jsonResponses(responses, urls):
    json = {}
    json["repos"] = reduce(lambda accum, response: accum + response.json()["items"], responses, [])
    json["urls"] = urls
    return json

#### 3. saveJSONFile
> Write a JSON content to a new (or existing) file on the system.

**Parameters:** 
- `foldername`: name of folder to save file.
- `filename`: name of existing or new JSON file.
- `content`: the json (dict) that is going to be written to the file.

**Side-effect:**
- new `filename.json` file saved on the `/data` folder.

In [53]:
def saveJSONFile(foldername, filename, content):
    path = f'data/{foldername}'
    if not os.path.exists(path):
        os.mkdir(path)
    file = open(f'{path}/{filename}.json', "w") 
    file.write(json.dumps(content))
    file.close()


## API Queries

Through the GitHub API there are several ways of querying for projects. In this section we'll make some specific queries and some generic ones as well to later compare the obtained results.

The following parameters can be of interest for the queries:
- Programming language
- Topic
- Project type (framework, library, app, list)¹
- Domain (music, real estate, networking)¹

The following paramenters can be used to **sort/filter** the results:
- Stars
- Forks
- Creation date
- Last updated

¹would require novel approach, needing further testing to prove it's efficiency.

### Assembling Queries

Combining different types of queries to see which ones generate a better result.

In [55]:
language = "Swift"
topic = "iOS"
libraryQuery = "library"
sort = "stars"
numberRepos = 100

languageQuery = f'language:{language}'
languageTopicQuery = f'language:{language}+topic:{topic}'
typeLanguageTopicQuery = f'{libraryQuery}+language:{language}+topic:{topic}'
typeLanguageQuery = f'{libraryQuery}+language:{language}'

queries = [languageQuery, languageTopicQuery, typeLanguageTopicQuery, typeLanguageQuery]

### Querying and Saving JSONs

Performing the assembled queries and saving the JSONs on the `data/` folder.

In [56]:
for query in queries:
    (queryResponses, urls) = fetchRepositories(query, sort, numberRepos)
    print(urls)
    saveJSONFile("queries", query, jsonResponses(queryResponses, urls))

['https://api.github.com/search/repositories?q=language:Swift+sort:stars&per_page=100&page=1']
['https://api.github.com/search/repositories?q=language:Swift+topic:iOS+sort:stars&per_page=100&page=1']
['https://api.github.com/search/repositories?q=library+language:Swift+topic:iOS+sort:stars&per_page=100&page=1']
['https://api.github.com/search/repositories?q=library+language:Swift+sort:stars&per_page=100&page=1']
