# Data collection

On this notebook we're collecting the README.md files from github API, and saving them locally to be analyzed on another notebook.

## Download functions

### 0. Importing
> External libraries used on the notebook

-  [Requests](http://docs.python-requests.org/en/master/): library to make https requests.

In [12]:
import requests as rq

### 1. fetchReadmeURL
> Query GitHub API for the download URL of the README.md file from the specified repository.

**Parameters:** 
- `repoOwner`: name of the repository owner.
- `repoName`: name of the repository.

**Return:**
- download URL of README.md file.

In [13]:
def fetchReadmeURL(repoOwner, repoName):
    baseURL = 'https://api.github.com/repos'
    requestURL = f'{baseURL}/{repoOwner}/{repoName}/readme'
    
    responseJSON = rq.get(requestURL).json()
    readmeURL = responseJSON['download_url']
    
    return readmeURL

### 2. downloadReadme
> Download the README file, and save it with a specified filename on the `/data` folder.

**Parameters:** 
- `readmeURL`: download URL of README.md file.
- `filename`: name given to the file once downloaded.

**Side-effect:**
- new `filename.md` file saved on the `/data` folder.

In [15]:
def downloadReadme(readmeURL, filename):
    !cd data && curl -o {filename + '.md'} {readmeURL} && cd -

## Testing download functions

In [16]:
test_downloadURL = fetchReadmeURL('hpbl', 'spotties')
test_downloadREADME = downloadReadme(test_downloadURL, 'spotties')

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2542  100  2542    0     0    253      0  0:00:10  0:00:10 --:--:--   728
/Users/pintor/Documents/CIn/10/Data Science/2018-2-projeto-bubads/Notebooks


## Compiling useful repositories
The functions bellow compile a list of possible iOS frameworks/libraries repositories written in Swift

Querying GitHub API for the 100 most starred repositories tagged "iOS" and written in Swift.
Saving the JSON response as a local file as backup.

In [37]:
url = "https://api.github.com/search/repositories\?q\=+topic:iOS+language:swift+sort:stars\&per_page\=100"
headers = "Accept:application/vnd.github.mercy-preview+json"
filename = 'top100Swift.json'
# len(rq.get(url, headers=headers).json()['items'])

!cd data && curl -H {headers} \  {url} > {filename}

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (6) Could not resolve host:  
100  605k  100  605k    0     0   239k      0  0:00:02  0:00:02 --:--:--  283k


Making the same request, but now keeping the response on memory.

In [48]:
response = rq.get("https://api.github.com/search/repositories?q=+topic:iOS+language:swift+sort:stars&per_page=100").json()

Saving the the URL of each repo on a txt, that will be analyzed by hand to see which ones are libraries or frameworks.

In [53]:
reposURL = []
for element in response['items']:
    reposURL.append(element['html_url'])

with open('possibleFrameworks.txt', 'w') as f:
    for repo in reposURL:
        f.write("%s\n" % repo)

For our analysis only repositories belonging to frameworks or libraries would be useful, so with a list of [`data/possibleFrameworks.txt`](data/possibleFrameworks.txt) we manually opened each link and looked at it's README file.

The [`data/confirmedFrameworks.txt`](data/confirmedFrameworks.txt) list contains 85 links to repositories that belong to libraries or frameworks.

In [3]:
frameworksURLs = []
confirmedFrameworks = open('data/confirmedFrameworks.txt', 'r' )
frameworksURLs = confirmedFrameworks.readlines()

len(frameworksURLs)

85