# Data collection

On this notebook we're collecting the README.md files from github API, and saving them locally to be analyzed on another notebook.

## Setup

### Installing packages

In [1]:
# Install a pip package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install markdown

Collecting markdown
[?25l  Downloading https://files.pythonhosted.org/packages/f5/e4/d8c18f2555add57ff21bf25af36d827145896a07607486cc79a2aea641af/Markdown-3.1-py2.py3-none-any.whl (87kB)
[K    100% |████████████████████████████████| 92kB 635kB/s ta 0:00:01
Installing collected packages: markdown
Successfully installed markdown-3.1


### Importing
> External libraries used on the notebook

-  [Requests](http://docs.python-requests.org/en/master/): library to make https requests.
-  [Regular Expressions](https://docs.python.org/3/library/re.html): library to operate on strings using regex.
-  [Markdown](https://python-markdown.github.io/): library to convert markdown to html. 

In [2]:
import requests as rq
import re
import markdown

### 1. fetchReadmeURL
> Query GitHub API for the download URL of the README.md file from the specified repository.

**Parameters:** 
- `repoOwner`: name of the repository owner.
- `repoName`: name of the repository.

**Return:**
- download URL of README.md file.

In [3]:
def fetchReadmeURL(repoOwner, repoName):
    baseURL = 'https://api.github.com/repos'
    requestURL = f'{baseURL}/{repoOwner}/{repoName}/readme'
    
    responseJSON = rq.get(requestURL).json()
    readmeURL = responseJSON['download_url']
    
    return readmeURL

### 2. downloadReadme
> Download the README file, and save it with a specified filename on the `/data` folder.

**Parameters:** 
- `readmeURL`: download URL of README.md file.
- `filename`: name given to the file once downloaded.

**Side-effect:**
- new `filename.md` file saved on the `/data` folder.

In [4]:
def downloadReadme(readmeURL, filename):
    !cd data/READMES && curl -o {filename + '.md'} {readmeURL} && cd -

## Testing download functions

In [5]:
test_downloadURL = fetchReadmeURL('hpbl', 'spotties')
test_downloadREADME = downloadReadme(test_downloadURL, 'spotties')

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2542  100  2542    0     0   5351      0 --:--:-- --:--:-- --:--:--  5351
/home/jovyan/workspace


## Compiling useful repositories
The functions bellow compile a list of possible iOS frameworks/libraries repositories written in Swift

Querying GitHub API for the 100 most starred repositories tagged "iOS" and written in Swift.
Saving the JSON response as a local file as backup.

In [47]:
url = "https://api.github.com/search/repositories\?q\=+topic:iOS+language:swift+sort:stars\&per_page\=100"
headers = "Accept:application/vnd.github.mercy-preview+json"
filename = 'top100Swift.json'
# len(rq.get(url, headers=headers).json()['items'])

!cd data && curl -H {headers} \  {url} > {filename}

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (6) Could not resolve host:  
100  183k  100  183k    0     0   144k      0  0:00:01  0:00:01 --:--:--  522k


Making the same request, but now keeping the response on memory.

In [48]:
response = rq.get("https://api.github.com/search/repositories?q=+topic:iOS+language:swift+sort:stars&per_page=100").json()

Saving the the URL of each repo on a txt, that will be analyzed by hand to see which ones are libraries or frameworks.

In [53]:
reposURL = []
for element in response['items']:
    reposURL.append(element['html_url'])

with open('possibleFrameworks.txt', 'w') as f:
    for repo in reposURL:
        f.write("%s\n" % repo)

For our analysis only repositories belonging to frameworks or libraries would be useful, so with a list of [`data/possibleFrameworks.txt`](data/possibleFrameworks.txt) we manually opened each link and looked at it's README file.

The [`data/confirmedFrameworks.txt`](data/confirmedFrameworks.txt) list contains 85 links to repositories that belong to libraries or frameworks.

In [5]:
frameworksURLs = []
confirmedFrameworks = open('data/confirmedFrameworks.txt', 'r' )
frameworksURLs = confirmedFrameworks.readlines()

removeLineBreak = lambda x: re.sub('\\n$', '', x)

frameworksURLs = list(map(removeLineBreak, frameworksURLs))

len(frameworksURLs)

85

Getting the framework's names and owners from the URLs

In [6]:
getNameFromURL = lambda x: re.search('[^\/]+(?=$)', x).group(0)
getOwnerFromURL = lambda x: re.search('(?<=\/)[\w|\-]+(?=\/)', x).group(0)

frameworkNames = list(map(getNameFromURL, frameworksURLs))
frameworkOwners = list(map(getOwnerFromURL, frameworksURLs))

frameworks = list(zip(frameworkOwners, frameworkNames, frameworksURLs))

Downloading each framework README.md file. This step required authentication, so as not to reach the API limit. Credentials were removed.

In [53]:
for framework in frameworks:
    readmeURL = fetchReadmeURL(framework[0], framework[1])
    downloadReadme(readmeURL, framework[1])

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  7550  100  7550    0     0  11128      0 --:--:-- --:--:-- --:--:-- 11135
/Users/pintor/Documents/CIn/10/Data Science/2018-2-projeto-bubads/Notebooks
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 10029  100 10029    0     0  18150      0 --:--:-- --:--:-- --:--:-- 18135
/Users/pintor/Documents/CIn/10/Data Science/2018-2-projeto-bubads/Notebooks
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  6361  100  6361    0     0  11792      0 --:--:-- --:--:-- --:--:-- 11779
/Users/pintor/Documents/CIn/10/Data Science/2018-2-projeto-bubads/Notebooks
  % Total    % Received % Xferd  Average Speed   Time    Time

100  6134  100  6134    0     0  11288      0 --:--:-- --:--:-- --:--:-- 11275
/Users/pintor/Documents/CIn/10/Data Science/2018-2-projeto-bubads/Notebooks
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 13606  100 13606    0     0  20781      0 --:--:-- --:--:-- --:--:-- 20804
/Users/pintor/Documents/CIn/10/Data Science/2018-2-projeto-bubads/Notebooks
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  5247  100  5247    0     0   9561      0 --:--:-- --:--:-- --:--:--  9574
/Users/pintor/Documents/CIn/10/Data Science/2018-2-projeto-bubads/Notebooks
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 11780  100 11780    0     0  21306      0 --:--:-- --:--:

100  5842  100  5842    0     0  10826      0 --:--:-- --:--:-- --:--:-- 10818
/Users/pintor/Documents/CIn/10/Data Science/2018-2-projeto-bubads/Notebooks
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2774  100  2774    0     0   5021      0 --:--:-- --:--:-- --:--:--  5016
/Users/pintor/Documents/CIn/10/Data Science/2018-2-projeto-bubads/Notebooks
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  5442  100  5442    0     0  10248      0 --:--:-- --:--:-- --:--:-- 10267
/Users/pintor/Documents/CIn/10/Data Science/2018-2-projeto-bubads/Notebooks
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  7952  100  7952    0     0  14763      0 --:--:-- --:--:

Counting how many files were downloaded.

In [54]:
! cd data/READMES && ls | wc -l

      85


## Converting markdown to HTML

Using markdown library to convert all README.md files to html formats for easier parsing. 

In [13]:
basePath = 'data/READMES/'

instance = markdown.Markdown()

for name in frameworkNames:
    filename = basePath + name
    instance.convertFile(filename + ".md", basePath + 'HTML/' + name + ".html")
    instance.reset()