In [11]:
os.environ["SHELL"]

'/bin/zsh'

# Who is the friendliest member of the IS team (*based on PR comments*)

## How will we do this
1. use the github RESTful API to retrieve all comments from all PRs
2. clean the data
3. use sentiment analysis to make a decision

---
Disclaimer: Our implementation of sentiment analysis does not take into account essential characteristics: context, friendly banter, or the relationships and levels of seniority between our members. Even if we attempted to do that, it would not be an accurate indicator of workplace culture or conduct and should not be used as such. The goal is to examine the techniques behind this whole process and have a little fun with the remaining data.

## 1. Github RESTful API

### What is an API?
- stands for Application Programming interface
- structured way for a program to offer a service, or in our case, data to another program
- we interact with APIs using endpoints
- menus are food APIs and appetizer, entre, dessert, and drinks are endpoints!

### RESTful API
- REST is a set of standards for an HTML API
- REST stands for representational state transfer

Let's look at a basic example:

$
\overbrace{
    \underbrace{\text{https://api.github.com}}_\text{api}/
    \underbrace{\text{users}}_\text{endpoint}/
    \underbrace{\text{philipmassouh}}_\text{user}
   }^\text{request}
$

This is a trimmed version of the output-- it's simply a JSON representatoin of the data you can see on my profile page... you can try this in your own browser!

```json
{
  "login": "philipmassouh",
  ...
  "name": "Philip Massouh",
  "company": "@ResearchAffiliates ",
  "blog": "http://www.philipmassouh.com",
  ...
  "public_repos": 17,
  "public_gists": 9,
  ...
}
```

### Requests
- What if I want to access something private?
- What if I want to modify how much data or exactly what type of data?
<br>

- ***We need a way to provide options***
---

### Let's take a look at the *comments* *endpoint*

$
\overbrace{
    \underbrace{\text{https://api.github.com}}_\text{api}/
    \underbrace{\text{ResearchAffiliates}}_\text{repo owner}/
    \underbrace{\text{invsys}}_\text{repo}/
    \underbrace{\text{pulls/comments}}_\text{endpoint}/
   }^\text{request}
$

The investment systems repository is private, so we need to tell it who we are. The python `requests` library can help us with that

In [3]:
import requests
import os

# authtoken = os.environ['GAUTH']
authtoken = "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"

comments = requests.get(
    url="https://api.github.com/repos/ResearchAffiliates/invsys/pulls/comments",    # request
    auth=("philipmassouh", authtoken),                                    # authorization
    params={                                                                        # extra options
        "per_page":100,
        "page":1
    }
)

Much like the first example, this returns a lot of unwanted information, too much to display here. So we are going to:
1. open the response as a json
2. select the first comment json
3. view only the sender and content

In [4]:
comment_with_metadata = comments.json()[0]          # open as json and select first
sender = comment_with_metadata['user']['login']     # get sender
comment = comment_with_metadata['body']             # get message
f"{sender}: {comment}"

'vhquang: I guess the check for new line does not run on its own configuration. :)'

You may have noticed that for this endpoint, the Github API wants a `page` parameter. Much like the browser view, it wants to know which page of comments to retrieve. 

So, to get all the comments, we look at the header

In [5]:
comments.headers['link']

'<https://api.github.com/repositories/287143969/pulls/comments?per_page=100&page=2>; rel="next", <https://api.github.com/repositories/287143969/pulls/comments?per_page=100&page=65>; rel="last"'

This gives us a string, we can ignore the first item because it's the current page, not the last. Then we will extract that number using code in case it changes.

In [6]:
last_page = comments.headers['link'].split('next", ')[1]

last_page = 64

This loop will run from 1 to the last page, fetching all of the comments for us. but....

In [12]:
import csv
import re

with open('is-comments.csv', 'w', newline='\n') as csvout:
    for page_number in range(1, last_page+1):
        
        comments = requests.get(
            "https://api.github.com/repos/ResearchAffiliates/invsys/pulls/comments",
            auth=("philipmassouh", authtoken),
            params={
                "per_page":100,
                "page":page_number
            }
        )

        for comment in comments.json():
            csv.writer(csvout, delimiter=',').writerow([
                comment['user']['login'],
                re.sub('\r\n', ' ', comment['body'])
            ])
        
        page_number += 1
        

NameError: name 're' is not defined

It's very slow, let's try to get more than one page at a time... if we can get 8 pages at a time, the task will be 8x faster!

---

Split the task up with `thread` from the `multiprocessing` library.

In order to run multiple requests concurrently, we need to wrap the task we want to multiprocess in a function. 

In [18]:
def get_comments(page_number):

    comments = requests.get(
        'https://api.github.com/repos/ResearchAffiliates/invsys/pulls/comments',
        auth=('philipmassouh', authtoken),
        params={
            'per_page':100,
            'page':page_number
        }
    )

    # since we are gathering almost 7000 commentPYTHO NTs and all their metadata,
    # I only want to keep the commenter and comment itself.
    output = []
    for comment in comments.json():
        try:
            output.append((comment['user']['login'], comment['body']))
        except TypeError:
            print(f"comment: {comment}")
    return output

In [19]:
get_comments(1)[0]

('vhquang',
 'I guess the check for new line does not run on its own configuration. :)')

In [20]:
from multiprocessing.pool import ThreadPool
import threading

max_threads = threading.activeCount()

pages = list(range(1,12))

with ThreadPool(max_threads) as p:
    p.map(get_comments, pages)
    print(pages)

print(pages)

  max_threads = threading.activeCount()


[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]


We are now left with some nested lists containing pairs of commenter,comment.
It looks like this: <br>
```
[
    [(commenter,comment),...,(commenter,comment)],
    ...,
    ...,
    ...,
    [(commenter,comment),...,(commenter,comment)]
]
```
Where each row corresponds to a page of comments. <br>

Now we want to unpack this list into a list where each line has a (commenter,comment) pair

In [None]:
all_comments = [comment for page in pages for comment in page]

Now we need to clean 