# ICIS 2018 python demo

In the following cell, we import various libraries that are required to perform operations.
- __requests__: performs http requests
- __json__: processes json documents
- __pandas__: data frame manipulation library. According to the pandas library, a data frame is "a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table."
- __numpy__: numerical library used for the pandas data frame
- __pyplot__ and __seaborn__: visualization libraries

In [None]:
import requests
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

This function takes a parameter, a url, and returns the response from an http call.
Here we do not  include an api token but in a real enterprise setting, you would need one to access GitHub remotely.
More information on: https://blog.github.com/2013-05-16-personal-api-tokens/

In [None]:
def make_http_call(url):
    r = requests.get(url = url)
    return r

## Main program

### Search query
We start by forming and performing a search query for GitHub. Search criteria:
- Repositories written in python 
- Repositories that are not archived

We sort the results by number of stars (popularity) in descending order.

### Creating the project data frame
Next, we retrieve the results (in json format) and create a data frame with the following fields:
- project id
- project name
- star count
- fork count
- watchers count

### Displaying the resulting data frame
Finally, we display the contents of our data frame

In [None]:
# here we retrieve projects/repositories based on their ratings (number of stars)
# see https://developer.github.com/v3/search/
search_query = 'https://api.github.com/search/repositories?q=language%3Apython+archived%3Afalse&sort=stars&order=desc'

# send the call to github
response = make_http_call(search_query)

# retrieve the response and load it as json
response_json = json.loads(response.text)

projects = pd.DataFrame(columns = ['project_id','project_name','star_count',\
                                   'fork_count','watch_count',\
                                  ],dtype = np.int)

# each project info is added into a master set
# by default, GitHub returns 30 items per page
# to retrieve more than 30 results, we would need
# to handle pagination: https://developer.github.com/v3/guides/traversing-with-pagination/
for item in response_json['items']:
    # stars
    star_count = item['stargazers_count']
    # watch
    watch_count = item['watchers_count']
    # forks
    fork_count = item['forks_count']

    # append to data frame
    projects=projects.append({'project_id' : item['id'], 'project_name' : item['name'], \
                            'star_count' : star_count, 'fork_count' : fork_count, \
                             'watch_count' : watch_count}, ignore_index = True)
    
# output the list
projects

Now we use the describe() function of the data frame to provide descriptive statistics

In [None]:
projects.describe()

In [None]:
projects.loc[projects['fork_count'].idxmax()]

Finally, we use the scatterplot() function from the seaborn library to display two scatterplots:
- the first plots watchers count against stars count
- the second plots forks count against stars count

In [None]:
sns.scatterplot(x="watch_count", y="star_count", data=projects)
plt.show()

In [None]:
sns.scatterplot(x="fork_count", y="star_count", data=projects)
plt.show()

# End of demo. Thank you!