# Scraping Top Repositories for Topics on GitHub

TODO  (Intro): 
- Introduction about web scraping
- Introduction about GitHub and the problem statement
- Mention the tools you're using (Python, requests, Beautiful Soup, Pandas)

Here are the steps we'll follow:

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 30 repositories in the topic from the topic page
- For each repository, we'll grab the repository name, username, stars and repository URL
- For each topic we'll create a CSV file in the following format:

```
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx
```

## Scrape the list of topics from Github

Explain how you'll do it.

- use requests to downlaod the page
- user BS4 to parse and extract information
- convert to a Pandas dataframe

Let's write a function to download the page.

In [95]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os

def get_topics_page():
    # TODO - add comments
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc

`get_topics_page()` can be use to load https://github.com/topics with top topics in **Github**

Let's create some helper functions to parse information from the page.

To get topic titles, we can pick `p` tags with the `class` ...

![](https://i.imgur.com/FldT48M.jpg)

Add some explanation

In [96]:
doc = get_topics_page()

In [97]:
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

`get_topic_titles` can be used to get the list of titles

In [98]:
def get_topic_descs(doc):
    desc_selection_class = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class': desc_selection_class})
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs

`get_topic_descs` can be used to get the list of description

In [99]:
def get_topic_urls(doc):
    link_selection_class = 'no-underline flex-1 d-flex flex-column'
    topic_link_tags = doc.find_all('a', {'class': link_selection_class})
    topic_urls = []
    base_url = 'https://github.com'
    for i in range(len(topic_link_tags)):
        topic_urls.append(base_url + topic_link_tags[i]['href'])
    return topic_urls

`get_topic_urls` can be used to get the list of url

**Let's put this all together into a single function**

In [100]:
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url': get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)

`scrape_topics` can be use to get table of topics with `tittle`,`description`,`url` of topics

## Get the top 30 repositories from a topic page

TODO - explanation and step

In [101]:
def get_topic_page(topic_url):
    # Download the page
    response = requests.get(topic_url)
    # Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    # Parse using Beautiful soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

`get_topic_page` can be used to load the top repository from selected topic from it's `topic_url` and store in to `topic_doc`

In [102]:
def parse_star_counts(star_str):
    star_str = star_str.strip()
    if star_str[-1] == 'k':
        return int(float(star_str[:-1]) * 1000)
    return int(star_str)

`parse_star_counts` can be used to convert the star count into interger Ex. `18K` --> `18000`

In [103]:
def get_repo_info(h3_tag, star_tag):
    a_tag = h3_tag.find_all('a')                 # find all a tag from h3_tag where h3_tag=repo_tags[..]
    user_name  = a_tag[0].text.strip()   
    repo_name = a_tag[1].text.strip()
    stars     = parse_star_counts(star_tag.text) # parse_star_counts to convert star count to integer
    base_url  = 'https://github.com'
    repo_url  = base_url + a_tag[1]['href']
    return user_name, repo_name, stars, repo_url

`get_repo_info` can be used to get the list repository information like `user_name`, `repo_name`, `stars`, `repo_url` of selected topic

In [104]:
def get_topic_repos(topic_doc):
    
    # Get h3_tag repo
    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3', {'class': h3_selection_class} )
    
    # Get star_tag
    star_tags = topic_doc.find_all('span', { 'class': 'Counter js-social-count'})
    
    # Get repo info
    topic_repos_dict = {
    'username': [],
    'repo_name': [],
    'stars': [],
    'repo_url': []}
    
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
    return pd.DataFrame(topic_repos_dict)

`get_topic_repos` can be use to get table of repository info of selected topic using `topic_doc`

In [105]:
def scrape_topic_content(topic_urls, path):
    # print this when repository file already exists
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    #save csv file of top repository with information
    topic_content_df = get_topic_repos(get_topic_page(topic_urls))
    topic_content_df.to_csv(path, index=None)   

`scrape_topic_content` can be use to get the csv file of selected topic using `topic_url`

TODO - show a example

In [106]:
get_topic_urls(doc)[5]

'https://github.com/topics/angular'

In [107]:
get_topic_repos(get_topic_page(get_topic_urls(doc)[5]))

Unnamed: 0,username,repo_name,stars,repo_url
0,justjavac,free-programming-books-zh_CN,94200,https://github.com/justjavac/free-programming-...
1,angular,angular,82600,https://github.com/angular/angular
2,storybookjs,storybook,72600,https://github.com/storybookjs/storybook
3,leonardomso,33-js-concepts,49900,https://github.com/leonardomso/33-js-concepts
4,ionic-team,ionic-framework,47600,https://github.com/ionic-team/ionic-framework
5,prettier,prettier,43300,https://github.com/prettier/prettier
6,SheetJS,sheetjs,30600,https://github.com/SheetJS/sheetjs
7,angular,angular-cli,25500,https://github.com/angular/angular-cli
8,angular,components,22800,https://github.com/angular/components
9,NativeScript,NativeScript,21400,https://github.com/NativeScript/NativeScript


## Putting it all together

- We have a funciton to get the list of topics
- We have a function to create a CSV file for scraped repos from a topics page
- Let's create a function to put them together

In [108]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    # scrape_topic_content can be use to get table of topics with tittle,description,url of topics
    
    os.makedirs('data', exist_ok=True)
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic_content(row['url'], 'data/{}.csv'.format(row['title']))

- Let's run `scrape_topics_repos()` to scrape and save it into csv for the top repos for the all the topics on the first page of https://github.com/topics
- To veiw csv file use `pd.read_csv('topic name')`

In [110]:
pd.read_csv('Data/Ajax.csv')

Unnamed: 0,username,repo_name,stars,repo_url
0,ljianshu,Blog,7200,https://github.com/ljianshu/Blog
1,metafizzy,infinite-scroll,7100,https://github.com/metafizzy/infinite-scroll
2,developit,unfetch,5400,https://github.com/developit/unfetch
3,jquery-form,form,5100,https://github.com/jquery-form/form
4,olifolkerd,tabulator,4700,https://github.com/olifolkerd/tabulator
5,Studio-42,elFinder,4300,https://github.com/Studio-42/elFinder
6,ded,reqwest,2900,https://github.com/ded/reqwest
7,dwyl,learn-to-send-email-via-google-script-html-no-...,2800,https://github.com/dwyl/learn-to-send-email-vi...
8,elbywan,wretch,2400,https://github.com/elbywan/wretch
9,LeaVerou,bliss,2400,https://github.com/LeaVerou/bliss


----------------------------------------------------------------------------------------------------------------------------

## References and Future Work

Summary of what we did

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repository name, username, stars and repository URL
- For each topic we'll create a CSV file in the following format:

References to links you found useful

- https://imgur.com/
- https://github.com/pavitramandal37
 
Ideas for future work

- We can make a user interface where we can ask user to there interested topic and show them top repository with direct link.
- We can create a Recommendation Engine to get repository according to user's used hastags.
