# Scraping Top Repositories On Github


#### Here are the steps we are going to follow:

  1. We're going to scrape https://github.com/topics
  2. We'll get a list of topics. 
  3. For each topic, we'll get topic title, topic page URL and topic description
  4. For each topic, we'll get the top 25 repositories in the topic from the topic page
  5. For each repository, we'll grab the repo name, username, stars and repo URL
  6. For each topic we'll create a CSV file in the following format:
 
    Repo Name,Username,Stars,Repo URL
    three.js,mrdoob,69700,https://github.com/mrdoob/three.js
    libgdx,libgdx,18300,https://github.com/libgdx/libgdx

### Scraping list of topics 

  - Use requests to download the page
  - Use BS4 to parse the page and extract the information
  - onvert to Pandas data frame 

In [49]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

def get_topic_page():
    topic_url = 'https://github.com/topics'
    # Downloading the page
    response = requests.get(topic_url)
    # Checking status
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    
    # Parse the page using BeautifulSoup 
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc

In [50]:
doc =  get_topic_page()

Let's build some helper functions to get the information

For topic title we can use p with class name:

In [51]:
def get_topic_titles(doc):
    topic_title_tags = doc.find_all('p', class_='f3 lh-condensed mb-0 mt-1 Link--primary')
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

In [52]:
# get_topic_title can be used to get list of topics

titles = get_topic_titles(doc)

In [53]:
len(titles)

30

In [54]:
titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

Similarly, we can use functions to get URLs and Description

In [55]:
def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a',class_='no-underline flex-1 d-flex flex-column')
    topic_urls = []
    base_url = 'http://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls

In [56]:
urls = get_topic_urls(doc)

In [57]:
urls[:5]

['http://github.com/topics/3d',
 'http://github.com/topics/ajax',
 'http://github.com/topics/algorithm',
 'http://github.com/topics/amphp',
 'http://github.com/topics/android']

In [58]:
def get_topic_desc(doc):
    topic_desc_tags = doc.find_all('p', class_='f5 color-fg-muted mb-0 mt-1')
    topic_descs = []
    for desc in topic_desc_tags:
        topic_descs.append(desc.text.strip())
    return topic_descs

In [59]:
desc = get_topic_desc(doc)

In [60]:
desc[:5]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

Let us now put them into a single function

In [76]:
def scrape_topics():
    topic_url = 'https://github.com/topics'
    response = requests.get(topic_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    topics_dict = {
        'title' : get_topic_titles(doc),
        'description': get_topic_desc(doc),
        'url' : get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)

## Getting top 30 repositories from a topic

In [77]:
def get_topic_page(topic_page_url):
    # Downloading the page
    response = requests.get(topic_page_url)
    # Checking status
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_page_url))
    
    # Parse the page using BeautifulSoup 
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

In [78]:
doc = get_topic_page('https://github.com/topics/3d')

Now, on this page we need to get the username and project name of each repository. We inspect the page and find that they are confined within the 'h3' tag within the HTMl code. Both the username and repo name is in 'a' tags and hence we first find all the 'a' tags and use the index to seperate them

In [90]:
def get_repo_info(h3_tag, star_tag):
    base_url = 'http://github.com'
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = star_count(star_tag.text.strip())
    return username,repo_name,stars,repo_url

In [91]:
def get_topic_repos(topic_doc):
    
    # Find all the repo tags that contains username, repo URL and repo name
    repo_tags = topic_doc.find_all('h3', class_='f3 color-fg-muted text-normal lh-condensed')
    
    # Find all the star tags
    star_tags = topic_doc.find_all('span', class_='Counter js-social-count')
    
    topic_repo_dict = {
        'username' : [],
        'repo_name' : [],
        'stars' : [],
        'repo_url' : []
    }
    
    # Get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i],star_tags[i])
        topic_repo_dict['username'].append(repo_info[0])
        topic_repo_dict['repo_name'].append(repo_info[1])
        topic_repo_dict['stars'].append(repo_info[2])
        topic_repo_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topic_repo_dict)

#### This is the function to change the star count into integers i.e. for example if the stars are 77.5k the function will change it into 77500

In [92]:
def star_count(star_str):
    star_str = star_str.strip()
    if star_str[-1] == 'k':
        return int(float(star_str[:-1]) * 1000)
    return int(star_str)

In [93]:
import os

def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print('The file {} already exists. Skipping...'.format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)

## Putting it all together
  1. We have a funciton to get the list of topics
  2. We have a function to create a CSV file for scraped repos from a topics page
  3. Let's create a function to put them together

In [94]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topic_df = scrape_topics()
    
    os.makedirs('data1', exist_ok=True)
    for index, row in topic_df.iterrows():
        print('Scraping top repositories')
        scrape_topic(row['url'], 'data1/{}.csv'.format(row['title']))

Let's run it to scrape the top repos for the all the topics on the first page of https://github.com/topics

In [95]:
scrape_topics_repos()

Scraping list of topics
Scraping top repositories
Scraping top repositories
Scraping top repositories
Scraping top repositories
Scraping top repositories
Scraping top repositories
Scraping top repositories
Scraping top repositories
Scraping top repositories
Scraping top repositories
Scraping top repositories
Scraping top repositories
Scraping top repositories
Scraping top repositories
Scraping top repositories
Scraping top repositories
Scraping top repositories
Scraping top repositories
Scraping top repositories
Scraping top repositories
Scraping top repositories
Scraping top repositories
Scraping top repositories
Scraping top repositories
Scraping top repositories
Scraping top repositories
Scraping top repositories
Scraping top repositories
Scraping top repositories
Scraping top repositories


We can check that the CSVs were created properly

In [98]:
# Let us read one of the files

df1 = pd.read_csv('data1/3D.csv')

In [99]:
df1.head()

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,77600,http://github.com/mrdoob/three.js
1,libgdx,libgdx,19500,http://github.com/libgdx/libgdx
2,pmndrs,react-three-fiber,16300,http://github.com/pmndrs/react-three-fiber
3,BabylonJS,Babylon.js,15600,http://github.com/BabylonJS/Babylon.js
4,aframevr,aframe,13500,http://github.com/aframevr/aframe


In [100]:
df2 = pd.read_csv('data1/Ajax.csv')
df2.head()

Unnamed: 0,username,repo_name,stars,repo_url
0,metafizzy,infinite-scroll,7000,http://github.com/metafizzy/infinite-scroll
1,ljianshu,Blog,6900,http://github.com/ljianshu/Blog
2,developit,unfetch,5300,http://github.com/developit/unfetch
3,jquery-form,form,5100,http://github.com/jquery-form/form
4,olifolkerd,tabulator,4300,http://github.com/olifolkerd/tabulator
