## Scraping Top Repositories for Topics on GitHub



### What is Web Scraping?

Web scraping is a method used to automatically extract data from websites. It involves retrieving HTML content from web pages and then parsing and extracting the desired information. This technique is valuable for various purposes, including data collection, content aggregation, monitoring, automation, and building APIs.

### Problem Statement

GitHub is a widely used platform for hosting and collaborating on software projects using version control. However, navigating and extracting specific information from GitHub's vast repository of projects can be time-consuming. In this project, we aim to address this issue by utilizing web scraping techniques to extract data from GitHub's topics page. Our goal is to automatically gather information about topics and their top repositories, including repository name, username, stars, and repository URL. By automating this process, we can streamline data collection and make it more efficient for various purposes, such as research, analysis, and monitoring trends in software development.

### Tools used: Python, requests, Beautiful Soup

### Steps:

- Scrape the GitHub topics page to retrieve a list of topics.
- For each topic, extract the topic title, topic page URL, and topic description.
- For each topic, navigate to its page and scrape the top 25 repositories within that topic.
- For each repository, extract the repository name, username, number of stars, and repository URL.
- Compile this information into a CSV file with the following format:

```
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx
```

### Scrape the list of topics from Github

- use requests to downlaod the page
- user BS4 to parse and extract information
- convert to a Pandas dataframe

Let's write a function to download the page.

In [151]:
import requests
from bs4 import BeautifulSoup

def get_topic_page():
    
    topic_url = 'https://github.com/topics'
    response = requests.get(topic_url)
    if response.status_code != 200:
        raise Exception(f"Failed to load page {topic_page_url}")
    
    topic_doc = BeautifulSoup(response.text, 'html.parser')  
    return topic_doc

In [152]:
topic_doc = get_topic_page()

Let's create some helper functions to parse information from the page.

To get topic titles, we can pick p tags with the class ...

In [153]:
def get_topic_titles(doc):
    topic_title = doc.find_all('p', class_='f3 lh-condensed mb-0 mt-1 Link--primary')
    topic_titles = []
    
    for tag in topic_title:
        topic_titles.append(tag.text)
    return topic_titles

`get_topic_titles` can be used to get the list of titles

Similarly we have defined functions for descriptions and URLs.

In [154]:
def get_topic_descs(doc):
    topic_description = doc.find_all('p', class_='f5 color-fg-muted mb-0 mt-1')
    topic_desc = []
    
    for tag in topic_description:
        topic_desc.append(tag.text.strip())
    return topic_desc

def get_topic_urls(doc):
    topic_link = doc.find_all('a', class_='no-underline flex-1 d-flex flex-column')
    topic_urls = []
    base_url = 'https://github.com'
    
    for url in topic_title:
        topic_urls.append(base_url + url.parent['href'])
    return topic_urls

Let's put this all together into a single function

In [155]:
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception(f"Failed to load page {topics_url}")
    doc = BeautifulSoup(response.text, 'html.parser')
    
    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url': get_topic_urls(doc)
    }
    
    return pd.DataFrame(topics_dict)

### Get the top 20 repositories from a topic page

In [156]:
def get_topic_page(topic_url):
    response = requests.get(topic_url)
    if response.status_code != 200:
        raise Exception(f"Failed to load page {topic_page_url}")
    
    topic_doc = BeautifulSoup(response.text, 'html.parser')  
    return topic_doc

In [157]:
doc = get_topic_page('https://github.com/topics/3d')

In [158]:
def parse_ratings_count(ratings):
    ratings = ratings.text
    if ratings[-1] == 'k':
        return int(float(ratings[:-1]) * 1000)
    return int(ratings)

In [159]:
def get_repo_info(repo, ratings):
    a_tags = repo.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    url = base_url + a_tags[1]['href']
    rating = parse_ratings_count(ratings)
    return username, repo_name, rating, url

In [160]:
def get_topic_repos(topic_doc):
    
    # Get the h3 tags containing repo title, repo URL and username
    repo = topic_doc.find_all('h3',class_='f3 color-fg-muted text-normal lh-condensed')
    
    # Get ratings tags
    ratings = topic_doc.find_all('span',class_='Counter js-social-count')
    

    repos_dict = {
        'username': [],
        'repo_name': [],
        'stars': [],
        'repo_url': []
    }
    
    # Get repo info
    for i in range(len(repo)):
        repo_info = get_repo_info(repo[i], ratings[i])
        repos_dict['username'].append(repo_info[0])
        repos_dict['repo_name'].append(repo_info[1])
        repos_dict['stars'].append(repo_info[2])
        repos_dict['repo_url'].append(repo_info[3])
    
    return pd.DataFrame(repos_dict)

In [161]:
import os

def scrape_topic(url, path):
    if os.path.exists(path):
        print(f"The file {path} already exists. Skipping...")
    topic_df = get_topic_repos(get_topic_page(url))
    topic_df.to_csv(path, index=None)

### Putting it all together

- We have a funciton to get the list of topics
- We have a function to create a CSV file for scraped repos from a topics page
- Let's create a function to put them together

In [162]:
def scrape_topic_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('data', exist_ok=True)
    
    for index, row in topics_df.iterrows():
        print(f"Scraping top repositories for "{row['title']}"")
        scrape_topic(row['url'], 'data/{}'.format(row['title']))

Let's run it to scrape the top repos for the all the topics on the first page of https://github.com/topics

In [163]:
scrape_topic_repos()

Scraping list of topics
Scraping top repositories for 3D
The file data/3D already exists. Skipping...
Scraping top repositories for Ajax
The file data/Ajax already exists. Skipping...
Scraping top repositories for Algorithm
The file data/Algorithm already exists. Skipping...
Scraping top repositories for Amp
The file data/Amp already exists. Skipping...
Scraping top repositories for Android
The file data/Android already exists. Skipping...
Scraping top repositories for Angular
The file data/Angular already exists. Skipping...
Scraping top repositories for Ansible
The file data/Ansible already exists. Skipping...
Scraping top repositories for API
The file data/API already exists. Skipping...
Scraping top repositories for Arduino
The file data/Arduino already exists. Skipping...
Scraping top repositories for ASP.NET
The file data/ASP.NET already exists. Skipping...
Scraping top repositories for Atom
The file data/Atom already exists. Skipping...
Scraping top repositories for Awesome List

We can check that the CSVs were created properly