 # Scraping Top Repositories for Topics on GitHub

#### Project Outline

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 20 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:

```
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx
```

In [1]:
# import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os

### Scrape the list of topics from Github

-use requests to download the page

-user BS4 to parse and extract information

-convert to a Pandas dataframe

In [16]:
def get_topics_page():
    # Define the URL for the GitHub topics page
    topics_url = 'https://github.com/topics'
    
    # Send a GET request to the topics URL
    response = requests.get(topics_url)
    
    # Check if the response status code is not 200 (OK)
    if response.status_code != 200:
        # Raise an exception if the page failed to load
        raise Exception('Failed to load page {}'.format(topics_url))
    
    # Parse the HTML content of the response using BeautifulSoup
    doc = BeautifulSoup(response.text, 'html.parser')
    
    # Return the parsed document
    return doc


In [17]:
doc = get_topics_page()

Let's create some helper functions to parse information from the page.

To get topic titles, we can pick `p` tags with the `class` ...

![](https://i.imgur.com/OnzIdyP.png)

In [18]:
def get_topic_titles(doc):
    # Define the CSS class used for topic title paragraphs on the GitHub topics page
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    
    # Find all <p> tags with the specified class in the provided document
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    
    # Initialize an empty list to store the extracted topic titles
    topic_titles = []
    
    # Iterate through each tag found
    for tag in topic_title_tags:
        # Append the text content of each tag to the topic_titles list
        topic_titles.append(tag.text)
    
    # Return the list of topic titles
    return topic_titles


In [19]:
# `get_topic_titles` can be used to get the list of titles
titles = get_topic_titles(doc)

In [20]:
len(titles)

30

In [21]:
titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

Similarly we have defined functions for descriptions and URLs.

In [33]:
def get_topic_descs(doc):
    # Define the CSS class used for topic description paragraphs on the GitHub topics page
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    
    # Find all <p> tags with the specified class in the provided document
    topic_desc_tags = doc.find_all('p', {'class': desc_selector})
    
    # Initialize an empty list to store the extracted topic descriptions
    topic_descs = []
    
    # Iterate through each tag found
    for tag in topic_desc_tags:
        # Append the stripped text content of each tag to the topic_descs list
        topic_descs.append(tag.text.strip())
    
    # Return the list of topic descriptions
    return topic_descs

In [34]:
descs = get_topic_descs(doc)

In [35]:
len(descs)

30

In [36]:
descs[:5]

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [37]:
def get_topic_urls(doc):
    # Find all <a> tags with the specified class used for topic URLs on the GitHub topics page
    topic_link_tags = doc.find_all('a', {'class': 'no-underline flex-1 d-flex flex-column'})
    
    # Initialize an empty list to store the full topic URLs
    topic_urls = []
    
    # Base GitHub URL to be prefixed to relative topic URLs
    base_url = 'https://github.com'
    
    # Iterate through each <a> tag found
    for tag in topic_link_tags:
        # Append the full URL by combining the base URL with the href attribute of the tag
        topic_urls.append(base_url + tag['href'])
    
    # Return the list of full topic URLs
    return topic_urls

In [38]:
urls = get_topic_urls(doc)

In [39]:
len(urls)

30

In [40]:
urls[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

Let's put this all together into a single function

In [41]:
def scrape_topics():
    # Define the URL for the GitHub topics page
    topics_url = 'https://github.com/topics'
    
    # Send a GET request to the topics URL
    response = requests.get(topics_url)
    
    # Check if the response status code is not 200 (OK)
    if response.status_code != 200:
        # Raise an exception if the page failed to load
        raise Exception('Failed to load page {}'.format(topics_url))
    
    # Parse the HTML content of the response using BeautifulSoup
    doc = BeautifulSoup(response.text, 'html.parser')
    
    # Create a dictionary to store topic titles, descriptions, and URLs
    topics_dict = {
        'title': get_topic_titles(doc),        # Get the topic titles from the document
        'description': get_topic_descs(doc),   # Get the topic descriptions from the document
        'url': get_topic_urls(doc)              # Get the topic URLs from the document
    }
    
    # Convert the topics dictionary into a DataFrame and return it
    return pd.DataFrame(topics_dict)

In [42]:
topics_df = scrape_topics()

In [43]:
topics_df

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


## Get the top 20 repositories from a topic page


In [84]:
def get_topic_page(topic_url):
    # Download the page from the given topic URL
    response = requests.get(topic_url)
    
    # Check if the response is successful (status code 200)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    
    # Parse the HTML content of the response using BeautifulSoup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    
    return topic_doc

def parse_star_count(stars_str):
    # Remove leading and trailing whitespace from the stars string
    stars_str = stars_str.strip()
    
    # Check if the string ends with 'k' indicating thousands
    if stars_str[-1] == 'k':
        # Convert 'k' value to an integer
        return int(float(stars_str[:-1]) * 1000)
    
    # If it doesn't end with 'k', return the integer value directly
    return int(stars_str)

def get_repo_info(h3_tag, star_tag):
    # Base URL for GitHub
    base_url = 'https://github.com'
    
    # Extract the username and repository name from the h3 tag
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()  # Extract username
    repo_name = a_tags[1].text.strip()  # Extract repository name
    
    # Extract the repository URL
    repo_url = base_url + a_tags[1]['href']
    
    # Extract star count using the helper function
    stars = parse_star_count(star_tag.text.strip())
    
    return username, repo_name, stars, repo_url

def get_topic_repos(topic_doc):
    # Define the class used to identify repository title tags
    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    
    # Get all h3 tags containing repository title, URL, and username
    repo_tags = topic_doc.find_all('h3', {'class': h3_selection_class})
    
    # Get all star count tags
    star_tags = topic_doc.find_all('span', {'class': 'Counter js-social-count'})
    
    # Initialize a dictionary to store repository information
    topic_repos_dict = {'username': [], 'repo_name': [], 'stars': [], 'repo_url': []}

    # Iterate through the repository tags to extract information
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])  # Get repo info
        topic_repos_dict['username'].append(repo_info[0])      # Append username
        topic_repos_dict['repo_name'].append(repo_info[1])     # Append repo name
        topic_repos_dict['stars'].append(repo_info[2])         # Append star count
        topic_repos_dict['repo_url'].append(repo_info[3])      # Append repo URL
        
    # Convert the dictionary to a DataFrame and return it
    return pd.DataFrame(topic_repos_dict)

def scrape_topic(topic_url, path):
    # Check if the CSV file already exists
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    
    # Get repository data and convert to DataFrame
    topic_df = get_topic_repos(get_topic_page(topic_url))
    
    # Save the DataFrame to a CSV file at the specified path
    topic_df.to_csv(path, index=None)

## Putting it all together

- We have a funciton to get the list of topics
- We have a function to create a CSV file for scraped repos from a topics page
- Let's create a function to put them together


In [85]:
def get_topic_titles(doc):
    # Define the CSS class used for topic title paragraphs on the GitHub topics page
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    
    # Find all <p> tags with the specified class in the provided document
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    
    # Initialize an empty list to store the extracted topic titles
    topic_titles = []
    
    # Iterate through each tag found
    for tag in topic_title_tags:
        # Append the text content of each tag to the topic_titles list
        topic_titles.append(tag.text)
    
    # Return the list of topic titles
    return topic_titles

def get_topic_descs(doc):
    # Correct selector for descriptions
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'  # Updated class for description tags
    
    # Find all <p> tags with the specified class in the provided document
    topic_desc_tags = doc.find_all('p', {'class': desc_selector})
    
    # Initialize an empty list to store the extracted topic descriptions
    topic_descs = []
    
    # Iterate through each tag found
    for tag in topic_desc_tags:
        # Append the stripped text content of each tag to the topic_descs list
        topic_descs.append(tag.text.strip())
        
    # Return the list of topic descriptions
    return topic_descs

def get_topic_urls(doc):
    # Correct selector for the topic URLs
    topic_link_tags = doc.find_all('a', {'class': 'no-underline flex-1 d-flex flex-column'})
    
    # Initialize an empty list to store the full topic URLs
    topic_urls = []
    
    # Base URL for GitHub
    base_url = 'https://github.com'
    
    # Iterate through each <a> tag found
    for tag in topic_link_tags:
        # Append the full URL by combining the base URL with the href attribute of the tag
        topic_urls.append(base_url + tag['href'])
    
    # Return the list of full topic URLs
    return topic_urls

def scrape_topics():
    # Define the URL for the GitHub topics page
    topics_url = 'https://github.com/topics'
    
    # Send a GET request to the topics URL
    response = requests.get(topics_url)
    
    # Check if the response status code is not 200 (OK)
    if response.status_code != 200:
        # Raise an exception if the page failed to load
        raise Exception('Failed to load page {}'.format(topics_url))
    
    # Parse the page using BeautifulSoup
    doc = BeautifulSoup(response.text, 'html.parser')
    
    # Get titles, descriptions, and URLs from the parsed document
    titles = get_topic_titles(doc)
    descs = get_topic_descs(doc)  # This should now work
    urls = get_topic_urls(doc)
    
    # Print lengths for debugging purposes
    print(f"Titles: {len(titles)}, Descriptions: {len(descs)}, URLs: {len(urls)}")
    
    # Ensure the lists have the same length
    min_length = min(len(titles), len(descs), len(urls))
    
    # Trim each list to the minimum length to ensure they match
    titles = titles[:min_length]
    descs = descs[:min_length]
    urls = urls[:min_length]
    
    # Create a dictionary to organize the topic data
    topics_dict = {
        'title': titles,
        'description': descs,
        'url': urls
    }
    
    # Convert the topics dictionary into a DataFrame and return it
    return pd.DataFrame(topics_dict)


In [86]:
def scrape_topics_repos():
    # Print message indicating the start of topic scraping
    print('Scraping list of topics')
    
    # Call the scrape_topics function to get the DataFrame of topics
    topics_df = scrape_topics()
    
    # Create a 'data' directory if it doesn't exist to store CSV files
    os.makedirs('data', exist_ok=True)
    
    # Iterate through each row in the topics DataFrame
    for index, row in topics_df.iterrows():
        # Print message indicating which topic's top repositories are being scraped
        print('Scraping top repositories for "{}"'.format(row['title']))
        
        # Call the scrape_topic function to scrape top repositories for the current topic
        # Save the results as a CSV file named after the topic
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))

Let's run it to scrape the top repos for the all the topics on the first page of https://github.com/topics

In [87]:
scrape_topics_repos()

Scraping list of topics
Titles: 30, Descriptions: 30, URLs: 30
Scraping top repositories for "3D"
The file data/3D.csv already exists. Skipping...
Scraping top repositories for "Ajax"
The file data/Ajax.csv already exists. Skipping...
Scraping top repositories for "Algorithm"
The file data/Algorithm.csv already exists. Skipping...
Scraping top repositories for "Amp"
Scraping top repositories for "Android"
Scraping top repositories for "Angular"
Scraping top repositories for "Ansible"
Scraping top repositories for "API"
Scraping top repositories for "Arduino"
Scraping top repositories for "ASP.NET"
Scraping top repositories for "Awesome Lists"
Scraping top repositories for "Amazon Web Services"
Scraping top repositories for "Azure"
Scraping top repositories for "Babel"
Scraping top repositories for "Bash"
Scraping top repositories for "Bitcoin"
Scraping top repositories for "Bootstrap"
Scraping top repositories for "Bot"
Scraping top repositories for "C"
Scraping top repositories for "C