# Scraping the Top 30 Repositories for each Topic on Github

![Github Image](https://preview.redd.it/g38817mqb1361.png?width=960&crop=smart&auto=webp&s=063b72aee824ef4f41cd58b3944b877d8a7f23e8)



*Author: [Lafir](https://www.linkedin.com/in/lafir)*

## Introduction: 

In this project, we will scrape top 30 repositories for each topic on github page using `requests` and `BeautifulSoup` library.

### Project Outline:

- We have to scrape this paginated webpage: https://github.com/topics, to get the entire list of topics available on Github
- Use the `requests` library to download the page and `BeautifulSoup` library to parse and extract the information
- Create a dataframe for the list of Topics which should contain topic title, topic page URL, and topic description

- With the dataframe created, use the `requests` library to download the page for each topic using the topic page URL
- Now use the `BeautifulSoup` library to parse and extract the information from the downloaded topic pages
- Create a dataframe for each topic which should contain username, repo name, stars, and repo URL of top 30 repositories and it should be in the below format:

```
username,repo_name,stars,repo_url
mrdoob,three.js,83700,https://github.com/mrdoob/three.js
libgdx,libgdx,20200,https://github.com/libgdx/libgdx
```

#### Note:
1. To create a CSV file, prepare a dictionary of lists
2. Create a pandas dataframe from dictionary of lists using `pd.DataFrame()` method
3. Convert the dataframe into CSV file using `df.to_csv()` method


## Install and Import required libraries

In [1]:
# install the required libraries
!pip install requests beautifulsoup4 pandas jovian --upgrade --quiet

In [2]:
#to download webpages
import requests

#to parse and extract information from downloaded pages
from bs4 import BeautifulSoup

#to work with OS and files 
import os

#to create dataframes
import pandas as pd

#to save our notebook online
import jovian

#to prevent warning messages
import warnings
warnings.filterwarnings('ignore')

## Scrape the list of Topics

### Download and Create Soup Doc

Let's define a helper function to download the topics list web pages which spans over 6 pages using `requests` library and parse the downloaded information using `BeautifulSoup` class from `bs4` library.

In [3]:
def soupify_topics_list_pages():
    """"
    It Scrapes the entire list of topics which spans over 6 web pages.
    Returns a beautifulsoup doc.
    """
    all_page_contents = ''
    topics_list_page_url = 'https://github.com/topics?page='
    for i in range(1,7):
        response = requests.get(topics_list_page_url+str(i))
        #check successful response
        if response.status_code != 200:
            raise Exception('Failed to load page{}'.format(topic_url))
        single_page_contents = response.text
        all_page_contents += single_page_contents
    #parse using beautiful soup
    doc = BeautifulSoup(all_page_contents, 'html.parser')
    return doc

### Extract Topic Titles

Once the soup doc was created using the `soupify_topics_list_pages()`, we have to go through the webpage html to extract the required information like topic title, topic description and topic URL using the inspect elemets option which is available in browsers like `Google Chrome`, `Brave`, `Mozilla Firefox` etc.

![Topic Title Class Image](https://i.imgur.com/F8DWSTB.png)

`topic_title_class` was identified as `'f3 lh-condensed mb-0 mt-1 Link--primary'` using the inspect elements option available under developer tools.

Using the `topic_title_class`, we collected all the `<p> tags` which contains the topic title and the topic title was extracted using `text()` method.

In [4]:
def get_topic_titles(doc):
    """"
    It accepts the soup doc as an input and returns topic titles list.
    """
    topic_title_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': topic_title_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text.strip())  #it's always safer to apply strip() fn. while extracting strings
    return topic_titles

### Extract Topic Descriptions

Similar to the process of extracting topic titles, topic descriptions were extracted after identifying `'f5 color-fg-muted mb-0 mt-1'` as `topic_desc_class`.

In [5]:
def get_topic_descs(doc):
    """"
    It accepts the soup doc as an input and returns topic descriptions list.
    """
    topic_desc_class = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class': topic_desc_class})
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs

### Extract Topic URLs

URL for each topics were created by concatenating the base URL `"https://github.com"` with text available in `<a> tag` of class name `'no-underline flex-1 d-flex flex-column'`.

In [6]:
def get_topic_urls(doc):
    """"
    It accepts the soup doc as an input and returns topic URLs list.
    """
    topic_url_class = 'no-underline flex-1 d-flex flex-column'
    topic_url_tags = doc.find_all('a', {'class': topic_url_class})
    topic_urls = []
    base_url = "https://github.com"
    for tag in topic_url_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls

### Create Topics List Dataframe

Now we have 3 functions, each of which returns a list of topic title, topic description and topic URL. 

Let's define another helper function where these lists are gathered to create a dictionary which will be later converted into pandas dataframe.

In [7]:
def create_topics_list_df(doc):
    """
    Returns a dataframe that consists of Topic name, Topic Description and Topic URL.
    """
    topics_dict = {'title': get_topic_titles(doc),
            'description': get_topic_descs(doc),
            'url': get_topic_urls(doc)    
        }
    return pd.DataFrame(topics_dict)

---

## Scraping top repos for each topic

Once topics list were scraped successfully, we have to scrape top 30 repositories (each topic page contains only 30 repos) for each topic page.

1. Download each topic page using topic URL from topic_df
2. 

In [8]:
def soupify_topics_page(topic_url):
    """
    It downloads each topics page using topic URL.
    Returns a soup doc using BeautifulSoup.   
    """
    # download the page
    response = requests.get(topic_url)
    #check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page{}'.format(topic_url))
    #parse using beautiful soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

In [9]:
def create_top_repos_df(topic_doc):
    #get h3 tags containing reponame, username, repo url
    h3_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3', {'class': h3_class})
    span_id = 'repo-stars-counter-star'
    star_tags = topic_doc.find_all('span', {'id': span_id})
    
    topic_repos_dict = {
        'username': [],
        'repo_name': [],
        'stars': [],
        'repo_url': []
    }

    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)


In [10]:
def get_repo_info(repo_tag, star_tag):
    #returns all the required info about a repository
    a_tags = repo_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

In [11]:
def scrape_topic_page_and_create_csv():
    """ Scrape top repositories of each topic.
    """
    print('Scraping Topics List')
    topics_df = (create_topics_list_df(soupify_topics_list_pages()))
    
    os.makedirs('data', exist_ok=True)
    
    for index, row in topics_df.iterrows():
        print('Scraping top repos for "{}"'.format(row['title']))
        create_top_repos_csv(row['url'], 'data/{}.csv'.format(row['title']))

In [12]:
def create_top_repos_csv(topic_url, fpath):
    if os.path.exists(fpath):
        print('The file {} already exists. Skipping...'.format(fpath))
        return
    top_repos_df = create_top_repos_df(soupify_topics_page(topic_url))
    top_repos_df.to_csv(fpath, index=None)

In [13]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1])*1000)
    return int(stars_str)

In [14]:
scrape_topic_page_and_create_csv()

Scraping Topics List
Scraping top repos for "3D"
The file data/3D.csv already exists. Skipping...
Scraping top repos for "Ajax"
The file data/Ajax.csv already exists. Skipping...
Scraping top repos for "Algorithm"
The file data/Algorithm.csv already exists. Skipping...
Scraping top repos for "Amp"
The file data/Amp.csv already exists. Skipping...
Scraping top repos for "Android"
The file data/Android.csv already exists. Skipping...
Scraping top repos for "Angular"
The file data/Angular.csv already exists. Skipping...
Scraping top repos for "Ansible"
The file data/Ansible.csv already exists. Skipping...
Scraping top repos for "API"
The file data/API.csv already exists. Skipping...
Scraping top repos for "Arduino"
The file data/Arduino.csv already exists. Skipping...
Scraping top repos for "ASP.NET"
The file data/ASP.NET.csv already exists. Skipping...
Scraping top repos for "Atom"
The file data/Atom.csv already exists. Skipping...
Scraping top repos for "Awesome Lists"
The file data/Awe

In [15]:
jovian.commit(project='github-top-repos-scraping')

<IPython.core.display.Javascript object>

[jovian] Updating notebook "lafirm/github-top-repo-scraping" on https://jovian.ai/[0m
[jovian] Committed successfully! https://jovian.ai/lafirm/github-top-repo-scraping[0m


'https://jovian.ai/lafirm/github-top-repo-scraping'