# scraping-github-topics

 - Web scraping is the process of using bots to extract content and data from a website. Unlike screen scraping, which only copies pixels displayed onscreen, web scraping extracts underlying HTML code and, with it, data stored in a database.
 - GitHub is a web-based version-control and collaboration platform for software developers. GitHub facilitates social coding by providing a web interface to the Git code repository and management tools for collaboration. GitHub can be thought of as a serious social networking site for software developers.
 - We'll be using python to code the project, requests library to download the web pages, beautiful soup to parse the downloaded info and pandas library to store the extracted info in the form of a data set

### PROJECT OUTLINE

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:

Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx

### SCRAPE THE LIST OF TOPICS FROM GITHUB

- Use requests to download the page
- Use beautiful soup 4 to parse and extract information
- convert to pandas dataframe

In [4]:
!pip install jovian --upgrade --quiet
!pip install beautifulsoup4 --upgrade --quiet
!pip install requests --upgrade --quiet
!pip install pandas --quiet

In [6]:
import jovianimport requests
from bs4 import BeautifulSoup
import os

In [16]:
def get_topics_page():
    topic_url = 'https://github.com/topics'
    response = requests.get(topic_url)
    if response.status_code != 200:
        raise Exception('Failed to load page{}'.format(topic_url))
        scrape_topics_repos()
    doc = BeautifulSoup(response.text,'html.parser')
    return doc

In [17]:
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

In [18]:
def get_topic_descs(doc):
    desc_selector = 'f5 color-text-secondary mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class': desc_selector})
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs

In [19]:
def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a', {'class': 'd-flex no-underline'})
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls

In [60]:
import pandas as pd

def scrape_topics(topics_url):
    #topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topics_url))
        scrape_topics_repos()
    doc = BeautifulSoup(response.text, 'html.parser')
    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url': get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)

In [61]:
def get_topic_page(topic_url):
    # Download the page
    response = requests.get(topic_url)
    # Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
        scrape_topics_repos()
    # Parse using Beautiful soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

In [43]:
def get_repo_info(h1_tag, star_tag):
    # returns all the required info about a repository
    a_tags = h1_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url =  base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

In [44]:
def get_topic_repos(topic_doc):
    # Get the h1 tags containing repo title, repo URL and username
    h1_selection_class = 'f3 color-text-secondary text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h1', {'class': h1_selection_class} )
    # Get star tags
    star_tags = topic_doc.find_all('a', { 'class': 'social-count float-none'})
    
    topic_repos_dict = { 'username': [], 'repo_name': [], 'stars': [],'repo_url': []}

    # Get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)

In [45]:
def scrape_topic(topic_url,path):
    if os.path.exists(path):
        print('the file {} already exists. Skipping ...'.format(path))
        return 
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path,index=None)

In [51]:
def scrape_topics_repos():
    print('Scraping list of topics')
    for i in range(1,15):
        topics_url = 'https://github.com/topics?page={}'.format(i)
        topics_df = scrape_topics(topics_url)
    
        os.makedirs('data', exist_ok=True)
        for index, row in topics_df.iterrows():
            print('Scraping top repositories for "{}"'.format(row['title']))
            scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))

In [67]:
scrape_topics_repos()

Scraping list of topics
Scraping top repositories for "3D"
the file data/3D.csv already exists. Skipping ...
Scraping top repositories for "Ajax"
the file data/Ajax.csv already exists. Skipping ...
Scraping top repositories for "Algorithm"
the file data/Algorithm.csv already exists. Skipping ...
Scraping top repositories for "Amp"
the file data/Amp.csv already exists. Skipping ...
Scraping top repositories for "Android"
the file data/Android.csv already exists. Skipping ...
Scraping top repositories for "Angular"
the file data/Angular.csv already exists. Skipping ...
Scraping top repositories for "Ansible"
the file data/Ansible.csv already exists. Skipping ...
Scraping top repositories for "API"
the file data/API.csv already exists. Skipping ...
Scraping top repositories for "Arduino"
the file data/Arduino.csv already exists. Skipping ...
Scraping top repositories for "ASP.NET"
the file data/ASP.NET.csv already exists. Skipping ...
Scraping top repositories for "Atom"
the file data/Ato

Scraping top repositories for "Localization"
the file data/Localization.csv already exists. Skipping ...
Scraping top repositories for "Lua"
the file data/Lua.csv already exists. Skipping ...
Scraping top repositories for "Machine learning"
the file data/Machine learning.csv already exists. Skipping ...
Scraping top repositories for "macOS"
the file data/macOS.csv already exists. Skipping ...
Scraping top repositories for "Markdown"
the file data/Markdown.csv already exists. Skipping ...
Scraping top repositories for "Mastodon"
the file data/Mastodon.csv already exists. Skipping ...
Scraping top repositories for "Material design"
the file data/Material design.csv already exists. Skipping ...
Scraping top repositories for "MATLAB"
the file data/MATLAB.csv already exists. Skipping ...
Scraping top repositories for "Maven"
the file data/Maven.csv already exists. Skipping ...
Scraping top repositories for "Minecraft"
the file data/Minecraft.csv already exists. Skipping ...
Scraping top rep

Scraping top repositories for "WordPress"
Scraping top repositories for "Xamarin"
Scraping top repositories for "XML"


In [66]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "shreyash075/scraping-github-topics" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/shreyash075/scraping-github-topics[0m


'https://jovian.ai/shreyash075/scraping-github-topics'