<a href="https://colab.research.google.com/github/parthpateljnv/webscrapingproject/blob/main/scraping_final_github_topics_repositories.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Scraping Top Repositories for Topics on GitHub





Here are the steps we'll follow:

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:

```
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx
```

## Scrape the list of topics from Github

Steps:

- use requests to downlaod the page
- user BS4 to parse and extract information
- convert to a Pandas dataframe



In [None]:
import requests
from bs4 import BeautifulSoup

def get_topics_page():
    # TODO - add comments
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc

In [None]:
import pandas as pd
import os

Add some explanation

In [None]:
doc = get_topics_page()



To get topic titles, we can pick `p` tags with the `class` ...

![](https://i.imgur.com/OnzIdyP.png)


In [None]:
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

`get_topic_titles` can be used to get the list of titles

In [None]:
titles = get_topic_titles(doc)

In [None]:
len(titles)

30

In [None]:
titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

Similarly we have defined functions for descriptions and URLs.

In [None]:
def get_topic_descs(doc):
    desc_selector = 'f5 color-text-secondary mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class': desc_selector})
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs



function to get the ulf for each topic

In [None]:
def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a', {'class': 'd-flex no-underline'})
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls


after  putting  this all together into a single



In [None]:
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topics_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url': get_topic_urls(doc)
    }
    
    return pd.DataFrame(topics_dict)

## Get the top 25 repositories from a topic page



In [None]:
def get_topic_page(topic_url):
    # Download the page
    response = requests.get(topic_url)
    # Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    # Parse using Beautiful soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc


In [None]:
doc = get_topic_page('https://github.com/topics/3d')

h1_tag contans repo username and reponame and star_tags contas star information

In [None]:
base_url = "https://github.com"

In [None]:
#def parse_star_count()

In [None]:
def get_repo_info(h1_tag, star_tag):
    # returns all the required info about a repository
    a_tags = h1_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url =  base_url + a_tags[1]['href']
    stars = star_tag.text.strip()
    return username, repo_name, stars, repo_url

In [None]:
def get_topic_repos(topic_doc):
    # Get the h1 tags containing repo title, repo URL and username
    h1_selection_class = 'f3 color-text-secondary text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3', {'class': h1_selection_class} )
    # Get star tags
    star_tags = topic_doc.find_all('a', { 'class': 'social-count float-none'})
    
    topic_repos_dict = { 'username': [], 'repo_name': [], 'stars': [],'repo_url': []}
    #print(repo_tags)
    #print(star_tags)
    # Get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
    
    return pd.DataFrame(topic_repos_dict)

In [None]:
def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    #print(topic_df)
    topic_df.to_csv(path, index=None)

## Putting it all together

- We have a funciton to get the list of topics
- We have a function to create a CSV file for scraped repos from a topics page
- Let's create a function to put them together

In [None]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    
    #os.makedirs('data', exist_ok=True)
    for index, row in topics_df.iterrows():
       # print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], '{}.csv'.format(row['title']))

Let's run it to scrape the top repos for the all the topics on the first page of https://github.com/topics

In [None]:
base_url = "https://github.com"

In [None]:
scrape_topics_repos()

Scraping list of topics
The file 3D.csv already exists. Skipping...
The file Ajax.csv already exists. Skipping...
The file Algorithm.csv already exists. Skipping...
The file Amp.csv already exists. Skipping...
The file Android.csv already exists. Skipping...
The file Angular.csv already exists. Skipping...
The file Ansible.csv already exists. Skipping...
The file API.csv already exists. Skipping...
The file Arduino.csv already exists. Skipping...
The file ASP.NET.csv already exists. Skipping...
The file Atom.csv already exists. Skipping...
The file Awesome Lists.csv already exists. Skipping...
The file Amazon Web Services.csv already exists. Skipping...
The file Azure.csv already exists. Skipping...
The file Babel.csv already exists. Skipping...
The file Bash.csv already exists. Skipping...
The file Bitcoin.csv already exists. Skipping...
The file Bootstrap.csv already exists. Skipping...
The file Bot.csv already exists. Skipping...
The file C.csv already exists. Skipping...


We can check that the CSVs were created properly