<a href="https://colab.research.google.com/github/poojasharma021/scraping-github-topics-repositories/blob/main/scraping_gitub_topics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Scraping Top Repostiories for Topics on Github

Here are the steps we'll follow:

We're going to scrape https://github.com/topics
We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
For each topic, we'll get the top 25 repositories in the topic from the topic page
For each repository, we'll grab the repo name, username, stars and repo URL
For each topic we'll create a CSV file.

## Scrape the list of topics from Github 

Steps:


*   use requests to downlaod the page
*   user BS4 to parse and extract information
*   convert to a Pandas dataframe

Let's write a function to download the page.

In [None]:
!pip install requests --quiet

In [34]:
!pip install beautifulsoup4 --quiet

In [35]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [36]:
def get_topic_titles(parsed_doc):
    title_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = parsed_doc.find_all('p', {'class' : title_class})
    topic_titles = []
    for tag in topic_title_tags:
      topic_titles.append(tag.text)
    return topic_titles


In [37]:
def get_topic_desc(parsed_doc):
  desc_class = 'f5 color-fg-muted mb-0 mt-1'
  topic_desc_tags = parsed_doc.find_all('p', {'class': desc_class})
  topic_descriptions = []
  for tag in topic_desc_tags:
    topic_descriptions.append(tag.text.strip())
  return topic_descriptions



In [38]:
def get_url(parsed_doc):
  link_class = 'no-underline flex-1 d-flex flex-column'
  topic_link_tags = parsed_doc.find_all('a', {'class': link_class})
  topic_url = []
  for tag in topic_link_tags:
    topic_url.append("https://github.com" + tag['href'])
  return topic_url



In [39]:
def scrape_topics():
  topics_url = 'https://github.com/topics'
  response = requests.get(topics_url)

  if response.status_code != 200:
    raise Exception('Failed to load page {}'.format(topic_url))
  
  parsed_doc = BeautifulSoup(response.text, 'html.parser')

  topics_dict = {
    'Title': get_topic_titles(parsed_doc),
    'Descriptions': get_topic_desc(parsed_doc),
    'Url': get_url(parsed_doc)
   }


  return pd.DataFrame(topics_dict)


## Get the top repositories from a topic page

In [40]:
def parse_star_count(stars_str):
  stars_str = stars_str.strip()
  if stars_str[-1] == 'k':
    return int(float(stars_str[:-1])*1000)
  return int(stars_str)


In [41]:
def get_repo_info(h3_tag, star_tag):
  a_tags = h3_tag.find_all('a')
  username = a_tags[0].text.strip()
  repo_name = a_tags[1].text.strip()
  repo_url = "https://github.com"+ a_tags[1]['href']
  stars = parse_star_count(star_tag.text.strip())
  return username, repo_name,stars, repo_url


In [42]:
def get_topic_repos(topic_url):
    # Download the page
  response = requests.get(topic_url)
    # Check Sucessful response
  if response.status_code != 200:
    raise Exception('Failed to load page {}'.format(topic_url))

  topic_doc = BeautifulSoup(response.text, 'html.parser')

  repo_class ='f3 color-fg-muted text-normal lh-condensed'
  repo_tags = topic_doc.find_all('h3', {'class': repo_class})

  star_tags = topic_doc.find_all('span', {'id': 'repo-stars-counter-star'})

  topic_repo_dict = {
    'username': [], 
    'repo_name': [],
    'stars': [],
    'repo_url': []
    }


  for i in range(len(repo_tags)):
      repo_info = get_repo_info(repo_tags[i], star_tags[i])
      topic_repo_dict['username'].append(repo_info[0])
      topic_repo_dict['repo_name'].append(repo_info[1])
      topic_repo_dict['stars'].append(repo_info[2])
      topic_repo_dict['repo_url'].append(repo_info[3])

  return pd.DataFrame(topic_repo_dict)




  

In [43]:
import os 
def scrape_topic(topic_name, topic_url):
  fname = topic_name+ '.csv'
  if os.path.exists(fname):
    print("File {} already exists.".format(fname))
    return
  topic_df = get_topic_repos(topic_url)
  topic_df.to_csv(fname, index=None)



In [44]:
def scrape_topic_repos():
  print('Scraping List of Topics')
  topics_df = scrape_topics()

  for index, row in topics_df.iterrows():
    print('Scraping Top Repositories for {}'.format(row['Title']))
    scrape_topic(row['Title'], row['Url'])


Let's run it to scrape the top repos for the all the topics on the first page of https://github.com/topics



In [None]:
scrape_topic_repos()