## Scraping Top Repositories for Topics on GitHub


### Problem Statement:

Web scraping is the process of using bots to extract content and data from a website.
Web scraping is used in a variety of digital businesses that rely on data harvesting.


GitHub is a Git repository hosting service, but it adds many of its own features. While Git is a command line tool, GitHub provides a Web-based graphical interface. It also provides access control and several collaboration features, such as a wikis and basic task management tools for every project.


Tools Used: Python, requests, Beautiful Soup, Pandas


Here are the steps followed :

We're going to scrape https://github.com/topics
We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
For each topic, we'll get the top 25 repositories in the topic from the topic page
For each repository, we'll grab the repo name, username, stars and repo URL
For each topic we'll create a CSV file in the following format:


                    Repo Name,Username,Stars,Repo URL


three.js,mrdoob,69700,https://github.com/mrdoob/three.js


libgdx,libgdx,18300,https://github.com/libgdx/libgdx

### Using Requests library to download the webpage

In [1]:
!pip install requests --upgrade --quiet

In [2]:
import requests
import pandas as pd
import os

In [3]:
topics_url="https://github.com/topics"

In [4]:
response=requests.get(topics_url)

In [5]:
response.status_code

200

In [6]:
len(response.text)

165700

In [7]:
page_content=response.text

In [8]:
page_content[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-gTJi5qrypRHWpLXsMZQXoL53mXDuVqfZc7AfuiFXreLhf7Pk1RMvXJMWJsiS8dpkFDfq/7t6bFZK+3xS1Ak+Lg==" rel="stylesheet" href="https://github.githubassets.com/assets/light-813262e6aaf2a511d6a4b5ec319417a0.css" /><link crossorigin="anonymous" media="all" integrity="sha512-CMdm0es1Ti46ZuFcKKz+jobtyuFMFz3OIWxrFfOGbsHzri6ehzY0MqUHRn9C23aqIUH6HrnhiqjxF6Ec

In [9]:
with open('webpage.html', "w", encoding="utf-8")as f:
    f.write(page_content)

### Using Beautiful Soup to parse info from webpage 
- use requests to downlaod the page
- user BS4 to parse and extract information
- convert to a Pandas dataframe

In [10]:
%pip install beautifulsoup4 --upgrade --quiet

Note: you may need to restart the kernel to use updated packages.


In [11]:
from bs4 import BeautifulSoup

In [12]:
doc=BeautifulSoup(page_content,'html.parser')

In [13]:
type(doc)

bs4.BeautifulSoup

## Extracting info from the webpage
To get topic titles, we can pick `p` tags with the `class` ...

![](https://i.imgur.com/uk47YQz.png)

In [14]:
title_tags= doc.find_all('p',{'class':'f3 lh-condensed mb-0 mt-1 Link--primary'})

In [15]:
title_tags

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Angular</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ansible</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">API</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Arduino</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">ASP.NET</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Atom</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Awesome Lists</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amazon Web Services</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Azure</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Babel</p>,
 <p class="f3 lh-condensed m

### Following the same step for the description

In [16]:
desc_tags=doc.find_all('p',{'class':'f5 color-fg-muted mb-0 mt-1'})

In [17]:
desc_tags

[<p class="f5 color-fg-muted mb-0 mt-1">
               3D modeling is the process of virtually developing the surface and structure of a 3D object.
             </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
               Ajax is a technique for creating interactive web applications.
             </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
               Algorithms are self-contained sequences that carry out a variety of tasks.
             </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
               Amp is a non-blocking concurrency framework for PHP.
             </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
               Android is an operating system built by Google designed for mobile devices.
             </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
               Angular is an open source web application platform.
             </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
               Ansible is a simple and powerful automation engine.
             </p>,
 <p class="

In [18]:
topic_tag_link=doc.find_all('a',{'class':'d-flex no-underline'})

In [19]:
len(topic_tag_link)

30

In [20]:
topic_tag_link[0]['href']

'/topics/3d'

In [21]:
topic0_url='https://github.com'+topic_tag_link[0]['href']
print(topic0_url)

https://github.com/topics/3d


In [22]:
topic_title=[]

for tag in title_tags:
    topic_title.append(tag.text)
print(topic_title)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [23]:
topic_desc=[]
for desc in desc_tags:
    topic_desc.append(desc.text.strip())
print(topic_desc[:5])

['3D modeling is the process of virtually developing the surface and structure of a 3D object.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency framework for PHP.', 'Android is an operating system built by Google designed for mobile devices.']


## Getting links of all the topic urls 

In [24]:
topic_url=[]
base_url='https://github.com'
for url in topic_tag_link:
    topic_url.append(base_url+url['href'])
print(topic_url)

['https://github.com/topics/3d', 'https://github.com/topics/ajax', 'https://github.com/topics/algorithm', 'https://github.com/topics/amphp', 'https://github.com/topics/android', 'https://github.com/topics/angular', 'https://github.com/topics/ansible', 'https://github.com/topics/api', 'https://github.com/topics/arduino', 'https://github.com/topics/aspnet', 'https://github.com/topics/atom', 'https://github.com/topics/awesome', 'https://github.com/topics/aws', 'https://github.com/topics/azure', 'https://github.com/topics/babel', 'https://github.com/topics/bash', 'https://github.com/topics/bitcoin', 'https://github.com/topics/bootstrap', 'https://github.com/topics/bot', 'https://github.com/topics/c', 'https://github.com/topics/chrome', 'https://github.com/topics/chrome-extension', 'https://github.com/topics/cli', 'https://github.com/topics/clojure', 'https://github.com/topics/code-quality', 'https://github.com/topics/code-review', 'https://github.com/topics/compiler', 'https://github.com/t

### Creating a dataframe and storing all the scraped info

In [25]:
topic_dict ={'title':topic_title ,'description':topic_desc , 'url': topic_url}

In [26]:
topic_df=pd.DataFrame(topic_dict)
topic_df

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


### Saving DataFrame To A CSV File

In [27]:
topic_df.to_csv('Topics.csv',index=None)

### Getting Information From Each Topic Page

In [28]:
topic_page_url=topic_url[0]
topic_page_url

'https://github.com/topics/3d'

In [29]:
response=requests.get(topic_page_url)
response.status_code

200

In [30]:
len(response.text)

655725

In [31]:
topic_doc=BeautifulSoup(response.text, 'html.parser')

### For example, these are the top repository in the topic 3D 

![](https://i.imgur.com/Hi9wKd9.png)

In [32]:
repo_tags=topic_doc.find_all('h3',{'class':'f3 color-fg-muted text-normal lh-condensed'})

In [33]:
len(repo_tags)

30

In [34]:
repo_tags[0]

<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-view-component="true" href="/mrdoob">
            mrdoob
</a>          /
          <a class="text-bold wb-break-word" data-ga-click="Explore, go to repository, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d897521

In [35]:
a_tags=repo_tags[0].find_all('a')
a_tags

[<a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-view-component="true" href="/mrdoob">
             mrdoob
 </a>,
 <a class="text-bold wb-break-word" data-ga-click="Explore, go to repository, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data-view-component="tr

In [36]:
a_tags[0].text.strip()

'mrdoob'

In [37]:
a_tags[1].text.strip()

'three.js'

In [38]:

repo_url=base_url+a_tags[1]['href']

In [39]:
repo_url

'https://github.com/mrdoob/three.js'

In [40]:
stars_tags=topic_doc.find_all('a',{'class':'social-count float-none'})

In [41]:
stars_tags[0].text.strip()

'75.8k'

In [42]:
def parse_star_count(stars_str):
    stars_str=stars_str.strip()
    if stars_str[-1]=='k':
        return int(float(stars_str[:-1])*1000)
    return int(stars_str)
        

In [43]:
parse_star_count(stars_tags[0].text)

75800

In [44]:
#retrns all info about the repository
def get_repo_info(h3_tag,star_tag):
    a_tags=h3_tag.find_all('a')
    username=a_tags[0].text.strip()
    repo_name=a_tags[1].text.strip()
    repo_url=base_url+a_tags[1]['href']
    stars=parse_star_count(star_tag.text)
    return username,repo_name,repo_url,stars
    

In [45]:
test=get_repo_info(repo_tags[0],stars_tags[0])
test

('mrdoob', 'three.js', 'https://github.com/mrdoob/three.js', 75800)

In [46]:
type(test)

tuple

In [47]:
len(repo_tags)

30

In [48]:
topic_repo_dict ={'username':[],'repo_name':[],'repo_url':[],'stars':[]}


In [49]:
for i in range(len(repo_tags)):
    repo_info=get_repo_info(repo_tags[i],stars_tags[i])
    topic_repo_dict['username'].append(repo_info[0])
    topic_repo_dict['repo_name'].append(repo_info[1])
    topic_repo_dict['repo_url'].append(repo_info[2])
    topic_repo_dict['stars'].append(repo_info[3])

## Driver Code

In [50]:
def get_topic_page(topic_page_url):
     #Dowloading Required Topic Pages
    response=requests.get(topic_page_url)
    #Checking reponse status
    if response.status_code !=200:
        raise Exception('Failed to load page {}'.format(topic_page_url))
    #Pasring using BeautifulSoup
    topic_doc=BeautifulSoup(response.text, 'html.parser')
    return topic_doc

#returns all info about the repository
def get_repo_info(h3_tag,star_tag):
    a_tags=h3_tag.find_all('a')
    username=a_tags[0].text.strip()
    repo_name=a_tags[1].text.strip()
    repo_url=base_url+a_tags[1]['href']
    stars=parse_star_count(star_tag.text)
    return username,repo_name,repo_url,stars

def get_topic_repos(topic_doc):
   
    #Getting h3 tags containing repo title,url and username
    repo_tags=topic_doc.find_all('h3',{'class':'f3 color-fg-muted text-normal lh-condensed'})
    #Getting a tags containing stars
    stars_tags=topic_doc.find_all('a',{'class':'social-count float-none'})
    
    
    topic_repo_dict ={'username':[],'repo_name':[],'repo_url':[],'stars':[]}

    #Getting repo info
    for i in range(len(repo_tags)):
        repo_info=get_repo_info(repo_tags[i],stars_tags[i])
        topic_repo_dict['username'].append(repo_info[0])
        topic_repo_dict['repo_name'].append(repo_info[1])
        topic_repo_dict['repo_url'].append(repo_info[2])
        topic_repo_dict['stars'].append(repo_info[3])
    return pd.DataFrame(topic_repo_dict)

def scrape_topic(topic_url,topic_name):
    fname=topic_name + '.csv'
    if os.path.exists(fname):
        print("The file {} already exists.Skipping....".format(fname))
        return
    topic_df=get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(fname,index=None)

## Writing a single funtion to:


1)Get the topic from the github page

2)Get details of individual repos

3)Make a csv of each repo

In [51]:
def get_topic_title(doc):
    title_tags= doc.find_all('p',{'class':'f3 lh-condensed mb-0 mt-1 Link--primary'})
    
    topic_title=[]
    for tag in title_tags:
        topic_title.append(tag.text)
    return topic_title 

def get_topic_desc(doc):
    desc_tags=doc.find_all('p',{'class':'f5 color-fg-muted mb-0 mt-1'})
    
    topic_desc=[]
    for desc in desc_tags:
        topic_desc.append(desc.text.strip())
    return topic_desc

def get_topic_url(doc):
    topic_tag_link=doc.find_all('a',{'class':'d-flex no-underline'})
    
    topic_url=[]
    base_url='https://github.com'
    for url in topic_tag_link:
        topic_url.append(base_url+url['href'])
    return topic_url

def scrape_topics():
    topic_page_url="https://github.com/topics"
    response=requests.get(topic_page_url)
    #Checking reponse status
    if response.status_code !=200:
        raise Exception('Failed to load page {}'.format(topic_page_url))
    
    topic_dict={"title":get_topic_title(doc),"description":get_topic_desc(doc),"url":get_topic_url(doc)}
    return pd.DataFrame(topic_dict,index=None)
    

In [52]:
def scrape_topics_repos():
    print("Getting list of topic from GitHub")
    topic_df=scrape_topics()
    for index,row in topic_df.iterrows():
        print("Scraping top repositories fot the topic '{}'".format(row['title']))
        scrape_topic(row['url'],row['title'])

In [53]:
scrape_topics_repos()

Getting list of topic from GitHub
Scraping top repositories fot the topic '3D'
The file 3D.csv already exists.Skipping....
Scraping top repositories fot the topic 'Ajax'
The file Ajax.csv already exists.Skipping....
Scraping top repositories fot the topic 'Algorithm'
The file Algorithm.csv already exists.Skipping....
Scraping top repositories fot the topic 'Amp'
The file Amp.csv already exists.Skipping....
Scraping top repositories fot the topic 'Android'
The file Android.csv already exists.Skipping....
Scraping top repositories fot the topic 'Angular'
The file Angular.csv already exists.Skipping....
Scraping top repositories fot the topic 'Ansible'
The file Ansible.csv already exists.Skipping....
Scraping top repositories fot the topic 'API'
The file API.csv already exists.Skipping....
Scraping top repositories fot the topic 'Arduino'
The file Arduino.csv already exists.Skipping....
Scraping top repositories fot the topic 'ASP.NET'
The file ASP.NET.csv already exists.Skipping....
Scra

In [54]:
pd.read_csv("3D.csv")

Unnamed: 0,username,repo_name,repo_url,stars
0,mrdoob,three.js,https://github.com/mrdoob/three.js,75600
1,libgdx,libgdx,https://github.com/libgdx/libgdx,19200
2,pmndrs,react-three-fiber,https://github.com/pmndrs/react-three-fiber,15600
3,BabylonJS,Babylon.js,https://github.com/BabylonJS/Babylon.js,15200
4,aframevr,aframe,https://github.com/aframevr/aframe,13200
5,ssloy,tinyrenderer,https://github.com/ssloy/tinyrenderer,11600
6,lettier,3d-game-shaders-for-beginners,https://github.com/lettier/3d-game-shaders-for...,11500
7,FreeCAD,FreeCAD,https://github.com/FreeCAD/FreeCAD,10100
8,metafizzy,zdog,https://github.com/metafizzy/zdog,8800
9,CesiumGS,cesium,https://github.com/CesiumGS/cesium,8000


## References and Conclusion

Summary of what we did

- So I have scraped the top 30 topics from https://github.com/topics.

- I got the username,repo name, stars & url from top repos from the individual topic

- I have created created seperate csv file for each topic which provides in-depth information about the topics

## Drawbacks
- Only the frust 30 repos were scraped, more can be done by using "?page=" the desired page number.
- The BeautifulSoup librabry is a simple library, will run into issues for websites with a loading screen.