# Scraping the Topicwise Top Repositories on Github

![Github Image](https://hoadm.net/wp-content/uploads/2021/09/Github-1200x630-1-696x365.jpeg)



*Author: [Lafir](https://www.linkedin.com/in/lafir)*

## About: 

In this project, top repositories of the github page was scraped using requests and BeautifulSoup.

### Project Outline:

- Scrape this page: https://github.com/topics
- Extract and create a dataframe and CSV file for list of topics and each row should contain topic title, topic page URL, and topic description
- For each topic, create a dataframe and CSV file for top 25 repositories which should be scarped from the topic page
- Each repository record should contain repo name, username, stars, and repo URL and it should be in the below format:

```
username,repo_name,stars,repo_url
mrdoob,three.js,83700,https://github.com/mrdoob/three.js
libgdx,libgdx,20200,https://github.com/libgdx/libgdx
```


## Use Requests library to download webpages

In [1]:
!pip install requests --upgrade --quiet

In [2]:
import requests

In [3]:
github_topics_url = 'https://github.com/topics'

In [4]:
response = requests.get(github_topics_url)

In [5]:
response.status_code

200

In [6]:
page_contents = response.text

## Use Beautiful Soup to parse and extract information

In [7]:
!pip install beautifulsoup4 --upgrade --quiet

In [8]:
from bs4 import BeautifulSoup

In [9]:
doc = BeautifulSoup(page_contents, 'html.parser')

In [10]:
type(doc)

bs4.BeautifulSoup

In [11]:
topic_title_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'

topic_title_tags = doc.find_all('p', {'class': topic_title_class})

In [12]:
len(topic_title_tags)

30

In [13]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [14]:
topic_desc_class = 'f5 color-fg-muted mb-0 mt-1'

topic_desc_tags = doc.find_all('p', {'class': topic_desc_class})

In [15]:
len(topic_desc_tags)

30

In [16]:
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D modeling is the process of virtually developing the surface and structure of a 3D object.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [17]:
topic_url_class = 'no-underline flex-1 d-flex flex-column'

topic_url_tags = doc.find_all('a', {'class': topic_url_class})

In [18]:
len(topic_url_tags)

30

In [19]:
topic0_url = "https://github.com"+topic_url_tags[0]['href']
print(topic0_url)

https://github.com/topics/3d


### Create a list of topic titles

In [20]:
topic_titles = []

for tag in topic_title_tags:
    topic_titles.append(tag.text)
    
print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


### Create a list of topic descriptions

In [21]:
topic_descs = []

for tag in topic_desc_tags:
    topic_descs.append(tag.text.strip())
    
print(topic_descs[:4])

['3D modeling is the process of virtually developing the surface and structure of a 3D object.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency library for PHP.']


### Create a list of topic URLs

In [22]:
topic_urls = []

base_url = "https://github.com"

for tag in topic_url_tags:
    topic_urls.append(base_url + tag['href'])
    
print(topic_urls[:4])

['https://github.com/topics/3d', 'https://github.com/topics/ajax', 'https://github.com/topics/algorithm', 'https://github.com/topics/amphp']


In [23]:
# its  easy to convert a df into CSV file

In [24]:
!pip install pandas --quiet

In [25]:
import pandas as pd

In [26]:
#its easy to convert a dictionary of lists into dataframe

dict = {'title': topic_titles,
        'description': topic_descs,
        'url': topic_urls    
}

In [27]:
topics_df = pd.DataFrame(dict)

In [28]:
topics_df

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


# Create a CSV file with the extracted information

In [29]:
topics_df.to_csv('topics.csv', index=None)

# Getting Information out of a Topic page

In [30]:
topic_page_url = topic_urls[0]

topic_page_url

'https://github.com/topics/3d'

In [31]:
response = requests.get(topic_page_url)

In [32]:
response.status_code

200

In [33]:
len(response.text)

647977

In [34]:
topic_doc = BeautifulSoup(response.text, 'html.parser')

In [35]:
h3_class = 'f3 color-fg-muted text-normal lh-condensed'

repo_tags = topic_doc.find_all('h3', {'class': h3_class})

In [36]:
len(repo_tags)

30

In [37]:
a_tags = repo_tags[0].find_all('a')
a_tags

[<a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
             mrdoob
 </a>,
 <a class="text-bold wb-break-word" data-ga-click="Explore, go to repository, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data

In [38]:
a_tags[0].text.strip()

'mrdoob'

In [39]:
a_tags[1].text.strip()

'three.js'

In [40]:
a_tags[1]['href']

'/mrdoob/three.js'

In [41]:
repo_url = base_url + a_tags[1]['href']

In [42]:
span_id = 'repo-stars-counter-star'


star_tags = topic_doc.find_all('span', {'id': span_id})

In [43]:
len(star_tags)

30

In [44]:
star_tags[0].text.strip()

'83.7k'

In [45]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1])*1000)
    return int(stars_str)

In [46]:
parse_star_count('68.9k')

68900

In [47]:
def get_repo_info(repo_tag, star_tag):
    #returns all the required info about a repository
    a_tags = repo_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

In [48]:
get_repo_info(repo_tags[0], star_tags[0])

('mrdoob', 'three.js', 83700, 'https://github.com/mrdoob/three.js')

## Final Code

In [76]:
import os

def get_topic_page(topic_url):
    # download the page
    response = requests.get(topic_url)
    #check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page{}'.format(topic_url))
    #parse using beautiful soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

def get_repo_info(repo_tag, star_tag):
    #returns all the required info about a repository
    a_tags = repo_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url


def get_topic_repos(topic_doc):
    #get h3 tags containing reponame, username, repo url
    h3_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3', {'class': h3_class})
    span_id = 'repo-stars-counter-star'
    star_tags = topic_doc.find_all('span', {'id': span_id})
    
    topic_repos_dict = {
        'username': [],
        'repo_name': [],
        'stars': [],
        'repo_url': []
    }

    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)

def scrape_each_topic_page(topic_url, fpath):
    if os.path.exists(fpath):
        print('The file {} already exists. Skipping...'.format(fpath))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(fpath, index=None)

In [50]:
url4 = topic_urls[4]

In [51]:
url4

'https://github.com/topics/android'

In [52]:
topic4_doc = get_topic_page(url4)

In [53]:
topic4_repos = get_topic_repos(topic4_doc)

In [54]:
topic4_repos

Unnamed: 0,username,repo_name,stars,repo_url
0,flutter,flutter,143000,https://github.com/flutter/flutter
1,justjavac,free-programming-books-zh_CN,94400,https://github.com/justjavac/free-programming-...
2,Genymobile,scrcpy,67900,https://github.com/Genymobile/scrcpy
3,Hack-with-Github,Awesome-Hacking,53500,https://github.com/Hack-with-Github/Awesome-Ha...
4,google,material-design-icons,46200,https://github.com/google/material-design-icons
5,wasabeef,awesome-android-ui,43200,https://github.com/wasabeef/awesome-android-ui
6,square,okhttp,42500,https://github.com/square/okhttp
7,Solido,awesome-flutter,41700,https://github.com/Solido/awesome-flutter
8,android,architecture-samples,41200,https://github.com/android/architecture-samples
9,square,retrofit,40200,https://github.com/square/retrofit


In [75]:
get_topic_repos(get_topic_page(topic_urls[6]))

Unnamed: 0,username,repo_name,stars,repo_url
0,ansible,ansible,53900,https://github.com/ansible/ansible
1,bregman-arie,devops-exercises,27900,https://github.com/bregman-arie/devops-exercises
2,trailofbits,algo,25800,https://github.com/trailofbits/algo
3,StreisandEffect,streisand,22800,https://github.com/StreisandEffect/streisand
4,MichaelCade,90DaysOfDevOps,15600,https://github.com/MichaelCade/90DaysOfDevOps
5,kubernetes-sigs,kubespray,12600,https://github.com/kubernetes-sigs/kubespray
6,ansible,awx,11200,https://github.com/ansible/awx
7,easzlab,kubeasz,8300,https://github.com/easzlab/kubeasz
8,geerlingguy,ansible-for-devops,6000,https://github.com/geerlingguy/ansible-for-devops
9,khuedoan,homelab,5900,https://github.com/khuedoan/homelab




Write a single function to:

1. get the list of topics from the github topics page
2. get the list of top repos from the individual topic pages
3. For each topic, create a CSV of the top repos fro the topic

In [74]:
def scrape_topics_list_page():
    """"
    To Scrape the entire list of topics which spans over 6 pages.
    Returns a dataframe that consists of Topic name, Topic Description and Topic URL.
    """
    all_page_contents = ''
    topics_list_page_url = 'https://github.com/topics?page='
    for i in range(1,7):
        response = requests.get(topics_list_page_url+str(i))
        #check successful response
        if response.status_code != 200:
            raise Exception('Failed to load page{}'.format(topic_url))
        single_page_contents = response.text
        all_page_contents += single_page_contents
    #parse using beautiful soup
    doc = BeautifulSoup(all_page_contents, 'html.parser')
    topics_dict = {'title': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url': get_topic_urls(doc)    
    }
    return pd.DataFrame(topics_dict)

In [57]:
def get_topic_titles(doc):
    topic_title_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': topic_title_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

def get_topic_descs(doc):
    topic_desc_class = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class': topic_desc_class})
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs

def get_topic_urls(doc):
    topic_url_class = 'no-underline flex-1 d-flex flex-column'
    topic_url_tags = doc.find_all('a', {'class': topic_url_class})
    topic_urls = []
    base_url = "https://github.com"
    for tag in topic_url_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls

In [2]:
def scrape_topic_repos():
    """ Scrape top repositories of each topic.
    """
    print('Scraping Topics List')
    topics_df = scrape_topics_list_page()
    
    os.makedirs('data', exist_ok=True)  #folder name -> data
    
    for index, row in topics_df.iterrows():
        print('Scraping top repos for "{}"'.format(row['title']))
        scrape_each_topic_page(row['url'], 'data/{}.csv'.format(row['title']))

In [86]:
scrape_topic_repos()

Scraping Topics List
Scraping top repos for "3D"
The file data/3D.csv already exists. Skipping...
Scraping top repos for "Ajax"
The file data/Ajax.csv already exists. Skipping...
Scraping top repos for "Algorithm"
The file data/Algorithm.csv already exists. Skipping...
Scraping top repos for "Amp"
The file data/Amp.csv already exists. Skipping...
Scraping top repos for "Android"
The file data/Android.csv already exists. Skipping...
Scraping top repos for "Angular"
The file data/Angular.csv already exists. Skipping...
Scraping top repos for "Ansible"
The file data/Ansible.csv already exists. Skipping...
Scraping top repos for "API"
The file data/API.csv already exists. Skipping...
Scraping top repos for "Arduino"
The file data/Arduino.csv already exists. Skipping...
Scraping top repos for "ASP.NET"
The file data/ASP.NET.csv already exists. Skipping...
Scraping top repos for "Atom"
The file data/Atom.csv already exists. Skipping...
Scraping top repos for "Awesome Lists"
The file data/Awe

Scraping top repos for "Web app"
Scraping top repos for "Webpack"
Scraping top repos for "Windows"
Scraping top repos for "WordPlate"
Scraping top repos for "WordPress"
Scraping top repos for "Xamarin"
Scraping top repos for "XML"


In [60]:
!pip install jovian --upgrade --quiet

In [61]:
import jovian

In [1]:
jovian.commit(project='github-top-repos-scraping-raw')

<IPython.core.display.Javascript object>

[jovian] Creating a new project "lafirm/github-top-repo-scraping-raw"[0m
[jovian] Committed successfully! https://jovian.ai/lafirm/github-top-repo-scraping-raw[0m


'https://jovian.ai/lafirm/github-top-repo-scraping-raw'