# Top Repositories for GitHub Topics 


## Pick a website and describe your objective
- ### Browse through different sites and pick on to scrape. Check the "Project ideas" section for -inspiration.
- ### Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- ### Summarize your project idea and outline your strategy in a jupyter notebook. Use the "new" button above.

## Project Outline:
- ### We are going to scrape https://github.com/topics
- ### We will get a list of topics. For each topic, we will get topic title, topic page url and topic description.
- ### For each topic, we will get top 25 repositories in the topic from the topic page.
- ### For each repository, we'll grab the repo name, repo username, stars and repo url.
- ### For each topic we'll create a CSV file in the following format:
```
  Repo Name ,Username,Stars,Repo URL
three.js,mrdoob,104000,https://github.com/mrdoob/three.js
react-three-fibre,pmndrs,28200,https://github.com/pmndrs/react-three-fiber
```

## Use the requests library to download web pages

In [1]:
!pip install requests  --upgrade --quiet


[notice] A new release of pip is available: 23.2.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
import requests

In [3]:
topics_url = 'https://github.com/topics'

In [4]:
response = requests.get(topics_url)

In [5]:
response.status_code

200

In [6]:
len(response.text)

206148

In [12]:
page_contents = response.text

In [13]:
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html\n  lang="en"\n  \n  data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"\n  data-a11y-animated-images="system" data-a11y-link-underlines="true"\n  \n  >\n\n\n\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n  \n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-7aa84bb7e11e.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-f65db3e8d171.css" /><link data-color-theme="dark_dimmed" cross

## Use Beautiful Soup to parse and extract information 

In [14]:
!pip install beautifulsoup4 --upgrade --quiet  


[notice] A new release of pip is available: 23.2.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [15]:
from bs4 import BeautifulSoup

In [16]:
doc = BeautifulSoup(page_contents, 'html.parser')

In [17]:
type(doc)

bs4.BeautifulSoup

In [18]:
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
topic_title_tags = doc.find_all('p', {'class': selection_class })

In [19]:
len(topic_title_tags)

30

In [20]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [21]:
desc_selector = 'f5 color-fg-muted mb-0 mt-1'
topic_desc_tags = doc.find_all('p',{'class': desc_selector}) 

In [22]:
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [23]:
topic_title_tag0 = topic_title_tags[0]

In [24]:
div_tag = topic_title_tag0.parent

In [25]:
topic_link_tags = doc.find_all('a', {'class': 'no-underline flex-1 d-flex flex-column'})

In [26]:
len(topic_link_tags)

30

In [27]:
topic0_url = "https://github.com" + topic_link_tags[0]['href']
print(topic0_url)

https://github.com/topics/3d


In [28]:
topic_title_tags[0].text

'3D'

In [29]:
topic_titles = [] 

for tag in topic_title_tags:
    topic_titles.append(tag.text)

print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command-line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'C++', 'Cryptocurrency', 'Crystal']


In [30]:
topic_descs = []

for tag in topic_desc_tags:
    topic_descs.append(tag.text.strip())

topic_descs[:5]

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [31]:
topic_urls = [] 
base_url = 'https://github.com'

for tag in topic_link_tags:
   topic_urls.append(base_url + tag['href']) 
topic_urls

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compiler',
 'https://github.com/topics/co

In [32]:
!pip install pandas --quiet


[notice] A new release of pip is available: 23.2.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [33]:
import pandas as pd 

In [34]:
topics_dict = {
    'title': topic_titles,
    'description': topic_descs,
    'url': topic_urls
}

In [35]:
topics_df = pd.DataFrame(topics_dict)

In [36]:
topics_df

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


## Create CSV file(s) with the extracted information 

In [37]:
topics_df.to_csv('topics.csv')

## Getting information out of a topic page 

In [38]:
topic_page_url = topic_urls[0]

In [39]:
topic_page_url

'https://github.com/topics/3d'

In [40]:
response = requests.get(topic_page_url)

In [41]:
response.status_code

200

In [42]:
len(response.text)

520923

In [43]:
topic_doc = BeautifulSoup(response.text, 'html.parser')

In [44]:
h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
repo_tags = topic_doc.find_all('h3', {'class': h3_selection_class})

In [45]:
len(repo_tags)

20

In [46]:
repo_tags[0]

<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="c72fbd5c69a8ee7c9c53a4e65de2b93c8fc7552dd793945819639bc165c0f0ba" data-turbo="false" data-view-component="true" href="/mrdoob">mrdoob</a>          /
          <a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4a2667db3d63a1739c412e059e5da95afe419df83f70949b5d59dc3478f5c79a" data-turbo="false" data-view-component="true" href="/mrdoob/thre

In [47]:
a_tags = repo_tags[0].find_all('a')

In [48]:
a_tags

[<a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="c72fbd5c69a8ee7c9c53a4e65de2b93c8fc7552dd793945819639bc165c0f0ba" data-turbo="false" data-view-component="true" href="/mrdoob">mrdoob</a>,
 <a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4a2667db3d63a1739c412e059e5da95afe419df83f70949b5d59dc3478f5c79a" data-turbo="false" data-view-component="true" href="/mrdoob/three.js">three.js</a>]

In [49]:
a_tags[0].text.strip()

'mrdoob'

In [50]:
a_tags[1].text.strip()

'three.js'

In [51]:
a_tags[1]['href']

'/mrdoob/three.js'

In [52]:
base_url

'https://github.com'

In [53]:
base_url = 'https://github.com'
repo_url = base_url + a_tags[1]['href']
print(repo_url)

https://github.com/mrdoob/three.js


In [54]:
star_tags =  topic_doc.find_all('span', {'class': 'Counter js-social-count'})

In [55]:
len(star_tags)

20

In [56]:
star_tags[0].text

'104k'

In [57]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1]) * 1000)
        return int(stars_str)

In [58]:
parse_star_count(star_tags[0].text.strip())

104000

In [59]:
def get_repo_info(h3_tag, star_tag):
    # return all the required info about a repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url 

In [60]:
get_repo_info(repo_tags[0], star_tags[0])

('mrdoob', 'three.js', 104000, 'https://github.com/mrdoob/three.js')

In [61]:
topic_repos_dict = {
    'username': [],
    'repo_name': [],
    'stars': [],
    'repo_url': []
}


for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])

In [62]:
topic_repos_df = pd.DataFrame(topic_repos_dict)

In [63]:
topic_repos_df

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,104000,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,28200,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,23800,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,23600,https://github.com/BabylonJS/Babylon.js
4,FreeCAD,FreeCAD,23100,https://github.com/FreeCAD/FreeCAD
5,ssloy,tinyrenderer,21300,https://github.com/ssloy/tinyrenderer
6,lettier,3d-game-shaders-for-beginners,18400,https://github.com/lettier/3d-game-shaders-for...
7,aframevr,aframe,16900,https://github.com/aframevr/aframe
8,blender,blender,14200,https://github.com/blender/blender
9,CesiumGS,cesium,13300,https://github.com/CesiumGS/cesium


# Final code

In [64]:
import os 
def get_topic_page(topic_url):
    # Dowload the page 
    response = requests.get(topic_url)
    # Check successful response 
    if response.status_code !=200:
        raise Exception('failed to load page {}'.format(topic_url))
    # parse using BeautifulSoup 
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

def get_repo_info(h3_tag, star_tag):
    # return all the required info about a repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url 

def get_topic_repos(topic_doc):
    # Get the h3 tags containing repo title, repo URL and username
    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3', {'class': h3_selection_class})
    # Get star tags
    star_tags =  topic_doc.find_all('span', {'class': 'Counter js-social-count'})

    topic_repos_dict = {
        'username': [],
        'repo_name': [],
        'stars': [],
        'repo_url': []
    }
    # Get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)

def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print("The file {} already exists. skipping...".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)

In [65]:
url4 = topic_urls[4]

In [66]:
url4

'https://github.com/topics/android'

In [67]:
topic4_doc =  get_topic_page(url4)

In [68]:
topic4_repos = get_topic_repos(topic4_doc)

In [69]:
topic4_repos

Unnamed: 0,username,repo_name,stars,repo_url
0,flutter,flutter,169000,https://github.com/flutter/flutter
1,facebook,react-native,121000,https://github.com/facebook/react-native
2,Genymobile,scrcpy,118000,https://github.com/Genymobile/scrcpy
3,justjavac,free-programming-books-zh_CN,113000,https://github.com/justjavac/free-programming-...
4,Hack-with-Github,Awesome-Hacking,89100,https://github.com/Hack-with-Github/Awesome-Ha...
5,Solido,awesome-flutter,54800,https://github.com/Solido/awesome-flutter
6,tldr-pages,tldr,53700,https://github.com/tldr-pages/tldr
7,wasabeef,awesome-android-ui,51600,https://github.com/wasabeef/awesome-android-ui
8,google,material-design-icons,51100,https://github.com/google/material-design-icons
9,laurent22,joplin,47800,https://github.com/laurent22/joplin


In [70]:
topic_urls[6]

'https://github.com/topics/ansible'

In [71]:
# second method 
get_topic_repos(get_topic_page(topic_urls[6]))

Unnamed: 0,username,repo_name,stars,repo_url
0,bregman-arie,devops-exercises,67900,https://github.com/bregman-arie/devops-exercises
1,ansible,ansible,64099,https://github.com/ansible/ansible
2,trailofbits,algo,29200,https://github.com/trailofbits/algo
3,MichaelCade,90DaysOfDevOps,27700,https://github.com/MichaelCade/90DaysOfDevOps
4,StreisandEffect,streisand,23200,https://github.com/StreisandEffect/streisand
5,kubernetes-sigs,kubespray,16500,https://github.com/kubernetes-sigs/kubespray
6,ansible,awx,14300,https://github.com/ansible/awx
7,semaphoreui,semaphore,11400,https://github.com/semaphoreui/semaphore
8,easzlab,kubeasz,10700,https://github.com/easzlab/kubeasz
9,netbootxyz,netboot.xyz,9800,https://github.com/netbootxyz/netboot.xyz


In [72]:
get_topic_repos(get_topic_page(topic_urls[6])).to_csv('ansible' , index=None)

write a single function 
1. Get the list of topics from the topics page 
2. Get the list of top repos from the individual topic pages 
3. For each topic, create a CSV of the top repos for the topic 

In [76]:
def get_topic_titles(doc):
        selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
        topic_title_tags = doc.find_all('p', {'class': selection_class })
        topic_titles = []
        for tag in topic_title_tags:
            topic_titles.append(tag.text)
        return topic_titles

def get_topic_descs(doc):
        desc_selector = 'f5 color-fg-muted mb-0 mt-1'
        topic_desc_tags = doc.find_all('p',{'class': desc_selector})
        topic_descs = []
        for tag in topic_desc_tags:
           topic_descs.append(tag.text.strip())
        return topic_descs

def get_topic_urls(doc):
        topic_link_tags = doc.find_all('a', {'class': 'no-underline flex-1 d-flex flex-column'})
        topic_urls = []
        base_url = 'https://github.com'
        for tag in topic_link_tags: 
            topic_urls.append(base_url + tag['href']) 
        return topic_urls

def scrape_topics():
        topics_url = 'https://github.com/topics'
        response = requests.get(topics_url)
        if response.status_code !=200:
            raise Exception('Failed to load page {}'.format(topic_url))
        topics_dict = {
            'title': get_topic_titles(doc),
            'description': get_topic_descs(doc),
            'url': get_topic_urls(doc)
        }
    
        return pd.DataFrame(topics_dict)
        

In [77]:
def scrape_topic_repos():
    print('scraping list of topic')
    topic_df = scrape_topics()
    os.makedirs('data', exist_ok=True)
    for index, row in topic_df.iterrows():
        print('scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], 'data/{}.CSV'.format(row['title']))

In [78]:
scrape_topic_repos()

scraping list of topic
scraping top repositories for "3D"
scraping top repositories for "Ajax"
scraping top repositories for "Algorithm"
scraping top repositories for "Amp"
scraping top repositories for "Android"
scraping top repositories for "Angular"
scraping top repositories for "Ansible"
scraping top repositories for "API"
scraping top repositories for "Arduino"
scraping top repositories for "ASP.NET"
scraping top repositories for "Awesome Lists"
scraping top repositories for "Amazon Web Services"
scraping top repositories for "Azure"
scraping top repositories for "Babel"
scraping top repositories for "Bash"
scraping top repositories for "Bitcoin"
scraping top repositories for "Bootstrap"
scraping top repositories for "Bot"
scraping top repositories for "C"
scraping top repositories for "Chrome"
scraping top repositories for "Chrome extension"
scraping top repositories for "Command-line interface"
scraping top repositories for "Clojure"
scraping top repositories for "Code quality"


In [79]:
r = requests.get('https://raw.githubusercontent.com/github/explore/c700f6f5bb68a850405eef411cf878162ff34b59/topics/angular/angular.png')

In [80]:
r.status_code

200

In [81]:
with open('image.jpg', 'wb') as f:
    f.write(r.content)