# Top Repositories for GitHub Topics

### Pick a website and describe your objective

- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.

#### Project Outline:
- We are going to scrape https://github.com/topics
- We will get list of topics. For each topic, we will get topic title, topic page URL and topic description.
- For each topic, we will get top 25 repositories in the topic from the topic page.
- For each repository, we will grab the topic name, username, stars and repo URL.
- For each topic, we will create a csv file in the following format
```
Repo name, Username, Stars, Repo URL
```

### Use the requests library to download web pages

- Inspect the website's HTML source and identify the right URLs to download.
- Download and save web pages locally using the requests library.
- Create a function to automate downloading for different topics/search queries.

In [6]:
!pip install requests --upgrade --quiet

In [7]:
import requests

In [8]:
topics_url = "https://github.com/topics"

In [9]:
response = requests.get(topics_url)

In [10]:
response.status_code

200

In [11]:
len(response.text)

126380

In [12]:
page_contents = response.text
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" >\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-7KjiGvJiLLy6LJPGf3m67ejAdgQsgDdnxZYoaI6+Agd0ZxHKTCjoKZgaf3PgUjURCcVceAwySJJJWgitRskDiA==" rel="stylesheet" href="https://github.githubassets.com/assets/frameworks-eca8e21af2622cbcba2c93c67f79baed.css" />\n    <link crossorigin="anonymous" media="all" integrity="sha512-dDsAoT3mMaA8gyLZkshXL3vrnDAuIv4cNq2iN06+o44rOFIngYNNiTehUUzNuMoBXMaDg0MLhEaZNumoCiLJkw==" rel="stylesheet" href="https://github.githubassets.com/assets/behaviors-743b00a13de631a03c8322d992c8572f.css" />\n    \n    \n    \n    <link crossorigin="anonymous" media="all" integrity="sha512-Rzg

In [13]:
with open("webpage.html", 'w') as f:
    f.write(page_contents)

### Use Beautiful Soup to parse and extract information

- Parse and explore the structure of downloaded web pages using Beautiful soup.
- Use the right properties and methods to extract the required information.
- Create functions to extract from the page into lists and dictionaries.
- (Optional) Use a REST API to acquire additional information if required.

In [14]:
!pip install beautifulsoup4 --upgrade --quiet

In [15]:
from bs4 import BeautifulSoup

In [16]:
doc = BeautifulSoup(page_contents,'html.parser')

In [17]:
type(doc)

bs4.BeautifulSoup

In [18]:
p_tags = doc.find_all('p')

In [19]:
len(p_tags)

67

In [20]:
p_tags[:5]

[<p class="f4 color-text-secondary col-md-6 mx-auto">Browse popular topics on GitHub.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Emacs
       </p>,
 <p class="f5 color-text-secondary text-center mb-0 mt-1">Emacs is an extensible, customizable, free text editor and computing environment.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Spring Boot
       </p>,
 <p class="f5 color-text-secondary text-center mb-0 mt-1">Spring Boot is a coding and configuration model for Java applications.</p>]

In [21]:
selection_class = "f3 lh-condensed mb-0 mt-1 Link--primary"
topic_title_tags = doc.find_all('p',{'class':selection_class})

In [22]:
len(topic_title_tags)

30

In [23]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [24]:
desc_selector = "f5 color-text-secondary mb-0 mt-1"
topic_desc_tags = doc.find_all('p',{'class':desc_selector})

In [25]:
len(topic_desc_tags)

30

In [26]:
topic_desc_tags[:5]

[<p class="f5 color-text-secondary mb-0 mt-1">
               3D modeling is the process of virtually developing the surface and structure of a 3D object.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Ajax is a technique for creating interactive web applications.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Algorithms are self-contained sequences that carry out a variety of tasks.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Amp is a non-blocking concurrency framework for PHP.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Android is an operating system built by Google designed for mobile devices.
             </p>]

In [27]:
topic_title_tag0 = topic_title_tags[0]

In [28]:
topic_title_tag0

<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>

In [29]:
topic_link_tags = doc.find_all('a',{"class":"d-flex no-underline"})

In [30]:
len(topic_link_tags)

30

In [31]:
topic_link_tags[0]['href']

'/topics/3d'

In [32]:
topic0_url = "https://github.com"+topic_link_tags[0]['href']
print(topic0_url)

https://github.com/topics/3d


In [33]:
topic_titles = []
for tag in topic_title_tags:
    topic_titles.append(tag.text)
print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [34]:
topic_descs = []
for tag in topic_desc_tags:
    topic_descs.append(tag.text.strip())
print(topic_descs)

['3D modeling is the process of virtually developing the surface and structure of a 3D object.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency framework for PHP.', 'Android is an operating system built by Google designed for mobile devices.', 'Angular is an open source web application platform.', 'Ansible is a simple and powerful automation engine.', 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.', 'Arduino is an open source hardware and software company and maker community.', 'ASP.NET is a web framework for building modern web apps and services.', 'Atom is a open source text editor built with web technologies.', 'An awesome list is a list of awesome things curated by the community.', 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.', 'Azure is a clo

In [35]:
topic_urls = []
base_url = "https://github.com"
for tag in topic_link_tags:
    topic_urls.append(base_url+tag['href'])
print(topic_urls)

['https://github.com/topics/3d', 'https://github.com/topics/ajax', 'https://github.com/topics/algorithm', 'https://github.com/topics/amphp', 'https://github.com/topics/android', 'https://github.com/topics/angular', 'https://github.com/topics/ansible', 'https://github.com/topics/api', 'https://github.com/topics/arduino', 'https://github.com/topics/aspnet', 'https://github.com/topics/atom', 'https://github.com/topics/awesome', 'https://github.com/topics/aws', 'https://github.com/topics/azure', 'https://github.com/topics/babel', 'https://github.com/topics/bash', 'https://github.com/topics/bitcoin', 'https://github.com/topics/bootstrap', 'https://github.com/topics/bot', 'https://github.com/topics/c', 'https://github.com/topics/chrome', 'https://github.com/topics/chrome-extension', 'https://github.com/topics/cli', 'https://github.com/topics/clojure', 'https://github.com/topics/code-quality', 'https://github.com/topics/code-review', 'https://github.com/topics/compiler', 'https://github.com/t

### Create CSV file(s) with the extracted information

- Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
- Execute the function with different inputs to create a dataset of CSV files.
- Verify the information in the CSV files by reading them back using Pandas.

In [36]:
!pip install pandas --upgrade --quiet

In [37]:
import pandas as pd

In [38]:
topics_dict = {'title':topic_titles,'descriptions':topic_descs,'url':topic_urls}

In [39]:
topics_df = pd.DataFrame(topics_dict)

In [40]:
topics_df

Unnamed: 0,title,descriptions,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [41]:
topics_df.to_csv("topics.csv",index=None)

## Getting information out of a topic page

In [42]:
topic_page_url = topic_urls[0]

In [43]:
topic_page_url

'https://github.com/topics/3d'

In [44]:
response = requests.get(topic_page_url)

In [45]:
response.status_code

200

In [46]:
len(response.text)

580471

In [47]:
topic_doc = BeautifulSoup(response.text, 'html.parser')

In [48]:
h1_selection_class = "f3 color-text-secondary text-normal lh-condensed"
repo_tags = topic_doc.find_all('h1',{"class":h1_selection_class})

In [49]:
len(repo_tags)

30

In [50]:
a_tags = repo_tags[0].find_all('a')

In [51]:
a_tags[0].text.strip()

'mrdoob'

In [52]:
a_tags[1].text.strip()

'three.js'

In [53]:
a_tags[1]['href']

'/mrdoob/three.js'

In [54]:
base_url = "https://github.com"
repo_url = base_url+a_tags[1]['href']
print(repo_url)

https://github.com/mrdoob/three.js


In [55]:
star_tags = topic_doc.find_all('a',{"class":"social-count float-none"})

In [56]:
len(star_tags)

30

In [57]:
star_tags[0].text.strip()

'69.9k'

In [58]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1])*1000)
    return int(stars_str)

In [59]:
parse_star_count(star_tags[0].text.strip())

69900

In [60]:
def get_repo_info(h1_tag,star_tag):
    #returns all the required info about repo
    a_tags = h1_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url +a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

In [61]:
get_repo_info(repo_tags[0],star_tags[0])

('mrdoob', 'three.js', 69900, 'https://github.com/mrdoob/three.js')

In [62]:
topic_repos_dict = {'username':[], "repo_name":[], "stars":[], "repo_url":[]}
for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i],star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])

In [63]:
topic_repos_dict

{'username': ['mrdoob',
  'libgdx',
  'BabylonJS',
  'pmndrs',
  'aframevr',
  'ssloy',
  'FreeCAD',
  'metafizzy',
  'lettier',
  'CesiumGS',
  'a1studmuffin',
  'timzhang642',
  'spritejs',
  'tensorspace-team',
  'intel-isl',
  'jagenjo',
  'AaronJackson',
  'YadiraF',
  'openscad',
  'domlysz',
  'ssloy',
  'mosra',
  'cleardusk',
  'gfxfundamentals',
  'jasonlong',
  'google',
  'blender',
  'antvis',
  'pissang',
  'tinyobjloader'],
 'repo_name': ['three.js',
  'libgdx',
  'Babylon.js',
  'react-three-fiber',
  'aframe',
  'tinyrenderer',
  'FreeCAD',
  'zdog',
  '3d-game-shaders-for-beginners',
  'cesium',
  'SpaceshipGenerator',
  '3D-Machine-Learning',
  'spritejs',
  'tensorspace',
  'Open3D',
  'webglstudio.js',
  'vrn',
  'PRNet',
  'openscad',
  'BlenderGIS',
  'tinyraytracer',
  'magnum',
  '3DDFA',
  'webgl-fundamentals',
  'isometric-contributions',
  'model-viewer',
  'blender',
  'L7',
  'claygl',
  'tinyobjloader'],
 'stars': [69900,
  18300,
  13900,
  13000,
  1270

In [64]:
topic_repos_dict = pd.DataFrame(topic_repos_dict)

In [65]:
topic_repos_dict

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,69900,https://github.com/mrdoob/three.js
1,libgdx,libgdx,18300,https://github.com/libgdx/libgdx
2,BabylonJS,Babylon.js,13900,https://github.com/BabylonJS/Babylon.js
3,pmndrs,react-three-fiber,13000,https://github.com/pmndrs/react-three-fiber
4,aframevr,aframe,12700,https://github.com/aframevr/aframe
5,ssloy,tinyrenderer,10500,https://github.com/ssloy/tinyrenderer
6,FreeCAD,FreeCAD,9200,https://github.com/FreeCAD/FreeCAD
7,metafizzy,zdog,8400,https://github.com/metafizzy/zdog
8,lettier,3d-game-shaders-for-beginners,8400,https://github.com/lettier/3d-game-shaders-for...
9,CesiumGS,cesium,6900,https://github.com/CesiumGS/cesium


In [110]:
def get_topic_repos(topic_url):
    # Download the page
    response = requests.get(topic_url)
    # Check successful response
    if response.status_code != 200:
        raise Exception("Failed to load page {}".format(topic_url))
    # Parse using BeautifulSoup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    # Get h1 tags containing repo title, repo url and username
    h1_selection_class = "f3 color-text-secondary text-normal lh-condensed"
    repo_tags = topic_doc.find_all('h1',{"class":h1_selection_class})
    # Get star tags
    star_tags = topic_doc.find_all('a',{"class":"social-count float-none"})
    topic_repos_dict = {'username':[], "repo_name":[], "stars":[], "repo_url":[]}
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i],star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
    return pd.DataFrame(topic_repos_dict)
import os
def scrape_topic(topic_url,path):
    if os.path.exists(path):
        print('The file {} already exists. Skipping...'.format(path))
        return
    topic_df = get_topic_repos(topic_url)
    topic_df.to_csv(path , index = None)

In [67]:
get_topic_repos(topic_urls[6]).to_csv("ansible.csv",index = None)

In [68]:
get_topic_repos(topic_urls[6])

Unnamed: 0,username,repo_name,stars,repo_url
0,ansible,ansible,47800,https://github.com/ansible/ansible
1,StreisandEffect,streisand,22300,https://github.com/StreisandEffect/streisand
2,trailofbits,algo,20500,https://github.com/trailofbits/algo
3,kubernetes-sigs,kubespray,10400,https://github.com/kubernetes-sigs/kubespray
4,ansible,awx,9600,https://github.com/ansible/awx
5,bregman-arie,devops-exercises,8200,https://github.com/bregman-arie/devops-exercises
6,easzlab,kubeasz,6800,https://github.com/easzlab/kubeasz
7,geerlingguy,ansible-for-devops,4500,https://github.com/geerlingguy/ansible-for-devops
8,rundeck,rundeck,4200,https://github.com/rundeck/rundeck
9,ansible-semaphore,semaphore,4000,https://github.com/ansible-semaphore/semaphore


Write a single function to:
1. Get the list of topics from the topics page.
2. Get the list of top repos from the individual topic pages.
3. For each topic, create a csv of the top repos for the topic.

In [84]:
def get_topic_titles(doc):
    selection_class = "f3 lh-condensed mb-0 mt-1 Link--primary"
    topic_title_tags = doc.find_all('p',{'class':selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

def get_topic_decsc(doc):
    desc_selector = "f5 color-text-secondary mb-0 mt-1"
    topic_desc_tags = doc.find_all('p',{'class':desc_selector})
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs

def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a',{"class":"d-flex no-underline"})
    topic_urls = []
    base_url = "https://github.com"
    for tag in topic_link_tags:
        topic_urls.append(base_url+tag['href'])
    return topic_urls

def scrape_topics():
    topics_url = "https://github.com" 
    response = requests.get(topics_url)
    # Check successful response
    if response.status_code != 200:
        raise Exception("Failed to load page {}".format(topics_url))
    topics_dict = {
        'title' : get_topic_titles(doc),
        'description' : get_topic_decsc(doc),
        'url' : get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)

In [85]:
scrape_topics()

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [None]:
import os


In [111]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    os.makedirs('data',exist_ok = True)
    for index,row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'],'data/{}.csv'.format(row['title']))

In [112]:
scrape_topics_repos()

Scraping list of topics
Scraping top repositories for "3D"
Scraping top repositories for "Ajax"
Scraping top repositories for "Algorithm"
Scraping top repositories for "Amp"
Scraping top repositories for "Android"
Scraping top repositories for "Angular"
Scraping top repositories for "Ansible"
Scraping top repositories for "API"
Scraping top repositories for "Arduino"
Scraping top repositories for "ASP.NET"
Scraping top repositories for "Atom"
Scraping top repositories for "Awesome Lists"
Scraping top repositories for "Amazon Web Services"
Scraping top repositories for "Azure"
Scraping top repositories for "Babel"
Scraping top repositories for "Bash"
Scraping top repositories for "Bitcoin"
Scraping top repositories for "Bootstrap"
Scraping top repositories for "Bot"
Scraping top repositories for "C"


Exception: Failed to load page https://github.com/topics/c

In [113]:
!pip install jovian --upgrade --quiet

In [114]:
import jovian

In [115]:
# Execute this to save new versions of the notebook
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Attempting to save notebook..[0m
[jovian] Updating notebook "kerkarpooja23/scraping-github-topics-repositories" on https://jovian.ai[0m
[jovian] Uploading notebook..[0m
[jovian] Uploading additional files...[0m
[jovian] Committed successfully! https://jovian.ai/kerkarpooja23/scraping-github-topics-repositories[0m


'https://jovian.ai/kerkarpooja23/scraping-github-topics-repositories'