# Top Repositories for Github Topics




### Project Outline:

- We're going to scrape https://github.com/topics

- We'll get a list of topics on the first page. For each topic, we'll get topic title, topic page URL and topic description.

- For each topic, we'll get the top repositories from the topic page.

- For each repository, we'll grab the repo name, username, stars and repo URL.

- For each topic we'll create a CSV file in the follwing format:
  Repo name, Username, Stars, RepoURL
  


## Use the requests library to download web pages

In [1682]:
!pip install requests --upgrade --quiet

In [1683]:
import requests

In [1684]:
topics_url = 'https://github.com/topics'
response = requests.get(topics_url)
# Creates a response object in accordance with the HTTP request 

In [1685]:
response.status_code

200

Any request you make culminates in a response and a status code.Satus code indicates whether response was successful or not. In order to learn different status codes google HTTP status codes

In [1686]:
len(response.text)

164601

- response.text contains all contents of the webpage. 
- The webpage consists of 164599 characters. Although we can print the webpage using response.text but its better not to do it. 
- Rather we'll print the first few lines of the code

In [1687]:
page_contents = response.text
page_contents[:1000]
#Displaying the first 1000 characters of the webpage in HTML

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"  data-a11y-animated-images="system">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n  \n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-8cafbcbd78f4.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-31dc14e38457.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" media="all" rel="stylesheet" data-href="

In [1688]:
with open('webpage.html','w') as f:
    f.write(page_contents)

## Use Beautiful Soup to parse and extract information

In [1689]:
#Installing beautiful soup
!pip install beautifulsoup4 --upgrade --quiet

In [1690]:
from bs4 import BeautifulSoup

In [1691]:
# Parsing HTML doc using Beautiful Soup
doc = BeautifulSoup(page_contents, 'html.parser')

In [1692]:
type(doc)
#doc contains all text in a parsed format

bs4.BeautifulSoup

In [1693]:
title_selector = 'f3 lh-condensed mb-0 mt-1 Link--primary'

topic_title_tags = doc.find_all('p', {'class' : title_selector})   # p is the tag, class is the key and equlas to is the value

In [1694]:
len(topic_title_tags)

30

In [1695]:
topic_title_tags[:5]  #We've obtained p_tags corresponding to he 30 topic titles

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

### Getting topic descriptions

In [1696]:
desc_selector = 'f5 color-fg-muted mb-0 mt-1'
topic_desc_tags = doc.find_all('p',{'class' : desc_selector})

In [1697]:
len(topic_desc_tags)

30

In [1698]:
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

### Getting URL to the Topic Page

In [1699]:
topic_title_tag0 = topic_title_tags[0]

In [1700]:
div_Tag = topic_title_tag0.parent

In [1701]:
topic_link_tags = doc.find_all('a',{'class':'no-underline flex-grow-0'})

In [1702]:
len(topic_link_tags)

30

In [1703]:
topic_link_tags[0]['href']  #We need the href attribute of the a tag

'/topics/3d'

In [1704]:
topic0_url = "https://github.com" + topic_link_tags[0]['href']
print(topic0_url)

https://github.com/topics/3d


### Cleaning up the information

In [1705]:
topic_titles =[]

for tag in topic_title_tags:
    topic_titles.append(tag.text)

print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [1706]:
topic_descs = []

for tag in topic_desc_tags:
    topic_descs.append(tag.text.strip())

topic_descs[:5]

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [1707]:
topic_urls = []
base_url = 'https://github.com'

for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])
    
topic_urls[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

## Create CSV file(s) with the extracted information

In [1708]:
!pip install pandas --quiet

In [1709]:
import pandas as pd

In [1710]:
#Create dictionary

topics_dict = {'title':topic_titles,
                'description':topic_descs,
                 'url':topic_urls}

In [1711]:
topics_df = pd.DataFrame(topics_dict)

In [1712]:
topics_df[:5]

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


In [1713]:
topics_df.to_csv('topics.csv')

#  If we dont want rows number to show up - topics_df.to_csv('topics.csv', index=None)

### Getting information out of a topic page

In [1714]:
topic_page_url = topic_urls[0]


In [1715]:
topic_page_url

'https://github.com/topics/3d'

In [1716]:
response1 = requests.get(topic_page_url)

In [1717]:
response1.status_code

200

In [1718]:
len(response1.text)

474171

In [1719]:
topic_doc = BeautifulSoup(response1.text, 'html.parser')

In [1720]:
h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
repo_tags = topic_doc.find_all('h3',{'class': h3_selection_class})

#repo_tags consists of the username, name of the repo and link of the repo

In [1721]:
len(repo_tags)

20

In [1722]:
repo_tags[0]

<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
            mrdoob
</a>          /
          <a class="text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data-turbo="false" data-view-component="true" href="/mrdoob/three.js

In [1723]:
a_tags = repo_tags[0].find_all('a')
len(a_tags)

2

In [1724]:
a_tags[0].text.strip()  #Gives username

'mrdoob'

In [1725]:
a_tags[1].text.strip() #Gives name of repo

'three.js'

In [1726]:
base_url = "https://github.com"
base_url + a_tags[1]['href']

'https://github.com/mrdoob/three.js'

In [1727]:
star_selection_class = 'Counter js-social-count'
star_tags = topic_doc.find_all('span',{'class':star_selection_class})

In [1728]:
len(star_tags)

20

In [1729]:
star_tags[0].text.strip() #returns number of starts as a string

'93.2k'

In [1730]:
#Function to convert to int
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1]=='k':
       return int(float(stars_str[:-1]) * 1000)
    return int(stars_str)

In [1731]:
parse_star_count(star_tags[0].text.strip())

93200

In [1732]:
#get_repo_info returns all the info about the repository

def get_repo_info(h1_tag, star_tag):
    a_tags = h1_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

In [1733]:
get_repo_info(repo_tags[0], star_tags[0])

('mrdoob', 'three.js', 93200, 'https://github.com/mrdoob/three.js')

In [1734]:
topic_repos_dict = { 'username':[],
                     'repo_name':[],
                     'stars':[],
                     'repo_url':[]}

for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i],star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])

In [1735]:
topic_repos_dict

{'username': ['mrdoob',
  'pmndrs',
  'libgdx',
  'BabylonJS',
  'ssloy',
  'lettier',
  'aframevr',
  'FreeCAD',
  'CesiumGS',
  'metafizzy',
  'isl-org',
  'timzhang642',
  'blender',
  'a1studmuffin',
  'domlysz',
  'FyroxEngine',
  'google',
  'nerfstudio-project',
  'openscad',
  'spritejs'],
 'repo_name': ['three.js',
  'react-three-fiber',
  'libgdx',
  'Babylon.js',
  'tinyrenderer',
  '3d-game-shaders-for-beginners',
  'aframe',
  'FreeCAD',
  'cesium',
  'zdog',
  'Open3D',
  '3D-Machine-Learning',
  'blender',
  'SpaceshipGenerator',
  'BlenderGIS',
  'Fyrox',
  'model-viewer',
  'nerfstudio',
  'openscad',
  'spritejs'],
 'stars': [93200,
  23200,
  21700,
  21000,
  17400,
  15700,
  15500,
  14500,
  10700,
  9900,
  9100,
  9000,
  9000,
  7400,
  6500,
  6300,
  5800,
  5800,
  5700,
  5200],
 'repo_url': ['https://github.com/mrdoob/three.js',
  'https://github.com/pmndrs/react-three-fiber',
  'https://github.com/libgdx/libgdx',
  'https://github.com/BabylonJS/Babylon.j

In [1736]:
topic_repos_df = pd.DataFrame(topic_repos_dict)

In [1737]:
topic_repos_df[:5]

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,93200,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,23200,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,21700,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,21000,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,17400,https://github.com/ssloy/tinyrenderer


## Final Code

In [1738]:
def get_topic_page(topic_url):
    
    #Download the page
    response = requests.get(topic_url)
    
    #Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}' .format(topic_url))
    
    # Parse using beautiful soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

def get_repo_info(h1_tag, star_tag):
    a_tags = h1_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url



def get_topic_repos(topic_doc):
    # Get the h3 tags containg repo name, repo url and username
    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3',{'class': h3_selection_class})
    
    #Get the star tags
    star_selection_class = 'Counter js-social-count'
    star_tags = topic_doc.find_all('span',{'class':star_selection_class})
    
    topic_repos_dict ={'username':[], 
                       'repo_name':[], 
                       'stars':[], 
                       'repo_url':[]}
    
    # Get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i],star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
    
    return pd.DataFrame(topic_repos_dict)

def scrape_topic(topic_url, topic_name):
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(topic_name + '.csv', index=None)

In [1739]:
url4 = topic_urls[4]

In [1740]:
topic4_doc = get_topic_page(url4)

In [1741]:
topic4_repos = get_topic_repos(topic4_doc)

In [1742]:
topic4_repos.to_csv('android.csv',index=None)

### Write a single function to :
- Get the list of topics from the topics webpage
- Get the list of top repos from individual topics pages
- For each topic, Create CSV of information of top repos for the topic

In [1743]:
def get_topic_titles(doc):
    title_selector = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class' : title_selector})
    
    topic_titles =[]
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

def get_topic_descs(doc):
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p',{'class' : desc_selector})
    
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs

def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a',{'class':'no-underline flex-grow-0'})
    
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls

def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}' .format(topic_url))
    topics_dict = {'title': get_topic_titles(doc),
                   'descripion': get_topic_descs(doc),
                   'url': get_topic_urls(doc)}
    return pd.DataFrame(topics_dict)


In [1744]:
scrape_topics()

Unnamed: 0,title,descripion,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [1745]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'],row['title'])

In [1746]:
scrape_topics_repos()

Scraping list of topics
Scraping top repositories for "3D"
Scraping top repositories for "Ajax"
Scraping top repositories for "Algorithm"
Scraping top repositories for "Amp"
Scraping top repositories for "Android"
Scraping top repositories for "Angular"
Scraping top repositories for "Ansible"
Scraping top repositories for "API"
Scraping top repositories for "Arduino"
Scraping top repositories for "ASP.NET"
Scraping top repositories for "Atom"
Scraping top repositories for "Awesome Lists"
Scraping top repositories for "Amazon Web Services"
Scraping top repositories for "Azure"
Scraping top repositories for "Babel"
Scraping top repositories for "Bash"
Scraping top repositories for "Bitcoin"
Scraping top repositories for "Bootstrap"
Scraping top repositories for "Bot"
Scraping top repositories for "C"
Scraping top repositories for "Chrome"
Scraping top repositories for "Chrome extension"
Scraping top repositories for "Command line interface"
Scraping top repositories for "Clojure"
Scrapin

In [1747]:
import jovian

In [1748]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "jyotiradityabarsain07/scraping-github-topic-repos-rough" on https://jovian.com[0m
[jovian] Committed successfully! https://jovian.com/jyotiradityabarsain07/scraping-github-topic-repos-rough[0m


'https://jovian.com/jyotiradityabarsain07/scraping-github-topic-repos-rough'