# Top Repositories for GitHub Topics

Pick a website and describe your objective
- Browse through different sites and pick on to scrape. Check the "Project - Ideas" section for inspiration.

Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.- 
Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.


### Project Outline:
- we're going to scrape https://github.com/topics
- we'll get a list of topics, For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page.
- For each repository we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:

  Repo Name,User Name,Stars,Repo URL
  
three.js,mrdoob,99100,https://github.com/mrdoob/three.js  
liggdx,libgdx,22800,https://github.com/libgdx/libgdx

### Use the requests library to download web pages

In [4]:
!pip install requests --upgrade --quiet

In [5]:
import requests

In [7]:
topics_url = 'https://github.com/topics'

In [8]:
response=requests.get(topics_url)

In [9]:
response.status_code

200

In [10]:
len(response.text)

186716

In [11]:
page_content=response.text
page_content[:1000]

'\n\n<!DOCTYPE html>\n<html\n  lang="en"\n  \n  data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"\n  data-a11y-animated-images="system" data-a11y-link-underlines="true"\n  >\n\n\n\n\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n  \n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-f552bab6ce72.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-4589f64a2275.css" /><link data-color-theme="dark_dimmed" crossor

In [12]:
with open('webpage.html', 'w') as f:
    f.write(page_content)

UnicodeEncodeError: 'charmap' codec can't encode character '\u2011' in position 162995: character maps to <undefined>

### Use Beautiful Soup to parse and extract information

In [13]:
!pip install beautifulsoup4 --upgrade --quiet

In [14]:
from bs4 import BeautifulSoup
doc= BeautifulSoup(page_content,'html.parser')
doc


<!DOCTYPE html>

<html data-a11y-animated-images="system" data-a11y-link-underlines="true" data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
<link href="https://avatars.githubusercontent.com" rel="preconnect"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-f552bab6ce72.css" media="all" rel="stylesheet"><link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-4589f64a2275.css" media="all" rel="stylesheet"><link crossorigin="anonymous" data-color-theme="dark_dimmed" data-href="https://github.githubassets.com/

In [15]:
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
topic_title_tags = doc.find_all("p",{'class': selection_class})
len(topic_title_tags)

30

In [16]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [17]:
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
topic_title_tags = doc.find_all("p",{'class': selection_class})
desc_selector= 'f5 color-fg-muted mb-0 mt-1'
topic_desc_tags = doc.find_all('p',{'class': desc_selector})

In [18]:
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [19]:
topic_title_tag0 = topic_title_tags[0]

In [20]:
topic_title_tag0

<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>

In [20]:
topic_title_tag0.parent.parent

<div class="py-4 border-bottom d-flex flex-justify-between">
<a class="no-underline flex-grow-0" href="/topics/3d">
<div class="color-bg-accent f4 color-fg-muted text-bold rounded mr-3 flex-shrink-0 text-center" style="width:64px; height:64px; line-height:64px;">
            #
          </div>
</a>
<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-fg-muted mb-0 mt-1">
          3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
        </p>
</a>
<div class="flex-grow-0">
<div class="d-block" data-view-component="true">
<a aria-label="You must be signed in to star a repository" class="tooltipped tooltipped-s btn-sm btn" data-hydro-click='{"event_type":"authentication.click","payload":{"location_in_page":"star button","repository_id":null,"auth_type":"LOG_IN","originating_url":"https://github.com/topics","user_id":null}}' data-hydro-click-hmac="7

In [21]:
div_tag = topic_title_tag0.parent

div_tag.parent

<div class="py-4 border-bottom d-flex flex-justify-between">
<a class="no-underline flex-grow-0" href="/topics/3d">
<div class="color-bg-accent f4 color-fg-muted text-bold rounded mr-3 flex-shrink-0 text-center" style="width:64px; height:64px; line-height:64px;">
            #
          </div>
</a>
<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-fg-muted mb-0 mt-1">
          3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
        </p>
</a>
<div class="flex-grow-0">
<div class="d-block" data-view-component="true">
<a aria-label="You must be signed in to star a repository" class="tooltipped tooltipped-s btn-sm btn" data-hydro-click='{"event_type":"authentication.click","payload":{"location_in_page":"star button","repository_id":null,"auth_type":"LOG_IN","originating_url":"https://github.com/topics","user_id":null}}' data-hydro-click-hmac="7

In [22]:
topic_link_tags = doc.find_all('a',{'class':'no-underline flex-1 d-flex flex-column'})

len(topic_link_tags)

30

In [None]:
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
topic_title_tags = doc.find_all("p",{'class': selection_class})
desc_selector= 'f5 color-fg-muted mb-0 mt-1'
topic_desc_tags = doc.find_all('p',{'class': desc_selector})
topic_link_tags = doc.find_all('a',{'class':'no-underline flex-1 d-flex flex-column'})

topic_title = []

for tag in topic_title_tags:
    topic_title.append(tag.text)

print(topic_title)

In [23]:
topic0_url = "https://github.com" + topic_link_tags[0]['href']

print(topic0_url)

https://github.com/topics/3d


In [24]:
topic_title_tags[0].text

'3D'

In [25]:
topic_title = []

for tag in topic_title_tags:
    topic_title.append(tag.text)

print(topic_title)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [26]:
topic_descs = []

for tag in topic_desc_tags:
    topic_descs.append(tag.text.strip())

topic_descs[:5]

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [27]:
topic0_url = "https://github.com" + topic_link_tags[0]['href']
print(topic0_url)

https://github.com/topics/3d


In [28]:
topic_urls= []
base_url= 'https://github.com'

for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])

topic_urls

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

In [29]:
!pip install pandas --quiet
import pandas as pd

In [30]:
topics_dict = {
    'Title': topic_title,
    'Description': topic_descs,
    'Url':topic_urls
}
topics_dict

{'Title': ['3D',
  'Ajax',
  'Algorithm',
  'Amp',
  'Android',
  'Angular',
  'Ansible',
  'API',
  'Arduino',
  'ASP.NET',
  'Atom',
  'Awesome Lists',
  'Amazon Web Services',
  'Azure',
  'Babel',
  'Bash',
  'Bitcoin',
  'Bootstrap',
  'Bot',
  'C',
  'Chrome',
  'Chrome extension',
  'Command line interface',
  'Clojure',
  'Code quality',
  'Code review',
  'Compiler',
  'Continuous integration',
  'COVID-19',
  'C++'],
 'Description': ['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
  'Ajax is a technique for creating interactive web applications.',
  'Algorithms are self-contained sequences that carry out a variety of tasks.',
  'Amp is a non-blocking concurrency library for PHP.',
  'Android is an operating system built by Google designed for mobile devices.',
  'Angular is an open source web application platform.',
  'Ansible is a simple and powerful automation engine.',
  'An API (Application Programming Interface) is a 

In [31]:
topics_df = pd.DataFrame(topics_dict)
topics_df

Unnamed: 0,Title,Description,Url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


### Create CSV file(s) with the extracted information

In [32]:
topics_df.to_csv('topics.csv',index=None)

In [33]:
topic_page_url = topic_urls[0]

In [34]:
topic_page_url

'https://github.com/topics/3d'

In [35]:
response = requests.get(topic_page_url)
response.status_code

200

In [36]:
len(response.text)

498291

In [37]:
topic_doc = BeautifulSoup(response.text,'html.parser')

In [43]:
h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
repo_tags= topic_doc.find_all('h3',{'class':h3_selection_class})

len(repo_tags)

20

In [45]:
repo_tags[0]

<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="c72fbd5c69a8ee7c9c53a4e65de2b93c8fc7552dd793945819639bc165c0f0ba" data-turbo="false" data-view-component="true" href="/mrdoob">
            mrdoob
</a>          /
          <a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4a2667db3d63a1739c412e059e5da95afe419df83f70949b5d59dc3478f5c79a" data-turbo="false" data-view-component="true" href

In [46]:
a_tags = repo_tags[0].find_all('a')
a_tags[0].text.strip()

'mrdoob'

In [47]:
a_tags[1].text.strip()

'three.js'

In [48]:
base_url = 'https://github.com'
repo_url = base_url + a_tags[1]['href']
print(repo_url)

https://github.com/mrdoob/three.js


In [49]:
base_url

'https://github.com'

In [50]:
star_tags = topic_doc.find_all('span',{'class': 'Counter js-social-count'})
len(star_tags)

20

In [51]:
star_tags[0].text

'99.4k'

In [68]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
       return int(float(stars_str[:-1])* 1000)
    return int (stars_str)

In [69]:
parse_star_count(star_tags[0].text)

99400

In [54]:
stars_str = '992'

In [55]:
int(stars_str)

992

In [56]:
stars_str[-1]

'2'

In [57]:
int(float(stars_str[:-1])*1000)

99000

In [70]:
def get_repo_info(h3_tag, star_tag):
    #return all the required info about a repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tags[0].text)
    return username, repo_name, stars, repo_url


In [59]:
get_repo_info(repo_tags[0],star_tags[0])

('mrdoob', 'three.js', 99400, 'https://github.com/mrdoob/three.js')

In [71]:
topic_repos_dict = {
    'Username':[],
    'Repo_name':[],
    'Stars':[],
    'Repo_url':[]
}

for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i],star_tags[i])
    topic_repos_dict['Username'].append(repo_info[0])
    topic_repos_dict['Repo_name'].append(repo_info[1])
    topic_repos_dict['Stars'].append(repo_info[2])
    topic_repos_dict['Repo_url'].append(repo_info[3])

In [72]:
def get_topic_page(topic_url):
     # Download the page
    response = requests.get(topic_url)
    # Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    # Parse using Beautiful Soup
    topic_doc = BeautifulSoup(response.text,'html.parser')
    return topic_doc

def get_repo_info(h3_tag, star_tag):
    #return all the required info about a repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

def get_topic_repos(topic_doc):
    # Get the h3 tags containing repo title, repo URL and username
    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags= topic_doc.find_all('h3',{'class':h3_selection_class})
    # Get star tags
    star_tags = topic_doc.find_all('span',{'class': 'Counter js-social-count'})
    
    topic_repos_dict = {
            'Username':[],
            'Repo_name':[],
            'Stars':[],
            'Repo_url':[]
        }
    # Get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i],star_tags[i])
        topic_repos_dict['Username'].append(repo_info[0])
        topic_repos_dict['Repo_name'].append(repo_info[1])
        topic_repos_dict['Stars'].append(repo_info[2])
        topic_repos_dict['Repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)


In [73]:
url4 = topic_urls[4]
url4

'https://github.com/topics/android'

In [74]:
topic4_doc = get_topic_page(url4)

In [74]:
topic_repos_df = pd.DataFrame(topic_repos_dict)

topic_repos_df

Unnamed: 0,Username,Repo_name,Stars,Repo_url
0,mrdoob,three.js,99300,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,99300,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,99300,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,99300,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,99300,https://github.com/ssloy/tinyrenderer
5,FreeCAD,FreeCAD,99300,https://github.com/FreeCAD/FreeCAD
6,lettier,3d-game-shaders-for-beginners,99300,https://github.com/lettier/3d-game-shaders-for...
7,aframevr,aframe,99300,https://github.com/aframevr/aframe
8,CesiumGS,cesium,99300,https://github.com/CesiumGS/cesium
9,blender,blender,99300,https://github.com/blender/blender


In [65]:
topic4_repos = get_topic_repos(topic4_doc)
topic4_repos

Unnamed: 0,Username,Repo_name,Stars,Repo_url
0,flutter,flutter,99300,https://github.com/flutter/flutter
1,facebook,react-native,99300,https://github.com/facebook/react-native
2,justjavac,free-programming-books-zh_CN,99300,https://github.com/justjavac/free-programming-...
3,Genymobile,scrcpy,99300,https://github.com/Genymobile/scrcpy
4,Hack-with-Github,Awesome-Hacking,99300,https://github.com/Hack-with-Github/Awesome-Ha...
5,Solido,awesome-flutter,99300,https://github.com/Solido/awesome-flutter
6,google,material-design-icons,99300,https://github.com/google/material-design-icons
7,wasabeef,awesome-android-ui,99300,https://github.com/wasabeef/awesome-android-ui
8,tldr-pages,tldr,99300,https://github.com/tldr-pages/tldr
9,square,okhttp,99300,https://github.com/square/okhttp


In [75]:
topic_urls[5]

'https://github.com/topics/angular'

In [78]:
get_topic_repos(get_topic_page(topic_urls[5])).to_csv('angular.csv', index=None)

In [76]:
range(len(repo_tags))

range(0, 20)

Write a single fuction to:

1.Get the list of topics from the topic page
2.Get the list of top repos format individual topic pages
3.For each topic, create a CSV of the repos for the topic

In [95]:
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

def get_topic_descs(doc):
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class': desc_selector})
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs

def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a', {'class': 'no-underline flex-1 d-flex flex-column'})
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls

def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url': get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)


In [78]:
topics_url

'https://github.com/topics'

In [96]:
scrape_topics()

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


### Document and share your work