<a href="https://colab.research.google.com/github/hitanshu5/WebScrapping/blob/main/Scraping_GitHub_Repo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Scrapping Top Repositories for Topics on GitHub

## TODO:                                               
### - Introduction about web scrapping
### - Introduction about GitHub and the problem statement
### - Mention the tools you're using (Python, requests, BeautifulSoup, Pandas)

In [None]:
!pip install jovian --upgrade --quiet

In [None]:
import jovian

In [None]:
jovian.commit(project='Scraping-GitHub-Repo')

[jovian] Detected Colab notebook...[0m
[jovian] jovian.commit() is no longer required on Google Colab. If you ran this notebook from Jovian, 
then just save this file in Colab using Ctrl+S/Cmd+S and it will be updated on Jovian. 
Also, you can also delete this cell, it's no longer necessary.[0m




### *   We're going to scrape https://github.com/topics
### *   We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
### * For each topic, we'll get the top 25 repositories in the topic from the topic page
### *  For each repository, we'll grab the repo name, username, stars and repo URL
### * For each topic we'll create a CSV file in the following format:
### New Section
###  Repo Name,Username,Stars,Repo URL
### three.js,mrdoob,69700,https://github.com/mrdoob/three.js
### libgdx,libgdx,18300,https://github.com/libgdx/libgdx










# Scrape the list of topics from GitHub

### - Use requests to download the page
### - Use BS4 to parse and extract information
### - Convert to Pandas DataFrame

In [None]:
!pip install requests==2.31.0 --upgrade --quiet

In [None]:
import requests

# Request Library to download WebPage

In [None]:
topics_url = 'https://github.com/topics'

In [None]:
response = requests.get(topics_url)

### there are different status codes depending upon response of the webpage
### always check status to know whether webpage has been downloaded
### Google HTTP response status codes to know more

In [None]:
response.status_code

200

In [None]:
len(response.text)

199768

In [None]:
page_content = response.text

In [None]:
page_content[:1000]

'\n\n<!DOCTYPE html>\n<html\n  lang="en"\n  \n  data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"\n  data-a11y-animated-images="system" data-a11y-link-underlines="true"\n  >\n\n\n\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n  \n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-efd2f2257c96.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-6b1e37da2254.css" /><link data-color-theme="dark_dimmed" crossorig

In [None]:
with open('webpage.html', 'w') as f:
  f.write(page_content)

### Beautiful Soup to parse and extract information

In [None]:
!pip install beautifulsoup4 --upgrade --quiet

In [None]:
from bs4 import BeautifulSoup

In [None]:
doc = BeautifulSoup(page_content, 'html.parser')

In [None]:
type(doc)

In [None]:
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
p_tags = doc.find_all('p',{'class': selection_class})

In [None]:
len(p_tags)

30

In [None]:
p_tags

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Angular</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ansible</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">API</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Arduino</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">ASP.NET</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Awesome Lists</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amazon Web Services</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Azure</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Babel</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Bash</p>,
 <p class="f3 lh-condensed m

In [None]:
topic_title_class = 'f5 color-fg-muted mb-0 mt-1'
topic_title_tags = doc.find_all('p',{'class':topic_title_class})

In [None]:
len(topic_title_class)

27

In [None]:
topic_title_tags[:3]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>]

In [None]:
topic_title_tags0 = topic_title_tags[0]

In [None]:
topic_title_tags0.parent

#parent is used to go to the class above the one you mentioned in while finding one

<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-fg-muted mb-0 mt-1">
          3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
        </p>
</a>

In [None]:
link_class = 'no-underline flex-1 d-flex flex-column'
topic_link_tag = doc.find_all('a',{'class':link_class})

In [None]:
len(topic_link_tag)

30

In [None]:
topic_link_tag[0]['href']

'/topics/3d'

In [None]:
topic0_url = 'https://github.com' + topic_link_tag[0]['href']
print(topic0_url)

https://github.com/topics/3d


In [None]:
p_tags[0].text

'3D'

In [None]:
topic_title_tags[0].text

'\n          3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.\n        '

In [None]:
topic_titles = []

for tags in p_tags:
  topic_titles.append(tags.text)

print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command-line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'C++', 'Cryptocurrency', 'Crystal']


In [None]:
topic_descriptions = []

for desc in topic_title_tags:
  topic_descriptions.append(desc.text.strip())

print(topic_descriptions)

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency library for PHP.', 'Android is an operating system built by Google designed for mobile devices.', 'Angular is an open source web application platform.', 'Ansible is a simple and powerful automation engine.', 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.', 'Arduino is an open source platform for building electronic devices.', 'ASP.NET is a web framework for building modern web apps and services.', 'An awesome list is a list of awesome things curated by the community.', 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.', 'Azure is a cloud computing service created by Microsoft.', 'Babel is a compiler for w

In [None]:
topic_urls = []
base = 'https://github.com'

for url in topic_link_tag:
  topic_urls.append(base + url['href'])

print(topic_urls)

['https://github.com/topics/3d', 'https://github.com/topics/ajax', 'https://github.com/topics/algorithm', 'https://github.com/topics/amphp', 'https://github.com/topics/android', 'https://github.com/topics/angular', 'https://github.com/topics/ansible', 'https://github.com/topics/api', 'https://github.com/topics/arduino', 'https://github.com/topics/aspnet', 'https://github.com/topics/awesome', 'https://github.com/topics/aws', 'https://github.com/topics/azure', 'https://github.com/topics/babel', 'https://github.com/topics/bash', 'https://github.com/topics/bitcoin', 'https://github.com/topics/bootstrap', 'https://github.com/topics/bot', 'https://github.com/topics/c', 'https://github.com/topics/chrome', 'https://github.com/topics/chrome-extension', 'https://github.com/topics/cli', 'https://github.com/topics/clojure', 'https://github.com/topics/code-quality', 'https://github.com/topics/code-review', 'https://github.com/topics/compiler', 'https://github.com/topics/continuous-integration', 'ht

### Now we convert all the data into a csv file


In [None]:
import pandas as pd

In [None]:
 topics_dict = {
     'title': topic_titles,
     'description': topic_descriptions,
     'url': topic_urls
 }

In [None]:
topics_df = pd.DataFrame(topics_dict)

In [None]:
topics_df.head()

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


In [None]:
topics_df.to_csv('topics.csv', index=None)

### Information from Topic page

In [None]:
topic_page_url = topic_urls[0]

In [None]:
topic_page_url

'https://github.com/topics/3d'

In [None]:
response = requests.get(topic_page_url)

In [None]:
response.status_code

200

In [None]:
len(response.text)

519346

In [None]:
topic_doc = BeautifulSoup(response.text, 'html.parser')

In [None]:
h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
repo_tags = topic_doc.find_all('h3',{'class':h3_selection_class})

In [None]:
len(repo_tags)

20

In [None]:
repo_tags[0]

<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="c72fbd5c69a8ee7c9c53a4e65de2b93c8fc7552dd793945819639bc165c0f0ba" data-turbo="false" data-view-component="true" href="/mrdoob">
            mrdoob
</a>          /
          <a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4a2667db3d63a1739c412e059e5da95afe419df83f70949b5d59dc3478f5c79a" data-turbo="false" data-view-component="true" href

In [None]:
a_tags = repo_tags[0].find_all('a')

In [None]:
user = a_tags[0].text.strip()

In [None]:
user

'mrdoob'

In [None]:
repo = a_tags[1].text.strip()

In [None]:
base_url = 'https://github.com'
final_url = base_url + a_tags[1]['href']
final_url

'https://github.com/mrdoob/three.js'

In [None]:
star_tags = topic_doc.find_all('span',{'class':'Counter js-social-count'})
len(star_tags)

20

In [None]:
star_tags[0].text

'101k'

In [None]:
def parse_star(stars_str):
  stars_str = stars_str.strip()
  if stars_str[-1] == 'k':
    return int(float(stars_str[:-1]) * 1000)
  return int(stars_str)

In [None]:
parse_star(star_tags[0].text.strip())

101000

In [None]:
def get_repo_info(repo_tag, star_tag):
  a_tags = repo_tag.find_all('a')
  username = a_tags[0].text.strip()
  repo_name = a_tags[1].text.strip()
  repo_url = base_url + a_tags[1]['href']
  stars = parse_star(star_tag.text.strip())
  return username, repo_name, stars, repo_url

In [None]:
get_repo_info(repo_tags[0], star_tags[0])

('mrdoob', 'three.js', 101000, 'https://github.com/mrdoob/three.js')

In [None]:
topic_repo_dict = {
    'username' : [],
    'repo_name' : [],
    'stars' : [],
    'repo_url' : []
}

for i in range(len(repo_tags)):
  repo_info = get_repo_info(repo_tags[i], star_tags[i])
  topic_repo_dict['username'].append(repo_info[0])
  topic_repo_dict['repo_name'].append(repo_info[1])
  topic_repo_dict['stars'].append(repo_info[2])
  topic_repo_dict['repo_url'].append(repo_info[3])

In [None]:
topic_repo_dict

{'username': ['mrdoob',
  'pmndrs',
  'libgdx',
  'BabylonJS',
  'ssloy',
  'FreeCAD',
  'lettier',
  'aframevr',
  'CesiumGS',
  'blender',
  'MonoGame',
  'mapbox',
  'isl-org',
  'metafizzy',
  'timzhang642',
  'nerfstudio-project',
  '4ian',
  'a1studmuffin',
  'FyroxEngine',
  'domlysz'],
 'repo_name': ['three.js',
  'react-three-fiber',
  'libgdx',
  'Babylon.js',
  'tinyrenderer',
  'FreeCAD',
  '3d-game-shaders-for-beginners',
  'aframe',
  'cesium',
  'blender',
  'MonoGame',
  'mapbox-gl-js',
  'Open3D',
  'zdog',
  '3D-Machine-Learning',
  'nerfstudio',
  'GDevelop',
  'SpaceshipGenerator',
  'Fyrox',
  'BlenderGIS'],
 'stars': [101000,
  26800,
  23000,
  22800,
  19900,
  18400,
  17500,
  16500,
  12500,
  12300,
  11100,
  10900,
  10900,
  10300,
  9600,
  9000,
  8000,
  7600,
  7500,
  7500],
 'repo_url': ['https://github.com/mrdoob/three.js',
  'https://github.com/pmndrs/react-three-fiber',
  'https://github.com/libgdx/libgdx',
  'https://github.com/BabylonJS/Babylon

In [None]:
topic_repo_df = pd.DataFrame(topic_repo_dict)

In [None]:
topic_repo_df

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,101000,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,26800,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,23000,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,22800,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,19900,https://github.com/ssloy/tinyrenderer
5,FreeCAD,FreeCAD,18400,https://github.com/FreeCAD/FreeCAD
6,lettier,3d-game-shaders-for-beginners,17500,https://github.com/lettier/3d-game-shaders-for...
7,aframevr,aframe,16500,https://github.com/aframevr/aframe
8,CesiumGS,cesium,12500,https://github.com/CesiumGS/cesium
9,blender,blender,12300,https://github.com/blender/blender


In [None]:
import os

def get_topic_page(topic_url):
  # Download the page
  response = requests.get(topic_url)
  #check successful response
  if response.status_code != 200:
    raise Exception('Failed to load page {}'.format(topic_url))
  #Parse using Beautiful Soup
  topic_doc = BeautifulSoup(response.text, 'html.parser')
  return topic_doc

def get_repo_info(repo_tag, star_tag):
  #returns all the required info about a respository
  a_tags = repo_tag.find_all('a')
  username = a_tags[0].text.strip()
  repo_name = a_tags[1].text.strip()
  repo_url = base_url + a_tags[1]['href']
  stars = parse_star(star_tag.text.strip())
  return username, repo_name, stars, repo_url

def get_topic_repos(topic_doc):
  #get the h3 tags containing repo title, repo URL and username
  h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
  repo_tags = topic_doc.find_all('h3',{'class':h3_selection_class})
  #get star tags
  star_tags = topic_doc.find_all('span',{'class':'Counter js-social-count'})

  topic_repo_dict = {
    'username' : [],
    'repo_name' : [],
    'stars' : [],
    'repo_url' : []
  }

  #get repo info
  for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topic_repo_dict['username'].append(repo_info[0])
    topic_repo_dict['repo_name'].append(repo_info[1])
    topic_repo_dict['stars'].append(repo_info[2])
    topic_repo_dict['repo_url'].append(repo_info[3])

  return pd.DataFrame(topic_repo_dict)

def scrape_topic(topic_url,topic_name):
  fname = topic_name + '.csv'
  if os.path.exists(fname):
    print('The file {} already exists. Skipping...'.format(fname))
    return
  topic_df = get_topic_repos(get_topic_page(topic_url))
  topic_df.to_csv(fname, index=None)

In [None]:
#topic_urls[4]

In [None]:
#get_topic_repos(get_topic_page(topic_urls[4]))

In [None]:
url4 = topic_urls[4]

In [None]:
url4

'https://github.com/topics/android'

In [None]:
topic4_doc = get_topic_page(url4)

In [None]:
topic4_repos = get_topic_repos(topic4_doc)

In [None]:
topic4_repos

Unnamed: 0,username,repo_name,stars,repo_url
0,flutter,flutter,164000,https://github.com/flutter/flutter
1,facebook,react-native,117000,https://github.com/facebook/react-native
2,justjavac,free-programming-books-zh_CN,111000,https://github.com/justjavac/free-programming-...
3,Genymobile,scrcpy,106000,https://github.com/Genymobile/scrcpy
4,Hack-with-Github,Awesome-Hacking,80600,https://github.com/Hack-with-Github/Awesome-Ha...
5,Solido,awesome-flutter,52400,https://github.com/Solido/awesome-flutter
6,google,material-design-icons,50200,https://github.com/google/material-design-icons
7,wasabeef,awesome-android-ui,49900,https://github.com/wasabeef/awesome-android-ui
8,tldr-pages,tldr,49500,https://github.com/tldr-pages/tldr
9,square,okhttp,45500,https://github.com/square/okhttp


In [None]:
def get_topic_titles(doc):
  selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
  p_tags = doc.find_all('p',{'class': selection_class})
  topic_titles = []
  for tags in p_tags:
    topic_titles.append(tags.text)
  return topic_titles

def get_topic_description(doc):
  topic_title_class = 'f5 color-fg-muted mb-0 mt-1'
  topic_title_tags = doc.find_all('p',{'class':topic_title_class})
  topic_descriptions = []
  for desc in topic_title_tags:
    topic_descriptions.append(desc.text.strip())
  return topic_descriptions

def get_topic_urls(doc):
  link_class = 'no-underline flex-1 d-flex flex-column'
  topic_link_tag = doc.find_all('a',{'class':link_class})
  topic_urls = []
  base = 'https://github.com'
  for url in topic_link_tag:
    topic_urls.append(base + url['href'])
  return topic_urls

def scrape_topics():
  topics_url = 'https://github.com/topics'
  response = requests.get(topics_url)
  if response.status_code != 200:
    raise Exception('Failed to load Page {}'.format(topic_url))
  topic_dict = {
      'title' : get_topic_titles(doc),
      'description' : get_topic_description(doc),
      'url' : get_topic_urls(doc)
  }
  return pd.DataFrame(topic_dict)

In [None]:
scrape_topics()

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [None]:
def scrape_topics_repos():
  print('Scrapping list of topics')
  topic_df = scrape_topics()
  for index, row in topic_df.iterrows():
    print('Scrapping top repositories for "{}"'.format(row['title']))
    scrape_topic(row['url'],row['title'])

In [None]:
scrape_topics_repos()

Scrapping list of topics
Scrapping top repositories for "3D"
Scrapping top repositories for "Ajax"
Scrapping top repositories for "Algorithm"
Scrapping top repositories for "Amp"
Scrapping top repositories for "Android"
Scrapping top repositories for "Angular"
Scrapping top repositories for "Ansible"
Scrapping top repositories for "API"
Scrapping top repositories for "Arduino"
Scrapping top repositories for "ASP.NET"
Scrapping top repositories for "Awesome Lists"
Scrapping top repositories for "Amazon Web Services"
Scrapping top repositories for "Azure"
Scrapping top repositories for "Babel"
Scrapping top repositories for "Bash"
Scrapping top repositories for "Bitcoin"
Scrapping top repositories for "Bootstrap"
Scrapping top repositories for "Bot"
Scrapping top repositories for "C"
Scrapping top repositories for "Chrome"
Scrapping top repositories for "Chrome extension"
Scrapping top repositories for "Command-line interface"
Scrapping top repositories for "Clojure"
Scrapping top reposi

In [None]:
import jovian

In [None]:
jovian.commit()

[jovian] Detected Colab notebook...[0m
[jovian] jovian.commit() is no longer required on Google Colab. If you ran this notebook from Jovian, 
then just save this file in Colab using Ctrl+S/Cmd+S and it will be updated on Jovian. 
Also, you can also delete this cell, it's no longer necessary.[0m
