# Scraping Popular Github Collections Using Python


Use the "Run" button to execute the code.

![](https://i.imgur.com/nbD2Aad.png)

In [1]:
!pip install jovian --upgrade --quiet

In [2]:
import jovian

In [1]:
# Execute this to save new versions of the notebook
jovian.commit(project="Scraping popular GitHub Collections Using Python")

<IPython.core.display.Javascript object>

[jovian] Creating a new project "mescanah/Scraping popular GitHub Collections Using Python"[0m
[jovian] Committed successfully! https://jovian.ai/mescanah/scraping-popular-github-collections-using-python[0m


'https://jovian.ai/mescanah/scraping-popular-github-collections-using-python'

## Outline and Objective of my web scrapping project:

## website to be scraped https://github.com/collections 

- I will scrape the collections title, Collection description, and the page Url  

- for each Collections  I will get the top 20 repositories in the Collection page

- For each repositories, I would get the Repo name, Username and  the Url page.

- For each of the GitHub Collection I will create a CSV file in the following format:

Repo name,Username,Repo Url

Rust,Rust-Lang,https://github.com/rust-lang/rust

HospitalRun,Hospitalrun-frontend,https://github.com/HospitalRun/hospitalrun-frontend


## Use the requests library to download web pages

- I Inspected the website's HTML source and identify the right URLs to download.


- I downloaded and save web pages locally using the requests library.


- I Created a function to automate downloading for different GitHub Collections.

In [3]:
!pip install requests --upgrade --quiet

In [4]:
import requests

In [5]:
Collections_url = 'https://github.com/collections'

In [6]:
response = requests.get(Collections_url )

In [9]:
response.status_code

200

In [7]:
len(response.text)

95762

In [8]:
page_contents = response.text

In [10]:
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark" data-a11y-animated-images="system">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-719f1193e0c0.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-0c343b529849.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" media="all" rel="stylesheet" data-href="htt

In [11]:
with open('webpageGitHubcollection.html', 'w') as f:
    f.write(page_contents)

## Use Beautiful Soup to parse and extract information

-I Parsed and explored the structure of downloaded web pages using Beautiful soup.


-I Used the right properties and methods to extract the required information.


-I created functions to extract from the page into lists and dictionaries. 

In [12]:
!pip install beautifulsoup4 --upgrade --quiet

In [13]:
from bs4 import BeautifulSoup

In [14]:
doc = BeautifulSoup(page_contents, 'html.parser')

## Scraping the h2 tag for Title

![](https://i.imgur.com/oxbAB25.png)

![](https://i.imgur.com/3BgTowS.png)

In [15]:
Colect_title_tag = doc.find_all('h2', class_ = 'h3')

In [16]:
len(Colect_title_tag)

20

In [17]:
Colect_title_tag[:1]

[<h2 class="h3"><a data-ga-click="Explore, go to collection, text:How to choose (and contribute to) your first open source project" href="/collections/choosing-projects">How to choose (and contribute to) your first open source project</a></h2>]

In [17]:
Colect_title_tag[0]

<h2 class="h3"><a data-ga-click="Explore, go to collection, text:How to choose (and contribute to) your first open source project" href="/collections/choosing-projects">How to choose (and contribute to) your first open source project</a></h2>

In [18]:
Colect_title_tag[0].text

'How to choose (and contribute to) your first open source project'

## Creating function to extract the title tag

In [21]:
def get_collection_title():
    
    Colect_title_tag = doc.find_all('h2', class_ = 'h3')
    
    Colect_title_tag[0].text
    
    Collection_title = [ ]
    
    for title in  Colect_title_tag:
        
        Collection_title.append(title.get_text())
        
    
    return Collection_title

In [22]:
Collection_titl = get_collection_title()
Collection_titl

['How to choose (and contribute to) your first open source project',
 'Clean code linters',
 'Open journalism',
 'Design essentials',
 'Music',
 'Government apps',
 'DevOps tools',
 'Front-end JavaScript frameworks',
 'GitHub Browser Extensions',
 'GitHub Pages examples',
 'Hacking Minecraft',
 'JavaScript Game Engines',
 'Learn to Code',
 'Getting started with machine learning',
 'Made in Africa',
 'Net neutrality',
 'Open data',
 'Open source organizations',
 'Policies',
 'Software productivity tools']

## Scrapping the Div tag for Description

![](https://i.imgur.com/XD818Xh.png)

![](https://i.imgur.com/aVSLN8y.png)

In [23]:
Collect_description_tag = doc.find_all('div', class_ = 'col-10 col-md-11')

In [32]:
Collect_description_tag[:1]

[<div class="col-10 col-md-11">
 <h2 class="h3"><a data-ga-click="Explore, go to collection, text:How to choose (and contribute to) your first open source project" href="/collections/choosing-projects">How to choose (and contribute to) your first open source project</a></h2>
       New to open source? Here’s how to find projects that need help and start making impactful contributions.
     </div>]

In [24]:
Collect_description_tag[0]

<div class="col-10 col-md-11">
<h2 class="h3"><a data-ga-click="Explore, go to collection, text:How to choose (and contribute to) your first open source project" href="/collections/choosing-projects">How to choose (and contribute to) your first open source project</a></h2>
      New to open source? Here’s how to find projects that need help and start making impactful contributions.
    </div>

In [25]:
def get_description_tag():
    
    Collect_description_tag = doc.find_all('div', class_ = 'col-10 col-md-11')
    
    Collect_description_tag[0].find_all(text= True, recursive = True)[2].strip()
    
    Collection_descript = []
    
    for i in Collect_description_tag:
        
        Collection_descript.append(i.find_all(text= True, recursive = True)[2].strip())
    
    return Collection_descript
    

In [26]:
Collect_description = get_description_tag()
Collect_description

['New to open source? Here’s how to find projects that need help and start making impactful contributions.',
 'Make sure your code matches your style guide with these essential code linters.',
 'See how publications and data-driven journalists use open source to power their newsroom and ensure information is reported fairly and accurately.',
 'This collection of design libraries are the best on the web, and will complete your toolset for designing stunning products.',
 'Drop the code bass with these musically themed repositories.',
 'Sites, apps, and tools built by governments across the world to make government work better, together. Read more at https://government.github.com',
 'These tools help you manage servers and deploy happier and more often with more confidence.',
 'While the number of ways to organize JavaScript is almost infinite, here are some tools that help you build single-page applications.',
 'Some useful and fun browser extensions to personalize your GitHub browser ex

## Scrapping the a tag for href link

![](https://i.imgur.com/gZQw685.png)

![](https://i.imgur.com/4ww4REd.png)

In [27]:
Colect_link_tag = doc.find_all('h2', {'class': 'h3'})

In [28]:
Colect_link_tag[1].find('a')

<a data-ga-click="Explore, go to collection, text:Clean code linters" href="/collections/clean-code-linters">Clean code linters</a>

In [29]:
Colect_link_tag[1].find('a')['href']

'/collections/clean-code-linters'

In [30]:
def get_link_tag():
    
    Colect_link_tag = doc.find_all('h2', {'class': 'h3'})
    
    Colect_link_tag[1].find('a')
    
    Colect_link_tag[1].find('a')['href']
    
    base_url = 'https://github.com'
    
    collection_url = []

    for link in Colect_link_tag:
        
        collection_url.append(base_url + link.find('a')['href'])

    return  collection_url
    
    

In [31]:
collect_url =  get_link_tag() 
collect_url   

['https://github.com/collections/choosing-projects',
 'https://github.com/collections/clean-code-linters',
 'https://github.com/collections/open-journalism',
 'https://github.com/collections/design-essentials',
 'https://github.com/collections/music',
 'https://github.com/collections/government',
 'https://github.com/collections/devops-tools',
 'https://github.com/collections/front-end-javascript-frameworks',
 'https://github.com/collections/github-browser-extensions',
 'https://github.com/collections/github-pages-examples',
 'https://github.com/collections/hacking-minecraft',
 'https://github.com/collections/javascript-game-engines',
 'https://github.com/collections/learn-to-code',
 'https://github.com/collections/machine-learning',
 'https://github.com/collections/made-in-africa',
 'https://github.com/collections/net-neutrality',
 'https://github.com/collections/open-data',
 'https://github.com/collections/open-source-organizations',
 'https://github.com/collections/policies',
 'http

## Putting the various tags in Pandas Frame

In [32]:
!pip install pandas --quiet

In [33]:
import pandas as pd

In [34]:
collection_dict = {'title': Collection_titl, 
                   
                   'description': Collect_description,
                   
                   'url': collect_url
                   
                    }

In [35]:
Collection_df = pd.DataFrame(collection_dict)

In [36]:
Collection_df 

Unnamed: 0,title,description,url
0,How to choose (and contribute to) your first o...,New to open source? Here’s how to find project...,https://github.com/collections/choosing-projects
1,Clean code linters,Make sure your code matches your style guide w...,https://github.com/collections/clean-code-linters
2,Open journalism,See how publications and data-driven journalis...,https://github.com/collections/open-journalism
3,Design essentials,This collection of design libraries are the be...,https://github.com/collections/design-essentials
4,Music,Drop the code bass with these musically themed...,https://github.com/collections/music
5,Government apps,"Sites, apps, and tools built by governments ac...",https://github.com/collections/government
6,DevOps tools,These tools help you manage servers and deploy...,https://github.com/collections/devops-tools
7,Front-end JavaScript frameworks,While the number of ways to organize JavaScrip...,https://github.com/collections/front-end-javas...
8,GitHub Browser Extensions,Some useful and fun browser extensions to pers...,https://github.com/collections/github-browser-...
9,GitHub Pages examples,Fine examples of projects using GitHub Pages (...,https://github.com/collections/github-pages-ex...


In [37]:
Collection_df.to_csv('Collections of Popular Repository.csv', index = None)

In [74]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "mescanah/scraping-popular-github-collections-using-python" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/mescanah/scraping-popular-github-collections-using-python[0m


'https://jovian.ai/mescanah/scraping-popular-github-collections-using-python'

## Getting Information Out of Each of GitHub Collection Name/Title Page Repository

![](https://i.imgur.com/kg2MIeJ.png)

In [38]:
collection_page_url = collect_url[0]

In [39]:
collection_page_url 

'https://github.com/collections/choosing-projects'

In [40]:
response = requests.get(collection_page_url)

In [78]:
response.status_code

200

In [38]:
len(response.text)

103507

In [41]:
Collection_doc = BeautifulSoup(response.text, 'html.parser')

## Scrapping Repo Username from Span tag

![](https://i.imgur.com/Ah9fOFg.png)

![](https://i.imgur.com/2XfnZ6B.png)

In [42]:
username_tag = Collection_doc.find_all('span', class_ = 'text-normal')

In [43]:
username_tag[:2]

[<span class="text-normal">rust-lang /</span>,
 <span class="text-normal">HospitalRun /</span>]

In [44]:
len(username_tag)

5

In [45]:
username_tag[0].text.strip().replace('/', '')

'rust-lang '

##  Scrapping Repo name from h1 tag

![](https://i.imgur.com/LZJVpeZ.png)

![](https://i.imgur.com/RbXQbNy.png)

In [46]:
Repo_tag = Collection_doc.find_all('h1', class_ = 'h3')

In [47]:
Repo_tag[0]

<h1 class="h3 lh-condensed">
<a data-ga-click="Explore, go to repository, location: collection" href="/rust-lang/rust">
<svg aria-hidden="true" class="octicon octicon-repo Link--secondary v-align-middle mr-1" data-view-component="true" height="20" version="1.1" viewbox="0 0 16 16" width="20">
<path d="M2 2.5A2.5 2.5 0 014.5 0h8.75a.75.75 0 01.75.75v12.5a.75.75 0 01-.75.75h-2.5a.75.75 0 110-1.5h1.75v-2h-8a1 1 0 00-.714 1.7.75.75 0 01-1.072 1.05A2.495 2.495 0 012 11.5v-9zm10.5-1V9h-8c-.356 0-.694.074-1 .208V2.5a1 1 0 011-1h8zM5 12.25v3.25a.25.25 0 00.4.2l1.45-1.087a.25.25 0 01.3 0L8.6 15.7a.25.25 0 00.4-.2v-3.25a.25.25 0 00-.25-.25h-3.5a.25.25 0 00-.25.25z" fill-rule="evenodd"></path>
</svg>
<span class="text-normal">rust-lang /</span>
        rust
      </a>
</h1>

In [48]:
Repo_tag[0].find_all(text = True, recursive = True)[6].strip()

'rust'

In [49]:
for i in Repo_tag:
    try: 
        print(i.find_all(text = True, recursive = True)[6].strip())
    
    except:
        
        
        print('')

rust
hospitalrun-frontend
brew

public-apis
serenity


## scapping the a tag from  href link

![](https://i.imgur.com/OONdQSW.png)

![](https://i.imgur.com/jOmyV5O.png)

In [50]:
# Getting the link or url tag out of the Repo_tag
Repo_tag[0].find('a')['href']

'/rust-lang/rust'

In [51]:
base_url = 'https://github.com'
Repo_url = base_url + Repo_tag[0].find('a')['href']
Repo_url 

'https://github.com/rust-lang/rust'

## Creating a function to get a single Repo_tag, Username and the Url 

In [52]:
def get_repo_infos(username_tag, Repo_tag):
    #returns the necessary information about a repository
    
    username = username_tag.text.strip().replace('/', '')
    try:
        Repo_name   = Repo_tag.find_all(text = True, recursive = True)[6].strip()
    
    except:
        Repo_name = ''
    
    Repo_link =  base_url + Repo_tag.find('a')['href']
    
    
    return username, Repo_name, Repo_link

In [53]:
get_repo_infos(username_tag[0], Repo_tag[0])

('rust-lang ', 'rust', 'https://github.com/rust-lang/rust')

## Creating a Pandas frame to read the Repository

In [54]:
def get_repo_dict():
  
    Collection_Repo_diction = {'username': [],'Repo_name': [],'Repo_link': []}



    for i in range(len(username_tag)):
    
        Collection_info = get_repo_infos(username_tag[i],Repo_tag[i])
    
        Collection_Repo_diction['username'].append(Collection_info[0])
    
        Collection_Repo_diction['Repo_name'].append(Collection_info[1])
    
        Collection_Repo_diction['Repo_link'].append(Collection_info[2])
        
        
    return   pd.DataFrame(Collection_Repo_diction)   

In [56]:
Collection_repos_df = get_repo_dict()
Collection_repos_df

Unnamed: 0,username,Repo_name,Repo_link
0,rust-lang,rust,https://github.com/rust-lang/rust
1,HospitalRun,hospitalrun-frontend,https://github.com/HospitalRun/hospitalrun-fro...
2,Homebrew,brew,https://github.com/Homebrew/brew
3,public-apis,,https://github.comhttps://www.youtube.com/embe...
4,SerenityOS,public-apis,https://github.com/public-apis/public-apis


## Saving the Collection Repo to csv

In [57]:
Collection_repos_df.to_csv('GitHub-Collections-Repos-Choosing-Projects.csv', index = None)

In [106]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "mescanah/scraping-popular-github-collections-using-python" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/mescanah/scraping-popular-github-collections-using-python[0m


'https://jovian.ai/mescanah/scraping-popular-github-collections-using-python'

In [58]:
collect_url 

['https://github.com/collections/choosing-projects',
 'https://github.com/collections/clean-code-linters',
 'https://github.com/collections/open-journalism',
 'https://github.com/collections/design-essentials',
 'https://github.com/collections/music',
 'https://github.com/collections/government',
 'https://github.com/collections/devops-tools',
 'https://github.com/collections/front-end-javascript-frameworks',
 'https://github.com/collections/github-browser-extensions',
 'https://github.com/collections/github-pages-examples',
 'https://github.com/collections/hacking-minecraft',
 'https://github.com/collections/javascript-game-engines',
 'https://github.com/collections/learn-to-code',
 'https://github.com/collections/machine-learning',
 'https://github.com/collections/made-in-africa',
 'https://github.com/collections/net-neutrality',
 'https://github.com/collections/open-data',
 'https://github.com/collections/open-source-organizations',
 'https://github.com/collections/policies',
 'http

In [59]:
url3 = collect_url[3]


In [60]:
import os
def get_collection_page(collect_url):
    
    # Download the page
    response = requests.get(collect_url)
    
    # Check successful download of website or response
    
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(collect_url))
    # Parse using BeautifulSoup    
    Collection_doc = BeautifulSoup(response.text, 'html.parser')

    return Collection_doc


def get_collection_repos(Collection_doc):
    
    # Get span tag containing username
    username_tag = Collection_doc.find_all('span', class_ = 'text-normal')
    
    # Get hi tag containing Repo name and Repo url or link
    Repo_tag = Collection_doc.find_all('h1', class_ = 'h3')
    
    
    
    # Get Repo information
    
    Collection_Repo_dict = {'username': [],'Repo_name': [],'Repo_link': []}


    
    for i in range(len(username_tag)):
    
        Collection_info = get_repo_infos(username_tag[i],Repo_tag[i])
    
        Collection_Repo_dict['username'].append(Collection_info[0])
    
        Collection_Repo_dict['Repo_name'].append(Collection_info[1])
    
        Collection_Repo_dict['Repo_link'].append(Collection_info[2])
    
    
    return pd.DataFrame(Collection_Repo_dict)


def  scrape_collections(collect_url, path):
    
    if os.path.exists(path):
        
        print("The file {} already exists. Skipping.... ".format(path))
        return 
        
    Collection_df = get_collection_repos(get_collection_page(collect_url))
    
    Collection_df.to_csv(path, index = None)
    
    
    
def scrape_gitcollection_repos():
    
    print('scrapping list of github collections')
    Collection_df = scrape_gitcollections()
    
    os.makedirs('project1_data', exist_ok = True)
    
    for index, row in Collection_df.iterrows():
        
        print('Scraping top repository for  "{}"'.format(row['title']))
        
        scrape_collections(row['url'], 'project1_data/{}.csv'.format(row['title']))

In [61]:
collect3_doc = get_collection_page(url3)

In [62]:
DesignEssential_collect3= get_collection_repos(collect3_doc)

In [63]:
DesignEssential_collect3

Unnamed: 0,username,Repo_name,Repo_link
0,twbs,bootstrap,https://github.com/twbs/bootstrap
1,animate-css,animate.css,https://github.com/animate-css/animate.css
2,nathansmith,960-Grid-System,https://github.com/nathansmith/960-Grid-System
3,necolas,normalize.css,https://github.com/necolas/normalize.css
4,ionic-team,ionicons,https://github.com/ionic-team/ionicons
5,designmodo,Flat-UI,https://github.com/designmodo/Flat-UI
6,h5bp,html5-boilerplate,https://github.com/h5bp/html5-boilerplate
7,foundation,foundation-sites,https://github.com/foundation/foundation-sites
8,Modernizr,Modernizr,https://github.com/Modernizr/Modernizr
9,twbs,ratchet,https://github.com/twbs/ratchet


In [64]:
DesignEssential_collect3.to_csv('DesignEssentials_Repo.csv', index = None)

In [65]:
get_collection_repos(get_collection_page(collect_url[1])).to_csv('cleancodelinters_Repo.csv', index = None) 


In [93]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "mescanah/scraping-popular-github-collections-using-python" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/mescanah/scraping-popular-github-collections-using-python[0m


'https://jovian.ai/mescanah/scraping-popular-github-collections-using-python'


## write a single function to: 

1. Get the list of Collection from the Github Collection page

2. Get the list of top repos from the individual Github Collection page

3. For each GitHub Collection, create a CSV file of the top repos for the collection 



##  This is a fuction to get  list of Collection from the Github Collection page

In [66]:
#This is be used to get the list of titles and the doc came from the Parsed webpage.
def get_Collection_titles(doc):
    
    Colect_title_tag = doc.find_all('h2', class_ = 'h3')
    
    Collection_titl = []
    
    for title in Colect_title_tag:
        
         Collection_titl.append(title.get_text())
    
    return Collection_titl 


#This is used to get the list of the title description
def get_Collection_desc(doc):
    
    Collect_description_tag = doc.find_all('div', class_ = 'col-10 col-md-11')
    
    Collect_description = []
    for i in Collect_description_tag:
    
        Collect_description.append(i.find_all(text= True, recursive = True)[2].strip())
    
    return Collect_description 


#This is used to get the repository url
def get_Collection_url(doc):
    
    collect_url = []
    base_url = 'https://github.com'
    for link in Colect_link_tag:
        collect_url.append(base_url + link.find('a')['href'])
    return collect_url 



#This is used to put every of the function in Pandas Frame    
def scrape_gitcollections():

    Collections_url = 'https://github.com/collections'
    response = requests.get(Collections_url)

    if response.status_code != 200:
        raise Exception('failed to load page {}'.format(Collections_url))    

    collection_dict = {
        
        'title': get_Collection_titles(doc),
        
        'description': get_Collection_desc(doc),
        
        'url': get_Collection_url(doc)
   
    }    
    return pd.DataFrame(collection_dict)

In [67]:
scrape_gitcollection_repos()

scrapping list of github collections
Scraping top repository for  "How to choose (and contribute to) your first open source project"
Scraping top repository for  "Clean code linters"
Scraping top repository for  "Open journalism"
Scraping top repository for  "Design essentials"
Scraping top repository for  "Music"
Scraping top repository for  "Government apps"
Scraping top repository for  "DevOps tools"
Scraping top repository for  "Front-end JavaScript frameworks"
Scraping top repository for  "GitHub Browser Extensions"
Scraping top repository for  "GitHub Pages examples"
Scraping top repository for  "Hacking Minecraft"
Scraping top repository for  "JavaScript Game Engines"
Scraping top repository for  "Learn to Code"
Scraping top repository for  "Getting started with machine learning"
Scraping top repository for  "Made in Africa"
Scraping top repository for  "Net neutrality"
Scraping top repository for  "Open data"
Scraping top repository for  "Open source organizations"
Scraping top

In [68]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "mescanah/scraping-popular-github-collections-using-python" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/mescanah/scraping-popular-github-collections-using-python[0m


'https://jovian.ai/mescanah/scraping-popular-github-collections-using-python'

## Summary


i.The webpage was downloaded using requests


ii.The Html source code was parsed using BeautifulSoup4 

iii.I extracted the GitHub Repository Collection names, descriptions of the name and the Url from the pages

iv. I Compiled the extracted information into python list and dictionaries

v. I also extracted and combine data from multiple pages and read with Pandas Frame

vi. The extracted data was Saved  to csv file 

vii. The csv file is stored as project1_data

## Future Work

 i. There are posibility of  scraping the repository to get get additional information about each individual collections.

ii. There are also posibility of  analyzing this data to get the total number of GitHub collections in the repository



## Reference 

     
i.   https://datagy.io/python-requests/   (#link to request documentation)
 
    
ii.  https://beautiful-soup-4.readthedocs.io/en/latest/  (#Link to Beautiful Soup documentation)

     
iii.  https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html (#Link to Pandas documentation)
