# Scraping Top Repositories for Topics on GitHub

### Introduction

- `Web Scraping` :- <b> Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.</b>


- `Problem Statement` :- <b> Scrap top repositories information for different topics on this page " https://github.com/topics "</b>


- `Tools & Technologies` :- <b> Python, BeautifulSoup, Pandas, Requests</b>

### Outline of the project

- We are going to scrap this website https://github.com/topics .
- We'll get list of topics. For each topics We will get topic title, topic dessscription, topic page url.
- For each topic We'll get top 30 repositories.
- For each repository We'll get Repo Name, Username, Stars, URL
- For each topic we'll create csv file and format of csv file is following,

```
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx
```

## Scrape list of topics from GitHub

<b>Steps</b>

- Firstly download the page using `requests` library
- Then Parse the content of downloaded page and extract the information using `Bs4 (BeautifulSoup)`
- Convert that data into `Pandas` data frame and coonvert it into `CSV` file or any another document

Let's Write function for download the page

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os

In [2]:
def topic_page_load(topics_url):
    # return html parse doc
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception("Failed to load page {}".format(topics_url))
    page_contents = response.text
    doc = BeautifulSoup(page_contents, 'html.parser')
    return doc

In [3]:
topics_url = "https://github.com/topics"
doc = topic_page_load(topics_url)
type(doc)

bs4.BeautifulSoup

Let's create some function to get information from parse document

To get topic title we can pick `<p>` tag with `class` <b>"f3 lh-condensed mb-0 mt-1 Link--primary"</b>

![](https://i.imgur.com/P1i1Sfe.png)

In [4]:
def topic_title(doc):
    #return topic title list
    title_tags_class="f3 lh-condensed mb-0 mt-1 Link--primary"
    #find all 'p' tags that coontain title of topics
    topic_title_tags = doc.find_all('p',{'class':title_tags_class})
    # get title from tags
    topic_title = []
    for tag in topic_title_tags:
        topic_title.append(tag.text)
        
    return topic_title

`topic_title` is return list of topics

In [5]:
title = topic_title(doc)
print(len(title))

30


In [6]:
title[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

Similar way we can create function for get description and urls

In [7]:
def topic_desc(doc):
    #return topic description list
    description_class = "f5 color-fg-muted mb-0 mt-1"
    #find all 'p' tags that coontain description of topics
    topic_desc_tags = doc.find_all('p',{'class':description_class})
    # get description from tags
    topic_desc = []
    for tag in topic_desc_tags:
        topic_desc.append(tag.text.strip())
        
    return topic_desc

In [8]:
desc = topic_desc(doc)
print(len(desc))

30


In [9]:
desc[:2]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.']

In [10]:
def topic_url(doc):
    #return topic url list
    #find all 'a' tags that coontain url of topics
    link_tags = doc.find_all('a',{'class':'no-underline flex-grow-0'})
    base_url = "https://github.com"
    # get url from tags
    topic_urls = []
    for link in link_tags:
        topic_urls.append(base_url+link['href'])
        
    return topic_urls

In [11]:
url = topic_url(doc)
print(len(url))

30


In [12]:
url[:3]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm']

Now we are putting all together to get topics dataframe

In [13]:
def scrape_topics():
    # return topics data frame
    topics_url = 'https://github.com/topics'
    
    doc = topic_page_load(topics_url)
    # get title
    title = topic_title(doc)
    # get description
    desc = topic_desc(doc)
    # get url
    url = topic_url(doc)
    # create topic dataframe
    dict_topic = {
              'title':title,
              'description':desc,
              'url':url}
    topic_df = pd.DataFrame(dict_topic)
    
    return topic_df

In [14]:
topic_df = scrape_topics()
topic_df[:5]

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


Now we convert this data frame into csv file

In [15]:
topic_df.to_csv("topics.csv", index=None)

Now Our csv is created we can check this using pandas

In [16]:
topic_csv = pd.read_csv("topics.csv")
topic_csv.head()

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


## Get Top 30 repositories for each topic page

- In this section we will scrape repositories data for each topic the same way we did in topics scraping
- We will get repository name, ussername, stars, url for each repository

Load topic page and parse using BeautifulSoup

In [17]:
def repo_page_load(topic_url):
    # return parse repositeries page doc
    response = requests.get(topic_url)
    # check successful response
    if response.status_code != 200:
        raise Exception("Failed to load page {}".format(topic_url))
    page_contents = response.text
    # parse page
    topic_doc = BeautifulSoup(page_contents, 'html.parser')
    return topic_doc

Currently we will look for just 3d topic that function work properly or not

In [18]:
topic_url_3d = topic_csv['url'][0]
topic_doc = repo_page_load(topic_url_3d)
type(topic_doc)

bs4.BeautifulSoup

Create function to get tags for repo information from h3 tag similar way as above function

In [19]:
def get_topic_repo_tags(topic_doc):
    # Get h3 tags
    h3_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3', {'class':h3_class})
    # Get star tags
    stars_tags = topic_doc.find_all('span', {'id':'repo-stars-counter-star'})
    return repo_tags, stars_tags

In [20]:
r_t,s_t = get_topic_repo_tags(topic_doc)
print(r_t[:1])
s_t[:1]

[<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-view-component="true" href="/mrdoob">
            mrdoob
</a>          /
          <a class="text-bold wb-break-word" data-ga-click="Explore, go to repository, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752

[<span aria-label="77589 users starred this repository" class="Counter js-social-count" data-pjax-replace="true" data-plural-suffix="users starred this repository" data-singular-suffix="user starred this repository" data-view-component="true" id="repo-stars-counter-star" title="77,589">77.6k</span>]

Now we will create function for converting stars into number

In [21]:
def convert_star_number(star_str):
    if star_str[-1] == 'k':
        return int(float(star_str[:-1])*1000)
    return int(star_str)

In [22]:
convert_star_number('77.6k')

77600

Now We will Create function that get data from tags 

In [23]:
def get_repo_info(repo_tag,stars_tag):
    # return all repository informtion
    a_tags = repo_tag.find_all('a')
    repo_user=a_tags[0].text.strip()
    repo_name=a_tags[1].text.strip()
    base_url = "https://github.com"
    repo_url=base_url+a_tags[1]['href']
    stars = convert_star_number(stars_tag.text)
    return repo_user, repo_name, stars, repo_url

we will see info for first repo for 3d topic

In [24]:
get_repo_info(r_t[0],s_t[0])

('mrdoob', 'three.js', 77600, 'https://github.com/mrdoob/three.js')

Now we will create function to generate dataframe for information

In [25]:
def get_topic_repo(topic_doc):
    topic_repos_dict = { 'username': [], 'repo_name': [], 'stars': [],'repo_url': []}
    
    repo_tags, stars_tags = get_topic_repo_tags(topic_doc)
    # Get repo info
    
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], stars_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
    # return topic_repos_dict dataframe
    return pd.DataFrame(topic_repos_dict)

In [26]:
repo_df = get_topic_repo(topic_doc)
repo_df.head()

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,77600,https://github.com/mrdoob/three.js
1,libgdx,libgdx,19500,https://github.com/libgdx/libgdx
2,pmndrs,react-three-fiber,16300,https://github.com/pmndrs/react-three-fiber
3,BabylonJS,Babylon.js,15600,https://github.com/BabylonJS/Babylon.js
4,aframevr,aframe,13500,https://github.com/aframevr/aframe


## Now we put it all together to create csv for differernt topics repositories

In [27]:
def scrape_topic(topic_url,path):
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_df = get_topic_repo(repo_page_load(topic_url))
    topic_df.to_csv(path, index=None)

def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('topic', exist_ok=True)
    for index,row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'],'topic/{}.csv'.format(row['title']))

In [29]:
scrape_topics_repos()

Scraping list of topics
Scraping top repositories for "3D"
Scraping top repositories for "Ajax"
Scraping top repositories for "Algorithm"
Scraping top repositories for "Amp"
Scraping top repositories for "Android"
Scraping top repositories for "Angular"
Scraping top repositories for "Ansible"
Scraping top repositories for "API"
Scraping top repositories for "Arduino"
Scraping top repositories for "ASP.NET"
Scraping top repositories for "Atom"
Scraping top repositories for "Awesome Lists"
Scraping top repositories for "Amazon Web Services"
Scraping top repositories for "Azure"
Scraping top repositories for "Babel"
Scraping top repositories for "Bash"
Scraping top repositories for "Bitcoin"
Scraping top repositories for "Bootstrap"
Scraping top repositories for "Bot"
Scraping top repositories for "C"
Scraping top repositories for "Chrome"
Scraping top repositories for "Chrome extension"
Scraping top repositories for "Command line interface"
Scraping top repositories for "Clojure"
Scrapin

Now we will check that CSV is created properly or not

In [30]:
Ajax_csv = pd.read_csv("topic/Ajax.csv")
Ajax_csv.head()

Unnamed: 0,username,repo_name,stars,repo_url
0,metafizzy,infinite-scroll,7000,https://github.com/metafizzy/infinite-scroll
1,ljianshu,Blog,6900,https://github.com/ljianshu/Blog
2,developit,unfetch,5300,https://github.com/developit/unfetch
3,jquery-form,form,5100,https://github.com/jquery-form/form
4,olifolkerd,tabulator,4400,https://github.com/olifolkerd/tabulator


### References and Future Work

Summary of what we did

- Firstly we scrapped information for different topics of Github
- Then we Scrapped information of top repositories for each topic

References to links you found useful

- `BeautifulSoup Documentation` :- https://www.crummy.com/software/BeautifulSoup/bs4/doc/

- `GitHub Topics` :- https://github.com/topics

- `Requests Documentation` :- https://docs.python-requests.org/en/latest/

- `Pandas Documentation` :- https://pandas.pydata.org/pandas-docs/version/0.15/tutorials.html

Ideas for future work

- We will Create csv file for more topics and repositories by just go through loop of pages 
- We will add extra column of topic_name and merge all this csv file to get data into one file using pandas 