# Scraping Top Repositories for Topics on GitHub

### TODO (Intro)

##### Intro to Web Scraping

- Web scraping is the process of using bots to extract content and data from a website. The scraper can then replicate entire website content elsewhere. Big companies use web scrapers for their own gain but also don't want others to use bots against them.

##### About Github

- GitHub (https://github.com/) is a code hosting platform for version control and collaboration. It lets you and others work together on projects from anywhere. It contains thousands of repositories on different topics

##### Problem Statement

- Here I'm going to use the topic page and from the topic page I am going find list of topics from each repository and then download it as '.csv' file.

##### Tools

- Python  
- requests 
- Beautiful Soup 
- Pandas 

##### Here are the steps I'll follow:

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic we will get topic title, topic page URL and topic description
- For each topic, we'll get the top repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file uin the following formate: 

```
Repo Name, Username, Stars, Repo URL
three.js, mrdoob, 79000, https://github.com/mrdoob/three.js
```   

## Scrape the list of topics from Github

- use requests to downlaod the page
- user BS4 to parse and extract information
- convert to a Pandas dataframe


In [1]:
#import requests,  BeautifulSoup, os and pandas Libraries

import requests
from bs4 import BeautifulSoup
import os
import pandas as pd

In [2]:
# Write a function to download the page

def get_topics_page():
    
    topics_url = 'https://github.com/topics'
    #Download the page
    response = requests.get(topics_url)
    #check successful response
    if response.status_code != 200:
        raise Exception('failed to load page {}'.format(topic_url))        
    # Parse using Beautiful soap
    doc = BeautifulSoup(response.text, 'html.parser')
    
    return doc

In [3]:
doc = get_topics_page()

In [4]:
type(doc)

bs4.BeautifulSoup

##### Create some helper functions to parse information from the page

To get the topic titles, we can pick `p` tags with the `class` ´f3 lh-condensed mb-0 mt-1 Link--primary´..

![](https://i.imgur.com/p2T7IkC.png)

In [5]:
# Topic title
def get_topic_titles(doc):
    
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text) 
    return topic_titles

`get_topic_titles` can be used to get the list of titles

In [6]:
titles = get_topic_titles(doc)

In [7]:
len(titles)

30

In [8]:
titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

To get the topic description, we can pick `p` tags with the `class` ´f5 color-fg-muted mb-0 mt-1´...

![](https://i.imgur.com/vRC97Rc.png)

In [9]:
# Topic description
def get_topic_descs(doc):
    
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class' : desc_selector})
    
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs

To get the topic url, we can pick `a` tags with the class ´no-underline flex-1 d-flex flex-column´.

![](https://i.imgur.com/IQPsQ3J.png)

In [10]:
# Topic URL
def get_topic_urls(doc):
    
    topic_link_tags = doc.find_all('a', {'class': 'no-underline flex-1 d-flex flex-column'})    

    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])    
    return topic_urls 

##### Put all these functions together into a single function

In [11]:
def scrape_topics():
    
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('failed to load page {}'.format(topic_url))
    
    doc = BeautifulSoup(response.text, 'html.parser')
        
    topic_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url': get_topic_urls(doc)
    }
    return pd.DataFrame(topic_dict)

## Get the top repositories from the topic page



In [12]:
import os

Here is the function to get the individual topic page and download it

In [13]:
def get_topic_page(topic_url):
    
    #Download the page
    response = requests.get(topic_url)
    
    #check successful response
    if response.status_code != 200:
        raise Exception('failed to load page {}'.format(topic_url))
        
    # Parse using Beautiful soap
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    
    return topic_doc

In [14]:
doc = get_topic_page('https://github.com/topics/3d')

In [15]:
type(doc)

bs4.BeautifulSoup

Here we pick the `a` tags to get the `username` and `repository name`.

![](https://i.imgur.com/cEaattd.png)
![](https://i.imgur.com/TKo0vxc.png)

In [16]:
def get_repo_info(h3_tag, star_tag):
    #return all the required info about a repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    
    return username, repo_name, stars, repo_url

Here I am using `h3` tags witch `class` ´f3 color-fg-muted text-normal lh-condensed´ and for calculating stars we are using `span` tag with the `class` ´Counter js-social-count´

![](https://i.imgur.com/bvTIxkJ.png)
![](https://i.imgur.com/FOAYBx0.png)

In [17]:
def get_topic_repos(topic_doc):
    
    # get h3 tags containing repo title, repo URL and username
    
    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3', {'class': h3_selection_class})
    
    # Get the star tags
    star_tags = topic_doc.find_all('span', {'class': 'Counter js-social-count'})
    
    topic_repos_dict = {
        'username' : [],
        'repo_name' : [],
        'stars' : [],
        'repo_url' : []
    }

    
    #Get the repo info
    
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])

        
    return pd.DataFrame(topic_repos_dict)

In [18]:
def scrape_topic(topic_url, path):    
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index = None)

## Putting it all together

- I have a funciton to get the list of topics
- I have a function to create a CSV file for scraped repos from a topics page
- Let's create a function to put them together

In [25]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('data', exist_ok=True)
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))

Let's run it to scrape the top repos for the topics on the first page of https://github.com/topics

In [26]:
scrape_topics_repos()

Scraping list of topics
Scraping top repositories for "3D"
The file data/3D.csv already exists. Skipping...
Scraping top repositories for "Ajax"
The file data/Ajax.csv already exists. Skipping...
Scraping top repositories for "Algorithm"
The file data/Algorithm.csv already exists. Skipping...
Scraping top repositories for "Amp"
The file data/Amp.csv already exists. Skipping...
Scraping top repositories for "Android"
The file data/Android.csv already exists. Skipping...
Scraping top repositories for "Angular"
The file data/Angular.csv already exists. Skipping...
Scraping top repositories for "Ansible"
The file data/Ansible.csv already exists. Skipping...
Scraping top repositories for "API"
The file data/API.csv already exists. Skipping...
Scraping top repositories for "Arduino"
The file data/Arduino.csv already exists. Skipping...
Scraping top repositories for "ASP.NET"
The file data/ASP.NET.csv already exists. Skipping...
Scraping top repositories for "Atom"
The file data/Atom.csv alre

## Reference and Future Work


##### Summary

- So far I have scrapping the github topic page and create .csv file.
- Get username and repositories from each topic

##### Reference

- Python Requests (https://docs.python-requests.org/en/latest/)
- Beautiful Soap (https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- For creating folder (https://www.geeksforgeeks.org/create-a-directory-in-python/)
- Downloading jupyter notebook as pdf (https://towardsdatascience.com/jupyter-notebook-to-pdf-in-a-few-lines-3c48d68a7a63#:~:text=In%20your%20notebook%2C%20click%20the,notebook%20as%20a%20PDF%20file.)
- Inserting Image (https://stackoverflow.com/questions/10628262/inserting-image-into-ipython-notebook-markdown)

##### Future Plan

- This time I make it with single page, I will make it into multiple pages
- csv files can be used for data analysis and also I can put more variables for better understanding and analysis