# Scraping Top Repositories Topics of Github

**MISSIONS**

- Introduction to `web scraping`.
- Intorduction to `github` and the problem statement
- Mention the tools and libraries that're being used(Python, requests, beautiful Soup, Pandas)

1. Know more about requests: https://requests.readthedocs.io/en/latest/
2. Know more about beautifulSoup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
3. Know more about Pandas: https://pandas.pydata.org/docs/

**OUTLINES**

Steps that are followed:

* Going to scrape https://github.com/topics.
* Get a list of topics. For each topic, will get topic title, topic page URL and topic description.
* For each topic, will get top 20 repositories in the topic from the topic page
* For each repository, will get the repo name, username, stars and repo URL
* For each topic, will create a CSV file in the following format:


Repo Name,Username,Stars,Repo URL
three.js,mardoob,92200,https://github.com/mrdoob/three.js
react-three-fibre,pmndrs,22700,https://github.com/pmndrs/react-three-fiber
libgdx,libgdx,21500,https://github.com/libgdx/libgdx

## Scrape the list of topics from Github

* Use `requests` library to download the page
* Use `bs4` to parse and extract information
* Convert to a Pandas dataframe

**Write a function to download the page**

In [88]:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import os
def get_topics_page():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
        
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc



In [89]:
doc = get_topics_page()

In [90]:
type(doc)

bs4.BeautifulSoup

**Create some helper functions to parse information from the page**

To get topic titles, we pick `p` tags with the class

![](https://i.imgur.com/PYd4XVw.png)

In [91]:
def get_topic_titles(doc):
    topic_title_p_tags = doc.find_all('p', {'class': 'f3 lh-condensed mb-0 mt-1 Link--primary'})
    topic_titles = []
    for tag in topic_title_p_tags:
        topic_titles.append(tag.text)
    
    return topic_titles
    r

`get_topic_titles` is used to get the list of titles

In [92]:
titles = get_topic_titles(doc)

In [93]:
len(titles)

30

In [94]:
titles[:3]

['3D', 'Ajax', 'Algorithm']

**Similarly, we've defined functions for descriptions and URLs**

In [95]:

    
def get_topic_descs(doc):
    topic_desc_tags = doc.find_all('p', {'class': 'f5 color-fg-muted mb-0 mt-1'})
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs
    


`get_topic_descs` is used to get the list of description

In [96]:
descs = get_topic_descs(doc)

In [97]:
len(descs)

30

In [98]:
descs[:2]

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.']

In [99]:
def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a', {'class': 'no-underline flex-grow-0'})
    topic_urls = []
    base_url = 'https://github.com'

    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])

    return topic_urls
    
   

`get_topic_urls` is used to get the list of the urls of the topics

In [100]:
urls = get_topic_urls(doc)

In [101]:
len(urls)

30

In [102]:
urls[:4]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp']

**Let's put this all together into a single function**

In [103]:
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
        
    doc = BeautifulSoup(response.text, 'html.parser')
    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url': get_topic_urls(doc)
    
    }
        
    return pd.DataFrame(topics_dict)

## Get the top 20 repositories from a topic page

**MISSION:**

- Getting individual top repositories for individual topics

In [104]:
def get_topic_page(topic_url):
       # Download the page
    response = requests.get(topic_url)

    # Check the successfull response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    # Parse using BeautifulSoup
    topic_doc = BeautifulSoup(response.text, 'html.parser')

    return topic_doc

In [105]:
doc = get_topic_page('https://github.com/topics/3d')

In [1]:
# doc

In [120]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1]) * 1000)
    return int(stars_str)

In [121]:
base_url = 'https://github.com/topics'
def get_repo_info(h3_tag, star_tag):
    # returns all necessary info about a repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[0]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

In [122]:
def get_topic_repos(topic_doc):
    # Get the h3 tags containing username, repo title and repo URL
    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3', {'class': h3_selection_class})
    # Get star tags
    star_tags = topic_doc.find_all('span', {'class': 'Counter js-social-count'})
    
    # Get repo info
    topic_repos_dict = {
        'username': [], 
         'repo_name': [], 
         'stars': [],
         'repo_url': []
     }
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])

    return pd.DataFrame(topic_repos_dict)

In [123]:
def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index = None)

# Putting it all together

- We use a function to get the list of all topics
- We use a function to create a CSV file for scraped repos from a topics page
- Let's create a function to put them together

In [124]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('data', exist_ok=True)
    
    for index, row in topics_df.iterrows():
        print('Scrapping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))
       

Let's run  it to scrape the top repos for all the topics on the first page of https://github.com/topics

In [125]:
scrape_topics_repos()

Scraping list of topics
Scrapping top repositories for "3D"
The file data/3D.csv already exists. Skipping...
Scrapping top repositories for "Ajax"
The file data/Ajax.csv already exists. Skipping...
Scrapping top repositories for "Algorithm"
The file data/Algorithm.csv already exists. Skipping...
Scrapping top repositories for "Amp"
The file data/Amp.csv already exists. Skipping...
Scrapping top repositories for "Android"
The file data/Android.csv already exists. Skipping...
Scrapping top repositories for "Angular"
The file data/Angular.csv already exists. Skipping...
Scrapping top repositories for "Ansible"
The file data/Ansible.csv already exists. Skipping...
Scrapping top repositories for "API"
The file data/API.csv already exists. Skipping...
Scrapping top repositories for "Arduino"
The file data/Arduino.csv already exists. Skipping...
Scrapping top repositories for "ASP.NET"
The file data/ASP.NET.csv already exists. Skipping...
Scrapping top repositories for "Atom"
The file data/At

**We can check the CSVs were created properly**

# References and Future Work


References to link found useful
- https://requests.readthedocs.io/en/latest/
-  https://www.crummy.com/software/BeautifulSoup/bs4/doc/
-  https://pandas.pydata.org/docs/
- https://jovian.ai/aakashns/python-web...
- https://jovian.ai/aakashns/python-web...