# Scraping Top Repositories for Topics on GitHub
## GitHub is a code hosting platform for version control and collaboration
### Scraping done with (Python, requests, Beautiful Soup, Pandas)
### if not working they probably changed tag classes

### Project outline:
- We're going to scrape https://github.com/topics 
- We'll get list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 19(first page) repositories in the topic from the topic page
- For each repository, we'll grab repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the format:
```
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,91000,https://github.com/mrdoob/three.js
react-three-fiber,pmndrs,22300,https://github.com/pmndrs/react-three-fiber
```

## Scrape the list of topics from Github

importing libraries

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import os 

base_url = 'https://github.com/'

## 1 page has about 20 topics and there are 7 topics pages total
`get_topics_page` gets https://github.com/topics/ page html and returns it parsed

In [2]:
def get_topics_page(page = 1):
    topics_url = 'https://github.com/topics/?=page' + str(page)
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    
    return doc

In [3]:
doc = get_topics_page(1)
doc.find_all('p')[:5]

[<p class="f4 color-fg-muted col-md-6 mx-auto">Browse popular topics on GitHub.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         LaTeX
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">LaTeX is a document preparation system.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Python
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">Python is a dynamically typed programming language.</p>]

To get topic titles, we can pick `p` tags with the `class` 'f3 lh-condensed mb-0 mt-1 Link--primary', as arguments we pass topics html from `get_topics_page`


In [4]:
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class':selection_class})
    topic_titles = []
    
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
        
    return topic_titles

In [5]:
doc = get_topics_page()
titles = get_topic_titles(doc)
titles[:20]

['3D',
 'Ajax',
 'Algorithm',
 'Amp',
 'Android',
 'Angular',
 'Ansible',
 'API',
 'Arduino',
 'ASP.NET',
 'Atom',
 'Awesome Lists',
 'Amazon Web Services',
 'Azure',
 'Babel',
 'Bash',
 'Bitcoin',
 'Bootstrap',
 'Bot',
 'C']

Functions for topic description and url are similiar

In [6]:
def get_topic_descs(doc):
    selection_class = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class':selection_class})
    topic_descs = []
    
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    
    return topic_descs

In [7]:
doc = get_topics_page()
descs = get_topic_descs(doc)
descs[:5]

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [8]:
def get_topic_urls(doc):
    selection_class = 'no-underline flex-1 d-flex flex-column'
    topic_link_tags = doc.find_all('a', {'class':selection_class})
    topic_urls = []
    
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])   
    
    return topic_urls

In [9]:
doc = get_topics_page()
urls = get_topic_urls(doc)
urls[:5]

['https://github.com//topics/3d',
 'https://github.com//topics/ajax',
 'https://github.com//topics/algorithm',
 'https://github.com//topics/amphp',
 'https://github.com//topics/android']

Let's put this all together into a single function, 7th page is last

In [10]:
def scrape_topics(pages = 1):
    i = 1  
    topics = []
    while(i <= pages):
        topics_url = 'https://github.com/topics/?page=' + str(i)
        i += 1
        response = requests.get(topics_url)
        if response.status_code != 200:
            raise Exception('Failed to load page {}'.format(topic_url))
        doc = BeautifulSoup(response.text, 'html.parser')
        
        topics_dict = {
        'title':get_topic_titles(doc),
        'description':get_topic_descs(doc),
        'url':get_topic_urls(doc)
        }
        
        topics.append(pd.DataFrame(topics_dict))
        
    return pd.concat(topics, ignore_index=True)

In [11]:
topics = scrape_topics(3)
topics

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com//topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com//topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com//topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com//topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com//topics/android
...,...,...,...
85,Kubernetes,Kubernetes is an open source system for automa...,https://github.com//topics/kubernetes
86,Laravel,The PHP Framework for Web Artisans.,https://github.com//topics/laravel
87,LaTeX,LaTeX is a document preparation system.,https://github.com//topics/latex
88,Library,"A library is a collection of resources, often ...",https://github.com//topics/library


## Get top 20 repositories from a topic page
`get_topic_page` takes topics url as argument, which we can get from `get_topic_urls`, returns parsed topics page html.

In [12]:
def get_topic_page(topic_url):
    response = requests.get(topic_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    
    return topic_doc

In [13]:
algorithm = get_topic_page('https://github.com/topics/algorithm')
algorithm.find_all('p')[:5]

[<p>Algorithms are detailed sets of guidelines created for a computer program to complete tasks efficiently and thoroughly.</p>,
 <p class="color-fg-muted mb-0">A complete computer science study plan to become a software engineer.</p>,
 <p class="color-fg-muted mb-0"><g-emoji alias="memo" class="g-emoji" fallback-src="https://github.githubassets.com/images/icons/emoji/unicode/1f4dd.png">📝</g-emoji> Algorithms and data structures implemented in JavaScript with explanations and links to further readings</p>,
 <p class="color-fg-muted mb-0"><g-emoji alias="books" class="g-emoji" fallback-src="https://github.githubassets.com/images/icons/emoji/unicode/1f4da.png">📚</g-emoji> 技术面试必备基础知识、Leetcode、计算机操作系统、计算机网络、系统设计</p>,
 <p class="color-fg-muted mb-0">All Algorithms implemented in Python</p>]

`parse_star_count` converts stars into number </br>
`get_repo_info` h_tag is parent to a_tags which hold usernae and repo_name, star_tag is for stars

In [14]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1]) * 1000)
    return int(stars_str)

def get_repo_info(h_tag, star_tag):
    a_tags = h_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    
    return username, repo_name, stars, repo_url

`get_topic_repos` returns dataframe of top repos for topic

In [15]:
def get_topic_repos(topic_doc):
      
    #get tags
    selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3',{'class':selection_class})
    star_tags = topic_doc.find_all('span',{'class':'Counter js-social-count'})
    
    topic_repos_dict = {
    'username': [],
    'repo_name': [],
    'stars': [],
    'repo_url': []
    }
    
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
    
    return pd.DataFrame(topic_repos_dict)

In [16]:
get_topic_repos(algorithm)

Unnamed: 0,username,repo_name,stars,repo_url
0,jwasham,coding-interview-university,256000,https://github.com//jwasham/coding-interview-u...
1,trekhleb,javascript-algorithms,169000,https://github.com//trekhleb/javascript-algori...
2,CyC2018,CS-Notes,164000,https://github.com//CyC2018/CS-Notes
3,TheAlgorithms,Python,158000,https://github.com//TheAlgorithms/Python
4,yangshun,tech-interview-handbook,90000,https://github.com//yangshun/tech-interview-ha...
5,kdn251,interviews,59700,https://github.com//kdn251/interviews
6,TheAlgorithms,Java,51600,https://github.com//TheAlgorithms/Java
7,azl397985856,leetcode,51100,https://github.com//azl397985856/leetcode
8,algorithm-visualizer,algorithm-visualizer,42700,https://github.com//algorithm-visualizer/algor...
9,youngyangyang04,leetcode-master,38300,https://github.com//youngyangyang04/leetcode-m...


`scrape_topic` saves topic repos as csv to path

In [17]:
def scrape_topic(topic_url, path):
    fname = path + '.csv'
    if os.path.exists(fname):
        print("The file {} already exists. Skipping...".format(fname))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(fname ,index=None)
    

## Putting it all together

In [18]:
def scrape_topics_repos(pages = 1):
    print('Scraping list of topics')
    i = 1
    while(i <= pages):
        topics_df = scrape_topics(i)
        i += 1
    os.makedirs('github-topics', exist_ok=True)
    
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], 'github-topics/{}'.format(row['title']))

Let's run it to scrape the top repos for all the topics on the 2 pages of https://github.com/topics.

In [19]:
scrape_topics_repos(2)

Scraping list of topics
Scraping top repositories for "3D"
The file github-topics/3D.csv already exists. Skipping...
Scraping top repositories for "Ajax"
The file github-topics/Ajax.csv already exists. Skipping...
Scraping top repositories for "Algorithm"
The file github-topics/Algorithm.csv already exists. Skipping...
Scraping top repositories for "Amp"
The file github-topics/Amp.csv already exists. Skipping...
Scraping top repositories for "Android"
The file github-topics/Android.csv already exists. Skipping...
Scraping top repositories for "Angular"
The file github-topics/Angular.csv already exists. Skipping...
Scraping top repositories for "Ansible"
The file github-topics/Ansible.csv already exists. Skipping...
Scraping top repositories for "API"
The file github-topics/API.csv already exists. Skipping...
Scraping top repositories for "Arduino"
The file github-topics/Arduino.csv already exists. Skipping...
Scraping top repositories for "ASP.NET"
The file github-topics/ASP.NET.csv al