# Scraping Top Repositories for Topics on GitHub

TODO  (Intro): 
- Introduction about web scraping
- Introduction about GitHub and the problem statement
- Mention the tools you're using (Python, requests, Beautiful Soup, Pandas)



Here are the steps we'll follow:

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:

```
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx
```

## Scrape the list of topics from Github

Explain how you'll do it.

- use requests to downlaod the page
- user BS4 to parse and extract information
- convert to a Pandas dataframe

Let's write a function to download the page.

In [47]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os

def get_topic_page():
    
    topics_url="https://github.com/topics"
    response=requests.get(topics_url)
    if response.status_code !=200:
        raise Exception('Failed to load page {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc

Add some explanation

In [48]:
doc = get_topic_page()

In [49]:
def get_topic_titles(doc):
    selection_class='f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title=doc.find_all('p',{'class':selection_class}) #or p_tags=soup.find_all('p',class_selection_class)
    topic_titles=[]
    for titles in topic_title:
        topic_titles.append(titles.text)
    return topic_titles


In [50]:
titles = get_topic_titles(doc)

In [51]:
titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

`get_topic_titles` can be used to get the list of titles

Similarly we have defined functions for descriptions and URLs.

In [52]:
def get_topic_desc(doc):
    desc_selection="f5 color-fg-muted mb-0 mt-1"
    topic_desc=doc.find_all('p',{'class':desc_selection})
    topic_descriptions=[]
    for desc in topic_desc:
        topic_descriptions.append(desc.text.strip()) #.strip() is the method of string that remove spaces
    return topic_descriptions



TODO - example and explanation

In [53]:
def get_topic_urls(doc):
    topic_link=doc.find_all('a',{'class':'no-underline flex-grow-0'})

    topic_urls=[]
    base_url="https://github.com"
    for link in topic_link:
        topic_urls.append(base_url+link['href'])
    return topic_urls


Let's put this all together into a single function

In [54]:
def scrape_topics():
    topics_url="https://github.com/topics"
    response=requests.get(topics_url)
    if response.status_code !=200:
        raise Exception('Failed to load page {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    topic_dict={
        'title':get_topic_titles(doc),
        'description':get_topic_desc(doc),
        'url':get_topic_urls(doc)
    }
    return pd.DataFrame(topic_dict)


## Get the top 25 repositories from a topic page

TODO - explanation and step

In [55]:
def get_topic_page(topic_url):
   #Downloadthe page
    response=requests.get(topic_url)
    
    #check successfull response
    if response.status_code !=200:
        raise Exception('Failed to load page {}'.format(topic_url))
    
    #parse using beautyfulSoup
    topic_doc=BeautifulSoup(response.text,'html.parser')
    return topic_doc


In [56]:
doc = get_topic_page('https://github.com/topics/3d')

TODO - talk about the h3 tags

In [62]:
def get_repo_info(h3_tags,star_tag):
    #returns all the required info about a respository
    base_url="https://github.com"
    a_tags=h3_tags.find_all('a')
    username=a_tags[0].text.strip()
    repo_name=a_tags[1].text.strip()
    repo_url=base_url + a_tags[1]['href']
    stars=parse_star_count(star_tag.text.strip())
    return username,repo_name,stars,repo_url

TODO - show a example

In [58]:
def get_topic_repos(topic_doc):

    #get the h3 tags containing repo title,repo url and username
    h1_selection_class="f3 color-fg-muted text-normal lh-condensed"
    repo_tags=topic_doc.find_all('h3',{'class':h1_selection_class})
   
    #get star tags
    star_tags = topic_doc.find_all('span',{'class':"Counter js-social-count"})
    
    #get repo info
    topic_repo_dict={'username':[],
                 'repo_name':[],
                 'stars':[],
                 'repo_url':[]
                }
    for i in range(len(repo_tags)):
        #calling get_repo_info function
        repo_info=get_repo_info(repo_tags[i],star_tags[i])
        topic_repo_dict['username'].append(repo_info[0])
        topic_repo_dict['repo_name'].append(repo_info[1])
        topic_repo_dict['stars'].append(repo_info[2])
        topic_repo_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topic_repo_dict)

TODO - show an example

In [59]:
def scrape_topic(topic_url,path):
    
    if os.path.exists(path):
        print("The file {} already exists. Skipping... ".format(path))
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path,index=None)

TODO - show an example

In [64]:
def parse_star_count(stars_str):
    stars_str=stars_str.strip()
    if stars_str[-1]=='k':
        return int(float(stars_str[:-1])*1000)
    return int(stars_str)

## Putting it all together

- We have a funciton to get the list of topics
- We have a function to create a CSV file for scraped repos from a topics page
- Let's create a function to put them together

In [60]:
def scrape_topics_repos():
    topic_df=scrape_topics()
    
    print('Scraping top topics from Github')
    
    os.makedirs('data',exist_ok = True)
    for index,row in topic_df.iterrows():
        print(row['title'],row['url'])
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'],'data/{}.csv'.format(row['title']))

Let's run it to scrape the top repos for the all the topics on the first page of https://github.com/topics

In [65]:
scrape_topics_repos()

Scraping top topics from Github
3D https://github.com/topics/3d
Scraping top repositories for "3D"
Ajax https://github.com/topics/ajax
Scraping top repositories for "Ajax"
Algorithm https://github.com/topics/algorithm
Scraping top repositories for "Algorithm"
Amp https://github.com/topics/amphp
Scraping top repositories for "Amp"
Android https://github.com/topics/android
Scraping top repositories for "Android"
Angular https://github.com/topics/angular
Scraping top repositories for "Angular"
Ansible https://github.com/topics/ansible
Scraping top repositories for "Ansible"
API https://github.com/topics/api
Scraping top repositories for "API"
Arduino https://github.com/topics/arduino
Scraping top repositories for "Arduino"
ASP.NET https://github.com/topics/aspnet
Scraping top repositories for "ASP.NET"
Atom https://github.com/topics/atom
Scraping top repositories for "Atom"
Awesome Lists https://github.com/topics/awesome
Scraping top repositories for "Awesome Lists"
Amazon Web Services ht

We can check that the CSVs were created properly

In [22]:
# read and display a CSV using Pandas