# Scraping Top Repositories for Topics on GitHub

- Introduction about web scraping
- Introduction about GitHub and the problem statement
- Mention the tools you're using (Python, requests, Beautiful Soup, Pandas)


# Here are the steps we'll follow:

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 30 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:

```
Repo Name,Username,Repo URL
three.js,mrdoob,https://github.com/mrdoob/three.js
libgdx,libgdx,https://github.com/libgdx/libgdx
```

In [1]:
!pip install requests
!pip install beautifulsoup4
from bs4 import BeautifulSoup
import pandas as pd
import requests
import os



# Scrape the list of topics from Github

Explain how you'll do it.

- use requests to downlaod the page
- user BS4 to parse and extract information
- convert to a Pandas dataframe

Let's write a function to download the page.

In [2]:
import requests
from bs4 import BeautifulSoup

def get_topics_page():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc

In [3]:
doc = get_topics_page()

Let's create some helper functions to parse information from the page.

`get_topic_titles` can be used to get the list of titles

functions to parse information from the page.

To get topic titles, we can pick `p` tags with the `class` ...


In [4]:
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    topic_titles = [tag.text for tag in topic_title_tags]
    return topic_titles

In [5]:
titles = get_topic_titles(doc)

In [6]:
titles[:3]

['3D', 'Ajax', 'Algorithm']

In [7]:
len(titles)

30

Function for descriptions

In [8]:
def get_topic_desc(doc):
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class': desc_selector})
    topic_descriptions = [tag.text.strip() for tag in topic_desc_tags]
    return topic_descriptions

Function for urls

In [9]:
def get_topic_url(doc):
    topic_link_tags = doc.find_all('a', {'class': 'no-underline flex-grow-0'})
    topic_urls = ['https://github.com/' + tag['href'] for tag in topic_link_tags]
    return topic_urls

Let's put this all together into a single function

In [10]:
def scrape_topics():
    topic_url = 'https://github.com/topics'
    response = requests.get(topic_url)

    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))

    doc = BeautifulSoup(response.text, 'html.parser')

    topic_dict = {
        'Title': get_topic_titles(doc),
        'Description': get_topic_desc(doc),
        'Urls': get_topic_url(doc)
    }

    return pd.DataFrame(topic_dict)


# Get the top 20 repositories from a topic page

In [11]:
def get_info_repo(h3_tags):
    base_url = 'https://github.com/'
    a_tags = h3_tags.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    return username, repo_name, repo_url

def get_topic_info(topic_url):
    #Download the page
    response = requests.get(topic_url)
    
    #check Succesfull response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    
    #parse using BeautifulSoup
    topic_doc = BeautifulSoup(response.text,'html.parser')
    
    #Get h3 tags containing information about username, Repository topis and url
    h3_selection_tag = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3',{'class':h3_selection_tag})
    
    
    #Get the repo together
    topic_info_dict = { 'username':[], 'repo_name':[], 'repo_url':[]}
    
    
    for i in range(len(repo_tags)):
        repo_info = get_info_repo(repo_tags[i])
        topic_info_dict['username'].append(repo_info[0])
        topic_info_dict['repo_name'].append(repo_info[1])
        topic_info_dict['repo_url'].append(repo_info[2])
    
    return pd.DataFrame(topic_info_dict)

In [12]:
def scrape_topic(topic_url,path):
    if os.path.exists(path):
        print("the file {} already exixsts.skipping...".format(path))
        return
    topic_df = get_topic_info(topic_url)
    topic_df.to_csv(path,index=None)

# Putting it all together
- We have a funciton to get the list of topics
- We have a function to create a CSV file for scraped repos from a topics page
- Create a function to put them together

In [13]:
def scrape_topics_repos():
    print('Scraping List of topics')
    topics_df = scrape_topics()
    os.makedirs('data', exist_ok=True)
    
    for index,row in topics_df.iterrows():
        print('Scraping top repository for "{}"'.format(row['Title']))
        scrape_topic(row['Urls'],'data/{}.csv'.format(row['Title']))

Let's run it to scrape the top repos for the all the topics on the first page of https://github.com/topics

In [14]:
scrape_topics_repos()

Scraping List of topics
Scraping top repository for "3D"
Scraping top repository for "Ajax"
Scraping top repository for "Algorithm"
Scraping top repository for "Amp"
Scraping top repository for "Android"
Scraping top repository for "Angular"
Scraping top repository for "Ansible"
Scraping top repository for "API"
Scraping top repository for "Arduino"
Scraping top repository for "ASP.NET"
Scraping top repository for "Atom"
Scraping top repository for "Awesome Lists"
Scraping top repository for "Amazon Web Services"
Scraping top repository for "Azure"
Scraping top repository for "Babel"
Scraping top repository for "Bash"
Scraping top repository for "Bitcoin"
Scraping top repository for "Bootstrap"
Scraping top repository for "Bot"
Scraping top repository for "C"
Scraping top repository for "Chrome"
Scraping top repository for "Chrome extension"
Scraping top repository for "Command line interface"
Scraping top repository for "Clojure"
Scraping top repository for "Code quality"
Scraping top