# Scraping GitHub Topics and Repositories

## Project Outline:
<ul><li> Website used for scrapping: <a href="https://github.com/topics"> https://github.com/topics </a></li>
<li> This is the "Main topic page". It has the topic titles. For each topic, scrape the topic title, topic page URL, topic description.</li>
<li> Navigate to each of the "linked topic pages". This has top 20 repositories in that particular topic. </li>
<li> For each repository, scrape the username, repo name, stars, repo URL.</li>
<li> For each topic, create a csv file</li>


## Scrape the list of topics from the "Main Topic Page"

<ul> <li> Use Requests library to download the page</li>
<li> Use BeautifulSoup to parse and extract information </li>
<li> Convert to a Pandas DataFrame</li>
</ul>


In [1]:
! pip install requests --upgrade --quiet

In [2]:
! pip install beautifulsoup4 --upgrade --quiet

In [32]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

#### First, Write a function to download the page
The function:
Returns BeautifulSoup doc which contains a parsed web page which has list of topics on GitHub.  

In [6]:
def get_topics_page():
    topics_url='https://github.com/topics'
    #Download the page
    response=requests.get(topics_url)
    #check for successful response
    if response.status_code != 200:
        raise Exception ('Failed to load page {}.format (topics_url)')
    doc=BeautifulSoup(response.text,'html.parser')
    return doc


In [21]:
doc=get_topics_page()
doc.find('p')

<p class="f4 color-fg-muted col-md-6 mx-auto">Browse popular topics on GitHub.</p>

#### Functions to parse information from the page.
To get topic titles , we pick `p` tags with the class  `f3 lh-condensed mb-0 mt-1 Link--primary`
<div>
<img src="https://i.imgur.com/ndpXl4e.png" width="800",height="1000"/>
</div>

In [24]:
# to get the list of titles
def get_topic_titles(doc):
    topic_title_tags=doc.find_all('p',{'class':'f3 lh-condensed mb-0 mt-1 Link--primary'})
    topic_titles=[]
    for tag in topic_title_tags[:6]:
        topic_titles.append(tag.text)
    return topic_titles

In [25]:
titles=get_topic_titles(doc)

In [27]:
titles

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular']

#### Similarly, define functions for topic_descriptions and URLs

In [29]:
# function to get topic_description
def get_topic_desc(doc):
    topic_desc_tags=doc.findAll('p',{'class':'f5 color-fg-muted mb-0 mt-1'})
    topic_desc=[]
    for tag in topic_desc_tags[:6]:
        topic_desc.append(tag.text.strip())
    return topic_desc

In [59]:
#function to get topic URLs
base_url="https://github.com"
def get_topic_urls(doc):
    topic_url_tags=doc.findAll('a',{'class':'no-underline flex-1 d-flex flex-column'})
    topic_urls=[]
    for tag in topic_url_tags[:6]:
        topic_urls.append(base_url + tag['href'])
    return topic_urls

#### Create a main function to call all these functions

In [60]:
def scrape_topics():
    topics_dict={
        'title':get_topic_titles(doc),
        'description':get_topic_desc(doc),
        'url':get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)

In [61]:
topics_df=scrape_topics()
topics_df

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular


## Navigate to each topic page and get the top 20 repositories 

In [62]:
def get_topic_page(topic_url):
    #download the page
    response=requests.get(topic_url)
     #check for successful response
    if response.status_code != 200:
        raise Exception ('Failed to load page {}.format (topic_url)')
    #parse using BeautifulSoup
    topic_doc=BeautifulSoup(response.text,'html.parser')
    return topic_doc
    

#### Get repository information

To get repo info , we pick `h3` tags with the class  `f3 color-fg-muted text-normal lh-condensed`
<div>
<img src="https://i.imgur.com/HGfFf6o.png" width="800",height="1000"/>
</div>

In [63]:
# The stars num is a string. Convert it into int by using a function
def conv_star_count (stars_str):
    stars_str=stars_str.strip()
    if(stars_str[-1]=='k'):
        return int(float(stars_str[:-1])*1000)
    return int(stars_str)

In [64]:
def get_repo_info(repo_tag,star_tag):
    # returns all the required info about the repository
    a_tags=repo_tag.find_all('a')
    username=a_tags[0].text.strip()
    repo_name=a_tags[1].text.strip()
    repo_url= base_url + a_tags[1]['href']
    stars = conv_star_count(star_tag.text.strip())
    return username, repo_name, stars,repo_url

In [65]:
def get_topic_repos(topic_doc):
    # Get the h1 tags containing repo title, repo URL and username
    repo_tags = topic_doc.find_all('h3',{'class':'f3 color-fg-muted text-normal lh-condensed'})

    # Get star tags
    star_tags=topic_doc.find_all('span',{'id':'repo-stars-counter-star'})

    topic_repos_dict = { 'username': [], 'repo_name': [], 'stars': [],'repo_url': []}

    # Get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)

In [66]:
import os
def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)

#### Main function which calls other functions

In [67]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('data', exist_ok=True)
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))

In [68]:
scrape_topics_repos()

Scraping list of topics
Scraping top repositories for "3D"
Scraping top repositories for "Ajax"
Scraping top repositories for "Algorithm"
Scraping top repositories for "Amp"
Scraping top repositories for "Android"
Scraping top repositories for "Angular"


In [70]:
#read and display csv file using pandas for topic "3D"
topic_3d=pd.read_csv('G:/Machine_learning/github/web_scrapping_git/web_scrape_git/data/3D.csv')
topic_3d

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,86900,https://github.com/mrdoob/three.js
1,libgdx,libgdx,20800,https://github.com/libgdx/libgdx
2,pmndrs,react-three-fiber,20300,https://github.com/pmndrs/react-three-fiber
3,BabylonJS,Babylon.js,18800,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,15200,https://github.com/ssloy/tinyrenderer
5,aframevr,aframe,14700,https://github.com/aframevr/aframe
6,lettier,3d-game-shaders-for-beginners,14000,https://github.com/lettier/3d-game-shaders-for...
7,FreeCAD,FreeCAD,12600,https://github.com/FreeCAD/FreeCAD
8,CesiumGS,cesium,9500,https://github.com/CesiumGS/cesium
9,metafizzy,zdog,9500,https://github.com/metafizzy/zdog
