# Scraping Top Repositories for Topics on Github

TODO:
- Web Scraping is a method where you extract large amount of data from a webpage/websites and use the extracted data for further data analysis.
- In this project i will be scraping Github website which is a platform for people to work together on a project from anywhere. I will be using this website to scrape a particular domain of topics,these topics will be scraped according to their rating and popularity.
- Tools used in this project are Python, requests, Beautiful Soup, Pandas.



Here are the steps we'll follow :
- We're going to scrape https://github.com/topics
- We'll get a list of Modules. For each module, we'll get module page URL and Module description
- For each Module, we'll get the top 25 repositories in the module from the module page
- For each repository, we'll grab the repo name , username ,rating & repo URL
- For each module we'll create a CSV file in thr following format:
``` 
username,repo_name,rating,repo_url
ljianshu,Blog,7600,https://github.com/ljianshu/Blog
metafizzy,infinite-scroll,7300,https://github.com/metafizzy/infinite-scroll
```

# Scrape the list of Modules from Github

- Using requests to download the page.
- Use of BS4 to parse and extract information
- Convert to a Pandas  DataFrame 
- Let's write a function to download the page

In [1]:
import pandas as pd
import requests
import os
from bs4 import BeautifulSoup
def get_Module_page():
    Modules_url = 'https://github.com/topics'
    response = requests.get(Modules_url)
    if response.status_code != 200:
        raise Exception ('Failed to load Page {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    
    return doc

In [2]:
doc = get_Module_page()

From this doc you can get particular information about the data you are searching.

Let's create some helper functions to parse information from the page

To get topic titles, we can pick `p` tags with the `class` ...
![](https://i.imgur.com/CI7gVXU.png)

In [3]:
def get_Module_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    Module_title_tags = doc.find_all('p',{'class' : selection_class})
    Module_titles = []
    for tag in Module_title_tags:
        Module_titles.append(tag.text)
    return Module_titles

`get_Module_Titles` can be used to get list of the titles 

In [4]:
titles = get_Module_titles(doc)

In [5]:
len(titles)

30

In [6]:
titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

Similarly we have defined functions for descriptions and URLs.

In [7]:
def get_Module_descs(doc) :
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    Module_desc_tags = doc.find_all('p', {'class': desc_selector})
    Module_descs = []
    for tag in Module_desc_tags:
        Module_descs.append(tag.text.strip())
    return Module_descs

TODO:
- Along with the Modules we will also scrape there descriptions.
- To do this we will select a particular unique tag like an identifier to get all the descriptions without any unwanted data.
- A certain aspect should be considered which is while selecting a tag or a class from the htrml source code you have to select the identifiers as unique as possible which may not take any unwanted data
![](https://i.imgur.com/lduZeCY.png)



- for example here as p tags were already used for the titles but class for that title was different so we chose different class for description in html source code

In [8]:
def get_Module_urls(doc):
    Module_link_tags = doc.find_all('a',{'class' : 'no-underline flex-1 d-flex flex-column'})
    Module_urls = []
    base_url = 'https://github.com'
    for tag in Module_link_tags:
        Module_urls.append(base_url + tag['href'])
    return Module_urls

TODO :
- As explained earlier it is similar here, just a different unique `a` tag and its corresponding `class` is used to get the urls which is added in a list and appended with the base github url  

![](https://i.imgur.com/vk9iHwa.png)

Let's put it altogether into a single function

In [9]:
def scrape_Modules():
    Modules_url = 'https://github.com/topics'
    response = requests.get(Modules_url)
    if response.status_code != 200:
        raise Exception ('Failed to load Page {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    Module_dict = {
        'title': get_Module_titles(doc),
        'descriptions' : get_Module_descs(doc),
        'url' : get_Module_urls(doc)
    }
    return pd.DataFrame(Module_dict)

# Get the Top 25 Repositories from a Topic Page



In [10]:
def get_Module_page(Module_url):
    # Downlod the Page
    response = requests.get(Module_url)
    # check successful response
    if response.status_code != 200:
        raise Exception('Failed to Load Page {}'.format(Module_url))
    # Parse using BeautifulSoup
    Module_doc = BeautifulSoup(response.text, 'html.parser')
    return Module_doc

TODO:
- Here, we get the module url, we use that url to check its status response to validate its usability to load the page
- Next, we use Beautiful Soup to parse the HTML document into text format for further implementation of Scraping

In [11]:
doc = get_Module_page('https://github.com/topic/3d')

In [12]:
def parse_rate_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k' :
        return int(float(stars_str[:-1]) * 1000)
    return int(stars_str)

def get_repo_info(h3_tag, rate_tag):
    # returns all the required info about a repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    base_url = 'https://github.com'
    repo_url = base_url + a_tags[1]['href']
    rating = parse_rate_count(rate_tag.text.strip())
    return username, repo_name, rating, repo_url


TODO 
- Here we used h3 as an identifier which is used to get all the indformation about the particular repositories like its rating, username etc.
![](https://i.imgur.com/xiEmRjT.png)

- h3 tag contains both the username and the repository corresponding to the username hence we used h3 for scraping the data.

In [13]:
def get_Module_repos(Module_doc):
    # Get the h3 tags containing  repo title, repo URL and username
    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = Module_tags = Module_doc.find_all('h3', {'class' : h3_selection_class})
    #Get Rating Tags
    rate_tags = Module_doc.find_all('span', {'class' : 'Counter js-social-count'})
    
    Module_repos_dict = {
        'username' : [],
        'repo_name' : [],
        'rating' : [],
        'repo_url' : []
    }
    # Get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], rate_tags[i])
        Module_repos_dict['username'].append(repo_info[0])
        Module_repos_dict['repo_name'].append(repo_info[1])
        Module_repos_dict['rating'].append(repo_info[2])
        Module_repos_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(Module_repos_dict)

- After Scraping the data we create a CSV file where all the generated data will be stored in unstructured manner or tabular format

![](https://i.imgur.com/PYp5i8Z.png)

In [14]:
def scrape_Module(Module_url, path) :
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    Module_df = get_Module_repos(get_Module_page(Module_url))
    Module_df.to_csv(path,index=None)

- Here we are generating and saving the scraped data in a CSV file which will be stored in OS in a defined path.

![](https://i.imgur.com/vts9rMa.png)


- Here the index is set to none because we won't be requiring index while scraping the data.

# Putting it altogether
- We have a function to get the list of topics
- We have a function to create a CSV file for scraped repos from a topics page
- Let's create a function to put them together.

In [15]:
def scrape_Module_repos():
    print('Scraping list of Modules from Github')
    Module_df = scrape_Modules()
    
    # Create folder here
    os.makedirs('data', exist_ok=True)
    
    for index, row in Module_df.iterrows():
        print('Scraping Top Repositories for "{}"'.format(row['title']))
        scrape_Module(row['url'], 'data/{}.csv'.format(row['title']))

Let's run it to scrape the top repositories for all the topics on the first page of https://github.com/topics

In [16]:
scrape_Module_repos()

Scraping list of Modules from Github
Scraping Top Repositories for "3D"
Scraping Top Repositories for "Ajax"
Scraping Top Repositories for "Algorithm"
Scraping Top Repositories for "Amp"
Scraping Top Repositories for "Android"
Scraping Top Repositories for "Angular"
Scraping Top Repositories for "Ansible"
Scraping Top Repositories for "API"
Scraping Top Repositories for "Arduino"
Scraping Top Repositories for "ASP.NET"
Scraping Top Repositories for "Atom"
Scraping Top Repositories for "Awesome Lists"
Scraping Top Repositories for "Amazon Web Services"
Scraping Top Repositories for "Azure"
Scraping Top Repositories for "Babel"
Scraping Top Repositories for "Bash"
Scraping Top Repositories for "Bitcoin"
Scraping Top Repositories for "Bootstrap"
Scraping Top Repositories for "Bot"
Scraping Top Repositories for "C"
Scraping Top Repositories for "Chrome"
Scraping Top Repositories for "Chrome extension"
Scraping Top Repositories for "Command line interface"
Scraping Top Repositories for "Clo

We can check that the CSVs were created properly

In [18]:
import jovian

In [None]:
jovian.commit()

<IPython.core.display.Javascript object>