# Scraping the Top Repositories for topics on GitHub

(Introduction):
- **Introduction about webscraping**

Web scraping is the process of automatically extracting data from websites. It involves writing code to fetch web pages, parse their contents, and extract the desired information. Web scraping enables you to retrieve data from various online sources quickly and efficiently.

Here's a brief overview of how web scraping works:

1. Fetching web pages: Web scraping begins by sending HTTP requests to the target website's server to retrieve the HTML content of web pages. This can be done using various programming libraries or frameworks, such as Requests or Scrapy in Python.

2. Parsing HTML: Once the HTML content is obtained, the next step is to parse it. Parsing involves analyzing the structure and elements of the HTML document to extract specific data. This is typically done using HTML parsing libraries like BeautifulSoup or lxml in Python.

3. Navigating and locating elements: With the parsed HTML, you can navigate through the document's elements, such as tags, classes, or IDs, to locate the data you want to extract. You can use CSS selectors or XPath expressions to specify the elements of interest.

4. Extracting data: Once the desired elements are located, you can extract the relevant data from them. This might involve retrieving text, attributes, or even URLs of images or links. The extracted data can be stored in variables, written to files, or processed further as needed.

5. handling pagination and dynamic content: Many websites have multiple pages or load content dynamically through JavaScript. Web scraping may involve handling pagination by iterating through pages or using techniques like scrolling or interacting with APIs to access dynamic content.

6. Data cleaning and processing: Extracted data often requires cleaning, formatting, and transformation to make it usable. This step involves removing unwanted characters, converting data types, performing calculations, or applying any necessary data manipulations.

7. Storing or utilizing the data: The final step involves storing the extracted data in a structured format like CSV, JSON, or a database. Alternatively, you can directly utilize the scraped data for analysis, visualization, or integration with other applications.

It's important to note that while web scraping can be a powerful tool for data collection, it's essential to comply with website terms of service, respect the website's robots.txt file (which specifies scraping rules), and not overload the target server with excessive requests.

Additionally, some websites may have specific restrictions or employ measures like CAPTCHAs or IP blocking to prevent or deter web scraping. It's crucial to be aware of and respect these limitations while scraping data from websites.




- **Introduction about git hub**

GitHub is a web-based platform and a widely used version control system that allows developers to collaborate on projects, track changes to their codebase, and manage code repositories. It provides a robust set of features and tools that facilitate code sharing, issue tracking, documentation, and project management. Here's an introduction to some key aspects of GitHub:

1. Version Control: GitHub is built upon Git, a distributed version control system. Version control enables developers to track changes to their codebase over time, collaborate with others, and easily revert to previous versions if needed. Git's decentralized nature allows multiple developers to work on the same project concurrently, merging their changes seamlessly.

2. Code Hosting and Collaboration: GitHub provides a platform for hosting Git repositories in the cloud. Developers can create repositories to store their code and share them with others. GitHub offers features like pull requests, which allow developers to propose changes, discuss them, and merge them into the main codebase. Collaboration on GitHub can happen within organizations, teams, or public open-source projects.

3. Issue Tracking: GitHub includes an issue tracking system that helps manage tasks, bugs, and feature requests. Users can create issues, assign them to specific individuals or teams, add labels and milestones, and track the progress of each issue. This feature facilitates project management, communication, and coordination among team members.

4. Documentation and Wikis: GitHub allows developers to create and maintain project documentation using built-in wikis or markdown files. This makes it easy to provide instructions, guidelines, and explanations for the codebase, enhancing collaboration and knowledge sharing within the project.

5. Pull Requests and Code Reviews: Pull requests (PRs) are a fundamental feature of GitHub. They enable developers to propose changes to a repository and request that they be merged into the main codebase. Pull requests often include code diffs, comments, and discussions, allowing team members to review and provide feedback on the proposed changes before merging them.

6. Integrations and Automation: GitHub supports integrations with various development tools and services. It offers a marketplace of applications and integrations that extend its functionality. Developers can automate workflows, perform continuous integration and deployment, and connect GitHub with tools like CI/CD systems, code quality analyzers, and project management platforms.

7. Community and Open Source: GitHub has a vibrant community of developers, and it serves as a hub for open-source projects. Many projects on GitHub are openly available for anyone to contribute to, fostering collaboration, knowledge sharing, and the advancement of technology.

8. GitHub provides an intuitive web interface, but it also offers a command-line interface (CLI) and can be integrated into development environments through various client applications and plugins. It has become an essential platform for developers to showcase their work, collaborate with others, and contribute to the open-source ecosystem.


- **Problem statement**

The task is to scrape the top repositories on GitHub and extract the topics associated with each repository using the BeautifulSoup and requests libraries in Python. The goal is to obtain the repository names, URLs, and their corresponding topics by parsing the HTML content of the GitHub trending page and the topics pages for each repository. The code should responsibly handle the scraping process by complying with website terms of service and avoiding excessive requests that could overload the server.


- **Tools**
(Python, Rquest, BeautifulSoup, pandas)


**Note** - 
Web scraping should be done responsibly and in compliance with the website's terms of service. Make sure to add appropriate delays between requests to avoid overwhelming the server and potentially violating any usage limits or restrictions imposed by GitHub.

## Here are the steps we are following 

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:

Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx

# Scrape the list of topics from Github

- Use requests to download the page
- Use BS4 to parse and extract information 
- convert to pandas Dataframe 

Write a function to download the page 

In [54]:
import requests 
import pandas as pd
from bs4 import BeautifulSoup

def get_topics_page():
    # This functio return BeutifoulSoup doc which conatains parsed HTML 
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    #Check succesfull response 
    if response.status_code != 200:
        raise Exception('Failed to Load page {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    
    return doc

In [55]:
doc = get_topics_page()

In [56]:
type(doc)

bs4.BeautifulSoup

In [57]:
doc.find('a')

<a class="px-2 py-4 color-bg-accent-emphasis color-fg-on-emphasis show-on-focus js-skip-to-content" href="#start-of-content">Skip to content</a>

### Lets create some helper function to parse information form the page.


In [58]:
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles




### `get_totpic_titles` can be used to get the list of titles 

to get topic titles, we can pick `p` tags with the `class` ...

![](https://imgur.com/a/WeKTfYu)

In [59]:
titles = get_totpic_titles(doc)

In [60]:
len(titles)

30

In [61]:
titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

### Similarly we have decribe the functions for descriptions and urls 

In [62]:
def get_topic_descs(doc):
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class': desc_selector})
    topic_desc = []
    for tag in topic_desc_tags:
        topic_desc.append(tag.text.strip())
    return topic_desc



In [85]:
base_url = 'https://github.com'

Example and explanation 

In [86]:
def get_topic_urls(doc):
    topic_links_tags = doc.find_all('a', {'class': 'no-underline flex-1 d-flex flex-column'})
    topic_urls = []
    base_url = "https://github.com"
    for tag in topic_links_tags:
        topic_urls.append(base_url+ tag['href'])
    return topic_urls
    



Lets put alttogether into a single function

In [87]:
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    #Check succesfull response 
    if response.status_code != 200:
        raise Exception('Failed to Load page {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    topics_dict = {
        'title': get_totpic_titles(doc),
        'description': get_totpic_descs(doc),
        'url': get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)

In [95]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1]) *1000)
    return int(stars_str)

## Get the top 25 repositories  from the topic page 

Explanation and steps to follow 

In [96]:
def get_topic_page(topic_url):
    # Download the page 
    response = requests.get(topic_url)
    #Check succesfull response 
    if response.status_code != 200:
        raise Exception('Failed to Load page {}'.format(topic_url))
    # Parse using beautiful soup  
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

In [97]:
doc = get_topic_page('https://github.com/topics/3d')

h1 tag with anchor

In [98]:
def get_repo_info(h1_tag, star_tag):
    # return all the required information about a repo
    a_tags = h1_tag.find_all('a')
    username =  a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, repo_url, stars

In [99]:
def get_topic_repos(topic_doc):
    # Get h3 tags containg repo title, repo url anmd username
    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3', {'class': h3_selection_class })
    # Get star tags 
    star_tags = topic_doc.find_all('span', {'class': 'Counter js-social-count'})
    
    topic_repos_dict = {
    'usernames': [],
    'repo_name': [],
    'stars': [],
    'repo_url': [] 
    }
    
    # Get Repository Information
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['usernames'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[3])
        topic_repos_dict['repo_url'].append(repo_info[2])
        
    return pd.DataFrame(topic_repos_dict)

In [100]:
def scrape_topic(topic_url, path):
#     fname = topic_name + '.csv'
    if os.path.exists(path):
        print('The file {} already exists. Skipping..'.format(path))
        return 
    topic_df = get_topic_repos(get_topic_page(topic_url))
    
    topic_df.to_csv(path + '.csv', index = None)

### Putting it all together 

- We have a function to get the list of topics 
- we have function to create a CSV file for scraped repos from topic page 
- Lets create a function to put them together 

In [101]:
import os 
import pandas as pd
def scrape_topics_repos():
    print('Scraping list of topics  ')
    topics_df = scrape_topics()
    
    # create folder here 
    os.makedirs('data', exist_ok = True)
    for index, row in topics_df.iterrows():
        print('Scraping top Repositories for "{}" to'.format(row['title']))
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))

Lets run it to scrape the top repos for the all the topics on the first page of https://github.com/topics

scrape_topics_repos()

We can check that CSVs are created properly 

In [102]:
# REad and display a csv using pandas 
scrape_topics_repos()

Scraping list of topics  
Scraping top Repositories for "3D" to
Scraping top Repositories for "Ajax" to
Scraping top Repositories for "Algorithm" to
Scraping top Repositories for "Amp" to
Scraping top Repositories for "Android" to
Scraping top Repositories for "Angular" to
Scraping top Repositories for "Ansible" to
Scraping top Repositories for "API" to
Scraping top Repositories for "Arduino" to
Scraping top Repositories for "ASP.NET" to
Scraping top Repositories for "Atom" to
Scraping top Repositories for "Awesome Lists" to
Scraping top Repositories for "Amazon Web Services" to
Scraping top Repositories for "Azure" to
Scraping top Repositories for "Babel" to
Scraping top Repositories for "Bash" to
Scraping top Repositories for "Bitcoin" to
Scraping top Repositories for "Bootstrap" to
Scraping top Repositories for "Bot" to
Scraping top Repositories for "C" to
Scraping top Repositories for "Chrome" to
Scraping top Repositories for "Chrome extension" to
Scraping top Repositories for "Com

## References and future Work

Summary: 

