<a href="https://colab.research.google.com/github/nikhilsable17/RepoRadar/blob/main/gitScraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **RepoRadar:** Unveiling GitHub's Top Repositories

# Motivation behind working on this project: -

In light of the ever-expanding GitHub repository ecosystem and the inherent challenge of navigating and extracting valuable insights from this vast code repository platform, I found a compelling need to develop a systematic web scraping solution. Recognizing the lack of readily available tools to comprehensively capture and organize information about top repositories under different topics, I decided to embark on a project that would leverage Python libraries like Beautiful Soup, Pandas, and requests. The goal is to create an efficient and accessible means of exploring GitHub's diverse coding landscape, allowing users to uncover hidden gems and trending projects within specific topics. This decision was motivated by the desire to empower developers and researchers with a user-friendly tool for informed decision-making and exploration in the GitHub ecosystem.

# Steps I  followed here: -
* We'll be Scraping https://github.com/topics

* We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description

* For each topic, we'll get the top repositories in the topic from the topic page

* For each repository, we'll grab the repo name, username, stars and repo URL

* For each topic we'll create a .csv file in the following format:

  Repo Name,Username,Stars,Repo URL

  three.js,mrdoob,69700,https://github.com/mrdoob/three.js

  libgdx,libgdx,18300,https://github.com/libgdx/libgdx

In [None]:
pip install requests --upgrade --quiet

In [None]:
pip install beautifulsoup4 --upgrade --quiet

In [None]:
pip install pandas --quiet

# Some of my Helping Hands!

- **Requests: -** Utilize the requests library to effortlessly fetch data from the web, which makes our project a master communicator with online sources.

- **BeautifulSoup: -** Let beautifulsoup weave its magic as it parses HTML and XML documents, simplifying the extraction of valuable data from the web.

- **Regex (re): -** Employ regular expressions (re) to pinpoint and extract precisely the information we need, adding a touch of surgical precision to our data extraction.

- **Pandas: -** Embrace the data manipulation capabilities of pandas to organize, clean, and analyze the extracted data seamlessly. Transform messy data into structured datasets with ease.

- **os: -** Leveraging the os library to manage our project's file system, providing a robust foundation for organizing and storing our scraped data efficiently

In [None]:
import requests
from bs4 import BeautifulSoup as bs
import re
import pandas as pd
import os

# Scraping all the Topic data here: -
- Scraped all the Topic Title
  * The code goes through all the Parent tags from the extracted HTML page and stores inside a variable
  
  * Iterating to the length of the above variable, we append all the Titles in an empty list.

- Topic Description & Title URL's: -
  * We follow the same process for the remaining functions as mention for this cell!

  * We use a .strip() to strip, if any whitespaces or any specificed characters from the START / END of the string


In [None]:
# Function to Scrape all the Topic Titles
def get_topic_titles(doc):
    try:
        t_titles = doc.find_all('p',{'class','f3 lh-condensed mb-0 mt-1 Link--primary'})
        topic_titles = []
        for i in t_titles:
            topic_titles.append(i.text)
        return topic_titles
    except Exception as e:
        print(e)

# Function to Scrape all the Topic Description
def get_topic_desc(doc):
    try:
        t_titles_desc = doc.find_all('p', {'class', 'f5 color-fg-muted mb-0 mt-1'})[:]
        topic_title_desc_1 = []
        for i in t_titles_desc:
            topic_title_desc_1.append(i.text.strip())
            return topic_title_desc_1
    except Exception as e:
        print(e)

# Function to Scrape all the Topic URL's
def get_topic_urls(doc):
    try:
        big_box_topics = doc.find_all('div',{'class', 'py-4 border-bottom d-flex flex-justify-between'})
        #box_topics = big_box_topics[0]

        #url = box_topics.a['href']
        topic_urls = []
        for i in big_box_topics:
            topic_urls.append("https://github.com" + i.a['href'])
        return topic_urls
    except Exception as e:
        print(e)

def scrape_topics():
    try:
        topics_url= 'https://github/.com/topics'
        topic_page = requests.get(topics_url)
        if topic_page.status_code != 200:
            raise Exception ('Error while Loading Topic Page: - {}'.format(Exception))
        doc = bs(topic_page.text, 'html.parser')
        topics_dict = {'Titles': get_topic_titles(doc),
                    'Description' : get_topic_desc(doc),
                    'Urls': get_topic_urls(doc)}
        return pd.DataFrame(topics_dict)
    except Exception as e:
        print(e)


# Scraping the Repo Data: -

- Here we first call the (get_topic_repos)function where we create a variable to store all the parent tags, after which we create a disctionary in which we store the Scraped record!

- Then we pass the whole html corpus, itrating to the length of all Parent tags, where the function calls (get_repo_data) & Scrape the required data one by one!

- After Scraping the data function returns the data to the disctionary, & the disctionary is then passed to create a Data Frame

In [None]:

# Function for Scraping data from the repository for a perticular topic
def get_repo_data(repo,i):
    try:
        username = repo[i].div.find_all('a',{'class':'Link'})[0].text.strip()
        reposatory_name = repo[i].div.find_all("a",{'class', 'Link text-bold wb-break-word'})[0].text.strip()
        star_repo = repo[i].find_all("span",{'class', 'Counter js-social-count'})[0].text
        reposatory_url = "https://github.com" + repo[i].find_all("a",{'class', 'Link'})[1]['href']
        stars = star_conversion(star_repo)
        return username, reposatory_name, reposatory_url, stars #star_repo
    except Exception as e:
        print(e)

def get_topic_repos(topic_page_html): # we have got the entire html junk here topic_page_html from def get_topic_page(topic_url)
    repo_big_box = topic_page_html.find_all("article", {"class" : "border rounded color-shadow-small color-bg-subtle my-4"})

    reposatory_dict = {"Username":[], "Reposatory_Name":[], "Stars":[], "Reposatory_URL's":[]}

    for i in range(len(repo_big_box)):
        repo_info = get_repo_data(topic_page_html,i) # repo_html_1,topic_page_html
        reposatory_dict['Username'].append(repo_info[0])
        reposatory_dict['Reposatory_Name'].append(repo_info[1])
        reposatory_dict['Stars'].append(repo_info[2])
        reposatory_dict["Reposatory_URL's"].append(repo_info[3])

    return pd.DataFrame(reposatory_dict)

# Creating all the Web page into a Beautified HTML coupus!

- Here the Base link i.e "https://github.com/topics" is passed to request library where .get() sends a GET request to the web server (in simple words, the server requests a website to open the URL).

- A successfull response will return a Response200 ! return code which simple means We have successfully hit the requested site!

- We have also took care if anything went in a unexpected way!

- Then we store whole HTML corpus in a way better Beautified way by passing it **"Beautiful Soup"**  & store it to a variable!

- We have made an another function where we convert the Rating in "...k" to "..000" **"eg: - 97k to 97000"** which will help the end user to easily understand!

In [None]:
# Function for getting & loading the page in html
def get_topic_page(topic_url):
    #Loading the topic page: -
    topic_page = requests.get(topic_url)

    #checking for successfull execution of the topic page
    if topic_page.status_code != 200:
        raise Exception ('Error while Loading Topic Page: - {}'.format(topic_url))

    # Beautifying using bs
    topic_page_html = bs(topic_page.text, 'html.parser')

    return topic_page_html


# Function Creation for Converting ***k starts to integer values: -
def star_conversion(star_str):
    star_str = star_str.strip()
    if star_str[-1] == 'k':
        return int(float(star_str[:-1]) * 1000)
    return int(star_str)



# We can call it as 2nd Base function: -
- Firstly we check for the file name(w.r.t. the File path), if we get any identical file name then it automatically skip & continue for the rest!


- All the Scraping process starts here

- Here we Firstly call the get_topic_function to Scrape all the Topic info.

- Then the resultant value is passed to Scrape the Repos for the respective Topic

- And finally we are available with a CSV file named with the respective Topic Title name!

In [None]:
def scrape_topic(topic_url, path): #filename = topic_titles + '.csv'
    try:
        if os.path.exist(path):
            print("File {} Already Exists".format(path))
            return

        topic_df_1 = get_topic_repos(get_topic_page(topic_url))
        topic_df_1.to_csv(path,index = None)
    except Exception as e:
        print(e)

# The Base Function (Main Function)

- We call the Base function 2, in which all the Scraping process.

- When all the information is Scraped from the above func, here we create a Folder named "gitHub_Scrape" where all the Scraped CSV files w.r.t "Topic Title" having the top repositories w.r.t Stars are grouped here!

In [None]:
# Base function to start with
def scrape_topics_repos():
    topics_df = scrape_topics()

    # Creating a Folder for all the Scraped data (CSV files)
    os.makedirs("gitHub_Scrape", exist_ok = True)

    for row in topics_df.iterrows(): #index,
        print("Scraping top reposatories for '{}'".format(row['Title']))
        scrape_topic(row['URLs'], "gitHub_Scrape/{}.csv".format(row['Titles']))

# Call this Function

In [None]:
scrape_topics_repos()

# References & Future Work: -

* Summary: -

 * In this web scraping project, I utilized Python's Beautiful Soup, Pandas, requests, and os libraries to extract valuable information from GitHub's website. The primary goal was to navigate through the Topics and gather data from all the available topics.

 * Scraping process involved extracting data such as the
   * Topic Name
   
   * Topic Description

   * Topic URL

   * Repository Title

   * Name of the Author

   * Repository URL

   * Rating (Stars)

  This comprehensive approach allowed me to capture a holistic view of the GitHub repositories under each topic.

 * Marshallizing the data effectively, I created separate .CSV files for each topic, containing information about the top repositories based on their Star Rating. I also make user that No file is repeated with the Same Title/ File Name.

 * Additionally, a dedicated folder was Created to store all the CSV files, ensuring a structured and easily accessible repository of information.

 * This project not only showcased proficiency in web scraping using Python but also demonstrated the ability to efficiently manage and present the extracted data. The resulting CSV files provide a quick reference to the top repositories in each topic, offering valuable insights for further analysis or exploration.

# References to link that found useful

* https://pandas.pydata.org/

* https://pypi.org/project/beautifulsoup4/

* https://docs.python.org/3/library/os.html

* https://python.readthedocs.io/en/v2.7.2/library/os.html

* https://pypi.org/project/requests/

* https://docs.python.org/3/library/re.html

* https://stackoverflow.com/

# Ideas for Future Work: -
Due to some limited timeframe & time constraints, certain aspects were prioritized. Moving forward, with additional time and resources, there are numerous avenues to explore and enhance the project further. Here are some potential areas for improvement and expansion: -

* Dynamic Data Updates

* Dynamic Data Updates:

* Extended Metadata Extraction

* Sentiment Analysis on Comments

* Machine Learning Models

* Cross-Platform Integration

* Automated Email Alerts

* Integration with Version Control Systems

* Security Analysis

These ideas provide a foundation for future work that can enhance the functionality, usability, and insights derived from your GitHub web scraping project