# Scraping Top Repositories for GitHub Topics

## INDEX : 
1. Introduction to Web Scraping 
2. Definition of Problem Statement
3. Technologies Used
4. Solution

### 1. Introduction to Web Scraping

- Web scraping is a technique used to extract data from websites by parsing the HTML content of web pages. It allows users to collect large amounts of data efficiently, which can be used for various purposes such as analysis, research, and automation.
- This process typically involves sending a request to a website, retrieving the HTML content, and then extracting specific information using libraries like BeautifulSoup in Python.

### 2. Definition of Problem Statement

- The goal of this project is to scrape TOP 30 topics' information from GitHub's topics page. 
- Specifically, the task is to extract the titles, descriptions, and URLs of various topics listed on the page.
- This information will then be structured into a Pandas DataFrames and .csv files for further analysis or storage.

### 3. Technologies Used

- Python: The primary programming language used for scripting and automation.
- BeautifulSoup: A Python library used for parsing HTML and XML documents, allowing easy navigation and extraction of data.
- Requests: A Python library used to send HTTP requests to websites and retrieve HTML content.
- Pandas: A data manipulation and analysis library used to organize and structure the scraped data into a DataFrame for easier handling.

### 4. Solution

#### Pick a website and describe your objective

- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.

Outline/Strategy : 
- We're going to scrape - ``https://github.com/topics``
-  we'll get a list of the topics, for each topic, we will get topic title, topic page URL and topic description
-  For each topic, we will get top 25 repositories in the topic from the topics page
-  For each repository, we will grab repo name, username, stars and repo URL
-  For each topic we will create a CSV file in the following format :

    ```
    Repository name,Username,Stars,Repo URL
    infinite-scroll,metafizzy,7400,https://github.com/metafizzy/infinite-scroll
    tabulator,olifolkerd,6600,https://github.com/olifolkerd/tabulator
    ```

#### Use the requests library to download web pages

- Inspect the website's HTML source and identify the right URLs to download.
- Download and save web pages locally using the requests library.
- Create a function to automate downloading for different topics/search queries.

In [None]:
!python -m pip install requests --upgrade --quiet

In [None]:
import requests

In [None]:
topics_url = "https://github.com/topics"

In [None]:
response = requests.get(topics_url)

In [None]:
response.status_code
# request code 200-209 means response successful

In [None]:
page_contents = response.text

In [None]:
with open('webpage.html', 'w', encoding='utf-8') as f:
    f.write(page_contents)


#### Use Beautiful Soup to parse and extract information

- Parse and explore the structure of downloaded web pages using Beautiful soup.
- Use the right properties and methods to extract the required information.
- Create functions to extract from the page into lists and dictionaries.
(Optional) Use a REST API to acquire additional information if required.

In [None]:
!pip install beautifulsoup4 --upgrade --quiet

In [None]:
from bs4 import BeautifulSoup

In [None]:
doc = BeautifulSoup(page_contents, 'html.parser')

In [None]:
selection_class = "f3 lh-condensed mb-0 mt-1 Link--primary"
topic_title_tags = doc.findAll('p',{"class":selection_class})

desc_selector = "f5 color-fg-muted mb-0 mt-1"
topic_desc_tags = doc.find_all("p",{"class" : desc_selector})

In [None]:
desc_selector = "f5 color-fg-muted mb-0 mt-1"
topic_desc_tags = doc.find_all("p",{"class" : desc_selector})

In [None]:
topic_titles = []
topic_descriptions = []
for tag in topic_title_tags:
    topic_titles.append(tag.text)
for tag in topic_desc_tags:
    topic_descriptions.append(tag.text.strip())

In [None]:
selection_class = "f3 lh-condensed mb-0 mt-1 Link--primary"
topic_title_tags = doc.findAll('p',{"class":selection_class})

desc_selector = "f5 color-fg-muted mb-0 mt-1"
topic_desc_tags = doc.find_all("p",{"class" : desc_selector})

topic_link_tags = doc.find_all("a", {"class":"no-underline flex-grow-0"})

topic_titles = []
topic_descriptions = []
for tag in topic_title_tags:
    topic_titles.append(tag.text)
for tag in topic_desc_tags:
    topic_descriptions.append(tag.text.strip())

topic_urls = []
base_url = "https://github.com"
for tag in topic_link_tags:
    topic_urls.append(base_url+tag["href"])



In [None]:
topic_urls = []
base_url = "https://github.com"
for tag in topic_link_tags:
    topic_urls.append(base_url+tag["href"])

In [None]:
!pip install pandas --upgrade --quiet

In [None]:
import pandas as pd

In [None]:
topics_dict = {
    "title" : topic_titles,
    "description" : topic_descriptions,
    "url" : topic_urls
}
topics_df = pd.DataFrame(topics_dict)

#### Create CSV file(s) with the extracted information

- Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
- Execute the function with different inputs to create a dataset of CSV files.
- Verify the information in the CSV files by reading them back using Pandas.

In [None]:
topics_df.to_csv("topics.csv",index=None)

#### Getting Information out of a topic page

In [None]:
topic_page_url = topic_urls[0]

In [None]:
topic_page_url

In [None]:
response = requests.get(topic_page_url)

In [None]:
response.status_code

In [None]:
len(response.text)

In [None]:
topic_doc = BeautifulSoup(response.text,"html.parser")

In [None]:
h3_selection_class = "f3 color-fg-muted text-normal lh-condensed"
repo_tags = topic_doc.findAll("h3", {"class": h3_selection_class})

In [None]:
a_tags = repo_tags[0].findAll("a")
a_tags[0].text.strip()

In [None]:
a_tags[1].text.strip()

In [None]:
base_url = "https://github.com"
repo_url = base_url + a_tags[1]["href"]
repo_url

In [None]:
star_class = "Counter js-social-count"
star_tags = topic_doc.findAll("span", {"class":star_class})
len(star_tags)

In [None]:
star_tags[0].text.strip()

In [None]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == "k":
        return int(float(stars_str[:-1])*1000)
    else:
       return int(stars_str)

parse_star_count(star_tags[0].text.strip())

In [None]:
def get_repo_info(h3_tag, star_tag):
    # returns all required info about repository
    a_tags = h3_tag.find_all("a")
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]["href"]
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url
    

In [None]:
get_repo_info(repo_tags[0],star_tags[0])

In [None]:
topic_repos_dictionary = {
    "username":[],
    "repo_name":[],
    "stars":[],
    "repo_url":[]
}

for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i],star_tags[i])
    topic_repos_dictionary["username"].append(repo_info[0])
    topic_repos_dictionary["repo_name"].append(repo_info[1])
    topic_repos_dictionary["stars"].append(repo_info[2])
    topic_repos_dictionary["repo_url"].append(repo_info[3])

### Final Code

In [None]:
import os
def get_topic_page(topic_url):
    #Download the page
    response = requests.get(topic_url)
    #Checking the status code
    if response.status_code != 200:
        raise Exception("Failed to load page {} ".format(topic_url))
    #parse using beautiful soup
    topic_doc = BeautifulSoup(response.text,"html.parser")
    return topic_doc

def get_repo_info(h3_tag, star_tag):
    # returns all required info about repository
    a_tags = h3_tag.find_all("a")
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]["href"]
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url
    

def get_topic_repos(topic_doc):
    
    #Get the h tag containing repo title, URL and username
    h3_selection_class = "f3 color-fg-muted text-normal lh-condensed"
    repo_tags = topic_doc.findAll("h3", {"class": h3_selection_class})
    #Get star tags
    star_class = "Counter js-social-count"
    star_tags = topic_doc.findAll("span", {"class":star_class})

    # Dictionary containing All the repos topics
    topic_repos_dictionary = {
        "username":[],
        "repo_name":[],
        "stars":[],
        "repo_url":[]
    }
    # Get repo Info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i],star_tags[i])
        topic_repos_dictionary["username"].append(repo_info[0])
        topic_repos_dictionary["repo_name"].append(repo_info[1])
        topic_repos_dictionary["stars"].append(repo_info[2])
        topic_repos_dictionary["repo_url"].append(repo_info[3])
    
    return pd.DataFrame(topic_repos_dictionary)

def scrape_topic(topic_url, topic_name):
    fname = topic_name + ".csv"
    if os.path.exists(fname):
        print("file {} already exists, skipping...".format(fname))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(fname,index=None)
    

In [None]:
topic_repos_df = pd.DataFrame(topic_repos_dictionary)

In [None]:
get_topic_repos(get_topic_page(topic_urls[4])).to_csv("Android.csv",index=None)

##### Write a single function to :
1. Get the list of topics from the topics page of github
2. Get the list of top repos from the individual topic pages
3. For each topic, create a CSV of the top repos for that topic

In [None]:
def get_topic_titles(doc):
    selection_class = "f3 lh-condensed mb-0 mt-1 Link--primary"
    topic_title_tags = doc.findAll('p',{"class":selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

def get_topic_description(doc):
    desc_selector = "f5 color-fg-muted mb-0 mt-1"
    topic_desc_tags = doc.find_all("p",{"class" : desc_selector})
    topic_descriptions = []
    for tag in topic_desc_tags:
        topic_descriptions.append(tag.text.strip())
    return topic_descriptions

def get_topic_urls(doc):
    topic_link_tags = doc.find_all("a", {"class":"no-underline flex-grow-0"})
    topic_urls = []
    base_url = "https://github.com"
    for tag in topic_link_tags:
        topic_urls.append(base_url+tag["href"])
    return topic_urls

def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
            raise Exception("Failed to load page {} ".format(topics_url))
    topics_dict = {
         "title": get_topic_titles(doc),
         "description":get_topic_description(doc),
         "url":get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)


In [None]:
def scrape_topics_repos():
    print("Scraping list of Topics...")
    topics_df = scrape_topics()

    os.makedirs("data",exist_ok=True)
    
    for index,row in topics_df.iterrows():
        print("Scraping Top Repository for {}".format(row["title"]))
        #scrape_topic(row["url"],row["title"])
        scrape_topic(row["url"],"data/{}".format(row["title"]))


    

### Closing Reference

- After working on this project, I’ve significantly enhanced my Python skills, particularly in web scraping and data extraction using libraries like BeautifulSoup and Requests.
- This experience has deepened my understanding of handling HTML content, parsing data, and effectively organizing the extracted information into usable formats.
-At last, I would like to extend my gratitude to [Jovian](http://www.youtube.com/@jovianhq) for their In depth Technical Explainations.