<div style="text-align: right;">
    <h3>Nainil Maladkar (002780019)</h3>
    <h3>Project 1: Web Data Extraction</h3>
    <h3>INFO7390 - Advances in Data Science and Architecture</h3>
</div>


#### References

1. Web Scraping python notebook by Prof. Junwei Huang
2. Medium blog on Scraping: [A Beginner's Guide to Web Scraping in Python](https://blog.jovian.com/a-beginners-guide-to-web-scraping-in-python-8de50bdf211b)

---

                                              

# GitHub Featured Topics Scraper

### Project Description

The **GitHub Topic Scraper** is a Python web scraping project that aims to retrieve the top listed topics of discussion from GitHub topics page listed on `'https://github.com/topics'` and store the data in a `CSV` file. 

This project utilizes `Selenium Webdriver` and `BeautifulSoup`

### Features:

- **Web Scraping:** The project leverages Selenium to automate the process of navigating to the GitHub topics page, scrolling through the topics, and fetching the list of all topics information.

- **Data Extraction:** Beautiful Soup is used to parse and extract relevant data like `repo name` and `stars` and  and associated details from the web page.

- **CSV Export:** The scraped data is then organized and saved in a `.CSV` file, making it easy to analyze or share with others.

The project further tries to automate the process using functions based on selected github topic.

### Download Webpage with `Requests` Library

In [1]:
#!pip install requests --upgrade 

In [2]:
import requests

In [3]:
git_topics_url = 'https://github.com/topics'

In [4]:
response = requests.get(git_topics_url)

**Status code check for website to check if request is successful**

In [5]:
response.status_code

200

In [6]:
len(response.text)

165737

In [7]:
#check content of webpage
topics_page_contents = response.text
topics_page_contents [:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"  data-a11y-animated-images="system" data-a11y-link-underlines="true">\n\n\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n  \n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-a09cef873428.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-5d486a4ede8e.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" med

In [8]:
#save the webpage to file `webpage.html`
with open('webpage.html', 'w', encoding='utf-8') as f:
    f.write(topics_page_contents)

**Using `Selenium` to access load webpage**

In [9]:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
import time

url = "https://github.com/topics"

TIMEOUT = 5

options = Options()
options.add_argument("--start-maximized")
options.add_argument("--disable-notifications")
options.add_argument("--disable-popup-blocking")
options.add_argument("--disable-infobars")
options.add_argument("--disable-extensions")
options.add_experimental_option("prefs", 
                                {"profile.default_content_setting_values.notifications": 2 
                                }) 

print(f"Retrieving web page URL '{url}'")
driver = webdriver.Chrome(options=options)
driver.get(url)

# Timeout needed for the web page to render (read more about it)
time.sleep(TIMEOUT)

# Scroll to the bottom of the page using JavaScript
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

# Add a delay to allow the page to finish loading (adjust as needed)
time.sleep(TIMEOUT)

html = driver.page_source

# Close the browser
driver.quit()


Retrieving web page URL 'https://github.com/topics'


### Data Extraction and Parsing using `BeautifulSoup`

In [10]:
#!pip install beautifulsoup4 --upgrade

In [11]:
from bs4 import BeautifulSoup
import bs4
import pandas as pd
import time

In [12]:
git_topic_doc = BeautifulSoup(topics_page_contents, "html.parser")

In [13]:
type(git_topic_doc)

bs4.BeautifulSoup

In [14]:
# element name inside element class 'p'
topic_selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
#specify tag to find and key
git_topic_title_tags = git_topic_doc.find_all('p', {'class': topic_selection_class })

In [15]:
len(git_topic_title_tags)

30

In [16]:
#retrieve top 20 titles
git_topic_title_tags[:20]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Angular</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ansible</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">API</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Arduino</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">ASP.NET</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Atom</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Awesome Lists</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amazon Web Services</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Azure</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Babel</p>,
 <p class="f3 lh-condensed m

### Get Topic title for repository
#### Get first title by index [0] and save to get `div` class tag

In [17]:
topic_title_tag0 = git_topic_title_tags[0]
topic_title_tag0

<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>

In [18]:
#div class store for tags
div_tag = topic_title_tag0.parent

In [19]:
div_tag.parent

<div class="py-4 border-bottom d-flex flex-justify-between">
<a class="no-underline flex-grow-0" href="/topics/3d">
<div class="color-bg-accent f4 color-fg-muted text-bold rounded mr-3 flex-shrink-0 text-center" style="width:64px; height:64px; line-height:64px;">
            #
          </div>
</a>
<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-fg-muted mb-0 mt-1">
          3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
        </p>
</a>
<div class="flex-grow-0">
<div class="d-block" data-view-component="true">
<a aria-label="You must be signed in to star a repository" class="tooltipped tooltipped-s btn-sm btn" data-hydro-click='{"event_type":"authentication.click","payload":{"location_in_page":"star button","repository_id":null,"auth_type":"LOG_IN","originating_url":"https://github.com/topics","user_id":null}}' data-hydro-click-hmac="5

In [20]:
# <p class="f5 color-fg-muted mb-0 mt-1">
#           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
#         </p>

### Get Topic description for Title

In [21]:
git_topic_desc_tags = git_topic_doc.find_all('p' , {'class': 'f5 color-fg-muted mb-0 mt-1'})

In [22]:
len(git_topic_desc_tags)

30

In [23]:
git_topic_link_tags = git_topic_doc.find_all('a' , {'class': 'no-underline flex-1 d-flex flex-column'})

In [24]:
len(git_topic_link_tags)

30

In [25]:
git_topic_descs = []

for tag in git_topic_desc_tags:
     git_topic_descs.append(tag.text.strip())

print(git_topic_descs)


['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency library for PHP.', 'Android is an operating system built by Google designed for mobile devices.', 'Angular is an open source web application platform.', 'Ansible is a simple and powerful automation engine.', 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.', 'Arduino is an open source platform for building electronic devices.', 'ASP.NET is a web framework for building modern web apps and services.', 'Atom is a open source text editor built with web technologies.', 'An awesome list is a list of awesome things curated by the community.', 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.', 'Azure is a cloud co

In [26]:
# <a href="/topics/3d" class="no-underline flex-1 d-flex flex-column">
#         <p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
#         <p class="f5 color-fg-muted mb-0 mt-1">
#           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
#         </p>
#       </a>

### Get Topic `URL` for repository
##### Get first title by index [0] and add `href` link to base URL to form list

In [27]:
git_topic_link_tags[0]['href']

'/topics/3d'

In [28]:
topic0_url = "https://github.com" + git_topic_link_tags[0]['href']
print(topic0_url)

https://github.com/topics/3d


In [29]:
git_topic_title_tags[:50]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Angular</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ansible</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">API</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Arduino</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">ASP.NET</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Atom</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Awesome Lists</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amazon Web Services</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Azure</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Babel</p>,
 <p class="f3 lh-condensed m

In [30]:
# Get Title names in a list and append any new additions

git_topic_titles = []

for tag in git_topic_title_tags:
    git_topic_titles.append(tag.text)

print(git_topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [31]:
git_topic_urls = []

base_url = 'https://github.com'
for tag in git_topic_link_tags:
    git_topic_urls.append(base_url + tag['href'])

#print(topic_urls)
git_topic_urls

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

#### Make a new disctionary to store details scraped from webpage

In [32]:
git_topics_dict = {
   'title' :  git_topic_titles,
    'URL'  : git_topic_urls,
    'Description' : git_topic_descs
    
}

In [33]:
print(git_topics_dict)

{'title': ['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++'], 'URL': ['https://github.com/topics/3d', 'https://github.com/topics/ajax', 'https://github.com/topics/algorithm', 'https://github.com/topics/amphp', 'https://github.com/topics/android', 'https://github.com/topics/angular', 'https://github.com/topics/ansible', 'https://github.com/topics/api', 'https://github.com/topics/arduino', 'https://github.com/topics/aspnet', 'https://github.com/topics/atom', 'https://github.com/topics/awesome', 'https://github.com/topics/aws', 'https://github.com/topics/azure', 'https://github.com/topics/babel', 'https://github.com/topics/bash', 'https://github.com/topics/bitcoin', 'https://github.c

### Generate a `dataframe` & save the contents in a `.csv` file 

In [34]:
git_topics_df = pd.DataFrame(git_topics_dict)

In [35]:
git_topics_df

Unnamed: 0,title,URL,Description
0,3D,https://github.com/topics/3d,3D refers to the use of three-dimensional grap...
1,Ajax,https://github.com/topics/ajax,Ajax is a technique for creating interactive w...
2,Algorithm,https://github.com/topics/algorithm,Algorithms are self-contained sequences that c...
3,Amp,https://github.com/topics/amphp,Amp is a non-blocking concurrency library for ...
4,Android,https://github.com/topics/android,Android is an operating system built by Google...
5,Angular,https://github.com/topics/angular,Angular is an open source web application plat...
6,Ansible,https://github.com/topics/ansible,Ansible is a simple and powerful automation en...
7,API,https://github.com/topics/api,An API (Application Programming Interface) is ...
8,Arduino,https://github.com/topics/arduino,Arduino is an open source platform for buildin...
9,ASP.NET,https://github.com/topics/aspnet,ASP.NET is a web framework for building modern...


In [36]:
#generate a csv based on dataframe
git_topics_df.to_csv('git_topics_info.csv',index= None)

## Retrieve data for specific repositories

#### Important section to select index of Topic as per view

In [37]:
git_topic_page_url = git_topic_urls[2]

In [38]:
git_topic_page_url

'https://github.com/topics/algorithm'

In [39]:
response = requests.get(git_topic_page_url)

In [40]:
response.status_code

200

In [41]:
len(response.text)

467894

In [42]:
repo_topic_doc = BeautifulSoup(response.text, 'html.parser')

In [43]:
h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
git_repo_tags = repo_topic_doc.find_all('h3',{'class':h3_selection_class})

In [44]:
len(git_repo_tags)

20

In [45]:
git_repo_tags

[<h3 class="f3 color-fg-muted text-normal lh-condensed">
 <a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":3771963,"originating_url":"https://github.com/topics/algorithm","user_id":null}}' data-hydro-click-hmac="806006bf6ecca87f1d4103212d684b4f8f9d8f897c7349a1cd7e37852661db87" data-turbo="false" data-view-component="true" href="/jwasham">
             jwasham
 </a>          /
           <a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":60493101,"originating_url":"https://github.com/topics/algorithm","user_id":null}}' data-hydro-click-hmac="00d35a03113917ed4a7133838f34531a079c7e92d024d3355218293bb04b5d81" data-turbo="false" data-v

In [46]:
#get data W.R.T each repository based on itsss `a` tag
a_tags = git_repo_tags[0].find_all('a')

In [47]:
a_tags[0].text.strip()

'jwasham'

In [48]:
a_tags[1].text.strip()

'coding-interview-university'

In [49]:
base_url

'https://github.com'

#### Get detials of repository author and add `base URL` value and `href`

In [50]:
a_tags[1]['href']
repo_url = base_url + a_tags[1]['href']
print(repo_url)


https://github.com/jwasham/coding-interview-university


## Get count of `stars` of a specific repository

In [51]:
repo_star_tags = repo_topic_doc.find_all('span',{'class': 'Counter js-social-count'})

len(repo_star_tags)

20

In [52]:
repo_star_tags[0].text.strip()

'268k'

## Define a function to convert `string` value to `int` and remove letter `'K'` 

In [53]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    
    try:
        if stars_str[-1] == 'k':
            return int(float(stars_str[:-1]) * 1000)
        return int(stars_str)
    except ValueError:
       
        return None 

# Example usage:
star_count = parse_star_count(repo_star_tags[0].text.strip())
if star_count is not None:
    print(star_count)
else:
    print("Error: Unable to parse star count")


268000


## Get Repository Information including `UserName` , `URL`, `Repository Name` and `Stars`

In [54]:
def get_repo_info(h3_tag, star_tag):
    #returns all information about repo
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

In [55]:
#repo_star_tags

In [56]:
get_repo_info(git_repo_tags[2],repo_star_tags[2])

('TheAlgorithms', 'Python', 170000, 'https://github.com/TheAlgorithms/Python')

## Get `Repositroy Name`, `User name`, `Repository URl`, `Stars`

In [57]:
git_topics_repo_dict = {
    'username':[],
    'repo_name':[], 
    'stars':[],
    'repo_url':[]
}

for i in range (len(git_repo_tags)):
    repo_info = get_repo_info(git_repo_tags[i],repo_star_tags[i])
    git_topics_repo_dict['username'].append(repo_info[0])
    git_topics_repo_dict['repo_name'].append(repo_info[1])
    git_topics_repo_dict['stars'].append(repo_info[2])
    git_topics_repo_dict['repo_url'].append(repo_info[3])

In [58]:
git_topics_repo_dict


{'username': ['jwasham',
  'trekhleb',
  'TheAlgorithms',
  'CyC2018',
  'yangshun',
  'kdn251',
  'TheAlgorithms',
  'azl397985856',
  'algorithm-visualizer',
  'youngyangyang04',
  'krahets',
  'halfrost',
  'huihut',
  'TheAlgorithms',
  'donnemartin',
  'crossoverJie',
  'TheAlgorithms',
  'keon',
  'mxgmn',
  'trekhleb'],
 'repo_name': ['coding-interview-university',
  'javascript-algorithms',
  'Python',
  'CS-Notes',
  'tech-interview-handbook',
  'interviews',
  'Java',
  'leetcode',
  'algorithm-visualizer',
  'leetcode-master',
  'hello-algo',
  'LeetCode-Go',
  'interview',
  'JavaScript',
  'interactive-coding-challenges',
  'JCSprout',
  'C-Plus-Plus',
  'algorithms',
  'WaveFunctionCollapse',
  'homemade-machine-learning'],
 'stars': [268000,
  176000,
  170000,
  167000,
  95600,
  60400,
  53800,
  52500,
  44500,
  42300,
  35600,
  30600,
  30500,
  28300,
  27900,
  27000,
  26100,
  23000,
  21800,
  21800],
 'repo_url': ['https://github.com/jwasham/coding-interview

In [59]:
topic_repos_df = pd.DataFrame(git_topics_repo_dict)

In [60]:
topic_repos_df

Unnamed: 0,username,repo_name,stars,repo_url
0,jwasham,coding-interview-university,268000,https://github.com/jwasham/coding-interview-un...
1,trekhleb,javascript-algorithms,176000,https://github.com/trekhleb/javascript-algorithms
2,TheAlgorithms,Python,170000,https://github.com/TheAlgorithms/Python
3,CyC2018,CS-Notes,167000,https://github.com/CyC2018/CS-Notes
4,yangshun,tech-interview-handbook,95600,https://github.com/yangshun/tech-interview-han...
5,kdn251,interviews,60400,https://github.com/kdn251/interviews
6,TheAlgorithms,Java,53800,https://github.com/TheAlgorithms/Java
7,azl397985856,leetcode,52500,https://github.com/azl397985856/leetcode
8,algorithm-visualizer,algorithm-visualizer,44500,https://github.com/algorithm-visualizer/algori...
9,youngyangyang04,leetcode-master,42300,https://github.com/youngyangyang04/leetcode-ma...


## Function to get topic page details and get repository information along `Stars` and `URl`

In [61]:
def get_topic_page(git_topic_page_url):
    # Download page
    response = requests.get(git_topic_page_url)
    
    # Check response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(git_topic_page_url))
    
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

def get_repo_info(h3_tag, repo_star_tag):
    # Returns all info about repo
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(repo_star_tag.text.strip())
    return username, repo_name, stars, repo_url

def get_topic_repos(topic_doc):
    # Get h3 tag with repo title, repo URL, and username
    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3', {'class': h3_selection_class})
    
    # Get star tag
    star_tags = topic_doc.find_all('span', {'class': 'Counter js-social-count'})
    
    git_topics_repo_dict = {
        'username': [],
        'repo_name': [],
        'stars': [],
        'repo_url': []
    }

    for i in range(len(git_repo_tags)):
        repo_info = get_repo_info(repo_tags[i], repo_star_tags[i])
        git_topics_repo_dict['username'].append(repo_info[0])
        git_topics_repo_dict['repo_name'].append(repo_info[1])
        git_topics_repo_dict['stars'].append(repo_info[2])
        git_topics_repo_dict['repo_url'].append(repo_info[3])

    return pd.DataFrame(git_topics_repo_dict)


In [62]:
url_trial = git_topic_urls[2]
url_trial

'https://github.com/topics/algorithm'

In [63]:
topic_trial_doc = get_topic_page(url_trial)
#topic4_doc

In [64]:
topic_trial_repos = get_topic_repos(topic_trial_doc)

In [65]:
topic_trial_repos

Unnamed: 0,username,repo_name,stars,repo_url
0,jwasham,coding-interview-university,268000,https://github.com/jwasham/coding-interview-un...
1,trekhleb,javascript-algorithms,176000,https://github.com/trekhleb/javascript-algorithms
2,TheAlgorithms,Python,170000,https://github.com/TheAlgorithms/Python
3,CyC2018,CS-Notes,167000,https://github.com/CyC2018/CS-Notes
4,yangshun,tech-interview-handbook,95600,https://github.com/yangshun/tech-interview-han...
5,kdn251,interviews,60400,https://github.com/kdn251/interviews
6,TheAlgorithms,Java,53800,https://github.com/TheAlgorithms/Java
7,azl397985856,leetcode,52500,https://github.com/azl397985856/leetcode
8,algorithm-visualizer,algorithm-visualizer,44500,https://github.com/algorithm-visualizer/algori...
9,youngyangyang04,leetcode-master,42300,https://github.com/youngyangyang04/leetcode-ma...


In [66]:
#get_topic_repos(get_topic_page(git_topic_urls[2]))

In [67]:
#get_topic_repos(get_topic_page(topic_urls[4])).to_csv('android.csv',index = None)

In [68]:
#topic_urls[5]

## Function to mention Title number to generate Repository list `csv` file

In [69]:
import os

# Function to extract the topic name from the URL
def extract_topic_name(url):
    parts = url.split('/')
    return parts[-1]

# Insert Title name in form of Index
topic_url = git_topic_urls[3]

topic_name = extract_topic_name(topic_url)

csv_filename = f'{topic_name}.csv'

get_topic_repos(get_topic_page(topic_url)).to_csv(csv_filename, index=None)

# Generate DF from given data
Data_from_selected_title = get_topic_repos(get_topic_page(topic_url))
print(f"Scraping top repositories for \"{topic_name}\":")
Data_from_selected_title



Scraping top repositories for "amphp":


Unnamed: 0,username,repo_name,stars,repo_url
0,amphp,amp,268000,https://github.com/amphp/amp
1,danog,MadelineProto,176000,https://github.com/danog/MadelineProto
2,unreal4u,telegram-api,170000,https://github.com/unreal4u/telegram-api
3,amphp,parallel,167000,https://github.com/amphp/parallel
4,amphp,http-client,95600,https://github.com/amphp/http-client
5,amphp,byte-stream,60400,https://github.com/amphp/byte-stream
6,php-service-bus,service-bus,53800,https://github.com/php-service-bus/service-bus
7,amphp,mysql,52500,https://github.com/amphp/mysql
8,amphp,parallel-functions,44500,https://github.com/amphp/parallel-functions
9,amphp,process,42300,https://github.com/amphp/process
