# Top Respositories for GitHub Topics Web Scraping Project

## Pick a website and describe your objective
- Browse through different sites and pick on to scrape.- 
Identify the information you'd like to scrape from the site
-  Decide the format of the output CSV file 
Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.



### Outline:

- I'm going to scrape https://github.com/topics
- I'll get a list of topics. For each topic I would like to extract its title, its url, topic description too.
- I would also like to scrape the top 20 repositories for each topic which will include:
  1. Repository Name
  2. Username
  3. Stars
  4. URL

## Use the requests library to download web pages
- Inspect the website's HTML source and identify the right URLs to download.- 
Download and save web pages locally using the requests library
- 
Create a function to automate downloading for different topics/search queries



In [5]:
!pip install requests --upgrade --quiet


[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [6]:
import requests

In [7]:
topics_url = 'https://github.com/topics'

In [8]:
response = requests.get(topics_url)

In [9]:
response.status_code

200

In [10]:
page_content = response.text

## Use Beautiful Soup to parse and extract information

- Parse and explore the structure of downloaded web pages using Beautiful soup.- 
Use the right properties and methods to extract the required information
- 
Create functions to extract from the page into lists and dictionari.
- .
(Optional) Use a REST API to acquire additional information if required.



In [11]:
!pip install beautifulsoup4 --upgrade --quiet


[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [12]:
from bs4 import BeautifulSoup

In [13]:
doc = BeautifulSoup(page_content, 'html.parser')

In [14]:
topic_title_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
topic_title_tags = doc.find_all('p', {'class': topic_title_class})


In [15]:
topic_description_class = 'f5 color-fg-muted mb-0 mt-1'
topic_description_tags = doc.find_all('p', {'class': topic_description_class} )

In [16]:
topic_link_class = 'no-underline flex-1 d-flex flex-column'
topic_link_tags = doc.find_all('a', {'class': topic_link_class })
topic_link_tags[0]['href']

'/topics/3d'

In [17]:
topic0_url = "https://github.com" + topic_link_tags[0]['href']
print(topic0_url)

https://github.com/topics/3d


In [18]:
topic_titles = []

for tag in topic_title_tags:
    topic_titles.append(tag.text)

topic_titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

In [19]:
topic_descriptions = []

for tag in topic_description_tags:
    topic_descriptions.append(tag.text.strip())

topic_descriptions[:5]

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [20]:
topic_urls = []
base_url = 'https://github.com'

for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])

topic_urls[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

In [21]:
!pip install pandas --upgrade --quiet


[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [22]:
import pandas as pd

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [23]:
topic_dict = {
    'Title': topic_titles,
    'Description': topic_descriptions,
    'URL': topic_urls
}

In [24]:
topic_df = pd.DataFrame(topic_dict)

In [25]:
topic_df

Unnamed: 0,Title,Description,URL
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


## Getting Information out of a Topic Page

In [26]:
topic_page_url = topic_urls[0]
response = requests.get(topic_page_url)

In [27]:
topic_doc = BeautifulSoup(response.text, 'html.parser')

In [28]:
h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
repo_tags = topic_doc.find_all('h3', {'class': h3_selection_class})

In [29]:
repo_tags[0]

<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="c72fbd5c69a8ee7c9c53a4e65de2b93c8fc7552dd793945819639bc165c0f0ba" data-turbo="false" data-view-component="true" href="/mrdoob">
            mrdoob
</a>          /
          <a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4a2667db3d63a1739c412e059e5da95afe419df83f70949b5d59dc3478f5c79a" data-turbo="false" data-view-component="true" href

In [30]:
a_tags = repo_tags[0].find_all('a')

In [31]:
a_tags

[<a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="c72fbd5c69a8ee7c9c53a4e65de2b93c8fc7552dd793945819639bc165c0f0ba" data-turbo="false" data-view-component="true" href="/mrdoob">
             mrdoob
 </a>,
 <a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4a2667db3d63a1739c412e059e5da95afe419df83f70949b5d59dc3478f5c79a" data-turbo="false" data-view-component="true" href="/mrdoob/three.js">
             three.js
 </a>]

In [32]:
a_tags[0].text.strip()

'mrdoob'

In [33]:
a_tags[1].text.strip()

'three.js'

In [34]:
a_tags[1]['href']

'/mrdoob/three.js'

In [35]:
repo_url = base_url + a_tags[1]['href']
repo_url

'https://github.com/mrdoob/three.js'

In [36]:
star_class = 'Counter js-social-count'
star_tags = topic_doc.find_all('span', {'class': 'Counter js-social-count'})
star_tags[0].text

'97.5k'

In [37]:
def get_repo_info(h3_tag, star_tag):
    #returns all the required information about the repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repository_name = a_tags[1].text.strip()
    repository_url = base_url + a_tags[1]['href']
    stars = star_tag.text.strip()
    return username, repository_name, repository_url, stars
    

In [38]:
get_repo_info(repo_tags[0], star_tags[0])

('mrdoob', 'three.js', 'https://github.com/mrdoob/three.js', '97.5k')

In [39]:
topic_repos_dict = {
    'username':[],
    'repo_name':[],
    'repo_url':[],
    'stars':[]
}

for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['repo_url'].append(repo_info[2])
    topic_repos_dict['stars'].append(repo_info[3])

In [40]:
topic_repo_df = pd.DataFrame(topic_repos_dict)

In [41]:
topic_repo_df

Unnamed: 0,username,repo_name,repo_url,stars
0,mrdoob,three.js,https://github.com/mrdoob/three.js,97.5k
1,pmndrs,react-three-fiber,https://github.com/pmndrs/react-three-fiber,25.3k
2,libgdx,libgdx,https://github.com/libgdx/libgdx,22.5k
3,BabylonJS,Babylon.js,https://github.com/BabylonJS/Babylon.js,22.1k
4,ssloy,tinyrenderer,https://github.com/ssloy/tinyrenderer,18.9k
5,FreeCAD,FreeCAD,https://github.com/FreeCAD/FreeCAD,16.8k
6,lettier,3d-game-shaders-for-beginners,https://github.com/lettier/3d-game-shaders-for...,16.7k
7,aframevr,aframe,https://github.com/aframevr/aframe,16k
8,CesiumGS,cesium,https://github.com/CesiumGS/cesium,11.6k
9,blender,blender,https://github.com/blender/blender,10.9k


In [69]:
import os

def get_topic_page(topic_url):
    # download the page
    response = requests.get(topic_url)
    # check response code
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    #parse using beautiful soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

def get_repo_info(h3_tag, star_tag):
    #returns all the required information about the repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repository_name = a_tags[1].text.strip()
    repository_url = base_url + a_tags[1]['href']
    stars = star_tag.text.strip()
    return username, repository_name, repository_url, stars

def get_topic_repos(topic_doc):
    #get the repository name tags
    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3', {'class': h3_selection_class})
    #get the star tags
    star_class = 'Counter js-social-count'
    star_tags = topic_doc.find_all('span', {'class': 'Counter js-social-count'})
    # get the repository info

    topic_repos_dict = {
        'username':[],
        'repo_name':[],
        'repo_url':[],
        'stars':[]
    }
    
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['repo_url'].append(repo_info[2])
        topic_repos_dict['stars'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)

def scrape_topic(topic_url, topic_name):
    file_name = topic_name + '.csv'
    if os.path.exists(file_name):
        print('The file {} already exists. Skipping..'.format(file_name))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(file_name, index = None)

In [44]:
topic_urls[1]

'https://github.com/topics/ajax'

In [49]:
topic2_repos = get_topic_repos(get_topic_page(topic_urls[1]))

In [50]:
topic2_repos

Unnamed: 0,username,repo_name,repo_url,stars
0,ljianshu,Blog,https://github.com/ljianshu/Blog,7.8k
1,metafizzy,infinite-scroll,https://github.com/metafizzy/infinite-scroll,7.4k
2,olifolkerd,tabulator,https://github.com/olifolkerd/tabulator,6.1k
3,developit,unfetch,https://github.com/developit/unfetch,5.7k
4,jquery-form,form,https://github.com/jquery-form/form,5.2k
5,Studio-42,elFinder,https://github.com/Studio-42/elFinder,4.5k
6,elbywan,wretch,https://github.com/elbywan/wretch,4.4k
7,dwyl,learn-to-send-email-via-google-script-html-no-...,https://github.com/dwyl/learn-to-send-email-vi...,3.1k
8,ded,reqwest,https://github.com/ded/reqwest,2.9k
9,wendux,ajax-hook,https://github.com/wendux/ajax-hook,2.5k


Write a single function to:
1. Get the list of topics from the topics page
2. Get the list of top repos from individual pages
3. For each topic, create a CSV of the top repos of the topic

In [52]:
def get_topic_titles(doc):
    topic_title_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': topic_title_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

def get_topic_description(doc):
    topic_description_class = 'f5 color-fg-muted mb-0 mt-1'
    topic_description_tags = doc.find_all('p', {'class': topic_description_class} )
    topic_descriptions = []
    for tag in topic_description_tags:
        topic_descriptions.append(tag.text.strip())
    return topic_descriptions

def get_topic_url(doc):
    topic_link_class = 'no-underline flex-1 d-flex flex-column'
    topic_link_tags = doc.find_all('a', {'class': topic_link_class })
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls
    
def scrape_topics():
    response = requests.get(topics_url)
    if response.status_code !=200:
        raise Exception('Falied to load page {}'.format(topics_url))
    doc = BeautifulSoup(response.text, 'html.parser')

    topic_dict = {
        'Title': get_topic_titles(doc),
        'Description': get_topic_description(doc),
        'URL': get_topic_url(doc)
    }
    return pd.DataFrame(topic_dict)

In [53]:
scrape_topics()

Unnamed: 0,Title,Description,URL
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [66]:
def scrape_topic_repos():
    topics_df = scrape_topics()
    for index, row in topic_df.iterrows():
        print('Scrapping top repositories for "{}" topic.'.format(row['Title']))
        scrape_topic(row['URL'], row['Title'])

In [70]:
scrape_topic_repos()

Scrapping top repositories for "3D" topic.
The file 3D.csv already exists. Skipping..
Scrapping top repositories for "Ajax" topic.
The file Ajax.csv already exists. Skipping..
Scrapping top repositories for "Algorithm" topic.
The file Algorithm.csv already exists. Skipping..
Scrapping top repositories for "Amp" topic.
The file Amp.csv already exists. Skipping..
Scrapping top repositories for "Android" topic.
The file Android.csv already exists. Skipping..
Scrapping top repositories for "Angular" topic.
The file Angular.csv already exists. Skipping..
Scrapping top repositories for "Ansible" topic.
The file Ansible.csv already exists. Skipping..
Scrapping top repositories for "API" topic.
The file API.csv already exists. Skipping..
Scrapping top repositories for "Arduino" topic.
The file Arduino.csv already exists. Skipping..
Scrapping top repositories for "ASP.NET" topic.
The file ASP.NET.csv already exists. Skipping..
Scrapping top repositories for "Atom" topic.
The file Atom.csv alrea