# Top Repositories for GitHub Topics



### Pick a website and describe your objective

- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.


#### Project Outline

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:

```
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx
```

### Use the requests library to download web pages
A library for downloading web pages 

In [1]:
!pip install requests --upgrade --quiet #quiet to not see the output

In [2]:
import requests

In [3]:
topics_url = 'https://github.com/topics' #make sure to have the http/https in the url

In [4]:
response = requests.get(topics_url) #downloading and creates a response object

In [5]:
response.status_code #If the sresponse is sucessful if 200 is the status code
#can learn more on http request code

200

In [6]:
len(response.text)

188953

In [7]:
page_contents = response.text

In [8]:
page_contents[:1000] #Seeing the first 1000 lines in the response

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-dkuYFW+ra8yYSt342e5pJEeslPSjMcrMvNxlYZMyM/X+/WJHDPvoCuGq3LFojI7B0dQWwZNRiPMnbi9IfUgTaA==" rel="stylesheet" href="https://github.githubassets.com/assets/light-764b98156fab6bcc984addf8d9ee6924.css" /><link crossorigin="anonymous" media="all" integrity="sha512-UrAu23+eyncWvaQFwsLbgSKtmLb2aH1bcT4hJnnRdkaPuY1eu9bumt33FyHHFDX8hskTUNWNkIsMCz7F

In [9]:
with open('webpage.html', 'w') as f: #saving to a html file 'w' for write with open automatically closes it
    f.write(page_contents)

### Use Beautiful Soup to parse and extract information

In [10]:
!pip install beautifulsoup4 --upgrade --quiet

Beautiful Soup is a Python library for pulling data out of HTML and XML files.

In [11]:
from bs4 import BeautifulSoup

In [12]:
doc = BeautifulSoup(page_contents, 'html.parser')

In [13]:
p_tags=doc.find_all('p') # finding all the <p>

In [14]:
p_tags[:5] #Selecting all te p_tags but not something we are interested in

[<p class="f4 color-fg-muted col-md-6 mx-auto">Browse popular topics on GitHub.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Arduino
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">Arduino is an open source hardware and software company and maker community.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         COVID-19
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">The coronavirus disease 2019 (COVID-19) is an infectious disease caused by SARS-CoV-2.</p>]

In [15]:
#now we can find things in the webpage using some queries
#so to be specific we can find by using specific class
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'

topic_title_tags = doc.find_all('p', {'class': selection_class})

In [16]:
len(topic_title_tags)

30

In [17]:
topic_title_tags[:5] #These way we were able to get all the topic names

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [18]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "kunduabhimanyu031/scraping-github-topics-repositories-rough" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/kunduabhimanyu031/scraping-github-topics-repositories-rough[0m


'https://jovian.ai/kunduabhimanyu031/scraping-github-topics-repositories-rough'

In [19]:
desc_selector = 'f5 color-fg-muted mb-0 mt-1'
topic_desc_tags = doc.find_all('p', {'class': desc_selector})

In [20]:
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D modeling is the process of virtually developing the surface and structure of a 3D object.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [21]:
topic_title_tag0 = topic_title_tags[0]

In [22]:
topic_title_tag0

<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>

# We need the link to the topic page
if we see clearly than we can see that the title tag is inside a <a> tag which has the href link
So the first step in this case would be to get using the .parent tag.

In [23]:
div_tag = topic_title_tag0.parent

In [24]:
div_tag # we get the whole parent in this ,we are doing all this things according to the data we want to include in our file.

<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-fg-muted mb-0 mt-1">
          3D modeling is the process of virtually developing the surface and structure of a 3D object.
        </p>
</a>

In [25]:
#Getting the div block

In [26]:
topic_link_tags = doc.find_all('a', {'class': 'no-underline flex-1 d-flex flex-column'})

In [27]:
len(topic_link_tags)

30

In [28]:
topic_link_tags[0]

<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-fg-muted mb-0 mt-1">
          3D modeling is the process of virtually developing the surface and structure of a 3D object.
        </p>
</a>

In [29]:
#Making the url 

In [30]:
topic0_url = "https://github.com" + topic_link_tags[0]['href']
print(topic0_url)

https://github.com/topics/3d


In [31]:
# Now we can create a function to append the values we need into a new list 
#For example below we are using tag.text to get the text value from the tag

In [32]:
topic_titles = []

for tag in topic_title_tags:
    topic_titles.append(tag.text)
    
print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


Getting the description into a list

In [33]:
topic_descs = []
#to remove the space before and after
for tag in topic_desc_tags:
    topic_descs.append(tag.text.strip())
    
topic_descs[:5]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

Getting the urls into list

In [34]:
topic_urls = []
base_url = 'https://github.com'

for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])
    
topic_urls

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

In [35]:
!pip install pandas --quiet

In [36]:
import pandas as pd

In [37]:
topics_dict = {
    'title': topic_titles,
    'description': topic_descs,
    'url': topic_urls
}

In [38]:
topics_df = pd.DataFrame(topics_dict)

In [39]:
topics_df

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [40]:
topics_df.to_csv('topics.csv', index=None)

In [41]:
jovian.commit()


<IPython.core.display.Javascript object>

[jovian] Updating notebook "kunduabhimanyu031/scraping-github-topics-repositories-rough" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/kunduabhimanyu031/scraping-github-topics-repositories-rough[0m


'https://jovian.ai/kunduabhimanyu031/scraping-github-topics-repositories-rough'

## Getting information out of a topic page

In [42]:
topic_page_url = topic_urls[0]

In [43]:
topic_page_url

'https://github.com/topics/3d'

In [44]:
response = requests.get(topic_page_url)

In [45]:
response.status_code

200

In [46]:
len(response.text)

674916

In [47]:
topic_doc = BeautifulSoup(response.text, 'html.parser')

In [56]:
# We wnat the username and the link to the repo but the a tag ityself dont tend to have a class
# So we will look at the parent first that is the h3 tag and then we can look at the children

<h3 class="f3 color-fg-muted text-normal lh-condensed">
          <a data-hydro-click="{&quot;event_type&quot;:&quot;explore.click&quot;,&quot;payload&quot;:{&quot;click_context&quot;:&quot;REPOSITORY_CARD&quot;,&quot;click_target&quot;:&quot;OWNER&quot;,&quot;click_visual_representation&quot;:&quot;REPOSITORY_OWNER_HEADING&quot;,&quot;actor_id&quot;:89184184,&quot;record_id&quot;:97088,&quot;originating_url&quot;:&quot;https://github.com/topics/3d&quot;,&quot;user_id&quot;:89184184}}" data-hydro-click-hmac="8666cbe46c3d963c9092624b5cf812bdd0322dff1e964854a968c8f0c13b7b4e" data-ga-click="Explore, go to repository owner, location:explore feed" href="/mrdoob" data-view-component="true">
            mrdoob
</a>          /
          <a data-hydro-click="{&quot;event_type&quot;:&quot;explore.click&quot;,&quot;payload&quot;:{&quot;click_context&quot;:&quot;REPOSITORY_CARD&quot;,&quot;click_target&quot;:&quot;REPOSITORY&quot;,&quot;click_visual_representation&quot;:&quot;REPOSITORY_NAME_HEADING&quot;,&quot;actor_id&quot;:89184184,&quot;record_id&quot;:576201,&quot;originating_url&quot;:&quot;https://github.com/topics/3d&quot;,&quot;user_id&quot;:89184184}}" data-hydro-click-hmac="593108e65f294d247449978fcc1e9db74b7bfb34ae241525564190cf680946ca" data-ga-click="Explore, go to repository, location:explore feed" href="/mrdoob/three.js" data-view-component="true" class="text-bold wb-break-word">
            three.js
</a>        </h3>

In [57]:
h3_username_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
repo_tags = topic_doc.find_all('h3', {'class': h3_username_selection_class} )

In [60]:
repo_tags[1]

<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":509841,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="760dcd7b253cb1a27d9b1a8675e86db885295be4e0d8d9fa7397adf923075d36" data-view-component="true" href="/libgdx">
            libgdx
</a>          /
          <a class="text-bold wb-break-word" data-ga-click="Explore, go to repository, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":5373551,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="ff9d8fbd4b6a268

In [61]:
len(repo_tags)

30

In [62]:
a_tags = repo_tags[0].find_all('a') #getting the a tages in which thew repo name and the usernamer isstored

In [63]:
a_tags[0].text.strip() #getting the username

'mrdoob'

In [64]:
a_tags[1].text.strip() #getting the repo name

'three.js'

In [66]:
base_url = 'https://github.com'
repo_url = base_url + a_tags[1]['href']
print(repo_url) #getting the repo url

https://github.com/mrdoob/three.js


In [76]:
star_tags = topic_doc.find_all('span', { 'class': 'Counter js-social-count'}) #getting th enumber of stars

In [77]:
len(star_tags)

30

In [78]:
star_tags[0].text.strip()

'79.3k'

In [80]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1]) * 1000)
    return int(stars_str) #converting the number of stars from sttring to actual number

In [81]:
parse_star_count(star_tags[0].text.strip())

79300

In [82]:
#Findiung the user name,reponame ,number of stars and the url

In [83]:
def get_repo_info(h1_tag, star_tag):
    # returns all the required info about a repository
    a_tags = h1_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url =  base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

In [84]:
get_repo_info(repo_tags[0], star_tags[0])

('mrdoob', 'three.js', 79300, 'https://github.com/mrdoob/three.js')

In [87]:
topic_repos_dict = {
    'username': [],
    'repo_name': [],
    'stars': [],
    'repo_url': []
}


for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])

In [90]:
topics_repo_df = pd.DataFrame(topic_repos_dict)

In [91]:
topics_repo_df

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,79300,https://github.com/mrdoob/three.js
1,libgdx,libgdx,19700,https://github.com/libgdx/libgdx
2,pmndrs,react-three-fiber,16900,https://github.com/pmndrs/react-three-fiber
3,BabylonJS,Babylon.js,16000,https://github.com/BabylonJS/Babylon.js
4,aframevr,aframe,13800,https://github.com/aframevr/aframe
5,lettier,3d-game-shaders-for-beginners,12200,https://github.com/lettier/3d-game-shaders-for...
6,ssloy,tinyrenderer,12100,https://github.com/ssloy/tinyrenderer
7,FreeCAD,FreeCAD,10700,https://github.com/FreeCAD/FreeCAD
8,metafizzy,zdog,9000,https://github.com/metafizzy/zdog
9,CesiumGS,cesium,8300,https://github.com/CesiumGS/cesium


In [92]:
topics_repo_df.to_csv('topics_repo.csv', index=None)

## Final Code

In [93]:
import os

def get_topic_page(topic_url):
    # Download the page
    response = requests.get(topic_url)
    # Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    # Parse using Beautiful soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

def get_repo_info(h1_tag, star_tag):
    # returns all the required info about a repository
    a_tags = h1_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url =  base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

def get_topic_repos(topic_doc):
    # Get the h1 tags containing repo title, repo URL and username
    h1_selection_class = 'f3 color-text-secondary text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h1', {'class': h1_selection_class} )
    # Get star tags
    star_tags = topic_doc.find_all('a', { 'class': 'social-count float-none'})
    
    topic_repos_dict = { 'username': [], 'repo_name': [], 'stars': [],'repo_url': []}

    # Get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)


def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)


Write a single function to :
1. Get the list of topics from the topics page
2. Get the list of top repos from the individual topic pages
3. For each topic, create a CSV of the top repos for the topic




In [94]:
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

def get_topic_descs(doc):
    desc_selector = 'f5 color-text-secondary mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class': desc_selector})
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs

def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a', {'class': 'd-flex no-underline'})
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls
    

def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url': get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)


In [95]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('data', exist_ok=True)
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))

In [None]:
import jovian

In [None]:
jovian.commit()

<IPython.core.display.Javascript object>