# Scraping the top repositories for Github topics

### Pick a website and describe your objective 
Jovian Reference website : https://www.youtube.com/watch?v=RKsLLG-bzEY&t=5455s

- Browse through different sites and pick on to scrape.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.

### Outline:

- scrape https://github.com/topics
- Gather a list of topics and some details.
- For each topic, get the top 25 repositories.
- For each topic create a csv file in the following format:

```
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
```

## Scrape the list of topics from Github

 - use requests to download the page
 - use BS4 to parse and extract information
 - convert to a pandas dataframe

### Use the requests library to download web pages

In [1]:
import requests

In [2]:
topics_url = 'https://github.com/topics'

In [3]:
# gather webpage
response = requests.get(topics_url)

In [4]:
# check if url was successful
response.status_code

200

In [5]:
len(response.text)

154075

In [6]:
# content of webpage
page_content = response.text
page_content[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark" data-a11y-animated-images="system">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n  \n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-0946cdc16f15.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-3946c959759a.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" media="all" rel="stylesheet" data-href="h

In [7]:
with open('webpage.html', 'w', encoding='utf-8') as f:
    f.write(page_content)


### Use Beautiful Soup to parse and extract information

In [8]:
from bs4 import BeautifulSoup

In [9]:
doc = BeautifulSoup(page_content, 'html.parser')

In [10]:
# find p tags

selection_class = "f3 lh-condensed mb-0 mt-1 Link--primary"
topic_title_tags = doc.find_all('p', {'class': selection_class})

In [11]:
len(topic_title_tags)

30

In [12]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [13]:

desc_selector = "f5 color-fg-muted mb-0 mt-1"
topic_desc_tags = doc.find_all('p', {'class':desc_selector})


In [14]:
selection_class = "f3 lh-condensed mb-0 mt-1 Link--primary"
topic_title_tags = doc.find_all('p', {'class': selection_class})
desc_selector = "f5 color-fg-muted mb-0 mt-1"
topic_desc_tags = doc.find_all('p', {'class':desc_selector})
topic_link_tags = doc.find_all('a',{'class':"no-underline flex-1 d-flex flex-column"})
topic_titles = []

for tag in topic_title_tags:
    topic_titles.append(tag.text)

print(topic_titles)

topic_descs = []

for tag in topic_desc_tags:
    topic_descs.append(tag.text.strip())
topic_descs[:5]

topic_urls = []
base_url = "https://github.com"

for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])
    
topic_urls

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

In [15]:
topic_desc_tags

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Angular is an open source web application platform.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ansible is a simple and powerful automation engine.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           An API (App

In [16]:
topic_title_tags0 = topic_title_tags[0]

In [17]:
topic_title_tags0.parent

<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-fg-muted mb-0 mt-1">
          3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
        </p>
</a>

In [18]:
topic_link_tags = doc.find_all('a',{'class':"no-underline flex-1 d-flex flex-column"})

In [19]:
len(topic_link_tags)

30

In [20]:
topic0_url = "https://github.com" + topic_link_tags[0]['href']
print(topic0_url)

https://github.com/topics/3d


In [21]:
topic_titles = []

for tag in topic_title_tags:
    topic_titles.append(tag.text)

print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [22]:
topic_descs = []

for tag in topic_desc_tags:
    topic_descs.append(tag.text.strip())
topic_descs[:5]

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [23]:
topic_urls = []
base_url = "https://github.com"

for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])
    
topic_urls

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

In [24]:
import pandas as pd

In [25]:
topics_dict = {
    'title': topic_titles,
    'description': topic_descs,
    'url': topic_urls
}

In [26]:
topics_df = pd.DataFrame(topics_dict)

In [27]:
topics_df

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


### Create CSV file(s) with the extracted information

In [28]:
topics_df.to_csv('topics.csv', index=None)

## Getting information out of a topic page

In [29]:
topic_page_url = topic_urls[0]

In [30]:
topic_page_url

'https://github.com/topics/3d'

In [31]:
response = requests.get(topic_page_url)

In [32]:
response.status_code

200

In [33]:
len(response.text)

461248

In [34]:
topic_doc = BeautifulSoup(response.text, 'html.parser')

In [35]:
# topic_doc

In [36]:
h3_selection_class = "f3 color-fg-muted text-normal lh-condensed"
repo_tags = topic_doc.find_all('h3',{'class': h3_selection_class})

In [37]:
len(repo_tags)

20

In [38]:
a_tags = repo_tags[0].find_all("a")

In [39]:
a_tags[0].text.strip()

'mrdoob'

In [40]:
a_tags[1].text.strip()

'three.js'

In [41]:
repo_url = base_url + a_tags[1]['href']

In [42]:
print(repo_url)

https://github.com/mrdoob/three.js


In [43]:
star_tags = topic_doc.find_all('span',{'class': 'Counter js-social-count' })

In [44]:
len(star_tags)

20

In [45]:
star_tags[0].text.strip()

'91.2k'

In [46]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1]=='k':
        return int(float(stars_str[:-1])*1000)
    return int(stars_str)

In [47]:
parse_star_count(star_tags[0].text.strip())

91200

In [48]:
def get_repo_info(h3_tag, star_tag):
    #returns info about repo
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars= parse_star_count(star_tag.text.strip())
    
    return username, repo_name, stars, repo_url

In [49]:
get_repo_info(repo_tags[0], star_tags[0])

('mrdoob', 'three.js', 91200, 'https://github.com/mrdoob/three.js')

In [50]:
topic_repos_dict = {
    'username' : [],
    'repo_name' : [],
    'stars': [],
    'repo_url': []
}

for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])



## Final code
### Get top repositories from a topic page

In [51]:
import os

def get_topic_page(topic_url):
    
    # download the page
    response = requests.get(topic_url)
    #check successful response
    if response.status_code !=200:
        raise Exception('Failed to load page {}'.format(topic_page_url))
    #parse using Beautiful Soup    
    topic_doc = BeautifulSoup(response.text, 'html.parser')

    return topic_doc

def get_repo_info(h3_tag, star_tag):
    #returns info about repo
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars= parse_star_count(star_tag.text.strip())
    
    return username, repo_name, stars, repo_url


def get_topic_repos(topic_doc):
    
    # Get h3 tags with repo title, url and username
    h3_selection_class = "f3 color-fg-muted text-normal lh-condensed"
    repo_tags = topic_doc.find_all('h3',{'class': h3_selection_class})
    # get star tags
    star_tags = topic_doc.find_all('span',{'class': 'Counter js-social-count' })
    
    topic_repos_dict = {
        'username' : [],
        'repo_name' : [],
        'stars': [],
        'repo_url': []
    }
    
    # get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
    
    return pd.DataFrame(topic_repos_dict)


In [52]:
def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print('The file {} already exists. Skipping..'.format(path))
        return 
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index= None)

In [53]:
get_topic_repos(get_topic_page(topic_urls[6])).to_csv('ansible.csv', index=None)

Write a single function to:
1. Get the list of topics from the topic page.
2. Get the list of top repos from the individual topic pages
3. For each topic , Create a csv of the top repos for the topic

In [54]:
# to get the list of titles
def get_topic_titles(doc):
    # obtain class from inspecting the webpage
    selection_class = "f3 lh-condensed mb-0 mt-1 Link--primary"
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles


def get_topic_descs(doc):
    desc_selector = "f5 color-fg-muted mb-0 mt-1"
    topic_desc_tags = doc.find_all('p', {'class':desc_selector})
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs
    
    
def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a',{'class':"no-underline flex-1 d-flex flex-column"})
    topic_urls = []
    base_url = "https://github.com"

    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])

    return topic_urls
    


In [55]:
# function to put the above functions together    
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response= requests.get(topics_url)
    if response.status_code !=200:
        raise Exception('Failed to load page {}'.format(topic_page_url))
        
    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url': get_topic_urls(doc)
    }
    
    return pd.DataFrame(topics_dict)
   

### Putting everything together

- function to get list of topics
- function to create csv for scraped repos
- new function to put them together

In [56]:
def scrape_topics_repos():
    print('Scraping list of topics ')
    topics_df = scrape_topics()
    # create a folder for csvs
    os.makedirs('data', exist_ok=True)
    
    for index,row in topics_df.iterrows():
        print('Scraping top repositories for {}'.format(row['title']))
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))

In [57]:
scrape_topics_repos()

Scraping list of topics 
Scraping top repositories for 3D
The file data/3D.csv already exists. Skipping..
Scraping top repositories for Ajax
The file data/Ajax.csv already exists. Skipping..
Scraping top repositories for Algorithm
The file data/Algorithm.csv already exists. Skipping..
Scraping top repositories for Amp
The file data/Amp.csv already exists. Skipping..
Scraping top repositories for Android
The file data/Android.csv already exists. Skipping..
Scraping top repositories for Angular
The file data/Angular.csv already exists. Skipping..
Scraping top repositories for Ansible
The file data/Ansible.csv already exists. Skipping..
Scraping top repositories for API
The file data/API.csv already exists. Skipping..
Scraping top repositories for Arduino
The file data/Arduino.csv already exists. Skipping..
Scraping top repositories for ASP.NET
The file data/ASP.NET.csv already exists. Skipping..
Scraping top repositories for Atom
The file data/Atom.csv already exists. Skipping..
Scraping

In [58]:
# read and display a csv using pandas
pd.read_csv('data/Android.csv')

Unnamed: 0,username,repo_name,stars,repo_url
0,flutter,flutter,153000,https://github.com/flutter/flutter
1,facebook,react-native,109000,https://github.com/facebook/react-native
2,justjavac,free-programming-books-zh_CN,102000,https://github.com/justjavac/free-programming-...
3,Genymobile,scrcpy,82800,https://github.com/Genymobile/scrcpy
4,Hack-with-Github,Awesome-Hacking,64400,https://github.com/Hack-with-Github/Awesome-Ha...
5,google,material-design-icons,48000,https://github.com/google/material-design-icons
6,Solido,awesome-flutter,46600,https://github.com/Solido/awesome-flutter
7,wasabeef,awesome-android-ui,46200,https://github.com/wasabeef/awesome-android-ui
8,square,okhttp,43900,https://github.com/square/okhttp
9,android,architecture-samples,42600,https://github.com/android/architecture-samples
