# Pick a website and describe your objective

Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.

Outline:
    -Scrape https://github.com/topics
    -Get the list of topics. For each get the title, topic URl, topic description
    -For each topic get top 25 repositories
    -For each repository get the reponame, username, stars and repo url
    -For each topic create a csv file

# Use the requests library to download web pages

In [4]:
!pip3 install requests --upgrade --quiet

In [5]:
import requests

In [6]:
topics_url = 'https://github.com/topics'

In [7]:
response = requests.get(topics_url)

In [8]:
response.status_code

200

In [9]:
page_content = response.text

In [10]:
page_content[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-dkuYFW+ra8yYSt342e5pJEeslPSjMcrMvNxlYZMyM/X+/WJHDPvoCuGq3LFojI7B0dQWwZNRiPMnbi9IfUgTaA==" rel="stylesheet" href="https://github.githubassets.com/assets/light-764b98156fab6bcc984addf8d9ee6924.css" /><link crossorigin="anonymous" media="all" integrity="sha512-UrAu23+eyncWvaQFwsLbgSKtmLb2aH1bcT4hJnnRdkaPuY1eu9bumt33FyHHFDX8hskTUNWNkIsMCz7F

In [11]:
with open('webpage.html', 'w', encoding='utf-8') as f:
    f.write(page_content)

# Use Beautiful Soup to parse and extract information

In [12]:
!pip3 install beautifulsoup4 --upgrade --quiet

In [13]:
from bs4 import BeautifulSoup

In [14]:
doc = BeautifulSoup(page_content, 'html.parser')

In [15]:
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
topic_title_tags = doc.find_all('p',{'class': selection_class})

In [16]:
len(topic_title_tags)

30

In [17]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [18]:
desc_class = 'f5 color-fg-muted mb-0 mt-1'
topic_desc_tags = doc.find_all('p',{'class': desc_class})

In [19]:
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D modeling is the process of virtually developing the surface and structure of a 3D object.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [20]:
topic_link_class = 'no-underline flex-grow-0'
topic_link_tag = doc.find_all('a',{'class': topic_link_class})

In [21]:
topic_link_tag[0]['href']

'/topics/3d'

In [22]:
topic_url0 = 'https://github.com' + topic_link_tag[0]['href']
print(topic_url0)

https://github.com/topics/3d


In [23]:
topic_titles =[]
for tag in topic_title_tags:
    topic_titles.append(tag.text)
print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [24]:
topic_desc = []
for tag in topic_desc_tags:
    topic_desc.append(tag.text.strip())
print(topic_desc[:5])

['3D modeling is the process of virtually developing the surface and structure of a 3D object.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency library for PHP.', 'Android is an operating system built by Google designed for mobile devices.']


In [25]:
topic_link=[]
base_url = 'https://github.com'
for tag in topic_link_tag:
    topic_link.append(base_url + tag['href'])
print(topic_link[:5])

['https://github.com/topics/3d', 'https://github.com/topics/ajax', 'https://github.com/topics/algorithm', 'https://github.com/topics/amphp', 'https://github.com/topics/android']


In [26]:
!pip3 install pandas --quiet

In [27]:
import pandas as pd

In [28]:
topics_dict= {
    'title': topic_titles,
    'description': topic_desc,
    'urls': topic_link
}

In [29]:
topic_df = pd.DataFrame(topics_dict)

In [30]:
topic_df

Unnamed: 0,title,description,urls
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [31]:
topic_df.to_csv('topics.csv')

# Create CSV file(s) with the extracted information

# Getting information out of a page

In [32]:
topic_page_url = topic_link[0]

In [33]:
topic_page_url

'https://github.com/topics/3d'

In [34]:
response = requests.get(topic_page_url)

In [35]:
response.status_code

200

In [36]:
topic_doc =  BeautifulSoup(response.text, 'html.parser')

In [37]:
h1_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
repo_tags = topic_doc.find_all('h3', {'class': h1_selection_class})

In [38]:
a_tags = repo_tags[0].find_all('a')

In [39]:
a_tags[0].text.split()

['mrdoob']

In [40]:
a_tags[1].text.split()

['three.js']

In [41]:
repo_url = base_url + a_tags[1]['href']

In [42]:
repo_url

'https://github.com/mrdoob/three.js'

In [43]:
star_tags = topic_doc.find_all('span', {'class': 'Counter js-social-count'})

In [44]:
len(star_tags)

30

In [45]:
star_tags[0].text.strip()

'78.6k'

In [46]:
def parse_star_count(star_str):
    star_str = star_str.strip()
    if star_str[-1] == 'k':
       return int(float(star_str[:-1]) * 1000)
    return int(star_str)

In [47]:
parse_star_count(star_tags[0].text.strip())

78600

In [59]:
def get_repo_info(h1_tag, star_tags):
    a_tags = h1_tag.find_all('a')
    username = a_tags[0].text.split()
    repo_name = a_tags[1].text.split()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tags.text.strip())
    return username, repo_name, stars, repo_url

In [60]:
get_repo_info(repo_tags[0], star_tags[0])

(['mrdoob'], ['three.js'], 78600, 'https://github.com/mrdoob/three.js')

In [61]:
topic_repos_dict={
    'username': [],
    'repo_name': [],
    'stars': [],
    'repo_url': []
}

for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])

In [62]:
topic_repos_dict

{'username': [['mrdoob'],
  ['libgdx'],
  ['pmndrs'],
  ['BabylonJS'],
  ['aframevr'],
  ['ssloy'],
  ['lettier'],
  ['FreeCAD'],
  ['metafizzy'],
  ['CesiumGS'],
  ['timzhang642'],
  ['a1studmuffin'],
  ['isl-org'],
  ['domlysz'],
  ['spritejs'],
  ['blender'],
  ['tensorspace-team'],
  ['jagenjo'],
  ['openscad'],
  ['YadiraF'],
  ['AaronJackson'],
  ['ssloy'],
  ['google'],
  ['mosra'],
  ['gfxfundamentals'],
  ['FyroxEngine'],
  ['cleardusk'],
  ['jasonlong'],
  ['tengbao'],
  ['cnr-isti-vclab']],
 'repo_name': [['three.js'],
  ['libgdx'],
  ['react-three-fiber'],
  ['Babylon.js'],
  ['aframe'],
  ['tinyrenderer'],
  ['3d-game-shaders-for-beginners'],
  ['FreeCAD'],
  ['zdog'],
  ['cesium'],
  ['3D-Machine-Learning'],
  ['SpaceshipGenerator'],
  ['Open3D'],
  ['BlenderGIS'],
  ['spritejs'],
  ['blender'],
  ['tensorspace'],
  ['webglstudio.js'],
  ['openscad'],
  ['PRNet'],
  ['vrn'],
  ['tinyraytracer'],
  ['model-viewer'],
  ['magnum'],
  ['webgl-fundamentals'],
  ['Fyrox'],
  ['

In [98]:
import os

def get_topic_page(topic_url):
    #Download the page
    response = requests.get(topic_url)
    #Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    #Parse using Beautiful soup
    topic_doc =  BeautifulSoup(response.text, 'html.parser')
    return topic_doc

def get_repo_info(h1_tag, star_tags):
    a_tags = h1_tag.find_all('a')
    username = a_tags[0].text.split()
    repo_name = a_tags[1].text.split()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tags.text.strip())
    return username, repo_name, stars, repo_url

def get_topic_repos(topic_doc):
    #Get the h1 tags containing repo title, repo URL and username
    h1_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3', {'class': h1_selection_class})
    #Get star_tags
    star_tags = topic_doc.find_all('span', {'class': 'Counter js-social-count'})
    
    topic_repos_dict={
    'username': [],
    'repo_name': [],
    'stars': [],
    'repo_url': []
    }

    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)
    
def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print("The file {} already exists.".format(path))
        return 
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path + '.csv', index = None)
    
    
    

In [99]:
get_topic_repos(get_topic_page(topic_link[5]))

Unnamed: 0,username,repo_name,stars,repo_url
0,[justjavac],[free-programming-books-zh_CN],86800,https://github.com/justjavac/free-programming-...
1,[angular],[angular],79300,https://github.com/angular/angular
2,[storybookjs],[storybook],68500,https://github.com/storybookjs/storybook
3,[leonardomso],[33-js-concepts],46300,https://github.com/leonardomso/33-js-concepts
4,[ionic-team],[ionic-framework],46300,https://github.com/ionic-team/ionic-framework
5,[prettier],[prettier],41800,https://github.com/prettier/prettier
6,[SheetJS],[sheetjs],28900,https://github.com/SheetJS/sheetjs
7,[angular],[angular-cli],25200,https://github.com/angular/angular-cli
8,[angular],[components],22400,https://github.com/angular/components
9,[NativeScript],[NativeScript],20800,https://github.com/NativeScript/NativeScript


Write a single function:
1. Get the list of topics from the topic page
2. Get the list of top repos from the individual topic pages
3. Create a CSV of the top repos for the topic

In [100]:
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p',{'class': selection_class})
    topic_titles =[]
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

def get_topic_desc(doc):
    desc_class = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p',{'class': desc_class})
    topic_desc = []
    for tag in topic_desc_tags:
        topic_desc.append(tag.text.strip())
    return topic_desc

def get_topic_urls(doc):
    topic_link_class = 'no-underline flex-grow-0'
    topic_link_tag = doc.find_all('a',{'class': topic_link_class})
    topic_link=[]
    base_url = 'https://github.com'
    for tag in topic_link_tag:
        topic_link.append(base_url + tag['href'])
    return topic_link
    
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topics_url))
    topic_dict ={
        'title': get_topic_titles(doc),
        'description':get_topic_desc(doc),
        'url': get_topic_urls(doc)
    }
    return pd.DataFrame(topic_dict)

In [101]:
import os
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    os.makedirs('data', exist_ok = True)
    for index, row in topics_df.iterrows():
        print('Scrapping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))

In [None]:
scrape_topicss_repos()

# Document and share your work