### Pick a website and describe your objective

- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.

#### Project Outline :

- We are going to scrape https://github.com/topics
- We will get list of topics. And for each topic we'll get topic title, topic page URL and topic description
- For each topic we'll get top 25 repositiories from each page
- For each repository we'll grab reponame, username, stars, repoURL
- For each topic we will create csv files in the following format

...

reponame,username,stars,repoURL
metafixxy,metafizzy,7000,https://github.com/metafizzy/infinite-scroll
ljianshu,ljianshu,7000,https://github.com/ljianshu/Blog

...

    
    

### Use the requests library to download web pages

- Inspect the website's HTML source and identify the right URLs to download.
- Download and save web pages locally using the requests library.
- Create a function to automate downloading for different topics/search queries.


In [1]:
!pip install requests --upgrade




In [2]:
import requests

In [3]:
topics_url = "https://github.com/topics"

In [4]:
response = requests.get(topics_url)


In [5]:
#http status codes
response.status_code

200

In [6]:
len(response.text)

189281

In [7]:
page_contents = response.text

In [8]:
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-9M4GwJqBATCm6CMz2UrCo6uuX1/Wa8wUnm7N5BQhGHFch1oIX2y8dcpUXfnQVQ2HE2bD287O5YMXuc5jFAcU8w==" rel="stylesheet" href="https://github.githubassets.com/assets/light-f4ce06c09a810130a6e8.css" /><link crossorigin="anonymous" media="all" integrity="sha512-BEwN74xxmv+L2zArQGm+kGVvX3bGY85LF4umkZnZ6Zl6IciYbr1IGxIYxqnUCW0RDEN1kM7glvXkpmQsntPVYQ==" re

In [10]:
#writing page contents to a file
with open('webpage.html', 'w', encoding="utf-8") as f:
    f.write(page_contents)

### Use Beautiful Soup to parse and extract information

- Parse and explore the structure of downloaded web pages using Beautiful soup.
- Use the right properties and methods to extract the required information.
- Create functions to extract from the page into lists and dictionaries.
- (Optional) Use a REST API to acquire additional information if required.


In [11]:
!pip install beautifulsoup4 --upgrade



In [12]:
from bs4 import BeautifulSoup

In [13]:
#doc is a beautifulsoup object that contains html parsed format 
doc = BeautifulSoup(page_contents,'html.parser')

In [14]:
type(doc)

bs4.BeautifulSoup

In [15]:
#we have doc object so we can find anything from html page with the help of queries
topictitle_tags = doc.find_all('p')

In [16]:
len(topictitle_tags)

67

In [17]:
topictitle_tags[:5]

[<p class="f4 color-fg-muted col-md-6 mx-auto">Browse popular topics on GitHub.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         scikit-learn
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">scikit-learn is a Python module for machine learning.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         C
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">C is a general purpose programming language that first appeared in 1972.</p>]

In [18]:
selection_class = "f3 lh-condensed mb-0 mt-1 Link--primary"
#p_tags = doc.find_all('p', {'class' :selection_class})
topictitle_tags = doc.find_all('p', class_=selection_class)

In [19]:
len(topictitle_tags)

30

In [29]:
topictitle_tags

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Angular</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ansible</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">API</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Arduino</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">ASP.NET</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Atom</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Awesome Lists</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amazon Web Services</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Azure</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Babel</p>,
 <p class="f3 lh-condensed m

In [20]:
desc_selector = "f5 color-fg-muted mb-0 mt-1"
topic_desc_tags= doc.find_all('p', class_=desc_selector)

In [21]:
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D modeling is the process of virtually developing the surface and structure of a 3D object.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [22]:
topic_title_tag0= topictitle_tags[0]
topic_title_tag0

<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>

In [23]:
topic_title_tag0.parent

<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-fg-muted mb-0 mt-1">
          3D modeling is the process of virtually developing the surface and structure of a 3D object.
        </p>
</a>

In [38]:
topic_title_tag0.parent.parent

<div class="py-4 border-bottom d-flex flex-justify-between">
<a class="no-underline flex-grow-0" href="/topics/3d">
<div class="color-bg-accent f4 color-fg-muted text-bold rounded mr-3 flex-shrink-0 text-center" style="width:64px; height:64px; line-height:64px;">
            #
          </div>
</a>
<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-fg-muted mb-0 mt-1">
          3D modeling is the process of virtually developing the surface and structure of a 3D object.
        </p>
</a>
<div class="flex-grow-0">
<div class="d-block" data-view-component="true">
<a aria-label="You must be signed in to star a repository" class="tooltipped tooltipped-s btn-sm btn" data-hydro-click='{"event_type":"authentication.click","payload":{"location_in_page":"star button","repository_id":null,"auth_type":"LOG_IN","originating_url":"https://github.com/topics","user_id":null}}' data-hydro-click-hmac="5d69675

In [24]:
topic_link_tags = doc.find_all('a',{'class' :'no-underline flex-grow-0'})

In [25]:
len(topic_link_tags)

30

In [26]:
topic_link_tags[0]['href']

'/topics/3d'

In [27]:
topic0_url = 'https://github.com'+topic_link_tags[0]['href']
print(topic0_url)

https://github.com/topics/3d


In [28]:
topictitle_tags[0].text

'3D'

In [29]:
topic_titles = []
for tag in topictitle_tags:
    topic_titles.append(tag.text)
print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [30]:
topic_descriptions =[]
for desc in topic_desc_tags:
    topic_descriptions.append(desc.text.strip()) #strip is used to reduce empty spaces in string
print(topic_descriptions)

['3D modeling is the process of virtually developing the surface and structure of a 3D object.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency library for PHP.', 'Android is an operating system built by Google designed for mobile devices.', 'Angular is an open source web application platform.', 'Ansible is a simple and powerful automation engine.', 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.', 'Arduino is an open source hardware and software company and maker community.', 'ASP.NET is a web framework for building modern web apps and services.', 'Atom is a open source text editor built with web technologies.', 'An awesome list is a list of awesome things curated by the community.', 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.', 'Azure is a cloud

In [31]:
topic_urls = []
base_url = 'https://github.com'
for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])
    
print(topic_urls)
    

['https://github.com/topics/3d', 'https://github.com/topics/ajax', 'https://github.com/topics/algorithm', 'https://github.com/topics/amphp', 'https://github.com/topics/android', 'https://github.com/topics/angular', 'https://github.com/topics/ansible', 'https://github.com/topics/api', 'https://github.com/topics/arduino', 'https://github.com/topics/aspnet', 'https://github.com/topics/atom', 'https://github.com/topics/awesome', 'https://github.com/topics/aws', 'https://github.com/topics/azure', 'https://github.com/topics/babel', 'https://github.com/topics/bash', 'https://github.com/topics/bitcoin', 'https://github.com/topics/bootstrap', 'https://github.com/topics/bot', 'https://github.com/topics/c', 'https://github.com/topics/chrome', 'https://github.com/topics/chrome-extension', 'https://github.com/topics/cli', 'https://github.com/topics/clojure', 'https://github.com/topics/code-quality', 'https://github.com/topics/code-review', 'https://github.com/topics/compiler', 'https://github.com/t

### Create CSV file(s) with the extracted information

- Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
- Execute the function with different inputs to create a dataset of CSV files.
- Verify the information in the CSV files by reading them back using Pandas.

In [32]:
!pip install pandas --upgarde --quiet


Usage:   
  pip install [options] <requirement specifier> [package-index-options] ...
  pip install [options] -r <requirements file> [package-index-options] ...
  pip install [options] [-e] <vcs project url> ...
  pip install [options] [-e] <local project path> ...
  pip install [options] <archive url/path> ...

no such option: --upgarde


In [33]:
import pandas as pd

In [34]:
topics_dict= {'title':topic_titles,'description':topic_descriptions ,'url':topic_urls }

In [36]:
topic_df = pd.DataFrame(topics_dict)
topic_df

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [37]:
topic_df.to_csv("topic.csv", index=None)#if we don't want index(row no's) values to the right


### Getttiing information out of a topic page

In [38]:
topic_page_url = topic_urls[0]
topic_page_url

'https://github.com/topics/3d'

In [39]:
response = requests.get(topic_page_url)

In [40]:
response.status_code

200

In [41]:
len(response.text)

675303

In [43]:
topic_doc = BeautifulSoup(response.text,'html.parser')


In [139]:
len(topic_doc)


5

In [45]:
h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
repo_tags = topic_doc.find_all('h3', {'class': h3_selection_class})


In [46]:
repo_tags

[<h3 class="f3 color-fg-muted text-normal lh-condensed">
 <a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-view-component="true" href="/mrdoob">
             mrdoob
 </a>          /
           <a class="text-bold wb-break-word" data-ga-click="Explore, go to repository, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d8

In [47]:
len(repo_tags)

30

In [48]:
repo_tags[0]

<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-view-component="true" href="/mrdoob">
            mrdoob
</a>          /
          <a class="text-bold wb-break-word" data-ga-click="Explore, go to repository, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d897521

In [49]:
a_tags =repo_tags[0].find_all('a')
a_tags

[<a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-view-component="true" href="/mrdoob">
             mrdoob
 </a>,
 <a class="text-bold wb-break-word" data-ga-click="Explore, go to repository, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data-view-component="tr

In [50]:
a_tags[0]

<a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-view-component="true" href="/mrdoob">
            mrdoob
</a>

In [52]:
a_tags[0].text

'\n            mrdoob\n'

In [53]:
a_tags[0].text.strip()

'mrdoob'

In [55]:
a_tags[1].text.strip()

'three.js'

In [57]:
repo_url= base_url + a_tags[1]['href']
print(repo_url)

https://github.com/mrdoob/three.js


In [64]:
star = 'Counter js-social-count'
star_tags= topic_doc.find_all('span',{'class':star})

In [65]:
len(star_tags)

30

In [66]:
star_tags[0].text.strip()

'79.4k'

In [71]:
def star_parse_count(stars_str):
    stars_str=stars_str.strip()
    if stars_str[-1]=='k':
        return int(float(stars_str[:-1])*1000)
    return int(stars_str)
    

In [72]:
star_parse_count(star_tags[0].text.strip())

79400

In [89]:
def get_repo_info(repo_tags, star_tags):
    a_tags =repo_tags.find_all('a')
    username =a_tags[0].text.strip()
    reponame =a_tags[1].text.strip()
    repourl = base_url + a_tags[1]['href']
    stars =star_parse_count(star_tags.text.strip())
    return username, reponame, repourl,stars

#returns all the required info about the repository


In [90]:
get_repo_info(repo_tags[0], star_tags[0])


('mrdoob', 'three.js', 'https://github.com/mrdoob/three.js', 79400)

In [94]:
topic_repos_dict ={
    'username':[],
    'reponame':[],
    'stars':[],
    'repourl':[]
    
}

In [96]:
for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['reponame'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repourl'].append(repo_info[3])


    

In [100]:
def get_topic_page(topic_url):
    #download the page
    response = requests.get(topic_page_url)
    # check successful response
    if response.status_code !=200:
        raise Exception('Failed to load page {}'.format(topic_url))
    #parse using beautifulsoup
    topic_doc = BeautifulSoup(response.text,'html.parser')
    return topic_doc
def get_repo_info(repo_tags, star_tags):
    a_tags =repo_tags.find_all('a')
    username =a_tags[0].text.strip()
    reponame =a_tags[1].text.strip()
    repourl = base_url + a_tags[1]['href']
    stars =star_parse_count(star_tags.text.strip())
    return username, reponame, repourl,stars

#returns all the required info about the repository
    
def get_topic_repos():
     #get h3 tags containing repo title, repo url and username
    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3', {'class': h3_selection_class})
    #get star tags
    star_tags= topic_doc.find_all('span',{'class':star})
    
    topic_repos_dict ={
    'username':[],
    'reponame':[],
    'stars':[],
    'repourl':[]
    
    }
    #get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['reponame'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repourl'].append(repo_info[3])
    return pd.DataFrame(topic_repos_dict)
    

In [104]:
url4 = topic_urls[4]

In [105]:
topic4_doc= get_topic_page(url4)

In [108]:
topic4_repos =get_topic_repos(topic4_doc)
topic4_repos

TypeError: get_topic_repos() takes 0 positional arguments but 1 was given

In [None]:
topi

In [97]:
topic_repos_dict

{'username': ['mrdoob',
  'libgdx',
  'pmndrs',
  'BabylonJS',
  'aframevr',
  'lettier',
  'ssloy',
  'FreeCAD',
  'metafizzy',
  'CesiumGS',
  'timzhang642',
  'a1studmuffin',
  'isl-org',
  'domlysz',
  'blender',
  'spritejs',
  'openscad',
  'tensorspace-team',
  'jagenjo',
  'YadiraF',
  'AaronJackson',
  'ssloy',
  'google',
  'mosra',
  'gfxfundamentals',
  'FyroxEngine',
  'cleardusk',
  'tengbao',
  'jasonlong',
  'cnr-isti-vclab',
  'mrdoob',
  'libgdx',
  'pmndrs',
  'BabylonJS',
  'aframevr',
  'lettier',
  'ssloy',
  'FreeCAD',
  'metafizzy',
  'CesiumGS',
  'timzhang642',
  'a1studmuffin',
  'isl-org',
  'domlysz',
  'blender',
  'spritejs',
  'openscad',
  'tensorspace-team',
  'jagenjo',
  'YadiraF',
  'AaronJackson',
  'ssloy',
  'google',
  'mosra',
  'gfxfundamentals',
  'FyroxEngine',
  'cleardusk',
  'tengbao',
  'jasonlong',
  'cnr-isti-vclab'],
 'reponame': ['three.js',
  'libgdx',
  'react-three-fiber',
  'Babylon.js',
  'aframe',
  '3d-game-shaders-for-beginne

In [98]:
topic_repos_df = pd.DataFrame(topic_repos_dict)

In [99]:
topic_repos_df

Unnamed: 0,username,reponame,stars,repourl
0,mrdoob,three.js,https://github.com/mrdoob/three.js,79400
1,libgdx,libgdx,https://github.com/libgdx/libgdx,19700
2,pmndrs,react-three-fiber,https://github.com/pmndrs/react-three-fiber,16900
3,BabylonJS,Babylon.js,https://github.com/BabylonJS/Babylon.js,16000
4,aframevr,aframe,https://github.com/aframevr/aframe,13800
5,lettier,3d-game-shaders-for-beginners,https://github.com/lettier/3d-game-shaders-for...,12300
6,ssloy,tinyrenderer,https://github.com/ssloy/tinyrenderer,12200
7,FreeCAD,FreeCAD,https://github.com/FreeCAD/FreeCAD,10800
8,metafizzy,zdog,https://github.com/metafizzy/zdog,9000
9,CesiumGS,cesium,https://github.com/CesiumGS/cesium,8300


### Document and share your work

- Add proper headings and documentation in your Jupyter notebook.
- Publish your Jupyter notebook to your Jovian profile
- (Optional) Write a blog post about your project and share it online.