# Pick a website and describe your objective

**-Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.**

**-Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.**

**-Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.**



### Here are the steps we'll follow:

-We're going to scrape https://github.com/topics

-We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description

-For each topic, we'll get the top 25 repositories in the topic from the topic page

-For each repository, we'll grab the repo name, username, stars and repo URL

-For each topic we'll create a CSV file in the following format:

Repo Name,Username,Stars,Repo URL

three.js,mrdoob,69700,https://github.com/mrdoob/three.js

libgdx,libgdx,18300,https://github.com/libgdx/libgdx

In [1]:
!pip install requests



You should consider upgrading via the 'C:\Users\Lucky Singh\AppData\Local\Programs\Python\Python310\python.exe -m pip install --upgrade pip' command.


In [2]:
import requests

In [3]:
topics_url='https://github.com/topics'

In [4]:
response=requests.get(topics_url)

In [5]:
response.status_code  # whether the response is successfull or not

200

In [6]:
len(response.text)

175598

In [7]:
page_contents=response.text

In [8]:
page_contents[:500]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" cro'

In [9]:
with open('webpage.html','w',encoding='UTF-8') as f:
    f.write(page_contents)

#  Use the requests library to download web pages



-Inspect the website's HTML source and identify the right URLs to download.

-Download and save web pages locally using the requests library.

-Create a function to automate downloading for different topics/search queries.


# Use Beautiful Soup to parse and extract information

-Parse and explore the structure of downloaded web pages using Beautiful soup.

-Use the right properties and methods to extract the required information.

-Create functions to extract from the page into lists and dictionaries.

-(Optional) Use a REST API to acquire additional information if required.


In [10]:
!pip install beautifulsoup4



You should consider upgrading via the 'C:\Users\Lucky Singh\AppData\Local\Programs\Python\Python310\python.exe -m pip install --upgrade pip' command.


In [11]:
from bs4 import BeautifulSoup

In [12]:
parsed_doc=BeautifulSoup(page_contents,'html.parser')

In [13]:
type(parsed_doc)

bs4.BeautifulSoup

In [14]:
selection_class='f3 lh-condensed mb-0 mt-1 Link--primary'
topic_title_tags=parsed_doc.find_all('p',{'class':selection_class})

In [15]:
len(topic_title_tags)

30

In [16]:
topic_title_tags[0:10]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Angular</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ansible</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">API</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Arduino</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">ASP.NET</p>]

In [17]:
sc2='f5 color-fg-muted mb-0 mt-1'
topic_description_tags=parsed_doc.find_all('p',{'class':sc2})

In [18]:
len(topic_description_tags)

30

In [19]:
topic_description_tags[:10]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D modeling is the process of virtually developing the surface and structure of a 3D object.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Angular is an open source web application platform.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ansible is a simple and powerful automation engine.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           An API (Applicati

In [20]:
sc3='no-underline flex-grow-0'
topic_url_tags=parsed_doc.find_all('a',{'class':sc3})

In [21]:
len(topic_url_tags)

30

In [22]:
topic_url_tags[:10]

[<a class="no-underline flex-grow-0" href="/topics/3d">
 <div class="color-bg-accent f4 color-fg-muted text-bold rounded mr-3 flex-shrink-0 text-center" style="width:64px; height:64px; line-height:64px;">
             #
           </div>
 </a>,
 <a class="no-underline flex-grow-0" href="/topics/ajax">
 <img alt="ajax" class="rounded mr-3" height="64" src="https://raw.githubusercontent.com/github/explore/8be26d91eb231fec0b8856359979ac09f27173fd/topics/ajax/ajax.png" width="64"/>
 </a>,
 <a class="no-underline flex-grow-0" href="/topics/algorithm">
 <div class="color-bg-accent f4 color-fg-muted text-bold rounded mr-3 flex-shrink-0 text-center" style="width:64px; height:64px; line-height:64px;">
             #
           </div>
 </a>,
 <a class="no-underline flex-grow-0" href="/topics/amphp">
 <img alt="amphp" class="rounded mr-3" height="64" src="https://raw.githubusercontent.com/github/explore/99fe59c0f4fb5d6545311440b4ce89a0d82b0804/topics/amphp/amphp.png" width="64"/>
 </a>,
 <a class

In [23]:
topic_url_tags[0]['href']

'/topics/3d'

In [24]:
topic0Url='https:github.com' + topic_url_tags[0]['href']
print(topic0Url)

https:github.com/topics/3d


In [25]:
topics=[]
for topic_tag in topic_title_tags:
    topics.append(topic_tag.text)

In [26]:
type(topics)

list

In [27]:
topics_desc=[]
for desc_tag in topic_description_tags:
    topics_desc.append(desc_tag.text.strip())
topics_desc[:5]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [28]:
topic_urls=[]
for tag in topic_url_tags:
    topic_urls.append('https://github.com/'+tag['href'])
topic_urls[:5]

['https://github.com//topics/3d',
 'https://github.com//topics/ajax',
 'https://github.com//topics/algorithm',
 'https://github.com//topics/amphp',
 'https://github.com//topics/android']

In [29]:
import pandas as pd

In [30]:
dic_list={'Topic':topics,
         'Description':topics_desc,
         'URLs':topic_urls}

In [31]:
df=pd.DataFrame(dic_list)

In [32]:
df.head()

Unnamed: 0,Topic,Description,URLs
0,3D,3D modeling is the process of virtually develo...,https://github.com//topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com//topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com//topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com//topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com//topics/android


# Create CSV file(s) with the extracted information


-Create functions for the end-to-end process of downloading, parsing, and saving CSVs.

-Execute the function with different inputs to create a dataset of CSV files.

-Verify the information in the CSV files by reading them back using Pandas.

In [33]:
df.to_csv('topics.csv',index=None)

### Getting information out of  a topic Page

In [34]:
repos_url=df['URLs'][0]

In [35]:
response=requests.get(repos_url)

In [36]:
response.status_code

200

In [37]:
type(response)

requests.models.Response

In [38]:
topic_contents=response.text

In [39]:
len(topic_contents)

665641

In [40]:
topic_contents[:500]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" cro'

In [41]:
with open('webpage.html','w',encoding='UTF-8') as f:
    f.write(topic_contents)
    

In [42]:
parsed_cont=BeautifulSoup(topic_contents,'html.parser')

In [43]:
type(parsed_cont)

bs4.BeautifulSoup

In [44]:
rep_class='f3 color-fg-muted text-normal lh-condensed'
repo_title_tags=parsed_cont.find_all('h3',{'class':rep_class})

In [45]:
len(repo_title_tags)

30

In [46]:
a_tags=repo_title_tags[0].find_all('a')

In [47]:
a_tags[0].text

'\n            mrdoob\n'

In [48]:
a_tags[0].text.strip()

'mrdoob'

In [49]:
a_tags[1].text

'\n            three.js\n'

In [50]:
a_tags[1].text.strip()

'three.js'

In [51]:
base_url='https://github.com'

In [52]:
repo_title_url=base_url + a_tags[1]['href']
print(repo_title_url)

https://github.com/mrdoob/three.js


In [53]:
repo_title_url

'https://github.com/mrdoob/three.js'

In [54]:
star_class='tooltipped tooltipped-s btn-sm btn BtnGroup-item color-bg-default'
star_title_tags=parsed_cont.find_all('a',{'class':star_class})
# len(star_title_tags)

In [55]:
star_title_tags[0].text.strip()[6:]


'78.3k'

In [56]:

def to_num(stars_str):
    stars_str=stars_str.strip()
    if(stars_str[-1]=='k'):
        return(int(float(stars_str[:-1])*1000))
    else:
        return(int(float(stars_str)))
    

In [57]:
to_num(star_title_tags[0].text.strip()[6:])

78300

In [58]:
# if('78.3k'[-1]=='k'):
#     new=int(float('78.3k'[:-1])*1000)
# print(new)

In [59]:
def get_repo_info(h_tags,star_tags):
    a_tags=h_tags.find_all('a')
    username=a_tags[0].text.strip()
    repo_name=a_tags[1].text.strip()
    repo_url=base_url + a_tags[1]['href']
    stars=to_num(star_tags.text.strip()[6:])
    return username,repo_name,stars,repo_url

In [60]:
get_repo_info(repo_title_tags[0],star_title_tags[0])

('mrdoob', 'three.js', 78300, 'https://github.com/mrdoob/three.js')

In [61]:
topic_repos_dict={
    'username':[],
    'repo_name':[],
    'stars':[],
    'repo_url':[]
}
for i in range(len(repo_title_tags)):
    repo_info=get_repo_info(repo_title_tags[i],star_title_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])
    

In [62]:
topic_df=pd.DataFrame(topic_repos_dict)

In [63]:
topic_df.head()

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,78300,https://github.com/mrdoob/three.js
1,libgdx,libgdx,19600,https://github.com/libgdx/libgdx
2,pmndrs,react-three-fiber,16500,https://github.com/pmndrs/react-three-fiber
3,BabylonJS,Babylon.js,15800,https://github.com/BabylonJS/Babylon.js
4,aframevr,aframe,13700,https://github.com/aframevr/aframe


In [105]:
def get_repo_info(h_tags,star_tags):
    a_tags=h_tags.find_all('a')
    username=a_tags[0].text.strip()
    repo_name=a_tags[1].text.strip()
    repo_url=base_url + a_tags[1]['href']
    stars=to_num(star_tags.text.strip()[6:])
    return username,repo_name,stars,repo_url


def get_topic_page(topic_url):
    response=requests.get(topic_url)
    if response.status_code !=200:
        raise Exception('Failed to load Page{}'.format(topic_url))
    topic_contents_repo=response.text
    with open('webpage.html','w',encoding='UTF-8') as f:
        f.write(topic_contents_repo)
    topic_doc=BeautifulSoup(topic_contents_repo,'html.parser')
    return topic_doc



def get_topic_repos(topic_doc):
    
    rep_class='f3 color-fg-muted text-normal lh-condensed'
    repo_title_tags=parsed_cont.find_all('h3',{'class':rep_class})
    
    star_class='tooltipped tooltipped-s btn-sm btn BtnGroup-item color-bg-default'
    star_title_tags=parsed_cont.find_all('a',{'class':star_class})
    
    topic_repos_dict={
    'username':[],
    'repo_name':[],
    'stars':[],
    'repo_url':[]
}
    for i in range(len(repo_title_tags)):
        repo_info=get_repo_info(repo_title_tags[i],star_title_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
    
    topic_df=pd.DataFrame(topic_repos_dict)
    return topic_df
    

In [106]:
url4=df['URLs'][4]
url4

'https://github.com//topics/android'

In [107]:
topic4_doc=get_topic_page(url4)
topic4_doc


<!DOCTYPE html>

<html data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
<link href="https://avatars.githubusercontent.com" rel="preconnect"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-764b98156fab6bcc984addf8d9ee6924.css" integrity="sha512-dkuYFW+ra8yYSt342e5pJEeslPSjMcrMvNxlYZMyM/X+/WJHDPvoCuGq3LFojI7B0dQWwZNRiPMnbi9IfUgTaA==" media="all" rel="stylesheet"><link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-52b02edb7f9eca7716bda405c2c2db81.css" integrity="sha512-UrAu23+eyncWvaQFwsLbgSKtmLb2aH1bcT4

In [108]:
topic4_repos=get_topic_repos(topic4_doc)

In [109]:
topic4_repos

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,78300,https://github.com/mrdoob/three.js
1,libgdx,libgdx,19600,https://github.com/libgdx/libgdx
2,pmndrs,react-three-fiber,16500,https://github.com/pmndrs/react-three-fiber
3,BabylonJS,Babylon.js,15800,https://github.com/BabylonJS/Babylon.js
4,aframevr,aframe,13700,https://github.com/aframevr/aframe
5,ssloy,tinyrenderer,12000,https://github.com/ssloy/tinyrenderer
6,lettier,3d-game-shaders-for-beginners,11900,https://github.com/lettier/3d-game-shaders-for...
7,FreeCAD,FreeCAD,10600,https://github.com/FreeCAD/FreeCAD
8,metafizzy,zdog,9000,https://github.com/metafizzy/zdog
9,CesiumGS,cesium,8200,https://github.com/CesiumGS/cesium


In [104]:
def get_topic_repos(topic4_doc):
    
    rep_class='f3 color-fg-muted text-normal lh-condensed'
    repo_title_tags=parsed_cont.find_all('h3',{'class':rep_class})
    
    star_class='tooltipped tooltipped-s btn-sm btn BtnGroup-item color-bg-default'
    star_title_tags=parsed_cont.find_all('a',{'class':star_class})
    
    topic_repos_dict={
    'username':[],
    'repo_name':[],
    'stars':[],
    'repo_url':[]
    }
    
    for i in range(len(repo_title_tags)):
        repo_info=get_repo_info(repo_title_tags[i],star_title_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
    
    topic_df=pd.DataFrame(topic_repos_dict)
    return topic_df
    
    