# The purpose of this project is to collect data from two web pages on the github site. The first web page is the page that contains different github topics. The second one contains different repositories of a specific topic of our choice " covid19 topic"

In [1]:
##first we will install requests library
## requests is a library that helps download files from the web in order to collect useful data from it. 
!pip install requests



In [2]:
import requests

In [3]:
##  get() method is used to indicate that we are requesting the contents of our selected URL from the server.
## This line of code sends an HTTP GET request to github's topic page via  get() method where the URL is provided as the argument and the response object is stored in the ‘r’ variable.
r = requests.get("https://github.com/topics")

In [5]:
r.status_code
#The output provides the status code for the url. 200 means successful

200

In [6]:
##now we will install Beautiful Soup
##Beautiful Soup is a Python package for parsing HTML documents to extract data.
!pip install beautifulsoup4



In [8]:
from bs4 import BeautifulSoup


In [11]:
## before navigating our website and extracting data.The HTML file "r.text" needs to be prepared. This is done by passing the file to the BeautifulSoup constructor.
soup = BeautifulSoup(r.text, 'html.parser')

In [12]:
##The find_all() method takes an HTML tag as a string argument and returns the list of elements that match with the provided tag
##In our case, we will search for all p tags that have the "class1" class,because these tags contain titles of all topics.
class1 ="f3 lh-condensed mb-0 mt-1 Link--primary"
topics_title = soup.find_all('p', {'class': class1})

In [13]:
## As a result, the webpage contains 30 topics.
print(len(topics_title))
topics_title

30


[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Angular</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ansible</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">API</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Arduino</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">ASP.NET</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Atom</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Awesome Lists</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amazon Web Services</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Azure</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Babel</p>,
 <p class="f3 lh-condensed m

In [14]:
## In the same way, we will search for all p tags that have the "class2" class,because these tags contain descriptions provided for every topic.
class2 ="f5 color-fg-muted mb-0 mt-1"
topics_description = soup.find_all('p', {'class': class2})

In [15]:
print(len(topics_description))
topics_description

30


[<p class="f5 color-fg-muted mb-0 mt-1">
           3D modeling is the process of virtually developing the surface and structure of a 3D object.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Angular is an open source web application platform.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ansible is a simple and powerful automation engine.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           An API (Applicati

In [16]:
#### in the same way we will search for all a tags that have the "class3" class,because these tags contain urls of all topics.
class3 ="no-underline flex-1 d-flex flex-column"
topics_url = soup.find_all('a', {'class': class3})

In [17]:
print(len(topics_url))
topics_url

30


[<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
 <p class="f5 color-fg-muted mb-0 mt-1">
           3D modeling is the process of virtually developing the surface and structure of a 3D object.
         </p>
 </a>,
 <a class="no-underline flex-1 d-flex flex-column" href="/topics/ajax">
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>
 </a>,
 <a class="no-underline flex-1 d-flex flex-column" href="/topics/algorithm">
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>
 </a>,
 <a class="no-underline flex-1 d-flex flex-column" href="/topics/amphp">
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>
 <p class="f

In [19]:
## we will create a list to regroup all topic titles
titles = []
for p in topics_title:
    titles.append(p.text)
    
print(titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [20]:
descriptions = []
for p in topics_description:
    descriptions.append(p.text.strip())
    
print(descriptions)

['3D modeling is the process of virtually developing the surface and structure of a 3D object.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency library for PHP.', 'Android is an operating system built by Google designed for mobile devices.', 'Angular is an open source web application platform.', 'Ansible is a simple and powerful automation engine.', 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.', 'Arduino is an open source hardware and software company and maker community.', 'ASP.NET is a web framework for building modern web apps and services.', 'Atom is a open source text editor built with web technologies.', 'An awesome list is a list of awesome things curated by the community.', 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.', 'Azure is a cloud

In [21]:
urls = []
url = 'https://github.com'
for a in topics_url:
    urls.append(url+a['href'])
print(urls)

['https://github.com/topics/3d', 'https://github.com/topics/ajax', 'https://github.com/topics/algorithm', 'https://github.com/topics/amphp', 'https://github.com/topics/android', 'https://github.com/topics/angular', 'https://github.com/topics/ansible', 'https://github.com/topics/api', 'https://github.com/topics/arduino', 'https://github.com/topics/aspnet', 'https://github.com/topics/atom', 'https://github.com/topics/awesome', 'https://github.com/topics/aws', 'https://github.com/topics/azure', 'https://github.com/topics/babel', 'https://github.com/topics/bash', 'https://github.com/topics/bitcoin', 'https://github.com/topics/bootstrap', 'https://github.com/topics/bot', 'https://github.com/topics/c', 'https://github.com/topics/chrome', 'https://github.com/topics/chrome-extension', 'https://github.com/topics/cli', 'https://github.com/topics/clojure', 'https://github.com/topics/code-quality', 'https://github.com/topics/code-review', 'https://github.com/topics/compiler', 'https://github.com/t

In [23]:
##we will create a dictionary to regroup all information(title,description,url)
topics_dictionnary = {
    'title': titles,
    'description':descriptions,
    'url': urls
}

In [24]:
import pandas as pd

In [26]:
##we will Construct a DataFrame from the dictionary "topics_dictionnary"
df = pd.DataFrame(topics_dictionnary)

In [27]:
df

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [28]:
##finnaly we will export our scraped data to a CSV file
df.to_csv('Github_Topics.csv', index=None)

# now we will redo the same steps to scrape "COVID-19" topic page


In [29]:
##we define the variable topic_url as the url of covid19 topic page
topic_url = urls[28]

In [31]:
r = requests.get(topic_url)

In [32]:
r.status_code

200

In [33]:
soup = BeautifulSoup(r.text, 'html.parser')

In [35]:
## All h3 tags that have  "class1" class contain information about the repositories "owner of repository ,number of stars ,repository name..."
##We will search for these tags using the find_all() method
class1 ="f3 color-fg-muted text-normal lh-condensed"
repo = soup.find_all('h3', {'class': class1} )
## the covid 19 topic page contains 20 repositories
len(repo)

20

In [36]:
##first we will extract informations from the first repository
## repo[0] is the first h3 tag of the list repo
## a returns the list of all "a" tags contained inside repo[0]
repo[0]
a = repo[0].find_all('a')
a

[<a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":60674295,"originating_url":"https://github.com/topics/covid-19","user_id":null}}' data-hydro-click-hmac="75c7c6681248794cfcc1d42d7a05f8127c46c04f18962eba0b8cc802aac2b49e" data-turbo="false" data-view-component="true" href="/CSSEGISandData">
             CSSEGISandData
 </a>,
 <a class="text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":238316428,"originating_url":"https://github.com/topics/covid-19","user_id":null}}' data-hydro-click-hmac="2af6b0a8e40f3fa3ad5989a551a8eab8f4670c83717f7db6508870462c0e4baa" data-turbo="false" data-view-component="true" href="/CSSEGISandData/COVID-19">
             COVID-19
 </a>

In [37]:
## the strip() function returns a copy of the first "a" tag of the list "a" with the leading and trailing characters removed in order to obtain only the name of the owner of the first repository "username"
a[0].text.strip()

'CSSEGISandData'

In [38]:
## in the same way this line of code returns a copy of the second "a" tag of the list "a" with the leading and trailing characters removed in order to obtain the name of the first repository "repository_name"
a[1].text.strip()

'COVID-19'

In [39]:
##this line of code extract the url of the repository 
url = 'https://github.com'
url + a[1]['href']


'https://github.com/CSSEGISandData/COVID-19'

In [40]:
##we will search for all p tags that have the "class1" class,because these tags contain the description of repositories.
class2 ="px-3 pt-3"
description = soup.find_all('div', {'class': class2} )
len(repo)

20

In [41]:
description[0].text.strip()

'Novel Coronavirus (COVID-19) Cases, provided by JHU CSSE'

In [42]:
## All span tags that have 'Counter js-social-count' class contains numbers of stars of repositories.
##We will search for these tags using find_all() method
star = soup.find_all('span', { 'class': 'Counter js-social-count'})

In [43]:
## the strip() function returns a copy of the first span tag with the leading and trailing characters removed in order to obtain the number of stars of  the first repository "stars"
star[0].text.strip()

'29k'

In [44]:
## this function replace k with three zeros in order to have a valid number for number of stars 
def star_count(str):
    str = str.strip()
    if str[-1] == 'k':
        return int(float(str[:-1]) * 1000)
    return int(str)

In [45]:
star_count(star[0].text.strip())

29000

In [46]:
## This function regroups the 4 pieces of information collected ,we will use it to generalize and extract data for the rest of repositories

def get_repository_info(h3_tag,description_tag,star_tag):
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repository_name = a_tags[1].text.strip()
    repository_url =  url + a_tags[1]['href']
    description=description_tag.text.strip()
    stars = star_count(star_tag.text.strip())
    return username, repository_name,description, stars, repository_url

In [47]:
get_repository_info(repo[0],description[0], star[0])

('CSSEGISandData',
 'COVID-19',
 'Novel Coronavirus (COVID-19) Cases, provided by JHU CSSE',
 29000,
 'https://github.com/CSSEGISandData/COVID-19')

In [48]:
## the dictionnary  contains the information extracted from the covid19 web page
repositories_dictionnary = {
    'username': [],
    'repository_name': [],
    'descriptions':[],
    'stars': [],
    'repository_url': []
}
##this loop will fill the dictionary with the information about the rest of the repositories just like we did for the first repository.



for i in range(len(repo)):
    repository_info = get_repository_info(repo[i],description[i], star[i])
    repositories_dictionnary['username'].append(repository_info[0])
    repositories_dictionnary['repository_name'].append(repository_info[1])
    repositories_dictionnary['descriptions'].append(repository_info[2])
    repositories_dictionnary['stars'].append(repository_info[3])
    repositories_dictionnary['repository_url'].append(repository_info[4])

In [49]:
##we will Construct DataFrame from the dictionary "repositories_dictionnary"
df1 = pd.DataFrame(repositories_dictionnary)

In [50]:
df1

Unnamed: 0,username,repository_name,descriptions,stars,repository_url
0,CSSEGISandData,COVID-19,"Novel Coronavirus (COVID-19) Cases, provided b...",29000,https://github.com/CSSEGISandData/COVID-19
1,covid19india,covid19india-react,Tracking the impact of COVID-19 in India,6900,https://github.com/covid19india/covid19india-r...
2,nytimes,covid-19-data,An ongoing repository of data on coronavirus c...,6900,https://github.com/nytimes/covid-19-data
3,tokyo-metropolitan-gov,covid19,東京都 新型コロナウイルス感染症対策サイト / Tokyo COVID-19 Task Fo...,6400,https://github.com/tokyo-metropolitan-gov/covid19
4,owid,covid-19-data,"Data on COVID-19 (coronavirus) cases, deaths, ...",5400,https://github.com/owid/covid-19-data
5,pcm-dpc,COVID-19,COVID-19 Italia - Monitoraggio situazione,3900,https://github.com/pcm-dpc/COVID-19
6,ieee8023,covid-chestxray-dataset,We are building an open database of COVID-19 c...,2800,https://github.com/ieee8023/covid-chestxray-da...
7,disease-sh,API,API for Current cases and more stuff about COV...,2400,https://github.com/disease-sh/API
8,geohot,corona,Reverse engineering SARS-CoV-2,2300,https://github.com/geohot/corona
9,WorldHealthOrganization,app,COVID-19 App,2100,https://github.com/WorldHealthOrganization/app


In [51]:
##finnaly we export our scraped data to a CSV file
df1.to_csv('Covid19_Repositories.csv', index=None)