<b>QUESTION</b>: 
- Write a Python function that creates a CSV file (comma-separated values) containing details about the 25 top GitHub repositories for any given topic. 
- You can view the top repositories for the topic machine-learning on this page: https://github.com/topics/machine-learning.
-  The output CSV should contain these details: repository name, owner's username, no. of stars, repository URL.

In [74]:
import requests
from bs4 import BeautifulSoup
import json

In [2]:
topic_url = 'https://github.com/topics/machine-learning'
response = requests.get(topic_url)

In [3]:
# len(response.content)
type(response)

requests.models.Response

```requests.get returns``` a response object with the page contents and some information indicating whether the request was successful, using a status code. 
- Learn more about HTTP status codes here: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status.

- If the request was successful, response.status_code is set to a value between 200 and 299.

In [4]:
response.status_code

200

In [5]:
page_contents = response.text
len(page_contents)

478455


The page contains over 60,000 characters! Let's view the first 1000 characters of the web page.



In [6]:
print(page_contents[:1000])



<!DOCTYPE html>
<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"  data-a11y-animated-images="system" data-a11y-link-underlines="true">


  <head>
    <meta charset="utf-8">
  <link rel="dns-prefetch" href="https://github.githubassets.com">
  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">
  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">
  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">
  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>
  <link rel="preconnect" href="https://avatars.githubusercontent.com">

  

  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-a09cef873428.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-5d486a4ede8e.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" media="all" rel="styl

> Let's save the contents to a file with the .html extension.

In [7]:
with open('machine-learning-topics.html', 'w', encoding='utf-8') as f:
    f.write(page_contents)

We can use browser by right-cliking on the saved file ```machine-learning-topics.html``` and opening it using our main or default browser.

Next, let's read the contents of the file machine-learning.html and create a BeautifulSoup object to parse the content.


In [8]:
with open('machine-learning-topics.html', 'r') as f:
    html_source = f.read()
print(html_source)



<!DOCTYPE html>
<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"  data-a11y-animated-images="system" data-a11y-link-underlines="true">


  <head>
    <meta charset="utf-8">
  <link rel="dns-prefetch" href="https://github.githubassets.com">
  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">
  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">
  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">
  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>
  <link rel="preconnect" href="https://avatars.githubusercontent.com">

  

  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-a09cef873428.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-5d486a4ede8e.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" media="all" rel="styl

In [9]:
doc = BeautifulSoup(html_source,'html.parser')

The doc object contains several properties and methods for extracting information from the HTML document. Let's look at a few examples below.

NOTE: You don't need to remember all (or any) of the properties/methods. You can look up the documentation of [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) or search online to find what you need when you need it.

In [10]:
doc.title

<title>machine-learning · GitHub Topics · GitHub</title>

In [11]:
doc.title.string

'machine-learning · GitHub Topics · GitHub'

In [12]:
doc.title.parent.name

'link'

In [13]:
body=doc.body
print(body.prettify())

<body class="logged-out env-production page-responsive" style="word-wrap: break-word;">
 <div class="logged-out env-production page-responsive" data-turbo-body="" style="word-wrap: break-word;">
  <div class="position-relative js-header-wrapper">
   <a class="px-2 py-4 color-bg-accent-emphasis color-fg-on-emphasis show-on-focus js-skip-to-content" href="#start-of-content">
    Skip to content
   </a>
   <span class="progress-pjax-loader Progress position-fixed width-full" data-view-component="true">
    <span class="Progress-item progress-pjax-loader-bar left-0 top-0 color-bg-accent-emphasis" data-view-component="true" style="width: 0%;">
    </span>
   </span>
   <script crossorigin="anonymous" defer="defer" src="https://github.githubassets.com/assets/vendors-node_modules_github_remote-form_dist_index_js-node_modules_delegated-events_dist_inde-94fd67-8311888324b2.js" type="application/javascript">
   </script>
   <script crossorigin="anonymous" defer="defer" src="https://github.github

In [14]:
# body.a.text
# Find all the link tags on the page. How many links does the page contain?
all_link_tags= doc.find_all('a')

In [15]:
type(all_link_tags )

bs4.element.ResultSet

In [16]:
first_link=all_link_tags[0]
first_link['href']

'#start-of-content'

In [17]:
first_link['class']

['px-2',
 'py-4',
 'color-bg-accent-emphasis',
 'color-fg-on-emphasis',
 'show-on-focus',
 'js-skip-to-content']

In [18]:
fith_link=all_link_tags[4]
fith_link

<a class="HeaderMenu-dropdown-link lh-condensed d-block no-underline position-relative py-2 Link--secondary d-flex flex-items-center pb-lg-3" data-analytics-event='{"category":"Header dropdown (logged out), Product","action":"click to go to Packages","label":"ref_cta:Packages;"}' href="/features/packages">
<svg aria-hidden="true" class="octicon octicon-package color-fg-subtle mr-3" data-view-component="true" height="24" version="1.1" viewbox="0 0 24 24" width="24">
<path d="M12.876.64V.639l8.25 4.763c.541.313.875.89.875 1.515v9.525a1.75 1.75 0 0 1-.875 1.516l-8.25 4.762a1.748 1.748 0 0 1-1.75 0l-8.25-4.763a1.75 1.75 0 0 1-.875-1.515V6.917c0-.625.334-1.202.875-1.515L11.126.64a1.748 1.748 0 0 1 1.75 0Zm-1 1.298L4.251 6.34l7.75 4.474 7.75-4.474-7.625-4.402a.248.248 0 0 0-.25 0Zm.875 19.123 7.625-4.402a.25.25 0 0 0 .125-.216V7.639l-7.75 4.474ZM3.501 7.64v8.803c0 .09.048.172.125.216l7.625 4.402v-8.947Z"></path>
</svg>
<div>
<div class="color-fg-default h4">Packages</div>
        Host and ma

In [19]:
fith_link['class']

['HeaderMenu-dropdown-link',
 'lh-condensed',
 'd-block',
 'no-underline',
 'position-relative',
 'py-2',
 'Link--secondary',
 'd-flex',
 'flex-items-center',
 'pb-lg-3']

##### Searching by Attribute Value

> <b>QUESTION:</b> Find the img tag(s) on the page with the alt attribute set to ``transformers``.

We can provide a dictionary of attributes as the second argument to ```find_all```

In [20]:
doc.find_all('img', {'alt': 'transformers'})

[<img alt="transformers" class="d-block width-full" loading="lazy" src="https://repository-images.githubusercontent.com/155220641/a16c4880-a501-11ea-9e8f-646cf611702e"/>]

In [21]:
doc.find_all('img', {'alt': 'julia'})

[<img alt="julia" class="d-block width-full" loading="lazy" src="https://repository-images.githubusercontent.com/1644196/ddfc1e00-6638-11e9-9b80-0fe7b9aedd72"/>]

In [22]:
matching_tags = doc.find_all(class_='HeaderMenu-link')
print(matching_tags)

[<button aria-expanded="false" class="HeaderMenu-link border-0 width-full width-lg-auto px-0 px-lg-2 py-3 py-lg-2 no-wrap d-flex flex-items-center flex-justify-between js-details-target" type="button">
        Product
        <svg aria-hidden="true" class="octicon octicon-chevron-down HeaderMenu-icon ml-1" data-view-component="true" height="16" opacity="0.5" version="1.1" viewbox="0 0 16 16" width="16">
<path d="M12.78 5.22a.749.749 0 0 1 0 1.06l-4.25 4.25a.749.749 0 0 1-1.06 0L3.22 6.28a.749.749 0 1 1 1.06-1.06L8 8.939l3.72-3.719a.749.749 0 0 1 1.06 0Z"></path>
</svg>
</button>, <button aria-expanded="false" class="HeaderMenu-link border-0 width-full width-lg-auto px-0 px-lg-2 py-3 py-lg-2 no-wrap d-flex flex-items-center flex-justify-between js-details-target" type="button">
        Solutions
        <svg aria-hidden="true" class="octicon octicon-chevron-down HeaderMenu-icon ml-1" data-view-component="true" height="16" opacity="0.5" version="1.1" viewbox="0 0 16 16" width="16">
<path

#### Parsing Information from Tags
> Once we have a list of tags matching some criteria, it's easy to extract information and convert it to a more convenient format.

<b>QUESTION:</b> Find the link text and URL of all the links withing the page header on https://github.com/topics/machine-learning .

> We'll create a list of dictionaries containing the required information. We'll add the base URL https://github.com as a prefix because the href attribute only contains the relative path e.

In [23]:
header_link_tags = doc.find_all('a', class_='HeaderMenu-link')
header_link_tags

[<a class="HeaderMenu-link no-underline px-0 px-lg-2 py-3 py-lg-2 d-block d-lg-inline-block" data-analytics-event='{"category":"Header menu top item (logged out)","action":"click to go to Pricing","label":"ref_cta:Pricing;"}' href="/pricing">Pricing</a>,
 <a class="HeaderMenu-link HeaderMenu-link--sign-in flex-shrink-0 no-underline d-block d-lg-inline-block border border-lg-0 rounded rounded-lg-0 p-2 p-lg-0" data-ga-click="(Logged out) Header, clicked Sign in, text:sign-in" data-hydro-click='{"event_type":"authentication.click","payload":{"location_in_page":"site header menu","repository_id":null,"auth_type":"SIGN_UP","originating_url":"https://github.com/topics/machine-learning","user_id":null}}' data-hydro-click-hmac="77fa0805c2ee5e083e5dbe5571a2e37d8661eca3a115806e97d54914b8332ea9" href="/login?return_to=https%3A%2F%2Fgithub.com%2Ftopics%2Fmachine-learning">
               Sign in
             </a>,
 <a class="HeaderMenu-link HeaderMenu-link--sign-up flex-shrink-0 d-none d-lg-inline

In [24]:
header_link_tags[0]['href']

'/pricing'

In [25]:
header_links = []
base_url = 'https://github.com'

for tag in header_link_tags:
    # print(tag.text.strip())
    # print(tag['href'])
    header_links.append({ 'title': tag.text.strip(), 
                         'url': base_url + tag['href']
                         })
    
print(header_links)

[{'title': 'Pricing', 'url': 'https://github.com/pricing'}, {'title': 'Sign in', 'url': 'https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Ftopics%2Fmachine-learning'}, {'title': 'Sign up', 'url': 'https://github.com/signup?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2Ftopics%2Fmachine-learning&source=header'}]


In [26]:
imag_tags=doc.find_all(class_='d-block width-full')

In [27]:
doc.find_all(class_='d-block width-full')

[<img alt="transformers" class="d-block width-full" loading="lazy" src="https://repository-images.githubusercontent.com/155220641/a16c4880-a501-11ea-9e8f-646cf611702e"/>,
 <img alt="netdata" class="d-block width-full" loading="lazy" src="https://repository-images.githubusercontent.com/10744183/8d08ea53-6359-45fe-bc4d-067cfe1673a1"/>,
 <img alt="ML-For-Beginners" class="d-block width-full" loading="lazy" src="https://repository-images.githubusercontent.com/343965132/549b1a80-c897-11eb-9436-918072d2e0f8"/>,
 <img alt="awesome-scalability" class="d-block width-full" loading="lazy" src="https://repository-images.githubusercontent.com/115478820/109a8e00-283a-11ea-8891-ad7215b06a4c"/>,
 <img alt="julia" class="d-block width-full" loading="lazy" src="https://repository-images.githubusercontent.com/1644196/ddfc1e00-6638-11e9-9b80-0fe7b9aedd72"/>,
 <img alt="yolov5" class="d-block width-full" loading="lazy" src="https://repository-images.githubusercontent.com/264818686/c9bae91d-ad2d-491c-876f-b

In [28]:
imag_tags[0]['alt']

'transformers'

In [29]:
images_bd=[]
# base_url = 'https://github.com'
for tag in imag_tags:
    # print(img['alt'])
    images_bd.append(
        { 'username': tag['alt'],
         "url":  tag['src']
        }
    )

images_bd

[{'username': 'transformers',
  'url': 'https://repository-images.githubusercontent.com/155220641/a16c4880-a501-11ea-9e8f-646cf611702e'},
 {'username': 'netdata',
  'url': 'https://repository-images.githubusercontent.com/10744183/8d08ea53-6359-45fe-bc4d-067cfe1673a1'},
 {'username': 'ML-For-Beginners',
  'url': 'https://repository-images.githubusercontent.com/343965132/549b1a80-c897-11eb-9436-918072d2e0f8'},
 {'username': 'awesome-scalability',
  'url': 'https://repository-images.githubusercontent.com/115478820/109a8e00-283a-11ea-8891-ad7215b06a4c'},
 {'username': 'julia',
  'url': 'https://repository-images.githubusercontent.com/1644196/ddfc1e00-6638-11e9-9b80-0fe7b9aedd72'},
 {'username': 'yolov5',
  'url': 'https://repository-images.githubusercontent.com/264818686/c9bae91d-ad2d-491c-876f-b6948f1a7c66'},
 {'username': 'annotated_deep_learning_paper_implementations',
  'url': 'https://repository-images.githubusercontent.com/290091948/ac5a4b00-3e4b-11eb-948f-8e1ff5bdcc63'}]

In [30]:
def get_topic_page(topic):
    # Construct the URL
    base_url = 'https://github.com/topics/'
    topic_repos_url = base_url + topic
    response = requests.get(topic_repos_url)
    if response.status_code != 200:
        print('Status code', response.status_code) 
        raise Exception('Failed to fetch the webpage',topic_repos_url )
    # Construct a beautiful soup document
    doc= BeautifulSoup(response.text)
    return doc


In [31]:
doc = get_topic_page('machine-learning')

In [32]:
doc.title

<title>machine-learning · GitHub Topics · GitHub</title>

In [33]:
doc2 = get_topic_page('python')
 

In [34]:
doc2.title.text

'python · GitHub Topics · GitHub'

In [35]:
article_tags = doc.find_all('article', class_="border rounded color-shadow-small color-bg-subtle my-4")

In [36]:
article_tags

[<article class="border rounded color-shadow-small color-bg-subtle my-4">
 <div class="px-3">
 <div class="d-flex flex-justify-between flex-items-start flex-wrap gap-2 my-3">
 <div class="d-flex flex-1">
 <span style="margin-top:2px">
 <svg aria-hidden="true" class="octicon octicon-repo color-fg-muted mr-2" data-view-component="true" height="16" version="1.1" viewbox="0 0 16 16" width="16">
 <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path>
 </svg>
 </span>
 <h3 class="f3 color-fg-muted text-normal lh-condensed">
 <a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click

In [37]:
len(article_tags)

20

In [38]:
article_tag=article_tags[1]

In [39]:
h3_tag = article_tag.find('h3')
print(h3_tag)

<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":25720743,"originating_url":"https://github.com/topics/machine-learning","user_id":null}}' data-hydro-click-hmac="6aafea2eb40857d403ccc23ff92e4207f03d0238eff7faa9fd913e074b03c181" data-turbo="false" data-view-component="true" href="/huggingface">
            huggingface
</a>          /
          <a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":155220641,"originating_url":"https://github.com/topics/machine-learning","user_id":null}}' data-hydro-click-hmac="544e8e49f42846b10cd63c619c2b8a5ae33f6c3643372221cfa87f8e26b42398" data-t

In [40]:
tag=doc.find_all(class_="f3 color-fg-muted text-normal lh-condensed")
tag[0]

<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":15658638,"originating_url":"https://github.com/topics/machine-learning","user_id":null}}' data-hydro-click-hmac="b00f41ff6d39cbe8b2ba57a2121412679410dc5657f69051b91b293a5e76cb9a" data-turbo="false" data-view-component="true" href="/tensorflow">
            tensorflow
</a>          /
          <a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":45717250,"originating_url":"https://github.com/topics/machine-learning","user_id":null}}' data-hydro-click-hmac="5162b8742768d23c1f4e26d74e6474a980744831321eb8115df7c4ffc584d901" data-turb

In [41]:
a_tags = h3_tag.find_all('a', recursive=False)
a_tags

[<a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":25720743,"originating_url":"https://github.com/topics/machine-learning","user_id":null}}' data-hydro-click-hmac="6aafea2eb40857d403ccc23ff92e4207f03d0238eff7faa9fd913e074b03c181" data-turbo="false" data-view-component="true" href="/huggingface">
             huggingface
 </a>,
 <a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":155220641,"originating_url":"https://github.com/topics/machine-learning","user_id":null}}' data-hydro-click-hmac="544e8e49f42846b10cd63c619c2b8a5ae33f6c3643372221cfa87f8e26b42398" data-turbo="false" data-view-component="true" href="/huggingface/transformers"

In [42]:
username = a_tags[0].text.strip()
username


'huggingface'

In [43]:
repo_name = a_tags[1].text.strip()
repo_name

'transformers'

In [44]:
repo_path=a_tags[1]['href'].strip()
repo_path

'/huggingface/transformers'

In [45]:
base_url = 'https://github.com'
repo_url = base_url + repo_path 
repo_url

'https://github.com/huggingface/transformers'


Next, to get the number of starts, we notice that it is contained within an `span` tag which has the count `Counter js-social-count`.

In [46]:
a_star_tag = article_tags[4].find('span', class_='Counter js-social-count')
a_star_tag.text

'60.5k'

In [47]:
def parse_star_count(stars_str):
    stars_str= stars_str.strip()
    if stars_str[-1] =='k':
        return int(float(stars_str[:-1]) * 1000)
    else:
        return int(stars_str)

In [48]:
star_count=parse_star_count(a_star_tag.text)
star_count

60500

In [49]:
parse_star_count('991')

991

In [50]:
print('Repository name:', repo_name)
print("Owner's username:", username)
print('Stars:', star_count)
print('Repository URL:', repo_url)

Repository name: transformers
Owner's username: huggingface
Stars: 60500
Repository URL: https://github.com/huggingface/transformers


Let's extract the logic for parsing the required information from an article tag into a function.

> **QUESTION**: Write a function `parse_repostory` that returns a dictionary containing the repository name, owner's username, number of stars, and repository URL by parsing a given `article` tag representing a repository.

In [51]:
def parse_repository(article_tag):
    # <a> tags containing username, repository name and URL
    a_tags = article_tag.h3.find_all('a')
    # Owner's username
    username = a_tags[0].text.strip()
    # Repository name
    repo_name = a_tags[1].text.strip()
    # Repository URL
    repo_url = base_url + a_tags[1]['href'].strip()
    # Star count
    stars_tag = article_tag.find('span', class_='Counter js-social-count')
    star_count=parse_star_count(stars_tag.text.strip())
    return{
        'repository_name':repo_name,
        'owner_username':username,
        'stars': star_count,
        "repository_url": repo_url
    }
    

We can now use the function to parse any `article` tag.

In [52]:
parse_repository(article_tags[0])

{'repository_name': 'tensorflow',
 'owner_username': 'tensorflow',
 'stars': 178000,
 'repository_url': 'https://github.com/tensorflow/tensorflow'}

In [53]:
parse_repository(article_tags[4])

{'repository_name': 'cs-video-courses',
 'owner_username': 'Developer-Y',
 'stars': 60500,
 'repository_url': 'https://github.com/Developer-Y/cs-video-courses'}

In [54]:
top_repositories = [parse_repository(tag) for tag in article_tags]

In [55]:
top_repositories

[{'repository_name': 'tensorflow',
  'owner_username': 'tensorflow',
  'stars': 178000,
  'repository_url': 'https://github.com/tensorflow/tensorflow'},
 {'repository_name': 'transformers',
  'owner_username': 'huggingface',
  'stars': 113000,
  'repository_url': 'https://github.com/huggingface/transformers'},
 {'repository_name': 'pytorch',
  'owner_username': 'pytorch',
  'stars': 71300,
  'repository_url': 'https://github.com/pytorch/pytorch'},
 {'repository_name': 'netdata',
  'owner_username': 'netdata',
  'stars': 65300,
  'repository_url': 'https://github.com/netdata/netdata'},
 {'repository_name': 'cs-video-courses',
  'owner_username': 'Developer-Y',
  'stars': 60500,
  'repository_url': 'https://github.com/Developer-Y/cs-video-courses'},
 {'repository_name': 'keras',
  'owner_username': 'keras-team',
  'stars': 59500,
  'repository_url': 'https://github.com/keras-team/keras'},
 {'repository_name': 'scikit-learn',
  'owner_username': 'scikit-learn',
  'stars': 56000,
  'reposi

In [56]:
len(top_repositories)

20



> **QUESTION**: Write a function that takes a `BeautifulSoup` object representing a topic page and returns a list of dictionaries containing information about the top repositories for the topic.

In [57]:
def get_top_repositories(doc):
    article_tags = doc.find_all('article',
                                 class_="border rounded color-shadow-small color-bg-subtle my-4")
    topic_repos = [parse_repository(tag) for tag in article_tags]
    return topic_repos

We can now use the functions we've defined to get the top repositories for any topic.

In [58]:
topic_page_ml = get_topic_page('machine-learning')
top_repos_ml = get_top_repositories(topic_page_ml)
top_repos_ml[:5]

[{'repository_name': 'tensorflow',
  'owner_username': 'tensorflow',
  'stars': 178000,
  'repository_url': 'https://github.com/tensorflow/tensorflow'},
 {'repository_name': 'transformers',
  'owner_username': 'huggingface',
  'stars': 113000,
  'repository_url': 'https://github.com/huggingface/transformers'},
 {'repository_name': 'pytorch',
  'owner_username': 'pytorch',
  'stars': 71300,
  'repository_url': 'https://github.com/pytorch/pytorch'},
 {'repository_name': 'netdata',
  'owner_username': 'netdata',
  'stars': 65300,
  'repository_url': 'https://github.com/netdata/netdata'},
 {'repository_name': 'cs-video-courses',
  'owner_username': 'Developer-Y',
  'stars': 60500,
  'repository_url': 'https://github.com/Developer-Y/cs-video-courses'}]

Here are the top repositories for the keyword `data-analysis`.

In [59]:
topic_page_da = get_topic_page('data-analysis')
top_repos_da = get_top_repositories(topic_page_da)
top_repos_da[:5]

[{'repository_name': 'scikit-learn',
  'owner_username': 'scikit-learn',
  'stars': 56000,
  'repository_url': 'https://github.com/scikit-learn/scikit-learn'},
 {'repository_name': 'superset',
  'owner_username': 'apache',
  'stars': 54400,
  'repository_url': 'https://github.com/apache/superset'},
 {'repository_name': 'pandas',
  'owner_username': 'pandas-dev',
  'stars': 39900,
  'repository_url': 'https://github.com/pandas-dev/pandas'},
 {'repository_name': 'metabase',
  'owner_username': 'metabase',
  'stars': 34400,
  'repository_url': 'https://github.com/metabase/metabase'},
 {'repository_name': 'streamlit',
  'owner_username': 'streamlit',
  'stars': 27600,
  'repository_url': 'https://github.com/streamlit/streamlit'}]

In [60]:
topic_page_da = get_topic_page('react')
top_repos_da = get_top_repositories(topic_page_da)
top_repos_da[:5]


[{'repository_name': 'freeCodeCamp',
  'owner_username': 'freeCodeCamp',
  'stars': 375000,
  'repository_url': 'https://github.com/freeCodeCamp/freeCodeCamp'},
 {'repository_name': 'react',
  'owner_username': 'facebook',
  'stars': 214000,
  'repository_url': 'https://github.com/facebook/react'},
 {'repository_name': 'next.js',
  'owner_username': 'vercel',
  'stars': 113000,
  'repository_url': 'https://github.com/vercel/next.js'},
 {'repository_name': 'react-native',
  'owner_username': 'facebook',
  'stars': 112000,
  'repository_url': 'https://github.com/facebook/react-native'},
 {'repository_name': 'free-programming-books-zh_CN',
  'owner_username': 'justjavac',
  'stars': 105000,
  'repository_url': 'https://github.com/justjavac/free-programming-books-zh_CN'}]

In [61]:
def write_csv(items, path):
    # Open the file in write mode
    with open(path,'w') as f:
        # Return if there's nothing to write
        if len(items) == 0:
            return
        # Write the headers in the first line
        headers =list(items[0].keys())
        f.write(','.join(headers) + '\n')

        # Write one item per line
        for item in items:
            values=[]
            for header in headers:
                values.append(str(item.get(header,"")))
            
            f.write(','.join(values) + '\n')

Let's write the data stored in `top_repos_ml` into a CSV file.

In [62]:
len(top_repos_ml)

20

In [63]:
write_csv(top_repositories, 'machine-learning.csv')

We can now read the file and inspect its contents. The contents of the file can also be inspected using the "File > Open" menu option within Jupyter.

In [64]:
with open('machine-learning.csv', 'r') as file:
    print(file.read())


repository_name,owner_username,stars,repository_url
tensorflow,tensorflow,178000,https://github.com/tensorflow/tensorflow
transformers,huggingface,113000,https://github.com/huggingface/transformers
pytorch,pytorch,71300,https://github.com/pytorch/pytorch
netdata,netdata,65300,https://github.com/netdata/netdata
cs-video-courses,Developer-Y,60500,https://github.com/Developer-Y/cs-video-courses
keras,keras-team,59500,https://github.com/keras-team/keras
scikit-learn,scikit-learn,56000,https://github.com/scikit-learn/scikit-learn
ML-For-Beginners,microsoft,53700,https://github.com/microsoft/ML-For-Beginners
tesseract,tesseract-ocr,53700,https://github.com/tesseract-ocr/tesseract
face_recognition,ageitgey,49500,https://github.com/ageitgey/face_recognition
d2l-zh,d2l-ai,49100,https://github.com/d2l-ai/d2l-zh
awesome-scalability,binhnguyennus,48600,https://github.com/binhnguyennus/awesome-scalability
faceswap,deepfakes,47200,https://github.com/deepfakes/faceswap
julia,JuliaLang,43300,https://g

Perfect! We've created a CSV containing the information about the top GitHub repositories for the topic `machine-learning`. We can now put together everything we've done so far to solve the original problem.

> **QUESTION**: Write a Python function that creates a CSV file (comma-separated values) containing details about the 25 top GitHub repositories for any given topic. The top repositories for the topic `machine-learning` can be found on this page: [https://github.com/topics/machine-learning](https://github.com/topics/machine-learning). The output CSV should contain these details: repository name, owner's username, no. of stars, repository URL. 


In [65]:
import pandas as pd

In [66]:
pd.read_csv('machine-learning.csv')

Unnamed: 0,repository_name,owner_username,stars,repository_url
0,tensorflow,tensorflow,178000,https://github.com/tensorflow/tensorflow
1,transformers,huggingface,113000,https://github.com/huggingface/transformers
2,pytorch,pytorch,71300,https://github.com/pytorch/pytorch
3,netdata,netdata,65300,https://github.com/netdata/netdata
4,cs-video-courses,Developer-Y,60500,https://github.com/Developer-Y/cs-video-courses
5,keras,keras-team,59500,https://github.com/keras-team/keras
6,scikit-learn,scikit-learn,56000,https://github.com/scikit-learn/scikit-learn
7,ML-For-Beginners,microsoft,53700,https://github.com/microsoft/ML-For-Beginners
8,tesseract,tesseract-ocr,53700,https://github.com/tesseract-ocr/tesseract
9,face_recognition,ageitgey,49500,https://github.com/ageitgey/face_recognition


In [67]:
def scrape_topic_repositories(topic, path=None):
    """Get the top repositories for a topic and write them to a CSV file"""
    if path is None:
        path = topic + '.csv'
    topic_page_doc = get_topic_page(topic)
    topic_repositories = get_top_repositories(topic_page_doc)
    write_csv(topic_repositories, path)
    print('Top repositories for topic "{}" written to file "{}"'.format(topic, path))
    return path

In [68]:
scrape_topic_repositories('data-analysis')

Top repositories for topic "data-analysis" written to file "data-analysis.csv"


'data-analysis.csv'

In [69]:
pd.read_csv('data-analysis.csv')

Unnamed: 0,repository_name,owner_username,stars,repository_url
0,scikit-learn,scikit-learn,56000,https://github.com/scikit-learn/scikit-learn
1,superset,apache,54400,https://github.com/apache/superset
2,pandas,pandas-dev,39900,https://github.com/pandas-dev/pandas
3,metabase,metabase,34400,https://github.com/metabase/metabase
4,streamlit,streamlit,27600,https://github.com/streamlit/streamlit
5,AI-Expert-Roadmap,AMAI-GmbH,26900,https://github.com/AMAI-GmbH/AI-Expert-Roadmap
6,CyberChef,gchq,22700,https://github.com/gchq/CyberChef
7,Data-Science-For-Beginners,microsoft,22600,https://github.com/microsoft/Data-Science-For-...
8,gradio,gradio-app,22300,https://github.com/gradio-app/gradio
9,goaccess,allinurl,16700,https://github.com/allinurl/goaccess


In [70]:
scrape_topic_repositories('javascript')


Top repositories for topic "javascript" written to file "javascript.csv"


'javascript.csv'

In [71]:
pd.read_csv('javascript.csv')

Unnamed: 0,repository_name,owner_username,stars,repository_url
0,freeCodeCamp,freeCodeCamp,375000,https://github.com/freeCodeCamp/freeCodeCamp
1,react,facebook,214000,https://github.com/facebook/react
2,vue,vuejs,205000,https://github.com/vuejs/vue
3,javascript-algorithms,trekhleb,176000,https://github.com/trekhleb/javascript-algorithms
4,You-Dont-Know-JS,getify,172000,https://github.com/getify/You-Dont-Know-JS
5,bootstrap,twbs,165000,https://github.com/twbs/bootstrap
6,javascript,airbnb,138000,https://github.com/airbnb/javascript
7,project-based-learning,practical-tutorials,117000,https://github.com/practical-tutorials/project...
8,30-seconds-of-code,30-seconds,116000,https://github.com/30-seconds/30-seconds-of-code
9,electron,electron,109000,https://github.com/electron/electron


In [72]:
scrape_topic_repositories('python')

Top repositories for topic "python" written to file "python.csv"


'python.csv'

In [73]:
pd.read_csv('python.csv')

Unnamed: 0,repository_name,owner_username,stars,repository_url
0,system-design-primer,donnemartin,231000,https://github.com/donnemartin/system-design-p...
1,awesome-python,vinta,183000,https://github.com/vinta/awesome-python
2,tensorflow,tensorflow,178000,https://github.com/tensorflow/tensorflow
3,Python,TheAlgorithms,170000,https://github.com/TheAlgorithms/Python
4,CS-Notes,CyC2018,167000,https://github.com/CyC2018/CS-Notes
5,AutoGPT,Significant-Gravitas,150000,https://github.com/Significant-Gravitas/AutoGPT
6,project-based-learning,practical-tutorials,117000,https://github.com/practical-tutorials/project...
7,30-seconds-of-code,30-seconds,116000,https://github.com/30-seconds/30-seconds-of-code
8,transformers,huggingface,113000,https://github.com/huggingface/transformers
9,free-programming-books-zh_CN,justjavac,105000,https://github.com/justjavac/free-programming-...


Unlike HTML, it's really easy to work with JSON using Python, simply fetch the contents of the URL and convert it to a dictionary. Such URLs are often called **REST APIs** or REST API endpoints. Many websites offer well-documented REST APIs to access data from the site in JSON format:

* GitHub: https://docs.github.com/en/rest/reference/repos
* Facebook: https://developers.facebook.com/docs/groups-api/reference
* Twitter: https://developer.twitter.com/en/docs/twitter-api/v1/tweets/timelines/api-reference/get-statuses-user_timeline
* Reddit: https://www.reddit.com/dev/api/

Using an API is the *officially supported* way of extracting information from a website. To use an API, you will often need to register as a developer on the platform and generate an API key, which you'll need to send with every request to authenticate yourself. 

Since GitHub offers a public API, we can use it without any restrictions to fetch information about public repositories.


> **QUESTION**: Write a function `get_repo_details` to find the following information about a repository: description, watcher count, fork count, open issues count, created at time and updated at time.


In [75]:
def get_repo_details(username, repo_name):
    print('Fetching information for {}/{}'.format(username, repo_name))
    repo_details_url = "https://api.github.com/repos/" + username + "/" + repo_name
    response = requests.get(repo_details_url)
    if not response.ok:
        print("Failed to fetch!")
        return {}
    repo_data = json.loads(response.text)
    return {
        'description': repo_data['description'],
        'watchers': repo_data['watchers_count'],
        'open_issues': repo_data['open_issues_count'],
        'created_at': repo_data['created_at'],
        'updated_at': repo_data['updated_at']
    }

In [77]:
get_repo_details('tensorflow', 'tensorflow')

Fetching information for tensorflow/tensorflow


{'description': 'An Open Source Machine Learning Framework for Everyone',
 'watchers': 177920,
 'open_issues': 2052,
 'created_at': '2015-11-07T01:19:20Z',
 'updated_at': '2023-10-03T13:43:55Z'}

> **QUESTION**: Augment the list of top repositories for a topic with the repository description, watcher count, fork count, open issues count, created at time and updated at time.


In [78]:
def add_repo_details(repos):
    return [dict(**get_repo_details(repo['owner_username'], repo['repository_name']), **repo) for repo in repos]

In [79]:
add_repo_details(top_repositories[:5])

Fetching information for tensorflow/tensorflow
Fetching information for huggingface/transformers
Fetching information for pytorch/pytorch
Fetching information for netdata/netdata
Fetching information for Developer-Y/cs-video-courses


[{'description': 'An Open Source Machine Learning Framework for Everyone',
  'watchers': 177920,
  'open_issues': 2052,
  'created_at': '2015-11-07T01:19:20Z',
  'updated_at': '2023-10-03T13:43:55Z',
  'repository_name': 'tensorflow',
  'owner_username': 'tensorflow',
  'stars': 178000,
  'repository_url': 'https://github.com/tensorflow/tensorflow'},
 {'description': '🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.',
  'watchers': 112709,
  'open_issues': 865,
  'created_at': '2018-10-29T13:56:00Z',
  'updated_at': '2023-10-03T13:42:41Z',
  'repository_name': 'transformers',
  'owner_username': 'huggingface',
  'stars': 113000,
  'repository_url': 'https://github.com/huggingface/transformers'},
 {'description': 'Tensors and Dynamic neural networks in Python with strong GPU acceleration',
  'watchers': 71269,
  'open_issues': 12809,
  'created_at': '2016-08-13T05:26:41Z',
  'updated_at': '2023-10-03T13:23:18Z',
  'repository_name': 'pytorch',
  'owner

You may get rate limited if you attempt to make more than 60 requests per hour. To overcome the rate limit, use the Github OAuth token as described here: https://towardsdatascience.com/all-the-things-you-can-do-with-github-api-and-python-f01790fca131

Note: Never publish your Github API token publicly, as it can be used to access your Github account. To store your API token without displaying it on the screen, use `getpass`.