# TOP REPOSITORIES TOPIC FOR GITHUB

## Process i'm following to build a Python Web Scraping Project From Scratch

![](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcS3bNvHFcfFCkH9TXpLkjZNt5Fq_oDbKyRmqg&usqp=CAU)

Web scraping is the process of extracting and parsing data from websites in an automated fashion using a computer program. It's a useful technique for creating datasets for research and learning. Follow these steps to build a web scraping project from scratch using Python and its ecosystem of libraries:

1. **Pick a website and describe your objective**


- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.



2. **Use the requests library to download web pages**

-  Inspect the website's HTML source and identify the right URLs to download. 
- Download and save web pages locally using the `requests` library.
- Create a function to automate downloading for different topics/search queries.

3. **Use Beautiful Soup to parse and extract information**

- Parse and explore the structure of downloaded web pages using Beautiful soup.
- Use the right properties and methods to extract the required information.
- Create functions to extract from the page into lists and dictionaries.
- (Optional) Use a REST API to acquire additional information if required.



4. **Create CSV file(s) with the extracted information**

 - Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
 - Execute the function with different inputs to create a dataset of CSV files.
 - Verify the information in the CSV files by reading them back using [Pandas](https://pandas.pydata.org).


5. **Document and share my  work**

 - Add proper headings and documentation in your Jupyter notebook.

## Project Outlines
- Going to scrape https://github.com/topics.
- Get a list of topics. For each topic, will get topic title, topic page URL and topic description.
- For each topic, will get top 20 repositories in the topic from the topic page
- For each repository, will get the repo name, username, stars and repo URL
- For each topic, will create a CSV file in the following format:

```
Repo Name,Username,Stars,Repo URL
three.js,mardoob,92200,https://github.com/mrdoob/three.js
react-three-fibre,pmndrs,22700,https://github.com/pmndrs/react-three-fiber
libgdx,libgdx,21500,https://github.com/libgdx/libgdx
```

### Use the `requests` library to download the web pages

In [56]:
!pip install requests --upgrade --quiet

In [57]:
import requests

In [58]:
topics_url = 'https://github.com/topics'

In [59]:
response_url = requests.get(topics_url)

In [60]:
response_url.status_code

200

**200** is the status code that the response code is successfull. If you wanna learn about status code more visit https://developer.mozilla.org/en-US/docs/Web/HTTP/Status

In [61]:
len(response_url.text)

155352

In [62]:
topics_page_contents = response_url.text

In [63]:
topics_page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark" data-a11y-animated-images="system">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n  \n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-0946cdc16f15.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-3946c959759a.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" media="all" rel="stylesheet" data-href="h

In [64]:
with open('webpage.html', 'w') as f:
    f.write(topics_page_contents)

In [65]:
pip install beautifulsoup4 --upgrade --quiet

Note: you may need to restart the kernel to use updated packages.


In [66]:
from bs4 import BeautifulSoup

In [67]:
doc = BeautifulSoup(topics_page_contents, 'html.parser')

In [68]:
type(doc)

bs4.BeautifulSoup

In [69]:
topic_title_p_tags = doc.find_all('p', {'class': 'f3 lh-condensed mb-0 mt-1 Link--primary'})

In [70]:
len(topic_title_p_tags)

30

In [71]:
topic_title_p_tags

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Angular</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ansible</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">API</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Arduino</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">ASP.NET</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Atom</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Awesome Lists</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amazon Web Services</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Azure</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Babel</p>,
 <p class="f3 lh-condensed m

In [72]:
topic_titles = []

for tag in topic_title_p_tags:
    topic_titles.append(tag.text)

print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [73]:
topic_desc_tags = doc.find_all('p', {'class': 'f5 color-fg-muted mb-0 mt-1'})

In [74]:
len(topic_desc_tags)

30

In [75]:
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [76]:
topic_link_tags = doc.find_all('a', {'class': 'no-underline flex-grow-0'})

In [77]:
topic_descs = []

for tag in topic_desc_tags:
    topic_descs.append(tag.text.strip())
    
topic_descs[:5]

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [78]:
topic_urls = []
base_url = 'https://github.com'

for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])
    
topic_urls

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

In [79]:
!pip install pandas --quiet

In [80]:
import pandas as pd

In [81]:
topics_dict = {
    'title': topic_titles,
    'description': topic_descs,
    'url': topic_urls
}

In [82]:
topics_dict

{'title': ['3D',
  'Ajax',
  'Algorithm',
  'Amp',
  'Android',
  'Angular',
  'Ansible',
  'API',
  'Arduino',
  'ASP.NET',
  'Atom',
  'Awesome Lists',
  'Amazon Web Services',
  'Azure',
  'Babel',
  'Bash',
  'Bitcoin',
  'Bootstrap',
  'Bot',
  'C',
  'Chrome',
  'Chrome extension',
  'Command line interface',
  'Clojure',
  'Code quality',
  'Code review',
  'Compiler',
  'Continuous integration',
  'COVID-19',
  'C++'],
 'description': ['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
  'Ajax is a technique for creating interactive web applications.',
  'Algorithms are self-contained sequences that carry out a variety of tasks.',
  'Amp is a non-blocking concurrency library for PHP.',
  'Android is an operating system built by Google designed for mobile devices.',
  'Angular is an open source web application platform.',
  'Ansible is a simple and powerful automation engine.',
  'An API (Application Programming Interface) is a 

In [83]:
topics_df = pd.DataFrame(topics_dict)

In [84]:
topics_df

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


### Create CSV files with the extracted information

In [85]:
topics_df.to_csv('topics.csv', index = None)

### Getting information out of a topic page

In [86]:
topic_page_url = topic_urls[0]

In [87]:
topic_page_url

'https://github.com/topics/3d'

In [88]:
response = requests.get(topic_page_url)

In [89]:
response.status_code

200

In [90]:
len(response.text)

463455

In [91]:
topic_doc = BeautifulSoup(response.text, 'html.parser')

In [92]:
h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
repo_tags = topic_doc.find_all('h3', {'class': h3_selection_class})

In [93]:
len(repo_tags)

20

In [94]:
a_tags = repo_tags[0].find_all('a')

In [95]:
a_tags[0].text.strip()

'mrdoob'

In [96]:
a_tags[1].text.strip()

'three.js'

In [97]:
repo_url = base_url + a_tags[1]['href']
print(repo_url)

https://github.com/mrdoob/three.js


In [98]:
star_tags = topic_doc.find_all('span', {'class': 'Counter js-social-count'})

In [99]:
len(star_tags)

20

In [100]:
star_tags[6].text.strip()

'15.4k'

In [101]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1]) * 1000)
    return int(stars_str)

In [102]:
parse_star_count(star_tags[6].text.strip())

15400

### divide and conquer of def `parse_star_count(stars_str)

In [103]:
def get_repo_info(h3_tag, star_tag):
    # returns all necessary info about a repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[0]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

In [104]:
get_repo_info(repo_tags[0], star_tags[0])

('mrdoob', 'three.js', 92300, 'https://github.com/mrdoob')

In [105]:
get_repo_info(repo_tags[1], star_tags[1])

('pmndrs', 'react-three-fiber', 22700, 'https://github.com/pmndrs')

In [106]:
topic_repos_dict = {
    'username': [], 
    'repo_name': [], 
    'stars': [],
    'repo_url': []
}


for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])
    
    

In [107]:
topic_repos_df = pd.DataFrame(topic_repos_dict)

In [108]:
topic_repos_df

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,92300,https://github.com/mrdoob
1,pmndrs,react-three-fiber,22700,https://github.com/pmndrs
2,libgdx,libgdx,21500,https://github.com/libgdx
3,BabylonJS,Babylon.js,20700,https://github.com/BabylonJS
4,ssloy,tinyrenderer,17000,https://github.com/ssloy
5,lettier,3d-game-shaders-for-beginners,15500,https://github.com/lettier
6,aframevr,aframe,15400,https://github.com/aframevr
7,FreeCAD,FreeCAD,14200,https://github.com/FreeCAD
8,CesiumGS,cesium,10500,https://github.com/CesiumGS
9,metafizzy,zdog,9700,https://github.com/metafizzy


In [109]:
range(len(repo_tags))

range(0, 20)

## Final Code

In [146]:
import os
def get_topic_page(topic_url):
       # Download the page
    response = requests.get(topic_url)

    # Check the successfull response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    # Parse using BeautifulSoup
    topic_doc = BeautifulSoup(response.text, 'html.parser')

    return topic_doc
    
def get_repo_info(h3_tag, star_tag):
    # returns all necessary info about a repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[0]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

def get_topic_repos(topic_doc):
    # Get the h3 tags containing username, repo title and repo URL
    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3', {'class': h3_selection_class})
    # Get star tags
    star_tags = topic_doc.find_all('span', {'class': 'Counter js-social-count'})
    
    # Get repo info
    topic_repos_dict = {
        'username': [], 
         'repo_name': [], 
         'stars': [],
         'repo_url': []
     }
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])

    return pd.DataFrame(topic_repos_dict)

def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(fname))
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index = None)
    

In [147]:
url4 = topic_urls[1]

url4

'https://github.com/topics/ajax'

In [148]:
topic4_doc = get_topic_page(url4)

In [149]:
topic4_repos = get_topic_repos(topic4_doc)

In [150]:
topic4_repos

Unnamed: 0,username,repo_name,stars,repo_url
0,ljianshu,Blog,7600,https://github.com/ljianshu
1,metafizzy,infinite-scroll,7300,https://github.com/metafizzy
2,developit,unfetch,5600,https://github.com/developit
3,olifolkerd,tabulator,5500,https://github.com/olifolkerd
4,jquery-form,form,5200,https://github.com/jquery-form
5,Studio-42,elFinder,4400,https://github.com/Studio-42
6,elbywan,wretch,3900,https://github.com/elbywan
7,dwyl,learn-to-send-email-via-google-script-html-no-...,3000,https://github.com/dwyl
8,ded,reqwest,2900,https://github.com/ded
9,LeaVerou,bliss,2400,https://github.com/LeaVerou


In [151]:
### let's do all above 4 lines of code in a single line of code

get_topic_repos(get_topic_page(topic_urls[1])).to_csv('ajax.csv', index = None)


##  Now, Write a single function to:

- Get the list of topics from the topics page
- Get the list of top repos from the individual topic pages
- For each topic, create a CSV of the top repos for the topic

In [156]:
def get_topic_titles(doc):
    topic_title_p_tags = doc.find_all('p', {'class': 'f3 lh-condensed mb-0 mt-1 Link--primary'})
    topic_titles = []
    for tag in topic_title_p_tags:
        topic_titles.append(tag.text)
    return topic_titles
    
def get_topic_descs(doc):
    topic_desc_tags = doc.find_all('p', {'class': 'f5 color-fg-muted mb-0 mt-1'})
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs
    
def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a', {'class': 'no-underline flex-grow-0'})
    topic_urls = []
    base_url = 'https://github.com'

    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])

    return topic_urls
    
    
def scrape_topics():
    url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
        
    doc = BeautifulSoup(response.text, 'html.parser')
    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url': get_topic_urls(doc)
    
    }
        
    return pd.DataFrame(topics_dict)

In [153]:
import os
help(os.makedirs)

Help on function makedirs in module os:

makedirs(name, mode=511, exist_ok=False)
    makedirs(name [, mode=0o777][, exist_ok=False])
    
    Super-mkdir; create a leaf directory and all intermediate ones.  Works like
    mkdir, except that any intermediate path segment (not just the rightmost)
    will be created if it does not exist. If the target directory already
    exists, raise an OSError if exist_ok is False. Otherwise no exception is
    raised.  This is recursive.



In [154]:
def scrape_topics_repos():
    print('Scrapping list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('data', exist_ok=True)
    
    for index, row in topics_df.iterrows():
        print('Scrapping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))
        

In [155]:
scrape_topics_repos()

Scrapping list of topics
Scrapping top repositories for "3D"
Scrapping top repositories for "Ajax"
Scrapping top repositories for "Algorithm"
Scrapping top repositories for "Amp"
Scrapping top repositories for "Android"
Scrapping top repositories for "Angular"
Scrapping top repositories for "Ansible"
Scrapping top repositories for "API"
Scrapping top repositories for "Arduino"
Scrapping top repositories for "ASP.NET"
Scrapping top repositories for "Atom"
Scrapping top repositories for "Awesome Lists"
Scrapping top repositories for "Amazon Web Services"
Scrapping top repositories for "Azure"
Scrapping top repositories for "Babel"
Scrapping top repositories for "Bash"
Scrapping top repositories for "Bitcoin"
Scrapping top repositories for "Bootstrap"
Scrapping top repositories for "Bot"
Scrapping top repositories for "C"
Scrapping top repositories for "Chrome"
Scrapping top repositories for "Chrome extension"
Scrapping top repositories for "Command line interface"
Scrapping top repositor