# Top Repositories for Github Topics


# @Author




# Pick a website and describe your objective

- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.

## Project Outline
- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- for each topic we'll create a CSV file in the following format:
```
Title,Description,URL
3D,3D modeling is the process of virtually developing the surface and structure of a 3D object.,https://github.com/topics/3d
Ajax,Ajax is a technique for creating interactive web applications.,https://github.com/topics/ajax
```

# Use the requests library to download web pages


In [14]:
!pip install requests --upgrade --quiet

In [15]:
import requests

In [16]:
topics_url = 'https://github.com/topics'

In [17]:
response = requests.get(topics_url)

In [18]:
response.status_code #successful

200

In [19]:
len(response.text)

141647

In [20]:
page_contents = response.text

In [21]:
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark" data-a11y-animated-images="system">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-ksfTgQOOnE+FFXf+yNfVjKSlEckJAdufFIYGK7ZjRhWcZgzAGcmZqqArTgMLpu90FwthqcCX4ldDgKXbmVMeuQ==" rel="stylesheet" href="https://github.githubassets.com/assets/light-92c7d381038e.css" /><link crossorigin="anonymous" media="all" integrity="sha512-1KkMNn8M/al/dtzBLupRwkIOgnA9MWkm8oxS+solP87jByEvY/g4BmoxLihRogKcX

# Use Beautiful Soup to parse and extract information
- Parse and explore the structure of downloaded webpages using beautiful soup.
- Use the right properties and methods to extract the required information
- Create functions to extract from the page into lists and dictionaries
- (Optional) Use a Rest API to acquire additional information if required

In [24]:
!pip install beautifulsoup4 --upgrade --quiet

In [25]:
from bs4 import BeautifulSoup

In [26]:
doc = BeautifulSoup(page_contents,'html.parser')

In [28]:
type(doc)

bs4.BeautifulSoup

In [41]:
topic_title_p_tags=doc.find_all('p',{'class':'f3 lh-condensed mb-0 mt-1 Link--primary'})

In [42]:
topic_title_p_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [45]:
topic_des_p_tags=doc.find_all('p',{'class':'f5 color-fg-muted mb-0 mt-1'})

In [47]:
topic_des_p_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D modeling is the process of virtually developing the surface and structure of a 3D object.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [85]:
topic_link_a_tags = doc.find_all('a',{'class':'no-underline flex-1 d-flex flex-column'})

In [86]:
len(topic_link_a_tags)

30

In [87]:
topic_link_a_tags

[<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
 <p class="f5 color-fg-muted mb-0 mt-1">
           3D modeling is the process of virtually developing the surface and structure of a 3D object.
         </p>
 </a>,
 <a class="no-underline flex-1 d-flex flex-column" href="/topics/ajax">
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>
 </a>,
 <a class="no-underline flex-1 d-flex flex-column" href="/topics/algorithm">
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>
 </a>,
 <a class="no-underline flex-1 d-flex flex-column" href="/topics/amphp">
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>
 <p class="f

In [88]:
topic_link_a_tags[0]['href']

'/topics/3d'

In [89]:
topic0_url = "https://github.com"+topic_link_a_tags[0]['href']

In [90]:
topic0_url

'https://github.com/topics/3d'

In [None]:
topic_title_p_tags[0].text

In [96]:
topic_titles = []

for tag in topic_title_p_tags:
    topic_titles.append(tag.text)

print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [99]:
topic_des = []

for tag in topic_des_p_tags:
    topic_des.append(tag.text.strip())
    
print(topic_des[:5])

['3D modeling is the process of virtually developing the surface and structure of a 3D object.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency library for PHP.', 'Android is an operating system built by Google designed for mobile devices.']


In [105]:
topic_urls =[]
base_url = "https://github.com"
for tag in topic_link_a_tags:
    topic_urls.append(base_url+tag['href'])
    
print(topic_urls[:5])

['https://github.com/topics/3d', 'https://github.com/topics/ajax', 'https://github.com/topics/algorithm', 'https://github.com/topics/amphp', 'https://github.com/topics/android']


In [106]:
!pip install pandas --quiet

In [107]:
import pandas as pd

In [111]:
dict = {
    'title':topic_titles,
    'description':topic_des,
    'url': topic_urls
}
topics_df = pd.DataFrame(dict)

In [112]:
topics_df

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


# Create CSV files(s) with the extracted information

In [114]:
topics_df.to_csv('topics.csv',index=None)