# Scrapping Github topics

#####    

## Scrape the list of topics from Github

Explain how you'll do it.

- use requests to downlaod the page
- user BS4 to parse and extract information
- convert to a Pandas dataframe

Let's write a function to download the page.

#### Project Outline:
- We are going to scrape https://github.com/topics
- We will get a list of topics. For each topic, we'll get topic title, url, and description.
- For each topic we will get top 25 repositories.
- For each repository we will collect repo name, username, start, and url.
- For each topic we will create .csv in this format
```
 Repo Name,Username,Stars,Repo URL
 three.js,mrdoob,69700,https://github.com/mrdoob/three.js
```

### Use the request library to download web pages

In [2]:
# !pip install requests --upgrade --quiet

In [215]:
import requests
import os

In [4]:
topic_url = 'https://github.com/topics'


In [6]:
response = requests.get(topic_url)

In [9]:
response.status_code # if status code is 200-299 it is successful response

200

In [17]:
page_contents =  response.text
# it stores what is there in response and save it in local file
with open('webpage.html','w',encoding='utf-8') as f:
    f.write(page_contents)

### Use Beautiful Soup to parse and extract information

In [18]:
!pip install beautifulsoup4 --upgrade --quiet

In [26]:
import bs4 as bs

In [27]:
doc = bs.BeautifulSoup(page_content,"html.parser")

In [55]:
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
topic_title_tag = doc.find_all('p',{'class': selection_class })

In [56]:
topic_title_tag[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [57]:
desc_selector = 'f5 color-fg-muted mb-0 mt-1'
topic_desc_tag = doc.find_all('p',{'class':desc_selector})

In [58]:
topic_desc_tag[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D modeling is the process of virtually developing the surface and structure of a 3D object.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [69]:
url_selector = 'no-underline flex-1 d-flex flex-column'
topic_link_tag = doc.find_all('a',{'class':url_selector})

In [71]:
topic_link_tag[:5]

[<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
 <p class="f5 color-fg-muted mb-0 mt-1">
           3D modeling is the process of virtually developing the surface and structure of a 3D object.
         </p>
 </a>,
 <a class="no-underline flex-1 d-flex flex-column" href="/topics/ajax">
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>
 </a>,
 <a class="no-underline flex-1 d-flex flex-column" href="/topics/algorithm">
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>
 </a>,
 <a class="no-underline flex-1 d-flex flex-column" href="/topics/amphp">
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>
 <p class="f

In [76]:
base_url = "https://github.com/"
print(base_url+topic_link_tag[0]['href'])

https://github.com//topics/3d


In [91]:
print(topic_desc_tag[0])

<p class="f5 color-fg-muted mb-0 mt-1">
          3D modeling is the process of virtually developing the surface and structure of a 3D object.
        </p>


In [88]:
topic_title = []
for tag in topic_title_tag:
    topic_title.append(tag.text)
# print(topic_title)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [95]:
topic_desc = []
for desc in topic_desc_tag:
    topic_desc.append(desc.text.strip())
# print(topic_desc)

In [98]:
topic_urls = []
for url in  topic_link_tag:
    topic_urls.append(base_url+url['href'])
# print(topic_urls)

['https://github.com//topics/3d', 'https://github.com//topics/ajax', 'https://github.com//topics/algorithm', 'https://github.com//topics/amphp', 'https://github.com//topics/android', 'https://github.com//topics/angular', 'https://github.com//topics/ansible', 'https://github.com//topics/api', 'https://github.com//topics/arduino', 'https://github.com//topics/aspnet', 'https://github.com//topics/atom', 'https://github.com//topics/awesome', 'https://github.com//topics/aws', 'https://github.com//topics/azure', 'https://github.com//topics/babel', 'https://github.com//topics/bash', 'https://github.com//topics/bitcoin', 'https://github.com//topics/bootstrap', 'https://github.com//topics/bot', 'https://github.com//topics/c', 'https://github.com//topics/chrome', 'https://github.com//topics/chrome-extension', 'https://github.com//topics/cli', 'https://github.com//topics/clojure', 'https://github.com//topics/code-quality', 'https://github.com//topics/code-review', 'https://github.com//topics/compi

In [99]:
!pip install pandas --upgrade --quiet

In [100]:
import pandas as pd

In [102]:
topic_dict = {
    'title':topic_title,
    'description': topic_desc,
    'url': topic_urls
}

In [105]:
topic_df = pd.DataFrame(topic_dict)
topic_df

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com//topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com//topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com//topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com//topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com//topics/android
5,Angular,Angular is an open source web application plat...,https://github.com//topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com//topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com//topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com//topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com//topics/aspnet


### Create CSV file(s) with the extracted information

In [106]:
topic_df.to_csv('topic.csv',index=None)

### Getting info from topic page

In [107]:
topic_page_url = topic_urls[0]

In [111]:
response = requests.get(topic_page_url)
# response.status_code

In [113]:
topic_doc = bs.BeautifulSoup(response.text,'html.parser')

In [174]:
h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
repo_tags = topic_doc.find_all('h3',{'class': h3_selection_class})
# len(repo_tags)

In [176]:
star_selector = 'Counter js-social-count'
star_tags = topic_doc.find_all('span',{'class': star_selector})

In [177]:

def parse_star_count(stars_count):
    stars_str = stars_count.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1])*1000)
    return (int(stars_str))
        
print(parse_star_count(star_tags[0].text))   

83700


In [178]:
def get_repo_info(h3_tag, star_tags):
    #gives all info about repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = repo_url = base_url + a_tags[1]['href']
    star_count = parse_star_count(star_tags.text)
    return username, repo_name, star_count, repo_url

In [185]:
topic_repos_dict = {
    'username':[],
    'repo_name':[],
    'stars': [],
    'repo_url':[]
}
for i in range (len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i],star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])

In [179]:
get_repo_info(repo_tags[1],star_tags[1])

('libgdx', 'libgdx', 20200, 'https://github.com//libgdx/libgdx')

In [188]:
topic_repos = pd.DataFrame(topic_repos_dict)


In [229]:
def get_topic_page(topic_url):
    response = requests.get(topic_url)
    if response.status_code !=200:
        raise Exception('Failed to load the page {}'.format(topic_url))
    topic_doc = bs.BeautifulSoup(response.text,'html.parser')
    return topic_doc

def get_repo_info(h3_tag, star_tags):
    #gives all info about repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = repo_url = base_url + a_tags[1]['href']
    star_count = parse_star_count(star_tags.text)
    return username, repo_name, star_count, repo_url

def get_topic_repos(topic_doc):
    
#     get h3 tag for repo name, url, etc.
    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3',{'class': h3_selection_class})
    
#     get star tags
    star_selector = 'Counter js-social-count'
    star_tags = topic_doc.find_all('span',{'class': star_selector})
    
#     get repo info
    topic_repos_dict = {
        'username':[],
        'repo_name':[],
        'stars': [],
        'repo_url':[]
    }
    for i in range (len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i],star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
#     put all in dataframe
    topic_repos = pd.DataFrame(topic_repos_dict)
    return topic_repos

def scrape_topic(topic_url,path):

    if os.path.exists(path):
        print("File {} already exits.".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path,index=None)

In [193]:
get_topic_repos(get_topic_page(topic_urls[10]))

Unnamed: 0,username,repo_name,stars,repo_url
0,atom,atom,58300,https://github.com//atom/atom
1,themerdev,themer,4800,https://github.com//themerdev/themer
2,nteract,hydrogen,3800,https://github.com//nteract/hydrogen
3,miniflux,v2,3800,https://github.com//miniflux/v2
4,shd101wyy,markdown-preview-enhanced,3700,https://github.com//shd101wyy/markdown-preview...
5,atom,teletype,2400,https://github.com//atom/teletype
6,mmcdole,gofeed,1900,https://github.com//mmcdole/gofeed
7,mehcode,awesome-atom,1900,https://github.com//mehcode/awesome-atom
8,joefitzgerald,go-plus,1500,https://github.com//joefitzgerald/go-plus
9,Glavin001,atom-beautify,1500,https://github.com//Glavin001/atom-beautify


Write a single function to:
1. Get the list of topics from the topics page
2. Get list of top repos from the individual topic pages
3. For each topic create a csv of the top repos for the topic 

In [199]:
def get_topic_title(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tag = doc.find_all('p',{'class': selection_class })
    
    topic_title = []
    for tag in topic_title_tag:
        topic_title.append(tag.text)
    return topic_title

def get_topic_desc(doc):
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tag = doc.find_all('p',{'class':desc_selector})
    topic_desc = []
    for desc in topic_desc_tag:
        topic_desc.append(desc.text.strip())
    return topic_desc
    
    
def get_topic_url(doc):
    base_url = "https://github.com/"
    url_selector = 'no-underline flex-1 d-flex flex-column'
    topic_link_tag = doc.find_all('a',{'class':url_selector})        
    topic_urls = []
    for url in  topic_link_tag:
        topic_urls.append(base_url+url['href'])
        
def scrape_topics():
    topic_url = 'https://github.com/topics'
    response = requests.get(topic_url)
    if response.status_code !=200:
        raise Exception('Failed to load the page {}'.format(topic_url))
        
    doc = bs.BeautifulSoup(page_content,"html.parser")
    topics_dict = {
        'title' : get_topic_title(doc),
        'description' : get_topic_desc(doc),
        'url' : get_topic_url(doc),
        
    }
    return pd.DataFrame(topic_dict)

In [200]:
scrape_topics()

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com//topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com//topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com//topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com//topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com//topics/android
5,Angular,Angular is an open source web application plat...,https://github.com//topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com//topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com//topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com//topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com//topics/aspnet


In [227]:
def scrape_topics_repos():
    topics_df = scrape_topics()
    
    os.makedirs('data',exist_ok = True)
    
    for index,row in topics_df.iterrows():
        print("Scrapping topic {}.".format(row['title']))
        scrape_topic(row['url'],'data/{}.csv'.format(row['title']))
        

In [231]:
scrape_topics_repos()

Scrapping topic 3D.
File data/3D.csv already exits.
Scrapping topic Ajax.
File data/Ajax.csv already exits.
Scrapping topic Algorithm.
File data/Algorithm.csv already exits.
Scrapping topic Amp.
File data/Amp.csv already exits.
Scrapping topic Android.
File data/Android.csv already exits.
Scrapping topic Angular.
File data/Angular.csv already exits.
Scrapping topic Ansible.
File data/Ansible.csv already exits.
Scrapping topic API.
File data/API.csv already exits.
Scrapping topic Arduino.
File data/Arduino.csv already exits.
Scrapping topic ASP.NET.
File data/ASP.NET.csv already exits.
Scrapping topic Atom.
File data/Atom.csv already exits.
Scrapping topic Awesome Lists.
File data/Awesome Lists.csv already exits.
Scrapping topic Amazon Web Services.
File data/Amazon Web Services.csv already exits.
Scrapping topic Azure.
File data/Azure.csv already exits.
Scrapping topic Babel.
File data/Babel.csv already exits.
Scrapping topic Bash.
File data/Bash.csv already exits.
Scrapping topic Bitc

# Final Code 

In [236]:
import os
import requests
import bs4 as bs

Write a single function to:
1. Get the list of topics from the topics page
2. Get list of top repos from the individual topic pages
3. For each topic create a csv of the top repos for the topic 

In [237]:
def get_topic_title(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tag = doc.find_all('p',{'class': selection_class })
    
    topic_title = []
    for tag in topic_title_tag:
        topic_title.append(tag.text)
    return topic_title

def get_topic_desc(doc):
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tag = doc.find_all('p',{'class':desc_selector})
    topic_desc = []
    for desc in topic_desc_tag:
        topic_desc.append(desc.text.strip())
    return topic_desc
    
    
def get_topic_url(doc):
    base_url = "https://github.com"
    url_selector = 'no-underline flex-1 d-flex flex-column'
    topic_link_tag = doc.find_all('a',{'class':url_selector})        
    topic_urls = []
    for url in  topic_link_tag:
        topic_urls.append(base_url+url['href'])
    return topic_urls

def scrape_topics():
    topic_url = 'https://github.com/topics'
    response = requests.get(topic_url)
    if response.status_code !=200:
        raise Exception('Failed to load the page {}'.format(topic_url))
        
    doc = bs.BeautifulSoup(page_content,"html.parser")
    topics_dict = {
        'title' : get_topic_title(doc),
        'description' : get_topic_desc(doc),
        'url' : get_topic_url(doc),
        
    }
    return pd.DataFrame(topic_dict)

In [240]:
x="https://github.com/topics?page={}".format(5)
topic_dict

{'title': ['3D',
  'Ajax',
  'Algorithm',
  'Amp',
  'Android',
  'Angular',
  'Ansible',
  'API',
  'Arduino',
  'ASP.NET',
  'Atom',
  'Awesome Lists',
  'Amazon Web Services',
  'Azure',
  'Babel',
  'Bash',
  'Bitcoin',
  'Bootstrap',
  'Bot',
  'C',
  'Chrome',
  'Chrome extension',
  'Command line interface',
  'Clojure',
  'Code quality',
  'Code review',
  'Compiler',
  'Continuous integration',
  'COVID-19',
  'C++'],
 'description': ['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
  'Ajax is a technique for creating interactive web applications.',
  'Algorithms are self-contained sequences that carry out a variety of tasks.',
  'Amp is a non-blocking concurrency library for PHP.',
  'Android is an operating system built by Google designed for mobile devices.',
  'Angular is an open source web application platform.',
  'Ansible is a simple and powerful automation engine.',
  'An API (Application Programming Interface) is a collec

In [233]:
def get_topic_page(topic_url):
    response = requests.get(topic_url)
    if response.status_code !=200:
        raise Exception('Failed to load the page {}'.format(topic_url))
    topic_doc = bs.BeautifulSoup(response.text,'html.parser')
    return topic_doc

def get_repo_info(h3_tag, star_tags):
    #gives all info about repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = repo_url = base_url + a_tags[1]['href']
    star_count = parse_star_count(star_tags.text)
    return username, repo_name, star_count, repo_url

def parse_star_count(stars_count):
    stars_str = stars_count.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1])*1000)
    return (int(stars_str))
        

def get_topic_repos(topic_doc):
    
#     get h3 tag for repo name, url, etc.
    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3',{'class': h3_selection_class})
    
#     get star tags
    star_selector = 'Counter js-social-count'
    star_tags = topic_doc.find_all('span',{'class': star_selector})
    
#     get repo info
    topic_repos_dict = {
        'username':[],
        'repo_name':[],
        'stars': [],
        'repo_url':[]
    }
    for i in range (len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i],star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
#     put all in dataframe
    topic_repos = pd.DataFrame(topic_repos_dict)
    return topic_repos

def scrape_topic(topic_url,path):

    if os.path.exists(path):
        print("File {} already exits.".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path,index=None)

In [234]:
def scrape_topics_repos():
    topics_df = scrape_topics()
    
    os.makedirs('data',exist_ok = True)
    
    for index,row in topics_df.iterrows():
        print("Scrapping topic {}.".format(row['title']))
        scrape_topic(row['url'],'data/{}.csv'.format(row['title']))
        

In [235]:
scrape_topics_repos()

Scrapping topic 3D.
File data/3D.csv already exits.
Scrapping topic Ajax.
File data/Ajax.csv already exits.
Scrapping topic Algorithm.
File data/Algorithm.csv already exits.
Scrapping topic Amp.
File data/Amp.csv already exits.
Scrapping topic Android.
File data/Android.csv already exits.
Scrapping topic Angular.
File data/Angular.csv already exits.
Scrapping topic Ansible.
File data/Ansible.csv already exits.
Scrapping topic API.
File data/API.csv already exits.
Scrapping topic Arduino.
File data/Arduino.csv already exits.
Scrapping topic ASP.NET.
File data/ASP.NET.csv already exits.
Scrapping topic Atom.
File data/Atom.csv already exits.
Scrapping topic Awesome Lists.
File data/Awesome Lists.csv already exits.
Scrapping topic Amazon Web Services.
File data/Amazon Web Services.csv already exits.
Scrapping topic Azure.
File data/Azure.csv already exits.
Scrapping topic Babel.
File data/Babel.csv already exits.
Scrapping topic Bash.
File data/Bash.csv already exits.
Scrapping topic Bitc