## Pick a website and describe your objective


- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.

- Identify the information you'd like to scrape from the site. Decide the format of the output CSV 0 file.

- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.

**Project Outline:**
- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:*

```
Repo Name, Username, Stars, Repo URL
three.js, mrdoob, 69700, https://github.com/mrdoob/three.js
libgdx, libgdx, 18300, https://github.com/libgdx/libgdx
```

**1. Use the requests library to download web pages**

In [7]:
# if request library is not installed
#!pip install requests --upgrade --quiet

In [6]:
import requests

In [8]:
topics_url = 'https://github.com/topics'

In [10]:
response = requests.get(topics_url)
response.status_code                 # If status code is between 200 - 299 then it's succesful responsed

200

In [19]:
page_contents = response.text                   #html code of this page
len(page_contents)

In [20]:
# Create static page of the https://github.com/topics
with open('HTML code of page_contents.html', 'w') as f:
    f.write(page_contents)

UnicodeEncodeError: 'charmap' codec can't encode character '\u21b5' in position 41887: character maps to <undefined>

**2. Use Beautiful Soup to parse and extract information**

In [21]:
#!pip install beautifulsoup4 --upgrade --quiet

In [22]:
from bs4 import BeautifulSoup

In [23]:
doc = BeautifulSoup(page_contents, 'html.parser')

*-- list all topics*

In [122]:
topic_selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
topic_title_tags = doc.find_all('p', {'class': topic_selection_class})
topic_titles = []
for tag in topic_title_tags:
    topic_titles.append(tag.text)
    
#topic_titles

*-- list all description of the title*

In [124]:
desc_selection_class = 'f5 color-fg-muted mb-0 mt-1'
topic_desc_tags = doc.find_all('p', {'class': desc_selection_class})
topic_descs = []
for tag in topic_desc_tags:
    topic_descs.append(tag.text.strip())
    
#topic_descs

*-- list of all topic links*

In [126]:
link_selection_class = 'no-underline flex-1 d-flex flex-column'
topic_link_tags = doc.find_all('a', {'class': link_selection_class})
topic_urls = []
base_url = 'https://github.com'
for i in range(len(topic_titles)):
    topic_urls.append(base_url + topic_link_tags[i]['href'])
    
#topic_urls

In [127]:
import pandas as pd

In [128]:
topics_dict = {
    'title': topic_titles,
    'description': topic_descs,
    'url': topic_urls
}

In [135]:
topics_df = pd.DataFrame(topics_dict)

In [299]:
topics_df.to_csv('topics.csv', index=None)

**3. Getting information out of a topic page**

In [137]:
topic_page_url = topic_urls[0]

In [138]:
topic_page_url

'https://github.com/topics/3d'

In [140]:
response = requests.get(topic_page_url)
response.status_code

200

In [141]:
topic_doc = BeautifulSoup(response.text, 'html.parser')

In [146]:
h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
repo_tags = topic_doc.find_all('h3', {'class': h3_selection_class} )

In [147]:
repo_tags[0]

<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
            mrdoob
</a>          /
          <a class="text-bold wb-break-word" data-ga-click="Explore, go to repository, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac

In [148]:
len(repo_tags)

30

In [152]:
a_tags = repo_tags[0].find_all('a')
a_tags

[<a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
             mrdoob
 </a>,
 <a class="text-bold wb-break-word" data-ga-click="Explore, go to repository, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data

In [150]:
a_tags[0].text.strip()

'mrdoob'

In [151]:
a_tags[1].text.strip()

'three.js'

In [155]:
base_url = 'https://github.com'
repo_url = base_url + a_tags[1]['href']
print(repo_url)

https://github.com/mrdoob/three.js


In [164]:
star_tags = topic_doc.find_all('span', { 'class': 'Counter js-social-count'})
star_tags[0].text

'83.5k'

### Create function for count of star

In [199]:
def parse_star_counts(star_str):
    star_str = star_str.strip()
    if star_str[-1] == 'k':
        return int(float(star_str[:-1]) * 1000)
    return int(star_str)

parse_star_counts(star_tags[0].text)

83500

### Create function for repository information

In [209]:
def get_repo_info(h3_tag, star_tag):
    a_tag = h3_tag.find_all('a')        # find all a tag from h3_tag where h3_tag=repo_tags[..]
    user_name  = a_tag[0].text.strip()   
    repo_name = a_tag[1].text.strip()
    stars     = parse_star_counts(star_tag.text)
    base_url  = 'https://github.com'
    repo_url  = base_url + a_tag[1]['href']
    return user_name, repo_name, stars, repo_url

parse_repo_info(repo_tags[0],star_tags[0])

('mrdoob', 'three.js', 83500, 'https://github.com/mrdoob/three.js')

In [210]:
topic_repos_dict = {
    'username': [],
    'repo_name': [],
    'stars': [],
    'repo_url': []
}


for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])

### Create function for selected topic and get repository info of that topic

In [335]:
import os

def get_topic_page(topic_url):
    # Download the page
    response = requests.get(topic_url)
    # Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    # Parse using Beautiful soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

def get_repo_info(h3_tag, star_tag):
    a_tag = h3_tag.find_all('a')        # find all a tag from h3_tag where h3_tag=repo_tags[..]
    user_name  = a_tag[0].text.strip()   
    repo_name = a_tag[1].text.strip()
    stars     = parse_star_counts(star_tag.text.strip())
    base_url  = 'https://github.com'
    repo_url  = base_url + a_tag[1]['href']
    return user_name, repo_name, stars, repo_url

def get_topic_repos(topic_doc):
    
    # Get h3_tag repo
    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3', {'class': h3_selection_class} )
    
    # Get star_tag
    star_tags = topic_doc.find_all('span', { 'class': 'Counter js-social-count'})
    
    # Get repo info
    topic_repos_dict = {
    'username': [],
    'repo_name': [],
    'stars': [],
    'repo_url': []}
    
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
    return pd.DataFrame(topic_repos_dict)

def scrape_topic_content(topic_urls, path):
    # print this when repository file already exists
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    #save csv file of top repository with information
    topic_content_df = get_topic_repos(get_topic_page(topic_urls))
    topic_content_df.to_csv(path, index=None)                      

In [336]:
topic_urls[5]

'https://github.com/topics/angular'

In [319]:
get_topic_repos(get_topic_page(topic_urls[5]))

Unnamed: 0,username,repo_name,stars,repo_url
0,justjavac,free-programming-books-zh_CN,94200,https://github.com/justjavac/free-programming-...
1,angular,angular,82600,https://github.com/angular/angular
2,storybookjs,storybook,72600,https://github.com/storybookjs/storybook
3,leonardomso,33-js-concepts,49900,https://github.com/leonardomso/33-js-concepts
4,ionic-team,ionic-framework,47600,https://github.com/ionic-team/ionic-framework
5,prettier,prettier,43300,https://github.com/prettier/prettier
6,SheetJS,sheetjs,30600,https://github.com/SheetJS/sheetjs
7,angular,angular-cli,25500,https://github.com/angular/angular-cli
8,angular,components,22800,https://github.com/angular/components
9,NativeScript,NativeScript,21400,https://github.com/NativeScript/NativeScript



### Write a single function to :
1. Get the list of topics from the topics page
2. Get the list of top repos from the individual topic pages
3. For each topic, create a CSV of the top repos for the topic



In [349]:
def topic_tittle_tag(doc):
    topic_selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': topic_selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

def topic_descs_tag(doc):
    desc_selection_class = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class': desc_selection_class})
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs

def topic_urls_tag(doc):
    link_selection_class = 'no-underline flex-1 d-flex flex-column'
    topic_link_tags = doc.find_all('a', {'class': link_selection_class})
    topic_urls = []
    base_url = 'https://github.com'
    for i in range(len(topic_titles)):
        topic_urls.append(base_url + topic_link_tags[i]['href'])
    return topic_urls

def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200 :
        raise Exception('Failed to load page {}'.format(topic_url))
    
    topics_dict = {'title': topic_titles,'description': topic_descs,'url': topic_urls}
    return pd.DataFrame(topics_dict)


In [355]:
def scrap_topic_repos():
    print("Scraping topics repository in csv")
    topic_df = scrape_topics()
    for index, row in topic_df.iterrows():
        print('Scraping top repositories for '+ row['title'] + ' ' + row['url'])
        scrape_topic_content(row['url'], 'Data/{}.csv'.format(row['title']))
        

In [356]:
scrap_topic_repos()

Scraping topics repository in csv
Scraping top repositories for 3D https://github.com/topics/3d
The file Data/3D.csv already exists. Skipping...
Scraping top repositories for Ajax https://github.com/topics/ajax
The file Data/Ajax.csv already exists. Skipping...
Scraping top repositories for Algorithm https://github.com/topics/algorithm
The file Data/Algorithm.csv already exists. Skipping...
Scraping top repositories for Amp https://github.com/topics/amphp
The file Data/Amp.csv already exists. Skipping...
Scraping top repositories for Android https://github.com/topics/android
The file Data/Android.csv already exists. Skipping...
Scraping top repositories for Angular https://github.com/topics/angular
The file Data/Angular.csv already exists. Skipping...
Scraping top repositories for Ansible https://github.com/topics/ansible
The file Data/Ansible.csv already exists. Skipping...
Scraping top repositories for API https://github.com/topics/api
The file Data/API.csv already exists. Skipping..