# Scraping top 20 collections from GitHub Collections page and its top repositories

## The project is divided into two tasks: 
##### 1. Create a .csv file for the list of collections on GitHub page in the format of | Title | Description | URL |
##### 2. Create individual .csv files for all the repositories of an each collection following this format: 
|username | repo_name | repo_desc | stars | forks | programming language | repo_url |


## 1. Create a .csv file for the list of collections on GitHub page: 
> Website URL I am scraping:  https://github.com/collections

### Step 1: Downloading the github collections page
* use requests lib to download the page
* use BS4 to parse and extract the page 

In [1]:
import requests 
from bs4 import BeautifulSoup

def get_collections_page():
    collections_url = 'https://github.com/collections'
    response = requests.get(collections_url)
    if response.status_code != 200:
        raise Exception ('Failed to load {} page.'.format(collections_url))
    doc =  BeautifulSoup(response.text, 'html.parser')
    return doc

In [2]:
doc = get_collections_page()

### STEP 2: Creating individual helper functions for fetching Collections' Titles, Descriptions and URLs

In [3]:
# Fetching Titles
def get_collection_titles(doc): 
    selection_class = 'h3'
    collection_titles_tag = doc.find_all( 'h2', {'class': selection_class}) 
    collection_titles = []
    for tag in collection_titles_tag:
        collection_titles.append(tag.text)
    return collection_titles

In [4]:
titles = get_collection_titles(doc)
len(titles)

In [5]:
# Fetching Descriptions
def get_collection_desc(doc):
    desc_selector = 'col-10 col-md-11'
    collections_desc_tag = doc.find_all('div', {'class': desc_selector})
    temp_collection_desc = []
    for tag in collections_desc_tag:
        temp_collection_desc.append(tag.text.strip())
    collection_desc = []
    for i in temp_collection_desc:
        whole_collection = [i.split('\n')[1].strip()]
        collection_desc += whole_collection
    return collection_desc

In [6]:
desc = get_collection_desc(doc)
len(desc)

In [7]:
# Fetching URLs
def get_collection_urls(doc):
    urls = []
    for tag in doc.find_all('a', {'href': True, 'class': None}):
        if tag['href'].startswith('/collections/') == True:
            each_urls = ['https://github.com' + tag['href']]
            urls += each_urls
    return urls

In [8]:
collections_urls = get_collection_urls(doc)
len(collections_urls)

### STEP 3: Putting everything together to scrape a final .csv file containing list of top 20 collections

In [9]:
import pandas as pd

In [10]:
# Scraping the collections .csv
def scrape_collections():
    collections_url = 'https://github.com/collections'
    response = requests.get(collections_url)
    if response.status_code != 200:
        raise Exception ('Failed to load {} page.'.format(collections_url))
    doc =  BeautifulSoup(response.text, 'html.parser')
    collections_dict = {
        'Title': get_collection_titles(doc),
        'Description': get_collection_desc(doc),
        'URL': get_collection_urls(doc)
    }
    return pd.DataFrame(collections_dict)

In [11]:
collections = scrape_collections()
collections[:5]

In [12]:
# Downloading the final .csv file
collections.to_csv('collections.csv', index = None)

## END OF TASK 1

## 2. Create individual .csv files for all the repositories of an each collection following this format: 
|username | repo_name | repo_desc | stars | forks | programming language | repo_url |

### STEP 1: Downloading all collection pages using its URLs   

In [13]:
# First thing in pipeline is to download individual collection page
def get_collection_page(collection_url):
    response = requests.get(collection_url)
    if response.status_code != 200:
        raise Exception('Failed to load {} page'.format(collection_url))
    # Parse the collection page using BS4
    collection_doc = BeautifulSoup(response.text, 'html.parser')
    return collection_doc

## Rough Work

In [14]:
# topic_doc = get_collection_page('https://github.com/collections/clean-code-linters')
# desc_tag = topic_doc.findAll('div', {'class': 'color-text-secondary mb-2 ws-normal'})
# items = topic_doc.findAll("div", {'class': 'color-text-secondary mb-2 ws-normal'})
# # second_child = first_child.find('span', {'class': 'ml-0'})
# for desc in desc_tag:
#     print(desc.text.strip())
    
    

        
# results = list(zip(name,tag))
# df = pd.DataFrame(results)
            

### STEP 2: Creating a function for getting the following repository details:
username, repo_name, repo_desc, stars, forks, programming language, repo_url

In [15]:
# Getting Repo Info.
def get_repo_info(h1_tag, star_tag, fork_tag, lan_tag, desc_tag):
    a_tags = h1_tag.find_all('a')
    username = a_tags[0].text.strip().split('\n')[0][:-2]
    repo_name = a_tags[0].text.strip().split('\n')[1].strip()
    repo_url = 'https://www.github.com' + a_tags[0]['href']
    
    # Languages tag filter
    lan = []
    desc = []

#     for each in lan_tag:
    try:
        lan.append(lan_tag.find('span',{'class':'ml-0'}).text.strip())
    except:
        lan.append('Not Provided')
    
    try:
        desc.append(desc_tag.find('div',{'class':'color-text-secondary mb-2 ws-normal'}).text.strip())
    except:
        desc.append('Not Provided')
    
    # Stars and Forks tag filter
    stars = int(star_tag.text.strip())
    forks = int(fork_tag.text.strip())
   
    
    return username, repo_name, desc[0], stars, forks, lan[0], repo_url

### STEP 3: Creating the final function for scraping details of every collection's all repositories using get_repo_info() function.

In [16]:
import re

In [17]:
# put everything in final function
def get_collection_repo(collection_doc):
    h1_selection_class = 'h3 lh-condensed'
    repo_tag = collection_doc.find_all('h1', {'class': h1_selection_class})
    star_tag = collection_doc.find_all(href = re.compile('stargazers'))
    fork_tag = collection_doc.find_all(href = re.compile('members'))
    lan_tag = collection_doc.findAll("div", {'class': 'd-flex f6'})
    desc_tag = collection_doc.findAll('article', {'class': 'height-full border color-border-secondary rounded-1 p-3 p-md-5 my-5'})
    
    collection_dict = {'username': [], 'repo_name': [], 'repo_desc': [], 'stars': [], 'forks': [], 'programming language': [],'repo_url': []}
    
    for i in range(len(repo_tag)):
        repo_info = get_repo_info(repo_tag[i], star_tag[i], fork_tag[i], lan_tag[i], desc_tag[i])
#         starforks = get_stars_forks(topic_doc)
        collection_dict['username'].append(repo_info[0])
        collection_dict['repo_name'].append(repo_info[1])
        collection_dict['repo_desc'].append(repo_info[2])
        collection_dict['stars'].append(repo_info[3])
        collection_dict['forks'].append(repo_info[4])
        collection_dict['programming language'].append(repo_info[5])
        collection_dict['repo_url'].append(repo_info[6])
        
    return pd.DataFrame(collection_dict, index = None)

In [18]:
# Let's now start scraping .csv for each collection's repositories

def scrape_collection(collection_url, path):
    if os.path.exists(path):
        print('The file {} already exists. Skipping...'.format(path))
        return
    collection_df = get_collection_repo(get_collection_page(collection_url))
    collection_df.to_csv(path, index = None)

### STEP 4: Putting everything together and creating a function that gives the .csvs for each collection repos

In [19]:
import os

In [20]:
# Let's now start scraping .csv for each collection's repositories

def scrape_collection_repos():
    print("Scraping List of Collections")
    collection_df = scrape_collections()
    
    os.makedirs('data', exist_ok = True)
    
    for index, row in collection_df.iterrows():
        print('Scraping top repositories for {}'.format(row['Title']))
        scrape_collection(row['URL'], 'data/{}.csv'.format(row['Title']))

In [21]:
scrape_collection_repos()

In [22]:
## Check the .csvs
#pd.read_csv('./data/DevOps tools.csv')

## END OF TASK 2

## Summary:
* I've learnt various things such as parsing data using BS4, creating several helper functions and putting all functions together to make the code look organized and less confusing.
* I've learnt ways to figure out all kinds of problems while executing each steps by myself and gained confidence for been able to work my way in achieving goals.
* I've understood the significance of web scrapping and enjoyed the creativity of an automating systems

TODO: 

* explain what you learnt, what you used, and what would be the future work of this project.

## Future Work:
* Instead of scraping top 20 collections, I can go ahead and scrape all collections repo .csvs and create a .py file.
* Obviously an automated scarping tool can be created with a simple front-end which contains the search box for users to give input and fetch the matching .csv files based on what is searched from the whole list of collections available on GitHub and deploy the project globally.
* Another step ahead could be to scrape the whole Github collections, topics, trending, pages and make an end to end project on complete libraries. 
* Similar web scrapping projects can be created using scrappy and other tools.