# Scraping Topics and their top repositories from Github

### Web Scraping
Web scraping is the process of collecting structured web data in an automated fashion. It’s also known as web data extraction. Some of the main use cases of web scraping include price monitoring, price intelligence, news monitoring, lead generation, and market research among many others.
In general, web data extraction is used by people and businesses who want to make use of publicly available web data to make smarter decisions.


### Github
GitHub, Inc. is an Internet hosting service for software development and version control using Git. It provides the distributed version control of Git plus access control, bug tracking, software feature requests, task management, continuous integration, and wikis for every project


### Problem Statement

In this web scraping project, we are going to scrape the data from github. The requirement is to scrape the topics available from github and their top repositories. we are going to use Python Programming language for this project. With the help of `requests` library we are downloading a webpage and with the help of `BeautifulSoup` library we are going to extract and parse the information from the webpage. We also make use of other libraries such as `pandas` to create dataframe of our results

### Steps
* Initially, we are scraping https://github.com/topics. To get a list of topics.
* For each topic, we are extracting their title, description and the topic page url.
* As a next step, we are getting the top 25 repositories for each topic
* For each repository, we are going to extract the `repo_name`, `user_name`, `repo_url`, `stars` and download it in a csv format

### Scraping the list of topics 

* using `requests` we are going to download the topics page
* using `BeautifulSoup` we are going to extract and parse the information such as topic title, topic url and topic description
* converting the result as a Dataframe


In [3]:
import requests
from bs4 import BeautifulSoup

topic_url = 'https://github.com/topics'
response = requests.get(topic_url)
doc = BeautifulSoup(response.text, 'html.parser')

#### Getting topic titles

In [4]:
def get_topics(doc):
    topic_titles = doc.find_all('p', {'class': 'f3 lh-condensed mb-0 mt-1 Link--primary'})
    topic_name = []
    for i in topic_titles:
        topic_name.append(i.text)
    return topic_name


In [5]:
get_topics(doc)[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

#### Getting topic description

In [6]:
def get_topic_desc(doc):
    desc_topic = doc.find_all('p', {'class': 'f5 color-fg-muted mb-0 mt-1'})
    desc_info = []
    for i in desc_topic:
        desc_info.append(i.text.strip())
    return desc_info

In [7]:
get_topic_desc(doc)[:5]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

#### Getting topic url

In [8]:
def get_topic_url(doc):
    url_topic = doc.find_all('a', {'class': 'no-underline flex-1 d-flex flex-column'})
    url = []
    base_url = 'https://github.com'
    for i in url_topic:
        url.append(base_url + i['href'])
    return url

In [9]:
get_topic_url(doc)[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

#### Creating a dataframe 
Creating a dataframe with columns topic names, topic description and topic url

In [10]:
import pandas as pd
def scrape_topics():
    topics_dict = {
        'title': get_topics(doc),
        'Description': get_topic_desc(doc),
        'Url': get_topic_url(doc)
    }
    df= pd.DataFrame(topics_dict)
    return df

In [11]:
scrape_topics()

Unnamed: 0,title,Description,Url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


### Extracting the top repositories from a topic page

As we have got the list of topic, descriptions, url. Now, we are going to extract the top  repositories from each of the topics. 
The dataframe we are creating would have the following columns:
- Username
- repo_name
- Stars 
- repo_url

Initially, we are going to define a function to extract the information from each topics 

In [12]:
def get_topics_page(topic_url):
    response = requests.get(topic_url)
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

In [14]:
#get_topics_page('https://github.com/topics/3d')