# Scraping Topics and their top repositories from Github

### Web Scraping
Web scraping is the process of collecting structured web data in an automated fashion. It’s also known as web data extraction. Some of the main use cases of web scraping include price monitoring, price intelligence, news monitoring, lead generation, and market research among many others.
In general, web data extraction is used by people and businesses who want to make use of publicly available web data to make smarter decisions.


### Github
GitHub, Inc. is an Internet hosting service for software development and version control using Git. It provides the distributed version control of Git plus access control, bug tracking, software feature requests, task management, continuous integration, and wikis for every project


### Problem Statement

In this web scraping project, we are going to scrape the data from github. The requirement is to scrape the topics available from github and their top repositories. we are going to use Python Programming language for this project. With the help of `requests` library we are downloading a webpage and with the help of `BeautifulSoup` library we are going to extract and parse the information from the webpage. We also make use of other libraries such as `pandas` to create dataframe of our results

### Steps
* Initially, we are scraping https://github.com/topics. To get a list of topics.
* For each topic, we are extracting their title, description and the topic page url.
* As a next step, we are getting the top 25 repositories for each topic
* For each repository, we are going to extract the `repo_name`, `user_name`, `repo_url`, `stars` and download it in a csv format

### Scraping the list of topics 

* using `requests` we are going to download the topics page
* using `BeautifulSoup` we are going to extract and parse the information such as topic title, topic url and topic description
* converting the result as a Dataframe


In [1]:
import requests
from bs4 import BeautifulSoup

topic_url = 'https://github.com/topics'
response = requests.get(topic_url)
doc = BeautifulSoup(response.text, 'html.parser')

#### Getting topic titles

In [3]:
def get_topics(doc):
    topic_titles = doc.find_all('p', {'class': 'f3 lh-condensed mb-0 mt-1 Link--primary'})
    topic_name = []
    for i in topic_titles:
        topic_name.append(i.text)
    return topic_name


In [6]:
get_topics(doc)[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']