## Scrapping Top Repositories for Topics on GitHub

TODO:
- Introduction about web scraping
- Introduction about GitHub and the problem statement
- Mention the Tools you're using (Python,requests,Pandas,BeautifulSoup)

### Project Outline:

- We're going to scrap -https://github.com/topics

- we'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic information

- For each topic, we'all get the top 25 repositories in the topic from the topic page

- For each repsitory , we'll grab the repo name,username, stars, programming language, repo URL

- For each topic we'll create a csv file in the following format:
```
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/libgdx/libdgdx
```



## Scrap the list of topics form GitHub

- use requests to download the page
- user BS4 to parse and extract information
- convert to a pandas dataframe

Let's write a function to download the page

In [44]:
import requests
import bs4
def get_topics_page():
    # it fetch the information from given URL
    
    topics_url='https://github.com/topics'
    response=requests.get(topics_url)

    if response.status_code!=200:
        raise Exception(f'Failed to load page {topic_url}')

    doc=bs4.BeautifulSoup(response.text,'html.parser')
    return doc

## Demo that how it gonna fetch info and  convert into html and with that how we gonna select a tag for fetching

In [45]:
doc=get_topics_page()

In [46]:
doc.find_all('a')[:1]

[<a class="px-2 py-4 color-bg-accent-emphasis color-fg-on-emphasis show-on-focus js-skip-to-content" href="#start-of-content">Skip to content</a>]

**Let's create some helper function to parse information from the page.**

To get topics titles,we can pick `p`tags with `class`..
![](https://imgur.com/J8GopoO)

In [47]:
def get_topic_titles(doc):
    selection_class='f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags=doc.find_all('p',{'class':selection_class})
    topic_titles=[]
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

`get_topic_titles`**can be used to get the list of titles**

In [48]:
titles=get_topic_titles(doc)

In [49]:
titles[:3]

['3D', 'Ajax', 'Algorithm']

**Similarly we have defined functions for descriptions**

In [50]:
def get_topic_descs(doc):
    
    selection_of_description='f5 color-fg-muted mb-0 mt-1'
    topic_description_tag=doc.find_all('p',{'class':selection_of_description})
    
    topic_description=[]
    for description in topic_description_tag:
        topic_description.append(description.text.strip())

    return topic_description

**Similarly we have defined functions for urls**

In [51]:
def get_topic_urls(doc):

    topic_urls=[]
    base_url='https://github.com'

    for i in range(0,len(topic_link_tag)):
   
        topic_urls.append(base_url+topic_link_tag[i].parent['href'])
    return topic_url

**Similarly we have defined functions for scraping topics**

In [52]:
def scrape_topics():
    topics_url='https://github.com/topics'
    response=requests.get(topics_url)

    if response.status_code!=200:
        raise Exception(f'Failed to load page {topic_url}')
    
    doc=bs4.BeautifulSoup(response.text,'html.parser')
    topics_dict={
        'title':get_topic_titles(doc),
        'description':get_topic_descs(doc),
        'url':get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)


## Get the top 25 repositories from a topic page

In [53]:
def get_topic_page(topic_url):
    
      #Download the page
    response=requests.get(topic_url)
    #check succesful response
    if response.status_code!=200:
        raise Exception(f'Failed to load page {topic_url}')
    
    #parse using BeautifulSoup
    topic_doc=bs4.BeautifulSoup(response.text,'html.parser')
    
    return topic_doc

In [54]:
doc=get_topic_page('https://github.com/topics/3d')

ConnectionError: HTTPSConnectionPool(host='github.com', port=443): Max retries exceeded with url: /topics/3d (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000002B475B66FD0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it'))

**Now we are getting h1 tags and star tag**

In [None]:
def get_repo_info(h1_tag,star_tag):
    #returns all the required info about a repository
    a_tags=h1_tag.find_all('a')
    
    username=a_tags[0].text.strip()
    
    repo_name=a_tags[1].text.strip()
    
    repo_url = base_url + a_tags[1]['href']
    
    stars=parse_star_count(star_tag.text.strip())
    
    return username,repo_name,stars,repo_url

**concvert all fetch data into a csv foramt**

In [None]:


def get_topic_repos(topic_doc):
    
    #getting h1 tags containing repo title,repo URL and username
    tag_selection='f3 color-fg-muted text-normal lh-condensed'
    repo_tags=topic_doc.find_all('h3',{'class':tag_selection})
    
    #get stars tag
    star_id='repo-stars-counter-star'
    star_tags=topic_doc.find_all('span',{'id':star_id})
    
    
    topics_repos_dict={
    'username':[],
    'repo_name':[],
    'stars':[],
    'repo_url':[]
}
    #get repo info
    for i in range(len(repo_tags)):
    
        repo_info=get_repo_info(repo_tags[i],star_tags[i])
        topics_repos_dict['username'].append(repo_info[0])
        topics_repos_dict['repo_name'].append(repo_info[1])
        topics_repos_dict['stars'].append(repo_info[2])
        topics_repos_dict['repo_url'].append(repo_info[3])
    
    return pd.DataFrame(topics_repos_dict)

## Putting all together

-  we have a function to get the list of topics
-  We have a function to create a CSV file for scraped repos from a topic page
- Let's create a function to put them together

In [None]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df=scrape_topics()
    
    os.makedirs('Scrap-Data',exist_ok=True)
    
    for index,row in topics_df.iterrows():
        print("Scraping top repositories for {}".format(row['title']))
        scrape_topic(row['url'],'Scrap-Data/{}.csv'.format(row['title']))

Let's run it to scrape the top repos for all the topic on the first page of https://github.com/topics

In [None]:
scrape_topics_repos()

We can check that the CSVs were created properly

In [59]:
#read and display a csv using Pandas
import pandas as pd

df = pd.read_csv('Scrap-Data/3D.csv.csv')

print(df) 

            username                      repo_name  stars  \
0             mrdoob                       three.js  84000   
1             libgdx                         libgdx  20000   
2             pmndrs              react-three-fiber  19000   
3          BabylonJS                     Babylon.js  18000   
4              ssloy                   tinyrenderer  14000   
5           aframevr                         aframe  14000   
6            lettier  3d-game-shaders-for-beginners  13000   
7            FreeCAD                        FreeCAD  12000   
8          metafizzy                           zdog   9000   
9           CesiumGS                         cesium   9000   
10       timzhang642            3D-Machine-Learning   8000   
11           isl-org                         Open3D   7000   
12      a1studmuffin             SpaceshipGenerator   7000   
13           blender                        blender   6000   
14           domlysz                     BlenderGIS   5000   
15      