# Scraping trending Python developers profile from GitHub

## Project Outline

<li> Website used for scraping <a href="https://github.com/trending/developers/python?since=daily">https://github.com/trending/developers/python?since=daily</a></li>
<li>This is the trending page</li>
<li> From the trending page, Get the Developer name, profile URL, popular repo, repo URL, repo desc</li>
<li> Navigate to Profile page. Scrape the following details: </li>
<ul><li>Name</li>
<li>Bio</li>
<li> Company</li>
<li>Location</li>
<li> total contributions this year</li></ul>


## Scrape the list of Developers from the Trending page

<ul> <li> Use Requests library to download the page</li>
<li> Use BeautifulSoup to parse and extract information </li>
<li> Convert to a Pandas DataFrame</li>
</ul>


In [1]:
! pip install requests --upgrade --quiet

In [2]:
! pip install beautifulsoup4 --upgrade --quiet

In [3]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

#### First, Write a function to download the page
The function:
Returns BeautifulSoup doc which contains a parsed web page which has list of trending developers on GitHub.  

In [4]:
def get_trending_page():
    trending_url='https://github.com/trending/developers/python?since=daily'
    #Download the page
    response=requests.get(trending_url)
    #check for successful response
    if response.status_code != 200:
        raise Exception ('Failed to load page {}.format (trending_url)')
    doc=BeautifulSoup(response.text,'html.parser')
    return doc


#### To get the developer names,  we get `h1` tags with class `h3 lh-condensed`
<div>
<img src="https://i.imgur.com/yfsAIP0.png" width="800",height="1000"/>
</div>

In [5]:
doc=get_trending_page()
len(doc.find_all('h1',{'class':'h3 lh-condensed'}))

24

#### Functions to parse information from the trending page

In [6]:
# to get the list of developer names
def get_names(doc):
    name_tags=doc.find_all('h1',{'class':'h3 lh-condensed'})
    dev_names=[]
    for tag in name_tags:
        dev_names.append(tag.text.strip())
    return dev_names

In [7]:
#checking the function
names=get_names(doc)
names[:5]

['Romain Beaumont',
 'Matthias Fey',
 'Charles Tapley Hoyt',
 'RangiLyu',
 'Xingyi Zhou']

In [8]:
len(names)

24

#### Similarly, define functions for profile_URL, repo_names, repo_descriptions and repo_URL

In [9]:
# function to get repo_description
def get_repo_desc(doc):
    repo_desc_tags=doc.findAll('div',{'class':'f6 color-fg-muted mt-1'})
    repo_desc=[]
    for tag in repo_desc_tags: 
        repo_desc.append(tag.text.strip())
    return repo_desc

In [10]:
repo_description=get_repo_desc(doc)


In [11]:
len(repo_description)

23

In [12]:
#function to get profile URLs
base_url="https://github.com"
def get_profile_urls(doc):
    profile_urls=doc.findAll('h1',{'class':'h3 lh-condensed'})
    a_tags=[]
    profile_url=[]
    for i in range(len(profile_urls)):
        a_tags.append(profile_urls[i].find_all('a'))
        profile_url.append(base_url + a_tags[i][0]['href'])
    return(profile_url)

In [13]:
#checking the function

pro_urls=get_profile_urls(doc)
pro_urls[:5]


['https://github.com/rom1504',
 'https://github.com/rusty1s',
 'https://github.com/cthoyt',
 'https://github.com/RangiLyu',
 'https://github.com/xingyizhou']

In [14]:
len(pro_urls)

24

In [15]:
#Function to get Repo_URL
base_repo_url="https://github.com"
def get_repo_urls(doc):
    repo_url_tags=doc.findAll('a',{'class':'css-truncate css-truncate-target'})
    repo_urls=[]
    for tag in repo_url_tags:
        repo_urls.append(base_repo_url + tag['href'])
    return repo_urls

In [16]:
#checking the function
repo_url=get_repo_urls(doc)
repo_url[:5]

['https://github.com/rom1504/img2dataset',
 'https://github.com/rusty1s/pytorch_scatter',
 'https://github.com/cthoyt/opencheck-embed',
 'https://github.com/RangiLyu/nanodet',
 'https://github.com/xingyizhou/CenterNet']

In [17]:
len(repo_url)

24

In [18]:
# Function to get repo_name
def get_repo_name(doc):
    repo_name_tags=doc.findAll('a',{'class':'css-truncate css-truncate-target'})
    repo_name=[]
    for tag in repo_name_tags:
        repo_name.append(tag.text.strip())
    return repo_name

In [19]:
#checking the function 
repo_names=get_repo_name(doc)
repo_names[:5]

['img2dataset', 'pytorch_scatter', 'opencheck-embed', 'nanodet', 'CenterNet']

In [20]:
len(repo_names)

24

### Create a main function to call these functions

In [21]:
def scrape_trending():
    trending_dict={
        'name':get_names(doc),
        'popular_repo_name':get_repo_name(doc),
        'repo_url':get_repo_urls(doc),
        'profile_url':get_profile_urls(doc)
    }
    return pd.DataFrame(trending_dict)

In [22]:
#calling the main function
trending_df=scrape_trending()

#The trending page details are stored in a Dataframe (trending_df)
trending_df

Unnamed: 0,name,popular_repo_name,repo_url,profile_url
0,Romain Beaumont,img2dataset,https://github.com/rom1504/img2dataset,https://github.com/rom1504
1,Matthias Fey,pytorch_scatter,https://github.com/rusty1s/pytorch_scatter,https://github.com/rusty1s
2,Charles Tapley Hoyt,opencheck-embed,https://github.com/cthoyt/opencheck-embed,https://github.com/cthoyt
3,RangiLyu,nanodet,https://github.com/RangiLyu/nanodet,https://github.com/RangiLyu
4,Xingyi Zhou,CenterNet,https://github.com/xingyizhou/CenterNet,https://github.com/xingyizhou
5,Costa Huang,cleanrl,https://github.com/vwxyzjn/cleanrl,https://github.com/vwxyzjn
6,Kentaro Wada,labelme,https://github.com/wkentaro/labelme,https://github.com/wkentaro
7,Maarten Grootendorst,BERTopic,https://github.com/MaartenGr/BERTopic,https://github.com/MaartenGr
8,Cyrille Rossant,awesome-math,https://github.com/rossant/awesome-math,https://github.com/rossant
9,Wey Gu,nebula-dgl,https://github.com/wey-gu/nebula-dgl,https://github.com/wey-gu


## Navigate to each profile page and get profile details
Profile details:
<ul><li> Bio</li>
<li> Company</li>
<li>Location</li>
<li>Contributions in the last year</li>


In [23]:
def get_profile_page(profile_url):
    #download the page
    response=requests.get(profile_url)
     #check for successful response
    if response.status_code != 200:
        raise Exception ('Failed to load page {}.format (profile_url)')
    #parse using BeautifulSoup
    profile_doc=BeautifulSoup(response.text,'html.parser')
    return profile_doc
    

In [24]:
#checking the function
profile_doc1=get_profile_page('https://github.com/rom1504')

## Get Profile information

In [25]:
#get bio info
def get_bio(profile_doc):
    bio_tag=profile_doc.find('div',{'class':'p-note user-profile-bio mb-3 js-user-profile-bio f4'})
    return bio_tag.text.strip()

In [26]:
#checking the function
bio_desc=get_bio(profile_doc1)
bio_desc

'Interested in machine learning (computer vision, natural language processing, deep learning), node.js (network, bots, web), and programming in general'

In [27]:
# get company details
def get_company(profile_doc):
    comp_tag=profile_doc.find('a',{'class':'user-mention notranslate'})
    if(comp_tag==None):
        return("-")
    else:
        return comp_tag.text.strip()


In [28]:
#checking the function
comp=get_company(profile_doc1)
comp

'@google'

In [29]:
#get location details

def get_location(profile_doc):
   loc_tag=profile_doc.find('span',{'class':'p-label'})
   if(loc_tag==None):
      return("-")
   else:
      return(loc_tag.text.strip())
      

In [30]:
#check the function
loc=get_location(profile_doc1)
loc

'Paris'

In [31]:
import re
# get number of contributions
def get_contri(profile_doc):
    contri_tags=profile_doc.find('h2',{'class':'f4 text-normal mb-2'})
    contribution=re.sub("[A-Za-z\n\D]",'',contri_tags.text.strip())
    return contribution

In [32]:
#check the function
contri=get_contri(profile_doc1)
contri

'1848'

In [33]:
#creating an empty dictionary with topics as key
profiles_dict = { 'bio': [], 'company': [], 'location': [],'contribution': []}

# main function to get all the profile details and store in the dictionary.
def get_dict(profile_doc): 
    profiles_dict['bio'].append(get_bio(profile_doc)),
    profiles_dict['company'].append(get_company(profile_doc)),
    profiles_dict['location'].append(get_location(profile_doc)),
    profiles_dict['contribution'].append(get_contri(profile_doc))
    return pd.DataFrame(profiles_dict)

In [34]:
def create_df(profile_url):
    profile_df= get_dict(get_profile_page(profile_url))
    return profile_df

#### Main function to invoke other functions

In [35]:
def scrape_profile():
    print('Scraping list of profiles')
    trending_df=scrape_trending()

    # to take profile_urls row by row, from trending_df dataframe 
    for index, row in trending_df.iterrows():
        print('Scraping profile for "{}"'.format(row['name']))
        profile_df=create_df(row['profile_url'])
    return profile_df

In [36]:
profile_df=scrape_profile()
print(profile_df)


Scraping list of profiles
Scraping profile for "Romain Beaumont"
Scraping profile for "Matthias Fey"
Scraping profile for "Charles Tapley Hoyt"
Scraping profile for "RangiLyu"
Scraping profile for "Xingyi Zhou"
Scraping profile for "Costa Huang"
Scraping profile for "Kentaro Wada"
Scraping profile for "Maarten Grootendorst"
Scraping profile for "Cyrille Rossant"
Scraping profile for "Wey Gu"
Scraping profile for "Bane Sullivan"
Scraping profile for "Ross Wightman"
Scraping profile for "Guido van Rossum"
Scraping profile for "Benjamin Peterson"
Scraping profile for "Xintao"
Scraping profile for "Michael Dawson-Haggerty"
Scraping profile for "Ben Frederickson"
Scraping profile for "Andreas Klöckner"
Scraping profile for "Florian Roth"
Scraping profile for "Łukasz Langa"
Scraping profile for "Erik Bernhardsson"
Scraping profile for "Sylvain Gugger"
Scraping profile for "Kyle Altendorf"
Scraping profile for "Thomas Kluyver"
                                                  bio          com

In [37]:
#profile_df has all the scraped values of profile
profile_df

Unnamed: 0,bio,company,location,contribution
0,Interested in machine learning (computer visio...,@google,Paris,1848
1,Creator of PyG (PyTorch Geometric) - Founding ...,-,"Dortmund, Germany",4497
2,"Bio/cheminformatician, open scientist, maintai...",@pybel,"Bonn, Germany",5320
3,Deep Learning & Computer Vision,-,Shanghai,891
4,CS Ph.D. student at UT Austin.,-,Austin,311
5,Computer Science Ph.D student at Drexel Univer...,-,"Philadelphia, PA",1122
6,Passionate about automation. Working on comput...,@mujin,"Tokyo, Japan",1523
7,Data Scientist | Psychologist,-,"Netherlands, Tilburg",66
8,Neuroscience researcher and software engineer,@int-brain-lab,Paris,922
9,Developer Advocate @vesoft-inc,@vesoft-inc,Shanghai,1608


#### Combining trending_df and profile_df to a single dataframe.

In [38]:
ml_profiles_df=pd.concat([trending_df,profile_df],axis=1)
ml_profiles_df

Unnamed: 0,name,popular_repo_name,repo_url,profile_url,bio,company,location,contribution
0,Romain Beaumont,img2dataset,https://github.com/rom1504/img2dataset,https://github.com/rom1504,Interested in machine learning (computer visio...,@google,Paris,1848
1,Matthias Fey,pytorch_scatter,https://github.com/rusty1s/pytorch_scatter,https://github.com/rusty1s,Creator of PyG (PyTorch Geometric) - Founding ...,-,"Dortmund, Germany",4497
2,Charles Tapley Hoyt,opencheck-embed,https://github.com/cthoyt/opencheck-embed,https://github.com/cthoyt,"Bio/cheminformatician, open scientist, maintai...",@pybel,"Bonn, Germany",5320
3,RangiLyu,nanodet,https://github.com/RangiLyu/nanodet,https://github.com/RangiLyu,Deep Learning & Computer Vision,-,Shanghai,891
4,Xingyi Zhou,CenterNet,https://github.com/xingyizhou/CenterNet,https://github.com/xingyizhou,CS Ph.D. student at UT Austin.,-,Austin,311
5,Costa Huang,cleanrl,https://github.com/vwxyzjn/cleanrl,https://github.com/vwxyzjn,Computer Science Ph.D student at Drexel Univer...,-,"Philadelphia, PA",1122
6,Kentaro Wada,labelme,https://github.com/wkentaro/labelme,https://github.com/wkentaro,Passionate about automation. Working on comput...,@mujin,"Tokyo, Japan",1523
7,Maarten Grootendorst,BERTopic,https://github.com/MaartenGr/BERTopic,https://github.com/MaartenGr,Data Scientist | Psychologist,-,"Netherlands, Tilburg",66
8,Cyrille Rossant,awesome-math,https://github.com/rossant/awesome-math,https://github.com/rossant,Neuroscience researcher and software engineer,@int-brain-lab,Paris,922
9,Wey Gu,nebula-dgl,https://github.com/wey-gu/nebula-dgl,https://github.com/wey-gu,Developer Advocate @vesoft-inc,@vesoft-inc,Shanghai,1608


#### saving this dataframe to a csv file

In [39]:
ml_profiles_df.to_csv("G:/Machine_learning/github/web_scrapping_git/scrape_python_profiles/files/python_profiles.csv")