# Top Repositories for GitHub Topics

### Outline:

- we'll scrap https://github.com/topics
- get a list of topics and for each topic we'll get topic name, topic URL, and topic description 
- for each topic we will get the top 25 repositories 
- and for each repository we'll get repository name, repository owner (user name), number of stars and its URL
- create a csv file for each topic in the following formate:
 ```
 Repo_name,User_name,Stars,Repo_URL
 
 ```

### install and use the requests library to download web pages 

In [1]:
# upgrade if it already installed, quiet to don't show any outputs if installastion succeed 
!pip install requests --upgrade --quiet 

In [2]:
import requests

In [3]:
topics_url = 'https://github.com/topics'

In [4]:
response = requests.get(topics_url)

In [5]:
response.status_code  # if it 200 then the response was successfuly 

200

In [6]:
len(response.text) # response.text have the content of the web page HTML code

144927

In [7]:
page_content = response.text

In [8]:
page_content[:100]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="d'

### use Beautiful Soup to parse and extract the information 

In [9]:
!pip install beautifulsoup4 --upgrade --quiet

In [10]:
from bs4 import BeautifulSoup

In [11]:
doc = BeautifulSoup(page_content, 'html.parser')

In [12]:
# now you can search on doc what you want , go to the web page and click inspect
topic_title_tages = doc.find_all('p')
len(topic_title_tages)

67

In [13]:
# we need to search more specifily , so we go to the class of that p tage 
title_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
topic_title_tages = doc.find_all('p', {'class': title_class})
topic_title_tages[:3]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>]

![](https://i.imgur.com/cL7MJyc.png)

In [14]:
# description tags
description_class = 'f5 color-text-secondary mb-0 mt-1'
topic_description_tages = doc.find_all('p', {'class': description_class})
topic_description_tages[:3]

[<p class="f5 color-text-secondary mb-0 mt-1">
               3D modeling is the process of virtually developing the surface and structure of a 3D object.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Ajax is a technique for creating interactive web applications.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Algorithms are self-contained sequences that carry out a variety of tasks.
             </p>]

In [15]:
link_class = 'd-flex no-underline'
topic_link_tags = doc.find_all('a', {'class': link_class})
#topic_link_tags[0]

In [16]:
topic_link_tags[0]['href']

'/topics/3d'

In [17]:
'https://github.com'+topic_link_tags[0]['href']

'https://github.com/topics/3d'

In [18]:
print('https://github.com'+topic_link_tags[0]['href'])

https://github.com/topics/3d


In [19]:
# now we could get titles, descriptions, and links
# we have titles
print(topic_title_tages[0])
print(topic_title_tages[0].text) # text gives you the text inside tages

<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
3D


In [20]:
# we have descriptions
print(topic_description_tages[0])
print("---------------------------------------------------------------------------------------------")
print(topic_description_tages[0].text)
print("---------------------------------------------------------------------------------------------")
print(topic_description_tages[0].text.strip())

<p class="f5 color-text-secondary mb-0 mt-1">
              3D modeling is the process of virtually developing the surface and structure of a 3D object.
            </p>
---------------------------------------------------------------------------------------------

              3D modeling is the process of virtually developing the surface and structure of a 3D object.
            
---------------------------------------------------------------------------------------------
3D modeling is the process of virtually developing the surface and structure of a 3D object.


In [21]:
# store titles in 'topic_titles' list
topics_titles = [x.text for x in topic_title_tages]
print(topics_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [22]:
# store titles in 'topic_descriptions' list
topics_descriptions = [x.text.strip() for x in topic_description_tages]
print(topics_descriptions[:5])

['3D modeling is the process of virtually developing the surface and structure of a 3D object.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency framework for PHP.', 'Android is an operating system built by Google designed for mobile devices.']


In [23]:
# store titles in 'topic_URLs' list
topics_urls = ['https://github.com'+x['href'] for x in topic_link_tags]
print(topics_urls[0])

https://github.com/topics/3d


In [24]:
# create csv file with the data we gathered 
import pandas as pd
# make a dictionary to make it easy to make the dataframe
data_in_dict = {'titles':topics_titles, 'description':topics_descriptions, 'url':topics_urls}

topics_df = pd.DataFrame(data_in_dict)

In [25]:
topics_df

Unnamed: 0,titles,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


### save the data in a csv file

In [26]:
topics_df.to_csv('topics.csv', index=None)

### get the repositories informations 

### first get repositories informations for the first topic "3D topic"

In [27]:
print(topics_df['url'][0])

https://github.com/topics/3d


In [28]:
topic0_url = topics_df['url'][0]
topic0_response = requests.get(topic0_url)
topic0_response.status_code

200

In [29]:
topic0_doc = BeautifulSoup(topic0_response.text, 'html.parser')

In [30]:
class_0 =  "f3 color-text-secondary text-normal lh-condensed"
h3 = topic0_doc.find_all('h3', {'class': class_0})
len(h3)

30

In [31]:
username_and_reponame0 = h3[0].find_all('a')
len(username_and_reponame0)

2

In [32]:
print(h3[0].find_all('a')[0].text.strip())
print(h3[0].find_all('a')[1].text.strip())
print(h3[0].find_all('a')[1]['href'].strip())

mrdoob
three.js
/mrdoob/three.js


In [33]:
user_name0 = username_and_reponame0[0].text.strip()
user_name0

'mrdoob'

In [34]:
repo_name0 = username_and_reponame0[1].text.strip()
repo_name0

'three.js'

In [35]:
base_link = 'https://github.com'
repo_link = base_link + username_and_reponame0[1]['href']
print(repo_link)

https://github.com/mrdoob/three.js


In [37]:
class_star = "d-flex flex-items-start ml-3"
repo_star_tag = doc.find_all('div', {'class': class_star})

In [38]:
# define a function that is used to convert the number of stars to an integr 
def to_int_stars(stars):
    stars_int = 0
    if stars[-1] == 'k':
        stars_int =  float(stars[:-1]) * 1000
    else: stars_int = float(stars)
    return int(stars_int)

In [39]:
# test the to_int_stars function 
url = topics_df['url'][0]
res = requests.get(url)
doc = BeautifulSoup(res.text, "html.parser")
class_star = "social-count float-none"
stare_tages = doc.find_all('a', {'class': class_star})
ss = [to_int_stars (s.text.strip()) for s in stare_tages]
ss[:5]

[74900, 19100, 15300, 15000, 13100]

In [40]:
# Now we can make it for every topic 
# define a function that get the repositories' data for each topic 
def get_data(topic_link):
    '''
    input ->  topic link url 
    output -> 4 lists each list store the data of an attribute for each repository of that topic 
    '''
    try:
        topic_response = requests.get(topic_link)
    except: print("filed to download page!! ")
    topic_doc = BeautifulSoup(topic_response.text, 'html.parser')
    user_repo_link_class =  "f3 color-text-secondary text-normal lh-condensed"
    #user_repo_link_class contains the repo name and user name and repo link
    user_repo_link_tag = topic_doc.find_all('h3', {'class': user_repo_link_class})
    base_link = 'https://github.com'
    class_star = "social-count float-none"
    repo_star_tag = topic_doc.find_all('a', {'class': class_star})
    repos_number = len(user_repo_link_tag)
    user_name = [user_repo_link_tag[i].find_all('a')[0].text.strip() for i in range(repos_number)]
    repo_name = [user_repo_link_tag[i].find_all('a')[1].text.strip() for i in range(repos_number)]
    repo_link = [base_link+user_repo_link_tag[i].find_all('a')[1]['href'] for i in range(repos_number)]
    stars = [to_int_stars(star.text.strip()) for star in repo_star_tag]
    return user_name, repo_name, repo_link,stars

In [41]:
# make a function that takes the returned data from get_data function and make a dataframe for
# that topic 
def make_df(data):
    '''
    input -> the 4 lists "user_name, repo_name, repo_link,stars"
    that returned from the get_data function 
    output -> a dataframe that have the required attributes for a specified topic
    '''
    data_to_dict = {
        "Username":data[0], # user_name 
        "Repository Name":data[1], # repo_name
        "Repository Link":data[2], # repo_link
        "# of stars": data[3] # stars
    }
    topic_df = pd.DataFrame(data_to_dict)
    return topic_df

In [42]:
import os
def save_toics_dfs(topics_df):
    #urls is a list of topics urls 
    if not os.path.exists('topics'):
        os.makedirs('topics')  # make a dicrectory 
    for i in range(len(topics_df['url'])):
        if not os.path.exists("topics\\"+topics_df['titles'][i]+'.csv'):
            print('processing ... '+topics_df['titles'][i])
            df = make_df(get_data(url))
            df.to_csv('topics\\'+topics_df['titles'][i]+'.csv', index = None)
        else:
            print('already exists ! ')
            continue 
    print('End of processing!')

In [43]:
save_toics_dfs(topics_df)

processing ... 3D
processing ... Ajax
processing ... Algorithm
processing ... Amp
processing ... Android
processing ... Angular
processing ... Ansible
processing ... API
processing ... Arduino
processing ... ASP.NET
processing ... Atom
processing ... Awesome Lists
processing ... Amazon Web Services
processing ... Azure
processing ... Babel
processing ... Bash
processing ... Bitcoin
processing ... Bootstrap
processing ... Bot
processing ... C
processing ... Chrome
processing ... Chrome extension
processing ... Command line interface
processing ... Clojure
processing ... Code quality
processing ... Code review
processing ... Compiler
processing ... Continuous integration
processing ... COVID-19
processing ... C++
End of processing!


### Now we have a 30 files(file for each topic) and a 30 repository for each topic 
### Time to download these files in one zip file, and to do that we could install nbzip library 

#### just follow steps on https://github.com/data-8/nbzip

In [44]:
!pip install nbzip --upgrade --quiet

In [46]:
!jupyter serverextension enable --py nbzip --sys-prefix
!jupyter nbextension install --py nbzip
!jupyter nbextension enable --py nbzip

Enabling: nbzip
- Writing config: C:\Installation_of_anaconda\etc\jupyter
    - Validating...
      nbzip  ok
Installing C:\Installation_of_anaconda\lib\site-packages\nbzip\static -> nbzip
Up to date: C:\ProgramData\jupyter\nbextensions\nbzip\tree.js
- Validating: ok

    To initialize this nbextension in the browser every time the notebook (or other app) loads:
    
          jupyter nbextension enable nbzip --py
    
Enabling tree extension nbzip/tree...
      - Validating: ok
