# 'Topics on GitHub' Scrapper

<img src = 'https://i.imgur.com/Uig9ymG.png' width = 80%>

<a href = 'https://colab.research.google.com/github/jishnukoliyadan/scrapper_github_topics//blob/master/scraping_github_topics.ipynb' target = '_blank'><img src = 'https://raw.githubusercontent.com/jishnukoliyadan/usefull_items/master/svgs/Colab_Run_In.svg' width = 20%></a>

<a href = 'https://github.com/jishnukoliyadan/scrapper_github_topics/blob/master/scraping_github_topics.ipynb' target = '_blank'><img src = 'https://raw.githubusercontent.com/jishnukoliyadan/usefull_items/master/svgs/GitHub_View_Source.svg' width = 20%></a>

<a href = 'https://www.kaggle.com/code/jishnukoliyadan/data-collection-tutorial' target = '_blank'><img src = 'https://raw.githubusercontent.com/jishnukoliyadan/usefull_items/master/svgs/Kaggle_View_On.svg' width = 20%></a>

<a href = 'https://nbviewer.org/github/jishnukoliyadan/scrapper_github_topics/blob/master/scraping_github_topics.ipynb' target = '_blank'><img src = 'https://raw.githubusercontent.com/jishnukoliyadan/usefull_items/master/svgs/NbViwer_View_In.svg' width = 20%></a>

## First things first

Lets check [robots.txt](https://github.com/robots.txt) of [github.com](https://github.com/robots.txt).

<img src = 'https://i.imgur.com/aiqWqOK.png'>

GitHub provides API's but still we are going for web scrapping, and adding a good delay period between each click/activity in the website.

# Importing libraries

In [1]:
import os
import re
import ast
import time
import pandas as pd
from tqdm import tqdm
from bs4 import BeautifulSoup as bs

from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.core.utils import ChromeType
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service as ChromiumService

# Generating All featured topics

## Plan of action

### 1. The botton **Load more..**

- This image shows the are we are interested in going for to scrape.
- The [GitHub Topics](https://github.com/topics) page only shows **20** topics in one go.
- The first challenge is to extract all the avaliable topics by help of `botton` **Load more...**.

<img src = 'https://i.imgur.com/gzEh06O.png'>

- The HTML skeleton shows the **Load more..** is a `botton` with `type="submit"`.
- In order to find this `button` we are using the **XPATH** feature.
- For to find this `botton` we will be using the code `driver.find_element(By.XPATH, "//button[@type = 'submit']").click()`.
- This will find the `botton` and will **click** for us automatically.

<img src = 'https://i.imgur.com/P26nk9g.png'>

In [2]:
def call_webdriver():
    return webdriver.Chrome(service=ChromiumService(ChromeDriverManager(chrome_type=ChromeType.CHROMIUM).install()))

In [3]:
# https://pythonbasics.org/selenium-scroll-down/

def scroll_to_bottom():
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")

In [4]:
BASE_URL = 'https://github.com'

os.makedirs('data', exist_ok = True) # Creating directory to save scrapped data

if not os.path.isfile('data/topics.csv'):
    
    print('Opening selenium crawler')
    
    driver = call_webdriver()
    driver.get(BASE_URL + '/topics')
    driver.maximize_window()
    scroll_to_bottom()

    try:
        while True:
            time.sleep(2)
            driver.find_element(By.XPATH, "//button[@type = 'submit']").click()
            time.sleep(3)
            scroll_to_bottom()
    except:
        pass
    
    soup = bs(driver.page_source, 'lxml')
    driver.quit()

    # print(soup.prettify())
else:
    print("'topics.csv' already exisits, not using crawler")

Opening selenium crawler


### 2. Scrapping the data

- Let's see the **GIF** before we break down to pieces.

<img src = 'https://i.imgur.com/eyf9jsx.gif'>

### Let's Break Down

<img src = 'https://i.imgur.com/4rVt9Kn.gif'>

- This clearly shows we have a `class` **py-4 border-bottom d-flex flex-justify-between**  which have all the necessary data we have.
- In this class we have another `class` named **no-underline flex-1 d-flex flex-column**, this is the exact area we need to scrape.

<img src = 'https://i.imgur.com/CKYE5nm.gif'>

- We are looking for all these items.
    1. Link to the topic, which is represented as **Item-1** in the figure and enclosed inside the `a` tag.
    2. The topic name (**Item-2**) is enclosed in `p` tag.
    3. Description of the topic (**Item-3**) was given in another `p` tag.
    
**Now let's extract the data**

In [5]:
if not os.path.isfile('data/topics.csv'):
    
    print("Data file doesn't exists, scrapping data")
    all_topics = soup.find_all('div', {'class' : 'py-4 border-bottom d-flex flex-justify-between'})

    links, titles, descs = [], [], []
    for topic in all_topics:

        topic_box = topic.find('a', {'class' : 'no-underline flex-1 d-flex flex-column'})
        links.append(BASE_URL + topic_box['href'])
        titles.append(topic_box.select('p')[0].text.strip())
        descs.append(topic_box.select('p')[1].text.strip())

    topic_df = pd.DataFrame({'title' : titles, 'link' : links, 'descrip' : descs})
    topic_df.to_csv('data/topics.csv', index = False)
    print(f'Number of topic titles collected : {len(topic_df)}\n')
else:
    print("Loading saved 'topics.csv'")
    topic_df = pd.read_csv('data/topics.csv')
    print(f'Number of topic titles collected : {len(topic_df)}\n')

topic_df.head()

Data file doesn't exists, scrapping data
Number of topic titles collected : 180



Unnamed: 0,title,link,descrip
0,3D,https://github.com/topics/3d,3D modeling is the process of virtually develo...
1,Ajax,https://github.com/topics/ajax,Ajax is a technique for creating interactive w...
2,Algorithm,https://github.com/topics/algorithm,Algorithms are self-contained sequences that c...
3,Amp,https://github.com/topics/amphp,Amp is a non-blocking concurrency library for ...
4,Android,https://github.com/topics/android,Android is an operating system built by Google...


# Generating repository details

### 1. The botton **Load more..**

- This image shows the are we are interested in going for to scrape.
- The [GitHub Topics](https://github.com/topics) page for any topics only shows **20** repositories in one go.
- The first challenge is to extract 120 rescords with the help of `botton` **Load more...**.

<img src = 'https://i.imgur.com/8gAxqu7.png'>

- The HTML skeleton shows the **Load more..** is a `botton` with `type="submit"`.
- In order to find this `button` we are using the **XPATH** feature.
- For to find this `botton` we will be using the code `driver.find_element(By.XPATH, "//button[@type = 'submit']").click()`.
- This will find the `botton` and will **click** for us automatically.

<img src = 'https://i.imgur.com/RSC5pkV.png'>

### 2. Scrapping the data

- Let's see the **GIF** before we break down to pieces.

<img src = 'https://i.imgur.com/gSvMQbx.gif'>

### Let's Break Down

<img src = 'https://i.imgur.com/8gAxqu7.gif'>

- This clearly shows we have a `class` **d-flex flex-justify-between flex-items-start flex-wrap gap-2 my-3**  which have all the necessary data we have.
<!-- - In this class we have another `class` named **no-underline flex-1 d-flex flex-column**, this is the exact area we need to scrape. -->

<img src = 'https://i.imgur.com/xo0k8rl.gif'>

- We are looking for all these items.
    1. User name of the GitHub user (**Item-1**) and the repository name (**Item-2**) was present inside the `a` tag which is inside in a `h3` tag.
    2. The start count was present inside a `span` tag whose `class` name is **Counter js-social-count**.
    
**Now let's extract the data**

In [6]:
if not os.path.isfile('data/repo_details.csv'):

    driver = call_webdriver()
    driver.maximize_window()

    topic_repo_df = pd.DataFrame({'topic' : [], 'user_name' : [], 'repo_name' : [], 'repo_link' : [], 'start_count' : []})
    not_120 = []
    
    for t_link in tqdm(topic_df.link):
        driver.get(t_link)
        try:
            count = 5
            while count:
                scroll_to_bottom()
                time.sleep(2)
                driver.find_element(By.XPATH, "//button[@type = 'submit']").click()
                time.sleep(2)
                scroll_to_bottom()
                time.sleep(2)
                count -= 1
        except:
            not_120.append(t_link)
        
        soup = bs(driver.page_source, 'lxml')
        all_div_box = soup.find_all('div', {'class' : 'd-flex flex-justify-between flex-items-start flex-wrap gap-2 my-3'})

        for div_box in all_div_box:
            topic = t_link.split("/")[-1]
            user_name = div_box.select('h3 a')[0].text.strip()
            repo_name = div_box.select('h3 a')[1].text.strip()
            repo_link = BASE_URL + div_box.select('h3 a')[1]['href'].strip()
            star_count = div_box.find('span', class_ = 'Counter js-social-count').text

            topic_repo_df.loc[len(topic_repo_df)] = [topic, user_name, repo_name, repo_link, star_count]

    driver.quit()
    topic_repo_df.to_csv('data/repo_details.csv', index = False)
    print(f'Number of repo details collected : {len(topic_repo_df)}\n')
    
else:
    print("Loading saved 'repo_details.csv'")
    topic_repo_df = pd.read_csv('data/repo_details.csv')
    print(f'Number of repo details collected : {len(topic_repo_df)}\n')

topic_repo_df.head()

100%|████████████████████████████████████████████████████████████| 180/180 [1:38:14<00:00, 32.75s/it]


Number of repo details collected : 21337



Unnamed: 0,topic,user_name,repo_name,repo_link,start_count
0,3d,mrdoob,three.js,https://github.com/mrdoob/three.js,87.1k
1,3d,libgdx,libgdx,https://github.com/libgdx/libgdx,20.8k
2,3d,pmndrs,react-three-fiber,https://github.com/pmndrs/react-three-fiber,20.5k
3,3d,BabylonJS,Babylon.js,https://github.com/BabylonJS/Babylon.js,18.8k
4,3d,ssloy,tinyrenderer,https://github.com/ssloy/tinyrenderer,15.3k


In [7]:
try:
    not_120 = topic_repo_df.topic.value_counts()
    not_120 = pd.DataFrame(not_120[not_120 != 120]).reset_index().rename(columns  = {'topic' : 'Count', 'index' : 'Title'})
    print(f"Number of topices that doesn't have 120 records : {len(not_120)}\n")
    print(not_120)
except:
    pass

Number of topices that doesn't have 120 records : 5

                   Title  Count
0              mvvmcross     95
1                ratchet     89
2  dependency-management     76
3               spacevim     67
4              wordplate     10


# Cleaning repository data

In [8]:
topic_repo_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21337 entries, 0 to 21336
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   topic        21337 non-null  object
 1   user_name    21337 non-null  object
 2   repo_name    21337 non-null  object
 3   repo_link    21337 non-null  object
 4   start_count  21337 non-null  object
dtypes: object(5)
memory usage: 833.6+ KB


**`start_count`** is numeric values, it's recorded as *`object`* beacuse, instead of writting **87100**, it written as **87.1k**.

And also conveting it into `int32` from default `int64`.

Why `int32` and why not `int64` **?**, refer https://jakevdp.github.io/PythonDataScienceHandbook/02.01-understanding-data-types.html#NumPy-Standard-Data-Types

In [9]:
def stars_int(val):
    
    if re.search(r'k|k', val):
        val = re.sub(r'k|k', '', val)
        val = ast.literal_eval(val) * 1000
    return int(val)

In [10]:
topic_repo_df['start_count'] = topic_repo_df.start_count.apply(stars_int).astype('int32')

topic_repo_df.head()

Unnamed: 0,topic,user_name,repo_name,repo_link,start_count
0,3d,mrdoob,three.js,https://github.com/mrdoob/three.js,87100
1,3d,libgdx,libgdx,https://github.com/libgdx/libgdx,20800
2,3d,pmndrs,react-three-fiber,https://github.com/pmndrs/react-three-fiber,20500
3,3d,BabylonJS,Babylon.js,https://github.com/BabylonJS/Babylon.js,18800
4,3d,ssloy,tinyrenderer,https://github.com/ssloy/tinyrenderer,15300
