I want to get the challenge list from the site "CTF Learn".  
We first fetch the html content from the page using requests.

In [91]:
import requests

url = 'https://ctflearn.com/challenge/1/browse'

res = requests.get(url)

print(res.text)





<!doctype html>
<html lang="en">
<head>
    
        
            <script data-ad-client="ca-pub-5775379917527071" async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
            <!-- Google Tag Manager -->
            <script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':
            new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],
            j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=
            'https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);
            })(window,document,'script','dataLayer','GTM-KH229BX');</script>
            <!-- End Google Tag Manager -->
        
        <!-- Favicon -->
        <link rel="shortcut icon" type="image/png" href="/static/img/favicon.ico"/>
        <script src="https://kit.fontawesome.com/acbbb1978f.js" crossorigin="anonymous"></script>
        <link rel="preconnect" href="https://fonts.gstatic.com">
        <lin

We can observe from the content above and from the developer tools in the webpage and see that this page will render the content using Javascript after the user loads the site. We cannot directly get the content with requests or bs4, which is more suitable with handling static web contents.  

I'll use Selenium, a tool that can simulates user interaction with a webpage, to get the dynamically loaded data.

To be able to use Selenium, we need to first install the webdriver that we planned to use and put it in our working directory. In this case I'll be using Chrome Driver.  
> https://googlechromelabs.github.io/chrome-for-testing/



We can observe that the challenges are embedded in cards in the website, and all of them contains a "challenge-card" class.  

The first issue I encounter was that I could only find 6 challenges from the page.  
To show all the challenges I need to modify the pageSize, which can be selected with a select element. The first part of my code is trying to deal with this issue.   
After selecting 'All' we refresh the page to let it take effect.

Next we are going to split the text and format them into targeted data and store as a dictionary (in order to convert to json format).   

However, while fetching the challenge from the challenge list it may takes too much time. This may lead to a StaleElementReferenceException be raised. This means that the element no longer exists or the webpage has been closed.   

To solve this I added a waiting time and regenerate the element in every iteration.

In [93]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait, Select
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import StaleElementReferenceException
import json
import pandas as pd
import time

driver = webdriver.Chrome()
driver.get(url)

try:
    # Wait up to 10 seconds for the challenges to be loaded, then continue to next step
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.CLASS_NAME, "challenge-card"))
    )

    # Let the page show all challenges
    element_show = Select(driver.find_element(By.ID, 'pageSelect'))
    element_show.select_by_visible_text('All')

    # refresh page to let the select take effect
    driver.refresh()

    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.CLASS_NAME, "challenge-card"))
    )


    # Get the challenges
    challenges_len = len(driver.find_elements(By.CLASS_NAME, 'challenge-card'))
    
    dataset = []

    # Format and store to dictionary
    for i in range(challenges_len):
        print(i, end=' ')

        # fetch in every iteration to lower chance of StaleElementReferenceException
        challenge = driver.find_elements(By.CLASS_NAME, 'challenge-card')[i]

        data = challenge.text.replace(' · ', ';').split('\n')

        challenge_dict = {
            'challenge_name' : data[0],
            'difficulty' : data[1],
            'points' : int(data[2].split(' ')[0]),
            'comments' : int(data[2].split(' ')[2]),
            'rating' : float(data[2].split(' ')[4]),
            'category' : data[3],
            'solves' : int(data[4].split(' ')[0]),
        }

        dataset.append(challenge_dict)

        # avoid StaleElementReferenceException
        time.sleep(0.3)

    print(json.dumps(dataset, indent=4))

finally:
    driver.quit()

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 [
    {
        "challenge_name": "Practice Flag",
        "difficulty": "Easy",
        "points": 10,
        "comments": 444,
        "rating": 4.0,
        "category": "Miscellaneous;intelagent",
        "solves": 62454
    },
    {
        "challenge_name": "Bas

Dump the data above into challenges.json

In [107]:
with open('challenges.json', 'w') as f:
    json.dump(dataset, f, indent=4)

Dump the data into challenges.csv

In [108]:
df = pd.DataFrame(dataset)
df.to_csv('challenges.csv', index=False)

csv_result = pd.read_csv('challenges.csv')
csv_result

Unnamed: 0,challenge_name,difficulty,points,comments,rating,category,solves
0,Practice Flag,Easy,10,444,4.00,Miscellaneous;intelagent,62454
1,Basic Injection,Easy,30,736,4.60,Web;intelagent,50471
2,Forensics 101,Easy,30,391,4.48,Forensics;intelagent,36288
3,Character Encoding,Easy,20,257,4.39,Cryptography;dknj11902,32430
4,Taking LS,Easy,10,344,3.87,Forensics;alexkato29,29564
...,...,...,...,...,...,...,...
206,Dune,Medium,60,9,5.00,Reverse Engineering;kcbowhunter,8
207,Redirected,Hard,70,1,5.00,Binary;Rivit,6
208,House,Hard,80,1,5.00,Binary;Rivit,5
209,Slow bin,Hard,90,5,5.00,Binary;Rivit,5
