Useful Web Scrapping "Hacks"
Use pricescrap.py
to compile a csv file with asset prices and upload it to OpenFinance.
- Install Python 3
- Install Lxml
pip install lxml
- Install RoboBrowser
pip install robobrowser
- Edit your openfinance credentials in
pricescrap.json
(see "General Requirements" below).
- Add the assets that you want to scap daily quotes for in the
assets
array. - The script offers support for the following data sources ...
- ... but it can be easily extended to support other price providers
- Run the program
python pricescrap.py
Use duolingoscrap.py
to complile cheat sheets from Duolingo.
It portrays the use of Selenium and BeautifulSoup.
- Install the Chrome Browser
- Install Python 3
- Install BeautifulSoup
pip install bs4
- Install Selenium
pip install selenium
- Add to your system the required ChromeDriver
- Edit your duolingo credentials in
duolingoscrap.json
(see "General Requirements" below).
- Duplicate the template
duolingo/duo.html
with the name of the language you want to download (e.g.duolingo/Russian.html
). - Optionally you can edit the parts that correspond to the name of the language and the flag:
<div class="_1sdh6 ljpAk">
<div class="yZINH">
<!-- This section has the flag -->
<span class="_1eqxJ _3viv6 HCWXf _3PU7E _2XSZu"></span>
</div>
<div class="yZINH _1_vhy">
<!-- This section has the name -->
<h2>Russian</h2>
<div><span>Cheat Sheet</span></div>
</div>
</div>
- Edit the
languages
dictionary insideduolingoscrap.py
to associate Duolingo's extension for the language (e.g.ru
) to your recently created file (e.g.duolingo/Russian.html
).
languages = {
"ru": "duolingo/Russian.html"
}
-
You have two options: download only selected lessons (add them to the
lessons
array), or the whole language (leave thelessons
array empty). In the second case note that:- It will only scan the active language. If you have several languages in Duolingo, you need to switch to the language that you want to download.
- It will only download up to your last available lesson. It can't download lessons you can't access yet (but you can run the program again on a future date).
- It keeps track of the lessons you already downloaded and will not overwrite or duplicate them.
-
Run the program
python duolingoscrap.py
Use safariscrap.py
to download books and video lessons from SafariBooks.
It combines Selenium with Browsermobproxy to observe and manipulate web traffic (required in for video download) and PdfReactor to convert HTML books into PDF.
- Follow all the instructions above (credentials in
safariscrap.json
) - Install Java (Required by BrowserMobProxy)
- Optionally, install PdfReactor.
- If you use PdfReactor switch the
pdfReactor
variable insafariscrap.py
accordingly:
- If you use PdfReactor switch the
pdfReactor = None
# pdfReactor = PDFreactor("http://localhost:9423/service/rest")
- Search in SafariBooks what you want to download. You have two options:
- Download a specific course like
https://www.safaribooksonline.com/library/view/numpy-cookbook/9781849518925/
.- The application will automatically detect whether it is a book or a video tutorial and proceed accordingly.
- Download ALL courses from a given topic like
https://www.safaribooksonline.com/topics/python
- Download a specific course like
- List any combination of topics and courses in any order in the
courses
array.- Useful Notes for topics with many courses:
- If you just list one topic, you can further refine the page it starts downloading from using the
topicNum
variable. - By default the program will NOT overwrite courses downloaded previously. You can switch this with the
overwrite
variable.
- If you just list one topic, you can further refine the page it starts downloading from using the
- Useful Notes for topics with many courses:
courses = [
"https://www.safaribooksonline.com/library/view/python-data-structures/9781786467355/",
"https://www.safaribooksonline.com/topics/java"
]
overwrite = False
topicNum = 0
- Run the program
python safariscrap.py
Use drumeoscrap.py
to download songs, play-alongs and video lessons from Drumeo.
Very similar in requirements and functionality to Safari Books above, with two important additions:
- Reads and writes session cookies to disk (in JSON format), avoiding unnecessary logins.
- As well as whole video files, it can download *.ts video segments, assemble them and transform into mpg (using FFMPEG).
In all cases you need to create a credentials file named pricescrap.json
, duolingoscrap.json
, etc with the following structure:
{
"username" : "<your_username>",
"password" : "<your_password>"
}