Scrap Hacks

Useful Web Scrapping "Hacks"

Prices

Use pricescrap.py to compile a csv file with asset prices and upload it to OpenFinance.

Installation

Install Python 3
Install Lxml pip install lxml
Install RoboBrowser pip install robobrowser
Edit your openfinance credentials in pricescrap.json (see "General Requirements" below).

Usage

Add the assets that you want to scap daily quotes for in the assets array.
The script offers support for the following data sources ...
... but it can be easily extended to support other price providers
Run the program python pricescrap.py

Duolingo

Use duolingoscrap.py to complile cheat sheets from Duolingo.

It portrays the use of Selenium and BeautifulSoup.

Installation

Install the Chrome Browser
Install Python 3
Install BeautifulSoup pip install bs4
Install Selenium pip install selenium
Add to your system the required ChromeDriver
Edit your duolingo credentials in duolingoscrap.json (see "General Requirements" below).

Usage

Duplicate the template duolingo/duo.html with the name of the language you want to download (e.g. duolingo/Russian.html).
Optionally you can edit the parts that correspond to the name of the language and the flag:

<div class="_1sdh6 ljpAk">
    <div class="yZINH">
        <!-- This section has the flag -->
        <span class="_1eqxJ _3viv6 HCWXf _3PU7E _2XSZu"></span>
    </div>
    <div class="yZINH _1_vhy">
        <!-- This section has the name -->
        <h2>Russian</h2>
        <div><span>Cheat Sheet</span></div>
    </div>
</div>

Edit the languages dictionary inside duolingoscrap.py to associate Duolingo's extension for the language (e.g. ru) to your recently created file (e.g. duolingo/Russian.html).

languages = {
    "ru": "duolingo/Russian.html"
}

You have two options: download only selected lessons (add them to the lessons array), or the whole language (leave the lessons array empty). In the second case note that:
- It will only scan the active language. If you have several languages in Duolingo, you need to switch to the language that you want to download.
- It will only download up to your last available lesson. It can't download lessons you can't access yet (but you can run the program again on a future date).
- It keeps track of the lessons you already downloaded and will not overwrite or duplicate them.
Run the program python duolingoscrap.py

Safari Books

Use safariscrap.py to download books and video lessons from SafariBooks.

It combines Selenium with Browsermobproxy to observe and manipulate web traffic (required in for video download) and PdfReactor to convert HTML books into PDF.

Installation

Follow all the instructions above (credentials in safariscrap.json)
Install Java (Required by BrowserMobProxy)
Optionally, install PdfReactor.
- If you use PdfReactor switch the pdfReactor variable in safariscrap.py accordingly:

pdfReactor = None
# pdfReactor = PDFreactor("http://localhost:9423/service/rest")

Usage

Search in SafariBooks what you want to download. You have two options:
- Download a specific course like https://www.safaribooksonline.com/library/view/numpy-cookbook/9781849518925/.
  - The application will automatically detect whether it is a book or a video tutorial and proceed accordingly.
- Download ALL courses from a given topic like https://www.safaribooksonline.com/topics/python
List any combination of topics and courses in any order in the courses array.
- Useful Notes for topics with many courses:
  - If you just list one topic, you can further refine the page it starts downloading from using the topicNum variable.
  - By default the program will NOT overwrite courses downloaded previously. You can switch this with the overwrite variable.

courses = [
    "https://www.safaribooksonline.com/library/view/python-data-structures/9781786467355/",
    "https://www.safaribooksonline.com/topics/java"
]

overwrite = False
topicNum = 0

Run the program python safariscrap.py

Drumeo

Use drumeoscrap.py to download songs, play-alongs and video lessons from Drumeo.

Very similar in requirements and functionality to Safari Books above, with two important additions:

Reads and writes session cookies to disk (in JSON format), avoiding unnecessary logins.
As well as whole video files, it can download *.ts video segments, assemble them and transform into mpg (using FFMPEG).

General Requirements

In all cases you need to create a credentials file named pricescrap.json, duolingoscrap.json, etc with the following structure:

{
  "username" : "<your_username>",
  "password" : "<your_password>"
}

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
browsermob-proxy-2.1.4		browsermob-proxy-2.1.4
browsermobproxy		browsermobproxy
duolingo		duolingo
pdfreactor		pdfreactor
scrapper		scrapper
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
drumeo_courses.black		drumeo_courses.black
drumeo_courses.html		drumeo_courses.html
drumeo_plays.black		drumeo_plays.black
drumeo_plays.html		drumeo_plays.html
drumeo_songs.black		drumeo_songs.black
drumeo_songs.html		drumeo_songs.html
drumeoscrap.py		drumeoscrap.py
duolingoscrap.py		duolingoscrap.py
pricescrap.csv		pricescrap.csv
pricescrap.py		pricescrap.py
safariscrap.py		safariscrap.py
socialscrap.py		socialscrap.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scrap Hacks

Prices

Installation

Usage

Duolingo

Installation

Usage

Safari Books

Installation

Usage

Drumeo

General Requirements

About

Releases

Packages

Languages

License

isaacdlp/scraphacks

Folders and files

Latest commit

History

Repository files navigation

Scrap Hacks

Prices

Installation

Usage

Duolingo

Installation

Usage

Safari Books

Installation

Usage

Drumeo

General Requirements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages