# Souscheffin quickstart

Follow instuctions in [README.md](./README.md) for basic setup and prerequisites install.

To run this tutorial interactiverly, run `pip install jupyter` to install Jupyter Notebook and then start it using `jupyter notebook`. Jupyter is great because you can do explortory Python analysis right from your browser:

In [1]:
x = 2
print(x+2)

4


## Getting started

Here are some notes and sample code to help you get started.

In [2]:
# Define some sample URLs to use in examples below
SAMPLE_PDF_URL = 'https://media.readthedocs.org/pdf/kolibri/develop/kolibri.pdf'
SAMPLE_HTML_URL = 'https://learningequality.org/'
SAMPLE_IMG_URL = SAMPLE_HTML_URL + 'static/img/kickstarter/Kolibri-tablet.png'
SAMPLE_JS_URL = SAMPLE_HTML_URL + 'static/js/less.js'

### Downloader

The script `utils/downloader.py` has a `read` function that can read from both
urls and file paths. To use:

In [3]:
from utils.downloader import read

# read data from web
pdf_contents = read(SAMPLE_PDF_URL)
with open('./sampledoc.pdf', 'wb') as localfile:
    localfile.write(pdf_contents)

# read data from local file
local_file_content = read('./sampledoc.pdf')
assert pdf_contents == local_file_content, 'content differs!'

# read html web page contents
html_content = read(SAMPLE_HTML_URL)

# Load js before getting contents
html_dom_after_js_has_run = read(SAMPLE_HTML_URL, loadjs=True)

Cache entry deserialization failed, entry ignored


The `loadjs` option will run the JavaScript code on the webpage before reading
the contents of the page, which can be useful for scraping certain websites that
depend on JavaScript to build the page DOM tree.

If you need to use a custom session, you can also use the `session` option. This can
be useful for sites that require login information.

For more examples, see [`examples/openstax_souschef.py`](./examples/openstax_souschef.py) (json) and [`examples/wikipedia_souschef.py`](./examples/wikipedia_souschef.py) (html).

## Using the DataWriter

The DataWriter (`utils.data_writer.DataWriter`) is a tool for creating channel
`.zip` files in a standardized format. This includes creating folders, files,
and `CSV` metadata files that will be used to create the channel on Kolibri Studio.



### Step 1: Open a DataWriter

The `DataWriter` class is meant to be used as a context manager.


### Step 2: Create a Channel

Next, you will need to add a channel using `add_channel`. Channels need the following arguments:
  - `title` (str): Name of channel
  - `source_id` (str): Channel's unique id
  - `domain` (str): Who is providing the content
  - `language` (str): Language of channel
  - `description` (str): Description of the channel (optional)
  - `thumbnail` (str): Path in zipfile to find thumbnail (optional)


### Step 3: Add a Folder

In order to add subdirectories, you will need to use the `add_folder` method
from the DataWriter class. The method `add_folder` accepts the following arguments:
  - `path` (str): Path in zip file to find folder
  - `title` (str): Content's title
  - `source_id` (str): Content's original ID (optional)
  - `language` (str): Language of content (optional)
  - `description` (str): Description of the content (optional)
  - `thumbnail` (str): Path in zipfile to find thumbnail (optional)


### Step 4: Add a File

Finally, you will need to add files to the channel as learning resources.
This can be accomplished using the `add_file` method, which accepts these arguments:
  - `path` (str): Path in zip file to find folder
  - `title` (str): Content's title
  - `download_url` (str): Url or local path of file to download
  - `license` (str): Content's license (use le_utils.constants.licenses)
  - `license_description` (str): Description for content's license
  - `copyright_holder` (str): Who owns the license to this content?
  - `source_id` (str): Content's original ID (optional)
  - `description` (str): Description of the content (optional)
  - `author` (str): Author of content
  - `language` (str): Language of content (optional)
  - `thumbnail` (str): Path in zipfile to find thumbnail (optional)
  - `write_data` (boolean): Indicate whether to make a node (optional)


Putting this all together:

In [4]:
from utils.data_writer import DataWriter
from le_utils.constants import licenses

CHANNEL_NAME = "Quickstart Channel"
CHANNEL_SOURCE_ID = "quickstart-demo-channel"
CHANNEL_DOMAIN = "sampledomain.org"
CHANNEL_LANGUAGE = "en"
CHANNEL_DESCRIPTION = "What is this channel about?"

# Step 1
with DataWriter() as writer:
    # Step 2
    writer.add_channel(CHANNEL_NAME, CHANNEL_SOURCE_ID, CHANNEL_DOMAIN, CHANNEL_LANGUAGE, description=CHANNEL_DESCRIPTION)

    # Step 3
    TOPIC_NAME = "topic"
    writer.add_folder(CHANNEL_NAME + '/' + TOPIC_NAME, TOPIC_NAME)

    # Step 4
    PATH = CHANNEL_NAME + "/" + TOPIC_NAME + "/filename.pdf"
    writer.add_file(PATH, "Example PDF", "./sampledoc.pdf", license=licenses.CC_BY, copyright_holder="Somebody")


## Extra Tools

### PathBuilder

The `PathBuilder` clas is a tool for tracking folder and file paths to write to
the zip file. To initialize a PathBuilder object, you need to specify a channel name:

In [5]:
from utils.path_builder import PathBuilder

CHANNEL_NAME = "Channel"
PATH = PathBuilder(channel_name=CHANNEL_NAME)

You can now build this path using `open_folder`, which will append another item to the path:

```
...
PATH.open_folder('Topic')         # str(PATH): 'Channel/Topic'
```

You can also set a path from the root directory:
```
...
PATH.open_folder('Topic')         # str(PATH): 'Channel/Topic'
PATH.set('Topic 2', 'Topic 3')    # str(PATH): 'Channel/Topic 2/Topic 3'
```


If you'd like to go back one step back in the path:
```
...
PATH.set('Topic 1', 'Topic 2')    # str(PATH): 'Channel/Topic 1/Topic 2'
PATH.go_to_parent_folder()        # str(PATH): 'Channel/Topic 1'
PATH.go_to_parent_folder()        # str(PATH): 'Channel'
PATH.go_to_parent_folder()        # str(PATH): 'Channel' (Can't go past root level)
```

To clear the path:
```
...
PATH.set('Topic 1', 'Topic 2')    # str(PATH): 'Channel/Topic 1/Topic 2'
PATH.reset()                      # str(PATH): 'Channel'
```

### HTMLWriter

The HTMLWriter is a tool for generating zip files to be uploaded to Kolibri Studio

First, open an HTMLWriter context. To write the main file, you will need to use the `write_index_contents` method

You can also add other files (images, stylesheets, etc.) using using `write_file`, `write_contents` and `write_url`:

In [6]:
from utils.html import HTMLWriter

with HTMLWriter('./sampleHTML5ZipFile.zip') as zipper:
    
    # Example 1: add the main file (index.html)
    contents = "<html><head></head><body>Hello, World!</body></html>"
    zipper.write_index_contents(contents)
    
    # Example 2: add a stylesheet and a second html file
    css_path = zipper.write_contents("style.css", "body{padding:30px}", directory="styles")
    # css_path path is "styles/style.css"
    extra_head = "<link href='{}' rel='stylesheet'></link>".format(css_path)
    print('extra_head=', extra_head)
    contents2 = "<html><head>{extra_head}</head><body>Another page.</body></html>".format(extra_head=extra_head)
    zipper.write_contents("page2.html", contents2)

    # Examle 3: Adding a local file
    img_data = read(SAMPLE_IMG_URL)
    with open('./sampleimg.png', 'wb') as localfile:
        localfile.write(img_data)
    img_path = zipper.write_file('./sampleimg.png', directory="img")
    img_tag = "<img src='{}'>".format(img_path)
    print('img_tag=', img_tag)  # Can be inserted as image

    # Example 4: Adding a file directly from the web
    script_path = zipper.write_url(SAMPLE_JS_URL, 'less.js', directory="src")
    script_tag = "<script src='{}' type='text/javascript'></script>".format(script_path)
    print('script_tag=', script_tag)  # Can be inserted into html

extra_head= <link href='styles/style.css' rel='stylesheet'></link>
img_tag= <img src='img/sampleimg.png'>
script_tag= <script src='src/less.js' type='text/javascript'></script>



#### HTML parsing using BeautifulSoup

BeautifulSoup is an HTML parsing library that allows to select various DOM elements,
and extract their attributes and text contents. Here is some sample code for getting
the text of the LE mission statement.

The most commonly used parts of the BeautifulSoup API are:
  - `.find(tag_name,  <spec>)`: find the next occurrence of the tag `tag_name` that
     has attributes specified in `<spec>` (given as a dictionary), or can use the
     shortcut options `id` and `class_` (note extra underscore).
  - `.find_all(tag_name, <spec>)`: same as above but returns a list of all matching
     elements. Use the optional keyword argument `recursive=False` to select only
     immediate child nodes (instead of including children of children, etc.).
  - `.next_sibling`: find the next element (for badly formatted pages with no useful selectors)
  - `.get_text()` extracts the text contents of the node. See also helper method
    called `get_text` that performs additional cleanup of newlines and spaces.
  - `.extract()`: to remove a element from the DOM tree (useful to remove labels, and extra stuff)

For more info about BeautifulSoup, see [the docs](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).





In [7]:
from bs4 import BeautifulSoup
from utils.downloader import read

html = read(SAMPLE_HTML_URL)
page = BeautifulSoup(html, 'html.parser')

main_div = page.find('div', {'id': 'body-content'})
mission_el = main_div.find('h3', class_='our-mission')
mission = mission_el.get_text().strip()
print(mission)

Cache entry deserialization failed, entry ignored


Learning Equality is committed to enabling every person in the world to realize their right to a quality education, by supporting the creation, adaptation and distribution of open educational resources, and creating supportive tools for innovative pedagogy.



To print a list of all the links on the page, use the following code:

In [8]:
links = page.find_all('a')
for link in links:
    print(link.get_text().strip(), '-->', link['href'])

 --> /
About --> #
Learning Equality --> /about/
Core Values --> /about/values/
Team --> /about/team/
Board of Directors --> /about/board/
Supporters and Friends --> /about/supporters/
Translators --> /about/translators/
Press --> /about/press/
Jobs --> /about/jobs/
Internships --> /about/internships/
Kolibri --> /kolibri/
KA Lite --> /ka-lite/
Blog --> /blog/
Forum --> https://community.learningequality.org/
 --> /indiegogo/
LEARN MORE --> /kolibri/
 --> /ka-lite/
LEARN MORE --> /ka-lite/
SEE THE MAP --> https://learningequality.org/ka-lite/map/
SHARE YOUR STORY --> https://learningequality.org/ka-lite/map/add/
READ MORE --> https://learningequality.org/media/FUNSEPA_Final_Evaluation_Report_27May2016.pdf
READ MORE --> media/Rapport-Etude-Cameroun_KL_ENG.pdf
READ MORE --> https://www.povertyactionlab.org/sites/default/files/publications/Hirshleifer_Incentives%20for%20Output_Jan2016.pdf
READ MORE --> http://www.vodafone.com/content/dam/vodafone/connected-education/vodafone_connected_edu