# Nantes Hackathon X5-GON preparatory questions

Link to the hackathon website with all the useful resources and more: 
https://www.x5gon.org/event/ai-hackathon/

## Cold start


### Download the catalogue

In [1]:
! rm x5gon_catelogue.tsv*
! mkdir datasets
! wget https://gitlab.univ-nantes.fr/x5gon/x5gon-hackathon-datasets/raw/master/datasets/x5gon_catelogue.tsv
! mv x5gon_catelogue.tsv datasets/catalogue.tsv

rm: cannot remove 'x5gon_catelogue.tsv*': No such file or directory
mkdir: cannot create directory ‘datasets’: File exists
--2020-02-24 14:49:29--  https://gitlab.univ-nantes.fr/x5gon/x5gon-hackathon-datasets/raw/master/datasets/x5gon_catelogue.tsv
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving gitlab.univ-nantes.fr (gitlab.univ-nantes.fr)... 193.52.101.66
Connecting to gitlab.univ-nantes.fr (gitlab.univ-nantes.fr)|193.52.101.66|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 63850145 (61M) [text/plain]
Saving to: ‘x5gon_catelogue.tsv’


2020-02-24 14:50:37 (929 KB/s) - ‘x5gon_catelogue.tsv’ saved [63850145/63850145]



In [2]:
! head -10 datasets/catalogue.tsv

id	title	type	language	keywords	concepts
59260	C7 - Computing with Space	en	pdf	[space, omicini, c7, omicini disi, disi univ, disi, univ bologna, andrea omicini, computing, andrea, univ, bologna, agents, spatial, agent, coordination, tuple, space space, mas, computer science]	['http://en.wikipedia.org/wiki/Bologna', 'http://en.wikipedia.org/wiki/Tuple', 'http://en.wikipedia.org/wiki/Computer_science', 'http://en.wikipedia.org/wiki/Democratic_Alignment_(2015)', 'http://en.wikipedia.org/wiki/Middleware', 'http://en.wikipedia.org/wiki/Distributed_computing', 'http://en.wikipedia.org/wiki/Geometry', 'http://en.wikipedia.org/wiki/Logic', 'http://en.wikipedia.org/wiki/Computer', 'http://en.wikipedia.org/wiki/Mathematics']
3904	Electromagnetic Fields, Forces, and Motion	en	pdf	[forces motion, fields forces, electromagnetic fields, zahn page, markus zahn, prof markus, zahn, bs, forces, markus, fields, lecture prof, electromagnetic, dl, dt, sin, prof, tan, motion, dt dt]	['http://en.wikipedia.o

### Some requirements

In [6]:
! pip install plotly --upgrade
! pip install ipywidgets --upgrade
! pip install cufflinks --upgrade
! pip install pandas --upgrade

Requirement already up-to-date: plotly in /home/mvidaldepalo/.local/lib/python3.8/site-packages (4.5.1)
Requirement already up-to-date: ipywidgets in /usr/lib/python3.8/site-packages (7.5.1)
Requirement already up-to-date: cufflinks in /home/mvidaldepalo/.local/lib/python3.8/site-packages (0.17.0)
Requirement already up-to-date: pandas in /home/mvidaldepalo/.local/lib/python3.8/site-packages (1.0.1)


In [7]:
! jupyter nbextension enable --py widgetsnbextension
! jupyter nbextension enable --py plotlywidget
! jupyter nbextension list

Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: [32mOK[0m
Enabling notebook extension plotlywidget/extension...
      - Validating: [32mOK[0m
Known nbextensions:
  config dir: /home/mvidaldepalo/.jupyter/nbconfig
    notebook section
      jupyter-js-widgets/extension [32m enabled [0m
      - Validating: [32mOK[0m
      plotlywidget/extension [32m enabled [0m
      - Validating: [32mOK[0m
  config dir: /etc/jupyter/nbconfig
    notebook section
      jupyter-matplotlib/extension [32m enabled [0m
      - Validating: [32mOK[0m
      jupyter-js-widgets/extension [32m enabled [0m
      - Validating: [32mOK[0m


Some useful imports for the following

In [12]:
from collections import Counter

import pandas as pd  # for easy and effective catalogue manipulation
import numpy as np  # for mathematic stuff

from ipywidgets import widgets  # for easy implement UI directly in notebook

import plotly.graph_objs as go
import plotly.io as pio
from plotly.offline import init_notebook_mode, iplot  # for beautifull plot

pio.renderers.default = "colab"

import cufflinks as cf  # to directly bind pandas and plotly

import requests  # for dealing with API
import json  # to deal with json inputs/outputs
import pprint  # for more friendly console formatting
import operator  # often faster than lambda expression

import sklearn.metrics.pairwise as skdist
import statistics

cf.go_offline()  # set plotly to offline mode

ModuleNotFoundError: No module named 'sklearn'

In [None]:
! pip list | grep cufflinks
! pip list | grep plotly

### Plotly notebook configuration

Cell configuration

This method pre-populates the outputframe with the configuration that Plotly expects and must be executed for every cell which is displaying a Plotly graph.

In [None]:
def enable_plotly_in_cell():
    import IPython

    display(
        IPython.core.display.HTML(
            """
        <script src="/static/components/requirejs/require.js"></script>
  """
        )
    )
    init_notebook_mode(connected=False)

Plotly Pre-execute Hook

If you wish to automatically load the required resources within each cell, you can add the enable_plotly_in_cell function to a Jupyter pre-execute hook and it will be automaticaly executed before any cell execution:

In [None]:
# get_ipython().events.register("pre_run_cell", enable_plotly_in_cell)

Because this pre-run hook causes additional javascript resources to be loaded in each cell output, we will disable it here:

In [None]:
get_ipython().events.unregister('pre_run_cell', enable_plotly_in_cell)

## About the catalogue
The catalogue is the entry point to the X5GON Content data. You should be able to access it easily. Compute:


1. How many different OER in total are there?
2. How many videos in French language are there?
3. Which are the most popular keywords?
4. Which are the most popular concepts?



Hints:

For 1c and 1d you will need to define “most popular” and also decide how many OER you return.

For a more complete overview of the pandas library visit the documentation at: https://pandas.pydata.org/index.html

In [None]:
list_parser = lambda x: x[1:-1].split(",")
catalogue = pd.read_csv(
    "datasets/catalogue.tsv",
    sep="\t",
    converters={"keywords": list_parser, "concepts": list_parser},
)
# This is added in case initial dataset hasn't the right columns names:
catalogue.columns = ["id", "title", "language", "type", "keywords", "concepts"]
catalogue.set_index("id", inplace=True)

In [None]:
catalogue.head(20)

In [None]:
enable_plotly_in_cell()
catalogue["language"].value_counts().iplot(
    kind="bar", yTitle="Number of resources", title="Type of resources in db"
)

In [None]:
catalogue["language"].value_counts()

How many different OER in total are there?

In [None]:
print(f"There are {catalogue.shape[0]} oers")

How many videos in French language are there?

In [None]:
videos_extensions = ["avi", "divx", "m4v", "mpeg", "rm", "mp4", "mov", "mpg"]
corresponding_resources = catalogue[
    (catalogue["language"] == "fr") & (catalogue["type"].isin(videos_extensions))
]
print(f"There are {corresponding_resources.shape[0]} which fit to these criteria")

Which are the most popular keywords?

The amount of data is rapidly too huge to be treated in notebook and with a personnal computer we will process by random sampling.

In [None]:
sample_size = 10000  # size of sample set
random_state = 42  # for repeteable result
keywords_popularity = Counter(
    sum(
        catalogue["keywords"]
        .sample(n=sample_size, random_state=random_state)
        .values.flat,
        [],
    )
)

In [None]:
enable_plotly_in_cell()
topNkeywords, topNkcount = zip(*keywords_popularity.most_common(100))
fig = go.Figure(
    [go.Bar(x=topNkeywords, y=topNkcount)],
    layout=go.Layout(
        title="Most popular keywords", yaxis=dict(title="Keywords occurence")
    ),
)
fig

What about the concepts?

In [None]:
sample_size = 1000  # size of sample set
random_state = 42  # for repetable result
concepts_popularity = Counter(
    sum(
        catalogue["concepts"]
        .sample(n=sample_size, random_state=random_state)
        .values.flat,
        [],
    )
)

In [None]:
enable_plotly_in_cell()
topNconcepts, topNccount = zip(*concepts_popularity.most_common(100))
fig = go.Figure(
    [go.Bar(x=topNconcepts, y=topNccount)],
    layout=go.Layout(
        title="Most popular concepts", yaxis=dict(title="Concept occurence")
    ),
)
fig

## Using the catalogue and the X5GON API

**API Documentation**: https://platform.x5gon.org/products/feed#api

The X5GON API allows to get hold of the content and the metadata for a given OER, provided you have its material_id. To find this material_id, you need the catalogue.

5. Find the contents of OER # 39642
6. Find the metadata of OER # 39642
7. Give 10 OERs for concept “Randomness” and for each its License
8. Give the 10 most recent OERs which have Random as keyword

Hints:

For question 6 you can find the production date in the metadata in the JSI API. When you don’t know where to look, try the metadata!


In [None]:
# The X5GON API is available at:
PLATFORM_URL = "https://platform.x5gon.org/api/v1"

5. Find the contents of OER # 39642

In [None]:
# initialise the endpoint
get_specific_materials_endpoint = "/oer_materials/{}/contents/"

# get the material id of the first material returned from previous example
test_material_id = 39642

# query for meta-information about this material, Note that there are no
# parameters for this endpoint
response = requests.get(
    PLATFORM_URL + get_specific_materials_endpoint.format(test_material_id)
)

# convert the json response to a Python dictionary object for further processing
r_json = response.json()
pprint.pprint(r_json["oer_contents"][0]["value"]["value"])

6. Find the metadata of OER # 48331

In [None]:
# initialise the endpoint
get_specific_materials_endpoint = "/oer_materials/{}"

# get the material id of the first material returned from previous example
test_material_id = 48331

# query for meta-information about this material, Note that there are no
# parameters for this endpoint
response = requests.get(
    PLATFORM_URL + get_specific_materials_endpoint.format(test_material_id)
)

# convert the json response to a Python dictionary object for further processing
r_json = response.json()
print(r_json["oer_materials"]["metadata"])

7. Give 10 OERs for concept “Randomness” and for each its License

In [None]:
corresponding_resources = catalogue[
    [any("randomness" in curl.lower() for curl in v) for v in catalogue["concepts"]]
]
print(corresponding_resources.shape)

In [None]:
corresponding_resources.head(1000)

In [None]:
corresponding_resources = corresponding_resources.head(10)
# initialise the endpoint
endpoint = "/oer_materials/{}"

providers = []
licenses = []
for material_id in corresponding_resources.index:
    response = requests.get(PLATFORM_URL + endpoint.format(material_id))
    try:
        r_json = response.json()
        providers.append(r_json["oer_materials"]["provider"]["provider_name"])
        licenses.append(r_json["oer_materials"]["license"])
    except json.JSONDecodeError:
        print(f"Material_id {material_id} error during the request {response}")
        licenses.append(None)
        providers.append(None)

corresponding_resources["license"] = pd.Series(
    licenses, index=corresponding_resources.index
)
corresponding_resources["provider"] = pd.Series(
    providers, index=corresponding_resources.index
)

In [None]:
corresponding_resources.head(100)

Like you can observe the field license is not always present.
Indeed, sometimes deal with real data means to deal with problems, be carefull with
None value for the future ;)

8. Give the 10 most recent OERs which have Random as keyword

In [None]:
corresponding_resources = catalogue[
    [any("random" == curl.lower() for curl in v) for v in catalogue["keywords"]]
]
# initialise the endpoint
endpoint = "/oer_materials/{}"
print(corresponding_resources.shape)

creation_date = []
providers = []
metadatas = []
for material_id in corresponding_resources.index:
    response = requests.get(PLATFORM_URL + endpoint.format(material_id))
    try:
        r_json = response.json()
        creation_date.append(r_json["oer_materials"]["creation_date"])
        providers.append(r_json["oer_materials"]["provider"]["provider_name"])
        metadatas.append(r_json["oer_materials"]["metadata"])
    except json.JSONDecodeError:
        print(
            f"Material_id {material_id} error during the request {response} content:{response.text}"
        )
        # Set to None creation_date for failling resources
        creation_date.append(None)
        providers.append(None)
        metadatas.append(None)

corresponding_resources["creation_date"] = pd.Series(
    creation_date, index=corresponding_resources.index
)
corresponding_resources["provider"] = pd.Series(
    providers, index=corresponding_resources.index
)
corresponding_resources["meta_data"] = pd.Series(
    metadatas, index=corresponding_resources.index
)
corresponding_resources.sort_values(by=["creation_date"], ascending=False)

## Using the catalogue and the LAM API

The Nantes API allows to get hold of the models. There are many models. We don’t introduce each of these in the exercises below.

9. What are the most important terms for OER # 44900? Does your finding correspond to what is given in the catalogue?
10. What are the most important concept for OER # 44900? Does your finding correspond to what is given in the catalogue?
11. Compute the distance between OER  # 44900 and OER  # 1234567. What distance have you used? What distances could you use? What representation by vectors have you chosen? Explain your choices?
12. Find the OER which concerns concept Machine Learning and is closest to OER #1234567.
13. Consider the 100 first OER which have Machine learning as a theme and build an average TF-IDF vector for this class. Then find the OER which is closest to this ideal OER
14. Consider the 100 first OER which have Machine learning as a theme and find the class centroid, ie the OER in the class for which $max_{x∈X} d(c,x)$ is minimal. Do you obtain the same OER as in question 3.5?
15. Given the set of the 100 first OER which have Machine learning as a theme, which is the simplest? Which is the most complex? Check the answers. Are you convinced

Hints.
9 Notice that the catalogue gives us 


9. What are the most important terms for OER # 44900? Does your finding correspond to what is given in the catalogue?


In [None]:
# The X5GON API is available at:
PLATFORM_LAM_URL = "http://wp3.x5gon.org/"
HEADERS = {
    "accept": "application/json",
    "Content-Type": "application/json",
}
rid = 44900

In [None]:
endpoint = "/distance/text2tfidf/fetch"
data = {"resource_ids": [rid], "tfidf_type": "SIMPLE"}
response = requests.post(
    PLATFORM_LAM_URL + endpoint, headers=HEADERS, data=json.dumps(data)
)
tfidf = response.json()
pprint.pprint(tfidf)

In [None]:
top1api = sorted(
    tfidf["output"][0]["value"]["value"].items(), key=operator.itemgetter(1)
)[-1]
print(
    f"Regarding the API: The more important terms of resource {rid} is '{top1api[0]}' with a tfidf of {top1api[1]}"
)
top1catalogue = catalogue["keywords"].loc[rid]
print(
    f"Regarding the catalogue: The more important terms of resource {rid} is '{top1catalogue[0]}'"
)

It is a match !!! Pratically, the keywords in the catalogue are the tfidf top ranked [1-2]-grams for the given resources. Nevertheless, the catalogue should be considered as an entry point for the api. The latest and more efficient representation would always be avalaible through the API. 


10. What are the most important concept for OER # 44900? Does your finding correspond to what is given in the catalogue?


In [None]:
endpoint = "/distance/wikifier/fetch"
data = {
    "resource_ids": [rid],
    "wikification_type": "SIMPLE",  # shouldn't be required fix in the API!!!
}
response = requests.post(
    PLATFORM_LAM_URL + endpoint, headers=HEADERS, data=json.dumps(data)
)
tfidf = response.json()
pprint.pprint(tfidf)

In [None]:
top1api = sorted(
    tfidf["output"][0]["value"]["concepts"], key=operator.itemgetter("pageRank")
)[-1]
print(
    f"Regarding the API: The more important terms of resource {rid} is '{top1api['url']}' with a pageRank of {top1api['pageRank']}"
)
top1catalogue = catalogue["concepts"].loc[rid]
print(
    f"Regarding the catalogue: The more important terms of resource {rid} is '{top1catalogue[0]}'"
)

11. Compute the distance between OER  # 44900 and OER  # 3098. What distance have you used? What distances could you use? What representation by vectors have you chosen? Explain your choices?

In [None]:
# Here we are computing the distance between the 2 resources basing on their corresponding doc2vec representations (we could use tfidf/wikifier)
rids = [44900, 3098]
endpoint = "/distance/doc2vec/fetch/"
data = {"resource_ids": rids}
response = requests.post(
    PLATFORM_LAM_URL + endpoint, headers=HEADERS, data=json.dumps(data)
)
rids_vectors = response.json()

#  from scipy.spatial import distance as scipydist
# distance = scipydist.cosine(rids_vectors['output'][0]['value'], rids_vectors['output'][1]['value'])
#  print(f"distance using scipy: {distance}")
distance = skdist.pairwise_distances(
    X=[rids_vectors["output"][0]["value"], rids_vectors["output"][1]["value"]],
    metric="cosine",
    n_jobs=-1,
)[0, 1]
print(f"distance using sklearn pairwise: {distance}")

**12**. Find the OER which concerns concept Machine Learning and is closest to OER #29302.

In [None]:
# Search in the catalogue the resources having the concept "Machine learning"
principal_oer = 29302
corresponding_resources = catalogue[
    [
        any("machine_learning" in curl.lower() for curl in v)
        for v in catalogue["concepts"]
    ]
]
endpoint = "/distance/doc2vec/fetch/"
data = {"resource_ids": [principal_oer]}
response = requests.post(
    PLATFORM_LAM_URL + endpoint, headers=HEADERS, data=json.dumps(data)
)
principal_oer_vector = response.json()["output"][0]["value"]

# Here to avoid putting a big load on the API: we prefere proceeding by batch (knowing that the API already treating by batch)
i = 0
batch_size = 100
rids = []
rids_vectors = []
rids_vectors_values = []
for res in corresponding_resources.index[:100]:
    rids.append(res)
    i += 1
    if i >= batch_size:
        data["resource_ids"] = rids
        response = requests.post(
            PLATFORM_LAM_URL + endpoint, headers=HEADERS, data=json.dumps(data)
        )
        rids_vectors.extend(response.json()["output"])
        i = 0
        # rids = []
if i < batch_size:
    data["resource_ids"] = rids
    response = requests.post(
        PLATFORM_LAM_URL + endpoint, headers=HEADERS, data=json.dumps(data)
    )

    rids_vectors.extend(response.json()["output"])
# Compute the distance % oer_principal for all the resources having the concept
for rv in rids_vectors:
    rv["distanceToOer"] = skdist.pairwise_distances(
        X=[principal_oer_vector, rv["value"]], metric="cosine", n_jobs=-1
    )[0, 1]
# Sort the resources % the computed distance
top_closest_resources = sorted(rids_vectors, key=lambda i: i["distanceToOer"])[:10]
print(
    catalogue.loc[list(map(operator.itemgetter("resource_id"), top_closest_resources))]
)
# Result
endpoint = "/oer_materials/{}"
oer_pr_metadata = requests.get(PLATFORM_URL + endpoint.format(principal_oer))
closest_oer = next(
    x["resource_id"] for x in top_closest_resources if x["resource_id"] != principal_oer
)
oer_cl_metadata = requests.get(PLATFORM_URL + endpoint.format(closest_oer))
print(f"Principal oer:{principal_oer}")
print(f"Principal oer meta-data:{oer_pr_metadata.json()['oer_materials']}")
print(f"The closest oer:{closest_oer}")
print(f"The closest oer meta-data:{oer_cl_metadata.json()['oer_materials']}")

13. Consider the 100 first OER which have Machine learning as a theme and build an average TF-IDF vector for this class. Then find the OER which is closest to this ideal OER


In [None]:
# Get the fisrt 100 resources having 'Machine learning'
top_closest_resources = sorted(rids_vectors, key=lambda i: i["distanceToOer"])[:100]
top_closest_resources_ids = [x["resource_id"] for x in top_closest_resources]
# Get their TFIDF vectors
endpoint = "/distance/text2tfidf/fetch"
data = {"resource_ids": top_closest_resources_ids, "tfidf_type": "SIMPLE"}
response = requests.post(
    PLATFORM_LAM_URL + endpoint, headers=HEADERS, data=json.dumps(data)
)
closest_tfidf_vectors = response.json()["output"]
# Compute the average TF-IDF vector
tf_idf_average = dict()
for tf in closest_tfidf_vectors:
    for ky, val in tf["value"]["value"].items():
        if ky in tf_idf_average.keys():
            tf_idf_average[ky].append(val)
        else:
            tf_idf_average[ky] = []
            tf_idf_average[ky].append(val)
tf_idf_average = dict((ky, statistics.mean(val)) for ky, val in tf_idf_average.items())
print(tf_idf_average)

In [None]:
# Infere the avarage tfidf vector using the lAM API(tfidf knn)
endpoint = "/distance/text2tfidf/knn/vector"
data = {"vector": tf_idf_average, "n_neighbors": 20}
response = requests.post(
    PLATFORM_LAM_URL + endpoint, headers=HEADERS, data=json.dumps(data)
)

In [None]:
catalogue.loc[response.json()["output"]["neighbors"]]

Some neighbors are not in the catalogue but you can retrieve it using the feed 
api :)

14. Consider the 100 first OER which have Machine learning as a theme and find the class centroid, ie the OER in the class for which $min_{x∈X} \sum d(c,x)$. Do you obtain the same OER as in question 3.5?


In [None]:
# Get The 100 concerned resources
corresponding_resources = catalogue[
    [
        any("machine_learning" in curl.lower() for curl in v)
        for v in catalogue["concepts"]
    ]
]
corresponding_resources = corresponding_resources[:100]

In [None]:
# Get their tfidf vectors
endpoint = "/distance/text2tfidf/fetch"
data = {"resource_ids": corresponding_resources.index.tolist(), "tfidf_type": "SIMPLE"}
response = requests.post(
    PLATFORM_LAM_URL + endpoint, headers=HEADERS, data=json.dumps(data)
)

vectors = [res["value"]["value"] for res in response.json()["output"]]
ind2concepts = list(set().union(*map(lambda x: set(x.keys()), vectors)))
concepts2ind = {v: k for k, v in enumerate(ind2concepts)}


def tfidftovect(x):
    mat = np.zeros((len(x), len(concepts2ind)))
    for i, tfidf in enumerate(x):
        for c, v in tfidf.items():
            mat[i][concepts2ind[c]] = v
    return mat


# Computing the representatif matrix for all vectors % all different keywords
mat = tfidftovect(vectors)

In [None]:
# Computing the inter-distance between the vectors
dist = skdist.pairwise_distances(X=mat, metric="cosine", n_jobs=-1)
# Retrieving the centroid resource id
centroid_id = corresponding_resources.index.tolist()[dist.sum(axis=0).argmin()]

In [None]:
# Retrieving the closest resource to the centroid resource
endpoint = "distance/text2tfidf/knn/res"
data = {"resource_id": centroid_id, "n_neighbors": 20}

response = requests.post(
    PLATFORM_LAM_URL + endpoint, headers=HEADERS, data=json.dumps(data)
)
catalogue.loc[response.json()["output"]["neighbors"]]

15. Given the set of the 100 first OER which have Machine learning as a theme, which is the simplest? Which is the most complex? Check the answers. Are you convinced

In [None]:
# Get difficulty scores for the concerned resources
endpoint = "/difficulty/tfidf2technicity/res"
data = {"resource_ids": corresponding_resources.index.tolist()}

response = requests.post(
    PLATFORM_LAM_URL + endpoint, headers=HEADERS, data=json.dumps(data)
)
catalogue.loc[
    list(
        map(
            operator.itemgetter("resource_id"),
            sorted(response.json()["output"], key=lambda i: i["value"], reverse=True),
        )
    )[:10]
]
# corresponding_resources.sort_values("/difficulty/tfidf2technicity/res", ascending=False).head(10)

## Using both and going further

16. Find a long resource, and for this resource extract its continuous wikifier vector. Then choose the 5 most important concepts and plot their evolution through time.
17. Let us suppose that a learning goal is described by 2 concepts and 3 keywords. Suggest a playlist of OER to be watched
18. Given a set of resources and a time constraint (2 hours), suggest a playlist consistent with the constraint.
19. Given a set of OER, which is the odd one out?

Hints:

17 This is a more complex operation. To find a long OER you will need to use the catalogue to build an iterator over Ids, then count the number of words in each transcription. Once this is done, the continuous wikifier vector can be obtained using the LAM API. You now have to decide what are the 5 most important concepts: one possible answer can be obtained through averaging the scores. Then plotting is a good idea to better visualize the results. 

18 Now things are becoming more difficult. Don’t go for the best solution. Just for one which returns a result. We know how to build a vector (TF-IDF, Word2Vec, Wikifier) for each OER. But can we build one from 2 concepts and 3 keywords?

19 Even harder! We first have to associate a consumption length to each OER. How do we do that? For a video, we could take the duration, but that would not take care of the fact that we may want to check some parts of the video various times. For a pdf, we have to decide how long it takes to read 100 words. Perhaps all users are different? Perhaps we can learn their consumption speed? And we can also see which OERs should come first (or last) through a careful use of the Next-OER service.


16. Find a long resource, and for this resource extract its continuous wikifier vector. Then choose the 5 most important concepts and plot their evolution through time.

In [None]:
enable_plotly_in_cell()
endpoint = "/temporal/continuouswikifier/fetch"
data = {"resource_ids": [29302], "wikification_type": "SIMPLE"}
response = requests.post(
    PLATFORM_LAM_URL + endpoint, headers=HEADERS, data=json.dumps(data)
).json()

keyconcepts = [
    Counter({cdict["title"]: cdict["pageRank"] for cdict in chunk["concepts"]})
    for chunk in response["output"][0]["value"]
]

top20, _ = zip(*sum(keyconcepts, Counter()).most_common(10))
x = [f"chunks{i}" for i in range(len(keyconcepts))]
y = {t: [count[t] for count in keyconcepts] for t in top20}
data = [go.Bar(x=x, y=y[cname], name=cname) for cname in top20]
layout = go.Layout(
    barmode="group",
    title=catalogue.loc[29302]["title"],
    xaxis=dict(title="Concepts", automargin=True),
)
fig = go.Figure(data=data, layout=layout)
fig.update_layout(barmode="stack")
iplot(fig)

## Now it's time to try by yourself !!!

On your own ideas, or in the use cases below


17. Let us suppose that a learning goal is described by 2 concepts and 3 keywords. Suggest a playlist of OER to be watched


In [None]:
learning_goal_concepts = [
    "https://en.wikipedia.org/wiki/Open_educational_resources",
    "https://en.wikipedia.org/wiki/Machine_learning",
]
learning_goal_keywords = ["Education", "Open", "Network"]

18. Given a set of resources and a time constraint (2 hours), suggest a playlist consistent with the constraint.

hint:
https://en.wikipedia.org/wiki/Words_per_minute

Audiobooks are recommended to be 150–160 words per minute, which is the range that people comfortably hear and vocalize words.
Slide presentations tend to be closer to 100–125 wpm for a comfortable pace, auctioneers can speak at about 250 wpm, and the fastest speaking policy debaters speak from 350 to over 500 words per minute. Internet speech calculators show that various things influence words per minute including nervousness.

John Moschitta, Jr., was listed in Guinness World Records, for a time, as the world's fastest speaker, being able to talk at 586 wpm. He has since been surpassed by Steve Woodmore, who achieved a rate of 637 wpm. 

The definition of each "word" is often standardized to be five characters or keystrokes long in English, including spaces and punctuation. 

In [None]:
setids = {
    111935,
    112048,
    113263,
    112748,
    111645,
    52912,
    15519,
    109653,
    112861,
    105938,
    43770,
    65548,
    114821,
    111773,
    114977,
    115356,
    113124,
    114785,
    105997,
    111072,
}


19. Given a set of OER, which is the odd one out? and why ?

In [None]:
setids = {
    65490,
    16602,
    16737,
    58512,
    72491,
    48968,
    64639,
    48716,
    16734,
    17501,
    65546,
    47129,
    17413,
    58280,
    49717,
    44982,
    43405,
    49178,
    49301,
}

## Build an interactive notebook using ipywidgets


Query API and dissect the json output 
is of course a real pleasure for all computer scientists. To do the job, console may be sufficient.

But everybody like the more friendly output. And in the hackathon context the challenge is to present what you've done to external people. A friendly output becomes so far essential. 

An efficient way to transform our ugly notebook in a beautifull interactive one is to use the ipywidgets library.

The following questions will give you some baselines for handling this technology.

**ipywidgets documention**: https://ipywidgets.readthedocs.io/en/latest/

20. Build your first dropdown ! The goal is to display a dropdown of resource titles and to display each time a resource is selected its corresponding concepts.

In [None]:
# Create the dropdown
tDropdown = widgets.Dropdown(
    options=catalogue["title"].head(100), index=0, description="Resources:"
)

In [None]:
# Create an output text area
outConcepts = widgets.Output()  # Text output for concepts plot

In [None]:
# Allow to clear the text area and to handle all the function print
@outConcepts.capture(clear_output=True)
def on_tDropdown_selection(index):
    pprint.pprint(catalogue.iloc[index["new"]].concepts)

In [None]:
# Bind the function with the selection events
tDropdown.observe(on_tDropdown_selection, "index")

In [None]:
# Finally display the whole
display(tDropdown, outConcepts)

21. Update a text area is funny but not really satisfaying. Could we do the same thing with a chart ? Plot the bar of concepts importance for the selected resource.
**This section still in devellopment, mainly due to compatibility problems between google collaboratory and FigureWidget, but still completely working if you execute the code on your local environment**

In [None]:
# Create the dropdown
enable_plotly_in_cell()
tDropdown2 = widgets.Dropdown(
    options=catalogue["title"].head(100), index=0, description="Resources:"
)
concepts = catalogue.iloc[0].concepts
print(type(concepts), concepts)
vals = np.random.rand(len(concepts))
print(type(vals), vals.tolist())


radar = go.FigureWidget(
    data=[go.Scatterpolar(theta=concepts, r=vals.tolist(), fill="toself")],
    layout=go.Layout(
        title="Most popular concepts", yaxis=dict(title="Concept occurence")
    ),
)

In [None]:
# Allow to clear the text area and to handle all the function print
def on_tDropdown_selection2(index):
    concepts = catalogue.iloc[index["new"]].concepts
    radar.data[0].r = np.random.rand(len(concepts)).tolist()
    radar.data[0].theta = concepts
    radar.data[0].name = tDropdown.value

In [None]:
# Bind the function with the selection events
tDropdown2.observe(on_tDropdown_selection2, "index")

In [None]:
# Finally display the whole
display(tDropdown2, radar)

## Need of technical support

Use the Slack channel for all your queries!!

Also, feel free to contact:
- walid.benromdhane@univ-nantes.fr
- victor.connes@univ-nantes.fr
- m.bulathwela@ucl.ac.uk