## About the dataset
The dataset presented here is drawn from the Kolibri Studio curricular alignment tool, in which users can create their own channel, then build out a topic tree that represents a curriculum taxonomy or other hierarchical structure, and finally organize content items into these topics, by uploading their own content and/or importing existing materials from the Kolibri Content Library of Open Educational Resources.

An example of a branch of a topic tree is: Secondary Education >> Ordinary Level >> Mathematics >> Further Learning >> Activities >> Trigonometry. The leaf topic in this branch might then contain (be correlated with) a content item such as a video entitled Polar Coordinates.

**topics.csv**- Contains a row for each topic in the dataset. These topics are organized into "channels", with each channel containing a single "topic tree" (which can be traversed through the "parent" reference). Note that the hidden dataset used for scoring contains additional topics not in the public version. You should only submit predictions for those topics listed in sample_submission.csv.
- id - A unique identifier for this topic.
- title - Title text for this topic.
- description - Description text (may be empty)
- channel - The channel (that is, topic tree) this topic is part of.
- category - Describes the origin of the topic.
   - source - Structure was given by original content creator (e.g. the topic tree as imported from Khan Academy). There are no topics in the test set with this category.

  - aligned - Structure is from a national curriculum or other target taxonomy, with content aligned from multiple sources.
  - supplemental - This is a channel that has to some extent been aligned, but without the same level of granularity or fidelity as an aligned channel.
- language - Language code for the topic. May not always match apparent language of its title or description, but will always match the language of any associated content items.
- parent - The id of the topic that contains this topic, if any. This field if empty if the topic is the root node for its channel.
- level - The depth of this topic within its topic tree. Level 0 means it is a root node (and hence its title is the title of the channel).
has_content - Whether there are content items correlated with this 
- topic. Most content is correlated with leaf topics, but some non-leaf topics also have content correlations.



### Cosine-similarity apprach

In this  notebook, I will utilize cosine similarity between the topics to develop baseline model for contents prediction. No attempt was made to find the similarity between the contents itself.  
This is my first notebook share in kaggle. Please upvote and comment if you find it useful.
**Note**: You may notice that, I repeteadly delete the redundant files to free up memory. I faced memory outage error many times during submission.

In [48]:
import pandas as pd 
import numpy as np 
#import ast # abstract Syntax Trees
from sklearn.feature_extraction.text import CountVectorizer
import nltk # natural language tool kit
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.porter import PorterStemmer
from sklearn.metrics.pairwise import cosine_similarity
pd.set_option('display.max_columns',None)
from sklearn.metrics.pairwise import linear_kernel
from nltk.corpus import stopwords
import gc
import psutil
import os
import json

In [49]:
# sample_submission=pd.read_csv('/kaggle/input/learning-equality-curriculum-recommendations/sample_submission.csv',usecols=['topic_id']).fillna('')
# topics =pd.read_csv('/kaggle/input/learning-equality-curriculum-recommendations/topics.csv').fillna('')
# cor=pd.read_csv('/kaggle/input/learning-equality-curriculum-recommendations/correlations.csv').fillna('')

In [50]:
sample_submission=pd.read_csv('/kaggle/input/learning-equality-curriculum-recommendations/sample_submission.csv',usecols=['topic_id']).fillna('')
topics =pd.read_csv('/kaggle/input/learning-equality-curriculum-recommendations/topics.csv',usecols=['id', 'title','channel','category' ,'description', 'level',
       'language']).fillna('')
cor=pd.read_csv('/kaggle/input/learning-equality-curriculum-recommendations/correlations.csv').fillna('')

In [51]:
stop_words_dict = {}

for dirname, _, filenames in os.walk('/kaggle/input/multi-languages-stopwords'):
    for filename in filenames:
        if filename.endswith('.json'):
            with open(os.path.join(dirname, filename), 'r') as f:
                stop_words_dict[filename.split('.')[0]] = json.load(f)

In [52]:
#Merge sample submission df with topics
submission=topics.merge(sample_submission['topic_id'],how='inner',left_on='id',right_on='topic_id')
submission.head()

Unnamed: 0,id,title,description,channel,category,level,language,topic_id
0,t_00004da3a1b2,Откриването на резисторите,"Изследване на материали, които предизвикват на...",000cf7,source,4,bg,t_00004da3a1b2
1,t_00068291e9a4,Entradas e saídas de uma função,Entenda um pouco mais sobre funções.,8e286a,source,4,pt,t_00068291e9a4
2,t_00069b63a70a,Transcripts,,6e3ba4,source,3,en,t_00069b63a70a
3,t_0006d41a73a8,Графики на експоненциални функции (Алгебра 2 н...,Научи повече за графиките на сложните показате...,000cf7,source,4,bg,t_0006d41a73a8
4,t_4054df11a74e,Flow Charts: Logical Thinking?,This lesson is focused on flow charts. It supp...,6e3ba4,source,2,en,t_4054df11a74e


## Filter the topics 
**Filter** the topics based on language, category,channel, and levels that matches with the submission. We can expect the contents to be of **same language**,and should originate from the same channels. We can play around it later.  

In [53]:
topics=topics.loc[topics['language'].isin(submission.language.unique()),:]
# merge title and description to create the tags for text analysis
topics['tags']=topics['title'].astype(str)+topics['description'].astype(str)
#drop title and description
topics=(topics
        .drop(['title','description'],axis=1)
        #group by language, channel,category,level, and id
        .groupby(['language','channel','category','level','id'])['tags']
        .apply(lambda x:' '.join(x).lower())
        .reset_index())

In [54]:
# Define list of english stopwords
stop_words_en = set(stopwords.words("english"))


sub=pd.DataFrame(columns=['id','sim_topics'])

for i in topics.language.unique():
    if i in stop_words_dict.keys():
        stop_words=stop_words_dict[i]
    else:
        stop_words=stop_words_en
    
    df=topics[topics.language==i][['id','tags']].reset_index(drop=True)
    df_sub=pd.DataFrame(submission[submission.language==i]['id'].reset_index(drop=True))
   # df_sub=submission[submission.language==i]['index','id'].reset_index(drop=True)
   #Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a' 
    vectorizer=TfidfVectorizer(stop_words=stop_words)
    #Construct the required TF-IDF matrix by fitting and transforming the data
    tfidf_matrix=vectorizer.fit_transform(df.tags)
    # Compute the cosine similarity matrixaa
    cosine_sim=linear_kernel(tfidf_matrix,tfidf_matrix)
    
    # freeup some memory
    del vectorizer,tfidf_matrix
    gc.collect()
    
    #Create a function that recommends 5 similar topics
    def recommend(topics_id):
        topics_index=df[df.id==topics_id].index[0]
        distances=cosine_sim[topics_index]
        topics_list=sorted(list(enumerate(distances)),
                       reverse=True,
                      key=lambda x:x[1])[1:4]
        #print(topics_list)
        preds=[]
        for i in topics_list:
            preds.append(df.iloc[i[0]].id)
        return preds
        
    df_sub['sim_topics']=df_sub.id.apply(lambda x:recommend(x))


    sub=pd.concat([sub,df_sub])


  % sorted(inconsistent)


In [55]:
submission=(sub.explode('sim_topics')
            .merge(cor,how='left',left_on='sim_topics',right_on='topic_id')[['id','content_ids']]
            .fillna('')
            .groupby('id')['content_ids'].apply (lambda x:list(set(x)))
            .reset_index()
            .rename(columns={'id':'topic_id'})
           )
submission['content_ids']=submission['content_ids'].apply(lambda x:' '.join(x))
submission.to_csv('submission.csv',index=False)