In [1]:
# this notebook converts the CSV to ES mapping

In [2]:
# imports
from datetime import datetime
import hashlib
import json
import sys
import csv
import os
import pandas as pd
import re
import time
from gensim.summarization.summarizer import summarize
from gensim.summarization import keywords

In [3]:
# some long text
# source: https://www.kaggle.com/c/stanford-covid-vaccine
text1 = '''
Winning the fight against the COVID-19 pandemic will require an effective vaccine that can be equitably and widely distributed. Building upon decades of research has allowed scientists to accelerate the search for a vaccine against COVID-19, but every day that goes by without a vaccine has enormous costs for the world nonetheless. We need new, fresh ideas from all corners of the world. Could online gaming and crowdsourcing help solve a worldwide pandemic? Pairing scientific and crowdsourced intelligence could help computational biochemists make measurable progress.
mRNA vaccines have taken the lead as the fastest vaccine candidates for COVID-19, but currently, they face key potential limitations. One of the biggest challenges right now is how to design super stable messenger RNA molecules (mRNA). Conventional vaccines (like your seasonal flu shots) are packaged in disposable syringes and shipped under refrigeration around the world, but that is not currently possible for mRNA vaccines.
Researchers have observed that RNA molecules have the tendency to spontaneously degrade. This is a serious limitation--a single cut can render the mRNA vaccine useless. Currently, little is known on the details of where in the backbone of a given RNA is most prone to being affected. Without this knowledge, current mRNA vaccines against COVID-19 must be prepared and shipped under intense refrigeration, and are unlikely to reach more than a tiny fraction of human beings on the planet unless they can be stabilized.
The Eterna community, led by Professor Rhiju Das, a computational biochemist at Stanford’s School of Medicine, brings together scientists and gamers to solve puzzles and invent medicine. Eterna is an online video game platform that challenges players to solve scientific problems such as mRNA design through puzzles. The solutions are synthesized and experimentally tested at Stanford by researchers to gain new insights about RNA molecules. The Eterna community has previously unlocked new scientific principles, made new diagnostics against deadly diseases, and engaged the world’s most potent intellectual resources for the betterment of the public. The Eterna community has advanced biotechnology through its contribution in over 20 publications, including advances in RNA biotechnology.
In this competition, we are looking to leverage the data science expertise of the Kaggle community to develop models and design rules for RNA degradation. Your model will predict likely degradation rates at each base of an RNA molecule, trained on a subset of an Eterna dataset comprising over 3000 RNA molecules (which span a panoply of sequences and structures) and their degradation rates at each position. We will then score your models on a second generation of RNA sequences that have just been devised by Eterna players for COVID-19 mRNA vaccines. These final test sequences are currently being synthesized and experimentally characterized at Stanford University in parallel to your modeling efforts -- Nature will score your models!
Improving the stability of mRNA vaccines was a problem that was being explored before the pandemic but was expected to take many years to solve. Now, we must solve this deep scientific challenge in months, if not weeks, to accelerate mRNA vaccine research and deliver a refrigerator-stable vaccine against SARS-CoV-2, the virus behind COVID-19. The problem we are trying to solve has eluded academic labs, industry R&D groups, and supercomputers, and so we are turning to you. To help, you can join the team of video game players, scientists, and developers at Eterna to unlock the key in our fight against this devastating pandemic. 
'''

# and a short one
text2 = 'The quick brown fox jumps over the lazy dog'

In [4]:
# function to count words
def word_count(text):
    if isinstance(text, str):
        s = text.split(' ')
        return len(s)
    else:
        return 0

print('words:', word_count(text1))
print('words:', word_count(text2))
print('words:', word_count(None))

words: 564
words: 9
words: 0


In [5]:
# function to count sentences
def sentence_count(text):
    if isinstance(text, str):
        s = text.split('. ')
        return len(s)
    else:
        return 0

print('sentences:', sentence_count(text1))
print('sentences:', sentence_count(text2))
print('sentences:', sentence_count(None))

sentences: 20
sentences: 1
sentences: 0


In [6]:
# extractive summarization

In [7]:
# text summarization 100% -> n%
def nltk_ratio(text, ratio=0.25):
    return summarize(text, ratio=ratio)

sum_nltk_ratio = nltk_ratio(text1, ratio=0.25)
print('words:', word_count(sum_nltk_ratio))
print(sum_nltk_ratio)

words: 139
Eterna is an online video game platform that challenges players to solve scientific problems such as mRNA design through puzzles.
The solutions are synthesized and experimentally tested at Stanford by researchers to gain new insights about RNA molecules.
We will then score your models on a second generation of RNA sequences that have just been devised by Eterna players for COVID-19 mRNA vaccines.
Improving the stability of mRNA vaccines was a problem that was being explored before the pandemic but was expected to take many years to solve.
Now, we must solve this deep scientific challenge in months, if not weeks, to accelerate mRNA vaccine research and deliver a refrigerator-stable vaccine against SARS-CoV-2, the virus behind COVID-19.
To help, you can join the team of video game players, scientists, and developers at Eterna to unlock the key in our fight against this devastating pandemic.


In [8]:
# text summarization 100% -> n words
def nltk_count(text, word_count=100):
    return summarize(text, word_count=word_count)

sum_nltk_count = nltk_count(text1, word_count=100)
print('words:', word_count(sum_nltk_count))
print(sum_nltk_count)

words: 98
Eterna is an online video game platform that challenges players to solve scientific problems such as mRNA design through puzzles.
We will then score your models on a second generation of RNA sequences that have just been devised by Eterna players for COVID-19 mRNA vaccines.
Now, we must solve this deep scientific challenge in months, if not weeks, to accelerate mRNA vaccine research and deliver a refrigerator-stable vaccine against SARS-CoV-2, the virus behind COVID-19.
To help, you can join the team of video game players, scientists, and developers at Eterna to unlock the key in our fight against this devastating pandemic.


In [9]:
# adaptive summarization
# https://www.machinelearningplus.com/nlp/text-summarization-approaches-nlp-example/

In [10]:
# BART
# Importing the model
from transformers import BartForConditionalGeneration, BartTokenizer, BartConfig

In [11]:
'''
# Loading the model and tokenizer for bart-large-cnn
tokenizer=BartTokenizer.from_pretrained('facebook/bart-large-cnn')
model=BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
#'''

"\n# Loading the model and tokenizer for bart-large-cnn\ntokenizer=BartTokenizer.from_pretrained('facebook/bart-large-cnn')\nmodel=BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')\n#"

In [12]:
'''
# Encoding the inputs and passing them to model.generate()
def bart(text):
    inputs = tokenizer.batch_encode_plus([text],return_tensors='pt')
    summary_ids = model.generate(inputs['input_ids'], early_stopping=True)

    # Decoding and printing the summary
    bart_summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    
    return bart_summary

# long text
start = time.time()
sum_bart_l = bart(text1)
end = time.time()

print('### long text ###')
print('runtime:', end-start)
print('words:', word_count(sum_bart_l))
print('sentences:', sentence_count(sum_bart_l))
print(sum_bart_l)
print('')

# short text
print('### short text ###')
start = time.time()
sum_bart_s = bart(text2)
end = time.time()

print('runtime:', end-start)
print('words:', word_count(sum_bart_s))
print('sentences:', sentence_count(sum_bart_s))
print(sum_bart_s)
#'''

"\n# Encoding the inputs and passing them to model.generate()\ndef bart(text):\n    inputs = tokenizer.batch_encode_plus([text],return_tensors='pt')\n    summary_ids = model.generate(inputs['input_ids'], early_stopping=True)\n\n    # Decoding and printing the summary\n    bart_summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)\n    \n    return bart_summary\n\n# long text\nstart = time.time()\nsum_bart_l = bart(text1)\nend = time.time()\n\nprint('### long text ###')\nprint('runtime:', end-start)\nprint('words:', word_count(sum_bart_l))\nprint('sentences:', sentence_count(sum_bart_l))\nprint(sum_bart_l)\nprint('')\n\n# short text\nprint('### short text ###')\nstart = time.time()\nsum_bart_s = bart(text2)\nend = time.time()\n\nprint('runtime:', end-start)\nprint('words:', word_count(sum_bart_s))\nprint('sentences:', sentence_count(sum_bart_s))\nprint(sum_bart_s)\n#"

In [13]:
# T5
# https://towardsdatascience.com/summarize-reddit-comments-using-t5-bart-gpt-2-xlnet-models-a3e78a5ab944
from transformers import T5Tokenizer, T5ForConditionalGeneration
model = T5ForConditionalGeneration.from_pretrained('t5-base')
tokenizer = T5Tokenizer.from_pretrained('t5-base')

Some weights of the model checkpoint at t5-base were not used when initializing T5ForConditionalGeneration: ['decoder.block.0.layer.1.EncDecAttention.relative_attention_bias.weight']
- This IS expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [14]:
def t5(text):
    Preprocessed_text = "summarize: " + text
    tokens_input = tokenizer.encode(Preprocessed_text,return_tensors="pt", max_length=512, truncation=True)
    summary_ids = model.generate(tokens_input, min_length=100, max_length=180, length_penalty=4.0)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

'''
# long text
start = time.time()
sum_t5_l = t5(text1)
end = time.time()

print('### long text ###')
print('runtime:', end-start)
print('words:', word_count(sum_t5_l))
print('sentences:', sentence_count(sum_t5_l))
print(sum_t5_l)
print('')

# short text
print('### short text ###')
start = time.time()
sum_t5_s = t5(text2)
end = time.time()

print('runtime:', end-start)
print('words:', word_count(sum_t5_s))
print('sentences:', sentence_count(sum_t5_s))
print(sum_t5_s)
#'''

"\n# long text\nstart = time.time()\nsum_t5_l = t5(text1)\nend = time.time()\n\nprint('### long text ###')\nprint('runtime:', end-start)\nprint('words:', word_count(sum_t5_l))\nprint('sentences:', sentence_count(sum_t5_l))\nprint(sum_t5_l)\nprint('')\n\n# short text\nprint('### short text ###')\nstart = time.time()\nsum_t5_s = t5(text2)\nend = time.time()\n\nprint('runtime:', end-start)\nprint('words:', word_count(sum_t5_s))\nprint('sentences:', sentence_count(sum_t5_s))\nprint(sum_t5_s)\n#"

In [15]:
# industry categories

# https://www.census.gov/programs-surveys/aces/information/iccl.html
cat_sic = ['Agriculture','Forestry','Fishing','Mining','Construction','Manufacturing','Transportation','Communications','Electric','Gas','Sanitary','Wholesale Trade','Retail Trade','Finance','Insurance','Real Estate','Services','Public Administration']
# https://www.marketing91.com/19-types-of-business-industries/
cat_19 = ['Aerospace','Transport','Computer','Telecommunication','Agriculture','Construction','Education','Pharmaceutical','Food','Health care','Hospitality','Entertainment','News Media','Energy','Manufacturing','Music','Mining','Worldwide web','Electronics']
# https://simplicable.com/new/industries
cat_simple = ['Advertising','Agriculture','Communication','Construction','Creative','Education','Entertainment','Fashion','Finance','Health care','Information Technology','Manufacturing','Media','Retail','Research','Robotics','Space']

cat = ['Accommodation & Food','Accounting','Agriculture','Banking & Insurance','Biotechnological & Life Sciences','Construction & Engineering','Economics','Education & Research','Emergency & Relief','Finance','Government and Public Works','Healthcare','Justice, Law and Regulations','Manufacturing','Media & Publishing','Miscellaneous','Physics','Real Estate, Rental & Leasing','Utilities','Wholesale & Retail']
subcat = ['Failure','Food','Fraud','General','Genomics','Insurance and Risk','Judicial Applied','Life-sciences','Machine Learning','Maintenance','Management and Operations','Marketing','Material Science','Physical','Policy and Regulatory','Politics','Preventative and Reactive','Quality','Real Estate','Rental & Leasing','Restaurant','Retail','School','Sequencing','Social Policies','Student','Textual Analysis','Tools','Tourism','Trading & Investment','Transportation','Valuation','Water & Pollution','Wholesale']

In [16]:
# zero shot classification
# https://towardsdatascience.com/zero-shot-text-classification-with-hugging-face-7f533ba83cd6
from transformers import pipeline
classifier = pipeline("zero-shot-classification")

Some weights of the model checkpoint at facebook/bart-large-mnli were not used when initializing BartModel: ['model.encoder.version', 'model.decoder.version']
- This IS expected if you are initializing BartModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BartModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at facebook/bart-large-mnli were not used when initializing BartForSequenceClassification: ['model.encoder.version', 'model.decoder.version']
- This IS expected if you are initializing BartForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification m

In [17]:
'''
# test classifictaion with nltk (200 words)

s = nltk_count(text1, word_count=100)

start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat_19)
res_sic = classifier(s, cat_sic, multi_class=True)
end = time.time()

print('runtime sic:', end-start)
#print(res_sic)
print(res_sic['labels'][0:3])
print(res_sic['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat_19)
res_c19 = classifier(s, cat_19, multi_class=True)
end = time.time()

print('runtime c19:', end-start)
#print(res_c19)
print(res_c19['labels'][0:3])
print(res_c19['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat_19)
res_simple = classifier(s, cat_simple, multi_class=True)
end = time.time()

print('runtime simple:', end-start)
#print(res_simple)
print(res_simple['labels'][0:3])
print(res_simple['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat)
res_simple = classifier(s, cat, multi_class=True)
end = time.time()

print('runtime category:', end-start)
#print(res_simple)
print(res_simple['labels'][0:3])
print(res_simple['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), subcat)
res_simple = classifier(s, subcat, multi_class=True)
end = time.time()

print('runtime subcategory:', end-start)
#print(res_simple)
print(res_simple['labels'][0:3])
print(res_simple['scores'][0:3])
#'''

"\n# test classifictaion with nltk (200 words)\n\ns = nltk_count(text1, word_count=100)\n\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200), cat_19)\nres_sic = classifier(s, cat_sic, multi_class=True)\nend = time.time()\n\nprint('runtime sic:', end-start)\n#print(res_sic)\nprint(res_sic['labels'][0:3])\nprint(res_sic['scores'][0:3])\n\n\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200), cat_19)\nres_c19 = classifier(s, cat_19, multi_class=True)\nend = time.time()\n\nprint('runtime c19:', end-start)\n#print(res_c19)\nprint(res_c19['labels'][0:3])\nprint(res_c19['scores'][0:3])\n\n\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200), cat_19)\nres_simple = classifier(s, cat_simple, multi_class=True)\nend = time.time()\n\nprint('runtime simple:', end-start)\n#print(res_simple)\nprint(res_simple['labels'][0:3])\nprint(res_simple['scores'][0:3])\n\n\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200),

In [18]:
'''
# test classifictaion with t5
start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat_19)
res_sic = classifier(sum_t5_l, cat_sic, multi_class=True)
end = time.time()

print('runtime sic:', end-start)
#print(res_sic)
print(res_sic['labels'][0:3])
print(res_sic['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat_19)
res_c19 = classifier(sum_t5_l, cat_19, multi_class=True)
end = time.time()

print('runtime c19:', end-start)
#print(res_c19)
print(res_c19['labels'][0:3])
print(res_c19['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat_19)
res_simple = classifier(sum_t5_l, cat_simple, multi_class=True)
end = time.time()

print('runtime simple:', end-start)
#print(res_simple)
print(res_simple['labels'][0:3])
print(res_simple['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat)
res_simple = classifier(sum_t5_l, cat, multi_class=True)
end = time.time()

print('runtime category:', end-start)
#print(res_simple)
print(res_simple['labels'][0:3])
print(res_simple['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), subcat)
res_simple = classifier(sum_t5_l, subcat, multi_class=True)
end = time.time()

print('runtime subcategory:', end-start)
#print(res_simple)
print(res_simple['labels'][0:3])
print(res_simple['scores'][0:3])
#'''

"\n# test classifictaion with t5\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200), cat_19)\nres_sic = classifier(sum_t5_l, cat_sic, multi_class=True)\nend = time.time()\n\nprint('runtime sic:', end-start)\n#print(res_sic)\nprint(res_sic['labels'][0:3])\nprint(res_sic['scores'][0:3])\n\n\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200), cat_19)\nres_c19 = classifier(sum_t5_l, cat_19, multi_class=True)\nend = time.time()\n\nprint('runtime c19:', end-start)\n#print(res_c19)\nprint(res_c19['labels'][0:3])\nprint(res_c19['scores'][0:3])\n\n\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200), cat_19)\nres_simple = classifier(sum_t5_l, cat_simple, multi_class=True)\nend = time.time()\n\nprint('runtime simple:', end-start)\n#print(res_simple)\nprint(res_simple['labels'][0:3])\nprint(res_simple['scores'][0:3])\n\n\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200), cat)\nres_simple = classifier(sum_t

In [19]:
'''
# test classifictaion with bart

start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat_19)
res_sic = classifier(sum_bart_l, cat_sic, multi_class=True)
end = time.time()

print('runtime sic:', end-start)
#print(res_sic)
print(res_sic['labels'][0:3])
print(res_sic['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat_19)
res_c19 = classifier(sum_bart_l, cat_19, multi_class=True)
end = time.time()

print('runtime c19:', end-start)
#print(res_c19)
print(res_c19['labels'][0:3])
print(res_c19['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat_19)
res_simple = classifier(sum_bart_l, cat_simple, multi_class=True)
end = time.time()

print('runtime simple:', end-start)
#print(res_simple)
print(res_simple['labels'][0:3])
print(res_simple['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat)
res_simple = classifier(sum_bart_l, cat, multi_class=True)
end = time.time()

print('runtime category:', end-start)
#print(res_simple)
print(res_simple['labels'][0:3])
print(res_simple['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), subcat)
res_simple = classifier(sum_bart_l, subcat, multi_class=True)
end = time.time()

print('runtime subcategory:', end-start)
#print(res_simple)
print(res_simple['labels'][0:3])
print(res_simple['scores'][0:3])
#'''

"\n# test classifictaion with bart\n\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200), cat_19)\nres_sic = classifier(sum_bart_l, cat_sic, multi_class=True)\nend = time.time()\n\nprint('runtime sic:', end-start)\n#print(res_sic)\nprint(res_sic['labels'][0:3])\nprint(res_sic['scores'][0:3])\n\n\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200), cat_19)\nres_c19 = classifier(sum_bart_l, cat_19, multi_class=True)\nend = time.time()\n\nprint('runtime c19:', end-start)\n#print(res_c19)\nprint(res_c19['labels'][0:3])\nprint(res_c19['scores'][0:3])\n\n\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200), cat_19)\nres_simple = classifier(sum_bart_l, cat_simple, multi_class=True)\nend = time.time()\n\nprint('runtime simple:', end-start)\n#print(res_simple)\nprint(res_simple['labels'][0:3])\nprint(res_simple['scores'][0:3])\n\n\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200), cat)\nres_simple = classi

In [20]:
# category function
def categorize(text, categories, first=True, treshold=0, runtime=False):
    start = time.time()
    res = classifier(text, categories, multi_class=True)
    #print(res)
    end = time.time()
    dur = round(end-start, 3)
    if first == True:
        ret = {
            'category': res['labels'][0],
            'score': res['scores'][0],
        } if res['scores'][0] >= treshold else {
            'category': None,
            'score': None,
        }
        
    else:
        ret = dict(zip(res['labels'], res['scores']))
        ret = {key: val for key, val in filter(lambda sub: sub[1] >= treshold, ret.items())}
        
    if runtime == True:
        ret['runtime'] = dur
    return ret
        

print(categorize(nltk_count(text1), cat, first=False, treshold=0.5))
print(categorize(nltk_count(text1), cat, first=True, treshold=0.9))
print(categorize(nltk_count(text1), cat, first=False, treshold=0.5, runtime=True))
print(categorize(nltk_count(text1), cat, first=True, treshold=0.9, runtime=True))

{'Biotechnological & Life Sciences': 0.8398016095161438, 'Healthcare': 0.7811025977134705, 'Education & Research': 0.7176370620727539, 'Utilities': 0.6823796629905701}
{'category': None, 'score': None}
{'Biotechnological & Life Sciences': 0.8398016095161438, 'Healthcare': 0.7811025977134705, 'Education & Research': 0.7176370620727539, 'Utilities': 0.6823796629905701, 'runtime': 16.704}
{'category': None, 'score': None, 'runtime': 16.704}


In [21]:
'''
# measure error treshold
csv_in = '../data/database/db_04_analyzed_v02.csv'
csv_out = '../data/database/categorizer.csv'
df = pd.read_csv(csv_in, sep=';')
print(df.shape)

df_out = []
quit = 0
match_c = sim_c = match_sc = sim_sc = 0

start = time.time()
for index, row in df.iterrows():
    print('###')
    print(index, row['link'])
    c = row['industry']
    sc = row['type']
    d = row['description']
    item = {
        'link': row['link'],
        'category': c,
        'subcategory': sc,
    }
    
    print('category:', c)
    try:
        c_guess = categorize(d, cat, first=False, treshold=0.25)
    except:
        c_guess = {}
    print('guess:', c_guess)
    item['category_guess'] = json.dumps(c_guess)
    item['category_match'] = False
    item['category_similar'] = False
    if len(c_guess) > 0:
        keys = list(c_guess.keys())
        if c == keys[0]:
            print('MATCH')
            item['category_match'] = True
            match_c += 1
        elif c in keys:
            print('SIMILAR')
            item['category_similar'] = True
            sim_c += 1
    
    print('subcategory:', sc)
    try:
        sc_guess = categorize(d, subcat, first=False, treshold=0.25)
    except:
        sc_guess = {}
    print('guess:', sc_guess)
    item['subcategory_guess'] = json.dumps(sc_guess)
    item['subcategory_match'] = False
    item['subcategory_similar'] = False
    if len(sc_guess) > 0:
        keys = list(sc_guess.keys())
        if sc == keys[0]:
            print('MATCH')
            item['subcategory_match'] = True
            match_sc += 1
        elif sc in keys:
            print('SIMILAR')
            item['subcategory_similar'] = True
            sim_sc += 1
    
    df_out.append(item)
    
    if quit != 0 and index+1 >= quit:
        break
end = time.time()

df_out = pd.DataFrame(df_out)
df_out.to_csv(csv_out, sep=';', index=False)
print('done in', round(end-start, 3), 'sec')
print(index+1, match_c, sim_c, match_sc, sim_sc)
'''

"\n# measure error treshold\ncsv_in = '../data/database/db_04_analyzed_v02.csv'\ncsv_out = '../data/database/categorizer.csv'\ndf = pd.read_csv(csv_in, sep=';')\nprint(df.shape)\n\ndf_out = []\nquit = 0\nmatch_c = sim_c = match_sc = sim_sc = 0\n\nstart = time.time()\nfor index, row in df.iterrows():\n    print('###')\n    print(index, row['link'])\n    c = row['industry']\n    sc = row['type']\n    d = row['description']\n    item = {\n        'link': row['link'],\n        'category': c,\n        'subcategory': sc,\n    }\n    \n    print('category:', c)\n    try:\n        c_guess = categorize(d, cat, first=False, treshold=0.25)\n    except:\n        c_guess = {}\n    print('guess:', c_guess)\n    item['category_guess'] = json.dumps(c_guess)\n    item['category_match'] = False\n    item['category_similar'] = False\n    if len(c_guess) > 0:\n        keys = list(c_guess.keys())\n        if c == keys[0]:\n            print('MATCH')\n            item['category_match'] = True\n            m

In [43]:
# language detection
# https://towardsdatascience.com/how-to-detect-and-translate-languages-for-nlp-project-dfd52af0c3b5
from langdetect import detect, detect_langs, DetectorFactory

language_codes = {'af': 'afrikaans', 'sq': 'albanian', 'am': 'amharic', 'ar': 'arabic', 'hy': 'armenian', 'az': 'azerbaijani', 'eu': 'basque', 'be': 'belarusian', 'bn': 'bengali', 'bs': 'bosnian', 'bg': 'bulgarian', 'ca': 'catalan', 'ceb': 'cebuano', 'ny': 'chichewa', 'zh-cn': 'chinese (simplified)', 'zh-tw': 'chinese (traditional)', 'co': 'corsican', 'hr': 'croatian', 'cs': 'czech', 'da': 'danish', 'nl': 'dutch', 'en': 'english', 'eo': 'esperanto', 'et': 'estonian', 'tl': 'filipino', 'fi': 'finnish', 'fr': 'french', 'fy': 'frisian', 'gl': 'galician', 'ka': 'georgian', 'de': 'german', 'el': 'greek', 'gu': 'gujarati', 'ht': 'haitian creole', 'ha': 'hausa', 'haw': 'hawaiian', 'iw': 'hebrew', 'hi': 'hindi', 'hmn': 'hmong', 'hu': 'hungarian', 'is': 'icelandic', 'ig': 'igbo', 'id': 'indonesian', 'ga': 'irish', 'it': 'italian', 'ja': 'japanese', 'jw': 'javanese', 'kn': 'kannada', 'kk': 'kazakh', 'km': 'khmer', 'ko': 'korean', 'ku': 'kurdish (kurmanji)', 'ky': 'kyrgyz', 'lo': 'lao', 'la': 'latin', 'lv': 'latvian', 'lt': 'lithuanian', 'lb': 'luxembourgish', 'mk': 'macedonian', 'mg': 'malagasy', 'ms': 'malay', 'ml': 'malayalam', 'mt': 'maltese', 'mi': 'maori', 'mr': 'marathi', 'mn': 'mongolian', 'my': 'myanmar (burmese)', 'ne': 'nepali', 'no': 'norwegian', 'ps': 'pashto', 'fa': 'persian', 'pl': 'polish', 'pt': 'portuguese', 'pa': 'punjabi', 'ro': 'romanian', 'ru': 'russian', 'sm': 'samoan', 'gd': 'scots gaelic', 'sr': 'serbian', 'st': 'sesotho', 'sn': 'shona', 'sd': 'sindhi', 'si': 'sinhala', 'sk': 'slovak', 'sl': 'slovenian', 'so': 'somali', 'es': 'spanish', 'su': 'sundanese', 'sw': 'swahili', 'sv': 'swedish', 'tg': 'tajik', 'ta': 'tamil', 'te': 'telugu', 'th': 'thai', 'tr': 'turkish', 'uk': 'ukrainian', 'ur': 'urdu', 'uz': 'uzbek', 'vi': 'vietnamese', 'cy': 'welsh', 'xh': 'xhosa', 'yi': 'yiddish', 'yo': 'yoruba', 'zu': 'zulu', 'fil': 'Filipino', 'he': 'Hebrew'}

def lingo(text, simple=True):
    DetectorFactory.seed = 0
    try:
        if simple == True:
            return detect(text) #language_codes[detect(text)]
        else:
            l = str(detect_langs(text)[0]).split(':')
            l = {
                'code': l[0],
                'language': language_codes[ l[0] ],
                'probability': l[1],
            }
            return l
    except:
        return None

sentence = "Tanzania ni nchi inayoongoza kwa utalii barani afrika"
sentence2 = "Heute schneit es."

print(lingo(sentence, simple=False))
print(lingo(sentence2))
print(lingo(text1))
print(lingo(text2))
print(lingo(None))

{'code': 'sw', 'language': 'swahili', 'probability': '0.9999971210408874'}
de
en
en
None


In [23]:
# helper functions

In [24]:
# function to rebuild list from string
# that happens when it is stored in CSV without json-encode the data
def str_to_list(s):
    s = s.replace("'", "").replace(' ,', ',').replace(
        '[', '').replace(']', '').split(',')
    s = [i.replace('"','').strip() for i in s if i]
    return s

In [25]:
# helper function to create folder create_folder
def create_folder(path):
    if not os.path.exists(os.path.dirname(path)):
        try:
            os.makedirs(os.path.dirname(path))
            print(path + ' created')
        except OSError as exc: # Guard against race condition
            if exc.errno != errno.EEXIST:
                raise

In [26]:
# generic store data to file function
def store_data(data, file, mode='w', toJson=False):
    if toJson:
        data = json.dumps(data)
    with open(file, mode, encoding='utf-8') as fp:
        result = fp.write(data)
        return result
    
# generic load data from file function
def load_data(file, fromJson=False):
    if os.path.isfile(file):
        with open(file, 'r', encoding='utf-8', errors="ignore") as fp:
            data = fp.read()
            if fromJson:
                data = json.loads(data)
            return data
    else:
        return 'file not found'

# test text
#print(store_data('Hello', '../data/repositories/mlart/test.txt'))
#print(load_data('../data/repositories/mlart/test.txt'))

# test json
#print(store_data({'msg':'Hello World'}, '../data/repositories/mlart/test.json', toJson=True))
#print(load_data('../data/repositories/mlart/test.json', fromJson=True))

#store_data(result[0]['html'], '../data/repositories/kaggle/notebook.html')
#store_data(result[0]['iframe'], '../data/repositories/kaggle/kernel.html')

In [27]:
# remove special characters
def clean_text(text):
    # Ref: https://gist.github.com/Alex-Just/e86110836f3f93fe7932290526529cd1#gistcomment-3208085
    # Ref: https://en.wikipedia.org/wiki/Unicode_block
    EMOJI_PATTERN = re.compile(
        "(["
        "\U0001F1E0-\U0001F1FF"  # flags (iOS)
        "\U0001F300-\U0001F5FF"  # symbols & pictographs
        "\U0001F600-\U0001F64F"  # emoticons
        "\U0001F680-\U0001F6FF"  # transport & map symbols
        "\U0001F700-\U0001F77F"  # alchemical symbols
        "\U0001F780-\U0001F7FF"  # Geometric Shapes Extended
        "\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
        "\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
        "\U0001FA00-\U0001FA6F"  # Chess Symbols
        "\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
        "\U00002702-\U000027B0"  # Dingbats
        "])"
    )
    text = re.sub(EMOJI_PATTERN, '', text)
    
    # additional cleanup
    text = text.replace('•','').replace('\n',' ')
    
    return text

In [28]:
tag_filter = {
    # '3D',
    '3D Ken Burns Effect': 'Ken Burns 3D',
    # '3D Photo Inpainting',
    'AI': None,
    # 'ANN',
    # 'Ableton Live',
    'Activations': None,
    'Aeriolod': None,
    # 'AlexNet',
    'Analytics  Competition': None,
    # 'Animating Landscape',
    # 'Anomaly Detection',
    'ArtBreeder': 'Artbreeder',
    # 'Artbreeder',
    # 'AttnGAN',
    # 'AutoML',
    # 'B3D',
    # 'BASNet',
    # 'Background Removal',
    # 'Bayesian',
    # 'BigGAN',
    # 'BodyPix',
    # 'Boltzmann Machine',
    # 'CAD',
    # 'CMA-ES',
    # 'CMT',
    # 'CNN',
    # 'CPPN',
    # 'CV',
    'Camera': None,
    # 'Chatbot',
    # 'Classification',
    'Classifier': 'Classification',
    # 'Clustering',
    # 'Colorization',
    'Contentless': None,
    # 'Corpus-based synthesis',
    # 'CycleGAN',
    # 'DCGAN',
    # 'DDSP',
    # 'DL',
    # 'DLIB',
    # 'DeOldify',
    # 'Decision Tree',
    'Deep Fakes': 'DeepFake',
    # 'Deep Painterly Harmonization',
    # 'DeepDream',
    # 'DeepFake',
    # 'DeepFlow',
    # 'DenseCap',
    'Depth Map': None,
    # 'Detectron',
    'Detectron2': 'Detectron',
    'Device': None,
    'Discriminator': None,
    'Document Summarization': 'Summarization',
    # 'ESR-GAN',
    # 'Edge Detection',
    # 'Expression Detection',
    'Face Alignment': 'Face Detection',
    # 'Face Detection',
    # 'Face Recognition',
    'Face Tracking': 'Face Detection',
    'Face markers': 'Face Detection',
    'Face recognition': 'Face Recognition',
    'Facial Detection': 'Face Detection',
    'Facial Recognition': 'Face Recognition',
    # 'FastPhotoStyle',
    # 'Feature Mixing',
    'Feature vectors': None,
    'Featured Code Competition': None,
    'Featured Simulation Competition': None,
    'Featured prediction Competition': None,
    # 'Federated Learning',
    # 'Few Shot Animation',
    'First Order Motion': None,
    # 'GAN',
    # 'GBM',
    # 'GPT',
    'GPT-2': 'GPT',
    # 'GRU',
    'Game of Life': None,
    # 'GauGAN',
    # 'Gaussian Mixture Model',
    # 'Genetic Algorithm',
    'Getting Started prediction Competition': None,
    # 'Gradient Ascent',
    # 'Gradient Boosting',
    # 'Gradient Smoothing',
    # 'Grannma Magnet',
    'Hardware': None,
    'Height map': None,
    'Heightfield': None,
    'Houdini': None,
    # 'Image Captioning',
    # 'Image Segmentation',
    # 'ImageJ',
    # 'ImageNet',
    # 'Inception',
    # 'Inpainting',
    'Interpolation': None,
    # 'IoT',
    'K-Means': 'K-means',
    # 'K-means',
    # 'KNN',
    # 'Ken Burns 3D',
    # 'Kolmogorov complexity',
    # 'LSTM',
    # 'Laplacian Pyramid',
    'Lenticular': 'Lenticular Printing',
    'Lenticular Print': 'Lenticular Printing',
    # 'Linear Regression',
    # 'Logistic Regression',
    'ML': None,
    # 'Machine Translation',
    'Machine translation': 'Machine Translation',
    'Magenta': None,
    # 'Markov Chain',
    'Markov Chains': 'Markov Chain',
    'Memory Mosaic': None,
    'Microphone': None,
    'Mixture Density Networks': 'MDN',
    'Multi-Domain Multi-Modality I2I translation': 'Image2Image',
    'Multi-Style Transfer': 'Style Transfer',
    # 'Music Transformer',
    # 'N-gram',
    # 'NER',
    # 'NLG',
    # 'NLP',
    # 'NLU',
    'NN': None,
    # 'NNS',
    # 'NSynth',
    'NSynth Super': 'NSynth',
    # 'Naive Bayes',
    'Nerual CA': 'Neural Cellular Automata',
    # 'Neural Cellular Automata',
    # 'Object Detection',
    # 'Occams razor',
    'Open Pose': 'OpenPose',
    # 'OpenCV',
    # 'OpenPose',
    # 'Optical Flow',
    'Optical flow': 'Optical Flow',
    'Perlin Noise': None,
    # 'Photogrammetry',
    'Photoshop': None,
    'Pix2Pix': 'Image2Image',
    'Pix2pix': 'Image2Image',
    'Pixel2style2pixel': 'Image2Image',
    'Playground Code Competition': None,
    'Playground prediction Competition': None,
    # 'Point Cloud',
    # 'PoseNet',
    # 'ProGAN',
    'Progressively Grown GAN': 'ProGAN',
    # 'Projective Non-negative Matrix Factorization',
    # 'Quantum Computer',
    # 'QuickDraw',
    # 'RL',
    # 'RNN',
    # 'Random Forest',
    # 'Raymarching',
    # 'ReLu',
    # 'Recommender',
    'Recruitment prediction Competition': None,
    'Rectifier': 'ReLu',
    # 'Regression',
    'Reinforcement Learning': 'RL',
    # 'ResNet',
    'Research Code Competition': None,
    'Research prediction Competition': None,
    'Resnet': 'ResNet',
    # 'SIFT',
    # 'SNGAN',
    # 'SOM',
    # 'Self-attention',
    'Semantic search': 'Semantic Search',
    # 'Sentiment Analysis',
    # 'Simplex Volume Maximization',
    # 'SinGAN',
    # 'SketchRNN',
    # 'Sparse Transformer',
    # 'Speech Recognition',
    # 'Speech to text',
    # 'Style Transfer',
    # 'StyleGAN',
    'StyleGAN2': 'StyleGAN',
    'StyleTransfer': 'Style Transfer',
    # 'Super Slo Mo',
    # 'Super-Resolution',
    'Super-resolution': 'Super-Resolution',
    'Superresolution': 'Super-Resolution',
    # 'Supervised Learning',
    'Support Vector Machines': 'SVM',
    'TensorFlow.js': None,
    'Tensorflow.js': None,
    'Text Classification': 'Classification',
    'Text To Speech': 'Text To Speech',
    'Text classification': 'Classification',
    'Text to Animation of Virtual Characters': 'Text to Animation',
    # 'Texture synthesis',
    # 'Transformer',
    # 'Translation',
    # 'U-Net',
    'U-net': 'U-Net',
    # 'UMAP',
    'Unsupervised learning': 'Unsupervised Learning',
    # 'VAE',
    # 'VGG',
    'VQ-VAE': 'VAE',
    # 'VR',
    'Video StyleTransfer': 'Style Transfer',
    # 'Voice Detection',
    'Voice detection': 'Voice Detection',
    # 'Watson Beat',
    # 'Wav2Lip',
    # 'WaveGAN',
    # 'Wavenet',
    'Weights': None,
    # 'Word2Vec',
    'advanced': None,
    'animals': None,
    'arts and entertainment': 'Arts and Entertainment',
    'astronomy': 'Astronomy',
    'audio data': None,
    'automobiles and vehicles': 'Automotive',
    'banking': 'Banking',
    'basketball': 'Sports',
    'bayesian statistics': None,
    'beginner': None,
    'bigquery': 'BigQuery',
    'binary classification': 'Classification',
    'biology': 'Biology',
    # 'cDCGAN',
    'california': None,
    'categorical data': None,
    'china': None,
    'classification': 'Classification',
    'clustering': 'CLustering',
    'cnn': 'CNN',
    'computer science': None,
    'computer vision': 'CV',
    'covid19': None,
    'cuml-UMAP': 'UMAP',
    'dailychallenge': None,
    'data analytics': None,
    'data cleaning': None,
    'data visualization': None,
    'decision tree': 'Decision Tree',
    'deep learning': 'DL',
    'deepflow': 'DeepFlow',
    'dimensionality reduction': 'Dimensionality Reduction',
    'diseases': 'Diseases',
    'e-commerce services': 'E-Commerce',
    'earth and nature': 'Earth and Nature',
    'employment': 'Employment',
    'ensembling': 'Ensembling',
    'environment': 'Environment',
    'exploratory data analysis': 'Exploratory Data Analysis',
    'feature engineering': 'Feature Engineering',
    'finance': 'Finance',
    'forestry': 'Forestry',
    'games': 'Games',
    'gan': 'GAN',
    'genetics': 'Genetics',
    'geospatial analysis': 'Geospatial Analysis',
    'gpu': None,
    'gradient boosting': 'Gradient Boosting',
    'health': 'Healthcare',
    'healthcare': 'Healthcare',
    'image data': None,
    'india': None,
    'intermediate': None,
    'jobs and career': None,
    'k-means': 'K-means',
    'keras': 'Keras',
    'languages': None,
    'learn': None,
    'lightgbm': 'Gradient Boosting',
    'linear regression': 'Linear Regression',
    'linguistics': None,
    'logistic regression': 'Logistic Regression',
    'lstm': 'LSTM',
    'medicine': 'Healthcare',
    'microcontroller': None,
    'model comparison': 'Model Comparison',
    'model explainability': 'Model Explainability',
    'multiclass classification': 'Classification',
    'multilabel classification': 'Classification',
    'naive bayes': 'Naive Bayes',
    'neural networks': None,
    'nlp': 'NLP',
    'ofxSelfOrganizingMap': 'Self-organizing map',
    'openFrameworks': None,
    'optimization': None,
    'outlier analysis': 'Outlier Analysis',
    'pca': None,
    'physics': 'Physics',
    'pix2code': 'Pix2Code',
    'pix2pix': 'Image2Image',
    'plants': 'Plants',
    'pollution': 'Pollution',
    'puzzles': None,
    'python': None,
    'pytorch': None,
    # 'rGMIR',
    'random forest': 'Random Forest',
    'recommender systems': 'Recommender',
    'regression': 'Regression',
    'reinforcement learning': 'RL',
    'research': None,
    'rnn': 'RNN',
    'robotics': 'Robotics',
    'sampling': None,
    'signal processing': None,
    'simulations': None,
    'spaCy': 'NLP',
    'sports': 'Sports',
    'survey analysis': None,
    'svm': 'SVM',
    # 't-SNE',
    'tabular data': None,
    'tensorflow': None,
    'text data': None,
    'text mining': None,
    'time series analysis': 'Time Series Analysis',
    'tpu': None,
    'transfer learning': 'Transfer Learning',
    'utility script': None,
    'video games': 'Games',
    'xgboost': 'Xgboost',
}

def tag_equalizer(tags):
    tags = [tag_filter.get(x, x) for x in tags]
    tags = list(filter(None, tags))
    return tags

print(tag_equalizer(['tpu', 'rnn']))

['RNN']


In [52]:
# mapper to convert CSV to the mapping of Elasticsearch index
def mapper(row, style, extra={}):
    '''
    mapper to adopt csv to db-schema

    "title"
    "summarization"
    "words"
    "sum_words"
    "link"
    "source"
    "category"
    "category_score"
    "subcategory"
    "subcategory_score"
    "tags"
    "kind"
    "ml_libs"
    "host"
    "license"
    "programming_language"
    "ml_score"
    "learn_score"
    "explore_score"
    "compete_score"
    "engagement_score"
    "date_project"
    "date_scraped"
    '''

    # kaggle competition mapping
    if style == 'kaggle_competition':
        ret = {
            'title': row['title'],
            'description': row['subtitle'] + row['description'],
            'link': row['link'],
            # 'category': '',
            # 'category_score': 0,
            # 'subcategory': '',
            # 'subcategory_score': 0,
            'tags': list(set(str_to_list(row['tags']) + str_to_list(row['type']))),
            'kind': ['Project', '(Competition)', '(Dataset)'],
            # 'ml_libs': str_to_list(row['ml_libs']),
            'host': 'www.kaggle.com',
            # 'license': row['license'],
            # 'programming_language': row['type'],
            # 'ml_score': 0,
            'learn_score': 1,
            'explore_score': 0,
            'compete_score': 0,
            'engagement_score': row['teams_score'],
            'date_project': datetime.strptime(row['date_closed'], "%Y-%m-%d %H:%M:%S") if 'date_closed' in row else '',
            # 'date_scraped': datetime.strptime(row['scraped_at'], "%Y-%m-%d %H:%M:%S"),
            # 'ml_terms': row['ml_terms'],
            # 'score_raw': json.dumps({'views': row['views'], 'votes': row['votes'], 'score_private': row['score_private'], 'score_public': row['score_public']}),
        }
    
    # kaggle dataset mapping
    if style == 'kaggle_dataset':
        ret = {
            'title': row['title'],
            'description': row['subtitle'] + row['description'],
            'link': row['link'],
            # 'category': '',
            # 'category_score': 0,
            # 'subcategory': '',
            # 'subcategory_score': 0,
            'tags': list(set(str_to_list(row['tags']) + str_to_list(row['type']))),
            'kind': ['Project', '(Dataset)'],
            # 'ml_libs': str_to_list(row['ml_libs']),
            'host': 'www.kaggle.com',
            # 'license': row['license'],
            # 'programming_language': row['type'],
            # 'ml_score': 0,
            'learn_score': 1,
            'explore_score': 0,
            'compete_score': 0,
            'engagement_score': row['teams_score'],
            'date_project': datetime.strptime(row['date_closed'], "%Y-%m-%d %H:%M:%S") if 'date_closed' in row else '',
            # 'date_scraped': datetime.strptime(row['scraped_at'], "%Y-%m-%d %H:%M:%S"),
            # 'ml_terms': row['ml_terms'],
            # 'score_raw': json.dumps({'views': row['views'], 'votes': row['votes'], 'score_private': row['score_private'], 'score_public': row['score_public']}),
        }
    
    # kaggle notebook mapping
    if style == 'kaggle_notebook':
        ret = {
            'title': row['title'],
            'description': row['description'],
            'link': row['link'],
            # 'category': '',
            # 'category_score': 0,
            # 'subcategory': '',
            # 'subcategory_score': 0,
            'tags': list(set(str_to_list(row['tags']) + str_to_list(row['tags']))),
            'kind': ['Project', '(Notebook)'],
            'ml_libs': str_to_list(row['ml_libs']),
            'host': 'www.kaggle.com',
            'license': row['license'],
            'programming_language': row['type'],
            'ml_score': row['ml_detected'],
            'learn_score': 1,
            'explore_score': 0,
            'compete_score': 0,
            'engagement_score': row['score_views'] if 'score_views' in row else None,
            'date_project': datetime.strptime(row['date'], "%Y-%m-%d %H:%M:%S") if row['date'] != '' else None,
            'date_scraped': datetime.strptime(row['scraped_at'], "%Y-%m-%d %H:%M:%S") if row['scraped_at'] != '' else None,
            # 'ml_terms': row['ml_terms'],
            # 'score_raw': json.dumps({'views': row['views'], 'votes': row['votes'], 'score_private': row['score_private'], 'score_public': row['score_public']}),
        }

    # github mapping
    if style == 'github':
        title = row['name'] if row['name'] != '' else row['title']
        title = title.replace('-',' ').replace('_',' ').strip()
        cat_score = 1 if row['industry'] != '' else 0
        subcat_score = 1 if row['type'] != '' else 0
        #tags = row['ml_tags'] if len(row['ml_tags']) > 0 else ''
        ret = {
            'title': title,
            'description': row['description2'],
            'link': row['link'],
            'category': row['industry'],
            'category_score': cat_score,
            'subcategory': row['type'],
            'subcategory_score': subcat_score,
            'tags': str_to_list(row['ml_tags']),
            'kind': 'Project',
            'ml_libs': str_to_list(row['ml_libs']),
            'host': 'www.github.com',
            'license': row['license'],
            'programming_language': row['language_primary'],
            'ml_score': row['ml_detected'],
            'engagement_score': row['stars_score'],
            'date_project': datetime.strptime(row['pushed_at'], "%Y-%m-%d %H:%M:%S"),
            'date_scraped': datetime.strptime(row['scraped_at'], "%Y-%m-%d %H:%M:%S"),
            # 'ml_terms': row['keywords'],
            # 'score_raw': json.dumps({'stars': row['stars'], 'contributors': row['contributors']}),
        }

    # mlart mapping
    if style == 'mlart':
        title = row['Title'] if row['Title'] != '' else row['title']
        cat_score = 1 if row['Theme'] != '' else 0
        subcat_score = 1 if row['Medium'] != '' else 0
        ret = {
            'title': title,
            'description': row['subtitle'],
            'link': row['url'],
            'category': 'Miscellaneous',
            'category_score': cat_score,
            'subcategory': 'Art',
            'subcategory_score': subcat_score,
            'tags': str_to_list(row['Theme']) + str_to_list(row['Medium']) + str_to_list(row['Technology']),
            'kind': 'Showcase',
            # 'ml_libs': [],
            'host': 'mlart.co',
            # 'license': '',
            # 'programming_language': '',
            # 'ml_score': row['ml_detected'],
            'learn_score': 0,
            'explore_score': 1,
            'compete_score': 0,
            # 'engagement_score': 0,
            'date_project': datetime.strptime(row['Date'], "%Y-%m-%d"),
            'date_scraped': datetime.strptime(row['scraped_at'], "%Y-%m-%d %H:%M:%S"),
            # 'score_raw': json.dumps({'days_since_featured': row['Days Since Featured']}),
        }

    # thecleverprogrammer
    if style == 'tcp':
        ret = {
            'title': row['title'],
            'description': row['description'],
            'link': row['link'],
            # 'category': '',
            # 'category_score': 0,
            # 'subcategory': '',
            # 'subcategory_score': 0,
            'tags': str_to_list(row['ml_tags']),
            'kind': 'Project',
            'ml_libs': str_to_list(row['ml_libs']),
            'host': 'thecleverprogrammer.com',
            # 'license': '',
            'programming_language': 'Python',
            'ml_score': row['ml_score'],
            'learn_score': 1,
            'explore_score': 0,
            'compete_score': 0,
            # 'engagement_score': 0,
            'date_project': datetime.strptime(row['date'], "%Y-%m-%d %H:%M:%S"),
            'date_scraped': datetime.strptime('2020-12-20', "%Y-%m-%d"),
            # 'score_raw': json.dumps({'days_since_featured': row['Days Since Featured']}),
        }
    
    # zalando / bcgdv / medium
    if style == 'manual':
        ret = {
            'title': row['title'],
            # 'description': row['description'] if row['description'] != '' else row['text'],
            'link': row['link'],
            # 'category': '',
            # 'category_score': 0,
            # 'subcategory': '',
            # 'subcategory_score': 0,
            'tags': str_to_list(row['tags']),
            # 'kind': 'Project',
            # 'ml_libs': str_to_list(row['ml_libs']),
            # 'host': 'thecleverprogrammer.com',
            # 'license': '',
            # 'programming_language': 'Python',
            # 'ml_score': row['ml_score'],
            # 'engagement_score': 0,
            # 'date_project': datetime.strptime(row['date'], "%d.%m.%Y"),
            # 'date_scraped': datetime.strptime(row['date_scraped'], "%d.%m.%Y"),
            # 'score_raw': json.dumps({'days_since_featured': row['Days Since Featured']}),
        }
        if ret['title'] == '' and 'company' in row:
            ret['title'] = row['company']
            
        if 'description' in row and row['description'] != '':
            ret['description'] = row['description']
        else:
            ret['description'] = row['text']
            
        if 'source' in row:
            ret['source'] = row['source']
            
        if 'category' in row:
            ret['category'] = row['category']
            if 'category_score' in row:
                ret['category_score'] = row['category_score']
            elif ret['category'] != '':
                ret['category_score'] = 1
        else:
            ret['category'] = ''
            
        if 'subcategory' in row:
            ret['subcategory']= row['subcategory']
            if 'subcategory_score' in row:
                ret['subcategory_score'] = row['subcategory_score']
            elif ret['subcategory'] != '':
                ret['subcategory_score'] = 1
        else:
            ret['subcategory'] = ''
            
        if 'date' in row and row['date'] != '':
            try:
                ret['date_project'] = datetime.strptime(row['date'], "%d.%m.%Y")
            except:
                try:
                    ret['date_project'] = datetime.strptime(row['date'], "%Y")
                except:
                    pass
                
        if 'date_scraped' in row:
            row['date_scraped'] = datetime.strptime(row['date_scraped'], "%d.%m.%Y")
        
        if 'ml_score' in row:
            ret['ml_score'] = row['ml_score']
            
        if 'learn_score' in row:
            ret['learn_score'] = row['learn_score']
            
        if 'explore_score' in row:
            ret['explore_score'] = row['explore_score']
            
        if 'compete_score' in row:
            ret['compete_score'] = row['compete_score']
        
    attach = {**extra}
    if 'tags' in attach:
        ret['tags'].extend(attach['tags'])
        attach.pop('tags')
    ret.update(attach)
    return ret

In [30]:
# test gpu usage
import torch
torch.cuda.is_available()

True

In [31]:
# summarization loop

In [57]:
# loop to transform data row-wise
def transform_loop(csv_in, csv_format, subfolder, quit=0, overwrite=False, inplace=True, printItem=False, extra={}):
    
    with open(csv_in, encoding='utf-8') as csvfile:
        
        # let's store converted csv to temp-folder for analysis
        csv_out = '../data/database/csv/'
        json_out = '../data/database/json/'
        json_out_item = '../data/database/json/'+subfolder
        create_folder(json_out_item)
        df = pd.DataFrame()

        # readCSV = csv.reader(csvfile, delimiter=';')
        readCSV = csv.DictReader(csvfile, delimiter=';')
        # next(readCSV, None)  # skip the headers
        
        i = j = 0
        out = []
        
        for row in readCSV:
            row = mapper(row, csv_format, extra=extra)
            if printItem == True:
                print(json.dumps(row, indent=3, sort_keys=True, default=str))
            
            # check if file already exists
            link = row['link']
            md5 = hashlib.md5(link.encode("utf-8")).hexdigest()
            
            json_fp = json_out_item + md5 + '.json'
            
            old = {}
            if os.path.isfile(json_fp) and overwrite == True or not os.path.isfile(json_fp):
                if os.path.isfile(json_fp):
                    old = load_data(json_fp, fromJson=True)
                
                print(i, row['link'])
                item_start = time.time()

                # clean title & description
                row['title'] = clean_text(row['title'])
                text = row['description'] = clean_text(row['description'])
                words = row['words'] = word_count(text)
                sentences = row['sentences'] = sentence_count(text)

                # create summarization
                if words > 200 and sentences > 1 and (not 'sum_nltk' in old or not 'sum_t5' in old):
                    print('summarize')
                    
                    # nltk
                    if not 'sum_nltk' in old:
                        start = time.time()
                        row['sum_nltk'] = nltk_count(text, word_count=200)
                        end = time.time()
                        dur = round(end-start,3)

                        row['sum_nltk_words'] = word_count(row['sum_nltk'])
                        row['sum_nltk_runtime'] = dur
                        print('done (nltk)', dur, 'sec')
                    
                    # t5
                    if not 'sum_t5' in old:
                        start = time.time()
                        row['sum_t5'] = t5(text)
                        end = time.time()
                        dur = round(end-start,3)

                        row['sum_t5_words'] = word_count(row['sum_t5'])
                        row['sum_t5_runtime'] = dur
                        print('done (t5)', dur, 'sec')
                
                # detect language
                if not 'language_code' in old:
                    s = row['description'] if 'description' in row and row['description'] != '' else row['title']
                    lang = lingo(s, simple=False)
                    if lang != None:
                        row['language_code'] = lang['code']
                        row['language'] = lang['language']
                        row['language_score'] = lang['probability']
                    else:
                        row['language_code'] = None
                        row['language'] = None
                        row['language_score'] = None

                # equalizer
                if 'programming_language' in row and row['programming_language'] == 'Python notebook':
                    row['programming_language'] = 'Jupyter Notebook'
                    
                if 'license' in row:
                    if row['license'] == 'Apache 2.0':
                        row['license'] = 'Apache-2.0'
                    if row['license'] == 'Learn more about GitHub Sponsors':
                        row['license'] = None
                    if row['license'] == 'Unlicense':
                        row['license'] = None
                        
                row['tags'] = tag_equalizer(row['tags'])
                

                # convert datetime to string
                if 'date_project' in row:
                    row['date_project'] = str(row['date_project'])
                if 'date_scraped' in row:
                    row['date_scraped'] = str(row['date_scraped'])
                    
                # runtime
                item_end = time.time()
                item_dur = round(item_end-item_start, 3)
                row['runtime'] = item_dur

                #df = df.append(row, ignore_index=True)

                # json encode
                #out.append(row)
                
                if overwrite == True and inplace==True:
                    row = {**old, **row}
                    drop = ['score']
                    for key in drop:
                        if key in row:
                            row.pop(key)
                    # restore category, subcategory and runtime
                    if row['category'] == '' and 'category' in old:
                        row['category'] = old['category']
                    if row['category_score'] == '' and 'category_score' in old:
                        row['category_score'] = old['category_score']
                    if row['subcategory'] == '' and 'subcategory' in old:
                        row['subcategory'] = old['subcategory']
                    if row['subcategory_score'] == '' and 'subcategory_score' in old:
                        row['subcategory_score'] = old['subcategory_score']
                    if row['runtime'] == '' and 'runtime' in old:
                        row['runtime'] = old['runtime']
                            
                #print(row)
                #sys.exit()
                
                if row != old:
                    store_data(row, json_fp, toJson=True)
                    print('stored:', json_fp)
                j += 1

            #print(i, row['link'])
            i += 1

            # keep count of # rows processed
            if i % 100 == 0:
                print(i)

            if quit != 0 and i >= quit:
                break

        # store parsed csv
        #fp = csv_in.split('/')[-1]
        #df.to_csv(csv_out + fp, sep=';', index=False)
        #path = json_out + fp
        #path = path.replace('.csv', '.json')
        #store_data(out, path, toJson=True)
        
        print('DONE parsed', i, 'items')

In [59]:
# run the loop

#transform = ['ka_c', 'ka_cn', 'ka_d', 'ka_dn', 'ma', 'gh', 'tcp', 'bc']
transform = ['ka_c', 'ka_cn', 'ma', 'gh', 'tcp', 'bc', 'me_ft', 'bcg_fo', 'bcg_ha',
             'me_ft', 'bcg_fo', 'bcg_ha', 'za_bl', 'za_jo', 'za_pr', 'za_pu']
transform = ['za_bl', 'za_jo', 'za_pr', 'za_pu']

datasets = {
    # kaggle competitions
    'ka_c': {
        'csv_in': '../data/database/kaggle_competitions_correlated_01.csv',
        'csv_format': 'kaggle_competition',
    },
    # kaggle competitions notebooks
    'ka_cn': {
        'csv_in': '../data/database/kaggle_competitions_01_original.csv',
        'csv_format': 'kaggle_notebook',
    },
    # kaggle datasets
    'ka_d': {
        'csv_in': '../data/database/kaggle_datasets_correlated_01.csv',
        'csv_format': 'kaggle_dataset',
    },
    # kaggle datasets notebooks
    'ka_dn': {
        'csv_in': '../data/database/kaggle_datasets_01_original.csv',
        'csv_format': 'kaggle_notebook',
    },
    # mlart
    'ma': {
        'csv_in': '../data/database/mlart_01_original.csv',
        'csv_format':'mlart',
        'extra': {
            'learn_score': 0,
            'explore_score': 1,
            'compete_score': 0,
        },
    },
    # github
    'gh': {
        'csv_in': '../data/database/db_04_analyzed_v02.csv',
        'csv_format': 'github',
        'extra': {
            'learn_score': 1,
            'explore_score': 0,
            'compete_score': 0.25,
        },
    },
    # thecleverprogrammer
    'tcp': {
        'csv_in': '../data/database/thecleverprogrammer_01_original.csv',
        'csv_format': 'tcp',
        'extra': {
            'learn_score': 1,
            'explore_score': 0,
            'compete_score': 0,
        },
    },
    # blobcity
    'bc': {
        'csv_in': '../data/database/blobcity_02_analyzed.csv',
        'csv_format': 'github',
        'extra': {
            'learn_score': 1,
            'explore_score': 0,
            'compete_score': 0,
        },
    },
    # medium_fintech
    'me_ft': {
        'csv_in': '../data/database/medium_fintech_01.csv',
        'csv_format': 'manual',
        'extra': {
            'host': 'medium.com',
            'kind': 'Article',
            'learn_score': 0,
            'explore_score': 0,
            'compete_score': 1,
        },
        'out': 'me'
    },
    # bcgdv founded
    'bcg_fo': {
        'csv_in': '../data/database/bcgdv_founded_01.csv',
        'csv_format': 'manual',
        'extra': {
            'host': 'bcgdv.com',
            'kind': 'Article',
            'learn_score': 0,
            'explore_score': 0,
            'compete_score': 1,
        },
        'out': 'bcg'
    },
    # bcgdv hackaton
    'bcg_ha': {
        'csv_in': '../data/database/bcgdv_hackaton_01.csv',
        'csv_format': 'manual',
        'extra': {
            'host': 'bcgdv.com',
            'kind': ['Article', 'Project'],
            'learn_score': 1,
            'explore_score': 0,
            'compete_score': 1,
        },
        'out': 'bcg'
    },
    # zalando blog
    'za_bl': {
        'csv_in': '../data/database/zalando_blog_01.csv',
        'csv_format': 'manual',
        'extra': {
            'host': 'zalando.com',
            'kind': 'Article'
        },
        'out': 'za'
    },
    # zalando jobs
    'za_jo': {
        'csv_in': '../data/database/zalando_jobs_01.csv',
        'csv_format': 'manual',
        'extra': {
            'host': 'zalando.com',
            'kind': 'Article',
            'tags': ['Fashion'],
            'learn_score': 0,
            'explore_score': 0,
            'compete_score': 1,
        },
        'out': 'za'
    },
    # zalando research projects
    'za_pr': {
        'csv_in': '../data/database/zalando_projects_01.csv',
        'csv_format': 'manual',
        'extra': {
            'host': 'zalando.com',
            'kind': 'Article',
        },
        'out': 'za'
    },
    # zalando research publications
    'za_pu': {
        'csv_in': '../data/database/zalando_publications_04.csv',
        'csv_format': 'manual',
        'extra': {
            'host': 'zalando.com',
            'kind': 'Article',
            'date_scraped': datetime.strptime('17.01.2021', "%d.%m.%Y"),
            'tags': ['Fashion'],
            'learn_score': 0.5,
            'explore_score': 0,
            'compete_score': 0.75,
        },
        'out': 'za'
    },
}

    
for key in transform:
    print(key)
    item = datasets[key]
    extra = item['extra'] if 'extra' in item else {}
    out = key+'/' if not 'out' in item else item['out']+'/'
    printItem=False
    transform_loop(item['csv_in'], item['csv_format'], out, overwrite=True, extra=extra, printItem=printItem)

za_bl
0 https://github.com/flairNLP/flair
stored: ../data/database/json/za/0d4e98dc5312c9b4ec96665d5af364b0.json
1 https://engineering.zalando.com/posts/2017/03/deep-learning-in-production-for-predicting-consumer-behavior.html
stored: ../data/database/json/za/5cbb244aa92009613710b223c31b5f3a.json
2 https://engineering.zalando.com/posts/2018/09/shop-look-deep-learning.html
stored: ../data/database/json/za/fbc34098cc6faa78a18deb9141f2e930.json
DONE parsed 3 items
za_jo
0 https://jobs.zalando.com/en/jobs/2419780-senior-applied-scientist-w-m-d-advice-and-inspiration/?gh_src=gk03hq
stored: ../data/database/json/za/0641bcf216a7a29247bd3fda748a903a.json
1 https://jobs.zalando.com/en/jobs/2523598--senior-research-scientist-builder-platform-and-ai/?gh_src=gk03hq
stored: ../data/database/json/za/4aceb49e319eb060f76bf601951cfc34.json
2 https://jobs.zalando.com/en/jobs/2261169-senior-python-backend-engineer-competitive-analytics-engineering/?gh_src=gk03hq
stored: ../data/database/json/za/39f7a137f

In [None]:
# zero shot categorization is computational intense
# so let's keep it out from the loop and process it seperatly

In [None]:
print(cat)
print(subcat)

In [None]:
# classification

folder = '../data/database/json/'
subfolder = os.listdir(folder)
#print(subfolder)

#transform = ['ka_c', 'ka_cn', 'ka_d', 'ka_dn', 'ma', 'gh', 'tcp', 'bc']
transform = ['ka_c', 'ka_cn', 'ma', 'gh', 'tcp', 'bc']
#transform = ['ma']

recreate_category = False
save = True
categorzie_t5 = False
categorize_nltk = True
categorize_fallback = True

quit = 0
i = j = 0
for item in subfolder:
    print('folder', item)
    fp = os.path.join(folder, item)
    if os.path.isdir(fp) and item in transform:
        print('###')
        print(item)
        files = os.listdir(fp)
        print('files in folder:', len(files))
        for file in files:
            row = load_data(os.path.join(folder, item, file), fromJson=True)
            #print(row)
            
            print('row:', i, 'item:', j, 'link:', row['link'], 'file:', file)
            
            # zero shot categorization
            if not 'category' in row or row.get('category') == '' or recreate_category == True:
                print('categorize')
                start = time.time()
                j += 1

                # create category and subcategory from t5
                if 'sum_t5' in row and row['sum_t5'] != '' and categorzie_t5 == True:
                    s = row['sum_t5']
                    res = categorize(s, cat)
                    #row['t5_category_raw'] = res
                    c = row['t5_category'] = res['category']
                    c_score = row['t5_category_score'] = res['score']
                    row['t5_category_runtime'] = res['runtime']
                    print('t5 category', res['runtime'], 'sec')

                    res = categorize(s, subcat)
                    #row['t5_subcategory_raw'] = res
                    sc = row['t5_subcategory'] = res['category']
                    sc_score = row['t5_subcategory_score'] = res['score']
                    row['t5_subcategory_runtime'] = res['runtime']
                    print('t5 subcategory', res['runtime'], 'sec')
                else:
                    print('t5 skipped')

                # create category and subcategory from nltk
                if 'sum_nltk' in row and row['sum_nltk'] != '' and categorize_nltk == True:
                    s = row['sum_nltk']
                    res = categorize(s, cat)
                    #print(res)
                    #row['nltk_category_raw'] = res
                    c = row['nltk_category'] = res['category']
                    c_score = row['nltk_category_score'] = res['score']
                    row['nltk_category_runtime'] = res['runtime']
                    print('nltk category', res['runtime'], 'sec')

                    res = categorize(s, subcat)
                    #print(res)
                    #row['nltk_subcategory_raw'] = res
                    sc = row['nltk_subcategory'] = res['category']
                    sc_score = row['nltk_subcategory_score'] = res['score']
                    row['nltk_subcategory_runtime'] = res['runtime']
                    print('nltk subcategory', res['runtime'], 'sec')
                else:
                    print('nltk skipped')

                # create category and subcategory from title or description if not already done
                if categorize_fallback == True and not 't5_category' in row and not 'nltk_category' in row:
                    if len(row['description']) > 0:
                        s = row['description']
                        res = categorize(s, cat)
                        #row['description_category_raw'] = res
                        c = row['description_category'] = res['category']
                        c_score = row['description_category_score'] = res['score']
                        row['description_category_runtime'] = res['runtime']
                        print('description category', res['runtime'], 'sec')

                        res = categorize(s, subcat)
                        #row['description_subcategory_raw'] = res
                        sc = row['description_subcategory'] = res['category']
                        sc_score = row['description_subcategory_score'] = res['score']
                        row['description_subcategory_runtime'] = res['runtime']
                        print('description subcategory', res['runtime'], 'sec')
                    else:
                        s = row['title']
                        if s != '':
                            res = categorize(s, cat)
                            #row['title_category_raw'] = res
                            c = row['title_category'] = res['category']
                            c_score = row['title_category_score'] = res['score']
                            row['title_category_runtime'] = res['runtime']
                            print('title category', res['runtime'], 'sec')

                            res = categorize(s, subcat)
                            #row['title_subcategory_raw'] = res
                            sc = row['title_subcategory'] = res['category']
                            sc_score = row['title_subcategory_score'] = res['score']
                            row['title_subcategory_runtime'] = res['runtime']
                            print('title subcategory', res['runtime'], 'sec')
                        else:
                            print('nothing found to categorize')
                            c = sc = ''
                            c_score = sc_score = 0
                            j -= 1

                row['category'] = c
                row['category_score'] = c_score
                row['subcategory'] = sc
                row['subcategory_score'] = sc_score

                end = time.time()
                dur = round(end-start, 3)
                row['runtime_cat'] = dur
                
                fp = os.path.join(folder, item, file)
                if save == True:
                    store_data(row, fp, toJson=True)
                else:
                    print('NOT SAVED')
                    print(row)
            
            i += 1
            
            if i%100 == 0:
                print(i)
            
            if quit!= 0 and i >= quit:
                break
    if quit!= 0 and i >= quit:
                break
            
print('DONE parsed', i, 'items')