In [1]:
# this notebook converts the CSV to ES mapping

In [2]:
# imports
from datetime import datetime
import hashlib
import json
import sys
import csv
import os
import pandas as pd
import re
import time
from gensim.summarization.summarizer import summarize
from gensim.summarization import keywords
import unicodedata
import nltk
from nltk.corpus import stopwords

In [3]:
# some long text
# source: https://www.kaggle.com/c/stanford-covid-vaccine
text1 = '''
Winning the fight against the COVID-19 pandemic will require an effective vaccine that can be equitably and widely distributed. Building upon decades of research has allowed scientists to accelerate the search for a vaccine against COVID-19, but every day that goes by without a vaccine has enormous costs for the world nonetheless. We need new, fresh ideas from all corners of the world. Could online gaming and crowdsourcing help solve a worldwide pandemic? Pairing scientific and crowdsourced intelligence could help computational biochemists make measurable progress.
mRNA vaccines have taken the lead as the fastest vaccine candidates for COVID-19, but currently, they face key potential limitations. One of the biggest challenges right now is how to design super stable messenger RNA molecules (mRNA). Conventional vaccines (like your seasonal flu shots) are packaged in disposable syringes and shipped under refrigeration around the world, but that is not currently possible for mRNA vaccines.
Researchers have observed that RNA molecules have the tendency to spontaneously degrade. This is a serious limitation--a single cut can render the mRNA vaccine useless. Currently, little is known on the details of where in the backbone of a given RNA is most prone to being affected. Without this knowledge, current mRNA vaccines against COVID-19 must be prepared and shipped under intense refrigeration, and are unlikely to reach more than a tiny fraction of human beings on the planet unless they can be stabilized.
The Eterna community, led by Professor Rhiju Das, a computational biochemist at Stanford’s School of Medicine, brings together scientists and gamers to solve puzzles and invent medicine. Eterna is an online video game platform that challenges players to solve scientific problems such as mRNA design through puzzles. The solutions are synthesized and experimentally tested at Stanford by researchers to gain new insights about RNA molecules. The Eterna community has previously unlocked new scientific principles, made new diagnostics against deadly diseases, and engaged the world’s most potent intellectual resources for the betterment of the public. The Eterna community has advanced biotechnology through its contribution in over 20 publications, including advances in RNA biotechnology.
In this competition, we are looking to leverage the data science expertise of the Kaggle community to develop models and design rules for RNA degradation. Your model will predict likely degradation rates at each base of an RNA molecule, trained on a subset of an Eterna dataset comprising over 3000 RNA molecules (which span a panoply of sequences and structures) and their degradation rates at each position. We will then score your models on a second generation of RNA sequences that have just been devised by Eterna players for COVID-19 mRNA vaccines. These final test sequences are currently being synthesized and experimentally characterized at Stanford University in parallel to your modeling efforts -- Nature will score your models!
Improving the stability of mRNA vaccines was a problem that was being explored before the pandemic but was expected to take many years to solve. Now, we must solve this deep scientific challenge in months, if not weeks, to accelerate mRNA vaccine research and deliver a refrigerator-stable vaccine against SARS-CoV-2, the virus behind COVID-19. The problem we are trying to solve has eluded academic labs, industry R&D groups, and supercomputers, and so we are turning to you. To help, you can join the team of video game players, scientists, and developers at Eterna to unlock the key in our fight against this devastating pandemic. 
'''

# and a short one
text2 = 'The quick brown fox jumps over the lazy dog'

In [4]:
# function to count words
def word_count(text):
    if isinstance(text, str):
        s = text.split(' ')
        return len(s)
    else:
        return 0

print('words:', word_count(text1))
print('words:', word_count(text2))
print('words:', word_count(None))

words: 564
words: 9
words: 0


In [5]:
# function to count sentences
def sentence_count(text):
    if isinstance(text, str):
        s = text.split('. ')
        return len(s)
    else:
        return 0

print('sentences:', sentence_count(text1))
print('sentences:', sentence_count(text2))
print('sentences:', sentence_count(None))

sentences: 20
sentences: 1
sentences: 0


In [6]:
# extractive summarization

In [7]:
# text summarization 100% -> n%
def nltk_ratio(text, ratio=0.25):
    return summarize(text, ratio=ratio)

sum_nltk_ratio = nltk_ratio(text1, ratio=0.25)
print('words:', word_count(sum_nltk_ratio))
print(sum_nltk_ratio)

words: 139
Eterna is an online video game platform that challenges players to solve scientific problems such as mRNA design through puzzles.
The solutions are synthesized and experimentally tested at Stanford by researchers to gain new insights about RNA molecules.
We will then score your models on a second generation of RNA sequences that have just been devised by Eterna players for COVID-19 mRNA vaccines.
Improving the stability of mRNA vaccines was a problem that was being explored before the pandemic but was expected to take many years to solve.
Now, we must solve this deep scientific challenge in months, if not weeks, to accelerate mRNA vaccine research and deliver a refrigerator-stable vaccine against SARS-CoV-2, the virus behind COVID-19.
To help, you can join the team of video game players, scientists, and developers at Eterna to unlock the key in our fight against this devastating pandemic.


In [8]:
# text summarization 100% -> n words
def nltk_count(text, word_count=100):
    return summarize(text, word_count=word_count)

sum_nltk_count = nltk_count(text1, word_count=100)
print('words:', word_count(sum_nltk_count))
print(sum_nltk_count)

words: 98
Eterna is an online video game platform that challenges players to solve scientific problems such as mRNA design through puzzles.
We will then score your models on a second generation of RNA sequences that have just been devised by Eterna players for COVID-19 mRNA vaccines.
Now, we must solve this deep scientific challenge in months, if not weeks, to accelerate mRNA vaccine research and deliver a refrigerator-stable vaccine against SARS-CoV-2, the virus behind COVID-19.
To help, you can join the team of video game players, scientists, and developers at Eterna to unlock the key in our fight against this devastating pandemic.


In [9]:
# adaptive summarization
# https://www.machinelearningplus.com/nlp/text-summarization-approaches-nlp-example/

In [10]:
# BART
# Importing the model
from transformers import BartForConditionalGeneration, BartTokenizer, BartConfig

In [11]:
'''
# Loading the model and tokenizer for bart-large-cnn
tokenizer=BartTokenizer.from_pretrained('facebook/bart-large-cnn')
model=BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
#'''

"\n# Loading the model and tokenizer for bart-large-cnn\ntokenizer=BartTokenizer.from_pretrained('facebook/bart-large-cnn')\nmodel=BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')\n#"

In [12]:
'''
# Encoding the inputs and passing them to model.generate()
def bart(text):
    inputs = tokenizer.batch_encode_plus([text],return_tensors='pt')
    summary_ids = model.generate(inputs['input_ids'], early_stopping=True)

    # Decoding and printing the summary
    bart_summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    
    return bart_summary

# long text
start = time.time()
sum_bart_l = bart(text1)
end = time.time()

print('### long text ###')
print('runtime:', end-start)
print('words:', word_count(sum_bart_l))
print('sentences:', sentence_count(sum_bart_l))
print(sum_bart_l)
print('')

# short text
print('### short text ###')
start = time.time()
sum_bart_s = bart(text2)
end = time.time()

print('runtime:', end-start)
print('words:', word_count(sum_bart_s))
print('sentences:', sentence_count(sum_bart_s))
print(sum_bart_s)
#'''

"\n# Encoding the inputs and passing them to model.generate()\ndef bart(text):\n    inputs = tokenizer.batch_encode_plus([text],return_tensors='pt')\n    summary_ids = model.generate(inputs['input_ids'], early_stopping=True)\n\n    # Decoding and printing the summary\n    bart_summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)\n    \n    return bart_summary\n\n# long text\nstart = time.time()\nsum_bart_l = bart(text1)\nend = time.time()\n\nprint('### long text ###')\nprint('runtime:', end-start)\nprint('words:', word_count(sum_bart_l))\nprint('sentences:', sentence_count(sum_bart_l))\nprint(sum_bart_l)\nprint('')\n\n# short text\nprint('### short text ###')\nstart = time.time()\nsum_bart_s = bart(text2)\nend = time.time()\n\nprint('runtime:', end-start)\nprint('words:', word_count(sum_bart_s))\nprint('sentences:', sentence_count(sum_bart_s))\nprint(sum_bart_s)\n#"

In [13]:
# T5
# https://towardsdatascience.com/summarize-reddit-comments-using-t5-bart-gpt-2-xlnet-models-a3e78a5ab944
from transformers import T5Tokenizer, T5ForConditionalGeneration
model = T5ForConditionalGeneration.from_pretrained('t5-base')
tokenizer = T5Tokenizer.from_pretrained('t5-base')

Some weights of the model checkpoint at t5-base were not used when initializing T5ForConditionalGeneration: ['decoder.block.0.layer.1.EncDecAttention.relative_attention_bias.weight']
- This IS expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [14]:
def t5(text):
    Preprocessed_text = "summarize: " + text
    tokens_input = tokenizer.encode(Preprocessed_text,return_tensors="pt", max_length=512, truncation=True)
    summary_ids = model.generate(tokens_input, min_length=100, max_length=180, length_penalty=4.0)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

'''
# long text
start = time.time()
sum_t5_l = t5(text1)
end = time.time()

print('### long text ###')
print('runtime:', end-start)
print('words:', word_count(sum_t5_l))
print('sentences:', sentence_count(sum_t5_l))
print(sum_t5_l)
print('')

# short text
print('### short text ###')
start = time.time()
sum_t5_s = t5(text2)
end = time.time()

print('runtime:', end-start)
print('words:', word_count(sum_t5_s))
print('sentences:', sentence_count(sum_t5_s))
print(sum_t5_s)
#'''

"\n# long text\nstart = time.time()\nsum_t5_l = t5(text1)\nend = time.time()\n\nprint('### long text ###')\nprint('runtime:', end-start)\nprint('words:', word_count(sum_t5_l))\nprint('sentences:', sentence_count(sum_t5_l))\nprint(sum_t5_l)\nprint('')\n\n# short text\nprint('### short text ###')\nstart = time.time()\nsum_t5_s = t5(text2)\nend = time.time()\n\nprint('runtime:', end-start)\nprint('words:', word_count(sum_t5_s))\nprint('sentences:', sentence_count(sum_t5_s))\nprint(sum_t5_s)\n#"

In [15]:
# industry categories

# https://www.census.gov/programs-surveys/aces/information/iccl.html
cat_sic = ['Agriculture','Forestry','Fishing','Mining','Construction','Manufacturing','Transportation','Communications','Electric','Gas','Sanitary','Wholesale Trade','Retail Trade','Finance','Insurance','Real Estate','Services','Public Administration']
# https://www.marketing91.com/19-types-of-business-industries/
cat_19 = ['Aerospace','Transport','Computer','Telecommunication','Agriculture','Construction','Education','Pharmaceutical','Food','Health care','Hospitality','Entertainment','News Media','Energy','Manufacturing','Music','Mining','Worldwide web','Electronics']
# https://simplicable.com/new/industries
cat_simple = ['Advertising','Agriculture','Communication','Construction','Creative','Education','Entertainment','Fashion','Finance','Health care','Information Technology','Manufacturing','Media','Retail','Research','Robotics','Space']

cat = ['Accommodation & Food','Accounting','Agriculture','Banking & Insurance','Biotechnological & Life Sciences','Construction & Engineering','Economics','Education & Research','Emergency & Relief','Finance','Government and Public Works','Healthcare','Justice, Law and Regulations','Manufacturing','Media & Publishing','Miscellaneous','Physics','Real Estate, Rental & Leasing','Utilities','Wholesale & Retail']
subcat = ['Failure','Food','Fraud','General','Genomics','Insurance and Risk','Judicial Applied','Life-sciences','Machine Learning','Maintenance','Management and Operations','Marketing','Material Science','Physical','Policy and Regulatory','Politics','Preventative and Reactive','Quality','Real Estate','Rental & Leasing','Restaurant','Retail','School','Sequencing','Social Policies','Student','Textual Analysis','Tools','Tourism','Trading & Investment','Transportation','Valuation','Water & Pollution','Wholesale']

In [16]:
# zero shot classification
# https://towardsdatascience.com/zero-shot-text-classification-with-hugging-face-7f533ba83cd6
from transformers import pipeline
classifier = pipeline("zero-shot-classification")

Some weights of the model checkpoint at facebook/bart-large-mnli were not used when initializing BartModel: ['model.encoder.version', 'model.decoder.version']
- This IS expected if you are initializing BartModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BartModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at facebook/bart-large-mnli were not used when initializing BartForSequenceClassification: ['model.encoder.version', 'model.decoder.version']
- This IS expected if you are initializing BartForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification m

In [17]:
'''
# test classifictaion with nltk (200 words)

s = nltk_count(text1, word_count=100)

start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat_19)
res_sic = classifier(s, cat_sic, multi_class=True)
end = time.time()

print('runtime sic:', end-start)
#print(res_sic)
print(res_sic['labels'][0:3])
print(res_sic['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat_19)
res_c19 = classifier(s, cat_19, multi_class=True)
end = time.time()

print('runtime c19:', end-start)
#print(res_c19)
print(res_c19['labels'][0:3])
print(res_c19['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat_19)
res_simple = classifier(s, cat_simple, multi_class=True)
end = time.time()

print('runtime simple:', end-start)
#print(res_simple)
print(res_simple['labels'][0:3])
print(res_simple['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat)
res_simple = classifier(s, cat, multi_class=True)
end = time.time()

print('runtime category:', end-start)
#print(res_simple)
print(res_simple['labels'][0:3])
print(res_simple['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), subcat)
res_simple = classifier(s, subcat, multi_class=True)
end = time.time()

print('runtime subcategory:', end-start)
#print(res_simple)
print(res_simple['labels'][0:3])
print(res_simple['scores'][0:3])
#'''

"\n# test classifictaion with nltk (200 words)\n\ns = nltk_count(text1, word_count=100)\n\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200), cat_19)\nres_sic = classifier(s, cat_sic, multi_class=True)\nend = time.time()\n\nprint('runtime sic:', end-start)\n#print(res_sic)\nprint(res_sic['labels'][0:3])\nprint(res_sic['scores'][0:3])\n\n\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200), cat_19)\nres_c19 = classifier(s, cat_19, multi_class=True)\nend = time.time()\n\nprint('runtime c19:', end-start)\n#print(res_c19)\nprint(res_c19['labels'][0:3])\nprint(res_c19['scores'][0:3])\n\n\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200), cat_19)\nres_simple = classifier(s, cat_simple, multi_class=True)\nend = time.time()\n\nprint('runtime simple:', end-start)\n#print(res_simple)\nprint(res_simple['labels'][0:3])\nprint(res_simple['scores'][0:3])\n\n\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200),

In [18]:
'''
# test classifictaion with t5
start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat_19)
res_sic = classifier(sum_t5_l, cat_sic, multi_class=True)
end = time.time()

print('runtime sic:', end-start)
#print(res_sic)
print(res_sic['labels'][0:3])
print(res_sic['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat_19)
res_c19 = classifier(sum_t5_l, cat_19, multi_class=True)
end = time.time()

print('runtime c19:', end-start)
#print(res_c19)
print(res_c19['labels'][0:3])
print(res_c19['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat_19)
res_simple = classifier(sum_t5_l, cat_simple, multi_class=True)
end = time.time()

print('runtime simple:', end-start)
#print(res_simple)
print(res_simple['labels'][0:3])
print(res_simple['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat)
res_simple = classifier(sum_t5_l, cat, multi_class=True)
end = time.time()

print('runtime category:', end-start)
#print(res_simple)
print(res_simple['labels'][0:3])
print(res_simple['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), subcat)
res_simple = classifier(sum_t5_l, subcat, multi_class=True)
end = time.time()

print('runtime subcategory:', end-start)
#print(res_simple)
print(res_simple['labels'][0:3])
print(res_simple['scores'][0:3])
#'''

"\n# test classifictaion with t5\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200), cat_19)\nres_sic = classifier(sum_t5_l, cat_sic, multi_class=True)\nend = time.time()\n\nprint('runtime sic:', end-start)\n#print(res_sic)\nprint(res_sic['labels'][0:3])\nprint(res_sic['scores'][0:3])\n\n\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200), cat_19)\nres_c19 = classifier(sum_t5_l, cat_19, multi_class=True)\nend = time.time()\n\nprint('runtime c19:', end-start)\n#print(res_c19)\nprint(res_c19['labels'][0:3])\nprint(res_c19['scores'][0:3])\n\n\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200), cat_19)\nres_simple = classifier(sum_t5_l, cat_simple, multi_class=True)\nend = time.time()\n\nprint('runtime simple:', end-start)\n#print(res_simple)\nprint(res_simple['labels'][0:3])\nprint(res_simple['scores'][0:3])\n\n\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200), cat)\nres_simple = classifier(sum_t

In [19]:
'''
# test classifictaion with bart

start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat_19)
res_sic = classifier(sum_bart_l, cat_sic, multi_class=True)
end = time.time()

print('runtime sic:', end-start)
#print(res_sic)
print(res_sic['labels'][0:3])
print(res_sic['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat_19)
res_c19 = classifier(sum_bart_l, cat_19, multi_class=True)
end = time.time()

print('runtime c19:', end-start)
#print(res_c19)
print(res_c19['labels'][0:3])
print(res_c19['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat_19)
res_simple = classifier(sum_bart_l, cat_simple, multi_class=True)
end = time.time()

print('runtime simple:', end-start)
#print(res_simple)
print(res_simple['labels'][0:3])
print(res_simple['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat)
res_simple = classifier(sum_bart_l, cat, multi_class=True)
end = time.time()

print('runtime category:', end-start)
#print(res_simple)
print(res_simple['labels'][0:3])
print(res_simple['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), subcat)
res_simple = classifier(sum_bart_l, subcat, multi_class=True)
end = time.time()

print('runtime subcategory:', end-start)
#print(res_simple)
print(res_simple['labels'][0:3])
print(res_simple['scores'][0:3])
#'''

"\n# test classifictaion with bart\n\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200), cat_19)\nres_sic = classifier(sum_bart_l, cat_sic, multi_class=True)\nend = time.time()\n\nprint('runtime sic:', end-start)\n#print(res_sic)\nprint(res_sic['labels'][0:3])\nprint(res_sic['scores'][0:3])\n\n\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200), cat_19)\nres_c19 = classifier(sum_bart_l, cat_19, multi_class=True)\nend = time.time()\n\nprint('runtime c19:', end-start)\n#print(res_c19)\nprint(res_c19['labels'][0:3])\nprint(res_c19['scores'][0:3])\n\n\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200), cat_19)\nres_simple = classifier(sum_bart_l, cat_simple, multi_class=True)\nend = time.time()\n\nprint('runtime simple:', end-start)\n#print(res_simple)\nprint(res_simple['labels'][0:3])\nprint(res_simple['scores'][0:3])\n\n\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200), cat)\nres_simple = classi

In [20]:
# category function
def categorize(text, categories, first=True, treshold=0, runtime=False):
    start = time.time()
    res = classifier(text, categories, multi_class=True)
    #print(res)
    end = time.time()
    dur = round(end-start, 3)
    if first == True:
        ret = {
            'category': res['labels'][0],
            'score': res['scores'][0],
        } if res['scores'][0] >= treshold else {
            'category': None,
            'score': None,
        }
        
    else:
        ret = dict(zip(res['labels'], res['scores']))
        ret = {key: val for key, val in filter(lambda sub: sub[1] >= treshold, ret.items())}
        
    if runtime == True:
        ret['runtime'] = dur
    return ret
        

print(categorize(nltk_count(text1), cat, first=False, treshold=0.5))
print(categorize(nltk_count(text1), cat, first=True, treshold=0.9))
print(categorize(nltk_count(text1), cat, first=False, treshold=0.5, runtime=True))
print(categorize(nltk_count(text1), cat, first=True, treshold=0.9, runtime=True))

{'Biotechnological & Life Sciences': 0.8398016095161438, 'Healthcare': 0.7811025977134705, 'Education & Research': 0.7176370620727539, 'Utilities': 0.6823796629905701}
{'category': None, 'score': None}
{'Biotechnological & Life Sciences': 0.8398016095161438, 'Healthcare': 0.7811025977134705, 'Education & Research': 0.7176370620727539, 'Utilities': 0.6823796629905701, 'runtime': 17.529}
{'category': None, 'score': None, 'runtime': 20.315}


In [21]:
'''
# measure error treshold
csv_in = '../data/database/db_04_analyzed_v02.csv'
csv_out = '../data/database/categorizer.csv'
df = pd.read_csv(csv_in, sep=';')
print(df.shape)

df_out = []
quit = 0
match_c = sim_c = match_sc = sim_sc = 0

start = time.time()
for index, row in df.iterrows():
    print('###')
    print(index, row['link'])
    c = row['industry']
    sc = row['type']
    d = row['description']
    item = {
        'link': row['link'],
        'category': c,
        'subcategory': sc,
    }
    
    print('category:', c)
    try:
        c_guess = categorize(d, cat, first=False, treshold=0.25)
    except:
        c_guess = {}
    print('guess:', c_guess)
    item['category_guess'] = json.dumps(c_guess)
    item['category_match'] = False
    item['category_similar'] = False
    if len(c_guess) > 0:
        keys = list(c_guess.keys())
        if c == keys[0]:
            print('MATCH')
            item['category_match'] = True
            match_c += 1
        elif c in keys:
            print('SIMILAR')
            item['category_similar'] = True
            sim_c += 1
    
    print('subcategory:', sc)
    try:
        sc_guess = categorize(d, subcat, first=False, treshold=0.25)
    except:
        sc_guess = {}
    print('guess:', sc_guess)
    item['subcategory_guess'] = json.dumps(sc_guess)
    item['subcategory_match'] = False
    item['subcategory_similar'] = False
    if len(sc_guess) > 0:
        keys = list(sc_guess.keys())
        if sc == keys[0]:
            print('MATCH')
            item['subcategory_match'] = True
            match_sc += 1
        elif sc in keys:
            print('SIMILAR')
            item['subcategory_similar'] = True
            sim_sc += 1
    
    df_out.append(item)
    
    if quit != 0 and index+1 >= quit:
        break
end = time.time()

df_out = pd.DataFrame(df_out)
df_out.to_csv(csv_out, sep=';', index=False)
print('done in', round(end-start, 3), 'sec')
print(index+1, match_c, sim_c, match_sc, sim_sc)
'''

"\n# measure error treshold\ncsv_in = '../data/database/db_04_analyzed_v02.csv'\ncsv_out = '../data/database/categorizer.csv'\ndf = pd.read_csv(csv_in, sep=';')\nprint(df.shape)\n\ndf_out = []\nquit = 0\nmatch_c = sim_c = match_sc = sim_sc = 0\n\nstart = time.time()\nfor index, row in df.iterrows():\n    print('###')\n    print(index, row['link'])\n    c = row['industry']\n    sc = row['type']\n    d = row['description']\n    item = {\n        'link': row['link'],\n        'category': c,\n        'subcategory': sc,\n    }\n    \n    print('category:', c)\n    try:\n        c_guess = categorize(d, cat, first=False, treshold=0.25)\n    except:\n        c_guess = {}\n    print('guess:', c_guess)\n    item['category_guess'] = json.dumps(c_guess)\n    item['category_match'] = False\n    item['category_similar'] = False\n    if len(c_guess) > 0:\n        keys = list(c_guess.keys())\n        if c == keys[0]:\n            print('MATCH')\n            item['category_match'] = True\n            m

In [22]:
# language detection
# https://towardsdatascience.com/how-to-detect-and-translate-languages-for-nlp-project-dfd52af0c3b5
from langdetect import detect, detect_langs, DetectorFactory

language_codes = {'af': 'afrikaans', 'sq': 'albanian', 'am': 'amharic', 'ar': 'arabic', 'hy': 'armenian', 'az': 'azerbaijani', 'eu': 'basque', 'be': 'belarusian', 'bn': 'bengali', 'bs': 'bosnian', 'bg': 'bulgarian', 'ca': 'catalan', 'ceb': 'cebuano', 'ny': 'chichewa', 'zh-cn': 'chinese (simplified)', 'zh-tw': 'chinese (traditional)', 'co': 'corsican', 'hr': 'croatian', 'cs': 'czech', 'da': 'danish', 'nl': 'dutch', 'en': 'english', 'eo': 'esperanto', 'et': 'estonian', 'tl': 'filipino', 'fi': 'finnish', 'fr': 'french', 'fy': 'frisian', 'gl': 'galician', 'ka': 'georgian', 'de': 'german', 'el': 'greek', 'gu': 'gujarati', 'ht': 'haitian creole', 'ha': 'hausa', 'haw': 'hawaiian', 'iw': 'hebrew', 'hi': 'hindi', 'hmn': 'hmong', 'hu': 'hungarian', 'is': 'icelandic', 'ig': 'igbo', 'id': 'indonesian', 'ga': 'irish', 'it': 'italian', 'ja': 'japanese', 'jw': 'javanese', 'kn': 'kannada', 'kk': 'kazakh', 'km': 'khmer', 'ko': 'korean', 'ku': 'kurdish (kurmanji)', 'ky': 'kyrgyz', 'lo': 'lao', 'la': 'latin', 'lv': 'latvian', 'lt': 'lithuanian', 'lb': 'luxembourgish', 'mk': 'macedonian', 'mg': 'malagasy', 'ms': 'malay', 'ml': 'malayalam', 'mt': 'maltese', 'mi': 'maori', 'mr': 'marathi', 'mn': 'mongolian', 'my': 'myanmar (burmese)', 'ne': 'nepali', 'no': 'norwegian', 'ps': 'pashto', 'fa': 'persian', 'pl': 'polish', 'pt': 'portuguese', 'pa': 'punjabi', 'ro': 'romanian', 'ru': 'russian', 'sm': 'samoan', 'gd': 'scots gaelic', 'sr': 'serbian', 'st': 'sesotho', 'sn': 'shona', 'sd': 'sindhi', 'si': 'sinhala', 'sk': 'slovak', 'sl': 'slovenian', 'so': 'somali', 'es': 'spanish', 'su': 'sundanese', 'sw': 'swahili', 'sv': 'swedish', 'tg': 'tajik', 'ta': 'tamil', 'te': 'telugu', 'th': 'thai', 'tr': 'turkish', 'uk': 'ukrainian', 'ur': 'urdu', 'uz': 'uzbek', 'vi': 'vietnamese', 'cy': 'welsh', 'xh': 'xhosa', 'yi': 'yiddish', 'yo': 'yoruba', 'zu': 'zulu', 'fil': 'Filipino', 'he': 'Hebrew'}

def lingo(text, simple=True):
    DetectorFactory.seed = 0
    try:
        if simple == True:
            return detect(text) #language_codes[detect(text)]
        else:
            l = str(detect_langs(text)[0]).split(':')
            l = {
                'code': l[0],
                'language': language_codes[ l[0] ],
                'probability': l[1],
            }
            return l
    except:
        return None

sentence = "Tanzania ni nchi inayoongoza kwa utalii barani afrika"
sentence2 = "Heute schneit es."

print(lingo(sentence, simple=False))
print(lingo(sentence2))
print(lingo(text1))
print(lingo(text2))
print(lingo(None))

{'code': 'sw', 'language': 'swahili', 'probability': '0.9999971210408874'}
de
en
en
None


In [23]:
# helper functions

In [24]:
# function to rebuild list from string
# that happens when it is stored in CSV without json-encode the data
def str_to_list(s):
    s = s.replace("'", "").replace(' ,', ',').replace(
        '[', '').replace(']', '').split(',')
    s = [i.replace('"','').strip() for i in s if i]
    return s

In [25]:
# helper function to create folder create_folder
def create_folder(path):
    if not os.path.exists(os.path.dirname(path)):
        try:
            os.makedirs(os.path.dirname(path))
            print(path + ' created')
        except OSError as exc: # Guard against race condition
            if exc.errno != errno.EEXIST:
                raise

In [26]:
# generic store data to file function
def store_data(data, file, mode='w', toJson=False):
    if toJson:
        data = json.dumps(data)
    with open(file, mode, encoding='utf-8') as fp:
        result = fp.write(data)
        return result
    
# generic load data from file function
def load_data(file, fromJson=False):
    if os.path.isfile(file):
        with open(file, 'r', encoding='utf-8', errors="ignore") as fp:
            data = fp.read()
            if fromJson:
                data = json.loads(data)
            return data
    else:
        return 'file not found'

# test text
#print(store_data('Hello', '../data/repositories/mlart/test.txt'))
#print(load_data('../data/repositories/mlart/test.txt'))

# test json
#print(store_data({'msg':'Hello World'}, '../data/repositories/mlart/test.json', toJson=True))
#print(load_data('../data/repositories/mlart/test.json', fromJson=True))

#store_data(result[0]['html'], '../data/repositories/kaggle/notebook.html')
#store_data(result[0]['iframe'], '../data/repositories/kaggle/kernel.html')

In [27]:
# remove special characters
def clean_text(text):
    # Ref: https://gist.github.com/Alex-Just/e86110836f3f93fe7932290526529cd1#gistcomment-3208085
    # Ref: https://en.wikipedia.org/wiki/Unicode_block
    EMOJI_PATTERN = re.compile(
        "(["
        "\U0001F1E0-\U0001F1FF"  # flags (iOS)
        "\U0001F300-\U0001F5FF"  # symbols & pictographs
        "\U0001F600-\U0001F64F"  # emoticons
        "\U0001F680-\U0001F6FF"  # transport & map symbols
        "\U0001F700-\U0001F77F"  # alchemical symbols
        "\U0001F780-\U0001F7FF"  # Geometric Shapes Extended
        "\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
        "\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
        "\U0001FA00-\U0001FA6F"  # Chess Symbols
        "\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
        "\U00002702-\U000027B0"  # Dingbats
        "])"
    )
    text = re.sub(EMOJI_PATTERN, '', text)
    
    # additional cleanup
    text = text.replace('•','').replace('\n',' ')
    
    return text

In [28]:
tag_filter_v01 = load_data('../data/patterns/tag_filter_v01.json', fromJson=True)
tag_filter_v01 = {k:v for k, v in tag_filter_v01.items() if not tag_filter_v01[k] == ''}
tag_filter_v01 = {k:v if v != 'null' else None for k, v in tag_filter_v01.items()}
#print(json.dumps(tag_filter_v01, indent=2))

tag_filter_v02 = load_data('../data/patterns/tag_filter_v02.json', fromJson=True)
tag_filter_v02 = {k:v for k, v in tag_filter_v02.items() if not tag_filter_v02[k] == ''}
tag_filter_v02 = {k:v if v != 'null' else None for k, v in tag_filter_v02.items()}
#print(json.dumps(tag_filter_v01, indent=2))

def tag_equalizer(tags, pattern=tag_filter_v01):
    tags = [pattern.get(x, x) for x in tags]
    tags = list(filter(None, tags))
    return tags

print(tag_equalizer(['tpu', 'rnn']))

['RNN']


In [29]:
# lemmatizer
ADDITIONAL_STOPWORDS = []

def lemmatizer(text):
    """
    A simple function to clean up the data. All the words that
    are not designated as a stop word is then lemmatized after
    encoding and basic regex parsing are performed.
    """
    wnl = nltk.stem.WordNetLemmatizer()
    stopwords = nltk.corpus.stopwords.words('english') + ADDITIONAL_STOPWORDS
    text = (unicodedata.normalize('NFKD', text)
    .encode('ascii', 'ignore')
    .decode('utf-8', 'ignore')
    .lower())
    words = re.sub(r'[^\w\s]', '', text).split()
    return [wnl.lemmatize(word) for word in words if word not in stopwords]

In [30]:
# mapper to convert CSV to the mapping of Elasticsearch index
def mapper(row, style, extra={}):
    '''
    mapper to adopt csv to db-schema

    "title"
    "summarization"
    "words"
    "sum_words"
    "link"
    "source"
    "category"
    "category_score"
    "subcategory"
    "subcategory_score"
    "tags"
    "kind"
    "ml_libs"
    "host"
    "license"
    "programming_language"
    "ml_score"
    "learn_score"
    "explore_score"
    "compete_score"
    "engagement_score"
    "date_project"
    "date_scraped"
    '''

    # kaggle competition mapping
    if style == 'kaggle_competition':
        ret = {
            'title': row['title'],
            'description': row['subtitle'] + row['description'],
            'link': row['link'],
            # 'category': '',
            # 'category_score': 0,
            # 'subcategory': '',
            # 'subcategory_score': 0,
            'tags': list(set(str_to_list(row['tags']) + str_to_list(row['type']))),
            'kind': ['Project', '(Competition)', '(Dataset)'],
            # 'ml_libs': str_to_list(row['ml_libs']),
            'host': 'kaggle.com',
            # 'license': row['license'],
            # 'programming_language': row['type'],
            # 'ml_score': 0,
            'learn_score': 1,
            'explore_score': 0,
            'compete_score': 0,
            'engagement_score': row['teams_score'],
            'date_project': datetime.strptime(row['date_closed'], "%Y-%m-%d %H:%M:%S") if 'date_closed' in row else '',
            # 'date_scraped': datetime.strptime(row['scraped_at'], "%Y-%m-%d %H:%M:%S"),
            # 'ml_terms': row['ml_terms'],
            # 'score_raw': json.dumps({'views': row['views'], 'votes': row['votes'], 'score_private': row['score_private'], 'score_public': row['score_public']}),
        }
    
    # kaggle dataset mapping
    if style == 'kaggle_dataset':
        ret = {
            'title': row['title'],
            'description': row['subtitle'] + row['description'],
            'link': row['link'],
            # 'category': '',
            # 'category_score': 0,
            # 'subcategory': '',
            # 'subcategory_score': 0,
            'tags': list(set(str_to_list(row['tags']) + str_to_list(row['type']))),
            'kind': ['Project', '(Dataset)'],
            # 'ml_libs': str_to_list(row['ml_libs']),
            'host': 'kaggle.com',
            # 'license': row['license'],
            # 'programming_language': row['type'],
            # 'ml_score': 0,
            'learn_score': 1,
            'explore_score': 0,
            'compete_score': 0,
            'engagement_score': row['teams_score'],
            'date_project': datetime.strptime(row['date_closed'], "%Y-%m-%d %H:%M:%S") if 'date_closed' in row else '',
            # 'date_scraped': datetime.strptime(row['scraped_at'], "%Y-%m-%d %H:%M:%S"),
            # 'ml_terms': row['ml_terms'],
            # 'score_raw': json.dumps({'views': row['views'], 'votes': row['votes'], 'score_private': row['score_private'], 'score_public': row['score_public']}),
        }
    
    # kaggle notebook mapping
    if style == 'kaggle_notebook':
        ret = {
            'title': row['title'],
            'description': row['description'],
            'link': row['link'],
            # 'category': '',
            # 'category_score': 0,
            # 'subcategory': '',
            # 'subcategory_score': 0,
            'tags': list(set(str_to_list(row['tags']) + str_to_list(row['tags']))),
            'kind': ['Project', '(Notebook)'],
            'ml_libs': str_to_list(row['ml_libs']),
            'host': 'kaggle.com',
            'license': row['license'],
            'programming_language': row['type'],
            'ml_score': row['ml_detected'],
            'learn_score': 1,
            'explore_score': 0,
            'compete_score': 0,
            'engagement_score': row['score_views'] if 'score_views' in row else None,
            'date_project': datetime.strptime(row['date'], "%Y-%m-%d %H:%M:%S") if row['date'] != '' else None,
            'date_scraped': datetime.strptime(row['scraped_at'], "%Y-%m-%d %H:%M:%S") if row['scraped_at'] != '' else None,
            # 'ml_terms': row['ml_terms'],
            # 'score_raw': json.dumps({'views': row['views'], 'votes': row['votes'], 'score_private': row['score_private'], 'score_public': row['score_public']}),
        }

    # github mapping
    if style == 'github':
        title = row['name'] if row['name'] != '' else row['title']
        title = title.replace('-',' ').replace('_',' ').strip()
        cat_score = 1 if row['industry'] != '' else 0
        subcat_score = 1 if row['type'] != '' else 0
        #tags = row['ml_tags'] if len(row['ml_tags']) > 0 else ''
        ret = {
            'title': title,
            'description': row['description2'],
            'link': row['link'],
            'category': row['industry'],
            'category_score': cat_score,
            'subcategory': row['type'],
            'subcategory_score': subcat_score,
            'tags': str_to_list(row['ml_tags']),
            'kind': 'Project',
            'ml_libs': str_to_list(row['ml_libs']),
            'host': 'github.com',
            'license': row['license'],
            'programming_language': row['language_primary'],
            'ml_score': row['ml_detected'],
            'engagement_score': row['stars_score'],
            'date_project': datetime.strptime(row['pushed_at'], "%Y-%m-%d %H:%M:%S"),
            'date_scraped': datetime.strptime(row['scraped_at'], "%Y-%m-%d %H:%M:%S"),
            # 'ml_terms': row['keywords'],
            # 'score_raw': json.dumps({'stars': row['stars'], 'contributors': row['contributors']}),
        }

    # mlart mapping
    if style == 'mlart':
        title = row['Title'] if row['Title'] != '' else row['title']
        cat_score = 1 if row['Theme'] != '' else 0
        subcat_score = 1 if row['Medium'] != '' else 0
        ret = {
            'title': title,
            'description': row['subtitle'],
            'link': row['url'],
            'category': 'Miscellaneous',
            'category_score': cat_score,
            'subcategory': 'Art',
            'subcategory_score': subcat_score,
            'tags': str_to_list(row['Theme']) + str_to_list(row['Medium']) + str_to_list(row['Technology']),
            'kind': 'Showcase',
            # 'ml_libs': [],
            'host': 'mlart.co',
            # 'license': '',
            # 'programming_language': '',
            # 'ml_score': row['ml_detected'],
            'learn_score': 0,
            'explore_score': 1,
            'compete_score': 0,
            # 'engagement_score': 0,
            'date_project': datetime.strptime(row['Date'], "%Y-%m-%d"),
            'date_scraped': datetime.strptime(row['scraped_at'], "%Y-%m-%d %H:%M:%S"),
            # 'score_raw': json.dumps({'days_since_featured': row['Days Since Featured']}),
        }

    # thecleverprogrammer
    if style == 'tcp':
        ret = {
            'title': row['title'],
            'description': row['description'],
            'link': row['link'],
            # 'category': '',
            # 'category_score': 0,
            # 'subcategory': '',
            # 'subcategory_score': 0,
            'tags': str_to_list(row['ml_tags']),
            'kind': 'Project',
            'ml_libs': str_to_list(row['ml_libs']),
            'host': 'thecleverprogrammer.com',
            # 'license': '',
            'programming_language': 'Python',
            'ml_score': row['ml_score'],
            'learn_score': 1,
            'explore_score': 0,
            'compete_score': 0,
            # 'engagement_score': 0,
            'date_project': datetime.strptime(row['date'], "%Y-%m-%d %H:%M:%S"),
            'date_scraped': datetime.strptime('2020-12-20', "%Y-%m-%d"),
            # 'score_raw': json.dumps({'days_since_featured': row['Days Since Featured']}),
        }
    
    # zalando / bcgdv / medium
    if style == 'manual':
        ret = {
            'title': row['title'],
            # 'description': row['description'] if row['description'] != '' else row['text'],
            'link': row['link'],
            # 'category': '',
            # 'category_score': 0,
            # 'subcategory': '',
            # 'subcategory_score': 0,
            'tags': str_to_list(row['tags']),
            # 'kind': 'Project',
            # 'ml_libs': str_to_list(row['ml_libs']),
            # 'host': 'thecleverprogrammer.com',
            # 'license': '',
            # 'programming_language': 'Python',
            # 'ml_score': row['ml_score'],
            # 'engagement_score': 0,
            # 'date_project': datetime.strptime(row['date'], "%d.%m.%Y"),
            # 'date_scraped': datetime.strptime(row['date_scraped'], "%d.%m.%Y"),
            # 'score_raw': json.dumps({'days_since_featured': row['Days Since Featured']}),
        }
        if ret['title'] == '' and 'company' in row:
            ret['title'] = row['company']
            
        if 'description' in row and row['description'] != '':
            ret['description'] = row['description']
        else:
            ret['description'] = row['text']
            
        if 'source' in row:
            ret['source'] = row['source']
            
        if 'category' in row:
            ret['category'] = row['category']
            if 'category_score' in row:
                ret['category_score'] = row['category_score']
            elif ret['category'] != '':
                ret['category_score'] = 1
        else:
            ret['category'] = ''
            
        if 'subcategory' in row:
            ret['subcategory']= row['subcategory']
            if 'subcategory_score' in row:
                ret['subcategory_score'] = row['subcategory_score']
            elif ret['subcategory'] != '':
                ret['subcategory_score'] = 1
        else:
            ret['subcategory'] = ''
            
        if 'date' in row and row['date'] != '':
            try:
                ret['date_project'] = datetime.strptime(row['date'], "%d.%m.%Y")
            except:
                try:
                    ret['date_project'] = datetime.strptime(row['date'], "%Y")
                except:
                    pass
                
        if 'date_scraped' in row:
            row['date_scraped'] = datetime.strptime(row['date_scraped'], "%d.%m.%Y")
        
        if 'ml_score' in row:
            ret['ml_score'] = row['ml_score']
            
        if 'learn_score' in row:
            ret['learn_score'] = row['learn_score']
            
        if 'explore_score' in row:
            ret['explore_score'] = row['explore_score']
            
        if 'compete_score' in row:
            ret['compete_score'] = row['compete_score']
        
    attach = {**extra}
    if 'tags' in attach:
        ret['tags'].extend(attach['tags'])
        attach.pop('tags')
    ret.update(attach)
    return ret

In [31]:
# test gpu usage
import torch
torch.cuda.is_available()

True

In [32]:
# summarization loop

In [33]:
# loop to transform data row-wise
def transform_loop(csv_in, csv_format, subfolder, quit=0, overwrite=False, inplace=True, printItem=False, extra={}):
    
    with open(csv_in, encoding='utf-8') as csvfile:
        
        # let's store converted csv to temp-folder for analysis
        csv_out = '../data/database/csv/'
        json_out = '../data/database/json/'
        json_out_item = '../data/database/json/'+subfolder
        create_folder(json_out_item)
        df = pd.DataFrame()

        # readCSV = csv.reader(csvfile, delimiter=';')
        readCSV = csv.DictReader(csvfile, delimiter=';')
        # next(readCSV, None)  # skip the headers
        
        i = j = 0
        out = []
        
        for row in readCSV:
            row = mapper(row, csv_format, extra=extra)
            if printItem == True:
                print(json.dumps(row, indent=3, sort_keys=True, default=str))
            
            # check if file already exists
            link = row['link']
            md5 = hashlib.md5(link.encode("utf-8")).hexdigest()
            
            json_fp = json_out_item + md5 + '.json'
            
            old = {}
            if os.path.isfile(json_fp) and overwrite == True or not os.path.isfile(json_fp):
                if os.path.isfile(json_fp):
                    old = load_data(json_fp, fromJson=True)
                
                print(i, row['link'])
                item_start = time.time()

                # clean title & description
                row['title'] = clean_text(row['title'])
                text = row['description'] = clean_text(row['description'])
                words = row['words'] = word_count(text)
                sentences = row['sentences'] = sentence_count(text)

                # create summarization
                if words > 200 and sentences > 1 and (not 'sum_nltk' in old or not 'sum_t5' in old):
                    print('summarize')
                    
                    # nltk
                    if not 'sum_nltk' in old:
                        start = time.time()
                        row['sum_nltk'] = nltk_count(text, word_count=200)
                        end = time.time()
                        dur = round(end-start,3)

                        row['sum_nltk_words'] = word_count(row['sum_nltk'])
                        row['sum_nltk_runtime'] = dur
                        print('done (nltk)', dur, 'sec')
                    
                    # t5
                    if not 'sum_t5' in old:
                        start = time.time()
                        row['sum_t5'] = t5(text)
                        end = time.time()
                        dur = round(end-start,3)

                        row['sum_t5_words'] = word_count(row['sum_t5'])
                        row['sum_t5_runtime'] = dur
                        print('done (t5)', dur, 'sec')
                
                # detect language
                if not 'language_code' in old:
                    s = row['description'] if 'description' in row and row['description'] != '' else row['title']
                    lang = lingo(s, simple=False)
                    if lang != None:
                        row['language_code'] = lang['code']
                        row['language'] = lang['language']
                        row['language_score'] = lang['probability']
                    else:
                        row['language_code'] = None
                        row['language'] = None
                        row['language_score'] = None

                # equalizer
                if 'programming_language' in row and row['programming_language'] == 'Python notebook':
                    row['programming_language'] = 'Jupyter Notebook'
                    
                if 'license' in row:
                    if row['license'] == 'Apache 2.0':
                        row['license'] = 'Apache-2.0'
                    if row['license'] == 'Learn more about GitHub Sponsors':
                        row['license'] = None
                    if row['license'] == 'Unlicense':
                        row['license'] = None
                        
                row['tags'] = tag_equalizer(row['tags'], pattern=tag_filter_v01)
                row['tags_descriptive'] = tag_equalizer(row['tags'], pattern=tag_filter_v02)
                
                # lemmatizer
                row['description_lemmatized'] = ' '.join(lemmatizer(row['description']))
                if 'sum_nltk' in row:
                    row['sum_nltk_lemmatized'] = ' '.join(lemmatizer(row['sum_nltk']))
                if 'summarization' in row:
                    row['summarization_lemmatized'] = ' '.join(lemmatizer(row['summarization']))
                

                # convert datetime to string
                if 'date_project' in row:
                    row['date_project'] = str(row['date_project'])
                if 'date_scraped' in row:
                    row['date_scraped'] = str(row['date_scraped'])
                    
                # runtime
                item_end = time.time()
                item_dur = round(item_end-item_start, 3)
                row['runtime'] = item_dur

                #df = df.append(row, ignore_index=True)

                # json encode
                #out.append(row)
                
                if overwrite == True and inplace==True:
                    row = {**old, **row}
                    drop = ['score']
                    for key in drop:
                        if key in row:
                            row.pop(key)
                    # restore category, subcategory and runtime
                    if row['category'] == '' and 'category' in old:
                        row['category'] = old['category']
                    if row['category_score'] == '' and 'category_score' in old:
                        row['category_score'] = old['category_score']
                    if row['subcategory'] == '' and 'subcategory' in old:
                        row['subcategory'] = old['subcategory']
                    if row['subcategory_score'] == '' and 'subcategory_score' in old:
                        row['subcategory_score'] = old['subcategory_score']
                    if row['runtime'] == '' and 'runtime' in old:
                        row['runtime'] = old['runtime']
                            
                #print(row)
                #sys.exit()
                
                if row != old:
                    store_data(row, json_fp, toJson=True)
                    print('stored:', json_fp)
                j += 1

            #print(i, row['link'])
            i += 1

            # keep count of # rows processed
            if i % 100 == 0:
                print(i)

            if quit != 0 and i >= quit:
                break

        # store parsed csv
        #fp = csv_in.split('/')[-1]
        #df.to_csv(csv_out + fp, sep=';', index=False)
        #path = json_out + fp
        #path = path.replace('.csv', '.json')
        #store_data(out, path, toJson=True)
        
        print('DONE parsed', i, 'items')

In [34]:
# run the loop

#transform = ['ka_c', 'ka_cn', 'ka_d', 'ka_dn', 'ma', 'gh', 'tcp', 'bc']
transform = ['ka_c', 'ka_cn', 'ma', 'gh', 'tcp', 'bc', 'me_ft', 'bcg_fo', 'bcg_ha',
             'me_ft', 'bcg_fo', 'bcg_ha', 'za_bl', 'za_jo', 'za_pr', 'za_pu']
#transform = ['za_bl', 'za_jo', 'za_pr', 'za_pu']

datasets = {
    # kaggle competitions
    'ka_c': {
        'csv_in': '../data/database/kaggle_competitions_correlated_01.csv',
        'csv_format': 'kaggle_competition',
    },
    # kaggle competitions notebooks
    'ka_cn': {
        'csv_in': '../data/database/kaggle_competitions_01_original.csv',
        'csv_format': 'kaggle_notebook',
    },
    # kaggle datasets
    'ka_d': {
        'csv_in': '../data/database/kaggle_datasets_correlated_01.csv',
        'csv_format': 'kaggle_dataset',
    },
    # kaggle datasets notebooks
    'ka_dn': {
        'csv_in': '../data/database/kaggle_datasets_01_original.csv',
        'csv_format': 'kaggle_notebook',
    },
    # mlart
    'ma': {
        'csv_in': '../data/database/mlart_01_original.csv',
        'csv_format':'mlart',
        'extra': {
            'learn_score': 0,
            'explore_score': 1,
            'compete_score': 0,
        },
    },
    # github
    'gh': {
        'csv_in': '../data/database/db_04_analyzed_v02.csv',
        'csv_format': 'github',
        'extra': {
            'learn_score': 1,
            'explore_score': 0,
            'compete_score': 0.25,
        },
    },
    # thecleverprogrammer
    'tcp': {
        'csv_in': '../data/database/thecleverprogrammer_01_original.csv',
        'csv_format': 'tcp',
        'extra': {
            'learn_score': 1,
            'explore_score': 0,
            'compete_score': 0,
        },
    },
    # blobcity
    'bc': {
        'csv_in': '../data/database/blobcity_02_analyzed.csv',
        'csv_format': 'github',
        'extra': {
            'learn_score': 1,
            'explore_score': 0,
            'compete_score': 0,
        },
    },
    # medium_fintech
    'me_ft': {
        'csv_in': '../data/database/medium_fintech_01.csv',
        'csv_format': 'manual',
        'extra': {
            'host': 'medium.com',
            'kind': 'Article',
            'learn_score': 0,
            'explore_score': 0,
            'compete_score': 1,
        },
        'out': 'me'
    },
    # bcgdv founded
    'bcg_fo': {
        'csv_in': '../data/database/bcgdv_founded_01.csv',
        'csv_format': 'manual',
        'extra': {
            'host': 'bcgdv.com',
            'kind': 'Article',
            'learn_score': 0,
            'explore_score': 0,
            'compete_score': 1,
        },
        'out': 'bcg'
    },
    # bcgdv hackaton
    'bcg_ha': {
        'csv_in': '../data/database/bcgdv_hackaton_01.csv',
        'csv_format': 'manual',
        'extra': {
            'host': 'bcgdv.com',
            'kind': ['Article', 'Project'],
            'learn_score': 1,
            'explore_score': 0,
            'compete_score': 1,
        },
        'out': 'bcg'
    },
    # zalando blog
    'za_bl': {
        'csv_in': '../data/database/zalando_blog_01.csv',
        'csv_format': 'manual',
        'extra': {
            'host': 'zalando.com',
            'kind': 'Article'
        },
        'out': 'za'
    },
    # zalando jobs
    'za_jo': {
        'csv_in': '../data/database/zalando_jobs_01.csv',
        'csv_format': 'manual',
        'extra': {
            'host': 'zalando.com',
            'kind': 'Article',
            'tags': ['Fashion'],
            'learn_score': 0,
            'explore_score': 0,
            'compete_score': 1,
        },
        'out': 'za'
    },
    # zalando research projects
    'za_pr': {
        'csv_in': '../data/database/zalando_projects_01.csv',
        'csv_format': 'manual',
        'extra': {
            'host': 'zalando.com',
            'kind': 'Article',
        },
        'out': 'za'
    },
    # zalando research publications
    'za_pu': {
        'csv_in': '../data/database/zalando_publications_04.csv',
        'csv_format': 'manual',
        'extra': {
            'host': 'zalando.com',
            'kind': 'Article',
            'date_scraped': datetime.strptime('17.01.2021', "%d.%m.%Y"),
            'tags': ['Fashion'],
            'learn_score': 0.5,
            'explore_score': 0,
            'compete_score': 0.75,
        },
        'out': 'za'
    },
}

    
for key in transform:
    print(key)
    item = datasets[key]
    extra = item['extra'] if 'extra' in item else {}
    out = key+'/' if not 'out' in item else item['out']+'/'
    printItem=False
    transform_loop(item['csv_in'], item['csv_format'], out, overwrite=True, extra=extra, printItem=printItem)

ka_c
0 https://www.kaggle.com/c/20-newsgroups-ciphertext-challenge
stored: ../data/database/json/ka_c/07554a25ba5010fc437c588c02637782.json
1 https://www.kaggle.com/c/3d-object-detection-for-autonomous-vehicles
stored: ../data/database/json/ka_c/998dc92361462a193b1eee10472bc19e.json
2 https://www.kaggle.com/c/abstraction-and-reasoning-challenge
stored: ../data/database/json/ka_c/f1b8d78bbe5c3441abb14e95f265c987.json
3 https://www.kaggle.com/c/aerial-cactus-identification
stored: ../data/database/json/ka_c/10172da4e494109a2ed53419081a1096.json
4 https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings
stored: ../data/database/json/ka_c/c4455229a9996e08978b53afd46e7fea.json
5 https://www.kaggle.com/c/airbus-ship-detection
stored: ../data/database/json/ka_c/a09ba09d429792c7deec1b662bb71827.json
6 https://www.kaggle.com/c/alaska2-image-steganalysis
stored: ../data/database/json/ka_c/647c8d2facd6bc052331da47849e6ed2.json
7 https://www.kaggle.com/c/allstate-claims-severity
stored: ../dat

76 https://www.kaggle.com/c/ga-customer-revenue-prediction
stored: ../data/database/json/ka_c/b1dfd577eabf949737a5a393a70707ac.json
77 https://www.kaggle.com/c/gendered-pronoun-resolution
stored: ../data/database/json/ka_c/abf975b7565f8f7bb033e4a509cea9e1.json
78 https://www.kaggle.com/c/generative-dog-images
stored: ../data/database/json/ka_c/1c5dace5423aa08b39987729aa985e04.json
79 https://www.kaggle.com/c/ghouls-goblins-and-ghosts-boo
stored: ../data/database/json/ka_c/115abf1f213bada684ea551c6f75694a.json
80 https://www.kaggle.com/c/GiveMeSomeCredit
stored: ../data/database/json/ka_c/be27717c8aae7beda42415fa0701ffe9.json
81 https://www.kaggle.com/c/global-wheat-detection
stored: ../data/database/json/ka_c/3271d02e06eb02f506239b3fad4e248f.json
82 https://www.kaggle.com/c/google-ai-open-images-object-detection-track
stored: ../data/database/json/ka_c/867a3cdef825dd423df0b41a525903bc.json
83 https://www.kaggle.com/c/google-cloud-ncaa-march-madness-2020-division-1-mens-tournament
store

stored: ../data/database/json/ka_c/31312d0d90d087603678d0764e14cc9c.json
145 https://www.kaggle.com/c/noaa-fisheries-steller-sea-lion-population-count
stored: ../data/database/json/ka_c/86af4e9c220f67dc6478e3a3294d4214.json
146 https://www.kaggle.com/c/nomad2018-predict-transparent-conductors
stored: ../data/database/json/ka_c/1b1ca4fa190a94d326ac61fc1a18917c.json
147 https://www.kaggle.com/c/nyc-taxi-trip-duration
stored: ../data/database/json/ka_c/b3ff3d6c7d16b03a33255a58007d8378.json
148 https://www.kaggle.com/c/open-images-2019-instance-segmentation
stored: ../data/database/json/ka_c/821f99c0f2e2c85697424dba75fad282.json
149 https://www.kaggle.com/c/open-images-2019-object-detection
stored: ../data/database/json/ka_c/7b87883c949e13f77224c385f96e0ba8.json
150 https://www.kaggle.com/c/open-images-instance-segmentation-rvc-2020
stored: ../data/database/json/ka_c/62583876bc0eb7e0d34a6d2d1018172d.json
151 https://www.kaggle.com/c/open-images-object-detection-rvc-2020
stored: ../data/dat

213 https://www.kaggle.com/c/youtube8m
stored: ../data/database/json/ka_c/a3ec6dad357abe43e32862e96289d503.json
214 https://www.kaggle.com/c/youtube8m-2018
stored: ../data/database/json/ka_c/d2223669e130b56dd427a4d77f9b6745.json
215 https://www.kaggle.com/c/youtube8m-2019
stored: ../data/database/json/ka_c/3883732f2bdf12cda1d61467fbba0dc9.json
216 https://www.kaggle.com/c/zillow-prize-1
stored: ../data/database/json/ka_c/13b3e0534ccfa5df9f731564b27a1a8d.json
DONE parsed 217 items
ka_cn
0 https://www.kaggle.com/a45632/classification-tfidf-svm-2-0
stored: ../data/database/json/ka_cn/afd02527b27d3abf4a51ce91c208664b.json
1 https://www.kaggle.com/amansohane/level-3-with-partial-deciphering-0-94-level-3
stored: ../data/database/json/ka_cn/28dadc18b17605c155a83106abdf8fd7.json
2 https://www.kaggle.com/ashishpatel26/attension-layer-basic-for-nlp
stored: ../data/database/json/ka_cn/351c786b42f280918e40d780ef020571.json
3 https://www.kaggle.com/ashishpatel26/beginner-to-intermediate-nlp-tutoria

stored: ../data/database/json/ka_cn/64c7c2bb311e7343388c6eaecc487a7f.json
63 https://www.kaggle.com/kmader/baseline-u-net-model-part-1
stored: ../data/database/json/ka_cn/108434d32b230cbfceabc78774511a90.json
64 https://www.kaggle.com/kmader/from-trained-u-net-to-submission-part-2
stored: ../data/database/json/ka_cn/c9cd9ca10aeb7e4eb7bdcf3775c2d9b0.json
65 https://www.kaggle.com/kmader/transfer-learning-for-boat-or-no-boat
stored: ../data/database/json/ka_cn/432b7a4700bc16a1e184d198d3af01b7.json
66 https://www.kaggle.com/kotarojp/first-step-for-submission-u-net-tta
stored: ../data/database/json/ka_cn/adf7e0d843ea5cc9305c5bb76dfc174f.json
67 https://www.kaggle.com/leighplt/pytorch-tutorial-dataset-data-preparetion-stage
stored: ../data/database/json/ka_cn/619c10efd745c6cb689bf0269dca277b.json
68 https://www.kaggle.com/voglinio/from-masks-to-bounding-boxes
stored: ../data/database/json/ka_cn/1179260ceec48b6136bb3102458323dd.json
69 https://www.kaggle.com/windsurfer/baseline-u-net-on-pyto

stored: ../data/database/json/ka_cn/14183cf04d211bc7e10838afc9a430d6.json
121 https://www.kaggle.com/corochann/optuna-tutorial-for-hyperparameter-optimization
stored: ../data/database/json/ka_cn/ea322e1d0b6916fc179f7b7e28ac23ca.json
122 https://www.kaggle.com/gunesevitan/ashrae-ucf-spider-and-eda-full-test-labels
stored: ../data/database/json/ka_cn/e5ccccbde415556023f77c718b77ee96.json
123 https://www.kaggle.com/hmendonca/starter-eda-and-feature-selection-ashrae3
stored: ../data/database/json/ka_cn/7924031564c163c9bb59f5a7f667d574.json
124 https://www.kaggle.com/kimtaegwan/what-s-your-cv-method
stored: ../data/database/json/ka_cn/bc677778d62b93ad305ed836eeb770ee.json
125 https://www.kaggle.com/nroman/eda-for-ashrae
stored: ../data/database/json/ka_cn/5b0790b0964ffe111e6a673b4fbf646b.json
126 https://www.kaggle.com/nz0722/aligned-timestamp-lgbm-by-meter-type
stored: ../data/database/json/ka_cn/2d92d61508332fc8adbf145f92987ef9.json
127 https://www.kaggle.com/purist1024/ashrae-simple-data

stored: ../data/database/json/ka_cn/ffd77f48f8f08534c665691640c4afbb.json
186 https://www.kaggle.com/parulpandey/eda-and-audio-processing-with-python
stored: ../data/database/json/ka_cn/cd4bb2297966a6c01f69636aca93c46d.json
187 https://www.kaggle.com/pavansanagapati/birds-sounds-eda-spotify-urban-sound-eda
stored: ../data/database/json/ka_cn/a3b8ba1350e48d210f926287a5d89190.json
188 https://www.kaggle.com/rohanrao/birdcall-eda-chirp-hoot-and-flutter
stored: ../data/database/json/ka_cn/7216279f7a3114e6ade049aa0dae61f8.json
189 https://www.kaggle.com/rohitsingh9990/eda-visualizations-simple-baseline
stored: ../data/database/json/ka_cn/84a2f01aab40430fbc5e011d3892702b.json
190 https://www.kaggle.com/shahules/bird-watch-complete-eda-fe
stored: ../data/database/json/ka_cn/e140380d8f79ecb786ce5a4cf16e2f6c.json
191 https://www.kaggle.com/shonenkov/competition-metrics
stored: ../data/database/json/ka_cn/fada469385583b58303609581a781c4a.json
192 https://www.kaggle.com/adamamoussasamake/eda-ml-m

stored: ../data/database/json/ka_cn/ea8be522e182742f1c036c317e926baa.json
253 https://www.kaggle.com/vikassingh1996/handling-categorical-variables-encoding-modeling
stored: ../data/database/json/ka_cn/e1271852b4fc53a7a16b388631f20f20.json
254 https://www.kaggle.com/drcapa/categorical-feature-engineering-2-xgb
stored: ../data/database/json/ka_cn/05d2766d81b1fef98ff6756c888eb2dd.json
255 https://www.kaggle.com/frankmollard/finding-the-optimal-weight-between-three-models
stored: ../data/database/json/ka_cn/af2c9307d85198fb282fc80e438475b9.json
256 https://www.kaggle.com/lucamassaron/categorical-feature-encoding-with-tensorflow
stored: ../data/database/json/ka_cn/69955ed5edf8cf1fc352457ab3941794.json
257 https://www.kaggle.com/ogrellier/libffm-model
stored: ../data/database/json/ka_cn/7cb4f51010ddf17981321c64c74bbe48.json
258 https://www.kaggle.com/pavelvpster/cat-in-dat-2-embeddings-target-keras
stored: ../data/database/json/ka_cn/6700baeea13a70fbea6bd1a1beea9722.json
259 https://www.kagg

stored: ../data/database/json/ka_cn/9fc6a8cabc316b3f326f39dc8aa01f75.json
314 https://www.kaggle.com/ruchibahl18/starting-of-an-end-game
stored: ../data/database/json/ka_cn/a3c7f3d8b96d2f636f4c96ac7dce077c.json
315 https://www.kaggle.com/ebouteillon/top-10-with-first-genetic-algorithm-on-gpu
stored: ../data/database/json/ka_cn/306cf46e43f305d6136f8af9826cbf7c.json
316 https://www.kaggle.com/hasnainajmal281/iterative-cnn-approach
stored: ../data/database/json/ka_cn/a3b7d327fcf237237565575842c9f21e.json
317 https://www.kaggle.com/jamesmcguigan/game-of-life-hashmap-solver
stored: ../data/database/json/ka_cn/851d8d1f13fc3f77d51a371636d5196d.json
318 https://www.kaggle.com/jamesmcguigan/game-of-life-z3-constraint-satisfaction
stored: ../data/database/json/ka_cn/4e10f7c32c7779832f0b2294d8809365.json
319 https://www.kaggle.com/jpmiller/demo-cython-generator-and-keras-cnn
stored: ../data/database/json/ka_cn/8887c69e564eb6e1b19aa0c0a251c406.json
320 https://www.kaggle.com/li325040229/the-game-o

stored: ../data/database/json/ka_cn/36755e94ead2df0dfe9798ff19754c19.json
373 https://www.kaggle.com/saga21/covid-global-forecast-sir-model-ml-regressions
stored: ../data/database/json/ka_cn/6d9a490afa34cc3ddc90724dc0c7630d.json
374 https://www.kaggle.com/super13579/covid-19-global-forecast-seir-visualize
stored: ../data/database/json/ka_cn/3ff2bf6283201c4986674bd133e2df71.json
375 https://www.kaggle.com/dkjung/covid-19-eda-s-korea-forecasting-global
stored: ../data/database/json/ka_cn/85ec47888e806fbbe828e46d178f9431.json
376 https://www.kaggle.com/granjithkumar/covid19-model-using-ensemble-learning-with-95-acc
stored: ../data/database/json/ka_cn/5967c86c7a4c21a447955a1383d20f7e.json
377 https://www.kaggle.com/kdnishanth/covid-19-forcasting
stored: ../data/database/json/ka_cn/3593104b56b4ecc7b15e834ff9f5a7b5.json
378 https://www.kaggle.com/nischaydnk/covid19-week5-visuals-randomforestregressor
stored: ../data/database/json/ka_cn/218df1ea78d8c8546b8ef0af765b9311.json
379 https://www.ka

stored: ../data/database/json/ka_cn/b618c020f6d16f6a1c7a5c7119e5f1f7.json
432 https://www.kaggle.com/takuok/keras-generator-starter-lb-0-326
stored: ../data/database/json/ka_cn/0a0a85f86e44f20e8fdf554036a1b0d9.json
433 https://www.kaggle.com/vijaybj/basic-u-net-using-tensorflow
stored: ../data/database/json/ka_cn/a1cc56fb869c399caa4a1e418188eceb.json
434 https://www.kaggle.com/vijaybj/yolo-for-detection-of-bounding-boxes-tensorflow
stored: ../data/database/json/ka_cn/938ba943887e2858d1fc2e714e82bea8.json
435 https://www.kaggle.com/voglinio/external-h-e-data-with-mask-annotations
stored: ../data/database/json/ka_cn/1539577fce856ba1be3e46382507ec9c.json
436 https://www.kaggle.com/wcukierski/example-metric-implementation
stored: ../data/database/json/ka_cn/57421a0affe70c4269a5a20c5c3c6bdf.json
437 https://www.kaggle.com/abhinand05/catboost-a-deeper-dive
stored: ../data/database/json/ka_cn/08ba07b0022a25f786c8f18633262011.json
438 https://www.kaggle.com/artgor/oop-approach-to-fe-and-models

stored: ../data/database/json/ka_cn/6305a95e04790411f518d3a1a239036f.json
489 https://www.kaggle.com/hmendonca/proper-clustering-with-facenet-embeddings-eda
stored: ../data/database/json/ka_cn/54d03a378b94d98f705d0738e0525c8a.json
490 https://www.kaggle.com/humananalog/binary-image-classifier-training-demo
stored: ../data/database/json/ka_cn/093735f6cd881b045dd1a66da089879d.json
491 https://www.kaggle.com/humananalog/inference-demo
stored: ../data/database/json/ka_cn/f8e043c04526dc5445cf0c96a43dc4e8.json
492 https://www.kaggle.com/khoongweihao/gcloud-ensembling-learning-learning-rates
stored: ../data/database/json/ka_cn/602869a4215b709cabe3bf744c07a34a.json
493 https://www.kaggle.com/marcovasquez/basic-eda-face-detection-split-video-and-roi
stored: ../data/database/json/ka_cn/304b1d7988f689191cc76d00863d4580.json
494 https://www.kaggle.com/robikscube/faceforensics-baseline-dlib-no-internet
stored: ../data/database/json/ka_cn/94c7cb6c9b6946231241c99141c88355.json
495 https://www.kaggle.

550 https://www.kaggle.com/subhamoybhaduri/cnn-cat-and-dog-classification
stored: ../data/database/json/ka_cn/4995f823565e095690669ca986827659.json
551 https://www.kaggle.com/uysimty/keras-cnn-dog-or-cat-classification
stored: ../data/database/json/ka_cn/16e15a2fba1bb4dc2779a924e2ee4efc.json
552 https://www.kaggle.com/abhiksark/introduction-to-transfer-learning-cats-dogs
stored: ../data/database/json/ka_cn/753e9e61ffa46b0377f8dac9f52920af.json
553 https://www.kaggle.com/anshulrai/using-fastai-in-kaggle-kernel-updated
stored: ../data/database/json/ka_cn/ba03467cd08a9da3823e25d3e1dafc6c.json
554 https://www.kaggle.com/gauss256/image-preprocessing-exploration-2
stored: ../data/database/json/ka_cn/9a83fe9f8a56777fa4c5c38cf97db79e.json
555 https://www.kaggle.com/gpreda/cats-or-dogs-using-cnn-with-transfer-learning
stored: ../data/database/json/ka_cn/e1b07f8b20219dda445848a75abb6d7f.json
556 https://www.kaggle.com/hortonhearsafoo/fast-ai-lesson-1
stored: ../data/database/json/ka_cn/34256b82d

613 https://www.kaggle.com/deepakdeepu8978/methodology-for-average-historical-emissions
stored: ../data/database/json/ka_cn/955c9363503d13ec715227761a179542.json
614 https://www.kaggle.com/gpoulain/eda-ef-with-n2o-time-series-earth-engine
stored: ../data/database/json/ka_cn/11c01f883455d912046e2266cbe2a874.json
615 https://www.kaggle.com/katemelianova/ds4g-spatial-panel-data-modeling
stored: ../data/database/json/ka_cn/a3b4444ed18e8ebfd4afabaa7b296475.json
616 https://www.kaggle.com/paultimothymooney/explore-image-metadata-s5p-gfs-gldas
stored: ../data/database/json/ka_cn/b253d605593e78605491d55686d64185.json
617 https://www.kaggle.com/paultimothymooney/overview-of-the-eie-analytics-challenge
stored: ../data/database/json/ka_cn/6c0617aee8daa30f4a8eda0a85cb85f9.json
618 https://www.kaggle.com/ragnar123/exploratory-data-analysis-and-factor-model-idea
stored: ../data/database/json/ka_cn/a70f0f8d20431ae2228b62beb0c3f58d.json
619 https://www.kaggle.com/raviyadav2398/ds4g-emission-factor
sto

676 https://www.kaggle.com/dhananjay3/pytorch-xla-for-tpu-with-multiprocessing
stored: ../data/database/json/ka_cn/42a758cdac2a5eb7959bd20274b44670.json
677 https://www.kaggle.com/mgornergoogle/custom-training-loop-with-100-flowers-on-tpu
stored: ../data/database/json/ka_cn/833e6ab327186c742d4bd1f5a38fc4ca.json
678 https://www.kaggle.com/mmmarchetti/flowers-on-tpu-ii
stored: ../data/database/json/ka_cn/24246a2f3c704f72e561df257af27a48.json
679 https://www.kaggle.com/msheriey/flowers-on-tpu-ensemble-lr-schedule
stored: ../data/database/json/ka_cn/b265f59303fc63f4963cb120a0b9c692.json
680 https://www.kaggle.com/philculliton/a-simple-tf-2-1-notebook
stored: ../data/database/json/ka_cn/b0d0b661897102f408acf16e73721fcf.json
681 https://www.kaggle.com/ratan123/densenet201-flower-classification-with-tpus
stored: ../data/database/json/ka_cn/06d347eb97db2dcdfbd27d5578ac4410.json
682 https://www.kaggle.com/romanweilguny/tpu-flowers-first-love
stored: ../data/database/json/ka_cn/3fac0e779c5848503

stored: ../data/database/json/ka_cn/f3deafb4ece76ab8cba321886c3307ce.json
736 https://www.kaggle.com/artgor/eda-on-basic-data-and-lgb-in-progress
stored: ../data/database/json/ka_cn/bafa470cdd7729ae1874734002058fb9.json
737 https://www.kaggle.com/erikbruin/google-analytics-eda-lightgbm-screenshots
stored: ../data/database/json/ka_cn/ee44bfa03c155aff6de345b33b28b62d.json
738 https://www.kaggle.com/fabiendaniel/lgbm-starter
stored: ../data/database/json/ka_cn/738983e985009145e64dbba5ad8fdaf3.json
739 https://www.kaggle.com/julian3833/2-quick-study-lgbm-xgb-and-catboost-lb-1-66
stored: ../data/database/json/ka_cn/a154ac1c8dcd66516443e3cf7bc0c30e.json
740 https://www.kaggle.com/kabure/exploring-the-consumer-patterns-ml-pipeline
stored: ../data/database/json/ka_cn/93ce895318ccdd4c94800d3fc576bcb5.json
741 https://www.kaggle.com/ogrellier/i-have-seen-the-future
stored: ../data/database/json/ka_cn/033157b8675e4b43a2ddbf1f6c578bb0.json
742 https://www.kaggle.com/ogrellier/teach-lightgbm-to-sum

stored: ../data/database/json/ka_cn/676faf08181db9d685c09ea986eb210f.json
797 https://www.kaggle.com/pestipeti/pytorch-starter-fasterrcnn-inference
stored: ../data/database/json/ka_cn/3d404c7e97398b3ee8b286350736e548.json
798 https://www.kaggle.com/pestipeti/pytorch-starter-fasterrcnn-train
stored: ../data/database/json/ka_cn/ff409de646f1b15254274018dd9828dc.json
799 https://www.kaggle.com/shonenkov/bayesian-optimization-wbf-efficientdet
stored: ../data/database/json/ka_cn/34929977f6a36080ddec352e602d0857.json
800
800 https://www.kaggle.com/shonenkov/inference-efficientdet
stored: ../data/database/json/ka_cn/7225cd9b553083d44f87b67d7f9f7cbc.json
801 https://www.kaggle.com/shonenkov/oof-evaluation-mixup-efficientdet
stored: ../data/database/json/ka_cn/4dff772238d8599aac0b4cdad3d6aa2c.json
802 https://www.kaggle.com/shonenkov/training-efficientdet
stored: ../data/database/json/ka_cn/8f64cb77b3b690dbc166f7be2b8dffe7.json
803 https://www.kaggle.com/shonenkov/wbf-approach-for-ensemble
store

stored: ../data/database/json/ka_cn/ba7aefe1028d7c67e88fc7e9f52110dc.json
854 https://www.kaggle.com/phoenix9032/pytorch-bert-plain
stored: ../data/database/json/ka_cn/be6491ad58a2924a049d67ae8924ad31.json
855 https://www.kaggle.com/ratthachat/quest-cv-analysis-on-different-splitting-methods
stored: ../data/database/json/ka_cn/fe0e3a21fb78c26a172ff816a327454c.json
856 https://www.kaggle.com/abbysobh/classifying-client-type-using-client-names
stored: ../data/database/json/ka_cn/4bd1f1a198912429e602c7bd453e2495.json
857 https://www.kaggle.com/anokas/exploratory-data-analysis
stored: ../data/database/json/ka_cn/902ad7430373502ef5b4d46b11c1b76e.json
858 https://www.kaggle.com/vykhand/exploring-products
stored: ../data/database/json/ka_cn/0d2c9d60c528555c7396b83c5a1deb3f.json
859 https://www.kaggle.com/ajeffries/obsolete-halite-v0-starter-notebook
stored: ../data/database/json/ka_cn/d49a86146cfc6657afe4e5b4a1d914d1.json
860 https://www.kaggle.com/basu369victor/designing-game-ai-with-reinfor

stored: ../data/database/json/ka_cn/bda84e819f5bab4486e76b70b8434415.json
925 https://www.kaggle.com/nikitpatel/best-tutorial-for-beginner
stored: ../data/database/json/ka_cn/78667457a7e27694aad96f1c2819c753.json
926 https://www.kaggle.com/rejpalcz/best-loss-function-for-f1-score-metric
stored: ../data/database/json/ka_cn/70d0f29ad11a75e992564a9cefd176ea.json
927 https://www.kaggle.com/rejpalcz/cnn-128x128x4-keras-from-scratch-lb-0-328
stored: ../data/database/json/ka_cn/c89060a3c97cf276d692ba6c49be50b0.json
928 https://www.kaggle.com/rejpalcz/gapnet-pl-lb-0-385
stored: ../data/database/json/ka_cn/4a93b2fd2c22f0e597a758635dfb682a.json
929 https://www.kaggle.com/therealpythonman/get-350k-additional-hpa-images
stored: ../data/database/json/ka_cn/1361eef2ba5ae2f54e17e47d5cc5e145.json
930 https://www.kaggle.com/zhugds/resnet34-with-rgby-fast-ai-fork
stored: ../data/database/json/ka_cn/ec9436860975f7933d802c90f426052a.json
931 https://www.kaggle.com/artgor/pytorch-whale-identifier
stored: .

stored: ../data/database/json/ka_cn/303a3af278f4f0dd3b5d0d3fa41b6541.json
988 https://www.kaggle.com/ateplyuk/efficientnet-keras-s75-b200-e20
stored: ../data/database/json/ka_cn/a5ca15dafa0c3bfb96539f0ab9c38acb.json
989 https://www.kaggle.com/ateplyuk/keras-starter
stored: ../data/database/json/ka_cn/06f829a36664e2f22424c0f8e4ebcc17.json
990 https://www.kaggle.com/backaggle/imet-fastai-starter-focal-and-fbeta-loss
stored: ../data/database/json/ka_cn/4d99adf7d463f7cd717bd8e4c66a5b0b.json
991 https://www.kaggle.com/chewzy/eda-weird-images-with-new-updates
stored: ../data/database/json/ka_cn/4af000148d620e0888add54d02868f89.json
992 https://www.kaggle.com/dimitreoliveira/imet-collection-2019-eda-keras
stored: ../data/database/json/ka_cn/796684ce3b6fbe4172a4c19a6f5d3f74.json
993 https://www.kaggle.com/h4211819/leaderboard-analysis
stored: ../data/database/json/ka_cn/2a980c1ce00fc2627dffb6313d8854c3.json
994 https://www.kaggle.com/hidehisaarai1213/imet-pytorch-starter
stored: ../data/databa

stored: ../data/database/json/ka_cn/18ba934ac7446763b9c10bf8c432d76b.json
1055 https://www.kaggle.com/scottykwok/making-sense-out-of-some-difficult-samples
stored: ../data/database/json/ka_cn/999d2b69da264d7ef913798ab53e6822.json
1056 https://www.kaggle.com/vfdev5/data-exploration-1
stored: ../data/database/json/ka_cn/93eda97b0e735e0fbea3fc02b1006a28.json
1057 https://www.kaggle.com/vfdev5/type-1-clustering
stored: ../data/database/json/ka_cn/7ca257db62ab256de2d5938e6dab494a.json
1058 https://www.kaggle.com/zahaviguy/cervix-image-segmentation
stored: ../data/database/json/ka_cn/2afe9bc1c33cc19660db2961a821167f.json
1059 https://www.kaggle.com/amlacorp/keras-starter-fork
stored: ../data/database/json/ka_cn/34bb21ffd83f46a74db0010cc8f4552d.json
1060 https://www.kaggle.com/ardiya/tensorflow-vgg-pretrained
stored: ../data/database/json/ka_cn/95af2f56a412162ccdf52cd54465fd5d.json
1061 https://www.kaggle.com/chmaxx/finetune-vgg16-0-97-with-minimal-effort
stored: ../data/database/json/ka_cn/7

stored: ../data/database/json/ka_cn/9023fcb6af52240f7ccb3246dbb7d816.json
1120 https://www.kaggle.com/ekhtiar/unintended-eda-with-tutorial-notes
stored: ../data/database/json/ka_cn/9270b601fc08bc13db214ddeb9318ebd.json
1121 https://www.kaggle.com/nholloway/the-effect-of-word-embeddings-on-bias
stored: ../data/database/json/ka_cn/788c0d4b2ce93df1e8dd4830785209b9.json
1122 https://www.kaggle.com/nz0722/simple-eda-text-preprocessing-jigsaw
stored: ../data/database/json/ka_cn/967051990d0019a7e239c3aa4ae2484e.json
1123 https://www.kaggle.com/taindow/simple-cudnngru-python-keras
stored: ../data/database/json/ka_cn/eadc3b66b6b7dfcc3c904c6afa8bd99d.json
1124 https://www.kaggle.com/timon88/bert-lstm-simple-blender-0-93844-lb
stored: ../data/database/json/ka_cn/42bb320c56995ea34b6ea729ce7e26bf.json
1125 https://www.kaggle.com/andresionek/how-to-create-award-winning-data-visualizations
stored: ../data/database/json/ka_cn/50ecf4b4bc4a26deff100d8bd368bcf6.json
1126 https://www.kaggle.com/andresione

stored: ../data/database/json/ka_cn/b843d31440bc8b8e2013dfd2cc2f523e.json
1178 https://www.kaggle.com/selfishgene/psychology-of-a-professional-athlete
stored: ../data/database/json/ka_cn/ceec8559d207676f8b65314525e72b58.json
1179 https://www.kaggle.com/basu369victor/kuzushiji-recognition-just-like-digit-recognition
stored: ../data/database/json/ka_cn/6291ea2f9d8df53fcead8caa2d12f162.json
1180 https://www.kaggle.com/christianwallenwein/fastest-way-to-crop-all-images
stored: ../data/database/json/ka_cn/b38d5e35d049096d786ae7c982b44008.json
1181 https://www.kaggle.com/christianwallenwein/visualization-useful-functions-small-eda
stored: ../data/database/json/ka_cn/bd30aae77e1cca38a3dccbc3342de3d5.json
1182 https://www.kaggle.com/frlemarchand/keras-maskrcnn-kuzushiji-recognition
stored: ../data/database/json/ka_cn/654d1ac59f3ffcf88fb79eac706493f4.json
1183 https://www.kaggle.com/frlemarchand/kuzushiji-page-generator
stored: ../data/database/json/ka_cn/ec84f8320174ebc91e73a6d854540d58.json
1

1244 https://www.kaggle.com/abhmul/keras-convnet-lb-0-0052-w-visualization
stored: ../data/database/json/ka_cn/b33b60e7293f31de360792cbce6e2375.json
1245 https://www.kaggle.com/alexanderlazarev/simple-keras-1d-cnn-features-split
stored: ../data/database/json/ka_cn/e5611a799ba9a78dda302758fcd67fb1.json
1246 https://www.kaggle.com/asparago/3-basic-classifiers-and-features-correlation
stored: ../data/database/json/ka_cn/e58313b417c6447e9b185fb2a980069f.json
1247 https://www.kaggle.com/group16/shapelets
stored: ../data/database/json/ka_cn/f1aa6d9dfb59f25c62bfec4ecc9ef80c.json
1248 https://www.kaggle.com/jeffd23/10-classifier-showdown-in-scikit-learn
stored: ../data/database/json/ka_cn/9fa01392d50f6027c638de41759b23ee.json
1249 https://www.kaggle.com/lorinc/feature-extraction-from-images
stored: ../data/database/json/ka_cn/eb5d4a32076c28f74dfe12d3677b7f41.json
1250 https://www.kaggle.com/lorinc/feature-extraction-v4
stored: ../data/database/json/ka_cn/3517d4caf8b2a1485cdaf140de29d460.json
1

stored: ../data/database/json/ka_cn/e35aeafd6d0145f88d6c9c1fde5a5938.json
1311 https://www.kaggle.com/holoong9291/eda-for-m5-2-en
stored: ../data/database/json/ka_cn/11428ee855c2f159fda28d3d69ec8274.json
1312 https://www.kaggle.com/kneroma/fast-wsp-loss-implementation-5s
stored: ../data/database/json/ka_cn/94df08e2dbe0af51716422f2a146d401.json
1313 https://www.kaggle.com/mpware/quantile-regression-cv3-tf
stored: ../data/database/json/ka_cn/2bed1f41ea2afc31512f02bbfb2151e6.json
1314 https://www.kaggle.com/steverab/m5-forecast-compet-uncert-gluonts-template
stored: ../data/database/json/ka_cn/7b0dc31e7633dfcda01beb3efea80518.json
1315 https://www.kaggle.com/timib1203/time-series-clustering-for-forecasting-preparation
stored: ../data/database/json/ka_cn/6fb1a00783942d9f88ef06d874837b60.json
1316 https://www.kaggle.com/ulrich07/do-not-write-off-rnn-cnn
stored: ../data/database/json/ka_cn/a8620381cdeb4f9411aa71337f24c9fe.json
1317 https://www.kaggle.com/ulrich07/quantile-regression-with-ker

stored: ../data/database/json/ka_cn/55c29a69ec96168455ccfc901709362f.json
1377 https://www.kaggle.com/artgor/movie-review-sentiment-analysis-eda-and-models
stored: ../data/database/json/ka_cn/5d7cc4fb1791f6df5da9b5a263b91798.json
1378 https://www.kaggle.com/ashishpatel26/movie-review-analysis
stored: ../data/database/json/ka_cn/0c06a4cf06880445a1380f222f7435ec.json
1379 https://www.kaggle.com/codename007/sentiment-analysis-baseline-model
stored: ../data/database/json/ka_cn/4cd7235548586473782f527f8683e094.json
1380 https://www.kaggle.com/divsinha/sentiment-analysis-countvectorizer-tf-idf
stored: ../data/database/json/ka_cn/3104baf2fb125aa69d13d22413bab124.json
1381 https://www.kaggle.com/hamishdickson/cnn-for-sentence-classification-by-yoon-kim
stored: ../data/database/json/ka_cn/293d93d284f5d5b2fde5664e09a18125.json
1382 https://www.kaggle.com/jeremiahharmsen/tensorflow-hub-for-text-classification
stored: ../data/database/json/ka_cn/8cc81b03aff41827a4c745baa45aa740.json
1383 https://w

stored: ../data/database/json/ka_cn/46e79c9162db6c0a8ce87310a8759a0f.json
1433 https://www.kaggle.com/gaborfodor/summary-budapest-pythons
stored: ../data/database/json/ka_cn/f28dcb22bf6de4334fa4632ff55d9bc3.json
1434 https://www.kaggle.com/garlsham/nfl-punt-rule-changes-interactive-plots
stored: ../data/database/json/ka_cn/0b43fea9e3bec3a250ca314be114e40d.json
1435 https://www.kaggle.com/hallayang/nfl-punt-analytics-proposal
stored: ../data/database/json/ka_cn/6ef116c0789c91b9958d5815dae6796e.json
1436 https://www.kaggle.com/jdemeo/nfl-concussion-video-analysis
stored: ../data/database/json/ka_cn/63ac6b27b8db9eea5d8995c54469005d.json
1437 https://www.kaggle.com/returnofsputnik/variables-that-are-correlated-with-concussions
stored: ../data/database/json/ka_cn/291f958da85327ef52bf0142480a3093.json
1438 https://www.kaggle.com/robikscube/analyzing-nfl-punt-data-starter-notebook-and-eda
stored: ../data/database/json/ka_cn/1f80a9ee0c7889e0b1ccb7cf86ea2041.json
1439 https://www.kaggle.com/rob

stored: ../data/database/json/ka_cn/48fa6c9d242fa3a8658ed85300a971e5.json
1494 https://www.kaggle.com/priteshshrivastava/imageai-resnet50-2018-kernel
stored: ../data/database/json/ka_cn/e86158365ed0ab4c99d4bb88261acbb5.json
1495 https://www.kaggle.com/xhlulu/intro-to-tf-hub-for-object-detection
stored: ../data/database/json/ka_cn/080e825314e8f5b83506d204204977e0.json
1496 https://www.kaggle.com/yw6916/how-to-build-yolo-v3
stored: ../data/database/json/ka_cn/be8e374e29349cd9e02c0207379aeeae.json
1497 https://www.kaggle.com/yw6916/my-yolo-v3
stored: ../data/database/json/ka_cn/cc8b6f99ccbb11c038be002752581385.json
1498 https://www.kaggle.com/danish788/beat-the-lb
stored: ../data/database/json/ka_cn/521152f5928b09af17d24f4d8e024cef.json
1499 https://www.kaggle.com/hli2020/googld-ai-visual-relationship-data-exploration
stored: ../data/database/json/ka_cn/0ec7114b934fccce8cf31a69b1d9431a.json
1500
1500 https://www.kaggle.com/seriousran/googld-ai-visual-relationship-data-exploration
stored: 

stored: ../data/database/json/ka_cn/7dbc8b75d62b944a5f86657d03e047f2.json
1562 https://www.kaggle.com/phunghieu/a-quick-simple-eda
stored: ../data/database/json/ka_cn/ffc74b0cfd70b57ca1810396258b8c86.json
1563 https://www.kaggle.com/robikscube/autonomous-driving-introduction-data-review
stored: ../data/database/json/ka_cn/c4720728e28b1adf1c709c25233fc929.json
1564 https://www.kaggle.com/subinium/3d-interactive-car-with-plotly
stored: ../data/database/json/ka_cn/66d92ca52a66ad2d37c8f1c0ef491558.json
1565 https://www.kaggle.com/zstusnoopy/visualize-the-location-and-3d-bounding-box-of-car
stored: ../data/database/json/ka_cn/e00983bc5295d6bd30a4ff7765aee8b5.json
1566 https://www.kaggle.com/anokas/data-exploration-analysis
stored: ../data/database/json/ka_cn/6bc8bf96c3bd95a8ed16d51e0caa91b2.json
1567 https://www.kaggle.com/fppkaggle/making-tifs-look-normal-using-spectral-fork
stored: ../data/database/json/ka_cn/bfe5d03d7addb58887052d98fe323713.json
1568 https://www.kaggle.com/hortonhearsafo

stored: ../data/database/json/ka_cn/8ba8f63282f8abaf08c806627c557ee3.json
1627 https://www.kaggle.com/haqishen/panda-inference-w-36-tiles-256
stored: ../data/database/json/ka_cn/d036e2b930de8c17224ae953430e8ebe.json
1628 https://www.kaggle.com/haqishen/train-efficientnet-b0-w-36-tiles-256-lb0-87
stored: ../data/database/json/ka_cn/65875702522f9ed95ad1c1c4eba1ccd7.json
1629 https://www.kaggle.com/iafoss/panda-16x128x128-tiles
stored: ../data/database/json/ka_cn/a12f18ee7bade8ec72f95e512abebbb5.json
1630 https://www.kaggle.com/iafoss/panda-concat-tile-pooling-starter-0-79-lb
stored: ../data/database/json/ka_cn/cfc9b6e6bbb19f0ac8a4b584e129ad4a.json
1631 https://www.kaggle.com/iafoss/panda-concat-tile-pooling-starter-inference
stored: ../data/database/json/ka_cn/a4f0a925bf05517eb4b0beda3b5a4ac7.json
1632 https://www.kaggle.com/iamleonie/panda-eda-visualizations-suspicious-data
stored: ../data/database/json/ka_cn/d1b083deaf9ff357aa53a8d6c5ba10fb.json
1633 https://www.kaggle.com/reighns/unde

1689 https://www.kaggle.com/sudalairajkumar/keras-starter-script-with-word-embeddings
stored: ../data/database/json/ka_cn/9565347352879522539b6a385489b073.json
1690 https://www.kaggle.com/sudalairajkumar/simple-leaky-exploration-notebook-quora
stored: ../data/database/json/ka_cn/b414227fb5a646336f7d7e2c560b5e75.json
1691 https://www.kaggle.com/alvations/basic-nlp-with-nltk
stored: ../data/database/json/ka_cn/5a6575f4fd86bff7740dfd9f47e01c1a.json
1692 https://www.kaggle.com/koushikdeb/intro-to-terminologies-of-nlp
stored: ../data/database/json/ka_cn/9046b1e5339a755dbfa0ab5ea788909c.json
1693 https://www.kaggle.com/tanjinprity/ml-project-cat-and-tan
stored: ../data/database/json/ka_cn/9ae873fce0f72b743f1b6188faaaa85c.json
1694 https://www.kaggle.com/ynue21/random-act-of-pizza
stored: ../data/database/json/ka_cn/e35dd26eaca484a18e0e5622a846edab.json
1695 https://www.kaggle.com/zahoorahmad/basic-nlp-with-nltk-fba294
stored: ../data/database/json/ka_cn/2b1e180b035e6f3543629a6216f797f4.json


stored: ../data/database/json/ka_cn/62b981cf424bee0bfc30d18788712d1e.json
1750 https://www.kaggle.com/jhoward/creating-a-metadata-dataframe-fastai
stored: ../data/database/json/ka_cn/f5309b463d627099c7022f28dbc070d7.json
1751 https://www.kaggle.com/jhoward/don-t-see-like-a-radiologist-fastai
stored: ../data/database/json/ka_cn/ed0c4d5dc865e560123dd045043fe4a7.json
1752 https://www.kaggle.com/jhoward/from-prototyping-to-submission-fastai
stored: ../data/database/json/ka_cn/ea6c1efb0978de9121ec89edd20fe9f9.json
1753 https://www.kaggle.com/jhoward/some-dicom-gotchas-to-be-aware-of-fastai
stored: ../data/database/json/ka_cn/18adda438f4808bc32769a692940f95c.json
1754 https://www.kaggle.com/marcovasquez/basic-eda-data-visualization
stored: ../data/database/json/ka_cn/d22d42e608528b31498e0aab3e5ef09b.json
1755 https://www.kaggle.com/mobassir/keras-efficientnetb4-for-intracranial-hemorrhage
stored: ../data/database/json/ka_cn/ec2f0cf1ed92d0c24d661bbc279e6967.json
1756 https://www.kaggle.com/om

stored: ../data/database/json/ka_cn/917ab34c000fb07a7d708c710aabe655.json
1811 https://www.kaggle.com/mathormad/knowledge-distillation-with-nn-rankgauss
stored: ../data/database/json/ka_cn/6210d1347b80d08ebc223bae8e71f0c5.json
1812 https://www.kaggle.com/mjbahmani/santander-ml-explainability
stored: ../data/database/json/ka_cn/007f83cb1b16866f8369b14531d1a428.json
1813 https://www.kaggle.com/roydatascience/eda-pca-simple-lgbm-on-kfold-technique
stored: ../data/database/json/ka_cn/b67d32c6a1c2e66fb1053f16ac2fb8da.json
1814 https://www.kaggle.com/vinhnguyen/gpu-acceleration-for-lightgbm
stored: ../data/database/json/ka_cn/2b7379c6689c013f03e87e20608bbb34.json
1815 https://www.kaggle.com/apryor6/detailed-cleaning-visualization-python
stored: ../data/database/json/ka_cn/88ffe77f90703b2dc37e9103d5b7eb10.json
1816 https://www.kaggle.com/katerynad/know-your-data
stored: ../data/database/json/ka_cn/ccb78ea4e25f7d322a244c5d9f3cb817.json
1817 https://www.kaggle.com/marc000/feature-importance-v1


1872 https://www.kaggle.com/kostyaatarik/shame-on-me
stored: ../data/database/json/ka_cn/bbad141b6e51632dc1546384d99ee3af.json
1873 https://www.kaggle.com/seshadrikolluri/understanding-the-problem-and-some-sample-paths
stored: ../data/database/json/ka_cn/d9cfdd2947064e234de2feeea4030838.json
1874 https://www.kaggle.com/ajaygoswami/kernel6b62964a77
stored: ../data/database/json/ka_cn/990bd9795a10dffd27e19bb484b050ed.json
1875 https://www.kaggle.com/davidmezzetti/trec-covid-search-index
stored: ../data/database/json/ka_cn/83afc07a5dec12cb4fc48e662c27866b.json
1876 https://www.kaggle.com/davidmezzetti/trec-covid-submission
stored: ../data/database/json/ka_cn/f4429ee71e1c98f06bdf9e7a93bf8f0d.json
1877 https://www.kaggle.com/mpwolke/covid-19-ace
stored: ../data/database/json/ka_cn/1a0b2738e266d958e7ff3ca1ef1f05eb.json
1878 https://www.kaggle.com/mpwolke/covid-19-biomarkers
stored: ../data/database/json/ka_cn/1db18004b796566fe9576f1bf19479db.json
1879 https://www.kaggle.com/mpwolke/covid-19-

stored: ../data/database/json/ka_cn/9449b83d7b44f06fd1ead48e3e7423fb.json
1933 https://www.kaggle.com/jannesklaas/lb-0-63-xgboost-baseline
stored: ../data/database/json/ka_cn/611e8d4a59728d775f3c5d410cc3674b.json
1934 https://www.kaggle.com/jsaguiar/baseline-with-news
stored: ../data/database/json/ka_cn/1595c5e935b7cfbbfa77b6d46d5b42e8.json
1935 https://www.kaggle.com/marketneutral/eda-what-does-mktres-mean
stored: ../data/database/json/ka_cn/a1bed3df1d495f503e1b7450850cb5c9.json
1936 https://www.kaggle.com/marketneutral/the-fallacy-of-encoding-assetcode
stored: ../data/database/json/ka_cn/b2d966b3759feac320718b0b3deffa11.json
1937 https://www.kaggle.com/pestipeti/simple-eda-two-sigma
stored: ../data/database/json/ka_cn/7896f58335b1da17fea1c37da1f20594.json
1938 https://www.kaggle.com/bguberfain/elastic-transform-for-data-augmentation
stored: ../data/database/json/ka_cn/50ecdfaf6d9241344f7fbf4a6206254a.json
1939 https://www.kaggle.com/gbatchkala/urss-2019-project-review
stored: ../data

1993 https://www.kaggle.com/ymlai87416/web-traffic-time-series-forecast-with-4-model
stored: ../data/database/json/ka_cn/9d3a1800596a849d0b715fa2c071b645.json
1994 https://www.kaggle.com/zoupet/predictive-analysis-with-different-approaches
stored: ../data/database/json/ka_cn/d69004716dfd6a200f12ed963059a00e.json
1995 https://www.kaggle.com/andersy005/getting-started
stored: ../data/database/json/ka_cn/3cd0a88e9d0be01071a6a5e6eb331f57.json
1996 https://www.kaggle.com/anezka/cnn-with-keras-for-humpback-whale-id
stored: ../data/database/json/ka_cn/389e0ad2a84d832ae244dd269170e1d3.json
1997 https://www.kaggle.com/ashishpatel26/whale-prediction-using-resnet-50
stored: ../data/database/json/ka_cn/0cd938f48a9f18fbf94132cff8cf318d.json
1998 https://www.kaggle.com/bibuml/beating-cvxtz-very-good-code-0-38-to-0-42
stored: ../data/database/json/ka_cn/f4394152c1d80586412805fd262d9497.json
1999 https://www.kaggle.com/lextoumbourou/humpback-whale-id-data-and-aug-exploration
stored: ../data/database/j

stored: ../data/database/json/ka_cn/668aebfbd478531de08d3abcb91a23fb.json
2061 https://www.kaggle.com/jameschien/newbie-easy-way-for-loading-video-frame-tfrecord
stored: ../data/database/json/ka_cn/daa5369ba0b352cdead0c5e0e6e89d9a.json
2062 https://www.kaggle.com/juliaelliott/starter-kernel-yt8m-2018-sample-data
stored: ../data/database/json/ka_cn/db782ee639f64374b93705b656fff74e.json
2063 https://www.kaggle.com/machineheart/the-tools-of-import-data
stored: ../data/database/json/ka_cn/711b6ba0bed44005057769c1d1954448.json
2064 https://www.kaggle.com/mihaskalic/training-video-level-classifier-with-keras
stored: ../data/database/json/ka_cn/4c8301c4f955d5c98bd1cb2433e087ef.json
2065 https://www.kaggle.com/ozlemy/download-training-videos-for-windows
stored: ../data/database/json/ka_cn/670537bd1f749db3ace6eeb0a5a40414.json
2066 https://www.kaggle.com/plumeria/8m-video-understanding-challenge
stored: ../data/database/json/ka_cn/cb3fee6bc24d4e9c2c243a466d62d2be.json
2067 https://www.kaggle.co

stored: ../data/database/json/ma/a3e8717900607a2cd126cacd5e8eb019.json
30 https://mlart.co/item/a-stylegan-trained-on-clonal-colonies_-large-concentrations-of-genetically-identical-organisms_-imitating-the-natural-clonal-colony-phenomenon-in-aspen-trees
31 https://mlart.co/item/a-stylegan-trained-on-a-dataset-synthesised-with-analog-and-digital-materials
stored: ../data/database/json/ma/6c9f2203a540f2705e4c919af2ff17fd.json
32 https://mlart.co/item/deepdream-applied-to-bob-ross_s-video-and-sound-synthesized-with-wavenet
33 https://mlart.co/item/gan-trained-on-gap-clothes-with-super-resolution
34 https://mlart.co/item/stylegan-generated-mexican-and-guatemalan-textiles
35 https://mlart.co/item/automatic-analysis-and-visualization-of-motion-using-a-combination-of-projective-non-negative-matrix-factorization_-and-simplex-volume-maximization
36 https://mlart.co/item/custom-gan-generated-image-with-self-attention_-trained-on-personal-photos-from-finland
stored: ../data/database/json/ma/644f9

stored: ../data/database/json/ma/b1c0260a510e7d3f65f7eb8149ab86e5.json
98 https://mlart.co/item/use-singan-to-train-a-gan-with-a-single-image_-frymire_-the-classic-compression-image
99 https://mlart.co/item/repair-images-using-neural-cellular-automata_-a-16-channel-recurrent-cnn-with-a-fixed-sobel-kernel_-a-dense-layer_-and-masking
100
100 https://mlart.co/item/gan-trained-on-defra_s-lidar-data-of-london_-then-visualize-the-height-maps-in-aeriolod
101 https://mlart.co/item/clustering-of-static-adaptive-correspondences-for-deformable-object-tracking-_cmt_-with-optical-flow-enhanced-style-transfer_-and-a-loop-generating-algorithm-based-on-optical-flow
102 https://mlart.co/item/use-detectron2-to-identify-people-in-storror_s-parkour-video-and-apply-optical-flow-based-style-transfer
stored: ../data/database/json/ma/b8135d8816b0c8268d2c7e0168cfe3fd.json
103 https://mlart.co/item/apply-ken-burns-3d-until-it-breaks_-exposing-the-simulation-software-the-model-was-created-with
104 https://mlart.

stored: ../data/database/json/ma/a404eb4afba263fcd10b39584d355838.json
154 https://mlart.co/item/use-posenet-to-extract-a-pose-and-generate-stick-figure-choreographies
155 https://mlart.co/item/generate-a-poem-from-a-word-and-project-it-on-your-face
156 https://mlart.co/item/generate-a-vintage-chinese-opera-by-extracting-poses-and-augmenting-them-with-a-pix2pix-model
stored: ../data/database/json/ma/a082eee6936b8b30597656140eb43503.json
157 https://mlart.co/item/zooming-inside-of-an-artwork-and-applying-style-transfer-to-transfer-to-another-artwork
158 https://mlart.co/item/gan-collage-interpolation-of-three-dogs
159 https://mlart.co/item/rnn-generative-piano-music-inspired-by-satie
stored: ../data/database/json/ma/74565caa8fceb5f408956ced700c898d.json
160 https://mlart.co/item/rnn-generated-classical-music
stored: ../data/database/json/ma/8b0bbb07e1e4a3813d04dad5d072eba2.json
161 https://mlart.co/item/use-wavegan-to-generate-sound-based-on-touch
stored: ../data/database/json/ma/0c8d1c

226 https://mlart.co/item/train-a-model-to-map-a-gesture-to-a-sound
227 https://mlart.co/item/gan-generative-images-based-on-dali
228 https://mlart.co/item/use-a-turing-test-chatbot-to-guide-an-improvisational-theatre-act
229 https://mlart.co/item/generate-pictures-of-flowers-with-a-gan
230 https://mlart.co/item/use-gan-paintings-and-object-detection-to-remove-objects-from-a-video
231 https://mlart.co/item/gan-generated-images-from-personal-photos-from-finland
stored: ../data/database/json/ma/837228e08bacac5d93db37e1f2117aeb.json
232 https://mlart.co/item/use-cyclegan-to-translate-street-images-to-artworks-from-wikiart
stored: ../data/database/json/ma/0271e9f59c859ffbf9746673756551f6.json
233 https://mlart.co/item/combine-a-collage-of-gan-interpolations-to-create-motives_-then-projected-on-a-building
stored: ../data/database/json/ma/53064a1d05e202546a9dd50c0b2291af.json
234 https://mlart.co/item/generate-chairs-with-a-gan-and-then-produce-the-generated-chairs
235 https://mlart.co/item/

stored: ../data/database/json/ma/7c26c833cf6c4423b654552c27067079.json
293 https://mlart.co/item/paint-12-shapes_-algorithmically-combine-the-forms_-use-the-inception-classifier-to-select-artworks_-and-then-paint-the-final-selection
stored: ../data/database/json/ma/c0db8fb8bfc83119d4e1750dcf86fb7d.json
294 https://mlart.co/item/generate-shader-like-animations-with-cppn-using-microphone-prompts
stored: ../data/database/json/ma/b6cd2e8404af72ca62bc451e1dbe8aba.json
295 https://mlart.co/item/a-voiceover-via-a-cnn-and-rnn-for-walt-disney_s-nature_s-half-acre-_1951_-trained-on-text-from-female-protagonists-in-romance-novels
296 https://mlart.co/item/a-gan-trained-on-maltese-photography-and-as-well-as-the-gan-generated-images-themselves
297 https://mlart.co/item/progressively-grown-gan-trained-on-wikiarts
stored: ../data/database/json/ma/f1bb4b886d370251ff6b7cecf8213056.json
298 https://mlart.co/item/a-gan-trained-on-a-personal-dataset-of-100k-individual-petals-and-interpolated-in-a-collage


stored: ../data/database/json/ma/43bbcd31c332eb5b36743e6a1cc307a5.json
362 https://mlart.co/item/deepdream-inspired-by-classical-western-art-history
stored: ../data/database/json/ma/154a5c80b177855c3b1e19aafd8ffe05.json
363 https://mlart.co/item/a-conditional-gan-applied-on-satellite-images-from-openstreetmap
stored: ../data/database/json/ma/b7d3609a089aa90cd329b68747233d97.json
364 https://mlart.co/item/flora-and-fauna-themed-deepdream-projection-on-kaoru-okomura
stored: ../data/database/json/ma/128d3baf486e880330b59c43f9d960ae.json
365 https://mlart.co/item/deepdream-of-himalayas-created-with-lasercut-wood
366 https://mlart.co/item/pick-random-words-from-word2vec-and-look-at-their-relationships-to-the-words-_man_-and-_woman
stored: ../data/database/json/ma/8fcf62a92847cde2226c3fc18df6da54.json
367 https://mlart.co/item/select-a-region-on-a-map-and-extract-cnn-features-and-use-nns-to-find-similar-regions
stored: ../data/database/json/ma/253043db08cd2bcebacedf05da2a04b1.json
368 https:

stored: ../data/database/json/gh/a8c3f49389d1742cfc4dcf5f2b0442b9.json
30 https://github.com/andrey-lukyanov/Risk-Management
stored: ../data/database/json/gh/45e12468e0dfe13420e2402b567c9602.json
31 https://github.com/anki1909/Recruit-Restaurant-Visitor-Forecasting
stored: ../data/database/json/gh/96a609da924b8f21b7ae909ef2a1aa95.json
32 https://github.com/ankitkariryaa/ambulanceSiteLocation
stored: ../data/database/json/gh/4716c5589ede7deb3e073baccb6b061c.json
33 https://github.com/Ankushr785/Food-amenities-demand-prediction
stored: ../data/database/json/gh/bff081de7bb03e9397e8f1e46780fa29.json
34 https://github.com/anshu3769/FirmEmbeddings
stored: ../data/database/json/gh/9099dee87b8011e8792d7dbabf98b9cc.json
35 https://github.com/apbecker/Systemic_Risk/
stored: ../data/database/json/gh/e29a8dcfac036da6e124c307af08694c.json
36 https://github.com/apoorv-goel/Bank-Note-Authentication-Using-DNN-Tensorflow-Classifier-and-RandomForest
stored: ../data/database/json/gh/24b83e7f6fdd275bd406b

stored: ../data/database/json/gh/fe8a3424655ecac315c4d01db9422762.json
104 https://github.com/dariusmehri/Topic-Modeling-and-Analysis-of-Building-Related-Injuries
stored: ../data/database/json/gh/ab406e6957148dad973c8b4624f0e47d.json
105 https://github.com/darshankaarki/ml-coa-charging
stored: ../data/database/json/gh/9abe1afa16ac67826ae1db955ec51255.json
106 https://github.com/Data4Democracy/crash-model
stored: ../data/database/json/gh/1918bd04a682a7f38cae03d61b39f31c.json
107 https://github.com/datacamp/course-resources-ml-with-experts-budgets/
stored: ../data/database/json/gh/ec6af87d9d9cce1d6a9c5d960a1dd6a8.json
108 https://github.com/datadesk/california-ccscore-analysis
stored: ../data/database/json/gh/1e4ae0366c20da3c57b545ad144c52d5.json
109 https://github.com/datadesk/california-electricity-capacity-analysis
stored: ../data/database/json/gh/8fc55605286a8d329b1c8f47c29f0e8a.json
110 https://github.com/datadesk/lapd-crime-classification-analysis
stored: ../data/database/json/gh/9

stored: ../data/database/json/gh/70da68989dea25d24dd89dda6fd56911.json
185 https://github.com/hbutsuak95/Quality-Optimization-of-Steel
stored: ../data/database/json/gh/9be47146e9da2a758c4eadb0f436b5b1.json
186 https://github.com/hep-lbdl/adversarial-jets
stored: ../data/database/json/gh/0271c270eac02a13cd05bcb52a6a023d.json
187 https://github.com/hep-lbdl/CaloGAN
stored: ../data/database/json/gh/96cc22b31d725200b3fd11c71d20e708.json
188 https://github.com/higgsfield/interaction_network_pytorch
stored: ../data/database/json/gh/7b4a5efeda68d39706153ee12736a303.json
189 https://github.com/HIPS/neural-fingerprint
stored: ../data/database/json/gh/f0bfff2e70adc9883ed7d43ed57d6197.json
190 https://github.com/HitarthiShah/Radiation-Data-Analysis
stored: ../data/database/json/gh/f76919fb297b0258b7fca7abbe94c81a.json
191 https://github.com/hkacmaz/Bankin_Recovery/
stored: ../data/database/json/gh/3dfebd643c1158f1af19a4ac3c7805cf.json
192 https://github.com/hockeyjudson/Legal-Entity-Detection/
st

stored: ../data/database/json/gh/db924fffaf24cece85219aad57b656a6.json
255 https://github.com/longtng/frauddetectionproject/
stored: ../data/database/json/gh/f9c0fec4b144e27f5a162f25df240aa4.json
256 https://github.com/luqmanhakim/research-on-sp-wholesale/
stored: ../data/database/json/gh/74ac28f9138fa9dd0746e6d744062bd7.json
257 https://github.com/m-hoff/maintsim
stored: ../data/database/json/gh/5526105d52935a8c9e414a0f26e3f3d2.json
258 https://github.com/manuvarkey/Gestimator
stored: ../data/database/json/gh/f32919805cca40a9f44d92973f181896.json
259 https://github.com/marcotav/hotels
stored: ../data/database/json/gh/8064c7ad0c00f89cd649e7da2d831a42.json
260 https://github.com/MarcusOlivecrona/REINVENT
stored: ../data/database/json/gh/0a5b5046caad059e27a74b068ccbc0a4.json
261 https://github.com/materialsproject/emmet
stored: ../data/database/json/gh/e2dd304c0947bbce94b27d643f3943ba.json
262 https://github.com/materialsproject/pymatgen/
stored: ../data/database/json/gh/72dee01cd9825c28

stored: ../data/database/json/gh/5f7b888bbe458eac97b423838d1b11db.json
335 https://github.com/rawillis98/alpaca
stored: ../data/database/json/gh/4b80cbd3255587a39412c53e52f8b763.json
336 https://github.com/raymond180/FINRA_TRACE
stored: ../data/database/json/gh/5d07e2f058f5524bd8e083c22975823e.json
337 https://github.com/rdbraatz/data-driven-prediction-of-battery-cycle-life-before-capacity-degradation
stored: ../data/database/json/gh/760ba35bf142be26212283c4c9ba1aee.json
338 https://github.com/RealRadOne/Gyani-The-Loan-Eligibility-Predictor
stored: ../data/database/json/gh/4959521f4c6de5797925d428ddcb6ade.json
339 https://github.com/richardddli/state_electricity_rates
stored: ../data/database/json/gh/f4b790a37699eed32f1b17e90dfd455d.json
340 https://github.com/ricket-sjtu/bioinformatics
stored: ../data/database/json/gh/1565f1e06a75d8a124686ea67cd6864f.json
341 https://github.com/ritchie46/anaStruct
stored: ../data/database/json/gh/9a31ce95ad65d0d574b59c22f2c09cbd.json
342 https://githu

403 https://github.com/tslindner/Effects-of-Cannabis-Legalization
stored: ../data/database/json/gh/f4107d4412e47193d9db73dcb0ddc90c.json
404 https://github.com/tstreamDOTh/Instacart-Market-Basket-Analysis
stored: ../data/database/json/gh/2e0f5e5af0d662e2e70f4c00627728a8.json
405 https://github.com/tullyvelte/SchoolPerformanceDataAnalysis
stored: ../data/database/json/gh/cd8a4104acd57df2e26158694877981a.json
406 https://github.com/txytju/air-quality-prediction
stored: ../data/database/json/gh/e90d6192d82bbe550bcf609de85cab0b.json
407 https://github.com/ual/rental-listings
stored: ../data/database/json/gh/e930f264721e345336f35b3f7a1e31cd.json
408 https://github.com/uci-cbcl/D-GEX
stored: ../data/database/json/gh/dc32b8d98d0319470b69585c66aa8045.json
409 https://github.com/un-modelling/Electricity_Consumption_Surveys
stored: ../data/database/json/gh/7dbe2db7b864d6ccfa6e7f8c5600a1a3.json
410 https://github.com/usnistgov/modelmeth
stored: ../data/database/json/gh/e02ce276c9c1eb09b819d9dc05c

stored: ../data/database/json/tcp/4c65a297a4a5f7086b4090e905548145.json
32 https://thecleverprogrammer.com/2020/06/16/dog-and-cat-classification-using-convolutional-neural-networks-cnn/
33 https://thecleverprogrammer.com/2020/06/25/image-processing-with-machine-learning-and-python/
stored: ../data/database/json/tcp/990fb57280a364f6b56a8f6263fc0251.json
34 https://thecleverprogrammer.com/2020/06/29/skin-cancer-classification-with-machine-learning/
stored: ../data/database/json/tcp/3f8ac795c0c28f11cfb4f446db824d6e.json
35 https://thecleverprogrammer.com/2020/07/01/time-series-analysis-and-forecasting-with-python/
36 https://thecleverprogrammer.com/2020/07/02/logistic-regression-in-machine-learning-with-python/
stored: ../data/database/json/tcp/f903b33cca37601208324def8a9406e2.json
37 https://thecleverprogrammer.com/2020/07/05/naive-bayes-classification-in-machine-learning/
stored: ../data/database/json/tcp/601dbb6d6244853da7cc582195df7fe5.json
38 https://thecleverprogrammer.com/2020/07/0

stored: ../data/database/json/tcp/f94c26fe978775268a68a1724b225d1e.json
99 https://thecleverprogrammer.com/2020/09/01/image-segmentation-with-python/
stored: ../data/database/json/tcp/6a33cdd82f9da577b51394a8fd718097.json
100
100 https://thecleverprogrammer.com/2020/09/01/spacy-in-machine-learning/
stored: ../data/database/json/tcp/21eac8baad7eb58e157cfc2048aa2573.json
101 https://thecleverprogrammer.com/2020/09/02/predict-fuel-efficiency-with-machine-learning/
stored: ../data/database/json/tcp/7215a2080c92af03d7c90e7d0805512d.json
102 https://thecleverprogrammer.com/2020/09/03/abc-analysis-with-machine-learning/
stored: ../data/database/json/tcp/ffabe081e4a8243e5c3c8f1a3b576917.json
103 https://thecleverprogrammer.com/2020/09/04/overfitting-and-underfitting-in-machine-learning/
stored: ../data/database/json/tcp/75576d937fd9247f088d43aacdb46dda.json
104 https://thecleverprogrammer.com/2020/09/04/xgboost-in-machine-learning/
stored: ../data/database/json/tcp/3bf6026fc28d103e12378a8f424a

stored: ../data/database/json/tcp/ed9d9c7516139807442af673ff8162aa.json
159 https://thecleverprogrammer.com/2020/11/21/employee-attrition-prediction-with-python/
stored: ../data/database/json/tcp/26249f0f8218ec37b4ddcf3ee8a3d93b.json
160 https://thecleverprogrammer.com/2020/11/21/how-much-training-data-is-required-for-machine-learning/
stored: ../data/database/json/tcp/6cff9f523ce550dd6fefe9de0631f4e0.json
161 https://thecleverprogrammer.com/2020/11/23/machine-learning-process/
stored: ../data/database/json/tcp/5002962346a6d7b26871e4ffa12cf770.json
162 https://thecleverprogrammer.com/2020/11/24/flower-recognition-with-python/
stored: ../data/database/json/tcp/e382985bc0002b23730a0802a31fbc55.json
163 https://thecleverprogrammer.com/2020/11/25/gender-classification-with-python/
stored: ../data/database/json/tcp/9d5e2863a7bb8d379f2c15ffa351db9f.json
164 https://thecleverprogrammer.com/2020/11/27/chessboard-with-python/
stored: ../data/database/json/tcp/a54690819f1725af366669709f1a54e7.js

60 https://github.com/california-civic-data-coalition/first-python-notebook
stored: ../data/database/json/bc/38e78254b99d0f10c363ca9913d27cf1.json
61 https://github.com/Calysto/conx-notebooks
stored: ../data/database/json/bc/413e86f90300431bd8eb272789db8942.json
62 https://github.com/cameroncruz/notebooks
stored: ../data/database/json/bc/6cb33060c744ef1ae61c0d75337c9453.json
63 https://github.com/cantaro86/Financial-Models-Numerical-Methods
stored: ../data/database/json/bc/78137375deb5479ad281598841e872f5.json
64 https://github.com/carlosfab/data_science
stored: ../data/database/json/bc/dcc9efb7b64de31c59475a13a87f68fd.json
65 https://github.com/ccniuj/python_data_science_and_machine_learning_bootcamp
stored: ../data/database/json/bc/2a6550c44ccde189d06e0989c9074271.json
66 https://github.com/cedrickchee/data-science-notebooks
stored: ../data/database/json/bc/3bbad31242c735ab2fefe4ae891e9f03.json
67 https://github.com/cellardoor42/whatever
stored: ../data/database/json/bc/087f9794d794d

134 https://github.com/edbullen/nltk
stored: ../data/database/json/bc/fb09892c08b4a376f92c119ad01959ce.json
135 https://github.com/ehmatthes/intro_programming
stored: ../data/database/json/bc/d55a7c9879fb90400b603660c676013a.json
136 https://github.com/Einsteinish/Artificial-Neural-Networks-with-Jupyter
stored: ../data/database/json/bc/2ee5222362e5e949457be96d87d64d05.json
137 https://github.com/Einsteinish/bogotobogo-Machine-Learning
stored: ../data/database/json/bc/6832809d192c2ccb63bc6f7f8aec4265.json
138 https://github.com/elegant-scipy/notebooks
stored: ../data/database/json/bc/671d63082b080ee8b4d5a722baadd3b8.json
139 https://github.com/elyra-ai/elyra
stored: ../data/database/json/bc/61b9764d6a34b9cbf083abe5dd535960.json
140 https://github.com/Emergent-Behaviors-in-Biology/mlreview_notebooks
stored: ../data/database/json/bc/13fd991034b8cff501ccec6c932d6b0e.json
141 https://github.com/emmettgb/Emmetts-DS-NoteBooks
stored: ../data/database/json/bc/d87367d23cf9bf477ebe446c10c403e7.j

stored: ../data/database/json/bc/ed293a5cb4ac7e24a4eeca0af913ebff.json
211 https://github.com/InsightDataLabs/ipython-notebooks
stored: ../data/database/json/bc/00b2db9f2524a95318f16eed6e605dd0.json
212 https://github.com/InsightLab/data-science-cookbook
stored: ../data/database/json/bc/e372d609354745da1087db7a46d8fac7.json
213 https://github.com/InsightSoftwareConsortium/SimpleITK-Notebooks
stored: ../data/database/json/bc/8e07512583080008bbfbbad7595123f5.json
214 https://github.com/ioos/notebooks_demos
stored: ../data/database/json/bc/cd12278e88ce610d4639ab92165af18f.json
215 https://github.com/ipython/ipython
stored: ../data/database/json/bc/bef332df44ddb698944bb60bb8dc2e16.json
216 https://github.com/ipython-contrib/jupyter_contrib_nbextensions
stored: ../data/database/json/bc/2c8a9200d76331f52b7ee03e829c0bfe.json
217 https://github.com/jakemdrew/DataMiningNotebooks
stored: ../data/database/json/bc/1cd09e34eb2d31ab2c4c3d197ed19511.json
218 https://github.com/jakevdp/PythonDataScien

278 https://github.com/kwinkunks/notebooks
stored: ../data/database/json/bc/69cea0bb3a054632617a1ed23a05e481.json
279 https://github.com/kylemcdonald/AudioNotebooks
stored: ../data/database/json/bc/10f4fa59b94c3bd42aff5421d1d0a710.json
280 https://github.com/Lasagne/Recipes
stored: ../data/database/json/bc/f18afb194b81650ab7594e266bdaacc6.json
281 https://github.com/lebinh/vietnamese-accent-model
stored: ../data/database/json/bc/6be71812ac07e64b229b7e13bd80dd91.json
282 https://github.com/ledmaster/notebooks_tutoriais
stored: ../data/database/json/bc/733b401a95f63056d62b248b2a49fc8d.json
283 https://github.com/leondgarse/Atom_notebook
stored: ../data/database/json/bc/9636169b1ffbe5adce8232535747863d.json
284 https://github.com/leonvanbokhorst/NoteBooks-Statistics-and-MachineLearning
stored: ../data/database/json/bc/3096309d452967085ff3baa45b1e325c.json
285 https://github.com/leriomaggio/python-in-a-notebook
stored: ../data/database/json/bc/a0dd7a555509af354659916e46918abd.json
286 http

stored: ../data/database/json/bc/e212ff821c3fcc80865d79b1e09b8810.json
350 https://github.com/noaodatalab/notebooks-latest
stored: ../data/database/json/bc/b8baa0d6f33fa28accc7b3c75bfd141d.json
351 https://github.com/nteract/commuter
stored: ../data/database/json/bc/c6adc14099cefbab1ce5f78e476277b0.json
352 https://github.com/nteract/papermill
stored: ../data/database/json/bc/d2fd52e4338c1536c54b0e4906b39b14.json
353 https://github.com/nteract/scrapbook
stored: ../data/database/json/bc/d86b8e01f22a556d06bfd12008babde8.json
354 https://github.com/NumEconCopenhagen/ConsumptionSavingNotebooks
stored: ../data/database/json/bc/79d150910d9153096cf1754b1240b8f7.json
355 https://github.com/nyusterndatabootcamp/notebooks
stored: ../data/database/json/bc/6d9bdc68c3a16d8bc7ecfea33a2c880c.json
356 https://github.com/ocampor/notebooks
stored: ../data/database/json/bc/a8623cccc5ea7e3ae4476660784fab49.json
357 https://github.com/ocontreras309/ML_Notebooks
stored: ../data/database/json/bc/151c8aca9ed3

419 https://github.com/r9y9/Colaboratory
stored: ../data/database/json/bc/7b9488753ab83fad8bc225bd0c370af2.json
420 https://github.com/rambasnet/Python-Notebooks
stored: ../data/database/json/bc/98460bf2008d11bed2eb4a6512acfecb.json
421 https://github.com/rapidsai-community/notebooks-contrib
stored: ../data/database/json/bc/c391b338149a3a2a4ed09a6e814a8ed2.json
422 https://github.com/ravichaubey/Kaggle-Notebooks
stored: ../data/database/json/bc/b6273f9e0ba8136a5df3c7c979efb6ca.json
423 https://github.com/rayheberer/LambdaSchoolDataScience
stored: ../data/database/json/bc/b3c9194acb73a42e67ca89dbc2d2de3c.json
424 https://github.com/rdcolema/nc-fish-classification
stored: ../data/database/json/bc/dcda4c903f777e6d1be2bbcb6bfcee7c.json
425 https://github.com/rdipietro/jupyter-notebooks
stored: ../data/database/json/bc/7f88717e689063927d5ab5ce556a23df.json
426 https://github.com/renelikestacos/Google-Earth-Engine-Python-Examples
stored: ../data/database/json/bc/4272d800879e60d018df09349f503

stored: ../data/database/json/bc/10aa66213bda02c3bbdd014a4becf51f.json
500
500 https://github.com/TwistedHardware/mltutorial
stored: ../data/database/json/bc/9fc7eb4ef91c4d11cc8fbe8f162aa71a.json
501 https://github.com/twosigma/beakerx
stored: ../data/database/json/bc/6303e340f5c1f15347d62b53e1aab7f1.json
502 https://github.com/ubdsgroup/mlcourse-ub-notebooks
stored: ../data/database/json/bc/be02c5ff2d8a97e2866c7efc14e268c7.json
503 https://github.com/uber/h3-py-notebooks
stored: ../data/database/json/bc/d45c767b86dbbb006aa92123e29f8a7a.json
504 https://github.com/uberwach/leveling-up-jupyter
stored: ../data/database/json/bc/6c7a524722f26f5a2ca4e5ce3ab1fa87.json
505 https://github.com/UCIDataScienceInitiative/PredictiveModeling_withPython
stored: ../data/database/json/bc/ef29bf72967bacee968c1e5680ff4036.json
506 https://github.com/udacity/CVND_Exercises
stored: ../data/database/json/bc/10f51f05a598ebd81fea3c37e535c3fa.json
507 https://github.com/udacity/CVND_Localization_Exercises
stor

stored: ../data/database/json/bcg/63773b657b907a8eec3f85dce02f2534.json
DONE parsed 17 items
za_bl
0 https://github.com/flairNLP/flair
stored: ../data/database/json/za/0d4e98dc5312c9b4ec96665d5af364b0.json
1 https://engineering.zalando.com/posts/2017/03/deep-learning-in-production-for-predicting-consumer-behavior.html
stored: ../data/database/json/za/5cbb244aa92009613710b223c31b5f3a.json
2 https://engineering.zalando.com/posts/2018/09/shop-look-deep-learning.html
stored: ../data/database/json/za/fbc34098cc6faa78a18deb9141f2e930.json
DONE parsed 3 items
za_jo
0 https://jobs.zalando.com/en/jobs/2419780-senior-applied-scientist-w-m-d-advice-and-inspiration/?gh_src=gk03hq
1 https://jobs.zalando.com/en/jobs/2523598--senior-research-scientist-builder-platform-and-ai/?gh_src=gk03hq
2 https://jobs.zalando.com/en/jobs/2261169-senior-python-backend-engineer-competitive-analytics-engineering/?gh_src=gk03hq
3 https://jobs.zalando.com/en/jobs/2243058-senior-data-scientist-customer-analytics/?gh_src

In [None]:
# zero shot categorization is computational intense
# so let's keep it out from the loop and process it seperatly

In [None]:
print(cat)
print(subcat)

In [None]:
# classification

folder = '../data/database/json/'
subfolder = os.listdir(folder)
#print(subfolder)

#transform = ['ka_c', 'ka_cn', 'ka_d', 'ka_dn', 'ma', 'gh', 'tcp', 'bc']
transform = ['ka_c', 'ka_cn', 'ma', 'gh', 'tcp', 'bc']
#transform = ['ma']

recreate_category = False
save = True
categorzie_t5 = False
categorize_nltk = True
categorize_fallback = True

quit = 0
i = j = 0
for item in subfolder:
    print('folder', item)
    fp = os.path.join(folder, item)
    if os.path.isdir(fp) and item in transform:
        print('###')
        print(item)
        files = os.listdir(fp)
        print('files in folder:', len(files))
        for file in files:
            row = load_data(os.path.join(folder, item, file), fromJson=True)
            #print(row)
            
            print('row:', i, 'item:', j, 'link:', row['link'], 'file:', file)
            
            # zero shot categorization
            if not 'category' in row or row.get('category') == '' or recreate_category == True:
                print('categorize')
                start = time.time()
                j += 1

                # create category and subcategory from t5
                if 'sum_t5' in row and row['sum_t5'] != '' and categorzie_t5 == True:
                    s = row['sum_t5']
                    res = categorize(s, cat)
                    #row['t5_category_raw'] = res
                    c = row['t5_category'] = res['category']
                    c_score = row['t5_category_score'] = res['score']
                    row['t5_category_runtime'] = res['runtime']
                    print('t5 category', res['runtime'], 'sec')

                    res = categorize(s, subcat)
                    #row['t5_subcategory_raw'] = res
                    sc = row['t5_subcategory'] = res['category']
                    sc_score = row['t5_subcategory_score'] = res['score']
                    row['t5_subcategory_runtime'] = res['runtime']
                    print('t5 subcategory', res['runtime'], 'sec')
                else:
                    print('t5 skipped')

                # create category and subcategory from nltk
                if 'sum_nltk' in row and row['sum_nltk'] != '' and categorize_nltk == True:
                    s = row['sum_nltk']
                    res = categorize(s, cat)
                    #print(res)
                    #row['nltk_category_raw'] = res
                    c = row['nltk_category'] = res['category']
                    c_score = row['nltk_category_score'] = res['score']
                    row['nltk_category_runtime'] = res['runtime']
                    print('nltk category', res['runtime'], 'sec')

                    res = categorize(s, subcat)
                    #print(res)
                    #row['nltk_subcategory_raw'] = res
                    sc = row['nltk_subcategory'] = res['category']
                    sc_score = row['nltk_subcategory_score'] = res['score']
                    row['nltk_subcategory_runtime'] = res['runtime']
                    print('nltk subcategory', res['runtime'], 'sec')
                else:
                    print('nltk skipped')

                # create category and subcategory from title or description if not already done
                if categorize_fallback == True and not 't5_category' in row and not 'nltk_category' in row:
                    if len(row['description']) > 0:
                        s = row['description']
                        res = categorize(s, cat)
                        #row['description_category_raw'] = res
                        c = row['description_category'] = res['category']
                        c_score = row['description_category_score'] = res['score']
                        row['description_category_runtime'] = res['runtime']
                        print('description category', res['runtime'], 'sec')

                        res = categorize(s, subcat)
                        #row['description_subcategory_raw'] = res
                        sc = row['description_subcategory'] = res['category']
                        sc_score = row['description_subcategory_score'] = res['score']
                        row['description_subcategory_runtime'] = res['runtime']
                        print('description subcategory', res['runtime'], 'sec')
                    else:
                        s = row['title']
                        if s != '':
                            res = categorize(s, cat)
                            #row['title_category_raw'] = res
                            c = row['title_category'] = res['category']
                            c_score = row['title_category_score'] = res['score']
                            row['title_category_runtime'] = res['runtime']
                            print('title category', res['runtime'], 'sec')

                            res = categorize(s, subcat)
                            #row['title_subcategory_raw'] = res
                            sc = row['title_subcategory'] = res['category']
                            sc_score = row['title_subcategory_score'] = res['score']
                            row['title_subcategory_runtime'] = res['runtime']
                            print('title subcategory', res['runtime'], 'sec')
                        else:
                            print('nothing found to categorize')
                            c = sc = ''
                            c_score = sc_score = 0
                            j -= 1

                row['category'] = c
                row['category_score'] = c_score
                row['subcategory'] = sc
                row['subcategory_score'] = sc_score

                end = time.time()
                dur = round(end-start, 3)
                row['runtime_cat'] = dur
                
                fp = os.path.join(folder, item, file)
                if save == True:
                    store_data(row, fp, toJson=True)
                else:
                    print('NOT SAVED')
                    print(row)
            
            i += 1
            
            if i%100 == 0:
                print(i)
            
            if quit!= 0 and i >= quit:
                break
    if quit!= 0 and i >= quit:
                break
            
print('DONE parsed', i, 'items')