In [1]:
# this notebook converts the CSV to ES mapping

In [2]:
# imports
from datetime import datetime
import hashlib
import json
import sys
import csv
import os
import pandas as pd
import re
import time
from gensim.summarization.summarizer import summarize
from gensim.summarization import keywords

In [3]:
# some long text
# source: https://www.kaggle.com/c/stanford-covid-vaccine
text1 = '''
Winning the fight against the COVID-19 pandemic will require an effective vaccine that can be equitably and widely distributed. Building upon decades of research has allowed scientists to accelerate the search for a vaccine against COVID-19, but every day that goes by without a vaccine has enormous costs for the world nonetheless. We need new, fresh ideas from all corners of the world. Could online gaming and crowdsourcing help solve a worldwide pandemic? Pairing scientific and crowdsourced intelligence could help computational biochemists make measurable progress.
mRNA vaccines have taken the lead as the fastest vaccine candidates for COVID-19, but currently, they face key potential limitations. One of the biggest challenges right now is how to design super stable messenger RNA molecules (mRNA). Conventional vaccines (like your seasonal flu shots) are packaged in disposable syringes and shipped under refrigeration around the world, but that is not currently possible for mRNA vaccines.
Researchers have observed that RNA molecules have the tendency to spontaneously degrade. This is a serious limitation--a single cut can render the mRNA vaccine useless. Currently, little is known on the details of where in the backbone of a given RNA is most prone to being affected. Without this knowledge, current mRNA vaccines against COVID-19 must be prepared and shipped under intense refrigeration, and are unlikely to reach more than a tiny fraction of human beings on the planet unless they can be stabilized.
The Eterna community, led by Professor Rhiju Das, a computational biochemist at Stanford’s School of Medicine, brings together scientists and gamers to solve puzzles and invent medicine. Eterna is an online video game platform that challenges players to solve scientific problems such as mRNA design through puzzles. The solutions are synthesized and experimentally tested at Stanford by researchers to gain new insights about RNA molecules. The Eterna community has previously unlocked new scientific principles, made new diagnostics against deadly diseases, and engaged the world’s most potent intellectual resources for the betterment of the public. The Eterna community has advanced biotechnology through its contribution in over 20 publications, including advances in RNA biotechnology.
In this competition, we are looking to leverage the data science expertise of the Kaggle community to develop models and design rules for RNA degradation. Your model will predict likely degradation rates at each base of an RNA molecule, trained on a subset of an Eterna dataset comprising over 3000 RNA molecules (which span a panoply of sequences and structures) and their degradation rates at each position. We will then score your models on a second generation of RNA sequences that have just been devised by Eterna players for COVID-19 mRNA vaccines. These final test sequences are currently being synthesized and experimentally characterized at Stanford University in parallel to your modeling efforts -- Nature will score your models!
Improving the stability of mRNA vaccines was a problem that was being explored before the pandemic but was expected to take many years to solve. Now, we must solve this deep scientific challenge in months, if not weeks, to accelerate mRNA vaccine research and deliver a refrigerator-stable vaccine against SARS-CoV-2, the virus behind COVID-19. The problem we are trying to solve has eluded academic labs, industry R&D groups, and supercomputers, and so we are turning to you. To help, you can join the team of video game players, scientists, and developers at Eterna to unlock the key in our fight against this devastating pandemic. 
'''

# and a short one
text2 = 'The quick brown fox jumps over the lazy dog'

In [4]:
# function to count words
def word_count(text):
    if isinstance(text, str):
        s = text.split(' ')
        return len(s)
    else:
        return 0

print('words:', word_count(text1))
print('words:', word_count(text2))
print('words:', word_count(None))

words: 564
words: 9
words: 0


In [5]:
# function to count sentences
def sentence_count(text):
    if isinstance(text, str):
        s = text.split('. ')
        return len(s)
    else:
        return 0

print('sentences:', sentence_count(text1))
print('sentences:', sentence_count(text2))
print('sentences:', sentence_count(None))

sentences: 20
sentences: 1
sentences: 0


In [6]:
# extractive summarization

In [7]:
# text summarization 100% -> n%
def nltk_ratio(text, ratio=0.25):
    return summarize(text, ratio=ratio)

sum_nltk_ratio = nltk_ratio(text1, ratio=0.25)
print('words:', word_count(sum_nltk_ratio))
print(sum_nltk_ratio)

words: 139
Eterna is an online video game platform that challenges players to solve scientific problems such as mRNA design through puzzles.
The solutions are synthesized and experimentally tested at Stanford by researchers to gain new insights about RNA molecules.
We will then score your models on a second generation of RNA sequences that have just been devised by Eterna players for COVID-19 mRNA vaccines.
Improving the stability of mRNA vaccines was a problem that was being explored before the pandemic but was expected to take many years to solve.
Now, we must solve this deep scientific challenge in months, if not weeks, to accelerate mRNA vaccine research and deliver a refrigerator-stable vaccine against SARS-CoV-2, the virus behind COVID-19.
To help, you can join the team of video game players, scientists, and developers at Eterna to unlock the key in our fight against this devastating pandemic.


In [8]:
# text summarization 100% -> n words
def nltk_count(text, word_count=100):
    return summarize(text, word_count=word_count)

sum_nltk_count = nltk_count(text1, word_count=100)
print('words:', word_count(sum_nltk_count))
print(sum_nltk_count)

words: 98
Eterna is an online video game platform that challenges players to solve scientific problems such as mRNA design through puzzles.
We will then score your models on a second generation of RNA sequences that have just been devised by Eterna players for COVID-19 mRNA vaccines.
Now, we must solve this deep scientific challenge in months, if not weeks, to accelerate mRNA vaccine research and deliver a refrigerator-stable vaccine against SARS-CoV-2, the virus behind COVID-19.
To help, you can join the team of video game players, scientists, and developers at Eterna to unlock the key in our fight against this devastating pandemic.


In [9]:
# adaptive summarization
# https://www.machinelearningplus.com/nlp/text-summarization-approaches-nlp-example/

In [10]:
# BART
# Importing the model
from transformers import BartForConditionalGeneration, BartTokenizer, BartConfig

In [11]:
'''
# Loading the model and tokenizer for bart-large-cnn
tokenizer=BartTokenizer.from_pretrained('facebook/bart-large-cnn')
model=BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
#'''

"\n# Loading the model and tokenizer for bart-large-cnn\ntokenizer=BartTokenizer.from_pretrained('facebook/bart-large-cnn')\nmodel=BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')\n#"

In [12]:
'''
# Encoding the inputs and passing them to model.generate()
def bart(text):
    inputs = tokenizer.batch_encode_plus([text],return_tensors='pt')
    summary_ids = model.generate(inputs['input_ids'], early_stopping=True)

    # Decoding and printing the summary
    bart_summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    
    return bart_summary

# long text
start = time.time()
sum_bart_l = bart(text1)
end = time.time()

print('### long text ###')
print('runtime:', end-start)
print('words:', word_count(sum_bart_l))
print('sentences:', sentence_count(sum_bart_l))
print(sum_bart_l)
print('')

# short text
print('### short text ###')
start = time.time()
sum_bart_s = bart(text2)
end = time.time()

print('runtime:', end-start)
print('words:', word_count(sum_bart_s))
print('sentences:', sentence_count(sum_bart_s))
print(sum_bart_s)
#'''

"\n# Encoding the inputs and passing them to model.generate()\ndef bart(text):\n    inputs = tokenizer.batch_encode_plus([text],return_tensors='pt')\n    summary_ids = model.generate(inputs['input_ids'], early_stopping=True)\n\n    # Decoding and printing the summary\n    bart_summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)\n    \n    return bart_summary\n\n# long text\nstart = time.time()\nsum_bart_l = bart(text1)\nend = time.time()\n\nprint('### long text ###')\nprint('runtime:', end-start)\nprint('words:', word_count(sum_bart_l))\nprint('sentences:', sentence_count(sum_bart_l))\nprint(sum_bart_l)\nprint('')\n\n# short text\nprint('### short text ###')\nstart = time.time()\nsum_bart_s = bart(text2)\nend = time.time()\n\nprint('runtime:', end-start)\nprint('words:', word_count(sum_bart_s))\nprint('sentences:', sentence_count(sum_bart_s))\nprint(sum_bart_s)\n#"

In [13]:
# T5
# https://towardsdatascience.com/summarize-reddit-comments-using-t5-bart-gpt-2-xlnet-models-a3e78a5ab944
from transformers import T5Tokenizer, T5ForConditionalGeneration
model = T5ForConditionalGeneration.from_pretrained('t5-base')
tokenizer = T5Tokenizer.from_pretrained('t5-base')

Some weights of the model checkpoint at t5-base were not used when initializing T5ForConditionalGeneration: ['decoder.block.0.layer.1.EncDecAttention.relative_attention_bias.weight']
- This IS expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [14]:
def t5(text):
    Preprocessed_text = "summarize: " + text
    tokens_input = tokenizer.encode(Preprocessed_text,return_tensors="pt", max_length=512, truncation=True)
    summary_ids = model.generate(tokens_input, min_length=100, max_length=180, length_penalty=4.0)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

'''
# long text
start = time.time()
sum_t5_l = t5(text1)
end = time.time()

print('### long text ###')
print('runtime:', end-start)
print('words:', word_count(sum_t5_l))
print('sentences:', sentence_count(sum_t5_l))
print(sum_t5_l)
print('')

# short text
print('### short text ###')
start = time.time()
sum_t5_s = t5(text2)
end = time.time()

print('runtime:', end-start)
print('words:', word_count(sum_t5_s))
print('sentences:', sentence_count(sum_t5_s))
print(sum_t5_s)
#'''

"\n# long text\nstart = time.time()\nsum_t5_l = t5(text1)\nend = time.time()\n\nprint('### long text ###')\nprint('runtime:', end-start)\nprint('words:', word_count(sum_t5_l))\nprint('sentences:', sentence_count(sum_t5_l))\nprint(sum_t5_l)\nprint('')\n\n# short text\nprint('### short text ###')\nstart = time.time()\nsum_t5_s = t5(text2)\nend = time.time()\n\nprint('runtime:', end-start)\nprint('words:', word_count(sum_t5_s))\nprint('sentences:', sentence_count(sum_t5_s))\nprint(sum_t5_s)\n#"

In [15]:
# industry categories

# https://www.census.gov/programs-surveys/aces/information/iccl.html
cat_sic = ['Agriculture','Forestry','Fishing','Mining','Construction','Manufacturing','Transportation','Communications','Electric','Gas','Sanitary','Wholesale Trade','Retail Trade','Finance','Insurance','Real Estate','Services','Public Administration']
# https://www.marketing91.com/19-types-of-business-industries/
cat_19 = ['Aerospace','Transport','Computer','Telecommunication','Agriculture','Construction','Education','Pharmaceutical','Food','Health care','Hospitality','Entertainment','News Media','Energy','Manufacturing','Music','Mining','Worldwide web','Electronics']
# https://simplicable.com/new/industries
cat_simple = ['Advertising','Agriculture','Communication','Construction','Creative','Education','Entertainment','Fashion','Finance','Health care','Information Technology','Manufacturing','Media','Retail','Research','Robotics','Space']

cat = ['Accommodation & Food','Accounting','Agriculture','Banking & Insurance','Biotechnological & Life Sciences','Construction & Engineering','Economics','Education & Research','Emergency & Relief','Finance','Government and Public Works','Healthcare','Justice, Law and Regulations','Manufacturing','Media & Publishing','Miscellaneous','Physics','Real Estate, Rental & Leasing','Utilities','Wholesale & Retail']
subcat = ['Failure','Food','Fraud','General','Genomics','Insurance and Risk','Judicial Applied','Life-sciences','Machine Learning','Maintenance','Management and Operations','Marketing','Material Science','Physical','Policy and Regulatory','Politics','Preventative and Reactive','Quality','Real Estate','Rental & Leasing','Restaurant','Retail','School','Sequencing','Social Policies','Student','Textual Analysis','Tools','Tourism','Trading & Investment','Transportation','Valuation','Water & Pollution','Wholesale']

In [16]:
# zero shot classification
# https://towardsdatascience.com/zero-shot-text-classification-with-hugging-face-7f533ba83cd6
from transformers import pipeline
classifier = pipeline("zero-shot-classification")

Some weights of the model checkpoint at facebook/bart-large-mnli were not used when initializing BartModel: ['model.encoder.version', 'model.decoder.version']
- This IS expected if you are initializing BartModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BartModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at facebook/bart-large-mnli were not used when initializing BartForSequenceClassification: ['model.encoder.version', 'model.decoder.version']
- This IS expected if you are initializing BartForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification m

In [17]:
'''
# test classifictaion with nltk (200 words)

s = nltk_count(text1, word_count=100)

start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat_19)
res_sic = classifier(s, cat_sic, multi_class=True)
end = time.time()

print('runtime sic:', end-start)
#print(res_sic)
print(res_sic['labels'][0:3])
print(res_sic['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat_19)
res_c19 = classifier(s, cat_19, multi_class=True)
end = time.time()

print('runtime c19:', end-start)
#print(res_c19)
print(res_c19['labels'][0:3])
print(res_c19['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat_19)
res_simple = classifier(s, cat_simple, multi_class=True)
end = time.time()

print('runtime simple:', end-start)
#print(res_simple)
print(res_simple['labels'][0:3])
print(res_simple['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat)
res_simple = classifier(s, cat, multi_class=True)
end = time.time()

print('runtime category:', end-start)
#print(res_simple)
print(res_simple['labels'][0:3])
print(res_simple['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), subcat)
res_simple = classifier(s, subcat, multi_class=True)
end = time.time()

print('runtime subcategory:', end-start)
#print(res_simple)
print(res_simple['labels'][0:3])
print(res_simple['scores'][0:3])
#'''

"\n# test classifictaion with nltk (200 words)\n\ns = nltk_count(text1, word_count=100)\n\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200), cat_19)\nres_sic = classifier(s, cat_sic, multi_class=True)\nend = time.time()\n\nprint('runtime sic:', end-start)\n#print(res_sic)\nprint(res_sic['labels'][0:3])\nprint(res_sic['scores'][0:3])\n\n\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200), cat_19)\nres_c19 = classifier(s, cat_19, multi_class=True)\nend = time.time()\n\nprint('runtime c19:', end-start)\n#print(res_c19)\nprint(res_c19['labels'][0:3])\nprint(res_c19['scores'][0:3])\n\n\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200), cat_19)\nres_simple = classifier(s, cat_simple, multi_class=True)\nend = time.time()\n\nprint('runtime simple:', end-start)\n#print(res_simple)\nprint(res_simple['labels'][0:3])\nprint(res_simple['scores'][0:3])\n\n\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200),

In [18]:
'''
# test classifictaion with t5
start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat_19)
res_sic = classifier(sum_t5_l, cat_sic, multi_class=True)
end = time.time()

print('runtime sic:', end-start)
#print(res_sic)
print(res_sic['labels'][0:3])
print(res_sic['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat_19)
res_c19 = classifier(sum_t5_l, cat_19, multi_class=True)
end = time.time()

print('runtime c19:', end-start)
#print(res_c19)
print(res_c19['labels'][0:3])
print(res_c19['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat_19)
res_simple = classifier(sum_t5_l, cat_simple, multi_class=True)
end = time.time()

print('runtime simple:', end-start)
#print(res_simple)
print(res_simple['labels'][0:3])
print(res_simple['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat)
res_simple = classifier(sum_t5_l, cat, multi_class=True)
end = time.time()

print('runtime category:', end-start)
#print(res_simple)
print(res_simple['labels'][0:3])
print(res_simple['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), subcat)
res_simple = classifier(sum_t5_l, subcat, multi_class=True)
end = time.time()

print('runtime subcategory:', end-start)
#print(res_simple)
print(res_simple['labels'][0:3])
print(res_simple['scores'][0:3])
#'''

"\n# test classifictaion with t5\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200), cat_19)\nres_sic = classifier(sum_t5_l, cat_sic, multi_class=True)\nend = time.time()\n\nprint('runtime sic:', end-start)\n#print(res_sic)\nprint(res_sic['labels'][0:3])\nprint(res_sic['scores'][0:3])\n\n\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200), cat_19)\nres_c19 = classifier(sum_t5_l, cat_19, multi_class=True)\nend = time.time()\n\nprint('runtime c19:', end-start)\n#print(res_c19)\nprint(res_c19['labels'][0:3])\nprint(res_c19['scores'][0:3])\n\n\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200), cat_19)\nres_simple = classifier(sum_t5_l, cat_simple, multi_class=True)\nend = time.time()\n\nprint('runtime simple:', end-start)\n#print(res_simple)\nprint(res_simple['labels'][0:3])\nprint(res_simple['scores'][0:3])\n\n\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200), cat)\nres_simple = classifier(sum_t

In [19]:
'''
# test classifictaion with bart

start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat_19)
res_sic = classifier(sum_bart_l, cat_sic, multi_class=True)
end = time.time()

print('runtime sic:', end-start)
#print(res_sic)
print(res_sic['labels'][0:3])
print(res_sic['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat_19)
res_c19 = classifier(sum_bart_l, cat_19, multi_class=True)
end = time.time()

print('runtime c19:', end-start)
#print(res_c19)
print(res_c19['labels'][0:3])
print(res_c19['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat_19)
res_simple = classifier(sum_bart_l, cat_simple, multi_class=True)
end = time.time()

print('runtime simple:', end-start)
#print(res_simple)
print(res_simple['labels'][0:3])
print(res_simple['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat)
res_simple = classifier(sum_bart_l, cat, multi_class=True)
end = time.time()

print('runtime category:', end-start)
#print(res_simple)
print(res_simple['labels'][0:3])
print(res_simple['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), subcat)
res_simple = classifier(sum_bart_l, subcat, multi_class=True)
end = time.time()

print('runtime subcategory:', end-start)
#print(res_simple)
print(res_simple['labels'][0:3])
print(res_simple['scores'][0:3])
#'''

"\n# test classifictaion with bart\n\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200), cat_19)\nres_sic = classifier(sum_bart_l, cat_sic, multi_class=True)\nend = time.time()\n\nprint('runtime sic:', end-start)\n#print(res_sic)\nprint(res_sic['labels'][0:3])\nprint(res_sic['scores'][0:3])\n\n\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200), cat_19)\nres_c19 = classifier(sum_bart_l, cat_19, multi_class=True)\nend = time.time()\n\nprint('runtime c19:', end-start)\n#print(res_c19)\nprint(res_c19['labels'][0:3])\nprint(res_c19['scores'][0:3])\n\n\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200), cat_19)\nres_simple = classifier(sum_bart_l, cat_simple, multi_class=True)\nend = time.time()\n\nprint('runtime simple:', end-start)\n#print(res_simple)\nprint(res_simple['labels'][0:3])\nprint(res_simple['scores'][0:3])\n\n\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200), cat)\nres_simple = classi

In [20]:
# category function
def categorize(text, categories):
    start = time.time()
    res = classifier(text, categories, multi_class=True)
    end = time.time()
    dur = round(end-start, 3)
    return {
        'category': res['labels'][0],
        'score': res['scores'][0],
        'runtime': dur,
    }

#print(categorize(nltk_count(text1), cat))

In [21]:
# helper functions

In [22]:
# function to rebuild list from string
# that happens when it is stored in CSV without json-encode the data
def str_to_list(s):
    s = s.replace("'", "").replace(' ,', ',').replace(
        '[', '').replace(']', '').split(',')
    s = [i.strip() for i in s if i]
    return s

In [23]:
# helper function to create folder create_folder
def create_folder(path):
    if not os.path.exists(os.path.dirname(path)):
        try:
            os.makedirs(os.path.dirname(path))
            print(path + ' created')
        except OSError as exc: # Guard against race condition
            if exc.errno != errno.EEXIST:
                raise

In [24]:
# generic store data to file function
def store_data(data, file, mode='w', toJson=False):
    if toJson:
        data = json.dumps(data)
    with open(file, mode, encoding='utf-8') as fp:
        result = fp.write(data)
        return result
    
# generic load data from file function
def load_data(file, fromJson=False):
    if os.path.isfile(file):
        with open(file, 'r', encoding='utf-8', errors="ignore") as fp:
            data = fp.read()
            if fromJson:
                data = json.loads(data)
            return data
    else:
        return 'file not found'

# test text
#print(store_data('Hello', '../data/repositories/mlart/test.txt'))
#print(load_data('../data/repositories/mlart/test.txt'))

# test json
#print(store_data({'msg':'Hello World'}, '../data/repositories/mlart/test.json', toJson=True))
#print(load_data('../data/repositories/mlart/test.json', fromJson=True))

#store_data(result[0]['html'], '../data/repositories/kaggle/notebook.html')
#store_data(result[0]['iframe'], '../data/repositories/kaggle/kernel.html')

In [25]:
# remove special characters
def clean_text(text):
    # Ref: https://gist.github.com/Alex-Just/e86110836f3f93fe7932290526529cd1#gistcomment-3208085
    # Ref: https://en.wikipedia.org/wiki/Unicode_block
    EMOJI_PATTERN = re.compile(
        "(["
        "\U0001F1E0-\U0001F1FF"  # flags (iOS)
        "\U0001F300-\U0001F5FF"  # symbols & pictographs
        "\U0001F600-\U0001F64F"  # emoticons
        "\U0001F680-\U0001F6FF"  # transport & map symbols
        "\U0001F700-\U0001F77F"  # alchemical symbols
        "\U0001F780-\U0001F7FF"  # Geometric Shapes Extended
        "\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
        "\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
        "\U0001FA00-\U0001FA6F"  # Chess Symbols
        "\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
        "\U00002702-\U000027B0"  # Dingbats
        "])"
    )
    text = re.sub(EMOJI_PATTERN, '', text)
    
    # additional cleanup
    text = text.replace('•','').replace('\n',' ')
    
    return text

In [26]:
# mapper to convert CSV to the mapping of Elasticsearch index
def mapper(row, style):
    '''
    mapper to adopt csv to db-schema

    title, title_vector, description, description_vector,
    link, category, category_score, subcategory, subcategory_score, 
    tags, kind, ml_libs, host, license, language, score,
    date_project, date_scraped
    '''

    # kaggle notebook mapping
    if style == 'kaggle_notebook':
        return {
            'title': row['title'],
            'description': row['description'],
            'link': row['link'],
            # 'category': '',
            # 'category_score': 0,
            # 'subcategory': '',
            # 'subcategory_score': 0,
            'tags': list(set(str_to_list(row['tags']) + str_to_list(row['tags']))),
            'kind': ['project', 'notebook'],
            'ml_libs': str_to_list(row['ml_libs']),
            'host': 'www.kaggle.com',
            'license': row['license'],
            'language': row['type'],
            'score': row['score_views'] if 'score_views' in row else None,
            'date_project': datetime.strptime(row['date'], "%Y-%m-%d %H:%M:%S") if row['date'] != '' else None,
            'date_scraped': datetime.strptime(row['scraped_at'], "%Y-%m-%d %H:%M:%S") if row['scraped_at'] != '' else None,
            # 'ml_terms': row['ml_terms'],
            # 'score_raw': json.dumps({'views': row['views'], 'votes': row['votes'], 'score_private': row['score_private'], 'score_public': row['score_public']}),
        }
    
    # kaggle competition mapping
    if style == 'kaggle_competition':
        return {
            'title': row['title'],
            'description': row['subtitle'] + row['description'],
            'link': row['link'],
            # 'category': '',
            # 'category_score': 0,
            # 'subcategory': '',
            # 'subcategory_score': 0,
            'tags': list(set(str_to_list(row['tags']) + str_to_list(row['type']))),
            'kind': ['project', 'competition', 'dataset'],
            # 'ml_libs': str_to_list(row['ml_libs']),
            'host': 'www.kaggle.com',
            # 'license': row['license'],
            # 'language': row['type'],
            'score': row['teams_score'],
            'date_project': datetime.strptime(row['date_closed'], "%Y-%m-%d %H:%M:%S") if 'date_closed' in row else '',
            # 'date_scraped': datetime.strptime(row['scraped_at'], "%Y-%m-%d %H:%M:%S"),
            # 'ml_terms': row['ml_terms'],
            # 'score_raw': json.dumps({'views': row['views'], 'votes': row['votes'], 'score_private': row['score_private'], 'score_public': row['score_public']}),
        }
    
    # kaggle dataset mapping
    if style == 'kaggle_dataset':
        return {
            'title': row['title'],
            'description': row['subtitle'] + row['description'],
            'link': row['link'],
            # 'category': '',
            # 'category_score': 0,
            # 'subcategory': '',
            # 'subcategory_score': 0,
            'tags': list(set(str_to_list(row['tags']) + str_to_list(row['type']))),
            'kind': ['project', 'dataset'],
            # 'ml_libs': str_to_list(row['ml_libs']),
            'host': 'www.kaggle.com',
            # 'license': row['license'],
            # 'language': row['type'],
            'score': row['teams_score'],
            'date_project': datetime.strptime(row['date_closed'], "%Y-%m-%d %H:%M:%S") if 'date_closed' in row else '',
            # 'date_scraped': datetime.strptime(row['scraped_at'], "%Y-%m-%d %H:%M:%S"),
            # 'ml_terms': row['ml_terms'],
            # 'score_raw': json.dumps({'views': row['views'], 'votes': row['votes'], 'score_private': row['score_private'], 'score_public': row['score_public']}),
        }

    # github mapping
    if style == 'github':
        cat_score = 1 if row['industry'] != '' else 0
        subcat_score = 1 if row['type'] != '' else 0
        #tags = row['ml_tags'] if len(row['ml_tags']) > 0 else ''
        return {
            'title': row['name'],
            'description': row['description2'],
            'link': row['link'],
            'category': row['industry'],
            'category_score': cat_score,
            'subcategory': row['type'],
            'subcategory_score': subcat_score,
            'tags': str_to_list(row['ml_tags']),
            'kind': 'Project',
            'ml_libs': str_to_list(row['ml_libs']),
            'host': 'www.github.com',
            'license': row['license'],
            'language': row['language_primary'],
            'score': row['stars_score'],
            'date_project': datetime.strptime(row['pushed_at'], "%Y-%m-%d %H:%M:%S"),
            'date_scraped': datetime.strptime(row['scraped_at'], "%Y-%m-%d %H:%M:%S"),
            # 'ml_terms': row['keywords'],
            # 'score_raw': json.dumps({'stars': row['stars'], 'contributors': row['contributors']}),
        }

    # mlart mapping
    if style == 'mlart':
        title = row['Title'] if row['Title'] != '' else row['title']
        cat_score = 1 if row['Theme'] != '' else 0
        subcat_score = 1 if row['Medium'] != '' else 0
        return {
            'title': title,
            'description': row['subtitle'],
            'link': row['url'],
            'category': str_to_list(row['Theme']),
            'category_score': cat_score,
            'subcategory': str_to_list(row['Medium']),
            'subcategory_score': subcat_score,
            'tags': str_to_list(row['Technology']),
            'kind': 'Showcase',
            # 'ml_libs': [],
            'host': 'mlart.co',
            # 'license': '',
            # 'language': '',
            # 'score': 0,
            'date_project': datetime.strptime(row['Date'], "%Y-%m-%d"),
            'date_scraped': datetime.strptime(row['scraped_at'], "%Y-%m-%d %H:%M:%S"),
            # 'score_raw': json.dumps({'days_since_featured': row['Days Since Featured']}),
        }

    # thecleverprogrammer
    if style == 'tcp':
        return {
            'title': row['title'],
            'description': row['text'],
            'link': row['link'],
            # 'category': '',
            # 'category_score': 0,
            # 'subcategory': '',
            # 'subcategory_score': 0,
            'tags': str_to_list(row['ml_tags']),
            'kind': 'Project',
            'ml_libs': str_to_list(row['ml_libs']),
            'host': 'thecleverprogrammer.com',
            # 'license': '',
            'language': 'Python',
            # 'score': 0,
            'date_project': datetime.strptime(row['date'], "%Y-%m-%d %H:%M:%S"),
            'date_scraped': datetime.strptime('2020-12-20', "%Y-%m-%d"),
            # 'score_raw': json.dumps({'days_since_featured': row['Days Since Featured']}),
        }

    return None

In [27]:
# test gpu usage
import torch
torch.cuda.is_available()

True

In [28]:
# summarization loop

In [31]:
# loop to transform data row-wise
def transform_loop(csv_in, csv_format, subfolder, quit=0, overwrite=False):
    
    with open(csv_in, encoding='utf-8') as csvfile:
        
        # let's store converted csv to temp-folder for analysis
        csv_out = '../data/database/csv/'
        json_out = '../data/database/json/'
        json_out_item = '../data/database/json/'+subfolder
        create_folder(json_out_item)
        df = pd.DataFrame()

        # readCSV = csv.reader(csvfile, delimiter=';')
        readCSV = csv.DictReader(csvfile, delimiter=';')
        # next(readCSV, None)  # skip the headers
        
        i = j = 0
        out = []
        
        for row in readCSV:
            # print(row)
            row = mapper(row, csv_format)
            
            # check if file already exists
            link = row['link']
            md5 = hashlib.md5(link.encode("utf-8")).hexdigest()
            
            json_fp = json_out_item + md5 + '.json'
            if not os.path.isfile(json_fp) or overwrite == True:
                
                print(i, row['link'])
                item_start = time.time()

                # clean title & description
                row['title'] = clean_text(row['title'])
                text = row['description'] = clean_text(row['description'])
                words = row['words'] = word_count(text)
                sentences = row['sentences'] = sentence_count(text)

                # create summarization
                if words > 200 and sentences > 1:
                    print('summarize')

                    # nltk
                    #try:
                    if True:
                        start = time.time()
                        row['sum_nltk'] = nltk_count(text, word_count=200)
                        end = time.time()
                        dur = round(end-start,3)

                        row['sum_nltk_words'] = word_count(row['sum_nltk'])
                        row['sum_nltk_runtime'] = dur
                        print('done (nltk)', dur, 'sec')
                    #except:
                    #    print('summarization nltk failed')

                    # t5
                    #try:
                    if True:
                        start = time.time()
                        row['sum_t5'] = t5(text)
                        end = time.time()
                        dur = round(end-start,3)

                        row['sum_t5_words'] = word_count(row['sum_t5'])
                        row['sum_t5_runtime'] = dur
                        print('done (t5)', dur, 'sec')
                    #except:
                    #    print('summarization t5 failed')

                # zero shot categorization
                '''
                if not 'category' in row:
                    print('categorize')
                    
                    # create category and subcategory from t5
                    if 'sum_t5' in row and False:
                        s = row['sum_t5']
                        res = categorize(s, cat)
                        row['t5_category'] = res['category']
                        row['t5_category_score'] = res['score']
                        row['t5_category_runtime'] = res['runtime']
                        print('t5 category', res['runtime'], 'sec')
                        
                        res = categorize(s, subcat)
                        row['t5_subcategory'] = res['category']
                        row['t5_subcategory_score'] = res['score']
                        row['t5_subcategory_runtime'] = res['runtime']
                        print('t5 subcategory', res['runtime'], 'sec')
                    else:
                        print('t5 skipped')
                       
                    # create category and subcategory from nltk
                    if 'sum_nltk' in row:
                        s = row['sum_nltk']
                        res = categorize(s, cat)
                        row['nltk_category'] = res['category']
                        row['nltk_category_score'] = res['score']
                        row['nltk_category_runtime'] = res['runtime']
                        print('nltk category', res['runtime'], 'sec')
                        
                        res = categorize(s, subcat)
                        row['nltk_subcategory'] = res['category']
                        row['nltk_subcategory_score'] = res['score']
                        row['nltk_subcategory_runtime'] = res['runtime']
                        print('nltk subcategory', res['runtime'], 'sec')
                    else:
                        print('nltk skipped')
                        
                    # create category and subcategory from title or description if not already done
                    if not 't5_category' in row and not 'nltk_category' in row:
                        if len(row['description']) > 0:
                            s = row['description']
                            res = categorize(s, cat)
                            row['description_category'] = res['category']
                            row['description_category_score'] = res['score']
                            row['description_category_runtime'] = res['runtime']
                            print('description category', res['runtime'], 'sec')

                            res = categorize(s, subcat)
                            row['description_subcategory'] = res['category']
                            row['description_subcategory_score'] = res['score']
                            row['description_subcategory_runtime'] = res['runtime']
                            print('description subcategory', res['runtime'], 'sec')
                        else:
                            s = row['title']
                            res = categorize(s, cat)
                            row['title_category'] = res['category']
                            row['title_category_score'] = res['score']
                            row['title_category_runtime'] = res['runtime']
                            print('title category', res['runtime'], 'sec')

                            res = categorize(s, subcat)
                            row['title_subcategory'] = res['category']
                            row['title_subcategory_score'] = res['score']
                            row['title_subcategory_runtime'] = res['runtime']
                            print('title subcategory', res['runtime'], 'sec')
                    
                    
                    # zero shot categorization - first approach
                    
                    s = row['title']
                    if len(row['description']) > 0:
                        s = row['description']
                    if 'sum_nltk' in row:
                        s = row['sum_nltk']
                    if 'sum_t5' in row:
                        s = row['sum_t5']

                    #try:
                    if True:
                        res = categorize(s, cat_simple)
                        # {'category': 'Entertainment', 'score': 0.3688451945781708, 'runtime': 14.973}
                        row['category'] = res['category']
                        row['category_score'] = res['score']
                        row['category_runtime'] = res['runtime']
                        print('done', res['runtime'], 'sec')
                    #except:
                    #    print('categorization failed')
                    '''

                # convert datetime to string
                if 'date_project' in row:
                    row['date_project'] = str(row['date_project'])
                if 'date_scraped' in row:
                    row['date_scraped'] = str(row['date_scraped'])
                    
                # runtime
                item_end = time.time()
                item_dur = round(item_end-item_start, 3)
                row['runtime'] = item_dur

                df = df.append(row, ignore_index=True)

                # json encode
                #out.append(row)
                
                store_data(row, json_fp, toJson=True)
                print(json_fp)
                j += 1

            #print(i, row['link'])
            i += 1

            # keep count of # rows processed
            if i % 100 == 0:
                print(i)

            if quit != 0 and i >= quit:
                break

        # store parsed csv
        #fp = csv_in.split('/')[-1]
        #df.to_csv(csv_out + fp, sep=';', index=False)
        #path = json_out + fp
        #path = path.replace('.csv', '.json')
        #store_data(out, path, toJson=True)
        
        print('DONE parsed', i, 'items')

In [32]:
# run the loop

transform = ['ka_c', 'ka_cn', 'ka_d', 'ka_dn', 'ma', 'gh', 'tcp', 'bc']
#transform = ['ka_c', 'ka_cn', 'ka_dn']
#transform = ['ka_d', 'ma', 'gh', 'tcp', 'bc']
transform = ['bc']

datasets = {
    # kaggle competitions
    'ka_c': {
        'csv_in': '../data/database/kaggle_competitions_correlated_01.csv',
        'csv_format': 'kaggle_competition',
    },
    # kaggle competitions notebooks
    'ka_cn': {
        'csv_in': '../data/database/kaggle_competitions_01_original.csv',
        'csv_format': 'kaggle_notebook',
    },
    # kaggle datasets
    'ka_d': {
        'csv_in': '../data/database/kaggle_datasets_correlated_01.csv',
        'csv_format': 'kaggle_dataset',
    },
    # kaggle datasets notebooks
    'ka_dn': {
        'csv_in': '../data/database/kaggle_datasets_01_original.csv',
        'csv_format': 'kaggle_notebook',
    },
    # mlart
    'ma': {
        'csv_in': '../data/database/mlart_01_original.csv',
        'csv_format':'mlart',
    },
    # github
    'gh': {
        'csv_in': '../data/database/db_04_analyzed_v02.csv',
        'csv_format': 'github',
    },
    # thecleverprogrammer
    'tcp': {
        'csv_in': '../data/database/thecleverprogrammer_01_original.csv',
        'csv_format': 'tcp',
    },
    # blobcity
    'bc': {
        'csv_in': '../data/database/blobcity_02_analyzed.csv',
        'csv_format': 'github',
    },
}

    
for key in transform:
    print(key)
    item = datasets[key]
    transform_loop(item['csv_in'], item['csv_format'], key+'/', overwrite=False)

bc
100
200
300
400
500
DONE parsed 541 items


In [33]:
# zero shot categorization is computational intense
# so let's keep it out from the loop and process it seperatly

In [40]:
print(cat)
print(subcat)

['Accommodation & Food', 'Accounting', 'Agriculture', 'Banking & Insurance', 'Biotechnological & Life Sciences', 'Construction & Engineering', 'Economics', 'Education & Research', 'Emergency & Relief', 'Finance', 'Government and Public Works', 'Healthcare', 'Justice, Law and Regulations', 'Manufacturing', 'Media & Publishing', 'Miscellaneous', 'Physics', 'Real Estate, Rental & Leasing', 'Utilities', 'Wholesale & Retail']
['Failure', 'Food', 'Fraud', 'General', 'Genomics', 'Insurance and Risk', 'Judicial Applied', 'Life-sciences', 'Machine Learning', 'Maintenance', 'Management and Operations', 'Marketing', 'Material Science', 'Physical', 'Policy and Regulatory', 'Politics', 'Preventative and Reactive', 'Quality', 'Real Estate', 'Rental & Leasing', 'Restaurant', 'Retail', 'School', 'Sequencing', 'Social Policies', 'Student', 'Textual Analysis', 'Tools', 'Tourism', 'Trading & Investment', 'Transportation', 'Valuation', 'Water & Pollution', 'Wholesale']


In [48]:
# pathes
folder = '../data/database/json/'
subfolder = os.listdir(folder)
#print(subfolder)

transform = ['ka_c', 'ka_cn', 'ka_d', 'ka_dn', 'ma', 'gh', 'tcp', 'bc']
#transform = ['ka_c', 'ka_cn', 'ka_dn']
#transform = ['ka_d', 'ma', 'gh', 'tcp', 'bc']
transform = ['ma', 'gh', 'bc', 'tcp']

recreate_category = False
save = True
categorzie_t5 = False
categorize_nltk = True
categorize_fallback = True

quit = 0
i = 0
for item in subfolder:
    print('folder', item)
    fp = os.path.join(folder, item)
    if os.path.isdir(fp) and item in transform:
        print('###')
        print(item)
        files = os.listdir(fp)
        print('files in folder:', len(files))
        for file in files:
            row = load_data(os.path.join(folder, item, file), fromJson=True)
            #print(row)
            
            print(i, row['link'], file)
            
            # zero shot categorization
            if not 'category' in row or row.get('category') == '' or recreate_category == True:
                print('categorize')
                start = time.time()

                # create category and subcategory from t5
                if 'sum_t5' in row and row['sum_t5'] != '' and categorzie_t5 == True:
                    s = row['sum_t5']
                    res = categorize(s, cat)
                    #row['t5_category_raw'] = res
                    c = row['t5_category'] = res['category']
                    c_score = row['t5_category_score'] = res['score']
                    row['t5_category_runtime'] = res['runtime']
                    print('t5 category', res['runtime'], 'sec')

                    res = categorize(s, subcat)
                    #row['t5_subcategory_raw'] = res
                    sc = row['t5_subcategory'] = res['category']
                    sc_score = row['t5_subcategory_score'] = res['score']
                    row['t5_subcategory_runtime'] = res['runtime']
                    print('t5 subcategory', res['runtime'], 'sec')
                else:
                    print('t5 skipped')

                # create category and subcategory from nltk
                if 'sum_nltk' in row and row['sum_nltk'] != '' and categorize_nltk == True:
                    s = row['sum_nltk']
                    res = categorize(s, cat)
                    #print(res)
                    #row['nltk_category_raw'] = res
                    c = row['nltk_category'] = res['category']
                    c_score = row['nltk_category_score'] = res['score']
                    row['nltk_category_runtime'] = res['runtime']
                    print('nltk category', res['runtime'], 'sec')

                    res = categorize(s, subcat)
                    #print(res)
                    #row['nltk_subcategory_raw'] = res
                    sc = row['nltk_subcategory'] = res['category']
                    sc_score = row['nltk_subcategory_score'] = res['score']
                    row['nltk_subcategory_runtime'] = res['runtime']
                    print('nltk subcategory', res['runtime'], 'sec')
                else:
                    print('nltk skipped')

                # create category and subcategory from title or description if not already done
                if categorize_fallback == True and not 't5_category' in row and not 'nltk_category' in row:
                    if len(row['description']) > 0:
                        s = row['description']
                        res = categorize(s, cat)
                        #row['description_category_raw'] = res
                        c = row['description_category'] = res['category']
                        c_score = row['description_category_score'] = res['score']
                        row['description_category_runtime'] = res['runtime']
                        print('description category', res['runtime'], 'sec')

                        res = categorize(s, subcat)
                        #row['description_subcategory_raw'] = res
                        sc = row['description_subcategory'] = res['category']
                        sc_score = row['description_subcategory_score'] = res['score']
                        row['description_subcategory_runtime'] = res['runtime']
                        print('description subcategory', res['runtime'], 'sec')
                    else:
                        s = row['title']
                        if s != '':
                            res = categorize(s, cat)
                            #row['title_category_raw'] = res
                            c = row['title_category'] = res['category']
                            c_score = row['title_category_score'] = res['score']
                            row['title_category_runtime'] = res['runtime']
                            print('title category', res['runtime'], 'sec')

                            res = categorize(s, subcat)
                            #row['title_subcategory_raw'] = res
                            sc = row['title_subcategory'] = res['category']
                            sc_score = row['title_subcategory_score'] = res['score']
                            row['title_subcategory_runtime'] = res['runtime']
                            print('title subcategory', res['runtime'], 'sec')
                        else:
                            print('nothing found to categorize')
                            c = sc = ''
                            c_score = sc_score = 0

                row['category'] = c
                row['category_score'] = c_score
                row['subcategory'] = sc
                row['subcategory_score'] = sc_score

                end = time.time()
                dur = round(end-start, 3)
                row['runtime_cat'] = dur
                
                fp = os.path.join(folder, item, file)
                if save == True:
                    store_data(row, fp, toJson=True)
                else:
                    print('NOT SAVED')
                    print(row)
            
            i += 1
            
            if i%100 == 0:
                print(i)
            
            if quit!= 0 and i >= quit:
                break
    if quit!= 0 and i >= quit:
                break
            
print('DONE parsed', i, 'items')

folder bc
###
bc
files in folder: 541
0 https://github.com/HazyResearch/cs145-notebooks-2016 003eb1b987f35ed6d88b1c7930e6057f.json
1 https://github.com/marsggbo/deeplearning.ai_JupyterNotebooks 004a0f102a03862a74da782e6a796f80.json
categorize
t5 skipped
nltk skipped
nothing found to categorize
2 https://github.com/davidbp/lxmls-notebooks 004a915abfded4001d49385a09f77f19.json
categorize
t5 skipped
nltk skipped
nothing found to categorize
3 https://github.com/jamesdbrock/learn-you-a-haskell-notebook 0065cbc76fd0832c7b3fc8ef5baed767.json
4 https://github.com/ageron/tf2_course 00708ba1c3a32ce45749ebcb633e8c5a.json
5 https://github.com/InsightDataLabs/ipython-notebooks 00b2db9f2524a95318f16eed6e605dd0.json
categorize
t5 skipped
nltk skipped
nothing found to categorize
6 https://github.com/trnkatomas/Keras_2_examples 00d505608175f11ab66be6afd4d55899.json
7 https://github.com/soniclavier/bigdata-notebook 013b72a8ee9c19f8fc8835bd16e009e9.json
categorize
t5 skipped
nltk skipped
nothing found to

151 https://github.com/OTRF/notebooks-forge 4dfea1bae4708d4317f7e570aa7322c0.json
152 https://github.com/uwdata/visualization-curriculum 4e4a4fa94d836e0555b197ae6a5743da.json
153 https://github.com/IBM/nodejs-in-notebooks 4e686c249c8efbd71dbfdb18386e4338.json
154 https://github.com/velocyto-team/velocyto-notebooks 502b31c0cd0f584eda981914ef8360c7.json
categorize
t5 skipped
nltk skipped
nothing found to categorize
155 https://github.com/kaleko/CourseraML 5067d72f4fca5b73c8c5a5c3f1edee8e.json
156 https://github.com/timsainb/python_spectrograms_and_inversion 506ef9f98987c54b0d315037a5a988ea.json
157 https://github.com/GeostatisticsLessons/GeostatisticsLessonsNotebooks 518989b27d43d3ae7b293493115c2982.json
158 https://github.com/bhattbhavesh91/time_series_notebooks 519d691a4f7d10334b0adcdd19cd8d71.json
159 https://github.com/fastai/numerical-linear-algebra 5244b18585faa8bfdf0108efab812847.json
160 https://github.com/sgugger/Deep-Learning 5249d5514c30f56e43b3c30fcb37f642.json
161 https://gi

320 https://github.com/chmp/ipytest 9c85aefb2a557f02f80a3c1f92a4e984.json
321 https://github.com/krasserm/machine-learning-notebooks 9cc0633c1a4b086ec76a748021ff90fb.json
322 https://github.com/SheffieldML/notebook 9d90bdcd166e78cc59bbf44af78ad750.json
categorize
t5 skipped
nltk skipped
nothing found to categorize
323 https://github.com/takluyver/cite2c 9dc4cd914940bd494bee23149fd9fe6d.json
324 https://github.com/COGS108/Tutorials 9dd879256892edbeec3f26eb84cce9da.json
325 https://github.com/omarsar/pytorch_notebooks 9e7adb1ca737e00d58420020a401ca13.json
326 https://github.com/computationalmodelling/nbval 9f1530ccd9beca4c33b18129c3d2f819.json
327 https://github.com/TwistedHardware/mltutorial 9fc7eb4ef91c4d11cc8fbe8f162aa71a.json
328 https://github.com/QuantEcon/lecture-python-programming.notebooks 9fee70c5175d2e75893dab228a92ed3e.json
329 https://github.com/michhar/python-jupyter-notebooks a0d807b27b9eb7c20d62247f88004ced.json
330 https://github.com/leriomaggio/python-in-a-notebook a0dd

410 https://github.com/qiskit-community/qiskit-community-tutorials c51d79517d412295584b79418c072f87.json
411 https://github.com/jldowns/google_earth_engine_notebook c5e2b9dedf375e494cd12b31452295eb.json
412 https://github.com/likejazz/jupyter-notebooks c6823f9122c858d5cb0782525639339c.json
413 https://github.com/nteract/commuter c6adc14099cefbab1ce5f78e476277b0.json
414 https://github.com/donnemartin/data-science-ipython-notebooks c969629e43ee99737def68ac85b9a016.json
415 https://github.com/mrm8488/shared_colab_notebooks ca2cbbe2ac81de36872a316f66905f75.json
416 https://github.com/krishnan-r/sparkmonitor ca7f972cb0795ca69f14e3a945fa7749.json
417 https://github.com/scoutbee/pytorch-nlp-notebooks cacc7ab360f20627d3048c0919d88158.json
418 https://github.com/sergiogama/notebook cae0db40e47ad3123b410df6881288a2.json
419 https://github.com/digitalearthafrica/deafrica-sandbox-notebooks cb587259f30c89ba5aa571851f840274.json
420 https://github.com/jupyter-widgets/ipywidgets cbc2a8c6d598b561071b

505 https://github.com/tritemio/jupyter_notebook_beginner_guide ef95781d0a3a5a952c9913c5c8d405a9.json
506 https://github.com/peterroelants/notebooks f056283a4fc257e4e8bdf0aadaa06cb7.json
507 https://github.com/ogrisel/notebooks f0a517b4b3ef04713c0bf519f3d935b7.json
categorize
t5 skipped
nltk skipped
nothing found to categorize
508 https://github.com/yhilpisch/py4fi2nd f0c69ced8e766d8ea9cb3be589a562b8.json
509 https://github.com/Lasagne/Recipes f18afb194b81650ab7594e266bdaacc6.json
510 https://github.com/liviu-/notebooks f25b6bc097c0bfadc6b4555707fc9cc5.json
511 https://github.com/microsoft/gather f285063cc46847417b944e8e6d00516c.json
512 https://github.com/deep-learning-indaba/indaba-2018 f2e381c9bdaa5e7e03f10e89fed7bd64.json
513 https://github.com/codenode/codenode f32e96cb654d3f23e5da1e843212d2c2.json
514 https://github.com/seranus/faceswap-notebooks f3485f6c0924c8d88a75b251b0d2e2da.json
515 https://github.com/mli/d2l-1day-notebooks-zh f41c136f651ca17485d8d268567a539a.json
516 https:

606 https://github.com/Chris-Manna/charity_recommender 243c3b4f2e6cb8c5d7c50813e9352bf9.json
607 https://github.com/apoorv-goel/Bank-Note-Authentication-Using-DNN-Tensorflow-Classifier-and-RandomForest 24b83e7f6fdd275bd406b8916652d79e.json
608 https://github.com/han-yan-ds/Kaggle-Bosch 253f49faf83d92efa13442362e44944c.json
609 https://github.com/byukan/Marketing-Data-Science 255813e45371254b24f9bd2dbfaa6f6f.json
610 https://github.com/arcadynovosyolov/finance 2869a19e58028c397946bedcd40b1c88.json
611 https://github.com/sarachmax/MarketCrashes_Prediction/ 2ae41828b90a71e3f473c6fc6351a7b2.json
612 https://github.com/surajmall/Agriculture-Assistant/ 2bd19bc3894984635e1e3b8d798c38ba.json
613 https://github.com/Shomona/Bank-Failure-Prediction/ 2d0d6061ae79f8984e3a1909cc61bec4.json
614 https://github.com/IBM/iot-predictive-analytics 2db429c71f7394a0b73ef74198cb65dc.json
615 https://github.com/tstreamDOTh/Instacart-Market-Basket-Analysis 2e0f5e5af0d662e2e70f4c00627728a8.json
616 https://githu

790 https://github.com/talmo/leap 9699371c66c058d7ec7fadad74006130.json
791 https://github.com/anki1909/Recruit-Restaurant-Visitor-Forecasting 96a609da924b8f21b7ae909ef2a1aa95.json
792 https://github.com/hep-lbdl/CaloGAN 96cc22b31d725200b3fd11c71d20e708.json
793 https://github.com/datadesk/lapd-crime-classification-analysis 96ed0b93ca121ec2cba66d031301aab5.json
794 https://github.com/IBM-DSE/CyberShop-Analytics 980de799311518374c7a9bcfa58cae5a.json
795 https://github.com/datakind/datadive-gates92y-proj3-form990 9967a9915c38d2f4bded9ed04b7c1653.json
796 https://github.com/ritchie46/anaStruct 9a31ce95ad65d0d574b59c22f2c09cbd.json
797 https://github.com/everAspiring/RegressionAnalysis 9a61954ef61f6f45079eb31b43df00a5.json
798 https://github.com/darshankaarki/ml-coa-charging 9abe1afa16ac67826ae1db955ec51255.json
799 https://github.com/hbutsuak95/Quality-Optimization-of-Steel 9be47146e9da2a758c4eadb0f436b5b1.json
800
800 https://github.com/AccelAI/AI-Law-Minicourse/ 9cc570fef0c3eb39f4c55899

985 https://mlart.co/item/storify-and-auction-gan-generated-paintings-based-on-robbie-barrat_s-model 05cae1b74451a879545ef6707a6ef964.json
986 https://mlart.co/item/gan-interpolation 063c87fcc5916e19010929ebe0d0c2b2.json
987 https://mlart.co/item/create-an-artwork-with-the-shortest-algorithm-while-at-the-same-time-captures-the-essence-of-an-object_-using-kolmogorov-complexity 078191433dfbf1dd7f8ceb5fb21a8845.json
988 https://mlart.co/item/gan-trained-on-personal-pictures-of-flowers 078e9dd5b6a765310787e7bf65ee2e1d.json
989 https://mlart.co/item/a-cnn-model-in-a-device-that-collects-data-and-is-auto-trained-to-make-an-action 08170eb53c1558a8e79b9d8a7c546b40.json
990 https://mlart.co/item/apply-patches-from-artworks-to-a-painting-and-guide-it-with-a-gan-discriminator 086c6dd67301762e7d79ca6306d4446c.json
991 https://mlart.co/item/fast-3d-style-transfer 097ac2b9ff478af1d98358122f1c6fce.json
992 https://mlart.co/item/generate-a-face-with-stylegan-ffhq_-and-use-first-order-motion-to-animate

1173 https://mlart.co/item/deepdream-applied-to-bob-ross_s-video-and-sound-synthesized-with-wavenet 825ff05424a1612c1137fb9c894db4c6.json
1174 https://mlart.co/item/gan-generated-images-from-personal-photos-from-finland 837228e08bacac5d93db37e1f2117aeb.json
1175 https://mlart.co/item/run-an-image-through-a-pre-trained-cnn_-and-then-upscale-the-image-backward_-for-each-layer_-adjust-the-image-according-to-the-difference-between-the-activations-and-the-image 851df107db754592f5260c53b0aabd37.json
1176 https://mlart.co/item/measure-the-distance-between-cnn-features-to-pair-fonts 8611a4a5dc840606d00b72722f1ae1d5.json
1177 https://mlart.co/item/a-chrome-plugin-that-uses-3d-photo-inpainting-to-turn-images-into-3d-animations 8667e26495644b40173e519e7b51b449.json
1178 https://mlart.co/item/a-collaborative-fiction-with-gpt-3_-video_-personas_-and-voices-are-all-generated 87509dc9b3cfc786448760d6c3e9d0c7.json
1179 https://mlart.co/item/gpt-2-trained-on-svg-data_-emojis-from-twemoji_-and-symbols-f

1356 https://mlart.co/item/stylegan-trained-on-floor-plans-and-stacked-on-top-of-each-other fbb6e2f05ce8a9ea9c2cf1c4453472a3.json
1357 https://mlart.co/item/gan-interpolation-inspired-by-francis-bacon fccaff6d0d99cc7995fd0fc56e2417d3.json
1358 https://mlart.co/item/sngan-trained-on-doodles-and-upscaled-and-colorized-with-cyclegan fd848011bfd1f9c42ff31a4617bb1bec.json
1359 https://mlart.co/item/generate-a-pose_-use-a-pix2pix-model-to-turn-the-pose-into-an-artwork-and-then-paint-it-live fdc078a572b92aaa9b2dd8d8734a6532.json
1360 https://mlart.co/item/turn-a-drawing-into-a-cat-in-a-web-browser fea5ae867692ceaf53eb588c33893695.json
1361 https://mlart.co/item/overfit-stylegan-on-250-nonsensical-images-and-then-apply-styletransfer ff4c99efc66ad5bace54b6b9c651ce2a.json
1362 https://mlart.co/item/gan-generated-roman-statues-with-generative-music-and-descriptions ff857d7d6e5fb1d2a14f7d7c20788714.json
folder tcp
###
tcp
files in folder: 172
1363 https://thecleverprogrammer.com/2020/10/23/barcode

KeyboardInterrupt: 