In [1]:
# this notebook converts the CSV to ES mapping

In [2]:
# imports
from datetime import datetime
import hashlib
import json
import sys
import csv
import os
import pandas as pd
import re
import time
from gensim.summarization.summarizer import summarize
from gensim.summarization import keywords

In [3]:
# some long text
# source: https://www.kaggle.com/c/stanford-covid-vaccine
text1 = '''
Winning the fight against the COVID-19 pandemic will require an effective vaccine that can be equitably and widely distributed. Building upon decades of research has allowed scientists to accelerate the search for a vaccine against COVID-19, but every day that goes by without a vaccine has enormous costs for the world nonetheless. We need new, fresh ideas from all corners of the world. Could online gaming and crowdsourcing help solve a worldwide pandemic? Pairing scientific and crowdsourced intelligence could help computational biochemists make measurable progress.
mRNA vaccines have taken the lead as the fastest vaccine candidates for COVID-19, but currently, they face key potential limitations. One of the biggest challenges right now is how to design super stable messenger RNA molecules (mRNA). Conventional vaccines (like your seasonal flu shots) are packaged in disposable syringes and shipped under refrigeration around the world, but that is not currently possible for mRNA vaccines.
Researchers have observed that RNA molecules have the tendency to spontaneously degrade. This is a serious limitation--a single cut can render the mRNA vaccine useless. Currently, little is known on the details of where in the backbone of a given RNA is most prone to being affected. Without this knowledge, current mRNA vaccines against COVID-19 must be prepared and shipped under intense refrigeration, and are unlikely to reach more than a tiny fraction of human beings on the planet unless they can be stabilized.
The Eterna community, led by Professor Rhiju Das, a computational biochemist at Stanford’s School of Medicine, brings together scientists and gamers to solve puzzles and invent medicine. Eterna is an online video game platform that challenges players to solve scientific problems such as mRNA design through puzzles. The solutions are synthesized and experimentally tested at Stanford by researchers to gain new insights about RNA molecules. The Eterna community has previously unlocked new scientific principles, made new diagnostics against deadly diseases, and engaged the world’s most potent intellectual resources for the betterment of the public. The Eterna community has advanced biotechnology through its contribution in over 20 publications, including advances in RNA biotechnology.
In this competition, we are looking to leverage the data science expertise of the Kaggle community to develop models and design rules for RNA degradation. Your model will predict likely degradation rates at each base of an RNA molecule, trained on a subset of an Eterna dataset comprising over 3000 RNA molecules (which span a panoply of sequences and structures) and their degradation rates at each position. We will then score your models on a second generation of RNA sequences that have just been devised by Eterna players for COVID-19 mRNA vaccines. These final test sequences are currently being synthesized and experimentally characterized at Stanford University in parallel to your modeling efforts -- Nature will score your models!
Improving the stability of mRNA vaccines was a problem that was being explored before the pandemic but was expected to take many years to solve. Now, we must solve this deep scientific challenge in months, if not weeks, to accelerate mRNA vaccine research and deliver a refrigerator-stable vaccine against SARS-CoV-2, the virus behind COVID-19. The problem we are trying to solve has eluded academic labs, industry R&D groups, and supercomputers, and so we are turning to you. To help, you can join the team of video game players, scientists, and developers at Eterna to unlock the key in our fight against this devastating pandemic. 
'''

# and a short one
text2 = 'The quick brown fox jumps over the lazy dog'

In [4]:
# function to count words
def word_count(text):
    if isinstance(text, str):
        s = text.split(' ')
        return len(s)
    else:
        return 0

print('words:', word_count(text1))
print('words:', word_count(text2))
print('words:', word_count(None))

words: 564
words: 9
words: 0


In [5]:
# function to count sentences
def sentence_count(text):
    if isinstance(text, str):
        s = text.split('. ')
        return len(s)
    else:
        return 0

print('sentences:', sentence_count(text1))
print('sentences:', sentence_count(text2))
print('sentences:', sentence_count(None))

sentences: 20
sentences: 1
sentences: 0


In [6]:
# extractive summarization

In [7]:
# text summarization 100% -> n%
def nltk_ratio(text, ratio=0.25):
    return summarize(text, ratio=ratio)

sum_nltk_ratio = nltk_ratio(text1, ratio=0.25)
print('words:', word_count(sum_nltk_ratio))
print(sum_nltk_ratio)

words: 139
Eterna is an online video game platform that challenges players to solve scientific problems such as mRNA design through puzzles.
The solutions are synthesized and experimentally tested at Stanford by researchers to gain new insights about RNA molecules.
We will then score your models on a second generation of RNA sequences that have just been devised by Eterna players for COVID-19 mRNA vaccines.
Improving the stability of mRNA vaccines was a problem that was being explored before the pandemic but was expected to take many years to solve.
Now, we must solve this deep scientific challenge in months, if not weeks, to accelerate mRNA vaccine research and deliver a refrigerator-stable vaccine against SARS-CoV-2, the virus behind COVID-19.
To help, you can join the team of video game players, scientists, and developers at Eterna to unlock the key in our fight against this devastating pandemic.


In [8]:
# text summarization 100% -> n words
def nltk_count(text, word_count=100):
    return summarize(text, word_count=word_count)

sum_nltk_count = nltk_count(text1, word_count=100)
print('words:', word_count(sum_nltk_count))
print(sum_nltk_count)

words: 98
Eterna is an online video game platform that challenges players to solve scientific problems such as mRNA design through puzzles.
We will then score your models on a second generation of RNA sequences that have just been devised by Eterna players for COVID-19 mRNA vaccines.
Now, we must solve this deep scientific challenge in months, if not weeks, to accelerate mRNA vaccine research and deliver a refrigerator-stable vaccine against SARS-CoV-2, the virus behind COVID-19.
To help, you can join the team of video game players, scientists, and developers at Eterna to unlock the key in our fight against this devastating pandemic.


In [9]:
# adaptive summarization
# https://www.machinelearningplus.com/nlp/text-summarization-approaches-nlp-example/

In [10]:
# BART
# Importing the model
from transformers import BartForConditionalGeneration, BartTokenizer, BartConfig

In [11]:
'''
# Loading the model and tokenizer for bart-large-cnn
tokenizer=BartTokenizer.from_pretrained('facebook/bart-large-cnn')
model=BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
#'''

"\n# Loading the model and tokenizer for bart-large-cnn\ntokenizer=BartTokenizer.from_pretrained('facebook/bart-large-cnn')\nmodel=BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')\n#"

In [12]:
'''
# Encoding the inputs and passing them to model.generate()
def bart(text):
    inputs = tokenizer.batch_encode_plus([text],return_tensors='pt')
    summary_ids = model.generate(inputs['input_ids'], early_stopping=True)

    # Decoding and printing the summary
    bart_summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    
    return bart_summary

# long text
start = time.time()
sum_bart_l = bart(text1)
end = time.time()

print('### long text ###')
print('runtime:', end-start)
print('words:', word_count(sum_bart_l))
print('sentences:', sentence_count(sum_bart_l))
print(sum_bart_l)
print('')

# short text
print('### short text ###')
start = time.time()
sum_bart_s = bart(text2)
end = time.time()

print('runtime:', end-start)
print('words:', word_count(sum_bart_s))
print('sentences:', sentence_count(sum_bart_s))
print(sum_bart_s)
#'''

"\n# Encoding the inputs and passing them to model.generate()\ndef bart(text):\n    inputs = tokenizer.batch_encode_plus([text],return_tensors='pt')\n    summary_ids = model.generate(inputs['input_ids'], early_stopping=True)\n\n    # Decoding and printing the summary\n    bart_summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)\n    \n    return bart_summary\n\n# long text\nstart = time.time()\nsum_bart_l = bart(text1)\nend = time.time()\n\nprint('### long text ###')\nprint('runtime:', end-start)\nprint('words:', word_count(sum_bart_l))\nprint('sentences:', sentence_count(sum_bart_l))\nprint(sum_bart_l)\nprint('')\n\n# short text\nprint('### short text ###')\nstart = time.time()\nsum_bart_s = bart(text2)\nend = time.time()\n\nprint('runtime:', end-start)\nprint('words:', word_count(sum_bart_s))\nprint('sentences:', sentence_count(sum_bart_s))\nprint(sum_bart_s)\n#"

In [13]:
# T5
# https://towardsdatascience.com/summarize-reddit-comments-using-t5-bart-gpt-2-xlnet-models-a3e78a5ab944
from transformers import T5Tokenizer, T5ForConditionalGeneration
model = T5ForConditionalGeneration.from_pretrained('t5-base')
tokenizer = T5Tokenizer.from_pretrained('t5-base')

Some weights of the model checkpoint at t5-base were not used when initializing T5ForConditionalGeneration: ['decoder.block.0.layer.1.EncDecAttention.relative_attention_bias.weight']
- This IS expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [14]:
def t5(text):
    Preprocessed_text = "summarize: " + text
    tokens_input = tokenizer.encode(Preprocessed_text,return_tensors="pt", max_length=512, truncation=True)
    summary_ids = model.generate(tokens_input, min_length=100, max_length=180, length_penalty=4.0)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

'''
# long text
start = time.time()
sum_t5_l = t5(text1)
end = time.time()

print('### long text ###')
print('runtime:', end-start)
print('words:', word_count(sum_t5_l))
print('sentences:', sentence_count(sum_t5_l))
print(sum_t5_l)
print('')

# short text
print('### short text ###')
start = time.time()
sum_t5_s = t5(text2)
end = time.time()

print('runtime:', end-start)
print('words:', word_count(sum_t5_s))
print('sentences:', sentence_count(sum_t5_s))
print(sum_t5_s)
#'''

"\n# long text\nstart = time.time()\nsum_t5_l = t5(text1)\nend = time.time()\n\nprint('### long text ###')\nprint('runtime:', end-start)\nprint('words:', word_count(sum_t5_l))\nprint('sentences:', sentence_count(sum_t5_l))\nprint(sum_t5_l)\nprint('')\n\n# short text\nprint('### short text ###')\nstart = time.time()\nsum_t5_s = t5(text2)\nend = time.time()\n\nprint('runtime:', end-start)\nprint('words:', word_count(sum_t5_s))\nprint('sentences:', sentence_count(sum_t5_s))\nprint(sum_t5_s)\n#"

In [15]:
# industry categories

# https://www.census.gov/programs-surveys/aces/information/iccl.html
cat_sic = ['Agriculture','Forestry','Fishing','Mining','Construction','Manufacturing','Transportation','Communications','Electric','Gas','Sanitary','Wholesale Trade','Retail Trade','Finance','Insurance','Real Estate','Services','Public Administration']
# https://www.marketing91.com/19-types-of-business-industries/
cat_19 = ['Aerospace','Transport','Computer','Telecommunication','Agriculture','Construction','Education','Pharmaceutical','Food','Health care','Hospitality','Entertainment','News Media','Energy','Manufacturing','Music','Mining','Worldwide web','Electronics']
# https://simplicable.com/new/industries
cat_simple = ['Advertising','Agriculture','Communication','Construction','Creative','Education','Entertainment','Fashion','Finance','Health care','Information Technology','Manufacturing','Media','Retail','Research','Robotics','Space']

cat = ['Accommodation & Food','Accounting','Agriculture','Banking & Insurance','Biotechnological & Life Sciences','Construction & Engineering','Economics','Education & Research','Emergency & Relief','Finance','Government and Public Works','Healthcare','Justice, Law and Regulations','Manufacturing','Media & Publishing','Miscellaneous','Physics','Real Estate, Rental & Leasing','Utilities','Wholesale & Retail']
subcat = ['Failure','Food','Fraud','General','Genomics','Insurance and Risk','Judicial Applied','Life-sciences','Machine Learning','Maintenance','Management and Operations','Marketing','Material Science','Physical','Policy and Regulatory','Politics','Preventative and Reactive','Quality','Real Estate','Rental & Leasing','Restaurant','Retail','School','Sequencing','Social Policies','Student','Textual Analysis','Tools','Tourism','Trading & Investment','Transportation','Valuation','Water & Pollution','Wholesale']

In [16]:
# zero shot classification
# https://towardsdatascience.com/zero-shot-text-classification-with-hugging-face-7f533ba83cd6
from transformers import pipeline
classifier = pipeline("zero-shot-classification")

Some weights of the model checkpoint at facebook/bart-large-mnli were not used when initializing BartModel: ['model.encoder.version', 'model.decoder.version']
- This IS expected if you are initializing BartModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BartModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at facebook/bart-large-mnli were not used when initializing BartForSequenceClassification: ['model.encoder.version', 'model.decoder.version']
- This IS expected if you are initializing BartForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification m

In [17]:
'''
# test classifictaion with nltk (200 words)

s = nltk_count(text1, word_count=100)

start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat_19)
res_sic = classifier(s, cat_sic, multi_class=True)
end = time.time()

print('runtime sic:', end-start)
#print(res_sic)
print(res_sic['labels'][0:3])
print(res_sic['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat_19)
res_c19 = classifier(s, cat_19, multi_class=True)
end = time.time()

print('runtime c19:', end-start)
#print(res_c19)
print(res_c19['labels'][0:3])
print(res_c19['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat_19)
res_simple = classifier(s, cat_simple, multi_class=True)
end = time.time()

print('runtime simple:', end-start)
#print(res_simple)
print(res_simple['labels'][0:3])
print(res_simple['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat)
res_simple = classifier(s, cat, multi_class=True)
end = time.time()

print('runtime category:', end-start)
#print(res_simple)
print(res_simple['labels'][0:3])
print(res_simple['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), subcat)
res_simple = classifier(s, subcat, multi_class=True)
end = time.time()

print('runtime subcategory:', end-start)
#print(res_simple)
print(res_simple['labels'][0:3])
print(res_simple['scores'][0:3])
#'''

"\n# test classifictaion with nltk (200 words)\n\ns = nltk_count(text1, word_count=100)\n\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200), cat_19)\nres_sic = classifier(s, cat_sic, multi_class=True)\nend = time.time()\n\nprint('runtime sic:', end-start)\n#print(res_sic)\nprint(res_sic['labels'][0:3])\nprint(res_sic['scores'][0:3])\n\n\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200), cat_19)\nres_c19 = classifier(s, cat_19, multi_class=True)\nend = time.time()\n\nprint('runtime c19:', end-start)\n#print(res_c19)\nprint(res_c19['labels'][0:3])\nprint(res_c19['scores'][0:3])\n\n\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200), cat_19)\nres_simple = classifier(s, cat_simple, multi_class=True)\nend = time.time()\n\nprint('runtime simple:', end-start)\n#print(res_simple)\nprint(res_simple['labels'][0:3])\nprint(res_simple['scores'][0:3])\n\n\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200),

In [18]:
'''
# test classifictaion with t5
start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat_19)
res_sic = classifier(sum_t5_l, cat_sic, multi_class=True)
end = time.time()

print('runtime sic:', end-start)
#print(res_sic)
print(res_sic['labels'][0:3])
print(res_sic['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat_19)
res_c19 = classifier(sum_t5_l, cat_19, multi_class=True)
end = time.time()

print('runtime c19:', end-start)
#print(res_c19)
print(res_c19['labels'][0:3])
print(res_c19['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat_19)
res_simple = classifier(sum_t5_l, cat_simple, multi_class=True)
end = time.time()

print('runtime simple:', end-start)
#print(res_simple)
print(res_simple['labels'][0:3])
print(res_simple['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat)
res_simple = classifier(sum_t5_l, cat, multi_class=True)
end = time.time()

print('runtime category:', end-start)
#print(res_simple)
print(res_simple['labels'][0:3])
print(res_simple['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), subcat)
res_simple = classifier(sum_t5_l, subcat, multi_class=True)
end = time.time()

print('runtime subcategory:', end-start)
#print(res_simple)
print(res_simple['labels'][0:3])
print(res_simple['scores'][0:3])
#'''

"\n# test classifictaion with t5\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200), cat_19)\nres_sic = classifier(sum_t5_l, cat_sic, multi_class=True)\nend = time.time()\n\nprint('runtime sic:', end-start)\n#print(res_sic)\nprint(res_sic['labels'][0:3])\nprint(res_sic['scores'][0:3])\n\n\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200), cat_19)\nres_c19 = classifier(sum_t5_l, cat_19, multi_class=True)\nend = time.time()\n\nprint('runtime c19:', end-start)\n#print(res_c19)\nprint(res_c19['labels'][0:3])\nprint(res_c19['scores'][0:3])\n\n\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200), cat_19)\nres_simple = classifier(sum_t5_l, cat_simple, multi_class=True)\nend = time.time()\n\nprint('runtime simple:', end-start)\n#print(res_simple)\nprint(res_simple['labels'][0:3])\nprint(res_simple['scores'][0:3])\n\n\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200), cat)\nres_simple = classifier(sum_t

In [19]:
'''
# test classifictaion with bart

start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat_19)
res_sic = classifier(sum_bart_l, cat_sic, multi_class=True)
end = time.time()

print('runtime sic:', end-start)
#print(res_sic)
print(res_sic['labels'][0:3])
print(res_sic['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat_19)
res_c19 = classifier(sum_bart_l, cat_19, multi_class=True)
end = time.time()

print('runtime c19:', end-start)
#print(res_c19)
print(res_c19['labels'][0:3])
print(res_c19['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat_19)
res_simple = classifier(sum_bart_l, cat_simple, multi_class=True)
end = time.time()

print('runtime simple:', end-start)
#print(res_simple)
print(res_simple['labels'][0:3])
print(res_simple['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), cat)
res_simple = classifier(sum_bart_l, cat, multi_class=True)
end = time.time()

print('runtime category:', end-start)
#print(res_simple)
print(res_simple['labels'][0:3])
print(res_simple['scores'][0:3])


start = time.time()
#res = classifier(nltk_count(text1, word_count=200), subcat)
res_simple = classifier(sum_bart_l, subcat, multi_class=True)
end = time.time()

print('runtime subcategory:', end-start)
#print(res_simple)
print(res_simple['labels'][0:3])
print(res_simple['scores'][0:3])
#'''

"\n# test classifictaion with bart\n\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200), cat_19)\nres_sic = classifier(sum_bart_l, cat_sic, multi_class=True)\nend = time.time()\n\nprint('runtime sic:', end-start)\n#print(res_sic)\nprint(res_sic['labels'][0:3])\nprint(res_sic['scores'][0:3])\n\n\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200), cat_19)\nres_c19 = classifier(sum_bart_l, cat_19, multi_class=True)\nend = time.time()\n\nprint('runtime c19:', end-start)\n#print(res_c19)\nprint(res_c19['labels'][0:3])\nprint(res_c19['scores'][0:3])\n\n\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200), cat_19)\nres_simple = classifier(sum_bart_l, cat_simple, multi_class=True)\nend = time.time()\n\nprint('runtime simple:', end-start)\n#print(res_simple)\nprint(res_simple['labels'][0:3])\nprint(res_simple['scores'][0:3])\n\n\nstart = time.time()\n#res = classifier(nltk_count(text1, word_count=200), cat)\nres_simple = classi

In [20]:
# category function
def categorize(text, categories):
    start = time.time()
    res = classifier(text, categories, multi_class=True)
    end = time.time()
    dur = round(end-start, 3)
    return {
        'category': res['labels'][0],
        'score': res['scores'][0],
        'runtime': dur,
    }

#print(categorize(nltk_count(text1), cat))

In [21]:
# helper functions

In [22]:
# function to rebuild list from string
# that happens when it is stored in CSV without json-encode the data
def str_to_list(s):
    s = s.replace("'", "").replace(' ,', ',').replace(
        '[', '').replace(']', '').split(',')
    s = [i.strip() for i in s if i]
    return s

In [23]:
# helper function to create folder create_folder
def create_folder(path):
    if not os.path.exists(os.path.dirname(path)):
        try:
            os.makedirs(os.path.dirname(path))
            print(path + ' created')
        except OSError as exc: # Guard against race condition
            if exc.errno != errno.EEXIST:
                raise

In [24]:
# generic store data to file function
def store_data(data, file, mode='w', toJson=False):
    if toJson:
        data = json.dumps(data)
    with open(file, mode, encoding='utf-8') as fp:
        result = fp.write(data)
        return result
    
# generic load data from file function
def load_data(file, fromJson=False):
    if os.path.isfile(file):
        with open(file, 'r', encoding='utf-8', errors="ignore") as fp:
            data = fp.read()
            if fromJson:
                data = json.loads(data)
            return data
    else:
        return 'file not found'

# test text
#print(store_data('Hello', '../data/repositories/mlart/test.txt'))
#print(load_data('../data/repositories/mlart/test.txt'))

# test json
#print(store_data({'msg':'Hello World'}, '../data/repositories/mlart/test.json', toJson=True))
#print(load_data('../data/repositories/mlart/test.json', fromJson=True))

#store_data(result[0]['html'], '../data/repositories/kaggle/notebook.html')
#store_data(result[0]['iframe'], '../data/repositories/kaggle/kernel.html')

In [25]:
# remove special characters
def clean_text(text):
    # Ref: https://gist.github.com/Alex-Just/e86110836f3f93fe7932290526529cd1#gistcomment-3208085
    # Ref: https://en.wikipedia.org/wiki/Unicode_block
    EMOJI_PATTERN = re.compile(
        "(["
        "\U0001F1E0-\U0001F1FF"  # flags (iOS)
        "\U0001F300-\U0001F5FF"  # symbols & pictographs
        "\U0001F600-\U0001F64F"  # emoticons
        "\U0001F680-\U0001F6FF"  # transport & map symbols
        "\U0001F700-\U0001F77F"  # alchemical symbols
        "\U0001F780-\U0001F7FF"  # Geometric Shapes Extended
        "\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
        "\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
        "\U0001FA00-\U0001FA6F"  # Chess Symbols
        "\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
        "\U00002702-\U000027B0"  # Dingbats
        "])"
    )
    text = re.sub(EMOJI_PATTERN, '', text)
    
    # additional cleanup
    text = text.replace('•','').replace('\n',' ')
    
    return text

In [53]:
# mapper to convert CSV to the mapping of Elasticsearch index
def mapper(row, style):
    '''
    mapper to adopt csv to db-schema

    title, title_vector, description, description_vector,
    link, category, category_score, subcategory, subcategory_score, 
    tags, kind, ml_libs, host, license, language, score,
    date_project, date_scraped
    '''

    # kaggle notebook mapping
    if style == 'kaggle_notebook':
        return {
            'title': row['title'],
            'description': row['description'],
            'link': row['link'],
            # 'category': '',
            # 'category_score': 0,
            # 'subcategory': '',
            # 'subcategory_score': 0,
            'tags': list(set(str_to_list(row['tags']) + str_to_list(row['tags']))),
            'kind': ['project', 'notebook'],
            'ml_libs': str_to_list(row['ml_libs']),
            'host': 'www.kaggle.com',
            'license': row['license'],
            'language': row['type'],
            'score': row['score_views'] if 'score_views' in row else None,
            'date_project': datetime.strptime(row['date'], "%Y-%m-%d %H:%M:%S") if row['date'] != '' else None,
            'date_scraped': datetime.strptime(row['scraped_at'], "%Y-%m-%d %H:%M:%S") if row['scraped_at'] != '' else None,
            # 'ml_terms': row['ml_terms'],
            # 'score_raw': json.dumps({'views': row['views'], 'votes': row['votes'], 'score_private': row['score_private'], 'score_public': row['score_public']}),
        }
    
    # kaggle competition mapping
    if style == 'kaggle_competition':
        return {
            'title': row['title'],
            'description': row['subtitle'] + row['description'],
            'link': row['link'],
            # 'category': '',
            # 'category_score': 0,
            # 'subcategory': '',
            # 'subcategory_score': 0,
            'tags': list(set(str_to_list(row['tags']) + str_to_list(row['type']))),
            'kind': ['project', 'competition', 'dataset'],
            # 'ml_libs': str_to_list(row['ml_libs']),
            'host': 'www.kaggle.com',
            # 'license': row['license'],
            # 'language': row['type'],
            'score': row['teams_score'],
            'date_project': datetime.strptime(row['date_closed'], "%Y-%m-%d %H:%M:%S") if 'date_closed' in row else '',
            # 'date_scraped': datetime.strptime(row['scraped_at'], "%Y-%m-%d %H:%M:%S"),
            # 'ml_terms': row['ml_terms'],
            # 'score_raw': json.dumps({'views': row['views'], 'votes': row['votes'], 'score_private': row['score_private'], 'score_public': row['score_public']}),
        }
    
    # kaggle dataset mapping
    if style == 'kaggle_dataset':
        return {
            'title': row['title'],
            'description': row['subtitle'] + row['description'],
            'link': row['link'],
            # 'category': '',
            # 'category_score': 0,
            # 'subcategory': '',
            # 'subcategory_score': 0,
            'tags': list(set(str_to_list(row['tags']) + str_to_list(row['type']))),
            'kind': ['project', 'dataset'],
            # 'ml_libs': str_to_list(row['ml_libs']),
            'host': 'www.kaggle.com',
            # 'license': row['license'],
            # 'language': row['type'],
            'score': row['teams_score'],
            'date_project': datetime.strptime(row['date_closed'], "%Y-%m-%d %H:%M:%S") if 'date_closed' in row else '',
            # 'date_scraped': datetime.strptime(row['scraped_at'], "%Y-%m-%d %H:%M:%S"),
            # 'ml_terms': row['ml_terms'],
            # 'score_raw': json.dumps({'views': row['views'], 'votes': row['votes'], 'score_private': row['score_private'], 'score_public': row['score_public']}),
        }

    # github mapping
    if style == 'github':
        cat_score = 1 if row['industry'] != '' else 0
        subcat_score = 1 if row['type'] != '' else 0
        #tags = row['ml_tags'] if len(row['ml_tags']) > 0 else ''
        return {
            'title': row['name'],
            'description': row['description2'],
            'link': row['link'],
            'category': row['industry'],
            'category_score': cat_score,
            'subcategory': row['type'],
            'subcategory_score': subcat_score,
            'tags': str_to_list(row['ml_tags']),
            'kind': 'Project',
            'ml_libs': str_to_list(row['ml_libs']),
            'host': 'www.github.com',
            'license': row['license'],
            'language': row['language_primary'],
            'score': row['stars_score'],
            'date_project': datetime.strptime(row['pushed_at'], "%Y-%m-%d %H:%M:%S"),
            'date_scraped': datetime.strptime(row['scraped_at'], "%Y-%m-%d %H:%M:%S"),
            # 'ml_terms': row['keywords'],
            # 'score_raw': json.dumps({'stars': row['stars'], 'contributors': row['contributors']}),
        }

    # mlart mapping
    if style == 'mlart':
        title = row['Title'] if row['Title'] != '' else row['title']
        cat_score = 1 if row['Theme'] != '' else 0
        subcat_score = 1 if row['Medium'] != '' else 0
        return {
            'title': title,
            'description': row['subtitle'],
            'link': row['url'],
            'category': str_to_list(row['Theme']),
            'category_score': cat_score,
            'subcategory': str_to_list(row['Medium']),
            'subcategory_score': subcat_score,
            'tags': str_to_list(row['Technology']),
            'kind': 'Showcase',
            # 'ml_libs': [],
            'host': 'mlart.co',
            # 'license': '',
            # 'language': '',
            # 'score': 0,
            'date_project': datetime.strptime(row['Date'], "%Y-%m-%d"),
            'date_scraped': datetime.strptime(row['scraped_at'], "%Y-%m-%d %H:%M:%S"),
            # 'score_raw': json.dumps({'days_since_featured': row['Days Since Featured']}),
        }

    # thecleverprogrammer
    if style == 'tcp':
        return {
            'title': row['title'],
            'description': row['description'],
            'link': row['link'],
            # 'category': '',
            # 'category_score': 0,
            # 'subcategory': '',
            # 'subcategory_score': 0,
            'tags': str_to_list(row['ml_tags']),
            'kind': 'Project',
            'ml_libs': str_to_list(row['ml_libs']),
            'host': 'thecleverprogrammer.com',
            # 'license': '',
            'language': 'Python',
            # 'score': 0,
            'date_project': datetime.strptime(row['date'], "%Y-%m-%d %H:%M:%S"),
            'date_scraped': datetime.strptime('2020-12-20', "%Y-%m-%d"),
            # 'score_raw': json.dumps({'days_since_featured': row['Days Since Featured']}),
        }

    return None

In [27]:
# test gpu usage
import torch
torch.cuda.is_available()

True

In [28]:
# summarization loop

In [54]:
# loop to transform data row-wise
def transform_loop(csv_in, csv_format, subfolder, quit=0, overwrite=False):
    
    with open(csv_in, encoding='utf-8') as csvfile:
        
        # let's store converted csv to temp-folder for analysis
        csv_out = '../data/database/csv/'
        json_out = '../data/database/json/'
        json_out_item = '../data/database/json/'+subfolder
        create_folder(json_out_item)
        df = pd.DataFrame()

        # readCSV = csv.reader(csvfile, delimiter=';')
        readCSV = csv.DictReader(csvfile, delimiter=';')
        # next(readCSV, None)  # skip the headers
        
        i = j = 0
        out = []
        
        for row in readCSV:
            # print(row)
            row = mapper(row, csv_format)
            
            # check if file already exists
            link = row['link']
            md5 = hashlib.md5(link.encode("utf-8")).hexdigest()
            
            json_fp = json_out_item + md5 + '.json'
            if not os.path.isfile(json_fp) or overwrite == True:
                
                print(i, row['link'])
                item_start = time.time()

                # clean title & description
                row['title'] = clean_text(row['title'])
                text = row['description'] = clean_text(row['description'])
                words = row['words'] = word_count(text)
                sentences = row['sentences'] = sentence_count(text)

                # create summarization
                if words > 200 and sentences > 1:
                    print('summarize')

                    # nltk
                    #try:
                    if True:
                        start = time.time()
                        row['sum_nltk'] = nltk_count(text, word_count=200)
                        end = time.time()
                        dur = round(end-start,3)

                        row['sum_nltk_words'] = word_count(row['sum_nltk'])
                        row['sum_nltk_runtime'] = dur
                        print('done (nltk)', dur, 'sec')
                    #except:
                    #    print('summarization nltk failed')

                    # t5
                    #try:
                    if True:
                        start = time.time()
                        row['sum_t5'] = t5(text)
                        end = time.time()
                        dur = round(end-start,3)

                        row['sum_t5_words'] = word_count(row['sum_t5'])
                        row['sum_t5_runtime'] = dur
                        print('done (t5)', dur, 'sec')
                    #except:
                    #    print('summarization t5 failed')

                # zero shot categorization
                '''
                if not 'category' in row:
                    print('categorize')
                    
                    # create category and subcategory from t5
                    if 'sum_t5' in row and False:
                        s = row['sum_t5']
                        res = categorize(s, cat)
                        row['t5_category'] = res['category']
                        row['t5_category_score'] = res['score']
                        row['t5_category_runtime'] = res['runtime']
                        print('t5 category', res['runtime'], 'sec')
                        
                        res = categorize(s, subcat)
                        row['t5_subcategory'] = res['category']
                        row['t5_subcategory_score'] = res['score']
                        row['t5_subcategory_runtime'] = res['runtime']
                        print('t5 subcategory', res['runtime'], 'sec')
                    else:
                        print('t5 skipped')
                       
                    # create category and subcategory from nltk
                    if 'sum_nltk' in row:
                        s = row['sum_nltk']
                        res = categorize(s, cat)
                        row['nltk_category'] = res['category']
                        row['nltk_category_score'] = res['score']
                        row['nltk_category_runtime'] = res['runtime']
                        print('nltk category', res['runtime'], 'sec')
                        
                        res = categorize(s, subcat)
                        row['nltk_subcategory'] = res['category']
                        row['nltk_subcategory_score'] = res['score']
                        row['nltk_subcategory_runtime'] = res['runtime']
                        print('nltk subcategory', res['runtime'], 'sec')
                    else:
                        print('nltk skipped')
                        
                    # create category and subcategory from title or description if not already done
                    if not 't5_category' in row and not 'nltk_category' in row:
                        if len(row['description']) > 0:
                            s = row['description']
                            res = categorize(s, cat)
                            row['description_category'] = res['category']
                            row['description_category_score'] = res['score']
                            row['description_category_runtime'] = res['runtime']
                            print('description category', res['runtime'], 'sec')

                            res = categorize(s, subcat)
                            row['description_subcategory'] = res['category']
                            row['description_subcategory_score'] = res['score']
                            row['description_subcategory_runtime'] = res['runtime']
                            print('description subcategory', res['runtime'], 'sec')
                        else:
                            s = row['title']
                            res = categorize(s, cat)
                            row['title_category'] = res['category']
                            row['title_category_score'] = res['score']
                            row['title_category_runtime'] = res['runtime']
                            print('title category', res['runtime'], 'sec')

                            res = categorize(s, subcat)
                            row['title_subcategory'] = res['category']
                            row['title_subcategory_score'] = res['score']
                            row['title_subcategory_runtime'] = res['runtime']
                            print('title subcategory', res['runtime'], 'sec')
                    
                    
                    # zero shot categorization - first approach
                    
                    s = row['title']
                    if len(row['description']) > 0:
                        s = row['description']
                    if 'sum_nltk' in row:
                        s = row['sum_nltk']
                    if 'sum_t5' in row:
                        s = row['sum_t5']

                    #try:
                    if True:
                        res = categorize(s, cat_simple)
                        # {'category': 'Entertainment', 'score': 0.3688451945781708, 'runtime': 14.973}
                        row['category'] = res['category']
                        row['category_score'] = res['score']
                        row['category_runtime'] = res['runtime']
                        print('done', res['runtime'], 'sec')
                    #except:
                    #    print('categorization failed')
                    '''

                # convert datetime to string
                if 'date_project' in row:
                    row['date_project'] = str(row['date_project'])
                if 'date_scraped' in row:
                    row['date_scraped'] = str(row['date_scraped'])
                    
                # runtime
                item_end = time.time()
                item_dur = round(item_end-item_start, 3)
                row['runtime'] = item_dur

                df = df.append(row, ignore_index=True)

                # json encode
                #out.append(row)
                
                store_data(row, json_fp, toJson=True)
                print(json_fp)
                j += 1

            #print(i, row['link'])
            i += 1

            # keep count of # rows processed
            if i % 100 == 0:
                print(i)

            if quit != 0 and i >= quit:
                break

        # store parsed csv
        #fp = csv_in.split('/')[-1]
        #df.to_csv(csv_out + fp, sep=';', index=False)
        #path = json_out + fp
        #path = path.replace('.csv', '.json')
        #store_data(out, path, toJson=True)
        
        print('DONE parsed', i, 'items')

In [56]:
# run the loop

transform = ['ka_c', 'ka_cn', 'ka_d', 'ka_dn', 'ma', 'gh', 'tcp', 'bc']
#transform = ['ka_c', 'ka_cn', 'ka_dn']
#transform = ['ka_d', 'ma', 'gh', 'tcp', 'bc']
transform = ['tcp']

datasets = {
    # kaggle competitions
    'ka_c': {
        'csv_in': '../data/database/kaggle_competitions_correlated_01.csv',
        'csv_format': 'kaggle_competition',
    },
    # kaggle competitions notebooks
    'ka_cn': {
        'csv_in': '../data/database/kaggle_competitions_01_original.csv',
        'csv_format': 'kaggle_notebook',
    },
    # kaggle datasets
    'ka_d': {
        'csv_in': '../data/database/kaggle_datasets_correlated_01.csv',
        'csv_format': 'kaggle_dataset',
    },
    # kaggle datasets notebooks
    'ka_dn': {
        'csv_in': '../data/database/kaggle_datasets_01_original.csv',
        'csv_format': 'kaggle_notebook',
    },
    # mlart
    'ma': {
        'csv_in': '../data/database/mlart_01_original.csv',
        'csv_format':'mlart',
    },
    # github
    'gh': {
        'csv_in': '../data/database/db_04_analyzed_v02.csv',
        'csv_format': 'github',
    },
    # thecleverprogrammer
    'tcp': {
        'csv_in': '../data/database/thecleverprogrammer_01_original.csv',
        'csv_format': 'tcp',
    },
    # blobcity
    'bc': {
        'csv_in': '../data/database/blobcity_02_analyzed.csv',
        'csv_format': 'github',
    },
}

    
for key in transform:
    print(key)
    item = datasets[key]
    transform_loop(item['csv_in'], item['csv_format'], key+'/', overwrite=False)

tcp
100
DONE parsed 172 items


In [33]:
# zero shot categorization is computational intense
# so let's keep it out from the loop and process it seperatly

In [40]:
print(cat)
print(subcat)

['Accommodation & Food', 'Accounting', 'Agriculture', 'Banking & Insurance', 'Biotechnological & Life Sciences', 'Construction & Engineering', 'Economics', 'Education & Research', 'Emergency & Relief', 'Finance', 'Government and Public Works', 'Healthcare', 'Justice, Law and Regulations', 'Manufacturing', 'Media & Publishing', 'Miscellaneous', 'Physics', 'Real Estate, Rental & Leasing', 'Utilities', 'Wholesale & Retail']
['Failure', 'Food', 'Fraud', 'General', 'Genomics', 'Insurance and Risk', 'Judicial Applied', 'Life-sciences', 'Machine Learning', 'Maintenance', 'Management and Operations', 'Marketing', 'Material Science', 'Physical', 'Policy and Regulatory', 'Politics', 'Preventative and Reactive', 'Quality', 'Real Estate', 'Rental & Leasing', 'Restaurant', 'Retail', 'School', 'Sequencing', 'Social Policies', 'Student', 'Textual Analysis', 'Tools', 'Tourism', 'Trading & Investment', 'Transportation', 'Valuation', 'Water & Pollution', 'Wholesale']


In [57]:
# pathes
folder = '../data/database/json/'
subfolder = os.listdir(folder)
#print(subfolder)

transform = ['ka_c', 'ka_cn', 'ka_d', 'ka_dn', 'ma', 'gh', 'tcp', 'bc']
#transform = ['ka_c', 'ka_cn', 'ka_dn']
#transform = ['ka_d', 'ma', 'gh', 'tcp', 'bc']
transform = ['ma', 'gh', 'bc', 'tcp']
transform = ['tcp']

recreate_category = False
save = True
categorzie_t5 = False
categorize_nltk = True
categorize_fallback = True

quit = 0
i = j = 0
for item in subfolder:
    print('folder', item)
    fp = os.path.join(folder, item)
    if os.path.isdir(fp) and item in transform:
        print('###')
        print(item)
        files = os.listdir(fp)
        print('files in folder:', len(files))
        for file in files:
            row = load_data(os.path.join(folder, item, file), fromJson=True)
            #print(row)
            
            print('row:', i, 'item:', j, 'link:', row['link'], 'file:', file)
            
            # zero shot categorization
            if not 'category' in row or row.get('category') == '' or recreate_category == True:
                print('categorize')
                start = time.time()
                j += 1

                # create category and subcategory from t5
                if 'sum_t5' in row and row['sum_t5'] != '' and categorzie_t5 == True:
                    s = row['sum_t5']
                    res = categorize(s, cat)
                    #row['t5_category_raw'] = res
                    c = row['t5_category'] = res['category']
                    c_score = row['t5_category_score'] = res['score']
                    row['t5_category_runtime'] = res['runtime']
                    print('t5 category', res['runtime'], 'sec')

                    res = categorize(s, subcat)
                    #row['t5_subcategory_raw'] = res
                    sc = row['t5_subcategory'] = res['category']
                    sc_score = row['t5_subcategory_score'] = res['score']
                    row['t5_subcategory_runtime'] = res['runtime']
                    print('t5 subcategory', res['runtime'], 'sec')
                else:
                    print('t5 skipped')

                # create category and subcategory from nltk
                if 'sum_nltk' in row and row['sum_nltk'] != '' and categorize_nltk == True:
                    s = row['sum_nltk']
                    res = categorize(s, cat)
                    #print(res)
                    #row['nltk_category_raw'] = res
                    c = row['nltk_category'] = res['category']
                    c_score = row['nltk_category_score'] = res['score']
                    row['nltk_category_runtime'] = res['runtime']
                    print('nltk category', res['runtime'], 'sec')

                    res = categorize(s, subcat)
                    #print(res)
                    #row['nltk_subcategory_raw'] = res
                    sc = row['nltk_subcategory'] = res['category']
                    sc_score = row['nltk_subcategory_score'] = res['score']
                    row['nltk_subcategory_runtime'] = res['runtime']
                    print('nltk subcategory', res['runtime'], 'sec')
                else:
                    print('nltk skipped')

                # create category and subcategory from title or description if not already done
                if categorize_fallback == True and not 't5_category' in row and not 'nltk_category' in row:
                    if len(row['description']) > 0:
                        s = row['description']
                        res = categorize(s, cat)
                        #row['description_category_raw'] = res
                        c = row['description_category'] = res['category']
                        c_score = row['description_category_score'] = res['score']
                        row['description_category_runtime'] = res['runtime']
                        print('description category', res['runtime'], 'sec')

                        res = categorize(s, subcat)
                        #row['description_subcategory_raw'] = res
                        sc = row['description_subcategory'] = res['category']
                        sc_score = row['description_subcategory_score'] = res['score']
                        row['description_subcategory_runtime'] = res['runtime']
                        print('description subcategory', res['runtime'], 'sec')
                    else:
                        s = row['title']
                        if s != '':
                            res = categorize(s, cat)
                            #row['title_category_raw'] = res
                            c = row['title_category'] = res['category']
                            c_score = row['title_category_score'] = res['score']
                            row['title_category_runtime'] = res['runtime']
                            print('title category', res['runtime'], 'sec')

                            res = categorize(s, subcat)
                            #row['title_subcategory_raw'] = res
                            sc = row['title_subcategory'] = res['category']
                            sc_score = row['title_subcategory_score'] = res['score']
                            row['title_subcategory_runtime'] = res['runtime']
                            print('title subcategory', res['runtime'], 'sec')
                        else:
                            print('nothing found to categorize')
                            c = sc = ''
                            c_score = sc_score = 0
                            j -= 1

                row['category'] = c
                row['category_score'] = c_score
                row['subcategory'] = sc
                row['subcategory_score'] = sc_score

                end = time.time()
                dur = round(end-start, 3)
                row['runtime_cat'] = dur
                
                fp = os.path.join(folder, item, file)
                if save == True:
                    store_data(row, fp, toJson=True)
                else:
                    print('NOT SAVED')
                    print(row)
            
            i += 1
            
            if i%100 == 0:
                print(i)
            
            if quit!= 0 and i >= quit:
                break
    if quit!= 0 and i >= quit:
                break
            
print('DONE parsed', i, 'items')

folder bc
###
bc
files in folder: 541
row: 0 item: 0 link: https://github.com/HazyResearch/cs145-notebooks-2016 file: 003eb1b987f35ed6d88b1c7930e6057f.json
row: 1 item: 0 link: https://github.com/marsggbo/deeplearning.ai_JupyterNotebooks file: 004a0f102a03862a74da782e6a796f80.json
categorize
t5 skipped
nltk skipped
nothing found to categorize
row: 2 item: 0 link: https://github.com/davidbp/lxmls-notebooks file: 004a915abfded4001d49385a09f77f19.json
categorize
t5 skipped
nltk skipped
nothing found to categorize
row: 3 item: 0 link: https://github.com/jamesdbrock/learn-you-a-haskell-notebook file: 0065cbc76fd0832c7b3fc8ef5baed767.json
row: 4 item: 0 link: https://github.com/ageron/tf2_course file: 00708ba1c3a32ce45749ebcb633e8c5a.json
row: 5 item: 0 link: https://github.com/InsightDataLabs/ipython-notebooks file: 00b2db9f2524a95318f16eed6e605dd0.json
categorize
t5 skipped
nltk skipped
nothing found to categorize
row: 6 item: 0 link: https://github.com/trnkatomas/Keras_2_examples file: 00

row: 89 item: 0 link: https://github.com/pierrelux/notebooks file: 2bd2b9246e8afc17df686280401bde4f.json
row: 90 item: 0 link: https://github.com/vschaik/Conjugate-Gradient file: 2be5402f61f59d52920d60124bc27e2b.json
row: 91 item: 0 link: https://github.com/ipython-contrib/jupyter_contrib_nbextensions file: 2c8a9200d76331f52b7ee03e829c0bfe.json
row: 92 item: 0 link: https://github.com/aws-samples/amazon-forecast-samples file: 2d635a7fdc381e82e79a486faea065a0.json
row: 93 item: 0 link: https://github.com/thomas-haslwanter/statsintro_python file: 2e93fdabb5353fe6b5053fbe08eb9682.json
row: 94 item: 0 link: https://github.com/Einsteinish/Artificial-Neural-Networks-with-Jupyter file: 2ee5222362e5e949457be96d87d64d05.json
row: 95 item: 0 link: https://github.com/leonvanbokhorst/NoteBooks-Statistics-and-MachineLearning file: 3096309d452967085ff3baa45b1e325c.json
row: 96 item: 0 link: https://github.com/pmservice/ai-openscale-tutorials file: 32206235ac9b94eda534897db0fc70ab.json
row: 97 item: 

row: 183 item: 0 link: https://github.com/elyra-ai/elyra file: 61b9764d6a34b9cbf083abe5dd535960.json
row: 184 item: 0 link: https://github.com/dmonn/dcgan-oreilly file: 61e67e926ac854750dbf06e79dbfd119.json
row: 185 item: 0 link: https://github.com/tiagoantao/biopython-notebook file: 6256e6cea9d18c129797dbceedb89074.json
row: 186 item: 0 link: https://github.com/twosigma/beakerx file: 6303e340f5c1f15347d62b53e1aab7f1.json
categorize
t5 skipped
nltk skipped
nothing found to categorize
row: 187 item: 0 link: https://github.com/hadrienj/Preprocessing-for-deep-learning file: 63d5b229d3e2c621dd5c0fe24a9c8ba3.json
row: 188 item: 0 link: https://github.com/tirthajyoti/Machine-Learning-with-Python file: 640e7bcf4a2f04b94508b857b7024240.json
row: 189 item: 0 link: https://github.com/fastai/nbdev file: 644373424f76155bd06391ca65987779.json
row: 190 item: 0 link: https://github.com/dannguyen/python-notebooks-data-wrangling file: 6463113fd86e4281bc02ed1b1bbaee94.json
row: 191 item: 0 link: https:/

row: 279 item: 0 link: https://github.com/ydixon/yolo_v3 file: 88dcabc2d705e07225d3f09f2acee4bc.json
row: 280 item: 0 link: https://github.com/AiswaryaSrinivas/DataScienceWithPython file: 898f7251a69aa7a436e7aaf8a528fea5.json
row: 281 item: 0 link: https://github.com/Morisset/Python-lectures-Notebooks file: 8a456dfb4d985e4b1c36425a08af05f6.json
row: 282 item: 0 link: https://github.com/parasharshah/automl-handson file: 8aa7623bc6500ade4a6702b32b1ee902.json
row: 283 item: 0 link: https://github.com/probcomp/notebook file: 8bb9631f7bbebe9c7cf6f02aeeb2b8a9.json
row: 284 item: 0 link: https://github.com/Naereen/notebooks file: 8c0df78183d1579f44aec994cf157912.json
row: 285 item: 0 link: https://github.com/giswqs/Learning-Python file: 8c4758e2b1011fde90e00a13b8d38f86.json
row: 286 item: 0 link: https://github.com/Mjrovai/Python4DS file: 8c881ca96e46e6230c8221b70774b883.json
row: 287 item: 0 link: https://github.com/yhilpisch/dawp file: 8d9be9e160a22b40c70a5d8379d7f289.json
row: 288 item: 0 

row: 376 item: 0 link: https://github.com/PYFTS/notebooks file: b426d295770cdc6eed4ec4aa33f91c18.json
row: 377 item: 0 link: https://github.com/araffin/rl-tutorial-jnrr19 file: b4410e56f75c0b6981733c8b469b520a.json
row: 378 item: 0 link: https://github.com/jrjohansson/scientific-python-lectures file: b4a2531fd1781fc61fed09a5c7bdc365.json
row: 379 item: 0 link: https://github.com/ageron/julia_notebooks file: b4a254b391c1aae26a546dc9d9fbc0ae.json
row: 380 item: 0 link: https://github.com/ianmcloughlin/jupyter-teaching-notebooks file: b5d106d2e546ceaa66e76952ae7e991b.json
categorize
t5 skipped
nltk skipped
nothing found to categorize
row: 381 item: 0 link: https://github.com/ravichaubey/Kaggle-Notebooks file: b6273f9e0ba8136a5df3c7c979efb6ca.json
row: 382 item: 0 link: https://github.com/altair-viz/altair-tutorial file: b687b6823787f9362862316a0c269d53.json
row: 383 item: 0 link: https://github.com/alexcfleming/Python_DL_Working_Notebooks file: b6fd94a64508d354158d2048b00afca7.json
row: 3

row: 480 item: 0 link: https://github.com/stitchfix/Algorithms-Notebooks file: e263bab442096112fd7a6cbe92c95255.json
row: 481 item: 0 link: https://github.com/lmarti/evolutionary-computation-course file: e2a797e8fb7c61fa91874502d6ea6ea0.json
row: 482 item: 0 link: https://github.com/spark-notebook/spark-notebook file: e2d93ee4fd7f4b30b5cedadb1d2213a4.json
row: 483 item: 0 link: https://github.com/jmportilla/Udemy---Machine-Learning file: e3357afdcdd0d9f4b32d32b6eadc7e02.json
row: 484 item: 0 link: https://github.com/InsightLab/data-science-cookbook file: e372d609354745da1087db7a46d8fac7.json
row: 485 item: 0 link: https://github.com/krishnamrith12/NotebooksNLP file: e4556cda44e8e41a35d811e67a3b43c7.json
row: 486 item: 0 link: https://github.com/d2l-ai/d2l-en file: e49346a9d9bb4ca5bdc370e1da0ea491.json
row: 487 item: 0 link: https://github.com/PlatformStories/notebooks file: e4e40dee51890e7748b86d1f3e7a6772.json
row: 488 item: 0 link: https://github.com/Azure-Samples/cosmos-notebooks fi

row: 582 item: 0 link: https://github.com/Data4Democracy/crash-model file: 1918bd04a682a7f38cae03d61b39f31c.json
row: 583 item: 0 link: https://github.com/borisbanushev/stockpredictionai/ file: 1937bce4d96726e909f06bc0fea420d8.json
row: 584 item: 0 link: https://github.com/zischwartz/workerfatalities file: 1951ebbca315e85545d0e4e80b2d85bd.json
row: 585 item: 0 link: https://github.com/dchannah/fraudhacker file: 1add6748daa71e1f63344d63c75d9b19.json
row: 586 item: 0 link: https://github.com/Brett777/Predict-Risk file: 1b554588280edc7c23f0432236034dd2.json
row: 587 item: 0 link: https://github.com/mitmedialab/Evolutron file: 1d0ce92e867528164c0a193fe91b692b.json
row: 588 item: 0 link: https://github.com/Murgio/Food-Recipe-CNN file: 1da3e14643ee421ba18d0e2feaa31c2b.json
row: 589 item: 0 link: https://github.com/google/deepvariant file: 1dd7057aff7405ee48a0726f8fce225c.json
row: 590 item: 0 link: https://github.com/gablg1/ORGAN file: 1e3e4e2f40856231b7c4b61af2793a35.json
row: 591 item: 0 l

row: 676 item: 0 link: https://github.com/mesgarpour/T-CARER file: 4f313c6420324cd024542aaf808497ce.json
row: 677 item: 0 link: https://github.com/dariusmehri/Social-Network-Analysis-to-Expose-Corruption file: 4fdedd696217195bc77a26dcec65676c.json
row: 678 item: 0 link: https://github.com/williamadams1/natural-gas-consumption-forecasting file: 502f0efd3b0655ce4af0e6311104a557.json
row: 679 item: 0 link: https://github.com/MengchuanFu/Suspecious-Apps-Detection file: 51695ebbeeb18fb7ff1e5a7e4b37dc8a.json
row: 680 item: 0 link: https://github.com/adrianakopf/NJPublicSchools file: 51d407e075b40b74f1a44d0de0f386a6.json
row: 681 item: 0 link: https://github.com/CharlesHoffmanCPA/charleshoffmanCPA.github.io file: 520e0438eec5bf9d073136141df93b3a.json
row: 682 item: 0 link: https://github.com/pzivich/Python-for-Epidemiologists file: 5259b9d5282601e0cdb0e2bdd2486cd4.json
row: 683 item: 0 link: https://github.com/DLColumbia/DL_forFinance file: 5344722521a1625cb892812296c8c472.json
row: 684 item:

row: 782 item: 0 link: https://github.com/anshu3769/FirmEmbeddings file: 9099dee87b8011e8792d7dbabf98b9cc.json
row: 783 item: 0 link: https://github.com/SeanMcOwen/FinanceAndPython.com-BasicFinance file: 90eed86a383d801c685a6912682fff57.json
row: 784 item: 0 link: https://github.com/dariusmehri/Geo-Spatial-Risk-Analysis-of-Inspectors file: 937fb8260299e129987734cd13f30ba2.json
row: 785 item: 0 link: https://github.com/nus-usp/room-allocation file: 938d4e6bcf878775bb1f46dd585a758b.json
row: 786 item: 0 link: https://github.com/bradleyrobinson/School-Performance file: 93b0f41a38827c9bc159e4dedd8c983e.json
row: 787 item: 0 link: https://github.com/SiddheshAcharekar/Liveright file: 93d434bff74868a768d6465e838a6c48.json
row: 788 item: 0 link: https://github.com/akarazeev/LegalTech file: 93f7fd480c4b77ff0aa9b1d6ba2fe59c.json
row: 789 item: 0 link: https://github.com/davidmasse/US-supreme-court-prediction file: 9517ced35400a34581aea80e629f8e85.json
row: 790 item: 0 link: https://github.com/ta

row: 892 item: 0 link: https://github.com/A7med01/Deep-learning-for-Animal-Identification file: d29ca9c49a3e4d320b763f60c90dd007.json
row: 893 item: 0 link: https://github.com/t-davidson/hate-speech-and-offensive-language file: d337fe0e869caec1fcd254047c96f59e.json
row: 894 item: 0 link: https://github.com/hallba/WritingSimulators file: d494a8417cf0f9056000fb073c394183.json
row: 895 item: 0 link: https://github.com/xiaofei6677/TourismFlickrMiner file: d4a4bf5ea463d9eac1098ca36a2d9a38.json
row: 896 item: 0 link: https://github.com/mohan-mj/Manufacturing-Line-I4.0 file: d4ee62f4b30f80f805dff011189dd625.json
row: 897 item: 0 link: https://github.com/ehsanasgari/Deep-Proteomics file: d4fa78b1df6657e1d07fe270b58bdd64.json
row: 898 item: 0 link: https://github.com/bzjin/menus file: d555c964d10b526eac5d6461d78c83c4.json
row: 899 item: 0 link: https://github.com/krpiyush5/Amazon-Fine-Food-Review file: d55f9a40f67ca0b67ae0e727d2096379.json
900
row: 900 item: 0 link: https://github.com/toningega

row: 1000 item: 0 link: https://mlart.co/item/colorize-bandw-picture-of-detroit-with-a-pix2pix-u-net file: 103959987650b777c9d6d78b48e1f980.json
row: 1001 item: 0 link: https://mlart.co/item/visualizing-sound-by-interpolating-a-gan-trained-on-paintings file: 10426bc90d1efeb9c578b4c84aebbcc5.json
row: 1002 item: 0 link: https://mlart.co/item/a-user-is-challenged-to-draw-an-object-and-a-cnn-model-tells-if-you-are-correct file: 1046193e4f20e290d0416dd3feb5537f.json
row: 1003 item: 0 link: https://mlart.co/item/an-lstm-trained-on-350k-haikus_-the-generated-haiku-is-then-translated-to-an-image-with-attngan-and-then-painted file: 10aee40876bdfc35581704a8ddfb00e6.json
row: 1004 item: 0 link: https://mlart.co/item/a-conditional-gan-trained-on-topographical-data-and-satellite-imagery-of-earth-to-terraform-mars-topographic-data file: 1118d8d21b365c240d5efc8ecc767d67.json
row: 1005 item: 0 link: https://mlart.co/item/translate-a-live-face-to-face-trackers-and-turn-them-into-a-another-person-with-

row: 1114 item: 0 link: https://mlart.co/item/classify-voice-with-an-rnn-and-match-it-with-doodles file: 59dda5234b6e8d4d34fb604922d32078.json
row: 1115 item: 0 link: https://mlart.co/item/ai-doodles-in-mr_doob_s-multi-user-drawing-tool file: 5d065c81a22c44090535841c2cc9a366.json
row: 1116 item: 0 link: https://mlart.co/item/a-generative-audio-mashup-tool-using-memory-mosaic file: 5df87e5f04d27914f317854c99e2b4b1.json
row: 1117 item: 0 link: https://mlart.co/item/t-sne-exploration-in-vr file: 5e8b149926068620e68683301ab31842.json
row: 1118 item: 0 link: https://mlart.co/item/visualize-how-variables-in-a-music-vae-are-updated-on-a-2d-surface file: 5eeebe196c3fd1c682e0b77c5d44aa52.json
row: 1119 item: 0 link: https://mlart.co/item/an-album-with-songs-generated-with-magenta_s-musicvae-model-and-lyrics-made-by-an-lstm-trained-by-ross-goodwin file: 60a3cb9f81fdac5332f8d520c3532dc8.json
row: 1120 item: 0 link: https://mlart.co/item/gan-generated-images-of-edo-woodblock-art file: 60de298b4df0

row: 1229 item: 0 link: https://mlart.co/item/train-gan-on-maps-and-project-on-naked-bodies file: ab142601aae06c8ab08b98862820c83f.json
row: 1230 item: 0 link: https://mlart.co/item/transition-between-artworks-by-continuously-applying-style-transfer file: ab1d8df6565ca858de1c1e8b09aab2fb.json
row: 1231 item: 0 link: https://mlart.co/item/generative-vector-doodles-with-an-rnn file: ac3c7ed4cc6ec86569833c2e591e7fe0.json
row: 1232 item: 0 link: https://mlart.co/item/use-gaugan-with-sliders-and-segment-animations_-adjusting-the-landscape-to-raise-awareness-of-climate-change file: acbdbe1709326dbe2fd05c21181cc3f4.json
row: 1233 item: 0 link: https://mlart.co/item/use-k-means-to-segment-and-colorize-an-image_-inspired-by-warhol file: accdc5f647938a8e10f7e8b137ed57fc.json
row: 1234 item: 0 link: https://mlart.co/item/apply-slow-motion-and-fluid-footage-and-optical-flow-based-style-transfer-on-paradis_s-recto-verso file: ad5fe65a4d845c0e632ae4703008e394.json
row: 1235 item: 0 link: https://mla

row: 1349 item: 0 link: https://mlart.co/item/organise-personal-art-collection-with-t-sne_-compose-music-according-to-each-clusters_-and-detect-each-clusters-with-a-camera-to-play-corresponding-music file: f8d3068d6d1ec55bf1cab6c7b42bf48b.json
row: 1350 item: 0 link: https://mlart.co/item/a-next-frame-prediction-using-pix2pix-feeding-the-newly-generated-image-repeatedly file: f98a22969a917f54a310e63adb3b19b8.json
row: 1351 item: 0 link: https://mlart.co/item/a-gpt-model-trained-on-_30k-font-bios-to-generate-descriptions-of-speculative-fonts file: fa465b4098dca8e5975fa4c431ec3dc8.json
row: 1352 item: 0 link: https://mlart.co/item/generates-a-3d-t-sne-map-to-visualize-high-dimensional-data file: fa802015f07bd2c54b19da7ad9b5169a.json
row: 1353 item: 0 link: https://mlart.co/item/create-a-text-model_-potentially-using-markov-chains_-based-on-the-debate-between-michel-foucault-and-noam-chomsky-and-performed-as-a-theatre-piece file: fac28946aae90f2624677b35fa6ebbd9.json
row: 1354 item: 0 lin

KeyboardInterrupt: 