# NewsBERT on Colab

This notebook uses awesome [streamlit tutorial](https://youtu.be/x0NdZkaciws) to run NewsBERT on Colab.

For running this code you will need to setup your [ngrok](https://ngrok.com/) account

In [None]:
!pip install streamlit pyngrok
!pip install git+https://github.com/lambdaofgod/pytorch_hackathon/

In [None]:
!pip install torch==1.5.1+cu101 -f https://download.pytorch.org/whl/torch_stable.html

Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==1.5.1+cu101
[?25l  Downloading https://download.pytorch.org/whl/cu101/torch-1.5.1%2Bcu101-cp36-cp36m-linux_x86_64.whl (704.4MB)
[K     |████████████████████████████████| 704.4MB 24kB/s 
[31mERROR: torchvision 0.7.0+cu101 has requirement torch==1.6.0, but you'll have torch 1.5.1+cu101 which is incompatible.[0m
Installing collected packages: torch
  Found existing installation: torch 1.5.1
    Uninstalling torch-1.5.1:
      Successfully uninstalled torch-1.5.1
Successfully installed torch-1.5.1+cu101


In [None]:
import torch
torch.cuda.is_available()

True

In [None]:
%%writefile app.py
import streamlit as st
import pandas as pd
import numpy as np
import tqdm
import os
from operator import itemgetter

import torch
from pytorch_hackathon import rss_feeds, zero_shot_learning, haystack_search
import seaborn as sns

st.title('Zero-shot RSS feed article classifier')

cm = sns.light_palette("green", as_cmap=True)
topic_strings = list(pd.read_table('https://raw.githubusercontent.com/lambdaofgod/pytorch_hackathon/master/data/topics.txt', header=None).iloc[:,0].values)
rss_feed_urls = list(pd.read_table('https://raw.githubusercontent.com/lambdaofgod/pytorch_hackathon/master/data/feeds.txt', header=None).iloc[:,0].values)
rss_feed_urls = rss_feeds.rss_feed_urls.copy()


model_device = st.selectbox("Model device", ["cpu", "cuda"], index=int(torch.cuda.is_available()))


@st.cache(allow_output_mutation=True)
def get_feed_df():
    with st.spinner('Retrieving articles from feeds...'):
        return rss_feeds.get_feed_df(rss_feed_urls)


feed_df = get_feed_df()


@st.cache(allow_output_mutation=True)
def setup_searcher(feed_df, use_gpu, model_name="deepset/sentence_bert"):
    with st.spinner('No precomputed topics found, running zero-shot learning...'):
        searcher = haystack_search.Searcher(model_name, 'text', use_gpu=use_gpu)
        searcher.add_texts(feed_df)
    return searcher 


# we need to copy feed_df so that streamlit doesn't recompute embeddings when feed_df changes 
searcher = setup_searcher(feed_df, use_gpu=model_device == 'cuda') 


@st.cache
def get_retrieved_df(topic_strings):
    results = [
        result 
        for topic in topic_strings
        for result in searcher.retriever.retrieve(
            "text is about {}".format(topic)
        )
    ]
    return searcher.get_topic_score_df(
        results,
        topic_strings
    ).drop_duplicates(subset='title')
    

selected_df = get_retrieved_df(topic_strings).reset_index(drop=True)
selected_df['text'] = selected_df['text'].apply(lambda s: s[:1000])
topics = st.multiselect('Choose topics', topic_strings, default=[topic_strings[0]])
sort_by = st.selectbox("Sort by", topics)
display_df = selected_df[selected_df[topics].min(axis=1) > 0.5].sort_values(sort_by, ascending=False)

st.markdown('## Articles on {}'.format(', '.join(topics)))

st.table(display_df[display_df[topics].min(axis=1) > 0.5].style.background_gradient(cmap=cm))

Overwriting app.py


In [None]:
## Checking if GPU is available

Running this on GPU will be much faster (embeddings for articles will be calculated much faster).

In [None]:
import torch
torch.cuda.is_available()

True

In [None]:
from pyngrok import ngrok

# ngrok token

You will need to paste your ngrok authtoken. You can find it [here](https://dashboard.ngrok.com/auth/your-authtoken) provided you have an ngrok account.

In [None]:
#@title ngrok token

token = '' #@param {type:"string"}
!ngrok authtoken $token

Authtoken saved to configuration file: /root/.ngrok2/ngrok.yml


In [None]:
!streamlit run app.py&>log&

In [None]:
!cat log


  You can now view your Streamlit app in your browser.

  Network URL: http://172.28.0.2:8503
  External URL: http://34.87.86.86:8503

2020-08-16 18:02:32.817699: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
  import pandas.util.testing as tm
  0%|          | 0/16 [00:00<?, ?it/s]  6%|▋         | 1/16 [00:03<00:55,  3.70s/it] 12%|█▎        | 2/16 [00:04<00:39,  2.81s/it] 19%|█▉        | 3/16 [00:04<00:27,  2.13s/it] 25%|██▌       | 4/16 [00:05<00:21,  1.80s/it] 31%|███▏      | 5/16 [00:06<00:15,  1.37s/it] 38%|███▊      | 6/16 [00:06<00:10,  1.05s/it] 44%|████▍     | 7/16 [00:07<00:07,  1.20it/s] 50%|█████     | 8/16 [00:07<00:05,  1.40it/s] 56%|█████▋    | 9/16 [00:08<00:05,  1.32it/s] 62%|██████▎   | 10/16 [00:09<00:04,  1.28it/s] 69%|██████▉   | 11/16 [00:10<00:04,  1.18it/s] 75%|███████▌  | 12/16 [00:10<00:03,  1.24it/s] 81%|████████▏ | 13/16 [00:11<00:02,  1.17it/s] 88%|████████▊ | 14/16 [00:1

In [None]:
publ_url = ngrok.connect(port='8503')

In [None]:
publ_url

'http://635eda040288.ngrok.io'