# NewsBERT on Colab

This notebook uses awesome [streamlit tutorial](https://youtu.be/x0NdZkaciws) to run NewsBERT on Colab.

For running this code you will need to setup your [ngrok](https://ngrok.com/) account

In [None]:
!pip install streamlit pyngrok
!pip install git+https://github.com/lambdaofgod/pytorch_hackathon/

In [None]:
!pip install torch==1.5.1+cu101 -f https://download.pytorch.org/whl/torch_stable.html

In [None]:
import torch
torch.cuda.is_available()

True

In [None]:
%%writefile app.py
import streamlit as st
import pandas as pd
import numpy as np
import tqdm
import os
from operator import itemgetter

import torch
from pytorch_hackathon import rss_feeds, zero_shot_learning, haystack_search
import seaborn as sns

st.title('Zero-shot RSS feed article classifier')

cm = sns.light_palette("green", as_cmap=True)
topic_strings = list(pd.read_table('https://raw.githubusercontent.com/lambdaofgod/pytorch_hackathon/master/data/topics.txt', header=None).iloc[:,0].values)
rss_feed_urls = list(pd.read_table('https://raw.githubusercontent.com/lambdaofgod/pytorch_hackathon/master/data/feeds.txt', header=None).iloc[:,0].values)
rss_feed_urls = rss_feeds.rss_feed_urls.copy()


model_device = st.selectbox("Model device", ["cpu", "cuda"], index=int(torch.cuda.is_available()))


@st.cache(allow_output_mutation=True)
def get_feed_df():
    with st.spinner('Retrieving articles from feeds...'):
        return rss_feeds.get_feed_df(rss_feed_urls)


feed_df = get_feed_df()


@st.cache(allow_output_mutation=True)
def setup_searcher(feed_df, use_gpu, model_name="deepset/sentence_bert"):
    with st.spinner('No precomputed topics found, running zero-shot learning...'):
        searcher = haystack_search.Searcher(model_name, 'text', use_gpu=use_gpu)
        searcher.add_texts(feed_df)
    return searcher 


# we need to copy feed_df so that streamlit doesn't recompute embeddings when feed_df changes 
searcher = setup_searcher(feed_df, use_gpu=model_device == 'cuda') 


@st.cache
def get_retrieved_df(topic_strings):
    results = [
        result 
        for topic in topic_strings
        for result in searcher.retriever.retrieve(
            "text is about {}".format(topic)
        )
    ]
    return searcher.get_topic_score_df(
        results,
        topic_strings
    ).drop_duplicates(subset='title')
    

selected_df = get_retrieved_df(topic_strings).reset_index(drop=True)
selected_df['text'] = selected_df['text'].apply(lambda s: s[:1000])
topics = st.multiselect('Choose topics', topic_strings, default=[topic_strings[0]])
sort_by = st.selectbox("Sort by", topics)
display_df = selected_df[selected_df[topics].min(axis=1) > 0.5].sort_values(sort_by, ascending=False)

st.markdown('## Articles on {}'.format(', '.join(topics)))

st.table(display_df[display_df[topics].min(axis=1) > 0.5].style.background_gradient(cmap=cm))

Overwriting app.py


In [None]:
## Checking if GPU is available

## Running this on GPU will be much faster (embeddings for articles will be calculated much faster).

In [None]:
import torch
torch.cuda.is_available()

In [None]:
from pyngrok import ngrok

# ngrok token

You will need to paste your ngrok authtoken. You can find it [here](https://dashboard.ngrok.com/auth/your-authtoken) provided you have an ngrok account.

In [None]:
#@title ngrok token

token = '' #@param {type:"string"}
!ngrok authtoken $token

Authtoken saved to configuration file: /root/.ngrok2/ngrok.yml


In [None]:
!streamlit run app.py&>log&

In [None]:
pub_port = !cat log | grep "Network URL:" | awk -F":" '{print $4}'
pub_port = int(pub_port[0])


In [None]:
publ_url = ngrok.connect(pub_port)

In [None]:
publ_url

If application fails to start try to close all existng ngrok tunnel ang bo back to this [cell](https://colab.research.google.com/drive/1oJagYsBBfGugVdo5fuY4Fw5hZP6uUQo7#scrollTo=RKeo80BSU5l6&line=1&uniqifier=1) 

In [None]:
#!pkill streamlit
tunnels = ngrok.get_tunnels()
for tunnel in tunnels:
  ngrok.disconnect(tunnel.public_url)