-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
61 volume pipeline #65
Merged
Merged
Changes from 35 commits
Commits
Show all changes
66 commits
Select commit
Hold shift + click to select a range
0c6e024
adding script to create emergence table
08d6f64
adding pipeline to build table for emergence and alignment data
6940e00
adding polars to requirements
e08f13a
moving functions to app data utils
c42fb3b
removing unused imports
557def1
removing nesta_ds_utils upload as it doesn't accept polars
75901f6
adding getter for volume table
6815c17
getting started
beingkk 8b8f030
testing out novelty measures on dummy data
beingkk 4a99d78
factoring out novelty utils
beingkk 9a9b5cf
cleaning up
beingkk 93c9ebc
pipeline script
beingkk de19a53
pipeline script
beingkk 3a7fe0f
fixing gitignore
beingkk 4eaee1c
faster version of utils and pipeline
beingkk 76e9223
fixing conflict in requirements
58addaf
fixed typing, docstrings; used parquet; saving to s3; added getters
beingkk 9f90dbb
fixed docstring
beingkk 3ce7303
more informative function name
beingkk 38272b8
fixing conflict in requirements
cc7bef8
adding chart placeholders
c77baee
separating emergence from novelty and disruption, and making filters …
bb96c61
Fixed title font color
4b4e8eb
adding filters for network
180ad06
changing logo to include igl
2ef439c
changing plot placeholders based on miro brainstorming session
22c4f24
fixing conflict in requirements
b2d2a7a
adding emergence to horizon scanner
5ff21f0
fixing alignment chart
0c0216d
adding topic names
3c22766
cleaning and commenting code
6a2aef5
get rid of print statement
eae5eee
stylistic changes from standup
518053c
add ygrid to align topics
0518853
change mark size
91a1fd1
adding docstring for pipeline and adding topic names/additional colum…
aef6c6c
separating app utils from app data utils and moving convert to pandas
5a1c3db
fixing overlapping requirements
1882d83
added interactive images in Home
ampudia19 27e244f
added hover cursor
ampudia19 8e3a2fc
fixing data types
20022de
moving link functions to utils
5bd019a
fixing table creation and saving to s3
4dcec03
animate home
ampudia19 dd3339e
rebasing
d24fa07
fixed typing, docstrings; used parquet; saving to s3; added getters
beingkk 7db07c1
rebasing
5546047
pipeline script for topic level novelty
beingkk 8cde929
adapting novelty_utils for patents
beingkk a64ff1c
pipeline for calculating patent novelty scores
beingkk 9f45778
pipeline for calculating topic level novelty with patents
beingkk a1e7665
responding to PR review
beingkk 7f5730f
fixing dosctrings and adding upload_to_s3 flag for pipelines
beingkk 6215e92
rebasing
53b9f38
rebasing
f18ccd8
rebasing
6138ff8
functions for css style
ampudia19 ab20cf3
add gifs
ampudia19 aa76f5a
syntax of converting column to list for filter options
599b672
fixing bug in emergence table to replace nulls with 0
f30d4eb
removing unneeded requirement
d4ad4f3
adding column names in docstring in getter
a5a9d7b
moved image directory to init
c69fe8c
fixing conflicts with dev
2818bbf
refactor func outside package
ampudia19 624bb8b
added docstring
ampudia19 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
import streamlit as st | ||
from PIL import Image | ||
import altair as alt | ||
from nesta_ds_utils.viz.altair import formatting | ||
from dap_aria_mapping import PROJECT_DIR | ||
formatting.setup_theme() | ||
|
||
PAGE_TITLE = "Innovation Explorer" | ||
|
||
IMAGE_DIR = f"{PROJECT_DIR}/dap_aria_mapping/analysis/app/images" | ||
|
||
|
||
#icon to be used as the favicon on the browser tab | ||
nesta_fav = Image.open(f"{IMAGE_DIR}/favicon.ico") | ||
|
||
# sets page configuration with favicon and title | ||
st.set_page_config( | ||
page_title=PAGE_TITLE, | ||
layout="wide", | ||
page_icon=nesta_fav | ||
) | ||
|
||
st.title("Welcome to the Innovation Explorer!") | ||
|
||
home_tab, data_tab, methods_tab = st.tabs(["Home", "About the Datasets", "Methodology"]) | ||
|
||
with home_tab: | ||
hs, cm = st.columns(2) | ||
with hs: | ||
st.image(Image.open(f"{IMAGE_DIR}/hs_homepage.png")) | ||
|
||
with cm: | ||
st.image(Image.open(f"{IMAGE_DIR}/cm_homepage.png")) | ||
|
||
|
||
|
||
with data_tab: | ||
st.markdown("In this app we leverage open source data provided by [Google Patents](https://console.cloud.google.com/marketplace/product/google_patents_public_datasets/google-patents-public-data?pli=1) and [Openalex](https://docs.openalex.org/) to assess the landscape of innovation in the UK") | ||
st.markdown("ADD MORE DATA DOCUMENTATION") | ||
|
||
with methods_tab: | ||
st.markdown("ADD INFORMATION ABOUT OUR METHODOLOGY") | ||
|
||
#adds the nesta x aria logo at the bottom of each tab, 3 lines below the contents | ||
st.markdown("") | ||
st.markdown("") | ||
st.markdown("") | ||
|
||
white_space, logo, white_space = st.columns([1.5,1,1.5]) | ||
with logo: | ||
st.image(Image.open(f"{IMAGE_DIR}/igl_nesta_aria_logo.png")) | ||
|
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
307 changes: 307 additions & 0 deletions
307
dap_aria_mapping/analysis/app/pages/1_Horizon_Scanner.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,307 @@ | ||
import streamlit as st | ||
from PIL import Image | ||
import altair as alt | ||
from nesta_ds_utils.viz.altair import formatting | ||
from dap_aria_mapping import PROJECT_DIR | ||
from dap_aria_mapping.getters.app_tables.horizon_scanner import volume_per_year | ||
from dap_aria_mapping.getters.taxonomies import get_topic_names | ||
import polars as pl | ||
import pandas as pd | ||
from typing import List, Tuple | ||
|
||
formatting.setup_theme() | ||
|
||
PAGE_TITLE = "Horizon Scanner" | ||
|
||
IMAGE_DIR = f"{PROJECT_DIR}/dap_aria_mapping/analysis/app/images" | ||
emily-bicks marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
#icon to be used as the favicon on the browser tab | ||
icon = Image.open(f"{IMAGE_DIR}/hs_icon.ico") | ||
|
||
# sets page configuration with favicon and title | ||
st.set_page_config( | ||
page_title=PAGE_TITLE, | ||
layout="wide", | ||
page_icon=icon | ||
) | ||
|
||
@st.cache_data | ||
emily-bicks marked this conversation as resolved.
Show resolved
Hide resolved
|
||
def load_overview_data() -> Tuple[pl.DataFrame, pl.DataFrame, List[str]]: | ||
"""loads in the volume per year chart and does initial formatting that is not impacted by filters | ||
caches results so the data is not loaded each time a filter is run | ||
|
||
Returns: | ||
pl.DataFrame: total patents/publications per domain/area/topic per year, with names | ||
pl.DataFrame: same as above, but patent/publication counts are melted to long form | ||
List: unique domain names in dataset | ||
""" | ||
volume_data = volume_per_year() | ||
|
||
#add total document count as count of patents and publications combined | ||
volume_data = volume_data.with_columns( | ||
(pl.col('patent_count') + pl.col('publication_count')).alias('total_docs'), | ||
pl.col('year').round(0) | ||
) | ||
emily-bicks marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
#add chatgpt names for domain, area, topics | ||
domain_names = pl.DataFrame( | ||
pd.DataFrame.from_dict( | ||
get_topic_names("cooccur", "chatgpt", 1, n_top = 35), | ||
orient= "index").rename_axis("domain").reset_index().rename( | ||
columns = {"name": "domain_name"})[["domain", "domain_name"]]) | ||
area_names = pl.DataFrame( | ||
pd.DataFrame.from_dict( | ||
get_topic_names("cooccur", "chatgpt", 2, n_top = 35), | ||
orient= "index").rename_axis("area").reset_index().rename( | ||
columns = {"name": "area_name"})[["area", "area_name"]]) | ||
topic_names = pl.DataFrame( | ||
pd.DataFrame.from_dict( | ||
get_topic_names("cooccur", "chatgpt", 3, n_top = 35), | ||
orient= "index").rename_axis("topic").reset_index().rename( | ||
columns = {"name": "topic_name"})[["topic", "topic_name"]]) | ||
emily-bicks marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
volume_data = volume_data.join(domain_names, on="domain", how="left") | ||
volume_data = volume_data.join(area_names, on="area", how="left") | ||
volume_data = volume_data.join(topic_names, on="topic", how="left") | ||
|
||
#generate a list of the unique domain names to use as the filter | ||
unique_domains = list(list(volume_data.select(pl.col("domain_name").unique()))[0]) | ||
emily-bicks marked this conversation as resolved.
Show resolved
Hide resolved
|
||
unique_domains.insert(0,"All") | ||
|
||
#reformat the patent/publication counts to long form for the alignment chart | ||
alignment_data = volume_data.melt( | ||
id_vars = ["year", "topic", "topic_name","area", "area_name","domain", "domain_name"], | ||
value_vars = ["publication_count", "patent_count"]) | ||
alignment_data.columns = ["year", "topic", "topic_name", "area", "area_name", "domain", "domain_name","doc_type", "count"] | ||
emily-bicks marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
return volume_data, alignment_data, unique_domains | ||
|
||
@st.cache_data | ||
def filter_by_domain(domain: str, _volume_data: pl.DataFrame, _alignment_data: pl.DataFrame) -> Tuple[pl.DataFrame, pl.DataFrame, List[str]]: | ||
"""filters volume data, alignment data, and filter options based on a Domain selection | ||
|
||
Args: | ||
domain (str): domain selected by the filter | ||
_volume_data (pl.DataFrame): volume data for emergence chart | ||
_alignment_data (pl.DataFrame): alignment data for alignment chart | ||
|
||
Returns: | ||
Tuple[pl.DataFrame, pl.DataFrame, List[str]]: updated dataframes filtered by a domain, and a list of unique areas to populate area filter | ||
emily-bicks marked this conversation as resolved.
Show resolved
Hide resolved
|
||
""" | ||
volume_data = _volume_data.filter(pl.col("domain_name")==domain) | ||
alignment_data = _alignment_data.filter(pl.col("domain_name")==domain) | ||
unique_areas = list(list(volume_data.select(pl.col("area_name").unique()))[0]) | ||
emily-bicks marked this conversation as resolved.
Show resolved
Hide resolved
|
||
return volume_data, alignment_data, unique_areas | ||
|
||
@st.cache_data | ||
def filter_by_area(area:str, _volume_data: pl.DataFrame, _alignment_data: pl.DataFrame) -> Tuple[pl.DataFrame, pl.DataFrame, List[str]]: | ||
"""filters volume data, alignment data, and filter options based on an area selection | ||
|
||
Args: | ||
area (str): domain selected by the filter | ||
_volume_data (pl.DataFrame): volume data for emergence chart | ||
_alignment_data (pl.DataFrame): alignment data for alignment chart | ||
|
||
Returns: | ||
Tuple[pl.DataFrame, pl.DataFrame, List[str]]: updated dataframes filtered by an area, and a list of unique topics to populate topic filter | ||
""" | ||
volume_data = _volume_data.filter(pl.col("area_name")==area) | ||
alignment_data = _alignment_data.filter(pl.col("area_name")==area) | ||
unique_topics = list(list(volume_data.select(pl.col("topic_name").unique()))[0]) | ||
return volume_data, alignment_data, unique_topics | ||
|
||
def group_emergence_by_level(_volume_data: pl.DataFrame, level: str, y_col: str) -> pl.DataFrame: | ||
"""groups the data for the emergence chart by the level specified by the filters | ||
|
||
Args: | ||
_volume_data (pl.DataFrame): data for backend of emergence chart | ||
level (str): level to view, specified by domain/area filters | ||
y_col (str): patents, publications, or all documents (specified by filter) | ||
|
||
Returns: | ||
pl.DataFrame: grouped emergence data for chart | ||
""" | ||
q = (_volume_data.lazy().with_columns( | ||
pl.col(level).cast(str) | ||
).groupby( | ||
[level, "{}_name".format(level),"year"] | ||
).agg( | ||
[pl.sum(y_col)] | ||
).filter(pl.any(pl.col("year").is_not_null()))) | ||
return q.collect() | ||
|
||
def group_alignment_by_level(_alignment_data: pl.DataFrame, level: str) -> pl.DataFrame: | ||
"""groups the data for the alignment chart by the level specified by the filters. | ||
Also calculates the fraction of total documents per type to visualise in the chart. | ||
|
||
Args: | ||
_alignment_data (pl.DataFrame): data for backend of alignment chart | ||
level (str): level to view, specified by domain/area filters | ||
|
||
Returns: | ||
pl.DataFrame: grouped alignment data for chart | ||
""" | ||
total_pubs = _alignment_data.filter(pl.col("doc_type")=="publication_count").select(pl.sum("count")) | ||
total_patents = _alignment_data.filter(pl.col("doc_type")=="patent_count").select(pl.sum("count")) | ||
q = (_alignment_data.lazy().with_columns( | ||
pl.col(level).cast(str) | ||
).groupby(["doc_type", level, "{}_name".format(level)] | ||
).agg( | ||
[pl.sum("count").alias("total")] | ||
).with_columns( | ||
pl.when(pl.col("doc_type") == "publication_count") | ||
.then(pl.col("total")/total_pubs) | ||
.when(pl.col("doc_type") == "patent_count") | ||
.then(pl.col("total")/total_patents) | ||
.alias("doc_fraction") | ||
).with_columns( | ||
pl.when(pl.col("doc_type") == "publication_count") | ||
.then("Publications") | ||
.when(pl.col("doc_type") == "patent_count") | ||
.then("Patents") | ||
.alias("doc_name_clean") | ||
) | ||
.with_columns( | ||
(pl.col("doc_fraction")*100).alias("doc_percentage")) | ||
) | ||
return q.collect() | ||
|
||
def convert_to_pandas(_df: pl.DataFrame) -> pd.DataFrame: | ||
"""converts polars dataframe to pandas dataframe | ||
note: this is needed as altair doesn't allow polars, but the conversion is quick so i still think it's | ||
worth while to use polars for the filtering | ||
|
||
Args: | ||
_df (pl.DataFrame): polars dataframe | ||
|
||
Returns: | ||
pd.DataFrame: pandas dataframe | ||
""" | ||
return _df.to_pandas() | ||
emily-bicks marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
header1, header2 = st.columns([1,10]) | ||
with header1: | ||
st.image(icon) | ||
with header2: | ||
st.markdown(f'<h1 style="color:#0000FF;font-size:72px;">{"Horizon Scanner"}</h1>', unsafe_allow_html=True) | ||
|
||
st.markdown(f'<h1 style="color:#0000FF;font-size:16px;">{"<em>Explore patterns and trends in research domains across the UK<em>"}</h1>', unsafe_allow_html=True) | ||
|
||
#load in volume data | ||
volume_data, alignment_data, unique_domains = load_overview_data() | ||
|
||
with st.sidebar: | ||
# filter for domains comes from unique domain names | ||
domain = st.selectbox(label = "Select a Domain", options = unique_domains) | ||
area = "All" | ||
topic = "All" | ||
level_considered = "domain" | ||
# if a domain is selected in the filter, then filter the data | ||
if domain != "All": | ||
volume_data, alignment_data, unique_areas = filter_by_domain(domain, volume_data, alignment_data) | ||
unique_areas.insert(0, "All") | ||
#if a domain is selected, the plots that are being visualised are by area (i.e. level 2) | ||
level_considered = "area" | ||
|
||
#if a domain is selected, allow user to filter by area | ||
area = st.selectbox(label = "Select an Area", options = unique_areas) | ||
if area != "All": | ||
#if an area is selected, filter data to the area and present at topic level | ||
volume_data, alignment_data, unique_topics = filter_by_area(area, volume_data, alignment_data) | ||
level_considered = "topic" | ||
|
||
|
||
overview_tab, disruption_tab, novelty_tab, overlaps_tab = st.tabs(["Overview", "Disruption", "Novelty","Overlaps"]) | ||
|
||
with overview_tab: | ||
|
||
st.subheader("Growth Over Time") | ||
st.markdown("View trends in volume of content over time to detect emerging or stagnant areas of innovation") | ||
show_only = st.selectbox(label = "Show Emergence In:", options = ["All Documents", "Publications", "Patents"]) | ||
if show_only == "Publications": | ||
y_col = "publication_count" | ||
elif show_only == "Patents": | ||
y_col = "patent_count" | ||
else: | ||
y_col = "total_docs" | ||
emily-bicks marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
emergence_data = convert_to_pandas(group_emergence_by_level(volume_data, level_considered, y_col)) | ||
|
||
volume_chart = alt.Chart(emergence_data).mark_line(point=True).encode( | ||
alt.X("year:N"), | ||
alt.Y("{}:Q".format(y_col), title = "Total Documents"), | ||
color = alt.Color("{}_name:N".format(level_considered), | ||
legend = alt.Legend(labelFontSize = 10, title = None, labelLimit = 0, symbolSize = 20) | ||
), | ||
tooltip=[ | ||
alt.Tooltip("year:N", title = "Year"), | ||
alt.Tooltip("{}:Q".format(y_col),title = "Total Documents"), | ||
alt.Tooltip("{}_name:N".format(level_considered), title = "{}".format(level_considered))] | ||
|
||
).interactive().properties(width=1100, height = 500) | ||
st.altair_chart(volume_chart) | ||
|
||
st.subheader("Alignment in Research and Industry") | ||
st.markdown("Areas with high publication count and low patent count indicates there is significantly more activity in academia than industry on this topic (or vice versa).") | ||
filtered_alignment_data = convert_to_pandas(group_alignment_by_level(alignment_data, level_considered)) | ||
alignment_chart = alt.Chart(filtered_alignment_data).transform_filter( | ||
alt.datum.doc_fraction > 0 | ||
).mark_point(size = 60).encode( | ||
alt.X("doc_fraction:Q", | ||
title = "Percent of Documents of the Given Type", | ||
scale=alt.Scale(type="log"), | ||
axis = alt.Axis(tickSize=0, format = "%", grid = False)), | ||
alt.Y("{}_name:N".format(level_considered), | ||
axis = alt.Axis(labelLimit = 0, title = None, grid = True) | ||
), | ||
tooltip=[ | ||
alt.Tooltip("doc_name_clean:N", title = "Document Type"), | ||
alt.Tooltip("doc_percentage:Q", format = ".2f", title = "Percent of Docs (%)"), | ||
alt.Tooltip("{}_name:N".format(level_considered), title = "{}".format(level_considered))], | ||
color = alt.Color("doc_name_clean:N", legend=alt.Legend( | ||
direction='horizontal', | ||
legendX=10, | ||
legendY=-80, | ||
orient = 'none', | ||
titleAnchor='middle', | ||
title = None)) | ||
).interactive().properties(width = 1100) | ||
|
||
|
||
st.altair_chart(alignment_chart) | ||
|
||
with disruption_tab: | ||
disruption_trends, disruption_drilldown = st.columns(2) | ||
|
||
with disruption_trends: | ||
st.subheader("Trends in Disruption") | ||
st.markdown("This could show if certain domains/areas have been recently disrupted or have lacked disruption") | ||
|
||
with disruption_drilldown: | ||
st.subheader("Drill Down in Disruption") | ||
st.markdown("This would allow a user to select a topic and see the distribution of disruptiveness of papers within that topic") | ||
|
||
with novelty_tab: | ||
st.subheader("Trends in Novelty") | ||
st.markdown("This could show trends in novelty of research produced by certain domains/areas") | ||
|
||
with overlaps_tab: | ||
heatmap, overlap_drilldown = st.columns(2) | ||
with heatmap: | ||
st.subheader("Heatmap of Overlaps") | ||
st.markdown("This would be used to show which areas have a large amount of research that spans multiple topics (i.e. a lot of research combining ML with Neuroscience)") | ||
with overlap_drilldown: | ||
st.subheader("Trends in Overlaps") | ||
st.markdown("This would allow a user to select interesting overlaps and view the trends in growth over time") | ||
|
||
|
||
|
||
|
||
#adds the nesta x aria logo at the bottom of each tab, 3 lines below the contents | ||
st.markdown("") | ||
st.markdown("") | ||
st.markdown("") | ||
|
||
white_space, logo, white_space = st.columns([1.5,1,1.5]) | ||
with logo: | ||
st.image(Image.open(f"{IMAGE_DIR}/igl_nesta_aria_logo.png")) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @emily-bicks, really good to see this. I don't have comments really, as I'm new to streamlit - so perhaps I'll take this as what good looks like for now, and then will have more feedback as I get more familiar.