## Measuring Impact
This exploratory analysis measures the impact of researchers and keywords within a repository. 

Inspiration for this analysis was found here:https://api-lab.dimensions.ai/cookbooks/2-publications/Journal-Profile-2-Researchers-Impact-Metrics.html

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import plotly.express as px  # plotly>=4.8.1
import plotly.graph_objects as go


In [None]:
keywords = pd.read_csv("data/keywords.csv")
author = pd.read_csv("data/authors.csv")

### Keywords
___________________________________
First we'll look at the impact keywords have on repositories.

Let's see the total number of keyword appearances in repositories found over the years. Here the keywords chemical sciences, biological sciences, and biomedical and clinical sciences all have high appearances in repositories. The total number of appearances range from 140 to 240. You can use the tooltip to hover over the points in the plot and get more information.

In [None]:
keywords_pubyrs = keywords.sort_values(by="tot_appearances")
#look into altering figure size
fig = px.scatter(keywords_pubyrs,
           x="publicationYear", y="keywords",
           hover_name="keywords",
           hover_data=['author(s)', 'publisher', 'title', 'keywords'],
           color="tot_appearances",
        #    size="publicationYear",
        #    facet_col="publisher_categories", 
         #   facet_row="publicationYear",
           labels={
                     "publicationYear": "Publication Year",
                     "tot_appearances": "Total Keyword Appearances",
                 },
           height=800,
          #  width= 1600,
          
           title="Repository Impact: Total Keyword Appearances Over The Years")
fig.update_xaxes(tickangle=45)
fig.update_yaxes(showticklabels=False)



In [None]:
keywords_pubs = keywords.sort_values(by="tot_publishers")
fig_pub_appearance = px.scatter(keywords_pubs,
           x="publicationYear", y="keywords",
           hover_name="keywords",
           hover_data=['publicationYear','author(s)', 'publisher', 'title', 'keywords'],
           color="tot_publishers",
        #    size="publicationYear",
          #  facet_col="publisher_categories", 
         #   facet_row="publicationYear",
           labels={
                    #  "keywords": "",
                     "publicationYear": " Publication Year",
                     "tot_publishers": "Total Publisher Appearances"
                 },
           height=800,
           title="Repository Impact: Number of Repositories that use the Keyword")
fig_pub_appearance.update_yaxes(showticklabels=False)


The repositories below are split into six different Topics and the keywords are shown to be grouped into those categories. Here you can see that scientific seems to be the largest topic. 

In [None]:
keywords_pubs = keywords.sort_values(by="keywords")

#group keywords and publishers to show the different facets

fig_pubs_keywords = px.scatter(keywords_pubs,
           x="publicationYear", y="keywords",
           hover_name="keywords",
           hover_data=['publisher','publicationYear','author(s)', "tot_publishers", 'title', "publisher_categories", 'keywords'],
           color="publisher_categories",
        #    size="publicationYear",
           labels={
                     'publisher': "Repository",
                     "publisher_categories": "Repository Topics",
                     "publicationYear": "Publication Year",
                     "tot_publishers": "Total Publisher Appearances"
                 },
           height=600,
           title="Repository Impact: Keywords/Topics Published Per Repository")
# fig_pubs_keywords
fig_pubs_keywords.update_yaxes(showticklabels=False)


Below is a more faceted version of the repository topics that shows the keyword appearances for each repository topics. Here we can see the same keywords have higher appearances overall, but categories like social sciences and Math/physics/computer science have different keywords such as physical sciences (n = 8), economics (n = 6) and human society (N = 4). Although chemical sciences had the highest appearances in total, the biomedical sciences keyword (n = 13) was found in all but the social sciences topic categories. 

In [None]:
keywords_pubs = keywords.sort_values(by="tot_publishers")
fig_key_repository =  px.scatter(keywords_pubs,
           x="publicationYear", y="keywords",
           hover_name="keywords",
           hover_data=['publicationYear','author(s)', 'publisher', "publisher_categories", 'title', 'keywords'],
           color="tot_publishers",
        #    size="publicationYear",
           facet_col="publisher_categories", 
         #   facet_row="publicationYear",
           labels={
                     "publisher_categories": "",
                     "publisher": "Repository",
                     "publicationYear": "Publication Year",
                     "tot_publishers": "Total Publisher Appearances"
                 },
           height=800,
           title="Repository Impact: Number of Repositories that use the Keyword")
fig_key_repository.update_yaxes(showticklabels=False)


### Authors
___________________________________
Now we'll look at the impact Authors have on repositories.

In [None]:
authors = pd.read_csv("data/authors.csv")

In [None]:
author_appearance = authors.sort_values(by="tot_appearances")

fig_author  = px.scatter(author_appearance,
           x="publicationYear", y= "author(s)",
           hover_name="author(s)",
           hover_data=['tot_appearances','publicationYear', 'author(s)', 'publisher', 'title', 'keywords'],
           color= "tot_appearances",
        #    size="publicationYear",
           labels={
                     "author(s)": "Author(s)",
                     "publicationYear": "Publication Year",
                     "tot_appearances": "Total Author Appearances"
                 },
           height=600,
           title="Researcher Impact: Authors Published Over the Years")
fig_author.update_yaxes(showticklabels=False)


Below shows the total number of repositories an authors published in. most authors have only published in one repository, where a little more show begging published in two repositories, and only a handful have published in 5 different repositories.

In [None]:
author_pub = authors.sort_values(by="tot_publishers")

fig_auth_pub = px.scatter(author_pub,
           x="publicationYear", y= "author(s)",
           hover_name="author(s)",
           hover_data=['tot_appearances','publicationYear', 'author(s)', 'publisher', 'title', 'keywords'],
           color= "tot_publishers",
        #    size="publicationYear",
           labels={
                     "author(s)": "Author(s)",
                     "publicationYear": "Publication Year",
                     "tot_publishers": "Total Appearances in a Repository",
                     "tot_appearances": "Total Appearances Overall"
                 },
           height=600,
           title="Researcher Impact: Authors Published in Repositories Over the Years")
fig_auth_pub.update_yaxes(showticklabels=False)

In [None]:
author_pubs = authors.sort_values(by="tot_publishers")
fig_auth = px.scatter(author_pubs,
           x="publicationYear", y="author(s)",
           hover_name="author(s)",
           hover_data=['publicationYear','author(s)', 'publisher', "publisher_categories", 'title', 'keywords'],
           color="tot_publishers",
        #    size="publicationYear",
           facet_col="publisher_categories", 
         #   facet_row="publicationYear",
           labels={
                     "author(s)": "Author(s)",
                     "publisher_categories": "",
                     "publisher": "Repository",
                     "publicationYear": "Publication Year",
                     "tot_publishers": "Total Appearances in a Repository",
                     "tot_appearances": "Total Appearances Overall",
                     "title": "Title",
                 },
           height=800,
           title="Repository Impact: Number of Repositories where an Author Appears")
fig_auth.update_xaxes(tickangle=45, nticks=8)
fig_auth.update_yaxes(showticklabels=False)

## Add a note for how to read the graph. i.e. the y axis does not represent all the authors
## convert tot publishers to string data type