# Subject Mapping Extractions for Springer Nature

This notebook allows to select an FOR code (subject area) and a time span to produce CSV reports for the following areas: Data Competitors, Data Growth, Data Top Target Institutions and Data Top Countries. The reports are based on the [spreadsheet](https://drive.google.com/open?id=1La_GWhzw1ReFL3biYLIL_K7qMUWxXX_B) used by the SN team programmatically, as documented in the [Data Request Specs](https://docs.google.com/document/d/1haAj3SpraxzNLLOcF7_TBfrQ7chFWyRp70SueFeCPDg/edit). 

## Left To Do

* check that second year is always > than first year
* ~~rerun from top~~
* ~~merge all into single cell~~
* ~~clarify and/or for multiple subjects and do more testing~~
* ~~see if we need to download data programmatically, or AJ can do it himself after I provide more info about the left hand Files menu~~
* ~~remove indexes from CSV exports~~
* ~~consider adding subject name first in file name (easier to sort)~~


## 1. Log into the Dimensions API

This step needs to be run each time the notebooks gets re-loaded, as it sets up the API connection with the Dimensions back end. 

In [1]:
username = "dsl.demo.1@dimensions.ai"  #@param {type: "string"}
password = "1.Demo.Dsl"  #@param {type: "string"}
endpoint = "https://app.dimensions.ai"  #@param {type: "string"}

#
#
#
!pip install dimcli -U --quiet 
import dimcli
import pandas as pd
import time
import os
from datetime import datetime
dimcli.login(username, password, endpoint)
dsl = dimcli.Dsl()
#
#
#

[?25l[K     |██▉                             | 10kB 19.7MB/s eta 0:00:01[K     |█████▊                          | 20kB 5.8MB/s eta 0:00:01[K     |████████▌                       | 30kB 8.0MB/s eta 0:00:01[K     |███████████▍                    | 40kB 5.5MB/s eta 0:00:01[K     |██████████████▏                 | 51kB 6.6MB/s eta 0:00:01[K     |█████████████████               | 61kB 7.8MB/s eta 0:00:01[K     |███████████████████▉            | 71kB 8.9MB/s eta 0:00:01[K     |██████████████████████▊         | 81kB 9.9MB/s eta 0:00:01[K     |█████████████████████████▌      | 92kB 11.0MB/s eta 0:00:01[K     |████████████████████████████▍   | 102kB 9.1MB/s eta 0:00:01[K     |███████████████████████████████▏| 112kB 9.1MB/s eta 0:00:01[K     |████████████████████████████████| 122kB 9.1MB/s 
[?25hDimCli v0.6.1.2 - Succesfully connected to <https://app.dimensions.ai> (method: manual login)


## Select Parameters 

Pick a year range and one or more subjects from the [FOR categories](https://sn-insights.dimensions.ai/browse/publication/for?and_facet_open_access_status_analytics). NOTE the value in the 'connector' field determines if both subjects must be present in the results (default = no).

In [13]:

###################
#
# 1. SUBJECTS SELECTIONS AND SETTINGS 
#
###################



# tip 
# the subjects dropdown can be generated via 
# `str(["%s" % s for s in  sorted(dimcli.G.categories('category_for')) if len(s.split()[0]) > 2])`
#

#@markdown Time frame
start_year = 2015  #@param {type: "slider", min: 1980, max: 2020}
end_year = 2020  #@param {type: "slider", min: 1980, max: 2020}
start_year_growth = 2010 #@param {type: "integer" }
#@markdown ---
#@markdown Subjects
subject1 = "1006 Computer Hardware"  #@param ['0101 Pure Mathematics', '0102 Applied Mathematics', '0103 Numerical and Computational Mathematics', '0104 Statistics', '0105 Mathematical Physics', '0201 Astronomical and Space Sciences', '0202 Atomic, Molecular, Nuclear, Particle and Plasma Physics', '0203 Classical Physics', '0204 Condensed Matter Physics', '0205 Optical Physics', '0206 Quantum Physics', '0299 Other Physical Sciences', '0301 Analytical Chemistry', '0302 Inorganic Chemistry', '0303 Macromolecular and Materials Chemistry', '0304 Medicinal and Biomolecular Chemistry', '0305 Organic Chemistry', '0306 Physical Chemistry (incl. Structural)', '0307 Theoretical and Computational Chemistry', '0399 Other Chemical Sciences', '0401 Atmospheric Sciences', '0402 Geochemistry', '0403 Geology', '0404 Geophysics', '0405 Oceanography', '0406 Physical Geography and Environmental Geoscience', '0499 Other Earth Sciences', '0501 Ecological Applications', '0502 Environmental Science and Management', '0503 Soil Sciences', '0599 Other Environmental Sciences', '0601 Biochemistry and Cell Biology', '0602 Ecology', '0603 Evolutionary Biology', '0604 Genetics', '0605 Microbiology', '0606 Physiology', '0607 Plant Biology', '0608 Zoology', '0699 Other Biological Sciences', '0701 Agriculture, Land and Farm Management', '0702 Animal Production', '0703 Crop and Pasture Production', '0704 Fisheries Sciences', '0705 Forestry Sciences', '0706 Horticultural Production', '0707 Veterinary Sciences', '0799 Other Agricultural and Veterinary Sciences', '0801 Artificial Intelligence and Image Processing', '0802 Computation Theory and Mathematics', '0803 Computer Software', '0804 Data Format', '0805 Distributed Computing', '0806 Information Systems', '0807 Library and Information Studies', '0899 Other Information and Computing Sciences', '0901 Aerospace Engineering', '0902 Automotive Engineering', '0903 Biomedical Engineering', '0904 Chemical Engineering', '0905 Civil Engineering', '0906 Electrical and Electronic Engineering', '0907 Environmental Engineering', '0908 Food Sciences', '0909 Geomatic Engineering', '0910 Manufacturing Engineering', '0911 Maritime Engineering', '0912 Materials Engineering', '0913 Mechanical Engineering', '0914 Resources Engineering and Extractive Metallurgy', '0915 Interdisciplinary Engineering', '0999 Other Engineering', '1001 Agricultural Biotechnology', '1002 Environmental Biotechnology', '1003 Industrial Biotechnology', '1004 Medical Biotechnology', '1005 Communications Technologies', '1006 Computer Hardware', '1007 Nanotechnology', '1099 Other Technology', '1101 Medical Biochemistry and Metabolomics', '1102 Cardiorespiratory Medicine and Haematology', '1103 Clinical Sciences', '1104 Complementary and Alternative Medicine', '1105 Dentistry', '1106 Human Movement and Sports Science', '1107 Immunology', '1108 Medical Microbiology', '1109 Neurosciences', '1110 Nursing', '1111 Nutrition and Dietetics', '1112 Oncology and Carcinogenesis', '1113 Ophthalmology and Optometry', '1114 Paediatrics and Reproductive Medicine', '1115 Pharmacology and Pharmaceutical Sciences', '1116 Medical Physiology', '1117 Public Health and Health Services', '1199 Other Medical and Health Sciences', '1201 Architecture', '1202 Building', '1203 Design Practice and Management', '1205 Urban and Regional Planning', '1299 Other Built Environment and Design', '1301 Education Systems', '1302 Curriculum and Pedagogy', '1303 Specialist Studies In Education', '1399 Other Education', '1401 Economic Theory', '1402 Applied Economics', '1403 Econometrics', '1499 Other Economics', '1501 Accounting, Auditing and Accountability', '1502 Banking, Finance and Investment', '1503 Business and Management', '1504 Commercial Services', '1505 Marketing', '1506 Tourism', '1507 Transportation and Freight Services', '1601 Anthropology', '1602 Criminology', '1603 Demography', '1604 Human Geography', '1605 Policy and Administration', '1606 Political Science', '1607 Social Work', '1608 Sociology', '1699 Other Studies In Human Society', '1701 Psychology', '1702 Cognitive Sciences', '1799 Other Psychology and Cognitive Sciences', '1801 Law', '1899 Other Law and Legal Studies', '1901 Art Theory and Criticism', '1902 Film, Television and Digital Media', '1903 Journalism and Professional Writing', '1904 Performing Arts and Creative Writing', '1905 Visual Arts and Crafts', '1999 Other Studies In Creative Arts and Writing', '2001 Communication and Media Studies', '2002 Cultural Studies', '2003 Language Studies', '2004 Linguistics', '2005 Literary Studies', '2099 Other Language, Communication and Culture', '2101 Archaeology', '2102 Curatorial and Related Studies', '2103 Historical Studies', '2199 Other History and Archaeology', '2201 Applied Ethics', '2202 History and Philosophy of Specific Fields', '2203 Philosophy', '2204 Religion and Religious Studies', '2299 Other Philosophy and Religious Studies']
subject2 = "0604 Genetics"  #@param ['None', '0101 Pure Mathematics', '0102 Applied Mathematics', '0103 Numerical and Computational Mathematics', '0104 Statistics', '0105 Mathematical Physics', '0201 Astronomical and Space Sciences', '0202 Atomic, Molecular, Nuclear, Particle and Plasma Physics', '0203 Classical Physics', '0204 Condensed Matter Physics', '0205 Optical Physics', '0206 Quantum Physics', '0299 Other Physical Sciences', '0301 Analytical Chemistry', '0302 Inorganic Chemistry', '0303 Macromolecular and Materials Chemistry', '0304 Medicinal and Biomolecular Chemistry', '0305 Organic Chemistry', '0306 Physical Chemistry (incl. Structural)', '0307 Theoretical and Computational Chemistry', '0399 Other Chemical Sciences', '0401 Atmospheric Sciences', '0402 Geochemistry', '0403 Geology', '0404 Geophysics', '0405 Oceanography', '0406 Physical Geography and Environmental Geoscience', '0499 Other Earth Sciences', '0501 Ecological Applications', '0502 Environmental Science and Management', '0503 Soil Sciences', '0599 Other Environmental Sciences', '0601 Biochemistry and Cell Biology', '0602 Ecology', '0603 Evolutionary Biology', '0604 Genetics', '0605 Microbiology', '0606 Physiology', '0607 Plant Biology', '0608 Zoology', '0699 Other Biological Sciences', '0701 Agriculture, Land and Farm Management', '0702 Animal Production', '0703 Crop and Pasture Production', '0704 Fisheries Sciences', '0705 Forestry Sciences', '0706 Horticultural Production', '0707 Veterinary Sciences', '0799 Other Agricultural and Veterinary Sciences', '0801 Artificial Intelligence and Image Processing', '0802 Computation Theory and Mathematics', '0803 Computer Software', '0804 Data Format', '0805 Distributed Computing', '0806 Information Systems', '0807 Library and Information Studies', '0899 Other Information and Computing Sciences', '0901 Aerospace Engineering', '0902 Automotive Engineering', '0903 Biomedical Engineering', '0904 Chemical Engineering', '0905 Civil Engineering', '0906 Electrical and Electronic Engineering', '0907 Environmental Engineering', '0908 Food Sciences', '0909 Geomatic Engineering', '0910 Manufacturing Engineering', '0911 Maritime Engineering', '0912 Materials Engineering', '0913 Mechanical Engineering', '0914 Resources Engineering and Extractive Metallurgy', '0915 Interdisciplinary Engineering', '0999 Other Engineering', '1001 Agricultural Biotechnology', '1002 Environmental Biotechnology', '1003 Industrial Biotechnology', '1004 Medical Biotechnology', '1005 Communications Technologies', '1006 Computer Hardware', '1007 Nanotechnology', '1099 Other Technology', '1101 Medical Biochemistry and Metabolomics', '1102 Cardiorespiratory Medicine and Haematology', '1103 Clinical Sciences', '1104 Complementary and Alternative Medicine', '1105 Dentistry', '1106 Human Movement and Sports Science', '1107 Immunology', '1108 Medical Microbiology', '1109 Neurosciences', '1110 Nursing', '1111 Nutrition and Dietetics', '1112 Oncology and Carcinogenesis', '1113 Ophthalmology and Optometry', '1114 Paediatrics and Reproductive Medicine', '1115 Pharmacology and Pharmaceutical Sciences', '1116 Medical Physiology', '1117 Public Health and Health Services', '1199 Other Medical and Health Sciences', '1201 Architecture', '1202 Building', '1203 Design Practice and Management', '1205 Urban and Regional Planning', '1299 Other Built Environment and Design', '1301 Education Systems', '1302 Curriculum and Pedagogy', '1303 Specialist Studies In Education', '1399 Other Education', '1401 Economic Theory', '1402 Applied Economics', '1403 Econometrics', '1499 Other Economics', '1501 Accounting, Auditing and Accountability', '1502 Banking, Finance and Investment', '1503 Business and Management', '1504 Commercial Services', '1505 Marketing', '1506 Tourism', '1507 Transportation and Freight Services', '1601 Anthropology', '1602 Criminology', '1603 Demography', '1604 Human Geography', '1605 Policy and Administration', '1606 Political Science', '1607 Social Work', '1608 Sociology', '1699 Other Studies In Human Society', '1701 Psychology', '1702 Cognitive Sciences', '1799 Other Psychology and Cognitive Sciences', '1801 Law', '1899 Other Law and Legal Studies', '1901 Art Theory and Criticism', '1902 Film, Television and Digital Media', '1903 Journalism and Professional Writing', '1904 Performing Arts and Creative Writing', '1905 Visual Arts and Crafts', '1999 Other Studies In Creative Arts and Writing', '2001 Communication and Media Studies', '2002 Cultural Studies', '2003 Language Studies', '2004 Linguistics', '2005 Literary Studies', '2099 Other Language, Communication and Culture', '2101 Archaeology', '2102 Curatorial and Related Studies', '2103 Historical Studies', '2199 Other History and Archaeology', '2201 Applied Ethics', '2202 History and Philosophy of Specific Fields', '2203 Philosophy', '2204 Religion and Religious Studies', '2299 Other Philosophy and Religious Studies']
subject3 = "0202 Atomic, Molecular, Nuclear, Particle and Plasma Physics"  #@param ['None', '0101 Pure Mathematics', '0102 Applied Mathematics', '0103 Numerical and Computational Mathematics', '0104 Statistics', '0105 Mathematical Physics', '0201 Astronomical and Space Sciences', '0202 Atomic, Molecular, Nuclear, Particle and Plasma Physics', '0203 Classical Physics', '0204 Condensed Matter Physics', '0205 Optical Physics', '0206 Quantum Physics', '0299 Other Physical Sciences', '0301 Analytical Chemistry', '0302 Inorganic Chemistry', '0303 Macromolecular and Materials Chemistry', '0304 Medicinal and Biomolecular Chemistry', '0305 Organic Chemistry', '0306 Physical Chemistry (incl. Structural)', '0307 Theoretical and Computational Chemistry', '0399 Other Chemical Sciences', '0401 Atmospheric Sciences', '0402 Geochemistry', '0403 Geology', '0404 Geophysics', '0405 Oceanography', '0406 Physical Geography and Environmental Geoscience', '0499 Other Earth Sciences', '0501 Ecological Applications', '0502 Environmental Science and Management', '0503 Soil Sciences', '0599 Other Environmental Sciences', '0601 Biochemistry and Cell Biology', '0602 Ecology', '0603 Evolutionary Biology', '0604 Genetics', '0605 Microbiology', '0606 Physiology', '0607 Plant Biology', '0608 Zoology', '0699 Other Biological Sciences', '0701 Agriculture, Land and Farm Management', '0702 Animal Production', '0703 Crop and Pasture Production', '0704 Fisheries Sciences', '0705 Forestry Sciences', '0706 Horticultural Production', '0707 Veterinary Sciences', '0799 Other Agricultural and Veterinary Sciences', '0801 Artificial Intelligence and Image Processing', '0802 Computation Theory and Mathematics', '0803 Computer Software', '0804 Data Format', '0805 Distributed Computing', '0806 Information Systems', '0807 Library and Information Studies', '0899 Other Information and Computing Sciences', '0901 Aerospace Engineering', '0902 Automotive Engineering', '0903 Biomedical Engineering', '0904 Chemical Engineering', '0905 Civil Engineering', '0906 Electrical and Electronic Engineering', '0907 Environmental Engineering', '0908 Food Sciences', '0909 Geomatic Engineering', '0910 Manufacturing Engineering', '0911 Maritime Engineering', '0912 Materials Engineering', '0913 Mechanical Engineering', '0914 Resources Engineering and Extractive Metallurgy', '0915 Interdisciplinary Engineering', '0999 Other Engineering', '1001 Agricultural Biotechnology', '1002 Environmental Biotechnology', '1003 Industrial Biotechnology', '1004 Medical Biotechnology', '1005 Communications Technologies', '1006 Computer Hardware', '1007 Nanotechnology', '1099 Other Technology', '1101 Medical Biochemistry and Metabolomics', '1102 Cardiorespiratory Medicine and Haematology', '1103 Clinical Sciences', '1104 Complementary and Alternative Medicine', '1105 Dentistry', '1106 Human Movement and Sports Science', '1107 Immunology', '1108 Medical Microbiology', '1109 Neurosciences', '1110 Nursing', '1111 Nutrition and Dietetics', '1112 Oncology and Carcinogenesis', '1113 Ophthalmology and Optometry', '1114 Paediatrics and Reproductive Medicine', '1115 Pharmacology and Pharmaceutical Sciences', '1116 Medical Physiology', '1117 Public Health and Health Services', '1199 Other Medical and Health Sciences', '1201 Architecture', '1202 Building', '1203 Design Practice and Management', '1205 Urban and Regional Planning', '1299 Other Built Environment and Design', '1301 Education Systems', '1302 Curriculum and Pedagogy', '1303 Specialist Studies In Education', '1399 Other Education', '1401 Economic Theory', '1402 Applied Economics', '1403 Econometrics', '1499 Other Economics', '1501 Accounting, Auditing and Accountability', '1502 Banking, Finance and Investment', '1503 Business and Management', '1504 Commercial Services', '1505 Marketing', '1506 Tourism', '1507 Transportation and Freight Services', '1601 Anthropology', '1602 Criminology', '1603 Demography', '1604 Human Geography', '1605 Policy and Administration', '1606 Political Science', '1607 Social Work', '1608 Sociology', '1699 Other Studies In Human Society', '1701 Psychology', '1702 Cognitive Sciences', '1799 Other Psychology and Cognitive Sciences', '1801 Law', '1899 Other Law and Legal Studies', '1901 Art Theory and Criticism', '1902 Film, Television and Digital Media', '1903 Journalism and Professional Writing', '1904 Performing Arts and Creative Writing', '1905 Visual Arts and Crafts', '1999 Other Studies In Creative Arts and Writing', '2001 Communication and Media Studies', '2002 Cultural Studies', '2003 Language Studies', '2004 Linguistics', '2005 Literary Studies', '2099 Other Language, Communication and Culture', '2101 Archaeology', '2102 Curatorial and Related Studies', '2103 Historical Studies', '2199 Other History and Archaeology', '2201 Applied Ethics', '2202 History and Philosophy of Specific Fields', '2203 Philosophy', '2204 Religion and Religious Studies', '2299 Other Philosophy and Religious Studies']
subject4 = "0304 Medicinal and Biomolecular Chemistry"  #@param ['None', '0101 Pure Mathematics', '0102 Applied Mathematics', '0103 Numerical and Computational Mathematics', '0104 Statistics', '0105 Mathematical Physics', '0201 Astronomical and Space Sciences', '0202 Atomic, Molecular, Nuclear, Particle and Plasma Physics', '0203 Classical Physics', '0204 Condensed Matter Physics', '0205 Optical Physics', '0206 Quantum Physics', '0299 Other Physical Sciences', '0301 Analytical Chemistry', '0302 Inorganic Chemistry', '0303 Macromolecular and Materials Chemistry', '0304 Medicinal and Biomolecular Chemistry', '0305 Organic Chemistry', '0306 Physical Chemistry (incl. Structural)', '0307 Theoretical and Computational Chemistry', '0399 Other Chemical Sciences', '0401 Atmospheric Sciences', '0402 Geochemistry', '0403 Geology', '0404 Geophysics', '0405 Oceanography', '0406 Physical Geography and Environmental Geoscience', '0499 Other Earth Sciences', '0501 Ecological Applications', '0502 Environmental Science and Management', '0503 Soil Sciences', '0599 Other Environmental Sciences', '0601 Biochemistry and Cell Biology', '0602 Ecology', '0603 Evolutionary Biology', '0604 Genetics', '0605 Microbiology', '0606 Physiology', '0607 Plant Biology', '0608 Zoology', '0699 Other Biological Sciences', '0701 Agriculture, Land and Farm Management', '0702 Animal Production', '0703 Crop and Pasture Production', '0704 Fisheries Sciences', '0705 Forestry Sciences', '0706 Horticultural Production', '0707 Veterinary Sciences', '0799 Other Agricultural and Veterinary Sciences', '0801 Artificial Intelligence and Image Processing', '0802 Computation Theory and Mathematics', '0803 Computer Software', '0804 Data Format', '0805 Distributed Computing', '0806 Information Systems', '0807 Library and Information Studies', '0899 Other Information and Computing Sciences', '0901 Aerospace Engineering', '0902 Automotive Engineering', '0903 Biomedical Engineering', '0904 Chemical Engineering', '0905 Civil Engineering', '0906 Electrical and Electronic Engineering', '0907 Environmental Engineering', '0908 Food Sciences', '0909 Geomatic Engineering', '0910 Manufacturing Engineering', '0911 Maritime Engineering', '0912 Materials Engineering', '0913 Mechanical Engineering', '0914 Resources Engineering and Extractive Metallurgy', '0915 Interdisciplinary Engineering', '0999 Other Engineering', '1001 Agricultural Biotechnology', '1002 Environmental Biotechnology', '1003 Industrial Biotechnology', '1004 Medical Biotechnology', '1005 Communications Technologies', '1006 Computer Hardware', '1007 Nanotechnology', '1099 Other Technology', '1101 Medical Biochemistry and Metabolomics', '1102 Cardiorespiratory Medicine and Haematology', '1103 Clinical Sciences', '1104 Complementary and Alternative Medicine', '1105 Dentistry', '1106 Human Movement and Sports Science', '1107 Immunology', '1108 Medical Microbiology', '1109 Neurosciences', '1110 Nursing', '1111 Nutrition and Dietetics', '1112 Oncology and Carcinogenesis', '1113 Ophthalmology and Optometry', '1114 Paediatrics and Reproductive Medicine', '1115 Pharmacology and Pharmaceutical Sciences', '1116 Medical Physiology', '1117 Public Health and Health Services', '1199 Other Medical and Health Sciences', '1201 Architecture', '1202 Building', '1203 Design Practice and Management', '1205 Urban and Regional Planning', '1299 Other Built Environment and Design', '1301 Education Systems', '1302 Curriculum and Pedagogy', '1303 Specialist Studies In Education', '1399 Other Education', '1401 Economic Theory', '1402 Applied Economics', '1403 Econometrics', '1499 Other Economics', '1501 Accounting, Auditing and Accountability', '1502 Banking, Finance and Investment', '1503 Business and Management', '1504 Commercial Services', '1505 Marketing', '1506 Tourism', '1507 Transportation and Freight Services', '1601 Anthropology', '1602 Criminology', '1603 Demography', '1604 Human Geography', '1605 Policy and Administration', '1606 Political Science', '1607 Social Work', '1608 Sociology', '1699 Other Studies In Human Society', '1701 Psychology', '1702 Cognitive Sciences', '1799 Other Psychology and Cognitive Sciences', '1801 Law', '1899 Other Law and Legal Studies', '1901 Art Theory and Criticism', '1902 Film, Television and Digital Media', '1903 Journalism and Professional Writing', '1904 Performing Arts and Creative Writing', '1905 Visual Arts and Crafts', '1999 Other Studies In Creative Arts and Writing', '2001 Communication and Media Studies', '2002 Cultural Studies', '2003 Language Studies', '2004 Linguistics', '2005 Literary Studies', '2099 Other Language, Communication and Culture', '2101 Archaeology', '2102 Curatorial and Related Studies', '2103 Historical Studies', '2199 Other History and Archaeology', '2201 Applied Ethics', '2202 History and Philosophy of Specific Fields', '2203 Philosophy', '2204 Religion and Religious Studies', '2299 Other Philosophy and Religious Studies']
connector = "or"  #@param ['or', 'and']
#
#
#
#
nowfolder = datetime.now().strftime("%Y-%m-%d %H.%M.%S")
output_folder_name = nowfolder + "-" + subject1.replace(" ", "_")

def save_locally(header, dfdata, startyear_override=None):
  "save dataframe in local folder"
  global subject1, subject2, start_year, end_year, output_folder_name
  if startyear_override:
    syear = startyear_override
  else:
    syear = start_year
  if not os.path.exists(output_folder_name):
    os.mkdir(output_folder_name)
  if subject2:
    subject2 = "_" + subject2.replace(" ", "_")
  else:
    subject2 = ""
  export = header + "_" + subject1.replace(" ", "_") + subject2 + "_" + str(syear) + "_" + str(end_year) + ".csv"
  dfdata.to_csv(output_folder_name + "/" + export)
  print("\n..saved file '" + export)

#
#

if end_year < start_year :
  end_year = start_year

#
#
subjects = [s for s in [subject1, subject2, subject3, subject4] if s!= "None"]

cat_query = connector.join(f""" category_for.name="{x}" """ for x in subjects)
cat_query_grants = connector.join(f""" FOR.name="{x}" """ for x in subjects)


# print/save parameters summary
README = f"""*****\n*****\nSelection:\nSubject1  : {subject1}\nSubject2  : {subject2} \
      \nSubject3  : {subject3}\nSubject4  : {subject4} \
      \nConnector : {connector}\nYears     : {start_year} - {end_year}\n*****\n*****\n"""

if not os.path.exists(output_folder_name):
  os.mkdir(output_folder_name)
with open(output_folder_name + "/README.txt", "w") as f:
  f.write(README)

print(README)

*****
*****
Selection:
Subject1  : 1006 Computer Hardware
Subject2  : 0604 Genetics       
Subject3  : 0202 Atomic, Molecular, Nuclear, Particle and Plasma Physics
Subject4  : 0304 Medicinal and Biomolecular Chemistry       
Connector : or
Years     : 2015 - 2020
*****
*****



## 1. Data Competitors: Run extraction


In [14]:
#@title
print("\nData Competitors\n...extracting data..\n")
#
#
#
#
q1 = """search publications where year in [{}:{}] and {} return publisher aggregate count, rcr_avg, fcr_gavg, altmetric_median limit 1000"""
df = dsl.query(q1.format(start_year, end_year, cat_query)).as_dataframe()
if len(df) == 0:
  print("WARNING: no data found for the parameters selected!")
else:
  # print(q1.format(start_year, end_year, cat_query))
  df.rename(columns={"id": "name", "count": "pubs"}, inplace=True)
  #df
  q2 = """search publications where altmetric > 0 and year in [{}:{}] and {} return publisher aggregate count limit 1000"""
  df2 = dsl.query(q2.format(start_year, end_year, cat_query), verbose=False).as_dataframe()
  df2.rename(columns={"id": "name", "count": "pubs_altmetric"}, inplace=True)
  #df2
  df3 = pd.merge(df, df2, how='left', on="name")
  df3 = df3[['name', 'pubs', 'fcr_gavg', 'rcr_avg', 'altmetric_median', 'pubs_altmetric']]
  df3 = df3.fillna(0) # fill empty values
  df3['pubs_altmetric'] = df3['pubs_altmetric'].astype('int') # make col int (from float)
  df3['pubs_altmetric_prc'] = ((df3['pubs_altmetric'] * 100) / df3['pubs']).round(2) # add % representation
  #

  save_locally("DataCompetitor", df3)
  df3


Data Competitors
...extracting data..

Returned Publisher: 1000

..saved file 'DataCompetitor_1006_Computer_Hardware_0604_Genetics_2015_2020.csv


## 2. Data Growth

In [15]:
#@title
print("\nData Growth\n...extracting data..\n")
#
#
#
#
years = [x for x in range(start_year_growth, end_year+1)]
row_labels = ['Total Publications', 'Closed', 'All OA', 'Green, Submitted','Pure Gold',  'Bronze', 'Hybrid', 'Green, Published', 'Green, Accepted' ]
df = pd.DataFrame(columns=years, index=row_labels)
#
#
#
# TIP query payload has this shape
# {'_stats': {'total_count': 13920},
#  'open_access_categories': [{'count': 8075, 'id': 'closed', 'name': 'Closed'},
#   {'count': 5845, 'id': 'oa_all', 'name': 'All OA'},
#   {'count': 2060, 'id': 'gold_bronze', 'name': 'Bronze'},
#   {'count': 2046, 'id': 'gold_pure', 'name': 'Pure Gold'},
#   {'count': 680, 'id': 'gold_hybrid', 'name': 'Hybrid'},
#   {'count': 467, 'id': 'green_acc', 'name': 'Green, Accepted'},
#   {'count': 448, 'id': 'green_pub', 'name': 'Green, Published'},
#   {'count': 144, 'id': 'green_sub', 'name': 'Green, Submitted'}]}
#
#
#
for y in years:
  q1 = """search publications where year = {} and {} return open_access_categories[name]"""
  data = dsl.query(q1.format(y, cat_query))
  if data.count_total == 0:
    print("WARNING: no data found for the parameters selected!")
    break
  df.at['Total Publications', y] = data.count_total
  #
  for l in row_labels:
    for cat in data.open_access_categories:
      if cat['name'] == l:
        df.at[l, y] = cat['count']

  time.sleep(1)
df = df.fillna(0) # fill empty values

#
save_locally("DataGrowth", df, start_year_growth)
#
df



Data Growth
...extracting data..

Returned Open_access_categories: 8
Returned Open_access_categories: 8
Returned Open_access_categories: 8
Returned Open_access_categories: 8
Returned Open_access_categories: 8
Returned Open_access_categories: 8
Returned Open_access_categories: 8
Returned Open_access_categories: 8
Returned Open_access_categories: 8
Returned Open_access_categories: 8
Returned Open_access_categories: 8

..saved file 'DataGrowth_1006_Computer_Hardware__0604_Genetics_2010_2020.csv


Unnamed: 0,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
Total Publications,3506551,3506565,3506581,3506708,3506640,3506692,3506690,3506725,3506663,3506762,3505532
Closed,2023271,2023289,2023293,2023424,2023384,2023389,2023407,2023410,2023378,2023570,2022453
All OA,1483280,1483276,1483288,1483284,1483256,1483303,1483283,1483315,1483285,1483192,1483079
"Green, Submitted",402965,402962,402961,402959,402961,402952,402941,402932,402935,402923,402872
Pure Gold,259386,259384,259411,259391,259382,259400,259390,259403,259413,259397,259367
Bronze,583732,583749,583728,583740,583715,583735,583737,583738,583695,583698,583678
Hybrid,109577,109573,109578,109581,109578,109589,109590,109609,109617,109575,109570
"Green, Published",83262,83252,83255,83255,83257,83268,83260,83266,83262,83243,83240
"Green, Accepted",44358,44356,44355,44358,44363,44359,44365,44367,44363,44356,44352


## 3. Data Top Target Institutions

In [16]:
# PS version updated on 2019-11-29 so to include pubs with altmetric too
# this means adding another extraction query specifically for pubs altmetric > 0

#@title
print("\nData Top Target Institutions\n...extracting data..\n")
#
#
#
# two queries, an extra one only for altmetric publications count
q1 = """search publications where year in [{}:{}] and {} return funders aggregate count, citations_total, rcr_avg, fcr_gavg, altmetric_median limit 1000"""
q1B = """search publications where altmetric > 0 and year in [{}:{}] and {} return funders[name] aggregate count limit 1000"""
# calc df
df = dsl.query(q1.format(start_year, end_year, cat_query)).as_dataframe()
if len(df) == 0:
  print("WARNING: no data found for the parameters selected!")
else:
  # continue
  dfB = dsl.query(q1B.format(start_year, end_year, cat_query)).as_dataframe()
  # rename
  df.rename(columns={"country_name": "country", "count": "pubs"}, inplace=True)
  dfB.rename(columns={"count": "pubs_altmetric"}, inplace=True)
  # merge using 'id' 
  dfmerged = pd.merge(df, dfB[['id', 'pubs_altmetric']], how='left', on="id")
  # rename, sort and calc altmetric percentage
  dfmerged = dfmerged.fillna(0) # fill empty values
  dfmerged['pubs_altmetric'] = dfmerged['pubs_altmetric'].astype('int') # make col int (from float)
  dfmerged['pubs_altmetric_prc'] = ((dfmerged['pubs_altmetric'] * 100) / dfmerged['pubs']).round(2) # add % representation
  dfmerged = dfmerged[['id', 'name', 'country', 'pubs', 'pubs_altmetric', 'pubs_altmetric_prc', 'citations_total', 'rcr_avg', 'fcr_gavg', 'altmetric_median', ]].copy()
  dfmerged['type'] = "funders"
  dfmerged


  #
  #
  q2 = """search publications where year in [{}:{}] and {} return research_orgs aggregate count, citations_total, rcr_avg, fcr_gavg, altmetric_median limit 1000"""
  q2B = """search publications where altmetric > 0 and year in [{}:{}] and {} return research_orgs[name] aggregate count limit 1000"""
  df2 = dsl.query(q2.format(start_year, end_year, cat_query)).as_dataframe()
  df2B = dsl.query(q2B.format(start_year, end_year, cat_query)).as_dataframe()
  df2.rename(columns={"country_name": "country", "count": "pubs"}, inplace=True)
  df2B.rename(columns={"count": "pubs_altmetric"}, inplace=True)
  # merge using 'name' 
  df2merged = pd.merge(df2, df2B[['id', 'pubs_altmetric']], how='left', on="id")
  # rename, sort and calc altmetric percentage
  df2merged = df2merged.fillna(0) # fill empty values
  df2merged['pubs_altmetric'] = df2merged['pubs_altmetric'].astype('int') # make col int (from float)
  df2merged['pubs_altmetric_prc'] = ((df2merged['pubs_altmetric'] * 100) / df2merged['pubs']).round(2) # add % representation
  df2merged = df2merged[['id', 'name', 'country', 'pubs', 'pubs_altmetric', 'pubs_altmetric_prc', 'citations_total', 'rcr_avg', 'fcr_gavg', 'altmetric_median', ]].copy()
  df2merged['type'] = "research_organizations"
  df2merged

  #
  # append second query to first
  #
  dfmerged = dfmerged.append(df2merged, ignore_index=True)
  #
  #
  #
  q3 = """search publications where open_access_categories.id="oa_all" and year in [{}:{}] and {} return research_orgs aggregate count, citations_total, rcr_avg, fcr_gavg, altmetric_median limit 1000"""
  q3B = """search publications where altmetric > 0 and open_access_categories.id="oa_all" and year in [{}:{}] and {} return research_orgs[name] aggregate count limit 1000"""
  df3 = dsl.query(q3.format(start_year, end_year, cat_query)).as_dataframe()
  df3B = dsl.query(q3B.format(start_year, end_year, cat_query)).as_dataframe()
  df3.rename(columns={"country_name": "country", "count": "pubs"}, inplace=True)
  df3B.rename(columns={"count": "pubs_altmetric"}, inplace=True)
  # merge using 'name' 
  df3merged = pd.merge(df3, df3B[['id', 'pubs_altmetric']], how='left', on="id")
  # rename, sort and calc altmetric percentage
  df3merged = df3merged.fillna(0) # fill empty values
  df3merged['pubs_altmetric'] = df3merged['pubs_altmetric'].astype('int') # make col int (from float)
  df3merged['pubs_altmetric_prc'] = ((df3merged['pubs_altmetric'] * 100) / df3merged['pubs']).round(2) # add % representation
  df3merged = df3merged[['id', 'name', 'country', 'pubs', 'pubs_altmetric', 'pubs_altmetric_prc', 'citations_total', 'rcr_avg', 'fcr_gavg', 'altmetric_median', ]].copy()
  df3merged['type'] = "research_organizations_all_oa"
  df3merged

  #
  # append thid query to first
  #
  dfmerged = dfmerged.append(df3merged, ignore_index=True)
  # #

  #
  save_locally("DataTopTargetInstitutions", dfmerged)
  #
  dfmerged


Data Top Target Institutions
...extracting data..

Returned Funders: 1000
Returned Funders: 1000
Returned Research_orgs: 1000
Returned Research_orgs: 1000
Returned Research_orgs: 1000
Returned Research_orgs: 1000

..saved file 'DataTopTargetInstitutions_1006_Computer_Hardware___0604_Genetics_2015_2020.csv


## 4. Data Top Countries 

In [17]:
#@title
print("\nData Top Countries\n...extracting data..\n")
#
#
#
#
q1 = """search publications where year in [{}:{}] and {} return research_org_countries aggregate citations_total limit 1000"""
df = dsl.query(q1.format(start_year, end_year, cat_query)).as_dataframe()
if len(df) == 0:
  print("WARNING: no data found for the parameters selected!")
else:
  df.rename(columns={"count": "pubs", "citations_total": "citations"}, inplace=True)
  df = df[['name', 'pubs', 'citations', ]]
  #df
  #
  #
  # search grants now
  #
  q2 = """search grants where start_year in [{}:{}] and {} return research_org_countries aggregate funding limit 1000"""
  df2 = dsl.query(q2.format(start_year, end_year, cat_query_grants), verbose=True).as_dataframe()
  df2.rename(columns={"count": "grants"}, inplace=True)
  #df2
  if not df2.empty:
    df3 = pd.merge(df, df2[['name', 'grants', 'funding']], how='left', on="name")
  else:
    df3 = df2
  df3 = df3.fillna(0) # fill empty values
  #
  #
  save_locally("DataTopCountries", df3)
  #
  #
  df3


Data Top Countries
...extracting data..

Returned Research_org_countries: 242
Returned Research_org_countries: 140
Field 'FOR' is deprecated in favor of category_for. Please refer to https://docs.dimensions.ai/dsl/releasenotes.html for more details

..saved file 'DataTopCountries_1006_Computer_Hardware____0604_Genetics_2015_2020.csv


## Downloading the data

In [19]:
import zipfile
import os 
from google.colab import files

def zipdir(path, ziph):
    # ziph is zipfile handle
    for root, dirs, files in os.walk(path):
        for file in files:
            ziph.write(os.path.join(root, file))

zipf = zipfile.ZipFile(output_folder_name + '.zip', 'w', zipfile.ZIP_DEFLATED)
zipdir(output_folder_name + '/', zipf)
zipf.close()

try:
  files.download(output_folder_name + '.zip')
  print("\n===\nDone.")
except:
  print("ERROR - Google Colab couldn't download - please try again...")


===
Done.
