# Language statistics processing 

This is a Jupyter Notebook file showing the Python code used for processing the texts and running textstat analysis. The work done in this Notebook includes:
* Creating a single combined text for each source from the multiple websites' HTML. 
    * e.g. A single "ACRL" text was created out of the three files that contained the HTML from three different websites from ACRL explaining open access. 
* Counting the number of links inside of the files by counting instances of "href". 
* Cleaning the text files of HTML so that only the language is analyzed (to avoid counting HTML code in word counts, for example). 
* Using the textstat package to count words and sentences that appear in each *combined text file*. These measures can be used to calculate the average number of words per sentence. 
    * textstat was found from [this website] (https://pypi.org/project/textstat/) and imported from their github repository. 
* Running a variety of text complexity measures available in textstat. 
    * For more information on the complexity measures available in textstat, see [their website] (https://pypi.org/project/textstat/). 

In [1]:
import os
os.chdir('texts')

## Creating the texts

The next several cells show the combining of multiple text files into one single text per source. 

In [2]:
acrl1 = open('acrl_primary.txt', encoding="utf-8").read()
acrl2 = open('acrl_secondary.txt', encoding="utf-8").read()
acrl3 = open('acrl_tertiary.txt', encoding="utf-8").read()
acrl = acrl1+acrl2+acrl3

In [3]:
bmj1 = open('bmj_primary.txt', encoding="utf-8").read()
bmj2 = open('bmj_secondary.txt', encoding="utf-8").read()
bmj3 = open('bmj_tertiary.txt', encoding="utf-8").read()
bmj = bmj1+bmj2+bmj3

In [4]:
brill1 = open('brill_primary.txt', encoding="utf-8").read()
brill2 = open('brill_secondary.txt', encoding="utf-8").read()
brill3 = open('brill_tertiary.txt', encoding="utf-8").read()
brill = brill1+brill2+brill3

In [5]:
cambridge1 = open('cambridge_primary.txt', encoding="utf-8").read()
cambridge2 = open('cambridge_secondary.txt', encoding="utf-8").read()
cambridge3 = open('cambridge_tertiary.txt', encoding="utf-8").read()
cambridge = cambridge1+cambridge2+cambridge3

In [6]:
cornell1 = open('cornell_primary.txt', encoding="utf-8").read()
cornell2 = open('cornell_secondary.txt', encoding="utf-8").read()
cornell3 = open('cornell_tertiary.txt', encoding="utf-8").read()
cornell = cornell1+cornell2+cornell3

In [7]:
degruyter1 = open('degruyter_primary.txt', encoding="utf-8").read()
degruyter2 = open('degruyter_secondary.txt', encoding="utf-8").read()
degruyter3 = open('degruyter_tertiary.txt', encoding="utf-8").read()
degruyter = degruyter1+degruyter2+degruyter3

In [8]:
elsevier1 = open('elsevier_primary.txt', encoding="utf-8").read()
elsevier2 = open('elsevier_secondary.txt', encoding="utf-8").read()
elsevier3 = open('elsevier_tertiary.txt', encoding="utf-8").read()
elsevier = elsevier1+elsevier2+elsevier3

In [9]:
harvard1 = open('harvard_primary.txt', encoding="utf-8").read()
harvard2 = open('harvard_secondary.txt', encoding="utf-8").read()
harvard3 = open('harvard_tertiary.txt', encoding="utf-8").read()
harvard = harvard1+harvard2+harvard3

In [10]:
iop1 = open('iop_primary.txt', encoding="utf-8").read()
iop2 = open('iop_secondary.txt', encoding="utf-8").read()
iop3 = open('iop_tertiary.txt', encoding="utf-8").read()
iop = iop1+iop2+iop3

In [11]:
ku1 = open('ku_primary.txt', encoding="utf-8").read()
ku2 = open('ku_secondary.txt', encoding="utf-8").read()
ku3 = open('ku_tertiary.txt', encoding="utf-8").read()
ku = ku1+ku2+ku3

In [12]:
mit1 = open('mit_primary.txt', encoding="utf-8").read()
mit2 = open('mit_secondary.txt', encoding="utf-8").read()
mit3 = open('mit_tertiary.txt', encoding="utf-8").read()
mit = mit1+mit2+mit3

In [13]:
nature1 = open('nature_primary.txt', encoding="utf-8").read()
nature2 = open('nature_secondary.txt', encoding="utf-8").read()
nature3 = open('nature_tertiary.txt', encoding="utf-8").read()
nature = nature1+nature2+nature3

In [14]:
oawg1 = open('oawg_primary.txt', encoding="utf-8").read()
oawg2 = open('oawg_secondary.txt', encoding="utf-8").read()
oawg3 = open('oawg_tertiary.txt', encoding="utf-8").read()
oawg = oawg1+oawg2+oawg3

In [15]:
plos1 = open('plos_primary.txt', encoding="utf-8").read()
plos2 = open('plos_secondary.txt', encoding="utf-8").read()
plos3 = open('plos_tertiary.txt', encoding="utf-8").read()
plos = plos1+plos2+plos3

In [16]:
oxford1 = open('oxford_primary.txt', encoding="utf-8").read()
oxford2 = open('oxford_secondary.txt', encoding="utf-8").read()
oxford3 = open('oxford_tertiary.txt', encoding="utf-8").read()
oxford = oxford1+oxford2+oxford3

In [17]:
oasis1 = open('oasis_primary.txt', encoding="utf-8").read()
oasis2 = open('oasis_secondary.txt', encoding="utf-8").read()
oasis3 = open('oasis_tertiary.txt', encoding="utf-8").read()
oasis = oasis1+oasis2+oasis3

In [18]:
RTRC1 = open('RTRC_primary.txt', encoding="utf-8").read()
RTRC2 = open('RTRC_secondary.txt', encoding="utf-8").read()
RTRC3 = open('RTRC_tertiary.txt', encoding="utf-8").read()
rtrc = RTRC1+RTRC2+RTRC3

In [19]:
sherpa1 = open('sherpa_primary.txt', encoding="utf-8").read()
sherpa2 = open('sherpa_secondary.txt', encoding="utf-8").read()
sherpa3 = open('sherpa_tertiary.txt', encoding="utf-8").read()
sherpa = sherpa1+sherpa2+sherpa3

In [20]:
sage1 = open('sage_primary.txt', encoding="utf-8").read()
sage2 = open('sage_secondary.txt', encoding="utf-8").read()
sage3 = open('sage_tertiary.txt', encoding="utf-8").read()
sage = sage1+sage2+sage3

In [21]:
sparc1 = open('sparc_primary.txt', encoding="utf-8").read()
sparc2 = open('sparc_secondary.txt', encoding="utf-8").read()
sparc3 = open('sparc_tertiary.txt', encoding="utf-8").read()
sparc = sparc1+sparc2+sparc3

In [22]:
springer1 = open('springer_primary.txt', encoding="utf-8").read()
springer2 = open('springer_secondary.txt', encoding="utf-8").read()
springer3 = open('springer_tertiary.txt', encoding="utf-8").read()
springer = springer1+springer2+springer3

In [23]:
tf1 = open('tf_primary.txt', encoding="utf-8").read()
tf2 = open('tf_secondary.txt', encoding="utf-8").read()
tf3 = open('tf_tertiary.txt', encoding="utf-8").read()
tf = tf1+tf2+tf3

In [24]:
wiley1 = open('wiley_primary.txt', encoding="utf-8").read()
wiley2 = open('wiley_secondary.txt', encoding="utf-8").read()
wiley3 = open('wiley_tertiary.txt', encoding="utf-8").read()
wiley = wiley1+wiley2+wiley3

In [25]:
boai1 = open('boai_primary.txt', encoding="utf-8").read()
boai2 = open('boai_secondary.txt', encoding="utf-8").read()
boai = boai1+boai2

In [26]:
hindawi1 = open('hindawi_primary.txt', encoding="utf-8").read()
hindawi2 = open('hindawi_tertiary.txt', encoding="utf-8").read()
hindawi = hindawi1+hindawi2

In [27]:
proquest1 = open('proquest_primary.txt', encoding="utf-8").read()
proquest2 = open('proquest_secondary.txt', encoding="utf-8").read()
proquest = proquest1+proquest2

In [28]:
creativecommons = open('creativecommons_primary.txt', encoding="utf-8").read()

In [29]:
suber = open('suber_primary.txt', encoding="utf-8").read()

In [30]:
wikipedia = open('wikipedia_primary.txt', encoding="utf-8").read()

# Making the "publist" loop
In this section, we make a loop for all of the publications by putting them into one list. This will make our lives easier in the next section by not having to execute the same operation 30 times. 

In [31]:
publist = [("ACRL", acrl), ("BOAI", boai), ("BRILL", brill), ("ELSEVIER", elsevier), ("SPRINGER", springer), ("SAGE", sage), ("IOP", iop), ("CAMBRIDGE", cambridge), ("PROQUEST", proquest), ("TF", tf), ("OXFORD", oxford), ("OASIS", oasis), ("NATURE", nature), ("WILEY", wiley), ("DEGRUYTER", degruyter), ("BMJ", bmj), ("HINDAWI", hindawi), ("SPARC", sparc), ("HARVARD", harvard), ("RTRC", rtrc), ("SUBER", suber), ("CORNELL", cornell), ("KU", ku), ("WIKIPEDIA", wikipedia), ("CREATIVECOMMONS", creativecommons), ("OAWG", oawg), ("MIT", mit), ("SHERPA", sherpa), ("PLOS", plos)]

## Counting number of links

In the next section, we are counting the number of links by counting the number of "href"s that appear in the HTML code. These numbers were added to our analysis spreadsheet. 

In [69]:
for (pubname, pub) in publist: 
    print(pubname+",", pub.count("href"))

93
199
51
91
64
56
44
49
2
211
148
76
69
26
122
27
7
24
331
188
150
93
42
416
10
38
153
74
50


## Cleaning the texts

In this next section, we are cleaning all of the texts of HTML code by removing all text that appears between brackets. You can see previews of each text after it has been cleaned of code (this has been limited to only the first 500 characters for viewing convenience). This was used to verify that we weren't deleting anything unintentionally. 

In [33]:
import re
re.findall(r'<[^<]+>', acrl)

['<!-- BEGIN: Guide Content -->',
 '<div id="s-lg-guide-main" class="container s-lib-main s-lib-side-borders">',
 '<div class="row s-lg-row">',
 '<div id="s-lg-col-126" class="col-md-12">',
 '<div class="s-lg-col-boxes">',
 '</div>',
 '</div>',
 '</div>',
 '<div class="row s-lg-row">',
 '<div id="s-lg-col-1" class="col-md-4">',
 '<div class="s-lg-col-boxes">',
 '<div id="s-lg-box-wrapper-13748311" class="s-lg-box-wrapper-13748311">',
 '<div id="s-lg-box-11660560-container" class="s-lib-box-container">',
 '<div id="s-lg-box-11660560" class="s-lib-box s-lib-box-std">',
 '<h2 class="s-lib-box-title">',
 '</h2>',
 '<div id="s-lg-box-collapse-11660560" >',
 '<div class="s-lib-box-content">',
 '<div class="">',
 '<ul id="s-lg-link-list-28046256" class="s-lg-link-list s-lg-link-list-2">',
 '<li class="">',
 '<div id="s-lg-content-25317439" class="">',
 '<span>',
 '<a href="https://cyber.harvard.edu/~psuber/wiki/Writings_on_open_access" target="_blank"  onclick="return springSpace.springTrack.

In [34]:
len(re.findall(r'<[^<]+>', acrl))

1169

In [35]:
acrl.count('<')

1169

In [36]:
acrl_cleaned=re.sub(r'<[^<]+>', '', acrl)
print(acrl_cleaned[:500])



        
				
    
        
        
    
				

				
    
        
							
					
						
							Further Reading on Open Access
                                
							
								
									
                        
                        
			Complete list of Peter Suber writings 
				Peter Suber's complete bibliography, with most recent writings listed first.
			
                        
                        
			Transforming Scholarly Publishing Through Open Access: A Bibliography - Charles Ba


In [37]:
elsevier_cleaned=re.sub(r'<[^<]+>', '', elsevier)
print(elsevier_cleaned[:500])

﻿Your Guide to 
Publishing Open Access 
with Elsevier

2 Elsevier 
What is open access? 
The term open access was first used in 2001 when the Open Society Institute established 
what is known as the Budapest Open Access Initiative (BOAI). Their goal was to create a set of 
recommendations, which were designed to provide the public with unrestricted, free access to 
scholarly research. Since then, the term open access has been defined by different groups in 
different ways. 
In general, open acc


In [38]:
springer_cleaned=re.sub(r'<[^<]+>', '', springer)
print(springer_cleaned[:500])

﻿Open Access








SpringerOpen








Open Choice








Agreements












For Dutch authors








For UK authors








For Austrian authors








For MPG authors








For Swedish authors








For Finnish authors
















BioMed Central








Open access funding








Authors’ rights












Self-archiving policy








FAQ








Funder compliance
















Open access track record








Contact us

















Open access funding
Find out abou


In [39]:
sage_cleaned=re.sub(r'<[^<]+>', '', sage)
print(sage_cleaned[:500])

﻿ 
    What is Open Access Publishing?
A paper published via an open access (OA) route means that that research literature is free-to-view by anyone in the world via the internet and to reuse with attribution under a Creative Commons licence or equivalent.
	There are three distinct types of open access all of which are available at SAGE:
Pure ‘Gold’ Open Access Publishing: Articles are peer reviewed, selected and formally published and then made available with no subscription pay-walls. The full


In [40]:
iop_cleaned=re.sub(r'<[^<]+>', '', iop)
print(iop_cleaned[:500])

﻿• Post your pre-print into a repository with no restrictions. 
• Choose to publish on a gold open access basis in more 
than 40 of IOP’s wholly owned and partner journals. 
• Post the final published version of your article into a 
repository immediately when you choose the gold open 
access option. 
• Post your accepted manuscript into an institutional or 
subject repository after a 12-month embargo*, regardless 
of which journal you publish in (with reuse restrictions) 
– also known as green 


In [41]:
cambridge_cleaned=re.sub(r'<[^<]+>', '', cambridge)
print(cambridge_cleaned[:500])

﻿Open Access (OA) makes scholarly research permanently available online to view without restriction. OA can also allow content to be published in a way that allows readers to redistribute, re-use and adapt the content in new works. We support two types of OA:
Gold Open Access is an alternative to subscriptions and other access payments. Content is published under a Creative Commons licence that allows free access and redistribution and, in many cases, allows re-use in new or derivative works. Ty


In [42]:
proquest_cleaned=re.sub(r'<[^<]+>', '', proquest)
print(proquest_cleaned[:500])

﻿

    
				
			
												Open Access is a term used to describe content that a reader can access free of charge. With the ProQuest Open Access Publishing PLUS option, graduate students can significantly increase the reach of their research.
What are the benefits of Open Access Publishing PLUS? Open Access Publishing PLUS guarantees the widest possible exposure of your graduate research. It can also help ensure that the officially published version of your dissertation or thesis is the most w


In [43]:
tf_cleaned=re.sub(r'<[^<]+>', '', tf)
print(tf_cleaned[:500])

﻿Publishing Open Access 
with Taylor & Francis | the basics 
What is Open Access? 
Open Access (OA) means you can publish your 
research so it is free to access online as soon 
as it is published, meaning anyone can read 
(and cite) your work. 
Publishing OA also means published research can 
generally be re-used by third parties with few, or 
no, restrictions. 
Why publish OA? 
Choosing to publish OA has many benefits: 
. It can increase the discoverability of your research. 
This increased vis


In [44]:
oxford_cleaned=re.sub(r'<[^<]+>', '', oxford)
print(oxford_cleaned[:500])

﻿  
            
                    
                        
                            
                                Open Access navigation
                            
                        
                    
                
        
Frequently asked questions: Oxford Open
Find answers to frequently asked questions regarding open access policies, charges, and funder policies at OUP.

What is open access? 


What is Oxford Open? 


How will readers know which articles are available 


In [45]:
oasis_cleaned=re.sub(r'<[^<]+>', '', oasis)
print(oasis_cleaned[:500])

﻿
					Open Access: What is it and why should we have it?			
						
				
		
				
				
		
					




	
				
													General																
					








Open Access provides the means to maximise the visibility, and thus the uptake and use, of research outputs. Open Access is the immediate, online, free availability of research outputs without the severe restrictions on use commonly imposed by publisher copyright agreements. It is definitely not vanity publishing or self-publishing, nor abou


In [46]:
nature_cleaned=re.sub(r'<[^<]+>', '', nature)
print(nature_cleaned[:500])

﻿What is open access?Open access (OA) refers to free, unrestricted online access to research outputs such as journal articles and books. OA content is open to all, with no access fees.
There are two main routes to making research outputs openly accessible. Find out more here.
&nbsp;
Quick LinksWhat is open access?Benefits for authorsNature Research open accessPalgrave Macmillan open accessInstitutional supportPartner publishingContact usIf you would like more information about publishing open ac


In [47]:
brill_cleaned=re.sub(r'<[^<]+>', '', brill)
print(brill_cleaned[:500])

﻿Frequently Asked Questions - Open Access
	
		
		
		
		    
				
		    
		
			
			
				 
What is Open Access?
 Why should I publish in Open Access?
 What about the economics of Open Access?
 What are the differences between Green and Gold Open Access?
 Does Open Access have implications for the quality of the content?
 What is Brill’s policy when it comes to pricing subscription journals with a significant amount of Open Access content?
 Can my article, journal issue or book be made available in


In [48]:
wiley_cleaned=re.sub(r'<[^<]+>', '', wiley)
print(wiley_cleaned[:500])

﻿Open access options for your article

Open access articles are freely available to read, download and share. For more information about Wiley's open access options, watch our video and read the gold and green overviews below.









Gold Open Access
Green Open Access



What is it?The author pays an Article Publication Charge and the article is immediately freely available online for all to read, download, and share
What is it?The author self-archives a version of the subscription article in 


In [49]:
degruyter_cleaned=re.sub(r'<[^<]+>', '', degruyter)
print(degruyter_cleaned[:500])

﻿ 
 
De Gruyter is the largest independent academic publisher of open access books, and more than 1000 open access books are available on degruyter.com.
Authors can also publish open access across the entire journal portfolio by choosing to publish in De Gruyter’s fully open access journals or hybrid open access in any of the subscription journals.
All open access research is immediately available for free to read, download and share. Open access allows faster publication time, increased visibil


In [50]:
bmj_cleaned=re.sub(r'<[^<]+>', '', bmj)
print(bmj_cleaned[:500])

﻿
Open Access at BMJ
Solutions for Authors, Institutions and Societies.

							
						 
					
				 
				
			 
			

				 
			 
			
				
			 
				
				
					
					
				
				
				
				
				
Making research free at the point of use is critically important to advancing medical research and enabling healthcare professionals to make better decisions. We offer authors, institutions and funders the option to publish open access research across our journals, including our flagship journal, The BMJ.

			 


In [51]:
hindawi_cleaned=re.sub(r'<[^<]+>', '', hindawi)
print(hindawi_cleaned[:500])

﻿What is Open Access Publishing?"By 'open access' to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself."The Budapest Open Access Initiative »Hi


In [52]:
sparc_cleaned=re.sub(r'<[^<]+>', '', sparc)
print(sparc_cleaned[:500])

                        
                            Open Access
                            
                            Open Access is the free, immediate, online availability of research articles coupled with the rights to use these articles fully in the digital environment. Open Access ensures that anyone can access and use these results—to turn ideas into industries and breakthroughs into better lives.                        
                    
                    

                      


In [53]:
harvard_cleaned=re.sub(r'<[^<]+>', '', harvard)
print(harvard_cleaned[:500])

Open Access Policies   In 2008, Harvard's Faculty of Arts &amp; Sciences voted unanimously to give the Harvard a nonexclusive, irrevocable right to distribute their scholarly articles for any non-commercial purpose. In the years since, the remaining eight Harvard schools voted similar open-access (OA) policies; as of September 2017, four research centers have joined their number. In the words of OSC Director Peter Suber, author of Open Access, "The basic idea of OA is simple: Make research liter


In [54]:
rtrc_cleaned=re.sub(r'<[^<]+>', '', rtrc)
print(rtrc_cleaned[:500])


Luckily for students, doctors, patients, and everyone else who relies on academic journals, there is a proven alternative to costly subscription-based based journals. &#160;Using the Internet, research can be distributed to a wider audience at a very low marginal cost – the difference between what it costs to distribute an article to one person or to one million people is very small. &#160;Instead of locking information behind price barriers, research can reach anyone who needs it, regardless o


In [55]:
suber_cleaned=re.sub(r'<[^<]+>', '', suber)
print(suber_cleaned[:500])

Open Access OverviewFocusing on open access to peer-reviewed research articles and their preprints 





This is an introduction to open access (OA) for those who are new to the concept. I hope it's short enough to read, long enough to be useful, and organized to let you skip around and dive into detail only where you want detail. It doesn't cover every nuance or answer every objection. But for those who read it, it should cover enough territory to prevent the misunderstandings that delayed prog


In [56]:
boai_cleaned=re.sub(r'<[^<]+>', '', boai)
print(boai_cleaned[:500])

Prologue:

The Budapest Open Access Initiative after 10 years

Ten

years ago the Budapest Open Access Initiative launched a worldwide campaign for

open access (OA) to all new peer-reviewed research. It didn’t invent the idea

of OA. On the contrary, it deliberately drew together existing projects to

explore how they might “work together to achieve broader, deeper, and faster

success.” But the BOAI was the first initiative to use the term “open access”

for this purpose, the first to articula


In [57]:
cornell_cleaned=re.sub(r'<[^<]+>', '', cornell)
print(cornell_cleaned[:500])


Defining Open Access











Open access (OA) refers to freely available, digital, online information.&nbsp;Open access scholarly literature is free of charge and often carries less restrictive copyright and licensing barriers than traditionally published works, for both the users and the authors.&nbsp;



While OA is&nbsp;a newer form of scholarly publishing, many OA journals comply with well-established peer-review processes and maintain high publishing standards. For more information,&nbs


In [58]:
ku_cleaned=re.sub(r'<[^<]+>', '', ku)
print(ku_cleaned[:500])

What is Open Access?

Open Access is an international movement that has the goal of making peer-reviewed published scholarship available free of charge to the public and to the global scholarly community.

For additional reading about open access in general we recommend Wikipedia's well-referenced Open Access article.



    

    
        News


        
  
  
      
            
          Open Access Week 2018 Workshops    
              
          KU Libraries and the Shulenburger Office of S


In [59]:
wikipedia_cleaned=re.sub(r'<[^<]+>', '', wikipedia)
print(wikipedia_cleaned[:500])

Open access (OA) refers to research outputs which are distributed online and free of cost or other barriers,&#91;1&#93; and possibly with the addition of a Creative Commons license to promote reuse.&#91;1&#93; Open access can be applied to all forms of published research output, including peer-reviewed and non peer-reviewed academic journal articles, conference papers, theses,&#91;2&#93; book chapters,&#91;1&#93; and monographs.&#91;3&#93;
Academic articles (as historically seen in paper-based a


In [60]:
creativecommons_cleaned=re.sub(r'<[^<]+>', '', creativecommons)
print(creativecommons_cleaned[:500])

Open Access 

Open access literature is digital, online, free of charge, and free of most copyright and licensing restrictions.

There&#8217;s an incredible amount of scientific research conducted at universities and institutions around the world. Historically, the findings of this research have been published in scholarly journals. However, access to this research is typically restricted&#8211;granted only to those who are granted permission via their university affiliation, or by purchasing ac


In [61]:
oawg_cleaned=re.sub(r'<[^<]+>', '', oawg)
print(oawg_cleaned[:500])



			
				Definition of Budapest compliant open access

				

					

						&nbsp;

Budapest: Image from Wikipedia, by Christian Mehlführer
&nbsp;
The Budapest Declaration by the Budapest Open Access Initiative (BOAI) was published in 2002 and marked the beginning of the Open Access movement. The Declaration takes a strong stand on the role of Open Access to information:
“An old tradition and a new technology have converged to make possible an unprecedented public good. The old tradition is the w


In [62]:
mit_cleaned=re.sub(r'<[^<]+>', '', mit)
print(mit_cleaned[:500])


	
	
		Open access FAQ
What is open access?

Open access as discussed in relation to this policy refers to free availability of journal articles on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful, noncommercial purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to 


In [63]:
sherpa_cleaned=re.sub(r'<[^<]+>', '', sherpa)
print(sherpa_cleaned[:500])

Authors and Open Access
  What is Open Access?&nbsp; |&nbsp; 
    Funders' Grant Rules  |&nbsp; Journals' 
    Copyright Rules  |&nbsp; Which Repository? |&nbsp; 
    Assistance
  What is Open Access?
  Open Access is . . .
    If an article is &quot;Open Access&quot; it means that it can be freely accessed 
    by anyone in the world using an internet connection. This means that the potential 
    readership of Open Access articles is far, far greater than that for articles 
    where the full-


In [64]:
plos_cleaned=re.sub(r'<[^<]+>', '', plos)
print(plos_cleaned[:500])

﻿Benefits of Open Access Journals 


Open Access stands for unrestricted access and unrestricted reuse. 

 

Accelerated Discovery 

With Open Access, researchers can read and build on the findings of others without 
restriction. 

 

Public Enrichment 

Much scientific and medical research is paid for with public funds. Open Access allows 
taxpayers to see the results of their investment. 

 

Improved Education 

Open Access means that teachers and their students have access to the latest rese


## Counting words and sentences

In this next section we first create a new loop of all publisher names with cleaned files (to save effort), then import the textstat package, and count the number of words and sentences in each text.

In [65]:
publist_cleaned = [("ACRL", acrl_cleaned), ("BOAI", boai_cleaned), ("BRILL", brill_cleaned), ("ELSEVIER", elsevier_cleaned), ("SPRINGER", springer_cleaned), ("SAGE", sage_cleaned), ("IOP", iop_cleaned), ("CAMBRIDGE", cambridge_cleaned), ("PROQUEST", proquest_cleaned), ("TF", tf_cleaned), ("OXFORD", oxford_cleaned), ("OASIS", oasis), ("NATURE", nature_cleaned), ("WILEY", wiley_cleaned), ("DEGRUYTER", degruyter_cleaned), ("BMJ", bmj_cleaned), ("HINDAWI", hindawi_cleaned), ("SPARC", sparc_cleaned), ("HARVARD", harvard_cleaned), ("RTRC", rtrc_cleaned), ("SUBER", suber_cleaned), ("CORNELL", cornell_cleaned), ("KU", ku_cleaned), ("WIKIPEDIA", wikipedia_cleaned), ("CREATIVECOMMONS", creativecommons_cleaned), ("OAWG", oawg_cleaned), ("MIT", mit_cleaned), ("SHERPA", sherpa_cleaned), ("PLOS", plos_cleaned)]

In [66]:
import textstat

In [67]:
for (pubname, pub) in publist_cleaned:
    print(pubname+",", textstat.lexicon_count(pub, removepunct=True),",", textstat.sentence_count(pub))

ACRL, 3085 , 52
BOAI, 8889 , 255
BRILL, 2219 , 94
ELSEVIER, 6547 , 111
SPRINGER, 4147 , 86
SAGE, 2921 , 122
IOP, 2228 , 49
CAMBRIDGE, 1586 , 59
PROQUEST, 631 , 15
TF, 9641 , 375
OXFORD, 2872 , 81
OASIS, 2690 , 63
NATURE, 2229 , 56
WILEY, 1031 , 20
DEGRUYTER, 734 , 16
BMJ, 1794 , 44
HINDAWI, 818 , 28
SPARC, 1945 , 61
HARVARD, 8202 , 311
RTRC, 7110 , 94
SUBER, 6029 , 194
CORNELL, 2216 , 39
KU, 1828 , 50
WIKIPEDIA, 5205 , 101
CREATIVECOMMONS, 544 , 18
OAWG, 1953 , 83
MIT, 5761 , 174
SHERPA, 3841 , 90
PLOS, 1788 , 40


## Complexity Tests

In the next section, we run three analyses of complexity of the texts: the SMOG test, the Flesch Reading Ease test, and a "combined" test that looks at all of the complexity tests available in textstat. 
These use the same publist_cleaned loop from the previous section, and the new complexity counts are added to the word and sentence counts from the previous section for easy CSV creation. 

In [68]:
for (pubname, pub) in publist_cleaned:
    print(pubname+",", textstat.lexicon_count(pub, removepunct=True),",", textstat.sentence_count(pub), ",", textstat.smog_index(pub), ",", textstat.flesch_reading_ease(pub), ",", textstat.text_standard(pub, float_output=True))

ACRL, 3085 , 52 , 24.6 , -115.62 , 25.0
BOAI, 8889 , 255 , 17.1 , 36.05 , 17.0
BRILL, 2219 , 94 , 14.1 , 22.14 , 16.0
ELSEVIER, 6547 , 111 , 20.7 , -5.34 , 13.0
SPRINGER, 4147 , 86 , 20.1 , 14.09 , 23.0
SAGE, 2921 , 122 , 14.2 , 47.22 , 13.0
IOP, 2228 , 49 , 18.6 , 25.29 , 21.0
CAMBRIDGE, 1586 , 59 , 14.2 , 44.17 , 14.0
PROQUEST, 631 , 15 , 20.2 , 3.36 , 20.0
TF, 9641 , 375 , 14.4 , 45.39 , 14.0
OXFORD, 2872 , 81 , 17.1 , 10.06 , 18.0
OASIS, 2690 , 63 , 19.5 , -14.18 , 20.0
NATURE, 2229 , 56 , 18.2 , 22.62 , 18.0
WILEY, 1031 , 20 , 19.9 , -14.75 , 20.0
DEGRUYTER, 734 , 16 , 22.0 , -212.0 , 22.0
BMJ, 1794 , 44 , 18.2 , 21.6 , 13.0
HINDAWI, 818 , 28 , 17.8 , 24.92 , 17.0
SPARC, 1945 , 61 , 16.0 , -3.21 , 16.0
HARVARD, 8202 , 311 , 14.8 , 44.68 , 14.0
RTRC, 7110 , 94 , 25.4 , -22.19 , 17.0
SUBER, 6029 , 194 , 16.2 , 39.91 , 16.0
CORNELL, 2216 , 39 , 24.0 , -11.57 , 24.0
KU, 1828 , 50 , 16.4 , 17.41 , 14.0
WIKIPEDIA, 5205 , 101 , 21.9 , 2.28 , 24.0
CREATIVECOMMONS, 544 , 18 , 17.3 , 23.9 ,

# The End 

This is the end of the Jupyter notebook illustration of our code. All results that you see here appear in the analysis_numbers.csv file in our GitHub repository. 