# REF Impact Cleaner

This code iterates through REF Impact text via the REF API, gathering them into one document, formatted for Iramuteq analysis (completed separately).

Different versions of this code exist which include/do not include UCL, or consider only one specific institution. This version processes *all* REF Impact text.

Import general info about the UoAs we'll iterate through (UoAs are our iterables here)

In [1]:
import pandas as pd
UoA = pd.read_json("http://impact.ref.ac.uk/casestudiesapi/REFAPI.svc/ListUnitsOfAssessment")
#UoA = pd.read_json(data)

UoA.head()
print UoA

    ID       Panel                                            Subject
0    1  A                                           Clinical Medicine
1    2  A             Public Health, Health Services and Primary Care
2    3  A           Allied Health Professions, Dentistry, Nursing ...
3    4  A                     Psychology, Psychiatry and Neuroscience
4    5  A                                         Biological Sciences
5    6  A                    Agriculture, Veterinary and Food Science
6    7  B                    Earth Systems and Environmental Sciences
7    8  B                                                   Chemistry
8    9  B                                                     Physics
9   10  B                                       Mathematical Sciences
10  11  B                            Computer Science and Informatics
11  12  B           Aeronautical, Mechanical, Chemical and Manufac...
12  13  B           Electrical and Electronic Engineering, Metallu...
13  14  B           

## Specify folder for data output:

In [37]:
outString = 'code/dataout3/'
outString = 'code/UCLONLY/'
outString = 'dataout/dataout_inc_UCL/'

We need the following info, all asterisked

Title: **UKPRN**-**UoA**-**ID**

ID_**ID**

Inst_**UKPRN**

UoA_**n**

School (*only relevant for UCL-only analysis*)

panel_**P**


Impact Summary
Impact Details

Filename/title: **UKPRN**-**UoA**-**ID**

In [3]:
def removeEscapes(stringo):
    lines = stringo.splitlines()
    newString = ''
    for line in lines:
        newString+=line
    newString = ' '.join(newString.split())
    return newString;

In [4]:
def saveFile(sout, tiz):
    filename = str(tiz) + '.txt'
    text_file = open(outString+filename, "w")
    text_file.write(sout)
    text_file.close()
    print tiz;

Concatenating the relevant bits:

In [5]:
def makeOutputString(Case):
    s=[]
    ID = Case['CaseStudyId']
    UKPRN = Case['Institutions'][0]['UKPRN']
    panel = UoA.ix[UoA_number-1]['Panel'].strip()

    title = 'inst_'+str(UKPRN)+'-u_'+str(UoA_number)+'-case_'+str(ID)
    title = str(ID)

    s.append('****')
    s.append('*title_')
    s.append(title)
    s.append('*UKPRN_'+str(UKPRN))
    s.append('*uoa_'+str(UoA_number))
    s.append('*ID_'+str(ID))
    s.append('*panel_'+str(panel))

    s.append('\r\n')
    ImpactSummary = str(removeEscapes(Case['ImpactSummary']))
    ImpactSummary = allClean(ImpactSummary)
    
    ImpactDetails = str(removeEscapes(Case['ImpactDetails']))
    ImpactDetails = allClean(ImpactDetails)

    #concat these two elements for the output - these will form the base corpus of our text analysis
    s.append(ImpactSummary)
    s.append(ImpactDetails)
    print
    return s;

#print s

Set up some string cleaning functions:

In [16]:
def cleanString(sin, charact, out):
    sout = sin.replace(charact, out)  
    return sout;

### Remove all the following characters (which confuse Iramuteq):

In [17]:
stopwords = ["*", '(', ')', '\'', '\"', '[', ']', '{', '}']

In [18]:
def specificClean(sin):
    for charry in stopwords:
        sin = cleanString(sin, charry, "")
    return sin;

### Replace currency terms with words:

In [28]:
def replaceMoney(sin):
    sout = cleanString(sin, '£', 'pounds')
    sout = cleanString(sout, '$', 'dollars')
    sout = cleanString(sout, '€', 'euros')
    return sout;

### Remove urls

In [29]:
import re
pattern = re.compile('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')

### Test all of these:

In [30]:
testString = "*[testo] {£ € $} (Hi it is me) http://www.google.com"
print testString

testString = specificClean(testString)
#for charry in stopwords:
 #   testString = cleanString(testString, charry, "");
    #testString.replace(charry, "");

testString = replaceMoney(testString)
testString = testString.lower()
testString = pattern.sub('', testString)
    
print testString

*[testo] {£ € $} (Hi it is me) http://www.google.com
testo pounds euros dollars hi it is me 


### Wrap this all into the allClean function

- remove "stopwords" - brackets, etc
- wordify currency
- remove URLs
- put everything into lower case

allClean does the actual string cleaning, and is called in MakeOutputString above.

In [31]:
def allClean(sin):
    sout = specificClean(sin)
    sout = replaceMoney(sout)
    sout = sout.lower()
    sout = pattern.sub('', sout)
    return sout;

In [32]:
testString = "*[Jazz] {£ € $} (Hi it is me) http://www.google.com"
stroo = allClean(testString)
print testString
print stroo

*[Jazz] {£ € $} (Hi it is me) http://www.google.com
jazz pounds euros dollars hi it is me 


Takes an array of string objects and reconcatenates them into one string:

In [33]:
def chainString(s):
    sout=''
    for line in s:
        sout+=line
        if line != '\r\n' and line!='*title_':
            sout+=' '
    return sout;
#print sout

Test selecting only UCL

In [34]:
jazz =  pd.read_json("http://impact.ref.ac.uk/casestudiesapi/REFAPI.svc/SearchCaseStudies?UKPRN=10007784")
jazz.head()

Unnamed: 0,CaseStudyId,Continent,Country,Funders,ImpactDetails,ImpactSummary,ImpactType,Institution,Institutions,Panel,PlaceName,References,ResearchSubjectAreas,Sources,Title,UKLocation,UKRegion,UOA,UnderpinningResearch
0,20066,"[{u'GeoNamesId': u'6255148', u'Name': u'Europe'}]","[{u'GeoNamesId': u'2635167', u'Name': u'United...",[],"\r\n As a result of our work, which provide...",\r\n Lower Urinary Tract Symptoms (LUTS) in...,Health,\r\n University College London\r\n,"[{u'PeerGroup': u'A', u'Region': u'London', u'...",A,[],"\r\n \n[1] Brown CT, Van Der Meulen J, Mund...","[{u'Level1': u'11', u'Level2': u'17', u'Subjec...",\r\n [a] Lower urinary tract symptoms in me...,\r\n Self-management intervention for men w...,[],[],Clinical Medicine,\r\n In about 2003 Emberton observed that m...
1,20951,"[{u'GeoNamesId': u'6255148', u'Name': u'Europe'}]","[{u'GeoNamesId': u'2635167', u'Name': u'United...",[],\r\n Our ongoing programme of basic science...,\r\n The research described below has made ...,Health,\r\n University College London\r\n,"[{u'PeerGroup': u'A', u'Region': u'London', u'...",A,[],"\r\n \n[1] Abraham DJ, Vancheeswaran R, Das...","[{u'Level1': u'11', u'Level2': u'2', u'Subject...","\r\n [a] Barst RJ, Gibbs JS, Ghofrani HA, H...",\r\n Targeting endothelin in systemic scler...,[],[],Clinical Medicine,\r\n Research conducted by the Centre for R...
2,21477,"[{u'GeoNamesId': u'6255149', u'Name': u'North ...","[{u'GeoNamesId': u'6252001', u'Name': u'United...",[Biotechnology and Biological Sciences Researc...,\n Domainex Ltd was incorporated as a priva...,\n Combinatorial Domain Hunting (CDH) techn...,Technological,\n University College London/Birkbeck\n,"[{u'PeerGroup': u'B', u'Region': u'London', u'...",A,[],"\n \n[1] Prodromou C, Savva R, Driscoll PC....","[{u'Level1': u'6', u'Level2': u'1', u'Subject'...",\n [a] http://www.domainex.co.uk/investors....,\n Combinatorial protein domain hunting to ...,[],[],Biological Sciences,\n Drug discovery programmes rely on the av...
3,21737,"[{u'GeoNamesId': u'6255151', u'Name': u'Oceani...","[{u'GeoNamesId': u'2186224', u'Name': u'New Ze...","[Wellcome Trust, Biotechnology and Biological ...","\n Most acute wounds heal without issue, bu...",\n Professor David Becker and colleagues at...,Technological,\n University College London\n,"[{u'PeerGroup': u'B', u'Region': u'London', u'...",A,"[{u'GeoNamesId': u'5332921', u'Name': u'Califo...","\n \n[1] Qiu C, Coutinho P, Frank S, Franke...","[{u'Level1': u'6', u'Level2': u'1', u'Subject'...",\n [a] http://www.codatherapeutics.com/inde...,\n Healing chronic wounds with Nexagon\n,[],[],Biological Sciences,\n In 1994 David Becker obtained a Royal So...
4,21738,"[{u'GeoNamesId': u'6255148', u'Name': u'Europe'}]","[{u'GeoNamesId': u'2635167', u'Name': u'United...","[Wellcome Trust, Medical Research Council]",\n New diagnostic tests\n The impact of ...,\n Research at UCL into the genetics of neu...,Health,\n University College London\n,"[{u'PeerGroup': u'B', u'Region': u'London', u'...",A,[],\n These publications include those first r...,"[{u'Level1': u'11', u'Level2': u'9', u'Subject...",\n [a] A full list is given on the NCL webs...,\n Improving the diagnosis and understandin...,[],[],Biological Sciences,"\n NCL is a rare, progressive, inherited ne..."


## Main Function

This function will take a specific UoA and draw all data from that UoA, concatenating with a large text string called "Monobrow" - which will be saved to a text after each iteration, holding all Impact Text in one large text file formatted for Iramuteq.

In this function, you can see some commented code that limits the stored text to a specific UKPRN, or excludes that PRN, allowing us to focus on a particular institution, or ignore one.

The larger loop has some "if i<200" text, which can be modified when testing to prevent the code from iterating through the entire database e.g. by setting "i<5". Remember to reset this to 200 when you want to run the whole dataset (nb there are not 200 UoAs, so just make that number greater than the number of UoAs!).

This is the "monobrow" function - take everything into one file, also saves each case separately:

In [35]:
def formatUoA(call):
    global monobrow
    #this is the bit that calls the data:
    UoA_i = pd.read_json(call)
    for i in UoA_i.index:
        #safety feature to prevent it running through all CS - set to a big number to take the brakes off
        if(i<10000):
            print 'case '+str(i)+ " of " + str(len(UoA_i)) + '\n'
            #Case = pd.DataFrame(UoA_i.ix[i])
            Case = UoA_i.ix[i]
            #print Case
            
            s = makeOutputString(Case)
            sout = chainString(s).lower()
            monobrow+=sout
            monobrow+='\n'
            saveFile(sout, s[2]);
                
            #ONLY UCL
            #if  Case['Institutions'][0]['UKPRN']==10007784:
             #   s = makeOutputString(Case)
              #  sout = chainString(s).lower()
               # monobrow+=sout
               # monobrow+='\n'
                #saveFile(sout, s[2]);
            #else:
             #   sout='';
             

## Parent Loop

Note that you need to execute the first cell of this worksheet to get the list of UoA IDs, and I'm using UoA_number as an explicit global variable so I don't have to pass it to the child functions.

In [38]:
monobrow = ''

for i in UoA.ID:
    #safety feature to prevent it blitzing HEFCE- set to a big number to take the brakes off
    if i<200:
        #global variable so I don't have to push it up and down the chain
        UoA_number = i
        print '\n'+'UoA_'+str(UoA_number)
        API_UoA = 'http://impact.ref.ac.uk/casestudiesapi/REFAPI.svc/SearchCaseStudies?UoA='
        API_UoA+=str(i)
        print API_UoA
        formatUoA(API_UoA)
    
saveFile(monobrow, 'monobrow_inc_UCL')


UoA_1
http://impact.ref.ac.uk/casestudiesapi/REFAPI.svc/SearchCaseStudies?UoA=1
case 0 of 383


1855
case 1 of 383


1856
case 2 of 383


2582
case 3 of 383


2613
case 4 of 383


2703
case 5 of 383


2728
case 6 of 383


2729
case 7 of 383


2933
case 8 of 383


2999
case 9 of 383


3416
case 10 of 383


3417
case 11 of 383


3418
case 12 of 383


3419
case 13 of 383


3673
case 14 of 383


3674
case 15 of 383


3788
case 16 of 383


3789
case 17 of 383


3790
case 18 of 383


3859
case 19 of 383


3860
case 20 of 383


3862
case 21 of 383


3863
case 22 of 383


3864
case 23 of 383


3896
case 24 of 383


3897
case 25 of 383


3898
case 26 of 383


3901
case 27 of 383


3904
case 28 of 383


3906
case 29 of 383


3907
case 30 of 383


3908
case 31 of 383


3910
case 32 of 383


4862
case 33 of 383


4864
case 34 of 383


4865
case 35 of 383


4866
case 36 of 383


4867
case 37 of 383


4868
case 38 of 383


4869
case 39 of 383


4870
case 40 of 383


6025
case 41 of 383


6026
case 