## Thinkful 1.3.5
### Notes on Manipulating Strings: 

#### .is____ methods on strings

#### str.isdigit(), str.isdecimal, and str.isnumeric

These three are very close to the same function: "decimal" looks for digits 0-9 in ALL of the string, "digit" is like decimal but can also include special characters like exponentials, "numeric" includes special characters like fractions.

 -  str.isdigit(): 
     - Return true if all characters in the string are digits and there is at least one character, false otherwise. Digits include decimal characters and digits that need special handling, such as the compatibility superscript digits. This covers digits which cannot be used to form numbers in base 10, like the Kharosthi numbers. Formally, a digit is a character that has the property value Numeric_Type=Digit or Numeric_Type=Decimal.
     
 -  str.isdecimal()
     - Return true if all characters in the string are decimal characters and there is at least one character, false otherwise. Decimal characters are those that can be used to form numbers in base 10, e.g. U+0660, ARABIC-INDIC DIGIT ZERO. Formally a decimal character is a character in the Unicode General Category “Nd”.
     
 -  str.isnumeric()
     - Return true if all characters in the string are numeric characters, and there is at least one character, false otherwise. Numeric characters include digit characters, and all characters that have the Unicode numeric value property, e.g. U+2155, VULGAR FRACTION ONE FIFTH. Formally, numeric characters are those with the property value Numeric_Type=Digit, Numeric_Type=Decimal or Numeric_Type=Numeric.
     

 -  str.isalpha()
     - Return true if all characters in the string are alphabetic and there is at least one character, false otherwise. Alphabetic characters are those characters defined in the Unicode character database as “Letter”, i.e., those with general category property being one of “Lm”, “Lt”, “Lu”, “Ll”, or “Lo”. Note that this is different from the “Alphabetic” property defined in the Unicode Standard.
     
 -  str.isspace()
     - Return true if there are only whitespace characters in the string and there is at least one character, false otherwise. Whitespace characters are those characters defined in the Unicode character database as “Other” or “Separator” and those with bidirectional property being one of “WS”, “B”, or “S”.
     
 -  str.isascii()
     - Return true if the string is empty or all characters in the string are ASCII, false otherwise. ASCII characters have code points in the range U+0000-U+007F.
     
 -  str.islower()/isupper()
     - Return true if all _cased_ characters in the string are lowercase (uppercase) and there is at least one cased character, false otherwise.
     
 -  str.isidentifier()
     - Return true if the string is a valid identifier according to the language definition, section Identifiers and keywords.

 -  str.istitle()
      - Return true if the string is a titlecased string and there is at least one character, for example uppercase characters may only follow uncased characters and lowercase characters only cased ones. Return false otherwise.

In [1]:
import pandas as pd
import numpy as np

In [2]:
money = pd.Series([400, 111, '$20', 57, 'Lots'])

### Filtering a list versus Applying a filter TO a list.

In the following example the third and fifth string fail the test ".isdigit()".  The third because of the leading $ and the fifth because all the characters are letters.

NOTE: The results of filtering a list returns an iterator, so it needs to be delivered to a list to be useful.

In [3]:
print(list(filter(lambda x: str(x).isdigit(), money)))

[400, 111, 57]


In this case lambda function is being applied to the _strings_ because of the "apply" function.  it seems "apply" is the same as a standard "map" function?

The lambda function is testing each letter one at a time and passing the decimal number through, stripping letters and symbols.

In [4]:
print(money.apply(lambda x: ''.join(list(filter(str.isdigit, str(x))))))

0    400
1    111
2     20
3     57
4       
dtype: object


Note that we are still getting a value from the fifth string, but it's been reduced to the string ''

I think this can get us in trouble easily.

In [5]:
myMoney = pd.Series(['$400', 11.51, '$20.50', 'Heinz 57', 'Value2You'])


In [6]:
print(myMoney.apply(lambda x: ''.join(list(filter(str.isdigit, str(x))))))

0     400
1    1151
2    2050
3      57
4       2
dtype: object


__Thought Question__: how to KEEP decimals? or "$"
_https://stackoverflow.com/questions/320929/currency-formatting-in-python_

### Splitting strings apart

The .split() string method takes one argument, the character or substring to split on, and returns a list of the pieces of the string it's called on, using the separator as a delimiter for each piece. Conveniently, Pandas gives us its own version of this built-in method with Series.str.split() that you can use directly on series objects without needing .apply().

In [7]:
words = pd.Series([
    'MollyMalone$molmal@gmail.com',
    'JeffreyJones$jefjo@hotmail.com',
    'DeadParrot$fjords@gmail.com'
])

This appears to be Names separated from emails by \\$ sign.  Thus we'll want to split the strings by \\$.

Let's explore what the ",expand=True" parameter does by seeing what happens without it. It seems without it we get a SERIES not a DATAFRAME.

In [8]:
word_split = words.str.split('$')

In [9]:
print(word_split)

0      [MollyMalone, molmal@gmail.com]
1    [JeffreyJones, jefjo@hotmail.com]
2       [DeadParrot, fjords@gmail.com]
dtype: object


In [10]:
names = word_split[0]

In [11]:
print(names)


['MollyMalone', 'molmal@gmail.com']


The problem, of course, is that the first "thing" in a SERIES is that first row.  The first "thing" in a DATAFRAME is a column.

In [12]:
word_split = words.str.split('$', expand=True)

In [13]:
print(word_split)

              0                  1
0   MollyMalone   molmal@gmail.com
1  JeffreyJones  jefjo@hotmail.com
2    DeadParrot   fjords@gmail.com


In [14]:
names = word_split[0]

In [15]:
print(names)


0     MollyMalone
1    JeffreyJones
2      DeadParrot
Name: 0, dtype: object


In [16]:
# problem with splitting on something we don't want to be "disappeared"
print(names.str.split('[A-Z]', expand=True))

  0       1      2
0      olly  alone
1    effrey   ones
2       ead  arrot


Here the names split just fine on the Capital letters, but as "split" drops the character being split upon, we lose those letters.  Bummer.

In [17]:
# .findall() method is part of the re package and searches for a 
#  given pattern and returns each instance as an item in a list

import re
FirstLast = names.apply(lambda x: re.findall('[A-Z][a-z]*', x))

QUESTION:  I was wondering why the curriculum made TWO calls of names.apply:  one for the first names and one for the last.  I thought one could make a single call to FirstLast

In [18]:
print(FirstLast)

0     [Molly, Malone]
1    [Jeffrey, Jones]
2      [Dead, Parrot]
Name: 0, dtype: object


This is a Series of Lists.  I think there are two ways of making it a data Frame. (1) Turn those lists into Series

In [19]:
FirstLastDF1 = FirstLast.apply(lambda x: pd.Series(x))

In [20]:
print(FirstLastDF1)

         0       1
0    Molly  Malone
1  Jeffrey   Jones
2     Dead  Parrot


Or two, use the method pd.DataFrame.from_records()  This method expects a list of lists, so cast the series to become a list.

In [21]:
FirstLastDF2 = pd.DataFrame.from_records(list(FirstLast))

In [22]:
print(FirstLastDF2)

         0       1
0    Molly  Malone
1  Jeffrey   Jones
2     Dead  Parrot


Challenge: What if some names had a middle name as well?

In [23]:
myNames = pd.Series([
    'MollyMalone',
    'JeffreyGJones',
    'DeadExParrot'
])

In [35]:
mysplitnames = myNames.apply(lambda x: re.findall('[A-Z][a-z]*',x))

In [36]:
print(mysplitnames)

0        [Molly, Malone]
1    [Jeffrey, G, Jones]
2     [Dead, Ex, Parrot]
dtype: object


This splits just fine, but now we have the issue that some of the names have length 3 versus length 2 and in order to be DataFrame, they must have common length.  One could insert "None" for the middle name for those with only two names.

Note that the "pd.apply()" method changes the pd, but RETURNS the result of the function.  In this case of the names already with a middle name, lambda is returning the same series.

In [37]:
mysplitnames.apply(lambda x: x if (len(x)==3) else x.insert(1,None))


0                   None
1    [Jeffrey, G, Jones]
2     [Dead, Ex, Parrot]
dtype: object

As we can see, the pd has been changed appropriately.

In [38]:
print(mysplitnames)

0    [Molly, None, Malone]
1      [Jeffrey, G, Jones]
2       [Dead, Ex, Parrot]
dtype: object


In [39]:
firstname = mysplitnames[0]
middlename = mysplitnames[1]
lastname = mysplitnames[2]
print(firstname, '\n')
print(middlename, '\n')
print(lastname)

['Molly', None, 'Malone'] 

['Jeffrey', 'G', 'Jones'] 

['Dead', 'Ex', 'Parrot']


## Changing the content of strings
#### Replace using Series.str.replace()

In [40]:
emails = word_split[1]

In [41]:
print(emails,'\n')
print(emails.str.replace('@', '_at_'), '\n')

0     molmal@gmail.com
1    jefjo@hotmail.com
2     fjords@gmail.com
Name: 1, dtype: object 

0     molmal_at_gmail.com
1    jefjo_at_hotmail.com
2     fjords_at_gmail.com
Name: 1, dtype: object 



#### Changing case to lower or upper

In [42]:
print(names.str.lower(),'\n')
print(names.str.upper(),'\n')
print(names.str.capitalize(),'\n')

0     mollymalone
1    jeffreyjones
2      deadparrot
Name: 0, dtype: object 

0     MOLLYMALONE
1    JEFFREYJONES
2      DEADPARROT
Name: 0, dtype: object 

0     Mollymalone
1    Jeffreyjones
2      Deadparrot
Name: 0, dtype: object 



## Problem set
My file would not read until I opened with with LibreOffice and then saved it again as a .csv file.  For some reason that got it to load.  The "index_col-False" opens the file without using the first column as an index.

In [43]:
W_C = pd.read_csv('/home/john/Desktop/Thinkful/WC.csv', index_col=False)

In [44]:
W_C.head()

Unnamed: 0,PMID/PMCID,Publisher,Journal title,Article title,COST (�) charged to Wellcome (inc VAT when charged),Unnamed: 5
0,,CUP,Psychological Medicine,Reduced parahippocampal cortical thickness in ...,�0.00,
1,PMC3679557,ACS,Biomacromolecules,Structural characterization of a Model Gram-ne...,�2381.04,
2,23043264 PMC3506128,ACS,J Med Chem,"Fumaroylamino-4,5-epoxymorphinans and related ...",�642.56,
3,23438330 PMC3646402,ACS,J Med Chem,Orvinols with mixed kappa/mu opioid receptor a...,�669.64,
4,23438216 PMC3601604,ACS,J Org Chem,Regioselective opening of myo-inositol orthoes...,�685.88,


These column headings suck.  Let's change them.

In [45]:
W_C.columns = ['PMID','Publisher','Journal','Title','Cost','Oops']

We seem to have a few entries where column 5 is being used.  Let's find them.

In [46]:
W_C.loc[W_C['Oops'].notnull(),['PMID','Publisher','Journal','Title','Cost','Oops']]

Unnamed: 0,PMID,Publisher,Journal,Title,Cost,Oops
403,,Company of Biologists,Journal of Cell Science,Expression of OA1 limits the fusion of a subse...,a mechanism likely involved in the initial bi...,�2400.00
830,3708126,Elsevier,American Journal of Preventive Medicine,Sickle Cell Disease in Africa,a neglected cause of early childhood mortality,�1834.77
837,3778404,Elsevier,Social Science & Medicine,Managing misaligned paternity findings in rese...,consulting communities to inform policy,�1834.77
1278,PMC3744396,PLOS,PLOS Computational Biology,Spike triggered hormone secretion in vasopress...,a model investigation of mechanism and hetero...,�1429.13
1697,PMC3347798,Society for Genermal Microbiology,Journal of General Virology,The natural history of early hepatitis C virus...,lessons from a global outbreak in human immun...,�1750.00
1854,,The Endrocrine Society,Journal od Clinical Endocrinology,Corticotropin releasing hormone interacts with...,-1 to Regulate Prostaglandin H Synthase- 2 Exp...,�3602.41
1975,PMC3739940,Wiley,Movement Disorders,Somatic alpha-synuclein mutations in Parkinson...,s disease: hypothesis and preliminary data,�1005.00
1990,,Wiley,Muscle and Nerve,HETEROGENEITY OF QUADRICEPS MUSCLE PHENOTYPE I...,IMPLICATIONS FOR STRATIFIED MEDICINE?,�2371.54


These errors seem to have been caused by the article Title being split between Title and Cost with the cost being moved to a 6th column. The next line will concatenate the two data columns back into the Title column where it belongs, then we move the Cost data to the 5th column.

In [47]:
W_C.loc[W_C['Oops'].notnull(),'Title']=W_C['Title']+W_C['Cost']

In [48]:
W_C.loc[W_C['Oops'].notnull(),'Cost']=W_C['Oops']

In [49]:
W_C.loc[W_C['Oops'].notnull(),['PMID','Publisher','Journal','Title','Cost','Oops']]

Unnamed: 0,PMID,Publisher,Journal,Title,Cost,Oops
403,,Company of Biologists,Journal of Cell Science,Expression of OA1 limits the fusion of a subse...,�2400.00,�2400.00
830,3708126,Elsevier,American Journal of Preventive Medicine,Sickle Cell Disease in Africa a neglected caus...,�1834.77,�1834.77
837,3778404,Elsevier,Social Science & Medicine,Managing misaligned paternity findings in rese...,�1834.77,�1834.77
1278,PMC3744396,PLOS,PLOS Computational Biology,Spike triggered hormone secretion in vasopress...,�1429.13,�1429.13
1697,PMC3347798,Society for Genermal Microbiology,Journal of General Virology,The natural history of early hepatitis C virus...,�1750.00,�1750.00
1854,,The Endrocrine Society,Journal od Clinical Endocrinology,Corticotropin releasing hormone interacts with...,�3602.41,�3602.41
1975,PMC3739940,Wiley,Movement Disorders,Somatic alpha-synuclein mutations in Parkinson...,�1005.00,�1005.00
1990,,Wiley,Muscle and Nerve,HETEROGENEITY OF QUADRICEPS MUSCLE PHENOTYPE I...,�2371.54,�2371.54


We can now drop the unnecessary column.

In [50]:
W_C = pd.read_csv('/home/john/Desktop/Thinkful/WC.csv', index_col=False)W_C=W_C.drop(columns=['Oops'])

#### Cleaning up Unicode issues
Some of the columns have characters which aren't part of standard ASCII. We need to excise them.

However we need to be careful as some of the entries in Cost field are listed in UK pounds.  Now, the issue is by running the file through LibreOffice, the £ symbol has been turned into �.  Whether OTHER symbols were also turned into � I don't know. I do know that searching for  � and \$ return the same rows -- Thus it's clear we can't use the file after it's run through LibreOffice.

In [51]:
W_C.loc[W_C['Cost'].str.contains("£")]

Unnamed: 0,PMID,Publisher,Journal,Title,Cost


In [56]:
W_C.loc[W_C['Cost'].str.contains("�")].head()

Unnamed: 0,PMID,Publisher,Journal,Title,Cost
0,,CUP,Psychological Medicine,Reduced parahippocampal cortical thickness in ...,�0.00
1,PMC3679557,ACS,Biomacromolecules,Structural characterization of a Model Gram-ne...,�2381.04
2,23043264 PMC3506128,ACS,J Med Chem,"Fumaroylamino-4,5-epoxymorphinans and related ...",�642.56
3,23438330 PMC3646402,ACS,J Med Chem,Orvinols with mixed kappa/mu opioid receptor a...,�669.64
4,23438216 PMC3601604,ACS,J Org Chem,Regioselective opening of myo-inositol orthoes...,�685.88


In [55]:
W_C.loc[W_C['Cost'].str.contains("$")].head()

Unnamed: 0,PMID,Publisher,Journal,Title,Cost
0,,CUP,Psychological Medicine,Reduced parahippocampal cortical thickness in ...,�0.00
1,PMC3679557,ACS,Biomacromolecules,Structural characterization of a Model Gram-ne...,�2381.04
2,23043264 PMC3506128,ACS,J Med Chem,"Fumaroylamino-4,5-epoxymorphinans and related ...",�642.56
3,23438330 PMC3646402,ACS,J Med Chem,Orvinols with mixed kappa/mu opioid receptor a...,�669.64
4,23438216 PMC3601604,ACS,J Org Chem,Regioselective opening of myo-inositol orthoes...,�685.88


#### Second attempt: try reading in the file straight-away.
Continued to get issues with reading.  After exploring Stack Overflow I added, in succession,
encoding='utf-8', encoding='utf-16', encoding='latin1', encoding='iso-8859-1', encoding='cp1252'


In [92]:
WELL_C = pd.read_csv('/home/john/Downloads/WELLCOME_APCspend2013_forThinkful.csv', encoding='latin1',index_col=False)

In [93]:
WELL_C.head()

Unnamed: 0,PMID/PMCID,Publisher,Journal title,Article title,COST (£) charged to Wellcome (inc VAT when charged)
0,,CUP,Psychological Medicine,Reduced parahippocampal cortical thickness in ...,£0.00
1,PMC3679557,ACS,Biomacromolecules,Structural characterization of a Model Gram-ne...,£2381.04
2,23043264 PMC3506128,ACS,J Med Chem,"Fumaroylamino-4,5-epoxymorphinans and related ...",£642.56
3,23438330 PMC3646402,ACS,J Med Chem,Orvinols with mixed kappa/mu opioid receptor a...,£669.64
4,23438216 PMC3601604,ACS,J Org Chem,Regioselective opening of myo-inositol orthoes...,£685.88


This is a lovely surprise, by reading the file correctly, the article titles stayed where they were supposed to stay.

In [94]:
WELL_C.columns = ['PMID','Publisher','Journal','Title','Cost']

In [95]:
ratioUKtoUSD = 1.31

In [96]:
WELL_C['Cost'] = WELL_C.loc[WELL_C['Cost'].str.contains("£"),'Cost'].str.replace('£','').astype(float)*ratioUKtoUSD

In [97]:
WELL_C.head()

Unnamed: 0,PMID,Publisher,Journal,Title,Cost
0,,CUP,Psychological Medicine,Reduced parahippocampal cortical thickness in ...,0.0
1,PMC3679557,ACS,Biomacromolecules,Structural characterization of a Model Gram-ne...,3119.1624
2,23043264 PMC3506128,ACS,J Med Chem,"Fumaroylamino-4,5-epoxymorphinans and related ...",841.7536
3,23438330 PMC3646402,ACS,J Med Chem,Orvinols with mixed kappa/mu opioid receptor a...,877.2284
4,23438216 PMC3601604,ACS,J Org Chem,Regioselective opening of myo-inositol orthoes...,898.5028


Next, we need to remove the \$ sign from the Cost entries.  This is made difficult because the column is now half floats and half strings. Thus as we check for '$', we have to treat all as strings.

In [100]:
WELL_C['Cost'] = WELL_C.loc[WELL_C['Cost'].astype(str).str.contains('$'),'Cost'].replace('$','')

Whew.  Let's make sure all values in the column are floats.

In [101]:
WELL_C['Cost']=WELL_C['Cost'].astype(float)

### Grouping by Journal Name
We have to rest_index after any aggregation as the indexing is not existant until we do.

In [102]:
byJournal = WELL_C.groupby(['Journal']).agg('count').reset_index()


In [103]:
byJournal.sort_values('Title',ascending=False).head(5)

Unnamed: 0,Journal,PMID,Publisher,Title,Cost
792,PLoS One,91,92,92,92
791,PLoS ONE,62,62,62,62
527,Journal of Biological Chemistry,47,48,48,48
753,Nucleic Acids Research,20,21,21,21
835,Proceedings of the National Academy of Sciences,19,19,19,19


Obviously our top two Journals are the same thing.  Let's shift everything to lower case and get them to combine.

In [111]:
WELL_C['Journal']= WELL_C['Journal'].str.lower()

In [112]:
byJournal = WELL_C.groupby(['Journal']).agg('count').reset_index()

In [113]:
byJournal.sort_values('Title',ascending=False).head()


Unnamed: 0,Journal,PMID,Publisher,Title,Cost
772,plos one,188,190,190,190
510,journal of biological chemistry,52,53,53,53
700,neuroimage,28,29,29,29
766,plos genetics,23,24,24,24
773,plos pathogens,24,24,24,18


Note at this point that there are several Journal titles with a title containing "plos". Presumably they are the same journal?

NOTE:  Somewhere in this data, the Publisher column seemingly isn't a string. I should check for why not.

In [114]:
WELL_C.loc[WELL_C['Journal'].astype(str).str.contains("plos "),'Journal']='plos'

In [115]:
byJournal = WELL_C.groupby(['Journal']).agg('count').reset_index()

In [116]:
byJournal.sort_values('Title',ascending=False).head()

Unnamed: 0,Journal,PMID,Publisher,Title,Cost
760,plos,276,289,289,283
510,journal of biological chemistry,52,53,53,53
700,neuroimage,28,29,29,29
725,nucleic acids research,22,23,23,23
775,proceedings of the national academy of sciences,20,20,20,20


Let's go back to finding the Journals whose 'Journal' isn't a string.

In [117]:
WELL_C.loc[WELL_C['Journal'].isnull()]

Unnamed: 0,PMID,Publisher,Journal,Title,Cost
986,,MacMillan,,Fungal Disease in Britain and the United State...,17292.0


In [118]:
WELL_C.loc[WELL_C['Journal'].isnull(),'Journal']='Anonymous Fungal Disease Journal'

There are several Journals containing the word "national academy".  Let's gather those together.

In [119]:
WELL_C.loc[WELL_C['Journal'].str.contains("national academy"),'Journal']="PNAS"

In [120]:
byJournal = WELL_C.groupby(['Journal']).agg('count').reset_index()

In [121]:
byJournal.sort_values('Title',ascending=False).head(5)

Unnamed: 0,Journal,PMID,Publisher,Title,Cost
761,plos,276,289,289,283
512,journal of biological chemistry,52,53,53,53
1,PNAS,32,32,32,32
701,neuroimage,28,29,29,29
726,nucleic acids research,22,23,23,23


In [129]:
topFive = byJournal.sort_values('Title',ascending=False)['Journal'].head(5)

At this point I'm going to assume that these will still be the top 5 even if other entries are merged.  So at this point I'm going to perform the statistical calculations asked for in the assignment. 

Obviously 999999 is being used as a NaN marker.

In [122]:
WELL_C['Cost'].sort_values( ascending=False).head()

1563    1309998.69
669     1309998.69
560     1309998.69
1565    1309998.69
1675    1309998.69
Name: Cost, dtype: float64

In [123]:
WELL_C['Cost']=WELL_C['Cost'].apply(lambda x: np.nan if x >100000 else x)

In [124]:
WELL_C['Cost'].sort_values( ascending=False).head()

986     17292.0
1619     7860.0
800      7545.6
648      6288.0
552      6288.0
Name: Cost, dtype: float64

In [130]:
topFive.apply(lambda x: WELL_C.loc[WELL_C['Journal']==x,'Cost'].mean())

761    1454.369909
512    1864.900885
1      1127.834675
701    2901.870441
726    1531.674783
Name: Journal, dtype: float64

In [134]:
topFive.apply(lambda x: WELL_C.loc[WELL_C['Journal']==x,'Cost'].std())

761    459.520635
512    539.660212
1      651.342389
701    349.316670
726    595.171477
Name: Journal, dtype: float64

In [132]:
topFive.apply(lambda x: WELL_C.loc[WELL_C['Journal']==x,'Cost'].median())

761    1332.09315
512    1704.49340
1       998.65230
701    3047.62330
726    1116.12000
Name: Journal, dtype: float64

## Challenge Bonus
Find the open access prices paid by subject area.
Question here might be how to define subject area.  One easy method would be to "define" a subject area with a Series of terms where if ANY of those strings are in the title of an article then the article is within that subject area.  
 
As an example: A subject area might be defined by ['infection', 'infectious', 'disease', 'disease', 'illness']

Second example: ['diabetes', 'diabetic', 'insulin', 'Basal', 'glucose level', 'DKA', 'Glucagon']


In [145]:
InfectiousDisease = pd.Series(['infection', 'infectious', 'disease', 'disease', 'illness'])

In [156]:
Diabetes = pd.Series(['diabetes', 'diabetic', 'insulin', 'Basal', 'glucose level', 'DKA', 'Glucagon'])

In [157]:
InfectiousDisease.apply(lambda x: WELL_C['Title'].str.contains(x))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2117,2118,2119,2120,2121,2122,2123,2124,2125,2126
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,True,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [158]:
WELL_C.loc[InfectiousDisease.apply(lambda x: WELL_C['Title'].str.contains(x)).any()]

Unnamed: 0,PMID,Publisher,Journal,Title,Cost
102,,American Society of Haematology,blood,SAP gene transfer restores cellular and humora...,1650.9406
176,22738332 PMC3381227,BioMed Central,biomed central,Long-term impact of systemic bacterial infecti...,1768.5000
186,PMC3616814,BioMed Central,bmc medicine,Associations between selected immune-mediated ...,2271.5400
224,PMCID:\n PMC3726453\n\n,BioMed Central,respiratory research,Neutrophil adhesion molecules in experimental ...,2491.6200
225,PMC3190389,BioMed Central,virology journal,Label-free quantitative proteomics reveals reg...,1627.0200
226,3681581,BioMed Central,bmc genomics,Transcriptional adaptation of pneumococci and ...,797.1612
243,PMC3716626,BioMed Central Ltd,veterinary research,Understanding foot-and-mouth disease virus tra...,13.0083
257,In Process,BMC,bmc medicine,HIV-associated tuberculosis: relationship betw...,2102.9430
280,,BMJ,bmj,Completeness and diagnostic validity of record...,4716.0000
289,PMC3724198,BMJ,frontline gastroenterology,Measurement of faecal calprotectin and lactofe...,943.2000


In [160]:
WELL_C.loc[Diabetes.apply(lambda x: WELL_C['Title'].str.contains(x)).any()]

Unnamed: 0,PMID,Publisher,Journal,Title,Cost
185,3570299,BioMed Central,bmc medical genetics,Maternal and offspring fasting glucose and typ...,2012.16
262,PMC3735389,BMC,bmc public health,Association between legume intake and self-rep...,2012.16
283,PMC3657677,BMJ,bmj open,Development of an economic evaluation of diagn...,2122.2
369,PMC3458429,Cambridge University Press,public health nutrition,Prevalence and risk factors for self-reported ...,2664.54
831,PMC3685808,Elsevier,cell metabolism,Improved insulin sensitivity despite increased...,5350.7081
852,PMC3599069\n\n,European Society of Endocrinolog,european journal of endocrinology,Clinical and Molecular Characterisation of 300...,1886.4
1068,PMID: 23229735 PMC3734734,Nature Publishing Group,international journal of obesity,Catch-up growth following intra-uterine growth...,3930.0
1341,PMC3457933\n\n,Public Library of Science,biology,Independent Regulation of Basal Neurotransmitt...,2406.3521
1353,,Public Library of Science,plos,Sleep-wake sensitive mechanisms of adenosine r...,1325.458
1503,PMC3547960\n,Public Library of Science,plos,Association study of 25 type 2 diabetes relate...,1102.5615
