## ASN Kidney Week 2018 Abstract Supplment PDF to Pandas or CSV
####     by Jung Hoon Son, MD @ pulseData

------

**pulseData** will be presenting two posters at ASN Kidney Week 2018! Visit us

> [Using Machine Learning to Predict Optimal Renal Replacement Therapy Starts in Patients with Advanced Renal Function Loss](https://www.asn-online.org/education/kidneyweek/2018/program-abstract.aspx?controlId=3022574 "Using Machine Learning to Predict Optimal Renal Replacement Therapy Starts in Patients with Advanced Renal Function Loss")
> **Dialysis: Vascular Access - II**
> October 27, 2018 | Location: Exhibit Hall, San Diego Convention Center
> Abstract Time: 10:00 AM - 12:00 PM

> [A Machine Learning Approach to Identifying Patients at Risk of Developing Incident CKD](https://www.asn-online.org/education/kidneyweek/2018/program-abstract.aspx?controlId=3013628 "A Machine Learning Approach to Identifying Patients at Risk of Developing Incident CKD")
> **CKD: Epidemiology, Risk Factors, Prevention - II**
> October 26, 2018 | Location: Exhibit Hall, San Diego Convention Center
> Abstract Time: 10:00 AM - 12:00 PM


------


**Motivation**: ASN's Kidney Week website and the mobile app on Android was a bit sluggish with search so decided to parse the Supplment PDF using
* `tika` Apache Tika (`pip install tika`)
* `re`

Made a simple text search function as well `get_matching_abstracts()`

In [71]:
import re
import pandas as pd
from tika import parser

raw = parser.from_file('KW18Abstracts.pdf')
raw_text = raw['content'][3580:12120360]


def chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in range(0, len(l), n):
        yield l[i:i + n]


# split by regex for abstract names
s = r'((TH|FR|SA)?-?(OR|PO|PUB)\d{3,4})\s'
split_abstracts = re.split(s, raw_text)

# Above regex splits into 4 sections by matching groups
#   poster_id
#   day
#   poster_type
#   poster_content

abstract_list = [sections for sections in chunks(split_abstracts[1:], 4)]
abstract_df = pd.DataFrame(abstract_list,
                           columns=['poster_id', 'day', 'poster_type', 'poster_content'])

In [74]:
def section_split(text):
    sections = r'(Background|Methods|Results|Conclusions|Funding):\s'
    sectioned_list = re.split(sections, text)

    # Title actually contains title and authorlist...
    title_dict = [{'Session': re.split(r'\n', title.rstrip())[1],
                   'TitleAuthor': " ".join(re.split(r'\n', title.rstrip())[3:])} for title in sectioned_list[0:1]]

    content_dict = {chunk[0]: chunk[1].rstrip() for chunk in chunks(sectioned_list[1:], 2)}
    # Returns dictionary of section : text
    return {**title_dict[0], **content_dict}

# Some may prefer this dictionary... but for pure python work would convert
abstract_df['poster_content'] = abstract_df.poster_content.apply(lambda x: section_split(x))

###  Step 1) turn abstract sections into dictionary

In [75]:
pd.set_option('max_colwidth', 200)
abstract_df.head(5)

Unnamed: 0,poster_id,day,poster_type,poster_content
0,TH-OR001,TH,OR,"{'Methods': 'Injecting CC into the left renal artery of C57BL/6N, Mlkl-/- and CypD-/- mice, transcutaneous measurement of GFR after CC injection 24h, quantify renal tubular and endothelial cell ..."
1,TH-OR002,TH,OR,"{'Methods': 'The human proximal tubular cell line, HK-2, was treated with 10 μM of cisplatin and the renal cortex of C57BL/6 mice, injected with 25 mg/kg of cisplatin for 72 h, was analyzed. Th..."
2,TH-OR003,TH,OR,"{'Methods': 'Mice were subjected to: i) 30 minutes of bilateral renal ischemia with 48 hours of reperfusion to examine renal I/R injury, and ii) 30 minutes of unilateral renal ischemia with 2 w..."
3,TH-OR004,TH,OR,"{'Methods': 'COUP-TFII mRNA was measured by quantitative real-time PCR (qRT- PCR) and protein by immunostaining at different times after unilateral ischemia reperfusion (IRI). In vitro, knockdown..."
4,TH-OR005,TH,OR,"{'Methods': 'Wild-type C57BL/6 mice, ROR-deficient stagger [ROR (sg/sg)] mice and their wild-type (WT) littermates were used for in vivo studies. Renal I/R injury model was induced by bilateral ..."


### Step 2) Explode the dictionary into individual rows

In [76]:
# Many would prefer the wide form

pd.set_option('max_colwidth', 50)
wide_abstract_df = pd.concat([abstract_df.drop(['poster_content'], axis=1),
                              abstract_df['poster_content'].apply(pd.Series)], axis=1)
wide_abstract_df.head(6)

Unnamed: 0,poster_id,day,poster_type,Background,Conclusions,Funding,Methods,Results,Session,TitleAuthor
0,TH-OR001,TH,OR,Cholesterol crystal (CC) embolism may be an un...,CC embolism activates numerous cell types to r...,Other U.S. Government Support,Injecting CC into the left renal artery of C57...,CC injection to mice caused a sudden drop in g...,AKI: New Players and New Mechanisms,Extracellular DNA Drives Cholesterol Crystal E...
1,TH-OR002,TH,OR,Cisplatin is an anti-neoplastic drug that indu...,mtDNA leakage to the cytosol induces tubular i...,,"The human proximal tubular cell line, HK-2, wa...","In cisplatin-treated HK-2 or kidney cortex, ST...",AKI: New Players and New Mechanisms,Mitochondrial DNA Leakage Causes Inflammation ...
2,TH-OR003,TH,OR,"DNA methylation, catalyzed by DNA methyltransf...",These results indicate that DNMT1 and DNMT3a-d...,"NIDDK Support, Veterans Affairs Support",Mice were subjected to: i) 30 minutes of bilat...,DNMT1 and DNMT3a were markedly increased in re...,AKI: New Players and New Mechanisms,Knockout of DNA Methyltransferases in Proximal...
3,TH-OR004,TH,OR,Pericytes are essential to maintain capillary ...,Down regulation of COUP-TFII in AKI is an earl...,NIDDK Support\n\n\n\nAKI: New Players and New ...,COUP-TFII mRNA was measured by quantitative re...,COUP-TFII mRNA and protein expression initiall...,AKI: New Players and New Mechanisms,Orphan Nuclear Receptor COUP-TFII Regulates Pe...
4,TH-OR005,TH,OR,Emerging evidence indicates that retinoid-rela...,"In summary, our study provides experimental ev...",Government Support - Non-U.S.\n\nRORα deficien...,"Wild-type C57BL/6 mice, ROR-deficient stagger ...",We found that RORα was significantly down-regu...,AKI: New Players and New Mechanisms,The Orphan Nuclear Receptor RORα Is a Potentia...
5,TH-OR006,TH,OR,Energy depletion in renal tubular cells is a h...,NUPR1 protects renal tubular cells from energy...,"NIDDK Support, Private Foundation Support",Stable renal proximal tubular cell lines were ...,NUPR1 overexpression increased the ATP/ADP rat...,AKI: New Players and New Mechanisms,Stress Response Gene NUPR1 Protects Renal Tubu...


### Step 3) If you prefer the long/tidy (used to be R-user) format 

In [77]:
# I like the melted form better
# Allows for easier searching
pd.set_option('max_colwidth', 200)
abstract_df = pd.melt(wide_abstract_df,
                      id_vars=['poster_id', 'day', 'poster_type'],
                      var_name="section",
                      value_name='text',
                      value_vars=['TitleAuthor', 'Session', 'Background', 'Methods',
                                  'Conclusions', 'Results', 'Funding']).reset_index(drop=True)
abstract_df.head(10)

Unnamed: 0,poster_id,day,poster_type,section,text
0,TH-OR001,TH,OR,TitleAuthor,"Extracellular DNA Drives Cholesterol Crystal Embolism-Related Tissue Injury Chongxu Shi,1 Shrikant R. Mulay,1 Barbara M. Klinkhammer,2 Helen Liapis,3 Peter Boor,2 Hans J. Anders.1 1Medizinische ..."
1,TH-OR002,TH,OR,TitleAuthor,"Mitochondrial DNA Leakage Causes Inflammation via the cGAS-STING Axis in Cisplatin-Mediated Tubular Damage Hiroshi Maekawa,1,2 Tzu-Ming Jao,3 Reiko Inoue,1 Hiroshi Nishi,1 Tsuyoshi Inoue,1,2 Mas..."
2,TH-OR003,TH,OR,TitleAuthor,"Knockout of DNA Methyltransferases in Proximal Tubules Preserves Klotho and Improves Kidney Repair after Ischemia/Reperfusion Injury Chunyuan Guo,1 Qingqing Wei,1 Man J. Livingston,2 Zheng Dong.3..."
3,TH-OR004,TH,OR,TitleAuthor,"Orphan Nuclear Receptor COUP-TFII Regulates Pericyte Detachment in AKI Li Li, Pierre Galichon, Takaharu Ichimura, Joseph V. Bonventre. Brigham and Women’s Hospital/Harvard Medical School, Boston..."
4,TH-OR005,TH,OR,TitleAuthor,"The Orphan Nuclear Receptor RORα Is a Potential Endogenous Protector in Renal Ischemia/Reperfusion Injury Jieru Cai. Zhongshan Hospital, Fudan University, Shanghai, China."
5,TH-OR006,TH,OR,TitleAuthor,"Stress Response Gene NUPR1 Protects Renal Tubular Cells from Proliferation-Induced Energy Exhaustion Pierre Galichon, Li Li, M. Todd Valerius, Joseph V. Bonventre. Brigham and Women’s Hospital/H..."
6,TH-OR007,TH,OR,TitleAuthor,"C-Type Lectin Mincle Accelerates Renal Ischemia-Reperfusion Injury Miyako Tanaka,1,2 Marie Saka-Tanaka,1,3 Naotake Tsuboi,3 Shoichi Maruyama,3 Takayoshi Suganami.1,2 1Department of Molecular Medi..."
7,TH-OR008,TH,OR,TitleAuthor,"Novel Liposomal Nanocarriers of Preassembled Glycocalyx Restore Renal Microcirculation in Sepsis Michael S. Goligorsky, Dong Sun. New York Medical College, Valhalla, NY."
8,TH-OR009,TH,OR,TitleAuthor,Human Recombinant Alkaline Phosphatase (recAP) Protection from Kidney Ischemia-Reperfusion Injury (IRI) Is Mediated by Dephosphory- lation of ATP to Adenosine and Activation of Adenosine A2a Rece...
9,TH-OR010,TH,OR,TitleAuthor,"Inactivation of Mtorc1 in Proximal Tubular Epithelial Cells Reduces KIM-1-Mediated Kidney Fibrosis and Inflammation in Mice Wenqing Yin,1 Craig R. Brooks,2 M. Todd Valerius,3 Joseph V. Bonventre...."


### Basic Search function - for text-searching

In [78]:
######################################################################
# One helper function
######################################################################

def get_matching_abstracts(search_str, section=None, return_match_sections_only=False):
    if section:
        results = abstract_df[(abstract_df['section'] == section) &
                              (abstract_df.text.str.contains(search_str, flags=re.IGNORECASE))]
    else:
        results = abstract_df[~(abstract_df.section.isna()) &
                              (abstract_df.text.str.contains(search_str, flags=re.IGNORECASE))]

    matched_poster_ids = sorted(list(set(results.poster_id.tolist())))

    if not return_match_sections_only:
        results = abstract_df[abstract_df.poster_id.isin(matched_poster_ids)]
    print('{} has {} abstracts: {}'.format(search_str, len(matched_poster_ids), matched_poster_ids))
    return results.sort_values('poster_id').reset_index(drop=True)


In [79]:
######################################################################
# Default search: returns all sections of the presentation
#   'section' will return only matching section
#   'return_matching_sections_only' = will override this behavior
#
# --------------------------------------------------------------------
# Possible Sections
#   'TitleAuthor', 'Session', 'Background', 'Methods',
#   'Conclusions', 'Results', 'Funding'
#
######################################################################

# Note it prints out poster IDs as list
pd.set_option('max_colwidth', 1000)

get_matching_abstracts('CureGN').head(5)

CureGN has 7 abstracts: ['FR-OR078', 'SA-PO396', 'SA-PO397', 'SA-PO398', 'SA-PO400', 'TH-PO1015', 'TH-PO1016']


Unnamed: 0,poster_id,day,poster_type,section,text
0,FR-OR078,FR,OR,TitleAuthor,"Confidence in Women’s Health: An International Survey of Nephrologists Monica L. Reynolds,1 Elizabeth M. Hendren,4 Jarcy Zee,3 Laura H. Mariani,2 Michelle A. Hladunewich.4 1Nephrology, University of North Carolina, Chapel Hill, NC; 2University of Michigan, Ann Arbor, MI; 3Arbor Research Collaborative for Health, Ann Arbor, MI; 4University of Toronto, Toronto, ON, Canada."
1,FR-OR078,FR,OR,Funding,NIDDK Support\n\nNephrologists’ Confidence Managing Women’s Health
2,FR-OR078,FR,OR,Results,"Of the 154 respondents, 58% were from the US, 53% were women, and \nthe median age was between 41-45. The majority (77%) identified their practice setting \nas academic. 55% of the respondents had fellowship training in women’s health, which \nwas similar across country of training (p=0.325). Nephrologists from both countries lacked \nconfidence across a spectrum of issues (Figure 1). Most provided contraception (64%) and \npre-conception (68%) counseling to less than one woman per month though counseling \noccurred significantly more frequently in the US. In their career, 91% have cared for less \nthan five pregnant women on dialysis. Only 12% had access to interdisciplinary clinics. \nFinally, 89% felt that interdisciplinary guidelines and/or continuing education seminars \nwould improve knowledge."
3,FR-OR078,FR,OR,Conclusions,"As women with chronic kidney disease experience adverse maternal \noutcomes and remain at risk for disease progression postpartum, we must do better to \nbolster provider knowledge and comfort level. Further research is warranted to identify \nbarriers to counseling about women’s health issues, identify best mechanisms to enhance \nphysician confidence, and facilitate formation of interdisciplinary clinics. Interdisciplinary \nguidelines and case based materials may be a starting point."
4,FR-OR078,FR,OR,Session,"Glomerular Diseases: Clinical, Outcomes, and Trials"


In [80]:
pd.set_option('max_colwidth', 1000)

get_matching_abstracts('Machine Learning', section='TitleAuthor', return_match_sections_only=True).head(8)

Machine Learning has 8 abstracts: ['FR-PO1112', 'FR-PO189', 'FR-PO886', 'SA-PO392', 'SA-PO953', 'TH-OR117', 'TH-PO203', 'TH-PO408']


Unnamed: 0,poster_id,day,poster_type,section,text
0,FR-PO1112,FR,PO,TitleAuthor,"Insights from Machine Learning Analysis of Collated Lupus Nephritis Trials Liliana M. Gomez Mendez,2,1 Jian Dai,1 Jason Kutarnia,3 Scott Howser,7 Jay Schuren,7 Dana Mcclintock,5 Matthew Cascino,6 Marco Prunotto.4 1Genentech, South San Francisco, CA; 2University of California San Francisco, San Francisco, United States Minor Outlying Islands; 3Mitre, Bedford, MA; 4F. Hoffmann-La Roche Ltd., Basel, Switzerland; 5Genentech, Inc, South San Francisco, CA; 6Genentech, Inc., South San Francisco, CA; 7DataRobot, Boston, MA."
1,FR-PO189,FR,PO,TitleAuthor,"A Machine Learning Approach to Identifying Patients at Risk of Developing Incident CKD Tia Y. Yu,1 Lauren A. Wiener,1 Xiaoyan Wang,1 Ollie Fielding,1 Jung Hoon Son,1 Praveen Kumar Potukuchi,2 Csaba P. Kovesdy.2 1pulseData, New York, NY; 2University of Tennessee Health Science Center, Memphis, TN."
2,FR-PO886,FR,PO,TitleAuthor,"The Association Between Location of Eplet Mismatches, De Novo Donor Specific Antibodies, and Acute Rejection in Simultaneous Pancreas-Kid- ney Transplant Recipients Using Novel Machine Learning Methods Ankit Sharma,1,2 Craig Coorey,1 Anne T. Taverniti,1 Brian J. Nankivell,3 Jeremy R. Chapman,3 Jonathan C. Craig,1 Wai H. Lim,5 Jean Yang,4 Germaine Wong.1,2 1Centre for Kidney Research, Westmead, NSW, Australia; 2School of Public Health, University of Sydney, Sydney, NSW, Australia; 3Westmead Hospital, WESTMEAD, NSW, Australia; 4University of Sydney, Sydney, NSW, Australia; 5Sir Charles Gairdner Hospital, Nedlands, WA, Australia."
3,SA-PO392,SA,PO,TitleAuthor,"Studying Rare Disease Using an Electronic Health Record (EHR) and Machine Learning Based Approach: The Kaiser Permanente Southern California (KPSC) Membranous Nephropathy (MN) Cohort Amy Z. Sun,1 Yu-Hsiang Shu,4 Teresa N. Harrison,4 Michelle M. O’Shaughnessy,2 John J. Sim.3 1Department of Internal Medicine, Kaiser Permanente Los Angeles Medical Center, Los Angeles, CA; 2Division of Nephrology, Stanford University, Palo Alto, CA; 3Division of Nephrology and Hypertension, Kaiser Permanente Los Angeles Medical Center, Los Angeles, CA; 4Kaiser Permanente, Pasadena, CA."
4,SA-PO953,SA,PO,TitleAuthor,"Using Machine Learning to Predict Optimal Renal Replacement Therapy Starts in Patients with Advanced Renal Function Loss Ollie Fielding,1 Chris Kipers,1 Jung Hoon Son,1 Edward Lee,1 Daniel Levine,2 Thomas Parker,2 Barry H. Smith,2 Jeffrey I. Silberzweig.2 1pulseData, New York, NY; 2The Rogosin Institute, New York, NY."
5,TH-OR117,TH,OR,TitleAuthor,"Recognizing Sepsis: A High-Throughput Non-Invasive Assessment Using Machine Learning and Urinary MicroRNAs Ferdous Kadri,1,2 Sabyasachi Bandyopadhyay,6,4 Lasith Adhikari,1,2 Tezcan Ozrazgat-baslanti,1,2 Laura Sautina,1,3 Maria C. Lopez,5,3 Henry V. Baker,5,3 Mark S. Segal,1,3 Parisa Rashidi,4,2 Azra Bihorac.1,2 PRISMA-P 1Division of Nephrology, Hypertension, and Renal Transplantation, Department of Medicine, University of Florida, Gainesville, FL; 2Precision and Intelligent Systems in Medicine (PrismaP), University of Florida, Gainesville, FL; 3Sepsis and Critical Illness Research Center, University of Florida, Gainesville, FL; 4J. Crayton Pruitt Family Department of Biomedical Engineering, University of Florida, Gainesville, FL; 5Department of Molecular Genetics and Microbiology, University of Florida, Gainesville, FL; 6Department of Surgery, University of Florida, Gainesville, FL."
6,TH-PO203,TH,PO,TitleAuthor,"Using Machine Learning to Help Predict Elevated Serum Phosphate Levels in Patients with ESRD Andrew Long,1 Tommy C. Blanchard,1 Joanna Willetts,1 Sheetal Chaudhuri,1 Michael R. O’Connell,1 Kathleen Belmonte,2 Marissa A. Lee,1 Len A. Usvyat,1 Terry L. Ketchersid,1 Franklin W. Maddux.1 1Fresenius Medical Care North America, Waltham, MA; 2Fresenius Kidney Care, Waltham, MA."
7,TH-PO408,TH,PO,TitleAuthor,"A Machine Learning Model to Predict Patient Risk of Peritonitis Episodes Tommy C. Blanchard, Joanna Willetts, Michael R. O’Connell, Sheetal Chaudhuri, Len A. Usvyat, Brian C. Ellison, Judith Moran, Melissa Herman, Susan M. Dunphy, Franklin W. Maddux. Fresenius Medical Care North America, Waltham, MA."


In [81]:
######################################################################
# KidneyWeek Abstract Supplment PDF to Pandas
#     by Jung Hoon Son @ pulseData
######################################################################
#
# Visit our posters at pulseData!
#
pd.set_option('max_colwidth', 1000)
get_matching_abstracts('pulseData', 'TitleAuthor', return_match_sections_only=True)

pulseData has 2 abstracts: ['FR-PO189', 'SA-PO953']


Unnamed: 0,poster_id,day,poster_type,section,text
0,FR-PO189,FR,PO,TitleAuthor,"A Machine Learning Approach to Identifying Patients at Risk of Developing Incident CKD Tia Y. Yu,1 Lauren A. Wiener,1 Xiaoyan Wang,1 Ollie Fielding,1 Jung Hoon Son,1 Praveen Kumar Potukuchi,2 Csaba P. Kovesdy.2 1pulseData, New York, NY; 2University of Tennessee Health Science Center, Memphis, TN."
1,SA-PO953,SA,PO,TitleAuthor,"Using Machine Learning to Predict Optimal Renal Replacement Therapy Starts in Patients with Advanced Renal Function Loss Ollie Fielding,1 Chris Kipers,1 Jung Hoon Son,1 Edward Lee,1 Daniel Levine,2 Thomas Parker,2 Barry H. Smith,2 Jeffrey I. Silberzweig.2 1pulseData, New York, NY; 2The Rogosin Institute, New York, NY."


# Saving to CSV:
`abstract_df.to_csv('kidney_week_2018_abstracts.csv')`