# Part 0 : Introduction
_____

In this tutorial we will use pandas to analyze the csv file and prepare it for a seq2seq model training

Using pandas we can sort the data suitable for our needs. To achieve this we need to understand how the data is interpreted.

Current dataset in this example is results of radiology reports and the diagnosis results

# Part 1: Pandas Basics
_____

below you can see how to import the data presented in csv to your python code

In [66]:
import pandas as pd

In [67]:
dataFrame=pd.read_csv("ReportsDATASET.csv")

lets see details about your data, describe is much more useful for numeric data but still you can see that there are 1984 text values and 1982 unique values

In [68]:
dataFrame.describe()

Unnamed: 0,Text
count,1984
unique,1982
top,"\nSIGNATURE\nXXXX\n\nRADIOLOGY REPORT\nchest pain CHEST 2V FRONTAL/LATERAL XXXX, XXXX XXXX XXXX Comparison: XXXX, XXXX. The heart and lungs have XXXX XXXX in the interval. Both lungs are clear and expanded. Heart and mediastinum normal. Transcribed by - PSCB Transcription Date - XXXX\n\nIMPRESSION\nNo active disease. DICTATED BY : Dr. XXXX XXXX XXXX XXXX XXXX ELECTRONICALLY SIGNED XXXX. XXXX XXXX XXXX XXXX XXXX TRANSCRIBED XXXX XXXX\n\n"
freq,2


with looking at the shape of dataframe we can see row and columns size

In [69]:
dataFrame.shape

(1984, 1)

with head() method you can see first lines

In [70]:
dataFrame.head() 

Unnamed: 0,Text
0,\nChest PA-Lat XR\n\nImaging Study\nXray Chest PA and Lateral\nExam: 2 views of the chest XXXX/XXXX.\n \nComparison: None.\n \nIndication: Positive TB test\n \nFindings:\nThe cardiac silhouette and mediastinum size are within normal limits.\nThere is no pulmonary edema. There is no focal consolidation. There\nare no XXXX of a pleural effusion. There is no evidence of\npneumothorax.\n \nImpression:\nNormal chest x-XXXX. \nThis examination and reported findings have been reviewed and\nconfirmed by the undersigned.\n\n
1,"EXAM(S): Chest, 2 views, frontal and lateral\n\nDATE\nXXXX\n\nCOMPARISON\nNone.\n\nINDICATION\nPreop bariatric surgery.\n\nFINDINGS\nBorderline cardiomegaly. Midline sternotomy XXXX. Enlarged pulmonary arteries. Clear lungs. Inferior XXXX XXXX XXXX.\n\nIMPRESSION\nNo acute pulmonary findings. \n XXXX XXXX for the opportunity to care for your patient. If XXXX have any questions regarding this report, please XXXX the radiologist, Dr. XXXX XXXX, at XXXX.\n"
2,"\nExam\nXray Chest PA and Lateral\n\nDate\nXXXX\n\nHistory\nrib pain after a XXXX, XXXX XXXX steps this XXXX. Pain to R back, R elbow and R rib XXXX, no previous heart or lung hx, non-XXXX, no hx ca\n\nImpression\nNo displaced rib fractures, pneumothorax, or pleural effusion identified. Well-expanded and clear lungs. Mediastinal contour within normal limits. No acute cardiopulmonary abnormality identified.\n"
3,"\nRADIOLOGY REPORT\n\nExamination\nPA and lateral views of the chest XXXX, XXXX at XXXX hours History: XXXX-year-old XXXX with XXXX. Comparison: None available Findings: There are diffuse bilateral interstitial and alveolar opacities consistent with chronic obstructive lung disease and bullous emphysema. There are irregular opacities in the left lung apex, that could represent a cavitary lesion in the left lung apex.There are streaky opacities in the right upper lobe, XXXX scarring. The cardiomediastinal silhouette is normal in size and contour. There is no pneumothorax or large pleural effusion. Transcribed by - PSC Transcription Date - XXXX\n\nIMPRESSION\n1. Bullous emphysema and interstitial fibrosis. 2. Probably scarring in the left apex, although difficult to exclude a cavitary lesion. 3. Opacities in the bilateral upper lobes could represent scarring, however the absence of comparison exam, recommend short interval followup radiograph or CT thorax to document resolution.\n\nSIGNATURE\nXXXX\n\n"
4,"\nChest PA-Lat XR\n\nImaging Study\nXray Chest PA and Lateral\nEXAMINATION: CHEST ( FRONTAL AND LATERAL): XXXX, XXXX XXXX PM \n \nCLINICAL INDICATION: Chest and nasal congestion.\n \nCOMPARISXXXX/XXXX.\n \nFINDINGS:\nThe cardiomediastinal silhouette and pulmonary vasculature are within\nnormal limits. There is no pneumothorax or pleural effusion. There\nare no focal areas of consolidation. Cholecystectomy clips are\npresent. Small T-spine osteophytes. There is biapical pleural\nthickening, unchanged from prior. Mildly hyperexpanded lungs.\n \nIMPRESSION:\nNo acute cardiopulmonary abnormality.\n\n"


lets unlimit displayed width

In [22]:
pd.set_option('display.max_colwidth', None)
dataFrame.head()

These are quite long lines but we have very important key words

Findings and Impressions 

where findigs tells us what is happened 

and 

Impressions the diagnosis by looking at what happened

lets search for keywords and see how many lines have does keywords

In [72]:
# List you keywords
keywords = ['IMPRESSION', 'FINDINGS']

# dataFrame["Text"] is column selection
# str.contains checsk if keywords exits
# '|'.join(keywords) joins keywords with or so if one of them exits we count
# case=False makes it case in sensitive
# na=False makes NaN values counts as not includings

mask = dataFrame['Text'].str.contains('|'.join(keywords), case=False, na=False)

# Count the number of rows containing the keywords
count = mask.sum()

print(f'Number of total rows {dataFrame.shape[0]} ')
print(f'Number of rows containing the keywords : {count}')

Number of total rows 1984 
Number of rows containing the keywords : 1982


In [73]:
print(keywords)

['IMPRESSION', 'FINDINGS']


all lines except  2, lets see what are those lines. We already created a mask lets invert it and apply it 

In [74]:
# Invert the mask to get rows that do not contain the keywords
inverse_mask = ~mask

# Filter the DataFrame using the inverted mask
rows_without_keywords = dataFrame[inverse_mask]

print(rows_without_keywords)

                                                                   Text
15    \nChest PA-Lat XR\n\nImaging Study\nXray Chest PA and Lateral\n\n
1501    \nChest PA-Lat XR\n\nImaging Study\nXR Chest PA and Lateral\n\n


Looks like those two lines are not containing important data.   

Lets see if is ther lines with only one of our keywords


In [75]:
# a mask is created for word  impression
mask_keyword1 = dataFrame['Text'].str.contains(keywords[0], case=False, na=False)

# a mask is created for word findings
mask_keyword2 = dataFrame['Text'].str.contains(keywords[1], case=False, na=False)



In [76]:
#apply mask that looks for keyword1 impression
key1df=dataFrame[mask_keyword1]

# reverse mask that looks for keyword2 findings
nokey2=~mask_keyword2

print(key1df[nokey2].count())
# when you apply no findigs on impression df you get one results
key1df[nokey2].head()

Text    319
dtype: int64


  print(key1df[nokey2].count())
  key1df[nokey2].head()


Unnamed: 0,Text
2,"\nExam\nXray Chest PA and Lateral\n\nDate\nXXXX\n\nHistory\nrib pain after a XXXX, XXXX XXXX steps this XXXX. Pain to R back, R elbow and R rib XXXX, no previous heart or lung hx, non-XXXX, no hx ca\n\nImpression\nNo displaced rib fractures, pneumothorax, or pleural effusion identified. Well-expanded and clear lungs. Mediastinal contour within normal limits. No acute cardiopulmonary abnormality identified.\n"
26,"\nEXAM\nPA and LAT view CHEST XXXX, XXXX XXXX PM\n\nIndication\nChronic XXXX XXXX\n\nComparisons\nXXXX\n\nDiscussion\nLungs are overall hyperexpanded with flattening of the diaphragms. No focal consolidation. No pleural effusions or pneumothoraces. Heart and mediastinum of normal size and contour. Degenerative changes in the thoracic spine.\n\nImpression\nHyperexpanded but clear lungs.\n"
28,"EXAM(S): Chest, 2 views, frontal and lateral\n\nDATE\nXXXX\n\nCOMPARISON\nXXXX, XXXX\n\nINDICATION\nXXXX, hypoxia.\n\nIMPRESSION\nBorderline heart size. Elevated left diaphragm. Clear right lung. Tracheostomy tube tip above the carina. Extensive airspace disease in the left base. No large effusion or pneumothorax. \n XXXX XXXX for the opportunity to care for your patient. If XXXX have any questions regarding this report, please XXXX the radiologist, Dr. XXXX XXXX, at XXXX.\n"
29,"\nRADIOLOGY REPORT\n\nExam\nChest x-XXXX XXXX and lateral, XXXX Indication: XXXX-year-old male with chest pain. Comparison: None Discussion: Lungs are clear without focal consolidation, effusion, or pneumothorax. Normal heart size. Negative for pneumoperitoneum. Bony thorax and soft tissue grossly unremarkable Transcribed by - PSC Transcription Date - XXXX\n\nIMPRESSION\nNegative acute cardiopulmonary abnormality.\n\nSIGNATURE\nXXXX\n\n"
30,"\nChest PA-Lat XR\n\nImaging Study\nXray Chest PA and Lateral\nExam: Xray Chest PA and Lateral \n \nDate: XXXX, XXXX XXXX PM \n \nHistory: XXXX DYSPNEA \n \nImpression:\n \nComparison XXXX, XXXX.\n \nSuggestion of slightly more prominent interstitial markings, which\nmay represent some bronchitic/bronchiolitis changes. No suspicious\nnodules, pneumonia, effusions, or CHF. Stable mediastinal contour.\n\n"


as can bee seen discussion is sometimes used instead of findings,
lets se how mnay lines that doesnt contains findings, contains discussion

In [77]:
mask_discus = dataFrame['Text'].str.contains("Discussion", case=False, na=False)
print(key1df[nokey2].count())
only_impressions=key1df[nokey2]
print(only_impressions[mask_discus].count())

Text    319
dtype: int64
Text    46
dtype: int64


  print(key1df[nokey2].count())
  only_impressions=key1df[nokey2]
  print(only_impressions[mask_discus].count())


lets now filter for the lines that contains findings but not impression

In [78]:
#apply mask that looks for keyword2 findings
key2df=dataFrame[mask_keyword2]

# reverse mask that looks for keyword1 impression
nokey1=~mask_keyword1

print(key2df[nokey1].count())
# when you apply no findigs on impression df you get one results
key2df[nokey1].head(10)

Text    1
dtype: int64


  print(key2df[nokey1].count())
  key2df[nokey1].head(10)


Unnamed: 0,Text
809,"\nChest PA-Lat XR\n\nImaging Study\nXray Chest PA and Lateral\nExamination: PA lateral views of the chest dated XXXX.\n \nComparison: None\n \nHistory: XXXX-year-old female, tobacco use, preop.\n \nFindings: There are no focal areas of consolidation. No pleural\neffusions. No pneumothorax. Heart size within normal limits.\nCalcified granulomas. Degenerative changes thoracic spine.\n \nFindings:\nNo acute cardiopulmonary abnormality. \nThis examination and reported findings have been reviewed and\nconfirmed by the undersigned.\n\n"


so in here we have no impressions but the first finding should be named as impressions

key1df is lines with impression  
key2df is lines with findings

In [79]:
print(key1df.count())
print(key2df.count())

Text    1981
dtype: int64
Text    1663
dtype: int64


In [80]:
print(key1df[mask_keyword2].count())

  print(key1df[mask_keyword2].count())


Text    1662
dtype: int64


we have 1662 lines that include both keywords which is good for us

# Part 2 : Clearing Data
-----------

In this part we will start to work on invasive moves to start changin data.  

Dont forget that all our moves will be made on dataFrame variable and not on .csv file. 

So if we want to keep results we need to save them. 

lets keep letters comma full stop and empty space, regex will help us here   

a-z is for lower case
A-Z is for upper case  
. is for dot  
, is comma  
\s is for full stop


In [81]:
# Define a regular expression pattern to match desired characters
pattern = r'[^a-zA-Z.,\s]'
 # Matches anything that is not a letter, full stop, comma, or whitespace


we mostly get rid of semi colons, we want \ beacuse it helps us detect new lines

In [82]:

# Apply the pattern to each element in the DataFrame and replace non-matching characters with an empty string
df = dataFrame.replace(to_replace=pattern, value='', regex=True)

# Save the cleaned DataFrame back to a CSV file
df.to_csv('cleaned_file.csv', index=False)


and now we will make all lower case

In [83]:

# Lets read our cleaned file
df = pd.read_csv('cleaned_file.csv')

# Convert all string columns to lowercase
df = df.applymap(lambda x: x.lower() if isinstance(x, str) else x)


# Save the cleaned DataFrame back to a CSV file
df.to_csv('cleaned_file.csv', index=False)

df.head() # to see


  df = df.applymap(lambda x: x.lower() if isinstance(x, str) else x)


Unnamed: 0,Text
0,\nchest palat xr\n\nimaging study\nxray chest pa and lateral\nexam views of the chest xxxxxxxx.\n \ncomparison none.\n \nindication positive tb test\n \nfindings\nthe cardiac silhouette and mediastinum size are within normal limits.\nthere is no pulmonary edema. there is no focal consolidation. there\nare no xxxx of a pleural effusion. there is no evidence of\npneumothorax.\n \nimpression\nnormal chest xxxxx. \nthis examination and reported findings have been reviewed and\nconfirmed by the undersigned.\n\n
1,"exams chest, views, frontal and lateral\n\ndate\nxxxx\n\ncomparison\nnone.\n\nindication\npreop bariatric surgery.\n\nfindings\nborderline cardiomegaly. midline sternotomy xxxx. enlarged pulmonary arteries. clear lungs. inferior xxxx xxxx xxxx.\n\nimpression\nno acute pulmonary findings. \n xxxx xxxx for the opportunity to care for your patient. if xxxx have any questions regarding this report, please xxxx the radiologist, dr. xxxx xxxx, at xxxx.\n"
2,"\nexam\nxray chest pa and lateral\n\ndate\nxxxx\n\nhistory\nrib pain after a xxxx, xxxx xxxx steps this xxxx. pain to r back, r elbow and r rib xxxx, no previous heart or lung hx, nonxxxx, no hx ca\n\nimpression\nno displaced rib fractures, pneumothorax, or pleural effusion identified. wellexpanded and clear lungs. mediastinal contour within normal limits. no acute cardiopulmonary abnormality identified.\n"
3,"\nradiology report\n\nexamination\npa and lateral views of the chest xxxx, xxxx at xxxx hours history xxxxyearold xxxx with xxxx. comparison none available findings there are diffuse bilateral interstitial and alveolar opacities consistent with chronic obstructive lung disease and bullous emphysema. there are irregular opacities in the left lung apex, that could represent a cavitary lesion in the left lung apex.there are streaky opacities in the right upper lobe, xxxx scarring. the cardiomediastinal silhouette is normal in size and contour. there is no pneumothorax or large pleural effusion. transcribed by psc transcription date xxxx\n\nimpression\n. bullous emphysema and interstitial fibrosis. . probably scarring in the left apex, although difficult to exclude a cavitary lesion. . opacities in the bilateral upper lobes could represent scarring, however the absence of comparison exam, recommend short interval followup radiograph or ct thorax to document resolution.\n\nsignature\nxxxx\n\n"
4,"\nchest palat xr\n\nimaging study\nxray chest pa and lateral\nexamination chest frontal and lateral xxxx, xxxx xxxx pm \n \nclinical indication chest and nasal congestion.\n \ncomparisxxxxxxxx.\n \nfindings\nthe cardiomediastinal silhouette and pulmonary vasculature are within\nnormal limits. there is no pneumothorax or pleural effusion. there\nare no focal areas of consolidation. cholecystectomy clips are\npresent. small tspine osteophytes. there is biapical pleural\nthickening, unchanged from prior. mildly hyperexpanded lungs.\n \nimpression\nno acute cardiopulmonary abnormality.\n\n"


since this data is anonymized with usage of "xxxx" it is nice to remove them. If you train an aÄ± with such tokens it might create un wanted effects

In [84]:
df = pd.read_csv('cleaned_file.csv')

df = df.replace(to_replace="xxxx", value='', regex=True)



df.to_csv('cleaned_file.csv', index=False)

df.head() # to see

Unnamed: 0,Text
0,\nchest palat xr\n\nimaging study\nxray chest pa and lateral\nexam views of the chest .\n \ncomparison none.\n \nindication positive tb test\n \nfindings\nthe cardiac silhouette and mediastinum size are within normal limits.\nthere is no pulmonary edema. there is no focal consolidation. there\nare no of a pleural effusion. there is no evidence of\npneumothorax.\n \nimpression\nnormal chest x. \nthis examination and reported findings have been reviewed and\nconfirmed by the undersigned.\n\n
1,"exams chest, views, frontal and lateral\n\ndate\n\n\ncomparison\nnone.\n\nindication\npreop bariatric surgery.\n\nfindings\nborderline cardiomegaly. midline sternotomy . enlarged pulmonary arteries. clear lungs. inferior .\n\nimpression\nno acute pulmonary findings. \n for the opportunity to care for your patient. if have any questions regarding this report, please the radiologist, dr. , at .\n"
2,"\nexam\nxray chest pa and lateral\n\ndate\n\n\nhistory\nrib pain after a , steps this . pain to r back, r elbow and r rib , no previous heart or lung hx, non, no hx ca\n\nimpression\nno displaced rib fractures, pneumothorax, or pleural effusion identified. wellexpanded and clear lungs. mediastinal contour within normal limits. no acute cardiopulmonary abnormality identified.\n"
3,"\nradiology report\n\nexamination\npa and lateral views of the chest , at hours history yearold with . comparison none available findings there are diffuse bilateral interstitial and alveolar opacities consistent with chronic obstructive lung disease and bullous emphysema. there are irregular opacities in the left lung apex, that could represent a cavitary lesion in the left lung apex.there are streaky opacities in the right upper lobe, scarring. the cardiomediastinal silhouette is normal in size and contour. there is no pneumothorax or large pleural effusion. transcribed by psc transcription date \n\nimpression\n. bullous emphysema and interstitial fibrosis. . probably scarring in the left apex, although difficult to exclude a cavitary lesion. . opacities in the bilateral upper lobes could represent scarring, however the absence of comparison exam, recommend short interval followup radiograph or ct thorax to document resolution.\n\nsignature\n\n\n"
4,"\nchest palat xr\n\nimaging study\nxray chest pa and lateral\nexamination chest frontal and lateral , pm \n \nclinical indication chest and nasal congestion.\n \ncomparis.\n \nfindings\nthe cardiomediastinal silhouette and pulmonary vasculature are within\nnormal limits. there is no pneumothorax or pleural effusion. there\nare no focal areas of consolidation. cholecystectomy clips are\npresent. small tspine osteophytes. there is biapical pleural\nthickening, unchanged from prior. mildly hyperexpanded lungs.\n \nimpression\nno acute cardiopulmonary abnormality.\n\n"


# Step 3: Choosing where to split
_____________________________

All our data is one column but we need one input as text and one output as summary.  

So lets look at how we can split it.

In [29]:
df.head()

Unnamed: 0,Text
0,\nchest palat xr\n\nimaging study\nxray chest pa and lateral\nexam views of the chest xxxxxxxx.\n \ncomparison none.\n \nindication positive tb test\n \nfindings\nthe cardiac silhouette and mediastinum size are within normal limits.\nthere is no pulmonary edema. there is no focal consolidation. there\nare no xxxx of a pleural effusion. there is no evidence of\npneumothorax.\n \nimpression\nnormal chest xxxxx. \nthis examination and reported findings have been reviewed and\nconfirmed by the undersigned.\n\n
1,"exams chest, views, frontal and lateral\n\ndate\nxxxx\n\ncomparison\nnone.\n\nindication\npreop bariatric surgery.\n\nfindings\nborderline cardiomegaly. midline sternotomy xxxx. enlarged pulmonary arteries. clear lungs. inferior xxxx xxxx xxxx.\n\nimpression\nno acute pulmonary findings. \n xxxx xxxx for the opportunity to care for your patient. if xxxx have any questions regarding this report, please xxxx the radiologist, dr. xxxx xxxx, at xxxx.\n"
2,"\nexam\nxray chest pa and lateral\n\ndate\nxxxx\n\nhistory\nrib pain after a xxxx, xxxx xxxx steps this xxxx. pain to r back, r elbow and r rib xxxx, no previous heart or lung hx, nonxxxx, no hx ca\n\nimpression\nno displaced rib fractures, pneumothorax, or pleural effusion identified. wellexpanded and clear lungs. mediastinal contour within normal limits. no acute cardiopulmonary abnormality identified.\n"
3,"\nradiology report\n\nexamination\npa and lateral views of the chest xxxx, xxxx at xxxx hours history xxxxyearold xxxx with xxxx. comparison none available findings there are diffuse bilateral interstitial and alveolar opacities consistent with chronic obstructive lung disease and bullous emphysema. there are irregular opacities in the left lung apex, that could represent a cavitary lesion in the left lung apex.there are streaky opacities in the right upper lobe, xxxx scarring. the cardiomediastinal silhouette is normal in size and contour. there is no pneumothorax or large pleural effusion. transcribed by psc transcription date xxxx\n\nimpression\n. bullous emphysema and interstitial fibrosis. . probably scarring in the left apex, although difficult to exclude a cavitary lesion. . opacities in the bilateral upper lobes could represent scarring, however the absence of comparison exam, recommend short interval followup radiograph or ct thorax to document resolution.\n\nsignature\nxxxx\n\n"
4,"\nchest palat xr\n\nimaging study\nxray chest pa and lateral\nexamination chest frontal and lateral xxxx, xxxx xxxx pm \n \nclinical indication chest and nasal congestion.\n \ncomparisxxxxxxxx.\n \nfindings\nthe cardiomediastinal silhouette and pulmonary vasculature are within\nnormal limits. there is no pneumothorax or pleural effusion. there\nare no focal areas of consolidation. cholecystectomy clips are\npresent. small tspine osteophytes. there is biapical pleural\nthickening, unchanged from prior. mildly hyperexpanded lungs.\n \nimpression\nno acute cardiopulmonary abnormality.\n\n"


structure of each report is as follows 

header   

inner text  


header 


inner text


which is described as \nheader\ninnertext\n


so we can use \n as a splitter 

Lets look for impression\n and findings\n to see when those words used as headers

In [26]:
mask_inw = df['Text'].str.contains("impression\\n", case=False, na=False)
mask_fnw = df['Text'].str.contains("findings\\n", case=False, na=False)


In [27]:
df[mask_fnw].count()

Text    443
dtype: int64

In [28]:
df[mask_inw].count()

Text    1549
dtype: int64

In [30]:
df[mask_fnw].head()

Unnamed: 0,Text
0,\nchest palat xr\n\nimaging study\nxray chest pa and lateral\nexam views of the chest xxxxxxxx.\n \ncomparison none.\n \nindication positive tb test\n \nfindings\nthe cardiac silhouette and mediastinum size are within normal limits.\nthere is no pulmonary edema. there is no focal consolidation. there\nare no xxxx of a pleural effusion. there is no evidence of\npneumothorax.\n \nimpression\nnormal chest xxxxx. \nthis examination and reported findings have been reviewed and\nconfirmed by the undersigned.\n\n
1,"exams chest, views, frontal and lateral\n\ndate\nxxxx\n\ncomparison\nnone.\n\nindication\npreop bariatric surgery.\n\nfindings\nborderline cardiomegaly. midline sternotomy xxxx. enlarged pulmonary arteries. clear lungs. inferior xxxx xxxx xxxx.\n\nimpression\nno acute pulmonary findings. \n xxxx xxxx for the opportunity to care for your patient. if xxxx have any questions regarding this report, please xxxx the radiologist, dr. xxxx xxxx, at xxxx.\n"
4,"\nchest palat xr\n\nimaging study\nxray chest pa and lateral\nexamination chest frontal and lateral xxxx, xxxx xxxx pm \n \nclinical indication chest and nasal congestion.\n \ncomparisxxxxxxxx.\n \nfindings\nthe cardiomediastinal silhouette and pulmonary vasculature are within\nnormal limits. there is no pneumothorax or pleural effusion. there\nare no focal areas of consolidation. cholecystectomy clips are\npresent. small tspine osteophytes. there is biapical pleural\nthickening, unchanged from prior. mildly hyperexpanded lungs.\n \nimpression\nno acute cardiopulmonary abnormality.\n\n"
13,"\nchest palat xr\n\nimaging study\nxray chest pa and lateral\npa and lateral chest xxxx\n \nindication xxxxyearold female, chest pain\n \ncomparisxxxxxxxx\n \nimpression no acute findings\n \nfindings heart size within normal limits, stable mediastinal and\nhilar contours. mild hyperinflation appears similar to prior. no\nfocal alveolar consolidation, no definite pleural effusion seen.\nscattered chronic appearing irregular interstitial markings, no\ntypical findings of pulmonary edema.\n\n"
17,"\nchest palat xr\n\nimaging study\nxray chest pa and lateral\npa and lateral chest xxxx\n \nindication xxxxyearold male, pain\n \ncomparison none\n \nimpression no acute cardiopulmonary findings\n \nfindings heart size within normal limits. no focal alveolar\nconsolidation, no definite pleural effusion seen. no typical\nfindings of pulmonary edema. no pneumothorax.\n\n"


Turns out most of "findings" words is used inside text and it is header in very few lines.

When looked as "findings\n" rows 2 and 3 doesnt shows up 

In [31]:
df.head()


Unnamed: 0,Text
0,\nchest palat xr\n\nimaging study\nxray chest pa and lateral\nexam views of the chest xxxxxxxx.\n \ncomparison none.\n \nindication positive tb test\n \nfindings\nthe cardiac silhouette and mediastinum size are within normal limits.\nthere is no pulmonary edema. there is no focal consolidation. there\nare no xxxx of a pleural effusion. there is no evidence of\npneumothorax.\n \nimpression\nnormal chest xxxxx. \nthis examination and reported findings have been reviewed and\nconfirmed by the undersigned.\n\n
1,"exams chest, views, frontal and lateral\n\ndate\nxxxx\n\ncomparison\nnone.\n\nindication\npreop bariatric surgery.\n\nfindings\nborderline cardiomegaly. midline sternotomy xxxx. enlarged pulmonary arteries. clear lungs. inferior xxxx xxxx xxxx.\n\nimpression\nno acute pulmonary findings. \n xxxx xxxx for the opportunity to care for your patient. if xxxx have any questions regarding this report, please xxxx the radiologist, dr. xxxx xxxx, at xxxx.\n"
2,"\nexam\nxray chest pa and lateral\n\ndate\nxxxx\n\nhistory\nrib pain after a xxxx, xxxx xxxx steps this xxxx. pain to r back, r elbow and r rib xxxx, no previous heart or lung hx, nonxxxx, no hx ca\n\nimpression\nno displaced rib fractures, pneumothorax, or pleural effusion identified. wellexpanded and clear lungs. mediastinal contour within normal limits. no acute cardiopulmonary abnormality identified.\n"
3,"\nradiology report\n\nexamination\npa and lateral views of the chest xxxx, xxxx at xxxx hours history xxxxyearold xxxx with xxxx. comparison none available findings there are diffuse bilateral interstitial and alveolar opacities consistent with chronic obstructive lung disease and bullous emphysema. there are irregular opacities in the left lung apex, that could represent a cavitary lesion in the left lung apex.there are streaky opacities in the right upper lobe, xxxx scarring. the cardiomediastinal silhouette is normal in size and contour. there is no pneumothorax or large pleural effusion. transcribed by psc transcription date xxxx\n\nimpression\n. bullous emphysema and interstitial fibrosis. . probably scarring in the left apex, although difficult to exclude a cavitary lesion. . opacities in the bilateral upper lobes could represent scarring, however the absence of comparison exam, recommend short interval followup radiograph or ct thorax to document resolution.\n\nsignature\nxxxx\n\n"
4,"\nchest palat xr\n\nimaging study\nxray chest pa and lateral\nexamination chest frontal and lateral xxxx, xxxx xxxx pm \n \nclinical indication chest and nasal congestion.\n \ncomparisxxxxxxxx.\n \nfindings\nthe cardiomediastinal silhouette and pulmonary vasculature are within\nnormal limits. there is no pneumothorax or pleural effusion. there\nare no focal areas of consolidation. cholecystectomy clips are\npresent. small tspine osteophytes. there is biapical pleural\nthickening, unchanged from prior. mildly hyperexpanded lungs.\n \nimpression\nno acute cardiopulmonary abnormality.\n\n"


In [32]:
mask_ex=df["Text"].str.contains("examination\\n", case=False, na=False)
df[mask_ex].count()

Text    285
dtype: int64

turns out one other important header is examination. So notice that we first thought that there are two headers at first examination yet when we use special characters like newline(\n) to ensure our understanding of formatting is correct we saw that in 1463 lines of data where "findings" is found, only 443 of them are headers. And we have another header called examination. and it is used as header in 285 line which is considerable amount of data. so lets see if examination part works as we expect

In [34]:
df[mask_ex].head(10)

Unnamed: 0,Text
3,"\nradiology report\n\nexamination\npa and lateral views of the chest xxxx, xxxx at xxxx hours history xxxxyearold xxxx with xxxx. comparison none available findings there are diffuse bilateral interstitial and alveolar opacities consistent with chronic obstructive lung disease and bullous emphysema. there are irregular opacities in the left lung apex, that could represent a cavitary lesion in the left lung apex.there are streaky opacities in the right upper lobe, xxxx scarring. the cardiomediastinal silhouette is normal in size and contour. there is no pneumothorax or large pleural effusion. transcribed by psc transcription date xxxx\n\nimpression\n. bullous emphysema and interstitial fibrosis. . probably scarring in the left apex, although difficult to exclude a cavitary lesion. . opacities in the bilateral upper lobes could represent scarring, however the absence of comparison exam, recommend short interval followup radiograph or ct thorax to document resolution.\n\nsignature\nxxxx\n\n"
16,"\nradiology report\n\nexamination\npa and lateral views of the chest dated xxxx. comparison xxxx films of the chest dated xxxx. history xxxxyearold female, chest pain. findings no focal areas of consolidation. no suspicious pulmonary opacities. heart size within normal limits. no pleural effusions. no evidence of pneumothorax. osseous structures intact. transcribed by psc transcription date xxxx\n\nimpression\nno acute cardiopulmonary abnormality.\n\nsignature\nxxxx\n\n"
21,"\nradiology report\n\nexamination\npa and lateral views of the chest xxxx, xxxx at xxxx hours history xxxxyearold woman with xxxx for weeks. comparison none available findings the lungs are clear, and without focal air space opacity. the cardiomediastinal silhouette is normal in size and contour, and stable. there is no pneumothorax large pleural effusion. transcribed by psc transcription date xxxx\n\nimpression\nno acute cardiopulmonary abnormality.\n\nsignature\nxxxx\n\n"
34,"\nradiology report\n\nexamination\npa and lateral chest radiographs xxxx at xxxx hours. history xxxxyearold female with breast mass and smoking history. comparison pa and lateral chest redressed xxxx findings the heart size and cardiomediastinal silhouette are normal. there is hyperexpansion of the lungs with flattening of the hemidiaphragms. there is no focal airspace opacity, pleural effusion, or pneumothorax. there multilevel degenerative changes of thoracic spine. transcribed by psc transcription date xxxx\n\nimpression\nemphysema, however no acute cardiopulmonary finding.\n\nsignature\nxxxx\n\n"
35,"\nradiology report\n\nexamination\npa and lateral chest xxxxx dated xxxx, xxxx at xxxx p.m.. indication xxxxyearold female with chest pain, rule out pneumonia.. comparison twoview chest radiograph dated xxxx, xxxx.. findings the lungs are clear bilaterally. specifically, no evidence of focal consolidation, pneumothorax, or pleural effusion.. cardio mediastinal silhouette is unremarkable. visualized osseous structures of the thorax are without acute abnormality. transcribed by psc transcription date xxxx\n\nimpression\nno acute cardiopulmonary abnormality..\n\nsignature\nxxxx\n\n"
48,"\nexamination\npa and lateral chest radiographs dated xxxx at xxxx hours.\n\ncomparison\nnone.\n\nhistory\nxxxxyearold with osteoarthritis of the hip scheduled for total hip replacement. preoperative evaluation.\n\nfindings\nthe heart, pulmonary xxxx and mediastinum are within normal limits. there is no pleural effusion or pneumothorax. there is no focal air space opacity to suggest a pneumonia. there are degenerative changes of the thoracic spine. there is a calcified granuloma identified in the right suprahilar region. the aorta is mildly tortuous and ectatic. there is asymmetric right apical smooth pleural thickening. there are severe degenerative changes of the xxxx.\n\nimpression\nno acute cardiopulmonary disease.\n"
53,"\nradiology report\n\nexamination\npa and lateral chest xxxxx dated xxxx, xxxx at xxxx p.m.. indication xxxxyearold woman, prior to enbrel therapy.. comparison none. findings the lungs are clear bilaterally. specifically, no evidence of focal consolidation, pneumothorax, or pleural effusion.. minimal right basilar subsegmental atelectasis noted. cardio mediastinal silhouette is unremarkable. tortuosity of the thoracic aorta noted. scattered calcified granulomas are seen without evidence of active granulomatoustuberculous process. visualized osseous structures of the thorax are without acute abnormality. transcribed by pscb transcription date xxxx\n\nimpression\nno acute cardiopulmonary abnormality.. xxxx xxxx, m.d. xxxx xxxx xxxx electronically xxxx xxxx, m.d. xxxx xxxx xxxx transcribed xxxx xxxx xxxx radres xxxx\n\nsignature\nxxxx\n\n"
61,\nchest palat xr\n\nimaging study\nxray chest pa and lateral\nexamination\nfrontal and lateral views of the chest dated xxxx \n \ncomparisxxxxxxxx\n \nhistory chest pain\n \nfindings status post xxxx sternotomy and cabg. heart size is\nnormal. coronary vascular stent. the lungs are clear. there are no\nfocal air space consolidations. no pleural effusions or\npneumothoraces. the hilar and mediastinal contours are stable.\ncalcified mediastinal lymph xxxx. normal pulmonary vascularity. \ndegenerative changes of the spine.\n \nimpression no acute abnormality. \nthis examination and reported findings have been reviewed and\nconfirmed by the undersigned.\n\n
64,"\nexamination\nchest views dated xxxx, xxxx.\n\nhistory and indication\nchest pain.\n\ncomparison\nxxxx.\n\nfindings\nthe xxxx examination consists of frontal and lateral radiographs of the chest. the cardiomediastinal contours are within normal limits. pulmonary vascularity is within normal limits. no focal consolidation, pleural effusion, or pneumothorax identified. deformity of the right clavicle related to remote xxxx is again seen. visualized upper abdomen grossly unremarkable.\n\nimpression\nno evidence of acute cardiopulmonary process.\n"
74,"\nradiology report\n\nexamination\npa and lateral views of the chest xxxx, xxxx at xxxx hours history xxxxyearold xxxx with chest pain. comparison xxxx, xxxx findings the heart size is stable. the aorta is ectatic and atherosclerotic but stable. xxxx sternotomy xxxx are again noted. the scarring in the left lower lobe is again noted and unchanged from prior exam. there are mild bilateral prominent lung interstitial opacities consistent with emphysematous disease. the calcified granulomas are stable. transcribed by psc transcription date xxxx\n\nimpression\n. changes of emphysema and left lower lobe scarring, both stable. . unchanged degenerative and atherosclerotic changes of the thoracic aorta.\n\nsignature\nxxxx\n\n"


In [35]:
exdf=df[mask_ex]   #  lets create a dat frame with mask that finds examination\n is applied
exdf[mask_fnw].head() # lets look if is there lines with findings\n on lines with examination\n

  exdf[mask_fnw].head()


Unnamed: 0,Text
48,"\nexamination\npa and lateral chest radiographs dated xxxx at xxxx hours.\n\ncomparison\nnone.\n\nhistory\nxxxxyearold with osteoarthritis of the hip scheduled for total hip replacement. preoperative evaluation.\n\nfindings\nthe heart, pulmonary xxxx and mediastinum are within normal limits. there is no pleural effusion or pneumothorax. there is no focal air space opacity to suggest a pneumonia. there are degenerative changes of the thoracic spine. there is a calcified granuloma identified in the right suprahilar region. the aorta is mildly tortuous and ectatic. there is asymmetric right apical smooth pleural thickening. there are severe degenerative changes of the xxxx.\n\nimpression\nno acute cardiopulmonary disease.\n"
64,"\nexamination\nchest views dated xxxx, xxxx.\n\nhistory and indication\nchest pain.\n\ncomparison\nxxxx.\n\nfindings\nthe xxxx examination consists of frontal and lateral radiographs of the chest. the cardiomediastinal contours are within normal limits. pulmonary vascularity is within normal limits. no focal consolidation, pleural effusion, or pneumothorax identified. deformity of the right clavicle related to remote xxxx is again seen. visualized upper abdomen grossly unremarkable.\n\nimpression\nno evidence of acute cardiopulmonary process.\n"
81,"\nexamination\nxray chest pa and lateral\n\nexamination date\nxxxx\n\ncomparison\nchest xxxxx xxxx\n\nrelevant clinical information\npain in thoracic spine pain started in leg area two weeks ago now having severe pain in upper xxxx back area rt side. hf\n\nfindings\nno airspace disease, effusion or noncalcified nodule. normal heart size and mediastinum. left axillary surgical clips unchanged visualized xxxx of the chest xxxx are within normal limits.\n\nimpression\nno acute cardiopulmonary abnormality. \n if xxxx have questions regarding this report, please xxxx xxxx on xxxx or xxxx xxxx. \nthis examination and reported findings have been reviewed and confirmed by the undersigned.\n"
133,"\nexamination\nchest views xxxx, xxxx\n\nclinical history\nchest pain\n\ncomparison\nnone\n\nfindings\nthe lungs are grossly clear without focal pneumonic consolidation, large effusion or pneumothorax. heart size is within normal limits.\n\nimpression\nclear lungs\n"
150,"\nexamination\nchest radiograph, frontal and lateral views\n\ncomparison\nxxxx\n\nfindings\ncardiomediastinal silhouette is normal. pulmonary vasculature and xxxx are normal. no consolidation, pneumothorax or large pleural effusion. osseous structures and soft tissues are normal.\n\nimpression\nno acute cardiopulmonary disease.\n"


Turns out examination is not as realiable as we tought. Some rows include examination header with findings header. Not all examination lines include all the data. So this is not a very well dataset. An formatting is not realiable across. Best thing is to use the data that we can rely on.

# Step 4: Actually splitting the data
____________

lets create a new csv file only with our keywords to work on them. Note that we are now looking for rows that contains both of our keywords.

In [87]:
df = pd.read_csv('cleaned_file.csv')




mask1=df["Text"].str.contains("findings\\n", case=False, na=False)
mask2=df["Text"].str.contains("impression\\n", case=False, na=False)
filtered_df=df[mask1]
filtered_df=filtered_df[mask2]





# Save the filtered DataFrame to a new CSV file
filtered_df.to_csv('keywordFiltered.csv', index=False)


  filtered_df=filtered_df[mask2]


In [None]:
filtered_df.head()

In [88]:
filtered_df.count()

Text    409
dtype: int64

now we are down to 409 rows, lets see if our headers always occurs in same order or not

In [13]:
import pandas as pd
df = pd.read_csv('keywordFiltered.csv')

# Define your keywords (case insensitive)
keyword1 = 'Findings'.lower()
keyword2 = 'Impression'.lower()

# Initialize a list to store the type values
type_values = []

# Iterate over the rows
for index, row in df.iterrows():
    text = row['Text'].lower()  # Convert text to lowercase for case insensitivity
    index1 = text.find(keyword1)  # Find the index of keyword1
    index2 = text.find(keyword2)  # Find the index of keyword2
    if index1 != -1 and index2 != -1:  # Both keywords are found
        if index1 < index2:  # Keyword1 occurs first
            type_values.append(1)
        else:  # Keyword2 occurs first
            type_values.append(2)
    elif index1 != -1:  # Only keyword1 is found
        type_values.append(1)
    elif index2 != -1:  # Only keyword2 is found
        type_values.append(2)
    else:  # Neither keyword is found
        type_values.append(0)

# Add the type values to the DataFrame
df['type'] = type_values

# Save the DataFrame back to a CSV file
df.to_csv('typed.csv', index=False)

In [19]:
count_ones = df['type'].value_counts().get(2, 0)

print(count_ones)

23


In [23]:
rows_with_type_2 = df[df['type'] == 2]

print(rows_with_type_2)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Text  \
14          

some lines have findings first and some lines have impression first   

 so we and new columns and send data in accordingly

In [32]:
import csv
import re

def extract_sections(report):
    # Define regex patterns for findings and impression sections
    findings_pattern = re.compile(r'findings\s*(.*?)\s*(impression|$)', re.DOTALL | re.IGNORECASE)
    impression_pattern = re.compile(r'impression\s*(.*?)\s*$', re.DOTALL | re.IGNORECASE)

    # Search for findings section
    findings_match = findings_pattern.search(report)
    findings = findings_match.group(1).strip() if findings_match else ""

    # Search for impression section
    impression_match = impression_pattern.search(report)
    impression = impression_match.group(1).strip() if impression_match else ""

    return findings, impression

# Read the original CSV file
with open('typed.csv', mode='r', newline='', encoding='utf-8') as infile:
    reader = csv.DictReader(infile)
    reports = [row['Text'] for row in reader]

# Process the reports to extract findings and impressions
processed_reports = []
for report in reports:
    findings, impression = extract_sections(report)
    processed_reports.append({'report': report, 'findings': findings, 'impression': impression})

# Write the new CSV file with findings and impressions
with open('processed_reports.csv', mode='w', newline='', encoding='utf-8') as outfile:
    fieldnames = ['report', 'findings', 'impression']
    writer = csv.DictWriter(outfile, fieldnames=fieldnames)
    
    writer.writeheader()
    for processed_report in processed_reports:
        writer.writerow(processed_report)



Some smart people have written findings on impressions and write " see impressions" on findings. we cant auto seperate those lines so we have to ignore them.  It is a really badly written dataset. So if you can manage to make it usable you are good at it.


we need to remove rows wtih only word "see" is on findings row

In [33]:
# simple method to get values without ssee


df = pd.read_csv("processed_reports.csv")
df_filtered = df[df['findings'] != 'see']
df_filtered.to_csv("processed_reports", index=False)

In [34]:
df_filtered.count()

report        398
findings      398
impression    398
dtype: int64

We are down to nearly 400 rows. No worries tho we can augment data before training but before lets clear leading empty spaces and dots im impressions column 

with lstrip we can use "left strip"  to get rid of leading part.

In [43]:
df = pd.read_csv("processed_reports.csv")
df['impression'] = df['impression'].str.strip(' . ')  # if starts with leading empty space and dot and empty sapce again
df['impression'] = df['impression'].str.strip(' .')  # if starts with leading empty space and dot
df['impression'] = df['impression'].str.strip('. ')  # if starts with leading empty space and  in reverse order
df['impression'] = df['impression'].str.strip('.')  #   if starts with dot
df['impression'] = df['impression'].str.strip(' ')  #   if starts with empty



df.to_csv("processed_reports.csv", index=False)

. interval improvement in consolidative left base opacity.
multifocal scattered bibasilar patchy and  pulmonary opacities
again noted, most consistent with atelectasisinfiltrate.
. stable enlarged cardiomediastinal silhouette. stable pulmonary
vascular congestion. 
this examination and reported findings have been reviewed and
confirmed by the undersigned.