# Part 0 : Introduction
_____

In this tutorial we will use pandas to analyze the csv file and prepare it for a seq2seq model training

Using pandas we can sort the data suitable for our needs. To achieve this we need to understand how the data is interpreted.

Current dataset in this example is results of radiology reports and the diagnosis results

#

# Part 1: Pandas Basics
_____

below you can see how to import the data presented in csv to your python code

In [1]:
import pandas as pd

In [2]:
dataFrame=pd.read_csv("ReportsDATASET.csv")

lets see details about your data, describe is much more useful for numeric data but still you can see that there are 1984 text values and 1982 unique values

In [3]:
dataFrame.describe()

Unnamed: 0,Text
count,1984
unique,1982
top,\nSIGNATURE\nXXXX\n\nRADIOLOGY REPORT\nchest p...
freq,2


with looking at the shape of dataframe we can see row and columns size

In [5]:
dataFrame.shape

(1984, 1)

with head() method you can see first lines

In [6]:
dataFrame.head() 

Unnamed: 0,Text
0,\nChest PA-Lat XR\n\nImaging Study\nXray Chest...
1,"EXAM(S): Chest, 2 views, frontal and lateral\n..."
2,\nExam\nXray Chest PA and Lateral\n\nDate\nXXX...
3,\nRADIOLOGY REPORT\n\nExamination\nPA and late...
4,\nChest PA-Lat XR\n\nImaging Study\nXray Chest...


lets unlimit displayed width

In [7]:
pd.set_option('display.max_colwidth', None)
dataFrame.head()

Unnamed: 0,Text
0,\nChest PA-Lat XR\n\nImaging Study\nXray Chest PA and Lateral\nExam: 2 views of the chest XXXX/XXXX.\n \nComparison: None.\n \nIndication: Positive TB test\n \nFindings:\nThe cardiac silhouette and mediastinum size are within normal limits.\nThere is no pulmonary edema. There is no focal consolidation. There\nare no XXXX of a pleural effusion. There is no evidence of\npneumothorax.\n \nImpression:\nNormal chest x-XXXX. \nThis examination and reported findings have been reviewed and\nconfirmed by the undersigned.\n\n
1,"EXAM(S): Chest, 2 views, frontal and lateral\n\nDATE\nXXXX\n\nCOMPARISON\nNone.\n\nINDICATION\nPreop bariatric surgery.\n\nFINDINGS\nBorderline cardiomegaly. Midline sternotomy XXXX. Enlarged pulmonary arteries. Clear lungs. Inferior XXXX XXXX XXXX.\n\nIMPRESSION\nNo acute pulmonary findings. \n XXXX XXXX for the opportunity to care for your patient. If XXXX have any questions regarding this report, please XXXX the radiologist, Dr. XXXX XXXX, at XXXX.\n"
2,"\nExam\nXray Chest PA and Lateral\n\nDate\nXXXX\n\nHistory\nrib pain after a XXXX, XXXX XXXX steps this XXXX. Pain to R back, R elbow and R rib XXXX, no previous heart or lung hx, non-XXXX, no hx ca\n\nImpression\nNo displaced rib fractures, pneumothorax, or pleural effusion identified. Well-expanded and clear lungs. Mediastinal contour within normal limits. No acute cardiopulmonary abnormality identified.\n"
3,"\nRADIOLOGY REPORT\n\nExamination\nPA and lateral views of the chest XXXX, XXXX at XXXX hours History: XXXX-year-old XXXX with XXXX. Comparison: None available Findings: There are diffuse bilateral interstitial and alveolar opacities consistent with chronic obstructive lung disease and bullous emphysema. There are irregular opacities in the left lung apex, that could represent a cavitary lesion in the left lung apex.There are streaky opacities in the right upper lobe, XXXX scarring. The cardiomediastinal silhouette is normal in size and contour. There is no pneumothorax or large pleural effusion. Transcribed by - PSC Transcription Date - XXXX\n\nIMPRESSION\n1. Bullous emphysema and interstitial fibrosis. 2. Probably scarring in the left apex, although difficult to exclude a cavitary lesion. 3. Opacities in the bilateral upper lobes could represent scarring, however the absence of comparison exam, recommend short interval followup radiograph or CT thorax to document resolution.\n\nSIGNATURE\nXXXX\n\n"
4,"\nChest PA-Lat XR\n\nImaging Study\nXray Chest PA and Lateral\nEXAMINATION: CHEST ( FRONTAL AND LATERAL): XXXX, XXXX XXXX PM \n \nCLINICAL INDICATION: Chest and nasal congestion.\n \nCOMPARISXXXX/XXXX.\n \nFINDINGS:\nThe cardiomediastinal silhouette and pulmonary vasculature are within\nnormal limits. There is no pneumothorax or pleural effusion. There\nare no focal areas of consolidation. Cholecystectomy clips are\npresent. Small T-spine osteophytes. There is biapical pleural\nthickening, unchanged from prior. Mildly hyperexpanded lungs.\n \nIMPRESSION:\nNo acute cardiopulmonary abnormality.\n\n"


These are quite long lines but we have very important key words

Findings and Impressions 

where findigs tells us what is happened 

and 

Impressions the diagnosis by looking at what happened

lets search for keywords and see how many lines have does keywords

In [11]:
# List you keywords
keywords = ['IMPRESSION', 'FINDINGS']

# dataFrame["Text"] is column selection
# str.contains checsk if keywords exits
# '|'.join(keywords) joins keywords with or so if one of them exits we count
# case=False makes it case in sensitive
# na=False makes NaN values counts as not includings

mask = dataFrame['Text'].str.contains('|'.join(keywords), case=False, na=False)

# Count the number of rows containing the keywords
count = mask.sum()

print(f'Number of total rows {dataFrame.shape[0]} ')
print(f'Number of rows containing the keywords : {count}')

Number of total rows 1984 
Number of rows containing the keywords : 1982


all lines except  2, lets see what are those lines. We already created a mask lets invert it and apply it 

In [12]:
# Invert the mask to get rows that do not contain the keywords
inverse_mask = ~mask

# Filter the DataFrame using the inverted mask
rows_without_keywords = dataFrame[inverse_mask]

print(rows_without_keywords)

                                                                   Text
15    \nChest PA-Lat XR\n\nImaging Study\nXray Chest PA and Lateral\n\n
1501    \nChest PA-Lat XR\n\nImaging Study\nXR Chest PA and Lateral\n\n


Looks like those two lines are not containing important data.   

Lets see if is ther lines with only one of our keywords


In [18]:
# a mask is created for word  impression
mask_keyword1 = dataFrame['Text'].str.contains(keywords[0], case=False, na=False)

# a mask is created for word findings
mask_keyword2 = dataFrame['Text'].str.contains(keywords[1], case=False, na=False)

# Filter rows that contain only keyword1 (Impressions) and not keyword2
# so it filters that includes impressions and removes ones with word findigs by applying reverse mask
only_keyword1 = mask_keyword1 & ~mask_keyword2

# Filter rows that contain only keyword2 and not keyword1
only_keyword2 = mask_keyword2 & ~mask_keyword1

print(dataFrame[only_keyword1].head())


                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               Text
2                                                                       \nExam\nXray Chest PA and Lateral\n\nDate\nXXXX\n\nHistory\nrib pain after a XXXX, XXXX XXXX steps this XXXX. Pain to R back, R elbow and R rib XXXX, no previous heart or lung hx, non-XXXX, no hx ca\n\nImpression\nNo displaced rib fractures, pneumothorax, or pleural effusion identified. Well-expanded and clear lungs. Mediastinal contour within normal limits. No acute cardiopulmonary abnormality identified.\n
26                              