# Part 0 : Introduction
_____

In this tutorial we will use pandas to analyze the csv file and prepare it for a seq2seq model training

Using pandas we can sort the data suitable for our needs. To achieve this we need to understand how the data is interpreted.

Current dataset in this example is results of radiology reports and the diagnosis results

#

# Part 1: Pandas Basics
_____

below you can see how to import the data presented in csv to your python code

In [12]:
import pandas as pd

In [13]:
dataFrame=pd.read_csv("ReportsDATASET.csv")

FileNotFoundError: [Errno 2] No such file or directory: 'ReportsDATASET.csv'

lets see details about your data, describe is much more useful for numeric data but still you can see that there are 1984 text values and 1982 unique values

In [None]:
dataFrame.describe()

Unnamed: 0,Text
count,1984
unique,1982
top,\nSIGNATURE\nXXXX\n\nRADIOLOGY REPORT\nchest p...
freq,2


with looking at the shape of dataframe we can see row and columns size

In [None]:
dataFrame.shape

(1984, 1)

with head() method you can see first lines

In [None]:
dataFrame.head() 

Unnamed: 0,Text
0,\nChest PA-Lat XR\n\nImaging Study\nXray Chest...
1,"EXAM(S): Chest, 2 views, frontal and lateral\n..."
2,\nExam\nXray Chest PA and Lateral\n\nDate\nXXX...
3,\nRADIOLOGY REPORT\n\nExamination\nPA and late...
4,\nChest PA-Lat XR\n\nImaging Study\nXray Chest...


lets unlimit displayed width

In [None]:
pd.set_option('display.max_colwidth', None)
dataFrame.head()

Unnamed: 0,Text
0,\nChest PA-Lat XR\n\nImaging Study\nXray Chest PA and Lateral\nExam: 2 views of the chest XXXX/XXXX.\n \nComparison: None.\n \nIndication: Positive TB test\n \nFindings:\nThe cardiac silhouette and mediastinum size are within normal limits.\nThere is no pulmonary edema. There is no focal consolidation. There\nare no XXXX of a pleural effusion. There is no evidence of\npneumothorax.\n \nImpression:\nNormal chest x-XXXX. \nThis examination and reported findings have been reviewed and\nconfirmed by the undersigned.\n\n
1,"EXAM(S): Chest, 2 views, frontal and lateral\n\nDATE\nXXXX\n\nCOMPARISON\nNone.\n\nINDICATION\nPreop bariatric surgery.\n\nFINDINGS\nBorderline cardiomegaly. Midline sternotomy XXXX. Enlarged pulmonary arteries. Clear lungs. Inferior XXXX XXXX XXXX.\n\nIMPRESSION\nNo acute pulmonary findings. \n XXXX XXXX for the opportunity to care for your patient. If XXXX have any questions regarding this report, please XXXX the radiologist, Dr. XXXX XXXX, at XXXX.\n"
2,"\nExam\nXray Chest PA and Lateral\n\nDate\nXXXX\n\nHistory\nrib pain after a XXXX, XXXX XXXX steps this XXXX. Pain to R back, R elbow and R rib XXXX, no previous heart or lung hx, non-XXXX, no hx ca\n\nImpression\nNo displaced rib fractures, pneumothorax, or pleural effusion identified. Well-expanded and clear lungs. Mediastinal contour within normal limits. No acute cardiopulmonary abnormality identified.\n"
3,"\nRADIOLOGY REPORT\n\nExamination\nPA and lateral views of the chest XXXX, XXXX at XXXX hours History: XXXX-year-old XXXX with XXXX. Comparison: None available Findings: There are diffuse bilateral interstitial and alveolar opacities consistent with chronic obstructive lung disease and bullous emphysema. There are irregular opacities in the left lung apex, that could represent a cavitary lesion in the left lung apex.There are streaky opacities in the right upper lobe, XXXX scarring. The cardiomediastinal silhouette is normal in size and contour. There is no pneumothorax or large pleural effusion. Transcribed by - PSC Transcription Date - XXXX\n\nIMPRESSION\n1. Bullous emphysema and interstitial fibrosis. 2. Probably scarring in the left apex, although difficult to exclude a cavitary lesion. 3. Opacities in the bilateral upper lobes could represent scarring, however the absence of comparison exam, recommend short interval followup radiograph or CT thorax to document resolution.\n\nSIGNATURE\nXXXX\n\n"
4,"\nChest PA-Lat XR\n\nImaging Study\nXray Chest PA and Lateral\nEXAMINATION: CHEST ( FRONTAL AND LATERAL): XXXX, XXXX XXXX PM \n \nCLINICAL INDICATION: Chest and nasal congestion.\n \nCOMPARISXXXX/XXXX.\n \nFINDINGS:\nThe cardiomediastinal silhouette and pulmonary vasculature are within\nnormal limits. There is no pneumothorax or pleural effusion. There\nare no focal areas of consolidation. Cholecystectomy clips are\npresent. Small T-spine osteophytes. There is biapical pleural\nthickening, unchanged from prior. Mildly hyperexpanded lungs.\n \nIMPRESSION:\nNo acute cardiopulmonary abnormality.\n\n"


These are quite long lines but we have very important key words

Findings and Impressions 

where findigs tells us what is happened 

and 

Impressions the diagnosis by looking at what happened

lets search for keywords and see how many lines have does keywords

In [None]:
# List you keywords
keywords = ['IMPRESSION\n', 'FINDINGS\n']

# dataFrame["Text"] is column selection
# str.contains checsk if keywords exits
# '|'.join(keywords) joins keywords with or so if one of them exits we count
# case=False makes it case in sensitive
# na=False makes NaN values counts as not includings

mask = dataFrame['Text'].str.contains('|'.join(keywords), case=False, na=False)

# Count the number of rows containing the keywords
count = mask.sum()

print(f'Number of total rows {dataFrame.shape[0]} ')
print(f'Number of rows containing the keywords : {count}')

Number of total rows 1984 
Number of rows containing the keywords : 1982


all lines except  2, lets see what are those lines. We already created a mask lets invert it and apply it 

In [None]:
# Invert the mask to get rows that do not contain the keywords
inverse_mask = ~mask

# Filter the DataFrame using the inverted mask
rows_without_keywords = dataFrame[inverse_mask]

print(rows_without_keywords)

                                                                   Text
15    \nChest PA-Lat XR\n\nImaging Study\nXray Chest PA and Lateral\n\n
1501    \nChest PA-Lat XR\n\nImaging Study\nXR Chest PA and Lateral\n\n


Looks like those two lines are not containing important data.   

Lets see if is ther lines with only one of our keywords


In [None]:
# a mask is created for word  impression
mask_keyword1 = dataFrame['Text'].str.contains(keywords[0], case=False, na=False)

# a mask is created for word findings
mask_keyword2 = dataFrame['Text'].str.contains(keywords[1], case=False, na=False)



In [None]:
#apply mask that looks for keyword1 impression
key1df=dataFrame[mask_keyword1]

# reverse mask that looks for keyword2 findings
nokey2=~mask_keyword2

print(key1df[nokey2].count())
# when you apply no findigs on impression df you get one results
key1df[nokey2].head()

Text    319
dtype: int64


  print(key1df[nokey2].count())
  key1df[nokey2].head()


Unnamed: 0,Text
2,"\nExam\nXray Chest PA and Lateral\n\nDate\nXXXX\n\nHistory\nrib pain after a XXXX, XXXX XXXX steps this XXXX. Pain to R back, R elbow and R rib XXXX, no previous heart or lung hx, non-XXXX, no hx ca\n\nImpression\nNo displaced rib fractures, pneumothorax, or pleural effusion identified. Well-expanded and clear lungs. Mediastinal contour within normal limits. No acute cardiopulmonary abnormality identified.\n"
26,"\nEXAM\nPA and LAT view CHEST XXXX, XXXX XXXX PM\n\nIndication\nChronic XXXX XXXX\n\nComparisons\nXXXX\n\nDiscussion\nLungs are overall hyperexpanded with flattening of the diaphragms. No focal consolidation. No pleural effusions or pneumothoraces. Heart and mediastinum of normal size and contour. Degenerative changes in the thoracic spine.\n\nImpression\nHyperexpanded but clear lungs.\n"
28,"EXAM(S): Chest, 2 views, frontal and lateral\n\nDATE\nXXXX\n\nCOMPARISON\nXXXX, XXXX\n\nINDICATION\nXXXX, hypoxia.\n\nIMPRESSION\nBorderline heart size. Elevated left diaphragm. Clear right lung. Tracheostomy tube tip above the carina. Extensive airspace disease in the left base. No large effusion or pneumothorax. \n XXXX XXXX for the opportunity to care for your patient. If XXXX have any questions regarding this report, please XXXX the radiologist, Dr. XXXX XXXX, at XXXX.\n"
29,"\nRADIOLOGY REPORT\n\nExam\nChest x-XXXX XXXX and lateral, XXXX Indication: XXXX-year-old male with chest pain. Comparison: None Discussion: Lungs are clear without focal consolidation, effusion, or pneumothorax. Normal heart size. Negative for pneumoperitoneum. Bony thorax and soft tissue grossly unremarkable Transcribed by - PSC Transcription Date - XXXX\n\nIMPRESSION\nNegative acute cardiopulmonary abnormality.\n\nSIGNATURE\nXXXX\n\n"
30,"\nChest PA-Lat XR\n\nImaging Study\nXray Chest PA and Lateral\nExam: Xray Chest PA and Lateral \n \nDate: XXXX, XXXX XXXX PM \n \nHistory: XXXX DYSPNEA \n \nImpression:\n \nComparison XXXX, XXXX.\n \nSuggestion of slightly more prominent interstitial markings, which\nmay represent some bronchitic/bronchiolitis changes. No suspicious\nnodules, pneumonia, effusions, or CHF. Stable mediastinal contour.\n\n"


as can bee seen discussion is sometimes used instead of findings,
lets se how mnay lines that doesnt contains findings, contains discussion

In [None]:
mask_discus = dataFrame['Text'].str.contains("Discussion", case=False, na=False)
print(key1df[nokey2].count())
only_impressions=key1df[nokey2]
print(only_impressions[mask_discus].count())

Text    319
dtype: int64
Text    46
dtype: int64


  print(key1df[nokey2].count())
  only_impressions=key1df[nokey2]
  print(only_impressions[mask_discus].count())


lets now filter for the lines that contains findings but not impression

In [11]:
#apply mask that looks for keyword2 findings
key2df=dataFrame[mask_keyword2]

# reverse mask that looks for keyword1 impression
nokey1=~mask_keyword1

print(key2df[nokey1].count())
# when you apply no findigs on impression df you get one results
key2df[nokey1].head(10)

NameError: name 'dataFrame' is not defined

so in here we have no impressions but the first finding should be named as impressions

key1df is lines with impression  
key2df is lines with findings

In [8]:
print(key1df.count())
print(key2df.count())

NameError: name 'key1df' is not defined

In [5]:
print(key1df[mask_keyword2].count())

NameError: name 'key1df' is not defined

we have 1662 lines that include both keywords which is good for us

# Part 2 : Clearing Data
-----------

In this part we will start to work on invasive moves to start changin data.  

Dont forget that all our moves will be made on dataFrame variable and not on .csv file. 

So if we want to keep results we need to save them. 

lets keep letters comma full stop and empty space, regex will help us here   

a-z is for lower case
A-Z is for upper case  
. is for dot  
, is comma  
\s is for full stop


In [None]:
# Define a regular expression pattern to match desired characters
pattern = r'[^a-zA-Z.,\s]'
 # Matches anything that is not a letter, full stop, comma, or whitespace


In [None]:

# Apply the pattern to each element in the DataFrame and replace non-matching characters with an empty string
df = dataFrame.replace(to_replace=pattern, value='', regex=True)

# Save the cleaned DataFrame back to a CSV file
df.to_csv('cleaned_file.csv', index=False)


and now we will make all lower case

In [None]:

# Lets read our cleaned file
df = pd.read_csv('cleaned_file.csv')

# Convert all string columns to lowercase
df = df.applymap(lambda x: x.lower() if isinstance(x, str) else x)


# Save the cleaned DataFrame back to a CSV file
df.to_csv('cleaned_file.csv', index=False)

df.head() # to see


  df = df.applymap(lambda x: x.lower() if isinstance(x, str) else x)


Unnamed: 0,Text
0,\nchest palat xr\n\nimaging study\nxray chest pa and lateral\nexam views of the chest xxxxxxxx.\n \ncomparison none.\n \nindication positive tb test\n \nfindings\nthe cardiac silhouette and mediastinum size are within normal limits.\nthere is no pulmonary edema. there is no focal consolidation. there\nare no xxxx of a pleural effusion. there is no evidence of\npneumothorax.\n \nimpression\nnormal chest xxxxx. \nthis examination and reported findings have been reviewed and\nconfirmed by the undersigned.\n\n
1,"exams chest, views, frontal and lateral\n\ndate\nxxxx\n\ncomparison\nnone.\n\nindication\npreop bariatric surgery.\n\nfindings\nborderline cardiomegaly. midline sternotomy xxxx. enlarged pulmonary arteries. clear lungs. inferior xxxx xxxx xxxx.\n\nimpression\nno acute pulmonary findings. \n xxxx xxxx for the opportunity to care for your patient. if xxxx have any questions regarding this report, please xxxx the radiologist, dr. xxxx xxxx, at xxxx.\n"
2,"\nexam\nxray chest pa and lateral\n\ndate\nxxxx\n\nhistory\nrib pain after a xxxx, xxxx xxxx steps this xxxx. pain to r back, r elbow and r rib xxxx, no previous heart or lung hx, nonxxxx, no hx ca\n\nimpression\nno displaced rib fractures, pneumothorax, or pleural effusion identified. wellexpanded and clear lungs. mediastinal contour within normal limits. no acute cardiopulmonary abnormality identified.\n"
3,"\nradiology report\n\nexamination\npa and lateral views of the chest xxxx, xxxx at xxxx hours history xxxxyearold xxxx with xxxx. comparison none available findings there are diffuse bilateral interstitial and alveolar opacities consistent with chronic obstructive lung disease and bullous emphysema. there are irregular opacities in the left lung apex, that could represent a cavitary lesion in the left lung apex.there are streaky opacities in the right upper lobe, xxxx scarring. the cardiomediastinal silhouette is normal in size and contour. there is no pneumothorax or large pleural effusion. transcribed by psc transcription date xxxx\n\nimpression\n. bullous emphysema and interstitial fibrosis. . probably scarring in the left apex, although difficult to exclude a cavitary lesion. . opacities in the bilateral upper lobes could represent scarring, however the absence of comparison exam, recommend short interval followup radiograph or ct thorax to document resolution.\n\nsignature\nxxxx\n\n"
4,"\nchest palat xr\n\nimaging study\nxray chest pa and lateral\nexamination chest frontal and lateral xxxx, xxxx xxxx pm \n \nclinical indication chest and nasal congestion.\n \ncomparisxxxxxxxx.\n \nfindings\nthe cardiomediastinal silhouette and pulmonary vasculature are within\nnormal limits. there is no pneumothorax or pleural effusion. there\nare no focal areas of consolidation. cholecystectomy clips are\npresent. small tspine osteophytes. there is biapical pleural\nthickening, unchanged from prior. mildly hyperexpanded lungs.\n \nimpression\nno acute cardiopulmonary abnormality.\n\n"


now it is better for training and we can start to look for changes