# Viewing Dataset File

---

## Intentions

- to be able to extract a predetermined subset of the overall dataset
allowing a brief overview and analysis.
- After extracting we now want to analyse using Spacy or Hugging Face Transformations (HFT).

In [1]:
# imports
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt

In [2]:
# read all the csv files and store them respectively

# create function to limit rows read
def read_first_n_rows(data, n):
    df = pd.read_csv(data, nrows=n)
    return df

rows = 1000
discharge_detail_data = read_first_n_rows("discharge_detail.csv", rows)

In [3]:
discharge_detail_data.head()

Unnamed: 0,note_id,subject_id,field_name,field_value,field_ordinal
0,10000032-DS-21,10000032,author,___,1
1,10000032-DS-22,10000032,author,___,1
2,10000032-DS-23,10000032,author,___,1
3,10000032-DS-24,10000032,author,___,1
4,10000084-DS-17,10000084,author,___,1


In [4]:
# read in discharge.csv 
discharge_data = read_first_n_rows("discharge.csv", rows)


In [5]:
discharge_data.head()

Unnamed: 0,note_id,subject_id,hadm_id,note_type,note_seq,charttime,storetime,text
0,10000032-DS-21,10000032,22595853,DS,21,2180-05-07 00:00:00,2180-05-09 15:26:00,\nName: ___ Unit No: _...
1,10000032-DS-22,10000032,22841357,DS,22,2180-06-27 00:00:00,2180-07-01 10:15:00,\nName: ___ Unit No: _...
2,10000032-DS-23,10000032,29079034,DS,23,2180-07-25 00:00:00,2180-07-25 21:42:00,\nName: ___ Unit No: _...
3,10000032-DS-24,10000032,25742920,DS,24,2180-08-07 00:00:00,2180-08-10 05:43:00,\nName: ___ Unit No: _...
4,10000084-DS-17,10000084,23052089,DS,17,2160-11-25 00:00:00,2160-11-25 15:09:00,\nName: ___ Unit No: __...


In [6]:
# retrieve the column names
discharge_data_column_names = []
for column in discharge_data.columns:
    discharge_data_column_names.append(column)
discharge_data_column_names

['note_id',
 'subject_id',
 'hadm_id',
 'note_type',
 'note_seq',
 'charttime',
 'storetime',
 'text']

In [7]:
radiology_detail_data = read_first_n_rows("radiology_detail.csv", rows)
radiology_detail_data.head()

Unnamed: 0,note_id,subject_id,field_name,field_value,field_ordinal
0,10000032-RR-14,10000032,exam_code,C11,1
1,10000032-RR-14,10000032,exam_name,CHEST (PA & LAT),1
2,10000032-RR-15,10000032,exam_code,U314,1
3,10000032-RR-15,10000032,exam_code,U644,3
4,10000032-RR-15,10000032,exam_code,W82,2


In [8]:
radiology_data = read_first_n_rows("radiology.csv", rows)
radiology_data.head()

Unnamed: 0,note_id,subject_id,hadm_id,note_type,note_seq,charttime,storetime,text
0,10000032-RR-14,10000032,22595853.0,RR,14,2180-05-06 21:19:00,2180-05-06 23:32:00,EXAMINATION: CHEST (PA AND LAT)\n\nINDICATION...
1,10000032-RR-15,10000032,22595853.0,RR,15,2180-05-06 23:00:00,2180-05-06 23:26:00,EXAMINATION: LIVER OR GALLBLADDER US (SINGLE ...
2,10000032-RR-16,10000032,22595853.0,RR,16,2180-05-07 09:55:00,2180-05-07 11:15:00,"INDICATION: ___ HCV cirrhosis c/b ascites, hi..."
3,10000032-RR-18,10000032,,RR,18,2180-06-03 12:46:00,2180-06-03 14:01:00,EXAMINATION: Ultrasound-guided paracentesis.\...
4,10000032-RR-20,10000032,,RR,20,2180-07-08 13:18:00,2180-07-08 14:15:00,EXAMINATION: Paracentesis\n\nINDICATION: ___...


# Next Step - HFT Text Summarisation

---

The next step in the process is to be able to evaluate if the HFT library will be able to summarise the text
to a desired level. 

## Install the HFT Library

### Installing Transformers and Importing Dependencies


In [9]:
%pip install transformers
%pip install tensorflow

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


### Load Summarisation Pipeline

In [17]:
from transformers import pipeline

summarizer = pipeline("summarization")

No model was supplied, defaulted to t5-small and revision d769bba (https://huggingface.co/t5-small).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


Downloading tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

### Summarise Text


In [20]:
text = """
‘Have some shame’: Boxing world reacts to Mike Tyson news
Boxing has been stunned by the announcement Mike Tyson will fight one of the biggest names in the sport and not everyone is happy.

2 min read
March 8, 2024 - 6:51AM
4 comments
Boxing: Jake Paul took down Ryan Bourland with a first round TKO in their bout in... more
Up Next
Spying probe finds comms equipment on Chinese cranes at…
Cancel
More From Boxing
Former heavyweight champ’s major call on Aussie young gun
Former heavyweight champ’s major call on Aussie young gun
Meet the undefeated Aussie who could be our next UFC champion
Meet the undefeated Aussie who could be our next UFC champion
‘I will take it straight away’: Hall chases Gallen, SBW, rematches
‘I will take it straight away’: Hall chases Gallen, SBW, rematches
Former heavyweight champion Mike Tyson will return to the ring to face Youtuber-turned-boxer Jake Paul in an exhibition bout screened on Netflix, organisers announced on Friday (AEDT).

Tyson, 57, and Paul will face off at the AT&T Stadium in Arlington, Texas, the home of the NFL’s Dallas Cowboys, on July 21, Netflix and Most Valuable Promotions (MVP) announced.

Early reports claimed the fight would be free to Netflix subscribers.

“Given the names involved, that it’s not on PPV and is on Netflix, which is everywhere globally it may be the most watched boxing event ever,” tweeted boxing reporter Dan Rafael.

Mike Tyson and Jake Paul square off. Picture: Netflix
Mike Tyson and Jake Paul square off. Picture: Netflix
Tyson said in a statement he was looking forward to fighting an opponent who is 30 years his junior, insisting that he had been impressed by Paul’s performances in his fledgling boxing career.

“He’s grown significantly as a boxer over the years, so it will be a lot of fun to see what the will and ambition of a ‘kid’ can do with the experience and aptitude of a GOAT,” Tyson said.

Paul fought on the undercard of Tyson’s last outing, an eight-round exhibition against former middleweight king Roy Jones Jr. in Los Angeles in 2020.

Mike Tyson was in top shape when he last fought against Roy Jones Jr in 2020. (Photo by Joe Scarnici / GETTY)
Mike Tyson was in top shape when he last fought against Roy Jones Jr in 2020. (Photo by Joe Scarnici / GETTY)
“I started him on his boxing journey on the undercard of my fight with Roy Jones and now I plan to finish him,” Tyson quipped.

The fight is the latest in what has become a popular trend in recent years, pitting internet celebrities against each other or against recognised boxers, who have already retired or are well past their prime.

Jake Paul’s brother Logan Paul played a key role in pioneering the trend, even fighting against boxing icon Floyd Mayweather in 2021.

Jake Paul was far too good for Ryan Bourland during their cruiserweight fight on March 2. (Photo by Al Bello/Getty Images)
Jake Paul was far too good for Ryan Bourland during their cruiserweight fight on March 2. (Photo by Al Bello/Getty Images)
Jake Paul however has become a bigger draw, building a 9-1 record with six knockouts.

He scored back-to-back knockouts against professional boxers in his last two fights with victories over Ryan Bourland and Andre August.

Tyson, meanwhile, was regarded as one of the most ferocious heavyweight boxers in history.

He reigned as undisputed champion between 1987 and 1990 and won his first belt at the age of 20 years, four months and 22 days to become the youngest heavyweight champion in history.

NEWS.COM.AU00:20
Mike Tyson's insane power at 53-years-old
UP NEXT







Mike Tyson lets the world know that he's still got incredible power at the age of 53.
“He’s the greatest heavyweight of all time…the most vicious KO artist ever,” Paul tweeted after the bout was announced.

“But I’m younger, I’m faster and I’m going to be working my ass off to get stronger. A member of my team sent me this video that Mike’s coach put up two weeks ago and asked me if I’m sure that I want to do this ... yes, yes I do. Heavyweight.”

“My sights are set on becoming a world champion, and now I have a chance to prove myself against the greatest heavyweight champion of the world, the baddest man on the planet and the most dangerous boxer of all time. Time to put Iron Mike to sleep,” Paul added.

But it wasn’t long before other fighters started criticising the match-up.

More Coverage

Horner accuser suspended by Red Bull

Boxing upset throws Tszyu plan into disarray
“You’re fighting someone who was born in 1966,” tweeted Dillon Danis, who fought and lost to Logan Paul. “Have some shame.”

“You should be ashamed of yourself,” UFC great Michael Bisping tweeted to Paul. “And the biggest joke is you don’t even slightly realise why.”
"""

In [25]:
summarizer(text, max_length=400, min_length=120, do_sample=False)

[{'summary_text': 'boxing has been stunned by the announcement that Mike Tyson will fight one of the biggest names in the sport . the fight is the latest in what has become a popular trend in recent years, pitting internet celebrities against each other or recognised boxers . he won his first belt at the age of 20 years, four months and 22 days to become the youngest heavyweight champion in history . his brother Logan Paul played a key role in pioneering the trend, even fighting against boxing icon Floyd Mayweather in 2021 . "......... ... '}]

# Summarising Patient Medical History ID - 10000032

---

Using the BART model we will summarise the medical history of patient 10000032

Steps:
- collect all the medical notes for this specific patient
- parse this information into the HFT model
- Analyse the output.

In [31]:
subject_info = []
patient_id = 10000032

# collect the text from patient 10000032
df = radiology_data
df = radiology_data.iloc[:,[1, 7]]
df

Unnamed: 0,subject_id,text
0,10000032,EXAMINATION: CHEST (PA AND LAT)\n\nINDICATION...
1,10000032,EXAMINATION: LIVER OR GALLBLADDER US (SINGLE ...
2,10000032,"INDICATION: ___ HCV cirrhosis c/b ascites, hi..."
3,10000032,EXAMINATION: Ultrasound-guided paracentesis.\...
4,10000032,EXAMINATION: Paracentesis\n\nINDICATION: ___...
...,...,...
995,10003299,EXAM: MRI brain. MRA head. MRA neck.\n\nCLI...
996,10003299,INDICATION: Evaluation of patient who stepped...
997,10003299,INDICATION: Bilateral hard tissue per requisi...
998,10003299,HISTORY: Screening.\n\nDIGITAL SCREENING MAMM...


In [40]:
df.shape

(1000, 2)

In [49]:
patient_info = df.loc[df['subject_id'] == patient_id]["text"]
patient_info

0     EXAMINATION:  CHEST (PA AND LAT)\n\nINDICATION...
1     EXAMINATION:  LIVER OR GALLBLADDER US (SINGLE ...
2     INDICATION:  ___ HCV cirrhosis c/b ascites, hi...
3     EXAMINATION:  Ultrasound-guided paracentesis.\...
4     EXAMINATION:  Paracentesis\n\nINDICATION:  ___...
5     EXAMINATION:  ULTRASOUND INTERVENTIONAL PROCED...
6     EXAMINATION:  LIVER OR GALLBLADDER US (SINGLE ...
7     EXAMINATION:  CHEST (PA AND LAT)\n\nINDICATION...
8     EXAMINATION:  Ultrasound-guided paracentesis.\...
9     EXAMINATION:  Ultrasound-guided paracentesis.\...
10    EXAMINATION:  ULTRASOUND PARACENTESIS\n\nINDIC...
11    EXAMINATION:  Ultrasound-guided paracentesis.\...
12    EXAMINATION:  PARACENTESIS\n\nINDICATION:  ___...
13    EXAMINATION:  THERAPEUTIC PARACENTESIS\n\nINDI...
14    EXAMINATION:  PARACENTESIS\n\nINDICATION:  ___...
15    INDICATION:  ___ year old woman with cirrhosis...
16    EXAMINATION:  PARACENTESIS\n\nINDICATION:  ___...
17    EXAMINATION:  CT HEAD W/O CONTRAST\n\nINDI

In [62]:
subject_ids = set()
for row in range(len(df)):
    if df.iloc[row][0] not in subject_ids:
        subject_ids.add(df.iloc[row][0])

subject_ids = list(sorted(subject_ids))
subject_ids

[10000032,
 10000084,
 10000102,
 10000108,
 10000117,
 10000248,
 10000285,
 10000473,
 10000560,
 10000594,
 10000635,
 10000650,
 10000719,
 10000764,
 10000826,
 10000891,
 10000898,
 10000904,
 10000935,
 10000951,
 10000980,
 10001016,
 10001038,
 10001122,
 10001176,
 10001186,
 10001217,
 10001319,
 10001336,
 10001338,
 10001401,
 10001492,
 10001523,
 10001574,
 10001663,
 10001667,
 10001725,
 10001823,
 10001851,
 10001860,
 10001877,
 10001884,
 10001919,
 10002012,
 10002013,
 10002131,
 10002147,
 10002155,
 10002157,
 10002167,
 10002177,
 10002221,
 10002348,
 10002428,
 10002430,
 10002443,
 10002495,
 10002523,
 10002528,
 10002545,
 10002557,
 10002559,
 10002661,
 10002662,
 10002751,
 10002755,
 10002760,
 10002769,
 10002800,
 10002804,
 10002807,
 10002852,
 10002859,
 10002869,
 10002870,
 10002920,
 10002930,
 10002976,
 10003019,
 10003046,
 10003052,
 10003137,
 10003199,
 10003203,
 10003255,
 10003299]

In [63]:
m = len(subject_ids)
m

86

In [64]:
patient_list = {subject_id: "" for subject_id in subject_ids}
patient_list

{10000032: '',
 10000084: '',
 10000102: '',
 10000108: '',
 10000117: '',
 10000248: '',
 10000285: '',
 10000473: '',
 10000560: '',
 10000594: '',
 10000635: '',
 10000650: '',
 10000719: '',
 10000764: '',
 10000826: '',
 10000891: '',
 10000898: '',
 10000904: '',
 10000935: '',
 10000951: '',
 10000980: '',
 10001016: '',
 10001038: '',
 10001122: '',
 10001176: '',
 10001186: '',
 10001217: '',
 10001319: '',
 10001336: '',
 10001338: '',
 10001401: '',
 10001492: '',
 10001523: '',
 10001574: '',
 10001663: '',
 10001667: '',
 10001725: '',
 10001823: '',
 10001851: '',
 10001860: '',
 10001877: '',
 10001884: '',
 10001919: '',
 10002012: '',
 10002013: '',
 10002131: '',
 10002147: '',
 10002155: '',
 10002157: '',
 10002167: '',
 10002177: '',
 10002221: '',
 10002348: '',
 10002428: '',
 10002430: '',
 10002443: '',
 10002495: '',
 10002523: '',
 10002528: '',
 10002545: '',
 10002557: '',
 10002559: '',
 10002661: '',
 10002662: '',
 10002751: '',
 10002755: '',
 10002760:

In [71]:
# for each row in df 
#   concatenate text part to patient_list for each patient


print(patient_id)

KeyError: 'subject_id'