# Viewing Dataset File

---

## Intentions

- to be able to extract a predetermined subset of the overall dataset
allowing a brief overview and analysis.
- After extracting we now want to analyse using Spacy or Hugging Face Transformations (HFT).

In [30]:
# imports
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt

In [31]:
# read all the csv files and store them respectively

# create function to limit rows read
def read_first_n_rows(data, n):
    df = pd.read_csv(data, nrows=n)
    return df

rows = 1000
discharge_detail_data = read_first_n_rows("discharge_detail.csv", rows)

In [32]:
discharge_detail_data.head()

Unnamed: 0,note_id,subject_id,field_name,field_value,field_ordinal
0,10000032-DS-21,10000032,author,___,1
1,10000032-DS-22,10000032,author,___,1
2,10000032-DS-23,10000032,author,___,1
3,10000032-DS-24,10000032,author,___,1
4,10000084-DS-17,10000084,author,___,1


In [33]:
# read in discharge.csv 
discharge_data = read_first_n_rows("discharge.csv", rows)


In [34]:
discharge_data.head()

Unnamed: 0,note_id,subject_id,hadm_id,note_type,note_seq,charttime,storetime,text
0,10000032-DS-21,10000032,22595853,DS,21,2180-05-07 00:00:00,2180-05-09 15:26:00,\nName: ___ Unit No: _...
1,10000032-DS-22,10000032,22841357,DS,22,2180-06-27 00:00:00,2180-07-01 10:15:00,\nName: ___ Unit No: _...
2,10000032-DS-23,10000032,29079034,DS,23,2180-07-25 00:00:00,2180-07-25 21:42:00,\nName: ___ Unit No: _...
3,10000032-DS-24,10000032,25742920,DS,24,2180-08-07 00:00:00,2180-08-10 05:43:00,\nName: ___ Unit No: _...
4,10000084-DS-17,10000084,23052089,DS,17,2160-11-25 00:00:00,2160-11-25 15:09:00,\nName: ___ Unit No: __...


In [35]:
# retrieve the column names
discharge_data_column_names = []
for column in discharge_data.columns:
    discharge_data_column_names.append(column)
discharge_data_column_names

['note_id',
 'subject_id',
 'hadm_id',
 'note_type',
 'note_seq',
 'charttime',
 'storetime',
 'text']

In [36]:
radiology_detail_data = read_first_n_rows("radiology_detail.csv", rows)
radiology_detail_data.head()

Unnamed: 0,note_id,subject_id,field_name,field_value,field_ordinal
0,10000032-RR-14,10000032,exam_code,C11,1
1,10000032-RR-14,10000032,exam_name,CHEST (PA & LAT),1
2,10000032-RR-15,10000032,exam_code,U314,1
3,10000032-RR-15,10000032,exam_code,U644,3
4,10000032-RR-15,10000032,exam_code,W82,2


In [37]:
radiology_data = read_first_n_rows("radiology.csv", rows)
radiology_data.head()

Unnamed: 0,note_id,subject_id,hadm_id,note_type,note_seq,charttime,storetime,text
0,10000032-RR-14,10000032,22595853.0,RR,14,2180-05-06 21:19:00,2180-05-06 23:32:00,EXAMINATION: CHEST (PA AND LAT)\n\nINDICATION...
1,10000032-RR-15,10000032,22595853.0,RR,15,2180-05-06 23:00:00,2180-05-06 23:26:00,EXAMINATION: LIVER OR GALLBLADDER US (SINGLE ...
2,10000032-RR-16,10000032,22595853.0,RR,16,2180-05-07 09:55:00,2180-05-07 11:15:00,"INDICATION: ___ HCV cirrhosis c/b ascites, hi..."
3,10000032-RR-18,10000032,,RR,18,2180-06-03 12:46:00,2180-06-03 14:01:00,EXAMINATION: Ultrasound-guided paracentesis.\...
4,10000032-RR-20,10000032,,RR,20,2180-07-08 13:18:00,2180-07-08 14:15:00,EXAMINATION: Paracentesis\n\nINDICATION: ___...


# Next Step - HFT Text Summarisation

---

The next step in the process is to be able to evaluate if the HFT library will be able to summarise the text
to a desired level. 

## Install the HFT Library

### Installing Transformers and Importing Dependencies


In [38]:
%pip install transformers

Note: you may need to restart the kernel to use updated packages.


In [39]:
from transformers import pipeline

In [41]:
summariser = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


RuntimeError: At least one of TensorFlow 2.0 or PyTorch should be installed. To install TensorFlow 2.0, read the instructions at https://www.tensorflow.org/install/ To install PyTorch, read the instructions at https://pytorch.org/.

### Load Summarisation Pipeline


### Summarise Text
