## **Data Loaders**

In [1]:
mydata='/content/subtitle.txt'

### 1. Data Loading with Python's Built-in Methods (File I/O)

In [2]:
file_path=mydata
with open(file_path,"r",encoding="utf-8") as file:
  data=file.read()

print("Data Loaded : ",data[:100])

Data Loaded :  Welcome to Tokenization. After watching this video, you'll be able to describe
the tokenization proc


### 2. Data Loading with Hugging Face Datasets


In [4]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

In [6]:
from datasets import load_dataset

dataset=load_dataset('imdb')

train_data=dataset['train']
test_data=dataset['test']

train_data=train_data.to_pandas()
train_data.head()

README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0


### 3. Data Loading with TensorFlow


In [8]:
import tensorflow as tf
# Load data from a txt file
text_path=mydata
text_dataset=tf.data.TextLineDataset(text_path)
#preview data

for line in text_dataset.take(4):
  print(line.numpy().decode("utf-8"))

Welcome to Tokenization. After watching this video, you'll be able to describe
the tokenization process. You'll also be able to explain tokenization methods in
the use of tokenizers. Imagine you're
developing an AI model for sentiment analysis


In [None]:
# Load structured data (e.g., CSV)
csv_path = "data.csv"
csv_dataset = tf.data.experimental.make_csv_dataset(
    csv_path,
    batch_size=32,
    num_epochs=1
)
for batch in csv_dataset.take(1):
    print(batch)

### 4. Data Loading with PyTorch

In [12]:
from torch.utils.data import Dataset, DataLoader

class TextDataset(Dataset):
    def __init__(self, file_path):
        with open(file_path, "r", encoding="utf-8") as file:
            self.data = file.readlines()

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx].strip()

# Initialize Dataset and DataLoader
file_path = mydata
dataset = TextDataset(file_path)
data_loader = DataLoader(dataset, batch_size=16, shuffle=True)

# Preview data
for batch in data_loader:
    print(batch)
    break


['iterator called my iterator by applying the yield tokens', "assigns unique IDs. Here's an example of a tokenizer utilizing the word", 'tokens by appending BOS at the beginning', 'the tokenized lines with pad tokens to ensure that all sentences', 'to tokenize sentences. The tokenizer is applied to the text to get the', 'to a list of tokens. The result is a list of indices. Consider an example of applying the vocab function to', "model's overall vocabulary. Different tokenizers", "appear in the text. It's an iterative process, gradually narrowing", 'into individual characters. An advantage is that the', 'tokenization. The tokens help the model', 'example of using torchtext to tokenize sentences. A synthetic data set has been', 'each token in the vocabulary a unique index', 'sentence and indices takes an iterator as input and applies the vocab function to', 'about tokenization. Tokenization breaks a sentence into smaller pieces or tokens. Tokenizers such as NLTK', "in the original text.

### 5. Data Loading with SpaCy


In [13]:
import spacy
# load spacy model
nlp=spacy.load("en_core_web_sm")

#process raw text
file_path=mydata
with open(file_path,'r',encoding='utf-8') as file:
  text=file.read()

doc=nlp(text)
print("Tokens: ",[token.text for token in doc[:20] ])

Tokens:  ['Welcome', 'to', 'Tokenization', '.', 'After', 'watching', 'this', 'video', ',', 'you', "'ll", 'be', 'able', 'to', 'describe', '\n', 'the', 'tokenization', 'process', '.']


### 6. Data Loading with Gensim


In [14]:
from gensim.corpora.textcorpus import TextCorpus

# Dfine a simple corpus class
class MyCorpus(TextCorpus):
  def get_texts(self):
    with open(self.input,"r",encoding='utf-8') as f:
      for line in f:
        yield line.split()

#Initilize corpus
corpus=MyCorpus(mydata)

for doc in corpus.get_texts():
  print(doc)
  break


['Welcome', 'to', 'Tokenization.', 'After', 'watching', 'this', 'video,', "you'll", 'be', 'able', 'to', 'describe']


### 7. Data Loading with Hugging Face Tokenizer


In [18]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Example text
text_data = ["This is the first sentence.", "Here is another example."]

# Tokenize the data
tokenized = tokenizer(mydata, padding=True, truncation=True, return_tensors="pt")
print("Tokenized Output:", tokenized)


Tokenized Output: {'input_ids': tensor([[  101,  1013,  4180,  1013,  4942,  3775,  9286,  1012, 19067,  2102,
           102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


### 8. Data Loading with FastAI


In [26]:
from fastai.data.transforms import get_text_files

files = get_text_files(mydata)
print(files)  # Should list all text files in your directory


[]


In [None]:
from fastai.text.all import TextDataLoaders

# Path to data
path = "data/"
data_loaders = TextDataLoaders.from_folder(path)

# View batches
for batch in data_loaders.train:
    print(batch)
    break
# for to run this we should have data in itrerative way like
# data/
# ├── train/
# │   ├── class1/
# │   │   ├── file1.txt
# │   │   ├── file2.txt
# │   ├── class2/
# │   │   ├── file3.txt
# │   │   ├── file4.txt
# ├── valid/
# │   ├── class1/
# │   │   ├── file5.txt
# │   ├── class2/
# │   │   ├── file6.txt


### 9. Data Loading with JSON Files


In [33]:
try:
    with open(data_path, 'r', encoding='utf-8') as file:
        data = json.load(file)

    # Check the data structure
    if isinstance(data, list) and "text" in data[0]:
        text_data = [item["text"] for item in data]
        print("Sample Text data: ", text_data[:3])
    else:
        print("Unexpected JSON structure:", data)

except json.JSONDecodeError as e:
    print("JSON decoding error:", e)
except IndexError:
    print("Index out of range. Ensure your data has enough items.")
except KeyError:
    print("Key 'text' not found in some items. Check your JSON structure.")
except Exception as e:
    print("An error occurred:", e)


Unexpected JSON structure: [{'category': 'Smoothies', 'name': 'Ashunti`Way Smoothie', 'description': 'Fruit n greens, mango bananas, tropical fruit blend, dragon fruit mix, mango, bananas, pineapples, apples, and spinach. Special green with strawberry bananas juice blend . Our fruity tasty smoothies are blended to perfection.', 'price': '5.49 USD'}, {'category': 'Smoothies', 'name': 'Jimmy Jam Smoothie', 'description': 'Berries n kale, strawberries, bananas, blueberries kale, tropical fruit blend, and dragon fruit. Our fruity tasty smoothies are blended to perfection.', 'price': '5.49 USD'}, {'category': 'Smoothies', 'name': 'Tejay Impact Smoothie', 'description': 'Tropical fruit blend, dragon fruit mix, mango, bananas, pineapples, apples, and spinach. Special blue juice blend smoothies.', 'price': '5.49 USD'}, {'category': 'Smoothies', 'name': 'Dayton 500 Smoothie', 'description': 'Tropical fruit blend, dragon fruit mix, mango, bananas, pineapples, apples. Special green juice blend. O

In [31]:
import json

data_path='/content/food_items.json'
with open(data_path,'r',encoding='') as file:
  data=json.load(file)

#access specified field
text_data=[item["text"] for item in data]
print("Sample Text data: ",text_data[3])

LookupError: unknown encoding: 