![link text](https://www.mytechmint.com/wp-content/uploads/2021/10/file-handling-in-python-mytechmint-720x340.jpg)

# **Basic Concepts**




In Python, files can be handled using the `open()` function, which opens a file and returns a file object. The syntax is:

```python
file = open("filename", "mode")
```

- `filename`: the name (and path, if necessary) of the file to open.
- `mode`: specifies how the file will be opened.

### Common File Modes

Let's look at the modes you'll use in file handling with Python:

1. `"r"` (Read)` - Opens a file for reading (default mode).
2. `"rb"` (Read Binary) - Opens a file for reading in binary mode.
3. `"w"` (Write) - Opens a file for writing. Creates a new file if it doesn't exist or truncates the file if it exists.
4. `"wb"` (Write Binary) - Opens a file for writing in binary mode.
5. `"a"` (Append) - Opens a file for appending. Data is added to the end of the file if it exists, or a new file is created if it doesn't.
6. `"r+"` (Read and Write) - Opens a file for both reading and writing. The file pointer is placed at the beginning of the file.
7. `"w+"` (Write and Read) - Opens a file for both writing and reading. Creates a new file if it doesn’t exist or truncates the file if it does.
8. `"a+"` (Append and Read) - Opens a file for both appending and reading.

### 1. Read Mode ("r")

In [None]:
# Create a sample file
with open("/content/sample.txt", "w") as file:
    file.write("Hello, this is a sample text file.\nWelcome to file handling in Python!")

# Reading a file in "r" mode
try:
    with open("/content/sample.txt", "r") as file:
        content = file.read()
        print("Content of the file in 'r' mode:")
        print(content)
except FileNotFoundError:
    print("File not found!")

Content of the file in 'r' mode:
Hello, this is a sample text file.
Welcome to file handling in Python!


### 2. Read Binary Mode ("rb")

In [None]:
# Reading a file in "rb" mode
try:
    with open("/content/sample.txt", "rb") as file:
        content = file.read()
        print("Content of the file in 'rb' mode:")
        print(content)
except FileNotFoundError:
    print("File not found!")


Content of the file in 'rb' mode:
b'Hello, this is a sample text file.\nWelcome to file handling in Python!'


### 3. Write Mode ("w")

In [None]:
# Writing to a file in "w" mode (overwrites if file exists)
with open("/content/sample_write.txt", "w") as file:
    file.write("This is a new file, created using 'w' mode.")

# Checking content
with open("/content/sample_write.txt", "r") as file:
    print("Content of the file after 'w' mode write:")
    print(file.read())


Content of the file after 'w' mode write:
This is a new file, created using 'w' mode.


### 4. Write Binary Mode ("wb")

In [None]:
# Writing to a file in binary mode "wb"
data = "Binary data example".encode("utf-8")  # Encoding string to bytes
with open("/content/sample_binary.bin", "wb") as file:
    file.write(data)

# Reading the binary content back
with open("/content/sample_binary.bin", "rb") as file:
    content = file.read()
    print("Binary content of the file written with 'wb' mode:")
    print(content)

Binary content of the file written with 'wb' mode:
b'Binary data example'


### 5. Append Mode ("a")

In [None]:
# Appending to an existing file with "a" mode
with open("/content/sample_write.txt", "a") as file:
    file.write("\nThis line is added using append mode.")

# Reading content to confirm append
with open("/content/sample_write.txt", "r") as file:
    print("Content after appending in 'a' mode:")
    print(file.read())

Content after appending in 'a' mode:
This is a new file, created using 'w' mode.
This line is added using append mode.


### 6. Read and Write Mode ("r+")

In [None]:
# Read and write in "r+" mode
with open("/content/sample_write.txt", "r+") as file:
    content = file.read()
    print("Original Content in 'r+' mode:", content)
    file.write("\nAdding this line with 'r+' mode.")

# Verify content
with open("/content/sample_write.txt", "r") as file:
    print("Content after 'r+' mode operation:")
    print(file.read())

Original Content in 'r+' mode: This is a new file, created using 'w' mode.
This line is added using append mode.
Content after 'r+' mode operation:
This is a new file, created using 'w' mode.
This line is added using append mode.
Adding this line with 'r+' mode.


### 7. Write and Read Mode ("w+")

In [None]:
# Write and read in "w+" mode
with open("/content/sample_w_plus.txt", "w+") as file:
    file.write("This file is created with 'w+' mode.")
    file.seek(0)  # Move pointer to beginning of file to read
    print("Content written and read in 'w+' mode:")
    print(file.read())

Content written and read in 'w+' mode:
This file is created with 'w+' mode.


### 8. Append and Read Mode ("a+")

In [None]:
# Append and read in "a+" mode
with open("/content/sample_write.txt", "a+") as file:
    file.write("\nAppended with 'a+' mode.")
    file.seek(0)  # Move pointer to beginning of file to read all content
    print("Content after 'a+' mode operation:")
    print(file.read())

Content after 'a+' mode operation:
This is a new file, created using 'w' mode.
This line is added using append mode.
Adding this line with 'r+' mode.
Appended with 'a+' mode.


### Summary Table

| Mode | Description |
| --- | --- |
| "r" | Read-only, file must exist |
| "rb" | Read-only in binary mode, file must exist |
| "w" | Write-only, creates/truncates |
| "wb" | Write-only in binary mode, creates/truncates |
| "a" | Append-only, creates if not exists |
| "r+" | Read and write, file must exist |
| "w+" | Write and read, creates/truncates |
| "a+" | Append and read, creates if not exists |

# **Advanced File Handling Concepts in AI and NLP**

**Common File Types in AI and NLP**

1. **Text Files ("`.txt`")**: Commonly used for storing raw text data.
2. **CSV Files ("`.csv`")**: Used for structured datasets in tabular format.
3. **JSON Files ("`.json`")**: Popular for hierarchical data structures, used in many NLP datasets.
4. **Binary Files (e.g., "`.pkl`" for Pickle)**: Useful for saving preprocessed data, models, and embeddings efficiently.

### **1. Reading Large Text Files Line-by-Line (Streaming Data)**

For massive text corpora, line-by-line reading avoids loading the entire file into memory.

In [None]:
import os

# Function to check for the file and create a dummy if not found
def process_large_text_file(file_path):
    # Check if file exists
    if not os.path.exists(file_path):
        # Create a dummy file if it doesn't exist
        with open(file_path, "w", encoding="utf-8") as file:
            file.write("This is a sample line in the dummy file.\nAnother sample line.\n")
        print(f"Dummy file created at: {file_path}")

    # Processing large text file line-by-line
    with open(file_path, "r", encoding="utf-8") as file:
        for line in file:
            # Simulate tokenizing the line
            tokens = line.strip().split()
            # Process tokens (e.g., store or use in model pipeline)
            print(tokens[:5])  # Displaying first 5 tokens as an example

# Test the function
process_large_text_file("/content/large_text_corpus.txt")

['This', 'is', 'a', 'sample', 'line']
['Another', 'sample', 'line.']


This approach is memory efficient and ideal for tokenizing or embedding large datasets.

### **2. Working with JSON Files for Dataset Structures**

Many NLP datasets, such as those from Hugging Face’s datasets library, are stored in JSON format, where each line represents a document or data entry

In [None]:
import os
import json

# Function to check for the file and create a dummy JSON file if not found
def load_json_dataset(file_path):
    # Check if file exists
    if not os.path.exists(file_path):
        # Create a dummy JSON file with sample data if it doesn't exist
        sample_data = [
            {"text": "This is a sample entry 1", "label": "example"},
            {"text": "This is a sample entry 2", "label": "example"}
        ]
        with open(file_path, "w", encoding="utf-8") as file:
            for entry in sample_data:
                json.dump(entry, file)
                file.write("\n")
        print(f"Dummy JSON file created at: {file_path}")

    # Reading JSON file with NLP data
    with open(file_path, "r", encoding="utf-8") as file:
        data = [json.loads(line) for line in file]
    return data

# Test the function
dataset = load_json_dataset("/content/nlp_dataset.json")
print("First entry:", dataset[0])

First entry: {'text': 'This is a sample entry 1', 'label': 'example'}


### **3. Writing Model Outputs and Metadata in JSON Format**

After processing text data, you may need to save outputs or metadata. Writing to JSON allows flexible storage of structured data.

In [None]:
import json

# Reading JSON file with NLP data
def load_json_dataset(file_path):
    with open(file_path, "r", encoding="utf-8") as file:
        data = [json.loads(line) for line in file]
    return data

dataset = load_json_dataset("/content/nlp_dataset.json")
print("First entry:", dataset[0])

First entry: {'text': 'This is a sample entry 1', 'label': 'example'}


### **4. Working with CSV Files for Structured Tabular Data**

CSV files are ideal for tabular data often found in NLP tasks, like classification labels or sentence pairs.

In [None]:
import os
import csv

# Function to check for the file and create a dummy CSV if not found
def load_csv_dataset(file_path):
    # Check if file exists
    if not os.path.exists(file_path):
        # Create a dummy CSV file with sample data if it doesn't exist
        with open(file_path, mode="w", encoding="utf-8", newline='') as file:
            writer = csv.DictWriter(file, fieldnames=["text", "label"])
            writer.writeheader()  # Write the header
            writer.writerow({"text": "This is a sample text 1", "label": "example"})
            writer.writerow({"text": "This is a sample text 2", "label": "example"})
        print(f"Dummy CSV file created at: {file_path}")

    # Proceed with loading the CSV data
    with open(file_path, mode="r", encoding="utf-8") as file:
        reader = csv.DictReader(file)
        data = [row for row in reader]
    return data

# Test the function
csv_data = load_csv_dataset("/content/classification_data.csv")
print("First row:", csv_data[0])

First row: {'text': 'This is a sample text 1', 'label': 'example'}


### **5. Handling Binary Files for Preprocessed Data and Embeddings**

Binary files are frequently used to save preprocessed data (like tokenized inputs) and embeddings. The `pickle` library is useful here for serialization.

In [None]:
import os
import pickle

# Saving and loading preprocessed embeddings
def save_embeddings(data, file_path):
    with open(file_path, "wb") as file:
        pickle.dump(data, file)

def load_embeddings(file_path):
    # Check if file exists
    if not os.path.exists(file_path):
        # Create a dummy pickle file with sample embeddings if it doesn't exist
        dummy_data = {"example_word": [0.0, 0.1, 0.2]}
        with open(file_path, "wb") as file:
            pickle.dump(dummy_data, file)
        print(f"Dummy pickle file created at: {file_path}")

    # Proceed with loading the embeddings
    with open(file_path, "rb") as file:
        data = pickle.load(file)
    return data

# Example usage
sample_embeddings = {"word": [0.1, 0.2, 0.3]}
save_embeddings(sample_embeddings, "/content/embeddings.pkl")
loaded_embeddings = load_embeddings("/content/embeddings.pkl")
print("Loaded embeddings:", loaded_embeddings)

Loaded embeddings: {'word': [0.1, 0.2, 0.3]}


### **6. File Management for Model Checkpoints and Artifacts**

For larger AI workflows, managing checkpoints and output files is critical. This can be automated using file handling with os and shutil libraries.

In [None]:
import os
import shutil
import pickle

# Saving a model checkpoint
def save_checkpoint(checkpoint, directory="checkpoints", filename="model.ckpt"):
    # Ensure the directory exists
    if not os.path.exists(directory):
        os.makedirs(directory)
    path = os.path.join(directory, filename)
    # Save the checkpoint
    with open(path, "wb") as file:
        pickle.dump(checkpoint, file)

# Loading the checkpoint
def load_checkpoint(directory="checkpoints", filename="model.ckpt"):
    path = os.path.join(directory, filename)
    # Check if the checkpoint file exists
    if not os.path.exists(path):
        # Create a dummy checkpoint file with sample data if it doesn't exist
        dummy_checkpoint = {"epoch": 0, "accuracy": 0.0}
        save_checkpoint(dummy_checkpoint, directory, filename)
        print(f"Dummy checkpoint created at: {path}")

    # Load the checkpoint
    with open(path, "rb") as file:
        checkpoint = pickle.load(file)
    return checkpoint

# Example checkpoint data
checkpoint_data = {"epoch": 10, "accuracy": 0.85}
save_checkpoint(checkpoint_data)
loaded_checkpoint = load_checkpoint()
print("Loaded checkpoint:", loaded_checkpoint)

Loaded checkpoint: {'epoch': 10, 'accuracy': 0.85}


### Summary of Advanced File Handling Techniques

| Mode/Technique | Description |
| --- | --- |
| **Line-by-line reading** | Efficient reading for large corpora without memory overload. |
| **JSON handling** | Structured data format commonly used in NLP datasets. |
| **CSV handling** | Useful for tabular data like classification labels or token metadata. |
| **Binary (Pickle)** | Fast serialization for saving model weights, embeddings, and checkpoints. |
| **File management** | Using `os` and `shutil` for organizing checkpoints and outputs in larger AI workflows. |

In [1]:
pip install yfinance pandas numpy



In [2]:
import yfinance as yf
import pandas as pd
import numpy as np

# Parameters for the moving average strategy
SHORT_WINDOW = 40  # Short moving average period
LONG_WINDOW = 100  # Long moving average period
INITIAL_CAPITAL = 10000  # Starting capital in dollars

# Fetch stock data
def fetch_data(ticker, start_date, end_date):
    data = yf.download(ticker, start=start_date, end=end_date)
    data['Short_MA'] = data['Close'].rolling(window=SHORT_WINDOW, min_periods=1).mean()
    data['Long_MA'] = data['Close'].rolling(window=LONG_WINDOW, min_periods=1).mean()
    return data

# Generate trading signals
def generate_signals(data):
    data['Signal'] = 0.0
    data['Signal'][SHORT_WINDOW:] = np.where(data['Short_MA'][SHORT_WINDOW:] > data['Long_MA'][SHORT_WINDOW:], 1.0, 0.0)
    data['Position'] = data['Signal'].diff()
    return data

# Simulate trading with the strategy
def simulate_trading(data):
    positions = INITIAL_CAPITAL / data['Close'][0]
    cash = INITIAL_CAPITAL
    holdings = 0

    for index, row in data.iterrows():
        if row['Position'] == 1:  # Buy signal
            holdings = cash / row['Close']
            cash = 0
            print(f"Buy {row['Close']} on {index}")
        elif row['Position'] == -1:  # Sell signal
            cash = holdings * row['Close']
            holdings = 0
            print(f"Sell {row['Close']} on {index}")

    final_value = cash + (holdings * data['Close'].iloc[-1])
    print(f"Final Portfolio Value: ${final_value:.2f}")

# Main execution
if __name__ == "__main__":
    ticker = "AAPL"  # Apple stock for example
    start_date = "2020-01-01"
    end_date = "2023-01-01"

    data = fetch_data(ticker, start_date, end_date)
    data = generate_signals(data)
    simulate_trading(data)

[*********************100%***********************]  1 of 1 completed
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['Signal'][SHORT_WINDOW:] = np.where(data['Short_MA'][SHORT_WINDOW:] > data['Long_MA'][SHORT_WINDOW:], 1.0, 0.0)


KeyError: 0