<a href="https://colab.research.google.com/github/meekmarcelin/chartbot/blob/main/chartbot_marcel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

CHARTBOT


 Install Necessary Packages
These commands install essential packages for data manipulation, machine learning, and natural language processing.

In [None]:
!pip install datasets
!pip install torch transformers flask beautifulsoup4 requests nltk
!pip install transformers[torch]
!pip install accelerate -U




Mount Google Drive
Mount Google Drive to access files stored in your Google Drive.

In [None]:
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Import Libraries
Import necessary libraries for data handling, text preprocessing, and model training.

In [None]:
import pandas as pd
import re
from transformers import BertTokenizer, BertForMaskedLM, Trainer, TrainingArguments, DataCollatorForLanguageModeling
from torch.utils.data import Dataset


Install Additional Packages
Install more packages including streamlit for creating the web application.

In [None]:
!pip install datasets torch transformers accelerate streamlit




 Re-import Libraries (for streamlit)
Re-import libraries, including streamlit, for building and deploying the web application.


In [None]:
import pandas as pd
import re
import torch
from transformers import BertTokenizer, BertForMaskedLM, Trainer, TrainingArguments, DataCollatorForLanguageModeling
from torch.utils.data import Dataset
from accelerate import Accelerator
import streamlit as st


Sample Data and Preprocessing
Define sample data and preprocess it by cleaning the text (lowercasing and removing non-alphanumeric characters).

In [None]:
# Sample data and preprocessing
data = {
    "questions": [
        "What role can AI play in monitoring and predicting crop diseases?",
        "How can AI-powered monitoring systems help in early detection and prevention of crop pests and diseases?",
        "Express the significance of geospatial technologies in monitoring and predicting crop diseases and pests.",
        "Improving Crop Health: Understanding the Interaction Mechanisms Between Crops and Their Pathogens"
    ],
    "answers": [
        "AI can analyze large datasets from various sources to predict potential outbreaks, monitor crop health, and suggest timely interventions to prevent widespread disease.",
        "AI-powered systems use sensors and machine learning algorithms to detect early signs of pests and diseases, enabling farmers to take preventive measures before the issues become severe.",
        "Geospatial technologies like GIS and remote sensing help in monitoring crop health and predicting disease outbreaks by analyzing spatial data and environmental conditions.",
        "Understanding the mechanisms between crops and pathogens can lead to the development of resistant crop varieties and more effective disease management strategies."
    ]
}

df = pd.DataFrame(data)

def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    return text

df['questions'] = df['questions'].apply(clean_text)
df['answers'] = df['answers'].apply(clean_text)


Load Tokenizer and Model
Load the pre-trained BERT tokenizer and model from Hugging Face.

In [None]:
# Define the BERT model name
model_name = "bert-base-uncased"

# Load tokenizer and model
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForMaskedLM.from_pretrained(model_name)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identica

 Define TextDataset
Create a custom dataset class for handling text data, which will be used for training the model.

In [None]:
# Define TextDataset
class TextDataset(Dataset):
    def __init__(self, tokenizer, input_texts, output_texts):
        self.tokenizer = tokenizer
        self.input_texts = input_texts
        self.output_texts = output_texts

    def __len__(self):
        return len(self.input_texts)

    def __getitem__(self, idx):
        input_encoding = self.tokenizer(self.input_texts[idx], truncation=True, padding="max_length", max_length=512)
        output_encoding = self.tokenizer(self.output_texts[idx], truncation=True, padding="max_length", max_length=512)
        return {
            'input_ids': input_encoding['input_ids'],
            'attention_mask': input_encoding['attention_mask'],
            'labels': output_encoding['input_ids']
        }

# Convert DataFrame to TextDataset
input_texts = df['questions'].tolist()
output_texts = df['answers'].tolist()
text_dataset = TextDataset(tokenizer, input_texts, output_texts)


 Initialize DataCollator
Initialize a data collator that will dynamically pad the inputs and apply masked language modeling.

In [None]:
# Initialize DataCollator
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm_probability=0.15)


Define TrainingArguments and Trainer
Configure training arguments for mixed precision training and define the trainer to handle the training process.

In [None]:
from transformers import BertTokenizer, BertForMaskedLM, Trainer, TrainingArguments, DataCollatorForLanguageModeling
from torch.utils.data import Dataset
import pandas as pd
import re

# Sample data and preprocessing
data = {
    "questions": [
        "What role can AI play in monitoring and predicting crop diseases?",
        "How can AI-powered monitoring systems help in early detection and prevention of crop pests and diseases?",
        "Express the significance of geospatial technologies in monitoring and predicting crop diseases and pests.",
        "Improving Crop Health: Understanding the Interaction Mechanisms Between Crops and Their Pathogens"
    ],
    "answers": [
        "AI can analyze large datasets from various sources to predict potential outbreaks, monitor crop health, and suggest timely interventions to prevent widespread disease.",
        "AI-powered systems use sensors and machine learning algorithms to detect early signs of pests and diseases, enabling farmers to take preventive measures before the issues become severe.",
        "Geospatial technologies like GIS and remote sensing help in monitoring crop health and predicting disease outbreaks by analyzing spatial data and environmental conditions.",
        "Understanding the mechanisms between crops and pathogens can lead to the development of resistant crop varieties and more effective disease management strategies."
    ]
}

df = pd.DataFrame(data)

def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    return text

df['questions'] = df['questions'].apply(clean_text)
df['answers'] = df['answers'].apply(clean_text)

# Define the BERT model name
model_name = "bert-base-uncased"

# Load tokenizer and model
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForMaskedLM.from_pretrained(model_name)

# Define TextDataset
class TextDataset(Dataset):
    def __init__(self, tokenizer, input_texts, output_texts):
        self.tokenizer = tokenizer
        self.input_texts = input_texts
        self.output_texts = output_texts

    def __len__(self):
        return len(self.input_texts)

    def __getitem__(self, idx):
        input_encoding = self.tokenizer(self.input_texts[idx], truncation=True, padding="max_length", max_length=512)
        output_encoding = self.tokenizer(self.output_texts[idx], truncation=True, padding="max_length", max_length=512)
        return {
            'input_ids': input_encoding['input_ids'],
            'attention_mask': input_encoding['attention_mask'],
            'labels': output_encoding['input_ids']
        }

# Convert DataFrame to TextDataset
input_texts = df['questions'].tolist()
output_texts = df['answers'].tolist()
text_dataset = TextDataset(tokenizer, input_texts, output_texts)

# Initialize DataCollator
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm_probability=0.15)

# Adjusting TrainingArguments for mixed precision training
training_args = TrainingArguments(
    output_dir="./results",
    overwrite_output_dir=True,
    num_train_epochs=10,
    per_device_train_batch_size=1,  # Adjust based on your GPU memory capacity
    gradient_accumulation_steps=8,  # Increase if facing OOM errors
    save_steps=10_000,
    save_total_limit=2,
    fp16=True,  # Enable mixed precision training
)

# Define Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=text_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

# Start training
trainer.train()


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Step,Training Loss


TrainOutput(global_step=10, training_loss=2.619709014892578, metrics={'train_runtime': 317.335, 'train_samples_per_second': 0.126, 'train_steps_per_second': 0.032, 'total_flos': 10528192512000.0, 'train_loss': 2.619709014892578, 'epoch': 10.0})

Save the Model and Tokenizer
Save the trained model and tokenizer to the specified directory.

In [None]:
# Save the model and tokenizer
model.save_pretrained("./results")
tokenizer.save_pretrained("./results")


('./results/tokenizer_config.json',
 './results/special_tokens_map.json',
 './results/vocab.txt',
 './results/added_tokens.json')

 Install Streamlit
Install streamlit, which is a web framework for data science.

In [None]:
!pip install streamlit





Install and upgrade pyngrok to create secure tunnels to localhost.

In [None]:
!pip install streamlit ngrok


Collecting ngrok
  Downloading ngrok-1.3.0-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.9/2.9 MB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: ngrok
Successfully installed ngrok-1.3.0


Write the Streamlit application to a file named app.py. This app will load the trained model and provide an interface for users to input questions and get answers.

In [None]:
# Write the Streamlit app to a file
with open('app.py', 'w') as f:
    f.write('''
import streamlit as st
from transformers import BertTokenizer, BertForMaskedLM
import torch

# Load the trained model and tokenizer
model = BertForMaskedLM.from_pretrained("./results")
tokenizer = BertTokenizer.from_pretrained("./results")

st.title("AI-Powered Crop Monitoring Q&A")

st.write("Enter a question related to crop monitoring and AI:")

question = st.text_input("Question")

if st.button("Get Answer"):
    inputs = tokenizer(question, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)

    # Decode the answer
    predicted_token_ids = torch.argmax(outputs.logits, dim=-1)
    answer = tokenizer.decode(predicted_token_ids[0], skip_special_tokens=True)

    st.write("Answer:")
    st.write(answer)
''')


Start Streamlit App with Ngrok
Use ngrok to create a tunnel to localhost and start the Streamlit app, making it accessible via a public URL.

In [None]:
!pip install --upgrade pyngrok

import subprocess
from pyngrok import ngrok

# **Authenticate ngrok**
ngrok.set_auth_token("2iQZXyD2qrDXJgQSjrDWusxT6v2_5kzvoGqCVp6N3F4tH8XCR")  # Replace with your actual token

# Start ngrok tunnel, explicitly specifying the port using the 'addr' parameter
tunnel = ngrok.connect(addr='8501')
public_url = tunnel.public_url
print(f"Streamlit app is running on: {public_url}")

# Start the Streamlit app
streamlit_process = subprocess.Popen(['streamlit', 'run', 'app.py'])

Streamlit app is running on: https://bb86-35-245-14-76.ngrok-free.app
