<a href="https://colab.research.google.com/github/iMac69/chat_agent/blob/master/Archetype%20Agentv1.0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Step 1: Install and Import Dependencies
## Ensure all necessary libraries are installed and imported for the project.
### This code installs all the required Python packages using pip and imports them into the script. Libraries like streamlit are used for the UI, sentence-transformers for embeddings, and pinecone-client for vector database operations.

In [39]:
# Install required libraries (if not already installed)
!pip install streamlit
!pip install sentence-transformers
!pip install pinecone-client
!pip install nltk
!pip install gspread
!pip install google-auth
!pip install uuid

# Import libraries
import streamlit as st
from sentence_transformers import SentenceTransformer
import pinecone
import nltk
import uuid
import json
import datetime
import gspread
from google.oauth2.service_account import Credentials
from scipy.spatial.distance import cosine



# Step 2: Prepare the Knowledge Base

## 2.1 Load and Chunk the Documents
### Summary: Load the custom prompt and archetype playbooks, then chunk them into smaller pieces for efficient embedding and retrieval.
### Explanation: This code processes each document, chunks it, and prepares a list of chunks with unique IDs and metadata, including the archetype name and chunk identifiers.


In [40]:
import os
from nltk.tokenize import sent_tokenize

import nltk # Make sure to import the nltk module

nltk.download('punkt')  # Download NLTK data files if not already present

def load_documents(folder_path):
    documents = {}
    for filename in os.listdir(folder_path):
        if filename.endswith('.txt'):
            archetype_name = filename.replace('.txt', '')
            with open(os.path.join(folder_path, filename), 'r') as file:
                documents[archetype_name] = file.read()
    return documents

def chunk_document(text, max_words=300):
    sentences = sent_tokenize(text)
    chunks = []
    current_chunk = ''
    word_count = 0

    for sentence in sentences:
        words_in_sentence = len(sentence.split())
        if word_count + words_in_sentence <= max_words:
            current_chunk += ' ' + sentence
            word_count += words_in_sentence
        else:
            chunks.append(current_chunk.strip())
            current_chunk = sentence
            word_count = words_in_sentence
    if current_chunk:
        chunks.append(current_chunk.strip())
    return chunks

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# 2.2 Prepare Chunks with Metadata
Summary: Associate each chunk with relevant metadata for efficient retrieval.

In [41]:
def prepare_chunks(documents):
    all_chunks = []
    for archetype_name, text in documents.items():
        chunks = chunk_document(text)
        for idx, chunk in enumerate(chunks):
            chunk_data = {
                'id': f"{archetype_name}_{idx}",
                'text': chunk,
                'metadata': {
                    'archetype': archetype_name,
                    'chunk_id': idx,
                    'total_chunks': len(chunks)
                }
            }
            all_chunks.append(chunk_data)
    return all_chunks

# Example usage
folder_path = '/content/knowledge_base'
documents = load_documents(folder_path)
all_chunks = prepare_chunks(documents)


# Step 3: Set Up Pinecone Indexing
### Summary: Initialize Pinecone and create an index to store embeddings.
### Explanation: This code initializes Pinecone with your API key, checks if the specified index exists, and creates it if necessary. The dimension parameter should match the embedding size of the model used.

# 3.1 Access Environment Variables

### Summary: Retrieve API keys and other sensitive information from environment variables.

### Explanation: This code accesses the Pinecone API key and environment variables securely stored in Colab’s environment. Avoid hardcoding sensitive information directly in the script to maintain security.

In [14]:
# Step 3: Initialize and Configure Pinecone (Corrected)

# Install the latest Pinecone client
!pip install --upgrade pinecone-client

import os
from pinecone import Pinecone, ServerlessSpec
from google.colab import userdata

# Access secrets set via Colab's UI
openai_api_key = userdata.get('OPENAI_API_KEY')      # Use userdata.get() to access secrets
pinecone_api_key = userdata.get('PINECONE_API_KEY')  # Use userdata.get() to access secrets
pinecone_env = userdata.get('PINECONE_ENVIRONMENT')  # Use userdata.get() to access secrets

# Verify that the API keys are set
if all([openai_api_key, pinecone_api_key, pinecone_env]):
    print("All API keys are set successfully!")
else:
    print("Error: One or more API keys are missing.")

# Import necessary classes from pinecone
from pinecone import Pinecone, ServerlessSpec

All API keys are set successfully!


# 3.2 Initialize Pinecone Client and Create Index

### Summary: Initialize the Pinecone client with the latest API method and create an index if it doesn’t already exist.
### Explanation:

	•	Initialization: Uses the updated pinecone.Client method to initialize the Pinecone client with your API key and environment.
	•	Index Creation: Checks if the specified index (archetype-index) exists. If not, it creates the index with a dimension of 384, which matches the embedding size of the all-MiniLM-L6-v2 model.
	•	Connection: Establishes a connection to the Pinecone index for subsequent operations like upserting and querying vectors.

***# Add this updated code cell back to gpt to update coding inforamtion available to the model. ***

In [42]:
# Initialize Pinecone with your API key and environment
pc = Pinecone(
    api_key=pinecone_api_key,    # Your Pinecone API key
    environment=pinecone_env      # Your Pinecone environment (e.g., 'us-east1-gcp')
)

# Check existing indexes
existing_indexes = pc.list_indexes().names()  # Correctly call the 'names' method
print(f"Existing Pinecone indexes: {existing_indexes}")

# Define your index name
index_name = 'knowledge-base'

# Create a new index if it doesn't exist
if index_name not in existing_indexes:
    pc.create_index(
        name=index_name,
        dimension=384,  # embedding size of the all-MiniLM-L6-v2 model
        metric='cosine',
        spec=ServerlessSpec(
            cloud='aws',        # Choose your cloud provider ('aws', 'gcp', etc.)
            region='us-east-1'   # Choose the appropriate region
        )
    )
    print(f"Created Pinecone index: {index_name}")
else:
    print(f"Pinecone index '{index_name}' already exists.")

# Connect to the index
index = pc.Index(index_name)

Existing Pinecone indexes: ['knowledge-base', 'archetype-playbook-index', 'archetype-playbooks', 'archetype-index', 'archetype-playbooks-index']
Pinecone index 'knowledge-base' already exists.


# Step 4: Develop the Embedding and Retrieval Module

# 4.1 Initialize the Embedding Model

### Summary: Load the all-MiniLM-L6-v2 model for generating text embeddings.

In [43]:
!pip install sentence-transformers
from sentence_transformers import SentenceTransformer

# Load the embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')



# Explanation:
This line loads the pre-trained Sentence Transformer model into memory for generating embeddings. The all-MiniLM-L6-v2 model provides a good balance between performance and efficiency, making it suitable for real-time applications.



---

# 4.2 Embed and Index the Chunks

Summary: Generate embeddings for each chunk and upsert them into the Pinecone index.

In [46]:
# Load the embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

def index_chunks(chunks):
    """
    Generate embeddings for each chunk and upsert them into Pinecone.

    Args:
        chunks (list): List of chunk dictionaries with 'id', 'text', and 'metadata'.
    """
    # Prepare a list of tuples for upsert
    upsert_data = []
    for chunk in chunks:
        embedding = model.encode(chunk['text']).tolist()
        upsert_data.append((chunk['id'], embedding, chunk['metadata']))

    # Upsert all chunks in bulk for efficiency
    # Check if upsert_data is not empty before attempting to upsert
    if upsert_data:
        index.upsert(vectors=upsert_data)

# Index the chunks
index_chunks(all_chunks)



# Explanation ADD


---





# Step 5: Implement the Conversation Manager
###Summary: Create unique session tokens to manage individual client sessions.

# 5.1 Generate Session Tokens
### Summary: Create unique session tokens to manage individual client sessions.

In [47]:
def generate_session_token():
    """
    Generate a unique session token using UUID4.

    Returns:
        str: A unique session token.
    """
    return str(uuid.uuid4())

### Explanation:
This function generates a unique identifier for each client session using UUID4, ensuring that each interview is uniquely tracked and attributed.


---



# 5.2 Define the Conversation Flow

### Summary: Manage the flow of the interview, including questions and response handling.

In [48]:
# Define the list of questions and follow-ups
questions = [
    {
        'question': "What’s your primary goal in interacting with customers?",
        'follow_up': "Can you give a specific example?"
    },
    {
        'question': "How would you describe your ideal brand voice?",
        'follow_up': "Does it vary by platform or audience?"
    },
    # Add other questions as needed
]

def interview_flow():
    """
    Manage the interview flow by presenting questions and capturing responses.
    """
    if 'session_token' not in st.session_state:
        st.session_state['session_token'] = generate_session_token()
    if 'responses' not in st.session_state:
        st.session_state['responses'] = []

    st.write("Hi there! I'm excited to learn more about your brand. 😊")

    for idx, q in enumerate(questions):
        with st.expander(f"Question {idx + 1}"):
            response = st.text_input(q['question'], key=f"q_{idx}")
            follow_up = st.text_input(q['follow_up'], key=f"q_{idx}_follow")
            if response and follow_up:
                st.session_state['responses'].append({
                    'question': q['question'],
                    'follow_up': q['follow_up'],
                    'response': {
                        'answer': response,
                        'example': follow_up
                    }
                })

### Explanation:

	•	Questions Definition: A list of dictionaries containing each question and its corresponding follow-up question.
	•	interview_flow Function: Initializes session variables if they don’t exist and iterates through each question, capturing user responses. Uses Streamlit’s expander for better UI organization, allowing users to focus on one question at a time.**bold text**


---



# Step 6: Build the Classification Engine

Summary: Calculate similarity scores and determine primary and secondary archetypes based on client responses.

In [49]:
def get_archetype_embedding(archetype_name):
    """
    Retrieve and average embeddings for a given archetype.

    Args:
        archetype_name (str): The name of the archetype.

    Returns:
        list: Averaged embedding vector for the archetype.
    """
    # Query Pinecone for all chunks related to the archetype
    query_result = index.query(filter={'archetype': archetype_name}, top=100, include_values=True)

    embeddings = [match['values'] for match in query_result['matches']]
    if not embeddings:
        return [0.0] * 384  # Return a zero vector if no embeddings found

    # Calculate the average embedding
    archetype_embedding = [sum(col) / len(col) for col in zip(*embeddings)]
    return archetype_embedding

def classify_archetypes(responses):
    """
    Classify the client into primary and secondary archetypes based on responses.

    Args:
        responses (list): List of response dictionaries.

    Returns:
        tuple: Primary archetype and secondary archetype (if any).
    """
    # Initialize a dictionary to hold cumulative similarity scores
    archetype_scores = {archetype: 0 for archetype in documents.keys()}

    for response in responses:
        response_text = response['response']['answer'] + ' ' + response['response']['example']
        response_embedding = model.encode(response_text)

        for archetype in archetype_scores.keys():
            archetype_embedding = get_archetype_embedding(archetype)
            score = 1 - cosine(response_embedding, archetype_embedding)
            archetype_scores[archetype] += score

    # Sort archetypes based on cumulative scores
    sorted_archetypes = sorted(archetype_scores.items(), key=lambda x: x[1], reverse=True)

    primary_archetype = sorted_archetypes[0][0]
    # Define threshold for secondary archetype (e.g., within 10% of primary score)
    threshold = 0.9 * sorted_archetypes[0][1]

    secondary_archetype = sorted_archetypes[1][0] if sorted_archetypes[1][1] > threshold else None

    return primary_archetype, secondary_archetype

### Explanation:

	•	get_archetype_embedding Function: Retrieves all embeddings associated with a specific archetype from Pinecone and calculates their average to form a representative embedding vector for the archetype. If no embeddings are found, it returns a zero vector to avoid errors in similarity calculations.
	•	classify_archetypes Function:
	•	Iterates through each client response, generates its embedding, and calculates the cosine similarity with each archetype’s average embedding.
	•	Aggregates the similarity scores for each archetype across all responses.
	•	Determines the primary archetype as the one with the highest cumulative score.
	•	Identifies a secondary archetype if its score is within 90% of the primary archetype’s score.


---



# Step 7: Design the User Interface

# 7.1 Customize the Streamlit App

Summary: Set up the Streamlit page configuration and apply custom theming.

In [51]:
# Apply custom CSS for branding
def local_css(file_name):
    with open(file_name) as f:
        st.markdown(f'<style>{f.read()}</style>', unsafe_allow_html=True)

local_css("/content/UI/styles.css")

# Streamlit Components
st.title("Tedesco AI Automation Solutions")
st.button("Learn More")


2024-09-25 00:48:15.518 
  command:

    streamlit run /usr/local/lib/python3.10/dist-packages/colab_kernel_launcher.py [ARGUMENTS]


False

### Explanation:

	•	Page Configuration: Sets the title, icon, and layout of the Streamlit app.
	•	Custom CSS: Applies custom styles from a styles.css file to match your website’s branding. Ensure that the styles.css file is present in your project directory and contains the necessary CSS rules for colors, fonts, logos, and layout.


---



# 7.2 Create the Chat Interface

Summary: Build the chat-like interface for the interview.

In [52]:
def chat_interface():
    """
    Render the chat interface for the interview.
    """
    st.title("Brand Archetype Interview")
    interview_flow()

### Explanation:
This function sets the title of the Streamlit app and invokes the interview_flow function to manage the conversation with the client.


---



# 7.3 Display Archetype Summaries

Summary: Present the client’s primary and secondary archetypes with brief summaries.

In [53]:
def get_archetype_summary(archetype_name):
    """
    Retrieve a summary for a given archetype.

    Args:
        archetype_name (str): The name of the archetype.

    Returns:
        str: A brief summary of the archetype.
    """
    # For simplicity, return the first 200 characters from the archetype document
    return documents[archetype_name][:200] + "..."

def display_archetype_summary(primary, secondary):
    """
    Display the primary and secondary archetype summaries to the client.

    Args:
        primary (str): Primary archetype.
        secondary (str or None): Secondary archetype, if any.
    """
    st.header("Interview Results")

    st.subheader(f"Primary Archetype: {primary}")
    st.write(get_archetype_summary(primary))

    if secondary:
        st.subheader(f"Secondary Archetype: {secondary}")
        st.write(get_archetype_summary(secondary))

### Explanation:

	•	get_archetype_summary Function: Retrieves a brief summary of the specified archetype. Here, it simply returns the first 200 characters of the archetype’s playbook text. You can enhance this by providing more structured summaries.
	•	display_archetype_summary Function: Displays the primary and, if applicable, secondary archetypes along with their summaries to the client.


---



# Step 8: Implement Data Storage and Export


# 8.1 Create Transcript Structure

Summary: Structure the conversation data into a JSON format.

In [54]:
def create_transcript(responses, primary, secondary):
    """
    Create a structured transcript of the interview.

    Args:
        responses (list): List of response dictionaries.
        primary (str): Primary archetype.
        secondary (str or None): Secondary archetype.

    Returns:
        dict: A structured transcript dictionary.
    """
    transcript = {
        "session_token": st.session_state['session_token'],
        "timestamp": datetime.datetime.utcnow().isoformat(),
        "interview": responses,
        "archetypes": {
            "primary": primary,
            "secondary": secondary
        }
    }
    return transcript

### Explanation:
This function compiles all the interview data into a structured JSON object, including the session token, timestamp, list of questions and responses, and the identified archetypes.


---



# 8.2 Export Transcript to Google Sheets

Summary: Parse the JSON transcript and export it to Google Sheets in a consistent format.

In [55]:
def export_to_sheets(transcript):
    """
    Export the transcript to Google Sheets.

    Args:
        transcript (dict): The structured transcript dictionary.
    """
    # Define the scope for Google Sheets API
    scope = ['https://www.googleapis.com/auth/spreadsheets']

    # Load Google Sheets credentials from environment variable
    GOOGLE_SHEETS_CREDENTIALS = os.getenv('GOOGLE_SHEETS_CREDENTIALS')  # JSON credentials as a string

    # Write the credentials to a temporary file
    with open('credentials.json', 'w') as f:
        f.write(GOOGLE_SHEETS_CREDENTIALS)

    # Authenticate and create the client
    creds = Credentials.from_service_account_file('credentials.json', scopes=scope)
    client = gspread.authorize(creds)

    # Open the Google Sheet (replace 'Transcripts' with your sheet name)
    sheet = client.open('Transcripts').sheet1

    # Prepare data for insertion
    row = [
        transcript['session_token'],
        transcript['timestamp']
    ]
    for entry in transcript['interview']:
        row.extend([
            entry['question'],
            entry['response']['answer'],
            entry['follow_up'],
            entry['response']['example']
        ])
    row.extend([
        transcript['archetypes']['primary'],
        transcript['archetypes']['secondary'] if transcript['archetypes']['secondary'] else ''
    ])

    # Insert the row into the sheet
    sheet.append_row(row)

### Explanation:

	•	Google Sheets Authentication:
	•	Retrieves Google Sheets credentials from an environment variable (GOOGLE_SHEETS_CREDENTIALS). Ensure that this variable contains the JSON credentials as a string.
	•	Writes the credentials to a temporary credentials.json file to authenticate with the Google Sheets API using gspread.
	•	Exporting Data:
	•	Opens the specified Google Sheet (Transcripts).
	•	Prepares the data by flattening the JSON structure into a single row, including session token, timestamp, questions, responses, follow-ups, examples, and identified archetypes.
	•	Appends the row to the Google Sheet.


---



# Step 9: Integrate All Components
Summary: Combine all modules into the main application script.

In [56]:
def main():
    """
    Orchestrate the overall application flow.
    """
    chat_interface()

    if st.button("Submit", key='submit'):
        if 'responses' in st.session_state and st.session_state['responses']:
            primary, secondary = classify_archetypes(st.session_state['responses'])
            display_archetype_summary(primary, secondary)
            transcript = create_transcript(st.session_state['responses'], primary, secondary)
            export_to_sheets(transcript)
            st.success("Your archetypes have been identified and the transcript has been saved!")
        else:
            st.warning("Please answer all questions before submitting.")

if __name__ == "__main__":
    main()

2024-09-25 00:48:51.072 Session state does not function when running a script without `streamlit run`


### Explanation:

	•	main Function:
	•	Initiates the chat interface where the client answers the interview questions.
	•	Upon clicking the “Submit” button:
	•	Checks if responses exist in the session state.
	•	Performs archetype classification based on the responses.
	•	Displays the identified archetypes with summaries.
	•	Creates a structured transcript and exports it to Google Sheets.
	•	Provides feedback to the user about the successful completion of the process.


---



10.1 Testing the Application
Summary: Run the application locally to test all functionalities.

In [68]:
!pip install pyngrok

Collecting pyngrok
  Downloading pyngrok-7.2.0-py3-none-any.whl.metadata (7.4 kB)
Downloading pyngrok-7.2.0-py3-none-any.whl (22 kB)
Installing collected packages: pyngrok
Successfully installed pyngrok-7.2.0


In [71]:
from pyngrok import ngrok
public_url = ngrok.connect(8501)
print(public_url)



ERROR:pyngrok.process.ngrok:t=2024-09-25T01:17:56+0000 lvl=eror msg="failed to reconnect session" obj=tunnels.session err="authentication failed: Usage of ngrok requires a verified account and authtoken.\n\nSign up for an account: https://dashboard.ngrok.com/signup\nInstall your authtoken: https://dashboard.ngrok.com/get-started/your-authtoken\r\n\r\nERR_NGROK_4018\r\n"


PyngrokNgrokError: The ngrok process errored on start: authentication failed: Usage of ngrok requires a verified account and authtoken.\n\nSign up for an account: https://dashboard.ngrok.com/signup\nInstall your authtoken: https://dashboard.ngrok.com/get-started/your-authtoken\r\n\r\nERR_NGROK_4018\r\n.

In [67]:
!streamlit run /usr/local/lib/python3.10/dist-packages/colab_kernel_launcher.py


Collecting usage statistics. To deactivate, set browser.gatherUsageStats to false.
[0m
[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Local URL: [0m[1mhttp://localhost:8501[0m
[34m  Network URL: [0m[1mhttp://172.28.0.12:8501[0m
[34m  External URL: [0m[1mhttp://35.196.24.72:8501[0m
[0m
[34m  Stopping...[0m
[34m  Stopping...[0m
