#Creating Data with Generative AI
*Authors: Yash Gopalji Pankhania, Jeel Kanzaria, Megha Patel*

**Abstract:**

In this notebook, we'll use OpenAI's Generative Pre-Trained Transformer model to create a personalized chatbot. This chatbot will be trained on user-provided details on their age, gender, weight, height in addition to their health goals to gain a comprehensize plan on diet and fitness. This app will use AI to become a personalized trainer and help achieve health goals using data across the web. The popularity of transformer models, notably seen in OpenAI's ChatGPT and image generator DALL-E, has inspired this research. Our goal is to explore the transformative impact of transformers in generative AI, emphasizing their relevance in data science. The study includes an investigation into the reasons behind the success of transformers, supported by code examples demonstrating their potential in data generation. Additionally, we present the generated results with data summaries, aiming to offer a comprehensive understanding of transformers in generative AI—blending theoretical insights with practical applications for enhanced comprehension in the field of data science.




### We have publicly hosted our health and fitness assistant on Streamlit: https://fitness-gpt.streamlit.app/

![Health](https://github.com/jeelkan/Exploratory-Data-Analysis/assets/122841430/7323e108-d263-4377-aff8-cf6a9c0169d3)

#Theoretical Foundations of Generative AI



**Introduction to generative AI and its applications.**

Generative AI refers to models or algorithms that create (generate) brand-new output, such as text, images, videos, code, or data. The models makes new content by referring back to the vast amount of data they have been trained on, making new predictions. Generative AI differs from other forms of AI in that its sole purpose is to create new content without needing to gather data where as other forms of AI analyze existing data or perform tasks.

Generative AI has become popular recently due to the development of the transformer models. The transformer model was introduced in the seminal paper "Attention is All You Need" by Vaswani et al. in 2017, and it has since led to a significant leap in AI capabilities, outperforming any previous deep learning techniques and enabling groundbreaking progress in AI research. This innovative architecture has managed to effectively harness the growing computational resources, enabling AI models to scale and tackle more complex problems than ever before.

**The relevance of data generation in various data science tasks.**

In data science, the availability of large and diverse datasets is often a bottleneck. Generative AI addresses this issue by enabling the creation of synthetic data, reducing the reliance on limited or biased datasets. This is especially valuable when dealing with sensitive or scarce data, such as medical records or rare events.

**Theoretical underpinnings of the chosen generative AI method.**

1. **Attention Mechanism:**

  The attention mechanism is fundamental to transformers. Traditional sequence-to-sequence models, like recurrent neural networks (RNNs), struggle with capturing long-range dependencies in sequential data. The attention mechanism allows transformers to focus on different parts of the input sequence when making predictions. This self-attention mechanism enables the model to weigh the importance of different elements dynamically.
2. **Self-Attention and Multi-Head Attention:**

  Self-attention allows each element in the sequence to consider other elements' information, capturing complex relationships within the data. Multi-head attention extends this concept by using multiple sets of attention weights, allowing the model to learn different aspects of the relationships simultaneously. This makes transformers highly effective at understanding contextual information across sequences.
3. **Positional Encoding:**

  Transformers do not inherently understand the order of elements in a sequence, as self-attention treats inputs independently. To introduce positional information, positional encodings are added to the input embeddings. These encodings provide the model with information about the positions of elements in the sequence, allowing it to consider the order of the data.
4. **Encoder-Decoder Architecture:**

  The transformer architecture is composed of an encoder and a decoder stack. The encoder processes the input sequence, while the decoder generates the output sequence. This separation allows transformers to handle various sequence-to-sequence tasks effectively.
5. **Layer Normalization and Residual Connections:**

  Each sub-layer in the transformer contains layer normalization and a residual connection. Layer normalization helps stabilize training by normalizing the inputs, and residual connections aid in the flow of information through the network. These components contribute to the stability and efficiency of training transformers.
6. **Position-wise Feedforward Networks:**

  Transformers include position-wise feedforward networks in each layer. These networks consist of fully connected layers and provide the model with the ability to capture non-linear relationships in the data. The inclusion of feedforward networks contributes to the expressive power of transformers.
7. **Scaled Dot-Product Attention:**

  The attention mechanism in transformers often uses scaled dot-product attention. This involves scaling the dot product of the query and key vectors to prevent the magnitudes of the dot products from becoming too large, making the learning process more stable.




**How generative AI contributes to solving data-related problems.**

  Generative AI contributes to data science by offering a solution to the problem of limited or biased datasets. By learning the underlying patterns and distributions from a training dataset, generative AI models can then generate new samples that share similar characteristics. This, in turn, enhances the diversity and representativeness of the available data, leading to improved model generalization and performance.


#Introduction to Data Generation


###Context


Data generation using generative AI involves creating synthetic data samples that resemble real-world data. This process is particularly valuable when faced with challenges such as limited, biased, or sensitive datasets. Generative AI models, such as transformers, can learn underlying patterns from existing data and generate new, diverse samples that reflect those patterns.

![Doctor](https://github.com/jeelkan/Exploratory-Data-Analysis/assets/122841430/7e05b4a3-d89f-4274-bf99-b18c639c5303)



###Signficance

- **Significance of Generative AI in Data Generation**: Overcomes challenges of traditional datasets.
- **Solution for Limited Data Scenarios**: Ideal for situations where acquiring extensive, diverse, or unbiased datasets is difficult.
- **Enhancement of Dataset Diversity**: Creates synthetic data that mirrors real-world patterns, improving diversity and representation.
- **Training Robust Machine Learning Models**: Essential for models to generalize effectively across various scenarios.
- **Addressing Data Scarcity**: Supplements existing datasets for better model training and testing.
- **Privacy Preservation in Sensitive Fields**: Generates synthetic data that maintains statistical properties while protecting sensitive information.
- **Data Augmentation Capability**: Creates variations of existing data to train more resilient models.
- **Beyond Traditional Approaches**: Offers innovative solutions to complex data-related challenges in modern applications.



![Roadmap](https://github.com/jeelkan/Exploratory-Data-Analysis/assets/122841430/d14af4c5-bd93-4f2e-a511-88e96230fa68)



###Principles of Data Generation



1. **Learning Underlying Patterns:**

  Generative AI models, based on principles such as neural networks, learn the underlying patterns and distributions from the input data during training. This knowledge is then used to generate new samples.
2. **Transfer Learning:**

  Pre-trained generative AI models, such as transformers, leverage transfer learning. They are trained on large, diverse datasets for a specific domain and can then be fine-tuned for various data generation tasks with smaller datasets.
3. **Contextual Understanding:**

  Models like transformers excel in capturing contextual relationships within the data. This contextual understanding is crucial for generating coherent and contextually relevant synthetic data.
4. **Balancing Fidelity and Diversity:**

  A key challenge in data generation is striking the right balance between fidelity to the original data and introducing diversity. Generative AI models aim to generate samples that are both realistic and diverse to improve the robustness of downstream applications.
5. **Evaluation and Validation:**

  The quality of generated data is evaluated using metrics specific to the task at hand. Validation ensures that the synthetic data aligns with the intended characteristics and is suitable for the desired application.

### Transformers

- **Popularity in Generative AI**: Transformers are widely used in natural language processing tasks.
- **Versatile Applications**: Not just for data generation, but also for text generation, language translation, summarization, etc.
- **Focus on General Characteristics**: Understanding transformers in the context of data generation tasks.
- **Origins of Transformers**: Introduced in "Attention is All You Need" by Vaswani et al. (2017).
- **Difference from RNNs**: Unlike Recurrent Neural Networks, transformers process sequences using self-attention, not sequentially.
- **Efficient Data Processing**: Self-attention mechanism enables capturing long-range dependencies in data.

### Purpose of Data Generation


1. **Text Generation:**

  Transformers excel at generating coherent and contextually relevant text. This makes them valuable for tasks such as automatic content creation, creative writing, and text completion.
2. **Language Translation:**

  Transformers have been widely used for language translation tasks. They can generate translations with high fluency and accuracy, leveraging their ability to understand and capture contextual dependencies.
3. **Summarization:**

  Transformers can be applied to generate concise and informative summaries of longer texts. They can identify essential information and create coherent summaries.
4. **Data Augmentation:**

  In data science, transformers can be used for data augmentation by generating additional examples for training datasets. This helps improve model robustness and generalization.

#Analyzing the Generated Data


###Data Characteristics

The generated data from OpenAI's ChatGPT model consists of human-like text responses. The nature of the data is diverse, encompassing a wide range of topics, writing styles, and contextual information. The data is derived from a pre-training process where the model is exposed to vast amounts of publicly available text data from the internet, allowing it to learn grammar, facts, reasoning abilities, and some level of world knowledge. The properties of the data include variability in language use, context sensitivity, and the ability to generate coherent and contextually relevant responses to a wide array of user inputs.

###Application Areas

The generated data from ChatGPT can find applications in various domains, including:

* Content Creation: ChatGPT can assist in generating creative writing, stories, articles, and other forms of content.

* Conversational Interfaces: The data is well-suited for building chatbots and virtual assistants, providing natural and contextually relevant responses to user queries.

* Educational Tools: The generated content can be applied in educational settings to provide explanations, answer questions, and aid in learning across different subjects.

* Programming Assistance: ChatGPT can assist developers by providing code snippets, explanations, and guidance in programming-related queries.

* Idea Generation: It can be used to brainstorm and generate ideas for various tasks, projects, or creative endeavors.

* Language Translation: While not explicitly trained for translation, ChatGPT can offer assistance in generating translated sentences or phrases.



![Application](https://github.com/jeelkan/Exploratory-Data-Analysis/assets/122841430/a7323f9d-f854-403c-87d8-d56a7e87d3cf)


###Analytical Insights

The potential insights derived from analyzing the generated data are varied and depend on the context of use. Some potential analytical insights include:

* Language Understanding:

  Analysis of the generated text can provide insights into the model's understanding of language, including grammar, syntax, and semantics.

* Contextual Adaptability:

  Studying how well the model maintains context in conversations offers insights into its contextual adaptability and the ability to generate coherent responses.

* Bias and Fairness:

  Analyzing the content for biases and fairness considerations can provide insights into potential ethical considerations and areas for improvement in the training data.

* Performance Metrics:

  Using metrics like response coherence, relevance, and informativeness can offer insights into the model's overall performance in specific applications.

* Limitations and Challenges:

  By examining cases where the model struggles or provides inaccurate information, insights can be gained into its limitations and areas where additional fine-tuning or improvements may be necessary.

#Engaging with Generative AI for Data Generation



###Insights into ChatGPT's Data Generation Process

Now that we have given a brief introduction of Open AI's API structure, let's have ChatGPT show case its capabilities itself.

Certainly! The data generation process for training ChatGPT involves two main steps: pretraining and fine-tuning.

1. **Pretraining:**
   - In the pretraining phase, the model is trained on a large corpus of publicly available text from the internet. This corpus is diverse and covers a wide range of topics, styles, and domains.
   - The model learns to predict the next word in a sentence based on the context provided by the preceding words. This process allows the model to capture grammar, syntax, facts, and reasoning abilities.
   - It's important to note that during pretraining, the model doesn't have specific information about the documents in its training set, and it doesn't have awareness of individual documents or sources.

2. **Fine-tuning:**
   - After pretraining, the model is fine-tuned on a more narrow dataset that is generated with human reviewers following guidelines provided by OpenAI.
   - The fine-tuning process involves iterative feedback between the human reviewers and the model. The reviewers review and rate model outputs for a range of example inputs. The model then generalizes from this feedback to respond to a wider array of user inputs.
   - The fine-tuning process is designed to make the model more controlled and align it with human values. OpenAI provides guidelines to reviewers to ensure the model's behavior is consistent with certain ethical standards.

It's important to note that the training data is anonymized and carefully stripped of personally identifiable information to respect privacy. The goal of the process is to create a language model that can generate coherent and contextually relevant responses while being mindful of ethical considerations and user safety. OpenAI continues to iterate on its models and training processes to improve performance and address potential biases or issues.

###Data Generation Scenarios using ChatGPT

Now that we have a better understanding of ChatGPT's architechure and data generation process, let's delve into some examples of data generation using this technique.

1. Content Creation:

    Prompt: "Write a short story about a robot and its creator."

    Response: "In a small workshop, Dr. Smith toiled day and night, creating a sentient robot named Luna. As Luna's artificial intelligence developed, a unique bond formed between them..."

2. Programming Help:

    Prompt: "How do I write a Python program to calculate the factorial of a number?"

    Response: "To calculate the factorial in Python, you can use a loop or recursion. Here's an example using a loop:"
3. Language Translation:

    Prompt: "Translate the following English sentence to French: 'Hello, how are you?'"

    Response: "'Bonjour, comment ça va ?'"
4. Learning New Topics:

    Prompt: "Explain the concept of black holes in astrophysics."

    Response: "Black holes are regions in space where gravitational forces are so strong that nothing, not even light, can escape. They are formed from the remnants of massive stars..."
5. Creative Writing Assistance:

    Prompt: "Help me come up with a catchy slogan for a new coffee shop."

    Response: "Brewing Joy, One Cup at a Time!"
6. Idea Generation:

    Prompt: "Generate ideas for a science fair project related to renewable energy."

    Response:
    "1. Solar-powered phone charger with recycled materials. 2. Wind turbine efficiency study. 3. Bioenergy from organic waste."
7. Casual Conversation:

    User: "Tell me a joke!"

    Response: "Why did the computer keep its drink on the windowsill? Because it wanted a byte with a view!"




###Validation of the Quality and Diversity of Generated Data

The quality of the responses depends on the clarity and specificity of the prompts. The model's responses are based on patterns learned during training and may not always reflect the most up-to-date or accurate information so it is important to verify the data. Let's see how:

* Human Evaluation:
  * *Expert Review*: Have domain experts or knowledgeable individuals review the generated content. They can assess the accuracy, relevance, and overall quality of the information.
  * *Crowdsourced Evaluation*: Gather feedback from a diverse group of human evaluators to obtain opinions on the quality, clarity, and appropriateness of generated responses (e.g. Amazon Mechanical Turk).

* Diversity Metrics:
  * *N-gram Analysis*: Examine the diversity of n-grams (sequences of adjacent words) in the generated text. Higher diversity indicates a broader range of language use.
  * *Topic Coverage*: Check if the generated content covers a diverse set of topics. This can be done by analyzing the distribution of topics or keywords in the generated data.

* Contextual Relevance:
  * *Prompt Variability*: Test the model's ability to handle a variety of prompts. A good model should be able to provide relevant and coherent responses across different types of inputs.
  * *Contextual Consistency*: Evaluate how well the model maintains context within a conversation. Responses should logically follow from the preceding dialogue or prompt.

* Avoidance of Biases:
  * *Bias Assessment*: Check for biases in the generated content, such as gender bias, cultural bias, or other forms of unfair or inappropriate behavior. This can be done manually or using automated tools designed to identify biases.
* User Feedback:
  * *Collect User Feedback*: Gather feedback from actual users who interact with the model. This can provide insights into how well the model meets user expectations and whether it produces useful and satisfactory responses.

* Performance Benchmarks:
  * *Task-Specific Metrics*: If the model is designed for a specific task (e.g., translation, summarization), use task-specific metrics to evaluate its performance. For instance, in translation tasks, BLEU scores can be used to measure translation quality.

* Adversarial Testing:
  * *Adversarial Input*: Test the model's robustness by providing challenging or adversarial inputs. This helps identify potential weaknesses and areas for improvement.


#Crafting Generative Data



###Our Data Generation Task

- **Chatbot for Personalized Wellness Plans**: Developing a chatbot using OpenAI's ChatGPT to create diet and exercise plans based on user biometrics and goals, hosted on Streamlit.
- **Holistic Health Management Approach**: Aims to integrate technology into personal health management seamlessly.
- **Tailored Plans for Diverse Goals**: Plans designed for weight loss, muscle gain, or overall well-being, considering lifestyle factors.
- **Adaptive and Iterative Feedback Incorporation**: Chatbot will adjust recommendations based on user feedback and progress.

- **Generative AI in Healthcare**: Multiple applications in healthcare for content and behavior generation.
    - **Medical Image Generation**: Synthetic MRI or CT images; histopathology images for algorithm training.
    - **Drug Discovery**: Generating molecular structures for drug candidate exploration.
    - **Clinical Data Augmentation**: Creating synthetic patient data to enhance datasets while ensuring privacy.
    - **NLP in Healthcare**: Automated clinical documentation generation to reduce administrative tasks.
    - **Chatbots for Patient Interaction**: Providing information, mental health support, and patient engagement.
    - **Personalized Medicine**: Generating tailored treatment plans based on individual data.
    - **Simulation for Medical Training**: Creating virtual patients for risk-free practice and training.
    - **Genomic Data Analysis**: Generating synthetic genomic data for rare disease study and genetic variation analysis.
    - **Disease Prediction and Early Diagnosis**: Developing risk prediction models for early intervention.
    - **Healthcare Robotics**: Simulating human-like behavior in healthcare robots for patient interaction.

![use case health](https://github.com/jeelkan/Exploratory-Data-Analysis/assets/122841430/b2b5f4ee-53e6-43c5-95ad-b90cfcd39149)




###Expected Format of Our Data

The format of the generated data will be separated into two forms.

For the diet plan, we specifically asked for the following format:
Breakfast, Mid-Morning Snack, Lunch, Evening Snack, Dinner.

For the exercise plan, we expect workouts broken down by the day of the week. This plan is more flexible based on the goal of the user and expected to vary each time it is ran.




In our Streamlit App, we added constraints to the input to ensure the generated data meets the desired criteria. For example, we must not provide a diet plan for a 5 year old nor should the weight inputted be 50 lbs. Accordingly, the following constraints are added to the web app:
*  age: number from 18 - 100
*  gender: choices  (male or female)
*  height: number in cm from 100 - 250
*  weight: number in kg from 30 - 200
*  goal: choices (Gain Muscle, Lose Weight, or Maintain Weight)
* activity: choices (Sedentary, Light, Moderate, or Active)


We will provide illustrative examples in the next section.

#Demonstrating Data Generation

###Our Code

Our code is a Streamlit app for a Health and Fitness Assistant that utilizes OpenAI's GPT-3.5 Turbo model to generate personalized diet plans and workout plans based on user input.

PLEASE NOTE: Streamlit apps are currently not support on Google Colab. You will get an error if it is run here. We recommend running this on your own machine.

Let's break down the code:

In [None]:
!pip install streamlit

Collecting streamlit
  Downloading streamlit-1.28.2-py2.py3-none-any.whl (8.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.4/8.4 MB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
Collecting validators<1,>=0.2 (from streamlit)
  Downloading validators-0.22.0-py3-none-any.whl (26 kB)
Collecting gitpython!=3.1.19,<4,>=3.0.7 (from streamlit)
  Downloading GitPython-3.1.40-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.6/190.6 kB[0m [31m21.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pydeck<1,>=0.8.0b4 (from streamlit)
  Downloading pydeck-0.8.1b0-py2.py3-none-any.whl (4.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.8/4.8 MB[0m [31m79.7 MB/s[0m eta [36m0:00:00[0m
Collecting watchdog>=2.1.5 (from streamlit)
  Downloading watchdog-3.0.0-py3-none-manylinux2014_x86_64.whl (82 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m82.1/82.1 kB[0m [31m10.6 MB/s[0m eta [36m0:0

In [None]:
!pip install openai

Collecting openai
  Downloading openai-1.3.3-py3-none-any.whl (220 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m220.3/220.3 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.25.1-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.0/75.0 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
Collecting httpcore (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.2-py3-none-any.whl (76 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.9/76.9 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting h11<0.15,>=0.13 (from httpcore->httpx<1,>=0.23.0->openai)
  Downloading h11-0.14.0-py3-none-any.whl (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: h11, httpcore, httpx, openai
[31mERROR: pip's dependency resolver does not currently

In [None]:
#Importing the required packages
import streamlit as st
import openai
import os
import toml

In [None]:
# Initialize session state for conversation history
if 'conversation' not in st.session_state:
    st.session_state.conversation = []



In [None]:
#Fetching the OpenAI API Key
openai.api_key = "sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" #Enter your OPENAI API Key here

In [None]:
# Streamlit app layout
st.title('Health and Fitness Assistant')

# Dropdown for selecting between Diet Plan and Workout Plan
with st.sidebar:
    st.header("Choose Your Plan")
    plan_type = st.selectbox('Select a Plan', ['Diet Plan and Recipes', 'Workout Plan'])

    if plan_type == 'Diet Plan and Recipes':
        # Input fields for Diet Plan
        st.header("Input Your Details for Diet Plan")
        age = st.number_input('Age', min_value=18, max_value=100)
        gender = st.selectbox('Gender', ['Male', 'Female'])
        height = st.number_input('Height in cm', min_value=100, max_value=250)
        weight = st.number_input('Weight in kg', min_value=30, max_value=200)
        goal = st.selectbox('Goal', ['Gain Muscle', 'Lose Weight'])
        activity = st.selectbox('Activity Level', ['Sedentary', 'Light', 'Moderate', 'Active'])
        submit_button_diet = st.button('Generate Diet Plan and Recipes')

    elif plan_type == 'Workout Plan':
        # Input fields for Workout Plan
        st.header("Input Your Details for Workout Plan")
        age_w = st.number_input('Age', min_value=18, max_value=100, key='workout_age')
        gender_w = st.selectbox('Gender', ['Male', 'Female'], key='workout_gender')
        height_w = st.number_input('Height in cm', min_value=100, max_value=250, key='workout_height')
        weight_w = st.number_input('Weight in kg', min_value=30, max_value=200, key='workout_weight')
        weight_goal = st.selectbox('Weight Goal', ['Gain Muscle', 'Lose Weight', 'Maintain Weight'], key='workout_goal')
        submit_button_workout = st.button('Generate Workout Plan')


2023-11-20 22:04:02.967 
  command:

    streamlit run /usr/local/lib/python3.10/dist-packages/colab_kernel_launcher.py [ARGUMENTS]


In [None]:
# Handling the output generation outside the sidebar block
if plan_type == 'Diet Plan and Recipes' and submit_button_diet:
    st.session_state.user_text = ""
    st.session_state.conversation = None

    initial_prompt = "Act like a ChatGPT. You know everything."
    if st.session_state.conversation is None:
        openai.api_key = os.getenv("OPENAI_API_KEY")
        st.session_state.conversation = []

        question1 = f"Generate a diet plan for a {age}-year-old {gender}, height {height} cm, weight {weight} kg, goal: {goal}, activity level: {activity}. The diet plan should be in the following format: Breakfast, Mid-Morning Snack, Lunch, Evening Snack, Dinner. Generate it in proper markdown format."

        response1 = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system", "content": initial_prompt},
                {"role": "user", "content": question1},
            ],
        )['choices'][0]['message']['content']

        question2 = f"Generate simple recipes for the dishes present in the following diet plan: {response1}. The recipes should be in the following format: Dish name, Ingredients, Instructions. Generate it in proper markdown format."

        response2 = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system", "content": initial_prompt},
                {"role": "user", "content": question2},
            ],
        )['choices'][0]['message']['content']

        st.session_state.conversation.extend([response1, response2])

    st.write("Diet Plan:\n\n" + st.session_state.conversation[0])
    st.write("Recipes:\n\n" + st.session_state.conversation[1])

elif plan_type == 'Workout Plan' and submit_button_workout:
    st.session_state.user_text = ""
    st.session_state.conversation = None

    initial_prompt = "Act like a ChatGPT. You know everything."
    if st.session_state.conversation is None:
        openai.api_key = os.getenv("OPENAI_API_KEY")
        st.session_state.conversation = []

        question = f"Generate a workout plan for a {age_w}-year-old {gender_w}, height {height_w} cm, weight {weight_w} kg, weight goal: {weight_goal}. The workout plan should be tailored to the individual's needs. Generate it in proper markdown format."

        response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system", "content": initial_prompt},
                {"role": "user", "content": question},
            ],
        )['choices'][0]['message']['content']

        st.session_state.conversation.append(response)

    st.write("Workout Plan:\n\n" + st.session_state.conversation[0])

1. Streamlit Setup:
  * Import the required libraries: streamlit, openai, dotenv, and os.
  * Initialize Streamlit app layout with st.title and set up the sidebar for plan selection.
2. Session State Initialization:
  * Initialize session state using Streamlit's st.session_state to store the conversation history.
3. Environment Variables:
  * Load environment variables using dotenv to secure the OpenAI API key.
4. Sidebar Plan Selection:
  * Allow the user to choose between a "Diet Plan and Recipes" or a "Workout Plan" using a dropdown in the sidebar.
  * Collect user input for the selected plan type (diet or workout).
5. User Input Fields:
  * Inside the sidebar block, capture user details for the selected plan type using Streamlit input fields (e.g., age, gender, height, weight, goal, activity level for diet; age, gender, height, weight, weight goal for workout).
6. Generate Plan Button:
  * Include a button to trigger the generation of the selected plan (diet or workout).
7. Output Handling:
  * Outside the sidebar block, handle the output generation based on the selected plan type and the submission of the corresponding button.
  * For "Diet Plan and Recipes," use the OpenAI ChatGPT model to generate a diet plan and associated recipes. Display the generated content in proper markdown format.
  * For "Workout Plan," use the model to generate a personalized workout plan. Display the generated workout plan in proper markdown format.
8. Conversation History:
  * Use Streamlit's session state to store and manage the conversation history, allowing for a coherent interaction with the model across multiple user inputs.
9. OpenAI API Calls:
  * Utilize the OpenAI GPT-3.5 Turbo model for generating responses based on user prompts. The questions are formatted according to the plan type and user input.
10. Display Results:
  * Display the generated diet plan, recipes, or workout plan in the Streamlit app.

#### **Note:** Ensure that you have the necessary environment variables, including the OpenAI API key, to run this code successfully. Also, keep in mind OpenAI's usage policies and guidelines when making API requests.

Utilizing the ChatGPT API for crafting a health and fitness app has demonstrated the versatility and responsiveness achievable through generative AI. The personalized diet plans and workout routines generated by ChatGPT exemplify the potential of AI in promoting individual well-being.

###Generated Data

In the following example, we entered an 18 year old man's biodata to obtain a diet plan. On the left is the input section and on the right is the generated diet plan that is broken down into meals showing an example of what to eat followed by simple reminders to ensure the user is following healthy habits.

![dietscreenshot](https://github.com/Draconian10/AI_Health_Assistant/assets/32498703/86d2999b-aa05-4e87-80bd-f5f948c3c6ad)

In the next example, we have entered the biodata for a 24 year old woman with a similar format. As you see on the right hand side, we have generated a workout plan broken down by the day of the week followed by similar advise on healthy habits.

![fitnessscreenshot](https://github.com/Draconian10/AI_Health_Assistant/assets/32498703/b79147ee-b4d8-4d1e-b39b-7f0fbdf918c7)


#Evaluation and Justification

###Assessing the Effectiveness

Based on the generated outputs from our chatgpt model, we can clearly see the plans are well designed and well written. The model has learned to provide a detailed example of dietary and fitness plans as well as reminders of maintaining a healthy lifestyle.





###Validation of the Data using ROUGE



ROUGE, which stands for Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics used for the automatic evaluation of machine-generated text, particularly in the context of text summarization and document clustering. The primary goal of ROUGE is to measure the quality of the generated summary by comparing it with one or more reference summaries or human-created model summaries.

In [None]:
# Install the rouge package
!pip install rouge-score

# Import necessary libraries
from rouge_score import rouge_scorer

# Example of reference and generated summaries
reference_summary = """# Workout Plan for Muscle Gain

## Personal Information:
- **Name:** [Individual's Name]
- **Age:** 24
- **Gender:** Female
- **Height:** 165 cm
- **Weight:** 54 kg
- **Weight Goal:** Gain Muscle

## Workout Schedule:

### Day 1: Upper Body
1. **Warm-up (10 minutes):**
   - Jumping jacks, arm circles, light cardio.

2. **Strength Training:**
   - Bench Press: 3 sets x 10 reps
   - Bent-over Rows: 3 sets x 12 reps
   - Overhead Shoulder Press: 3 sets x 10 reps

3. **Isolation Exercises:**
   - Bicep Curls: 3 sets x 12 reps
   - Tricep Dips: 3 sets x 15 reps

4. **Core Work:**
   - Plank: 3 sets x 30 seconds
   - Russian Twists: 3 sets x 15 reps each side

### Day 2: Lower Body
1. **Warm-up (10 minutes):**
   - Jumping jacks, leg swings, light cardio.

2. **Strength Training:**
   - Squats: 4 sets x 12 reps
   - Deadlifts: 3 sets x 10 reps
   - Lunges: 3 sets x 15 reps each leg

3. **Isolation Exercises:**
   - Leg Press: 3 sets x 12 reps
   - Leg Curls: 3 sets x 15 reps

4. **Core Work:**
   - Leg Raises: 3 sets x 15 reps
   - Bicycle Crunches: 3 sets x 20 reps

### Day 3: Rest or Active Recovery
- Light activities such as walking, yoga, or stretching.

### Day 4: Full Body
1. **Warm-up (10 minutes):**
   - Jumping jacks, dynamic stretches, light cardio.

2. **Compound Movements:**
   - Deadlifts: 3 sets x 10 reps
   - Pull-ups or Lat Pulldowns: 3 sets x 10 reps
   - Push-ups: 3 sets x 15 reps

3. **Isolation Exercises:**
   - Hammer Curls: 3 sets x 12 reps
   - Tricep Kickbacks: 3 sets x 15 reps

4. **Core Work:**
   - Plank Variations: Side plank, high plank, 3 sets x 30 seconds each.

### Day 5: Cardio and Active Recovery
- Cardiovascular exercise of choice (running, cycling, swimming) for 30-40 minutes.

### Day 6: Rest or Active Recovery
- Focus on restorative activities to allow muscles to recover.

### Day 7: Rest
- Complete rest day.

## Notes:
- **Progression:** Increase weights gradually to challenge muscles.
- **Nutrition:** Consume a balanced diet with a focus on protein for muscle repair.
- **Hydration:** Stay well-hydrated throughout the day.
- **Sleep:** Aim for 7-8 hours of quality sleep for optimal recovery.

Remember to listen to your body, and if you have any medical conditions or concerns, consult with a healthcare professional or fitness expert before starting this workout plan.
"""
generated_summary = """Certainly! Here's a sample workout plan tailored for a 24-year-old female with a height of 165 cm, weight of 54 kg, and a goal to gain muscle. This is a general plan, and it's always advisable to consult with a fitness professional or healthcare provider before starting a new workout program.

```markdown
# Workout Plan for a 24-year-old Female (Goal: Gain Muscle)

## Personal Information
- **Age:** 24
- **Gender:** Female
- **Height:** 165 cm
- **Weight:** 54 kg
- **Weight Goal:** Gain Muscle

## Overview
This workout plan focuses on a combination of strength training and cardiovascular exercises to help achieve the goal of gaining muscle. The plan is designed for four workout days per week with one day of active rest or light activity.

### Day 1: Full Body Strength
- **Warm-up:** 10 minutes of light cardio (jogging, jumping jacks)
- **Strength Training:**
  - Squats: 3 sets x 8-10 reps
  - Bench Press: 3 sets x 8-10 reps
  - Bent Over Rows: 3 sets x 10 reps
  - Plank: 3 sets x 30 seconds
- **Cardio:** 15 minutes of moderate-intensity cardio (running, cycling)

### Day 2: Active Rest or Light Activity
- **Active Rest:** Yoga, walking, or light stretching

### Day 3: Upper Body Strength
- **Warm-up:** 10 minutes of light cardio
- **Strength Training:**
  - Overhead Press: 3 sets x 8-10 reps
  - Pull-Ups or Lat Pulldowns: 3 sets x 8-10 reps
  - Tricep Dips: 3 sets x 12 reps
  - Plank: 3 sets x 30 seconds
- **Cardio:** 15 minutes of moderate-intensity cardio

### Day 4: Active Rest or Light Activity
- **Active Rest:** Yoga, walking, or light stretching

### Day 5: Lower Body Strength
- **Warm-up:** 10 minutes of light cardio
- **Strength Training:**
  - Deadlifts: 3 sets x 8-10 reps
  - Lunges: 3 sets x 12 reps (each leg)
  - Leg Press: 3 sets x 10 reps
  - Plank: 3 sets x 30 seconds
- **Cardio:** 15 minutes of moderate-intensity cardio

### Day 6: Cardiovascular Workout
- **Cardio:** 30-45 minutes of your favorite cardiovascular exercise (running, cycling, swimming)

### Day 7: Rest
- **Rest day:** Allow your body to recover.

## Important Notes
- Ensure proper form during strength training exercises to prevent injury.
- Gradually increase weights and intensity as your strength improves.
- Stay hydrated and maintain a balanced diet to support muscle growth.
- Listen to your body, and if you experience pain beyond normal muscle soreness, consult with a fitness professional or healthcare provider.

Remember, consistency is key to seeing results. Adjust the intensity and duration based on your fitness level and how your body responds. If in doubt, seek guidance from a fitness professional.
```

Feel free to modify the plan based on personal preferences, fitness level, and any specific considerations or restrictions.
"""

# Initialize the ROUGE scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeLsum'], use_stemmer=True)

# Calculate ROUGE scores
rouge_scores = scorer.score(reference_summary, generated_summary)

# Print the ROUGE scores
print("ROUGE-1 F1 Score:", rouge_scores['rouge1'].fmeasure)
print("ROUGE-2 F1 Score:", rouge_scores['rouge2'].fmeasure)
print("ROUGE-L F1 Score:", rouge_scores['rougeLsum'].fmeasure)

Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24933 sha256=e009831e3f70174b1d3c8da26187ccd1793ba177c5258e58826a53b49c9993d6
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge-score
Installing collected packages: rouge-score
Successfully installed rouge-score-0.1.2
ROUGE-1 F1 Score: 0.5739130434782609
ROUGE-2 F1 Score: 0.32876712328767127
ROUGE-L F1 Score: 0.5540372670807454


Interpreting these scores involves understanding the trade-off between precision and recall. A higher F1 score indicates a better balance between precision (correctly identified relevant terms) and recall (coverage of all relevant terms). These scores are crucial in evaluating the effectiveness of text generation models, especially in tasks such as summarization, where capturing the essence of the reference text is essential. However, ROUGE scores may not fully capture the quality, coherence, or fluency of the generated text. Let's interpret it's scores:
1. ROUGE-1 F1 Score: 0.5739130434782609
    
    This score is based on the overlap of unigrams (single words) between the generated text and the reference text(s). An F1 score of 0.57 indicates that about 57% of the unigrams in the generated text match those in the reference text(s). A higher ROUGE-1 score generally suggests better precision and recall of individual words.
2. ROUGE-2 F1 Score: 0.32876712328767127
    
    This score involves the overlap of bigrams (two consecutive words) between the generated text and the reference text(s). An F1 score of 0.33 suggests that approximately 33% of the bigrams in the generated text match those in the reference text(s). ROUGE-2 is more specific than ROUGE-1 and evaluates the quality of two-word sequences.
3. ROUGE-L F1 Score: 0.5540372670807454

    The ROUGE-L score is based on the longest common subsequence between the generated text and the reference text(s). An F1 score of 0.55 implies that about 55% of the longest common subsequences in the generated text match those in the reference text(s). ROUGE-L is more flexible and does not require consecutive word matches, considering subsequence similarity.

These ROUGE scores demonstrate approximately 50% commonality between the example and generated outputs. However, we can see that the reason for this reason can be caused by the difference in the plan itself. This variability of the plans is the strength of ChatGPT and expected behavior. Physically observing the generated and reference outputs, we see that the format and advise between the two are similar and indeed accurate. Therefore, we can verify the model's data as acceptable and well performed.

###Potential Applications of the Generated Data in Data Science tasks
Using Open AI API's chatgpt model, we generated a comprensive plan for dieting and fitness. This plan, while backed with science found on the internet, is always unique and changes everytime we run the code. Issues that may arise is that since the model is trained by data found on the internet, our model may be susceptible to fad diets that are unhealthy and dangerous. Users need to be cautious and critically evaluate the information provided. The model can generate coherent and contextually relevant responses over extended lengths of text. However, the coherence may vary, and long passages might start to lose focus or deviate from the initial prompt.

This generated data can be applied for those who are working towards a better lifestyle and do not know where to start. By providing, ideas for meals based on their weight goal, we can mimic advice that personal trainers would give without paying for additional costs.

From the generated data, some potential insights we gain are:
1. Personalized Recommendations:

  Tailored diet and exercise plans based on individual characteristics such as age, gender, weight, height, and health goals.
2. Nutritional Guidance:

  Information on the nutritional content of recommended foods and details about macronutrients (carbohydrates, proteins, fats), micronutrients, and calorie intake.
3. Fitness Routine Optimization:

  Optimized workout routines that align with the user's fitness level, preferences, and goals, taking into account factors like strength training, cardio, and flexibility exercises.
4. Educational Content:

  Informational insights to enhance the user's understanding of nutrition, exercise physiology, and the science behind the recommended plans.

By harnessing the potential of ChatGPT, we have successfully developed a Health and Fitness Assistant, capable of providing personalized diet and workout plans tailored to individual user specifications. The utilization of this cutting-edge technology not only showcases the advancements in artificial intelligence but also highlights the practical implications in promoting health and well-being.

#Conclusion

In conclusion, this paper has explored the transformative potential of generative artificial intelligence, particularly the transformer architecture, through the development of a Health and Fitness Assistant using OpenAI's ChatGPT API. This innovation exemplifies the versatility and adaptability of generative models in creating personalized solutions, in this case, for health and fitness. The application provides tailored diet and workout plans, showcasing how AI advancements can be practically applied to enhance health and well-being. Overall, this work highlights the significant role of generative AI in shaping the future of personal health management, marking a step towards an era where AI is a key contributor to our holistic well-being.

# References

1. Sckit learn offcial documentation
2. Refered Towards Data Science
3. Eli5 official documentation
4. https://www.zdnet.com/article/what-is-generative-ai-and-why-is-it-so-popular-heres-everything-you-need-to-know/
5. https://www.linkedin.com/pulse/rise-transformers-why-sudden-jump-ai-capabilities-steve-wilson/

The algorithms were referred directly from the **Sckit learn official documentation**. Visualization was referred from the  Machine Learning with scikit-learn Quick Start Guide and **Towards Data Science** (How do you check the quality of your regression model in Python?).  The remaining code was written independently. Feature importance reference is taken from **eli5 offical documentation**

# License
Copyright [2023]

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.