In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/llm-prompt-recovery/sample_submission.csv
/kaggle/input/llm-prompt-recovery/train.csv
/kaggle/input/llm-prompt-recovery/test.csv


In [2]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [3]:
# Load the dataset
train_df = pd.read_csv("/kaggle/input/llm-prompt-recovery/train.csv")
test_df = pd.read_csv("/kaggle/input/llm-prompt-recovery/test.csv")
sample_submission = pd.read_csv("/kaggle/input/llm-prompt-recovery/sample_submission.csv")

In [4]:
# Preprocess the data
X_train = train_df['original_text']
y_train = train_df['rewrite_prompt']
X_test = test_df['original_text']

In [5]:
# Vectorize the text data
vectorizer = TfidfVectorizer(max_features=10000, ngram_range=(1, 3))
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

In [6]:
unique_prompts = train_df['rewrite_prompt'].unique()
print(unique_prompts)

['Convert this into a sea shanty: """The competition dataset comprises text passages that have been rewritten by the Gemma LLM according to some rewrite_prompt instruction. The goal of the competition is to determine what prompt was used to rewrite each original text.  Please note that this is a Code Competition. When your submission is scored, this example test data will be replaced with the full test set. Expect roughly 2,000 original texts in the test set."""']


In [7]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier

# Load the training data
train_data = pd.read_csv("/kaggle/input/llm-prompt-recovery/train.csv")

# Load the test data
test_data = pd.read_csv("/kaggle/input/llm-prompt-recovery/test.csv")

# Preprocess the data
X_train = train_data['original_text']
y_train = train_data['rewrite_prompt']
X_test = test_data['original_text']

# Vectorize the text data
vectorizer = TfidfVectorizer(max_features=10000, ngram_range=(1, 3))
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Train a Random Forest classifier
classifier = RandomForestClassifier(n_estimators=100, random_state=42)
classifier.fit(X_train_vec, y_train)

# Predict on the test data
test_preds = classifier.predict(X_test_vec)

# Create submission file
submission = test_data[['id']].copy()
submission['rewrite_prompt'] = test_preds

# Save submission file to the output directory
submission.to_csv("/kaggle/working/submission.csv", index=False)

In [8]:
# Display the first few rows of the training dataset
print("Training Dataset:")
print(train_data.head())

# Display the first few rows of the test dataset
print("\nTest Dataset:")
print(test_data.head())

Training Dataset:
   id                                      original_text  \
0  -1  The competition dataset comprises text passage...   

                                      rewrite_prompt  \
0  Convert this into a sea shanty: """The competi...   

                                      rewritten_text  
0  Here is your shanty: (Verse 1) The text is rew...  

Test Dataset:
   id                                      original_text  \
0  -1  The competition dataset comprises text passage...   

                                      rewritten_text  
0  Here is your shanty: (Verse 1) The text is rew...  


In [9]:
# Display the headers of the training dataset
print("Training Dataset Headers:")
print(train_data.columns)

# Display the headers of the test dataset
print("\nTest Dataset Headers:")
print(test_data.columns)

Training Dataset Headers:
Index(['id', 'original_text', 'rewrite_prompt', 'rewritten_text'], dtype='object')

Test Dataset Headers:
Index(['id', 'original_text', 'rewritten_text'], dtype='object')



1. **Project Overview**: The goal of the project is to recover the prompt used to rewrite a given text. In other words, you're given original texts that have been rewritten by a language model (LLM) according to some rewrite prompt instruction. Your task is to determine what prompt was used to rewrite each original text.

2. **Dataset**: The dataset provided consists of two main files: `train.csv` and `test.csv`. The `train.csv` file contains original texts along with their rewritten versions and the corresponding rewrite prompts. The `test.csv` file contains only original texts for which you need to predict the rewrite prompts.

3. **Algorithm**: The algorithm used in this project is a supervised machine learning approach. Since the task is to classify the rewrite prompts based on the original texts, it falls under the category of text classification. Commonly used algorithms for text classification include:
   - Logistic Regression
   - Random Forest
   - Support Vector Machines (SVM)
   - Gradient Boosting
   - Neural Networks (e.g., LSTM, BERT)

4. **Approach**:
   - **Data Preprocessing**: The original texts are preprocessed, which may include tasks such as tokenization, removing stopwords, and vectorization (e.g., TF-IDF or word embeddings).
   - **Feature Engineering**: Features are extracted from the preprocessed texts to represent them numerically.
   - **Model Training**: A classification model (e.g., Random Forest, Logistic Regression) is trained using the preprocessed data, with the rewrite prompts as the target variable.
   - **Model Evaluation**: The trained model is evaluated using appropriate metrics (e.g., accuracy, F1-score) to assess its performance on the validation set.
   - **Prediction**: The trained model is then used to predict the rewrite prompts for the original texts in the test dataset.
   - **Submission**: The predicted rewrite prompts are submitted as the final output for evaluation.

5. **Evaluation Metric**: The competition might specify an evaluation metric to measure the performance of the predictions. Common metrics for classification tasks include accuracy, precision, recall, and F1-score.

6. **Iterative Process**: You might iterate on the steps mentioned above, trying different algorithms, feature engineering techniques, and hyperparameters to improve the model's performance.

7. **Submission**: Once you're satisfied with the performance of your model on the validation set, you submit your predictions for the test dataset. The competition organizers evaluate your predictions against the ground truth and provide a score based on the specified evaluation metric.

In summary, this project involves using supervised machine learning techniques to classify rewrite prompts based on original texts, and various algorithms such as Random Forest or Logistic Regression can be used for this task. The choice of algorithm and the success of the project depend on factors such as data quality, feature engineering, and model selection.

### Introduction:
The project focuses on recovering the prompts used to rewrite given texts, a task crucial in natural language processing (NLP) workflows. It presents a machine learning competition aimed at exploring effective prompt engineering for large language models (LLMs). The dataset comprises original texts paired with their rewritten versions generated by Gemma, a family of open LLMs developed by Google. Participants are challenged to determine the prompts used in the rewriting process.

### Methodology:
1. **Data Preparation**: The dataset consists of original texts and their corresponding rewritten versions, paired with rewrite prompts. Both the original and rewritten texts are preprocessed for further analysis.

2. **Feature Engineering**: Textual features are extracted from the preprocessed data. Techniques like TF-IDF vectorization or word embeddings may be employed to represent the text data numerically.

3. **Model Selection**: Supervised learning algorithms such as Logistic Regression, Random Forest, or Gradient Boosting are considered for prompt recovery. Models are trained on the preprocessed dataset, with the rewrite prompts as the target variable.

4. **Model Evaluation**: The trained models are evaluated using appropriate metrics like accuracy or F1-score. Cross-validation techniques may be employed to ensure robust performance assessment.

5. **Hyperparameter Tuning**: Hyperparameters of the selected models are fine-tuned to optimize performance on the validation set.

6. **Prediction and Submission**: The best-performing model is used to predict rewrite prompts for the test dataset. The predictions are then submitted for evaluation against the ground truth.

### Results Discussion:
The performance of the models is analyzed based on evaluation metrics specified by the competition organizers. The results are discussed in terms of the effectiveness of various algorithms, feature engineering techniques, and prompt recovery strategies. Insights into the challenges encountered and successful methodologies employed during the competition are highlighted.

### Conclusion:
The project concludes with a summary of key findings and lessons learned. It discusses the implications of the results for NLP workflows and prompts further research avenues in prompt engineering for LLMs. The competition serves as a platform for exploring innovative approaches to text rewriting tasks and contributes to advancing the field of natural language processing.

### Consultation:
For further inquiries or consultation regarding the project methodology, results, or implications, feel free to reach out to the project team or experts in the field of natural language processing. Collaboration opportunities, research partnerships, or discussions on related topics can also be explored to deepen the understanding of the findings and foster knowledge exchange within the community.