In [6]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout, Embedding, Bidirectional
from tensorflow.keras.preprocessing.sequence import pad_sequences
import tensorflow as tf
from sklearn.base import BaseEstimator, RegressorMixin

# train.csv and test.csv files are correctly formatted and located
train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")

X_train = train_data["sequence"].values
y_train = train_data["target"].values
X_test = test_data["sequence"].values

# Tokenization and padding
tokenizer = tf.keras.preprocessing.text.Tokenizer(char_level=True)
tokenizer.fit_on_texts(X_train)  # Fit on training data

X_train_encoded = tokenizer.texts_to_sequences(X_train)
X_test_encoded = tokenizer.texts_to_sequences(X_test)

max_length = 500  # this was adjusted and caused the most amount of change in prediction score
X_train_padded = pad_sequences(X_train_encoded, maxlen=max_length, padding='post')
X_test_padded = pad_sequences(X_test_encoded, maxlen=max_length, padding='post')

def create_model():
    model = Sequential([
        Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=64, input_length=max_length),
        Bidirectional(LSTM(64, return_sequences=True)),
        Dropout(0.5),
        LSTM(32),
        Dropout(0.5),
        Dense(64, activation='relu'),
        Dense(1)
    ])
    model.compile(optimizer='adam', loss='mean_squared_error')
    return model

# Directly using the model for training and prediction to avoid complications
model = create_model()

# Fit the model
model.fit(X_train_padded, y_train, epochs=20, batch_size=32, validation_split=0.2, verbose=1)

# Generating the predictions
test_predictions = model.predict(X_test_padded)

# Creating submission DataFrame
submission_df = pd.DataFrame({'id': test_data['id'], 'target': test_predictions.flatten()})
submission_df.to_csv('prediction.csv', index=False)



Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


Overview
This project utilizes a Long Short-Term Memory (LSTM) model to predict sequence-based outcomes. The model is implemented using TensorFlow and Keras libraries and focuses on handling sequence data for prediction tasks.

Requirements
Python 3.8+
pandas
numpy
scikit-learn
TensorFlow 2.x
Installation
To set up the necessary environment:

Install Python 3.8 or newer.
Install the required Python packages using pip:
Copy code
pip install pandas numpy scikit-learn tensorflow
Dataset
Ensure you have the train.csv and test.csv files in the same directory as the script. These files should be properly formatted CSVs where:

train.csv contains the columns sequence and target.
test.csv contains the column sequence.
Usage
Run the script using the following command:

Copy code
python lstm_sequence_prediction.py
This will train the model and output predictions into a file named prediction.csv in the same directory.

Files
lstm_sequence_prediction.py: Main Python script for the LSTM model.
train.csv: Training data file.
test.csv: Test data file for which predictions will be made.
Model Details
Model Architecture: Uses an embedding layer, followed by a bidirectional LSTM and dense layers.
Training: Trained with a validation split of 20% for 20 epochs.
Output: Predictions are saved in prediction.csv.
