### Alien Language Model Training Data Preparation

#### Introduction

In this notebook, we will prepare the training data for our alien language model. The data includes English sentences and their corresponding alien translations. We will load, inspect, augment, and preprocess the data for model training.

#### Loading the Data

In [3]:
import pandas as pd

# Load the training data from CSV file
file_path = "../../data/raw/alien_language_training_data.csv"
df = pd.read_csv(file_path)

# Display the first few rows of the dataframe
df.head()

Unnamed: 0,English Sentence,Alien Translation
0,"Hello, how are you?","vorp, zirg morp minz? 😂"
1,I'm from Earth.,minz gxorm 6538-ccz. 🌌
2,What's your name?,zulr morp minz? 😊
3,I like this place.,minz zurk zulr gxorm. 🙏
4,Do you need help?,morp minz zulr zurk? 😲


#### Data Inspection

In [4]:
# Check for any missing values
print("Missing values in each column:")
print(df.isnull().sum())

# Check the data types
print("Data types of each column:")
print(df.dtypes)

# Check for duplicates
print(f"Number of duplicate rows: {df.duplicated().sum()}")

Missing values in each column:
English Sentence     0
Alien Translation    0
dtype: int64
Data types of each column:
English Sentence     object
Alien Translation    object
dtype: object
Number of duplicate rows: 0


#### Data Augmentation

In [7]:
# Additional training pairs
additional_data = {
    "English Sentence": [
        "What's the weather like?",
        "Do you like music?",
        "I am tired.",
        "This is beautiful.",
        "Can you help me?",
        "Where are we?",
        "I am lost.",
        "That's great!",
        "What do you think?",
        "Tell me a joke.",
        "I love this!",
        "Please don't go.",
        "Are you coming?",
        "Let's play a game.",
        "I can't believe it!",
        "It's raining.",
        "How old are you?",
        "Do you have any pets?",
        "This is incredible!",
        "Where is the bathroom?"
    ],
    "Alien Translation": [
        "zulr minz zurk? 🌌",
        "morp minz zurk? 😊",
        "minz zorp. 😢",
        "zulr morp zurk. 😊",
        "morp minz zulr? 🙏",
        "zulr morp minz? 🌌",
        "minz zorp zulr. 😢",
        "zulr morp vorp! 😂",
        "morp minz zulr? 😊",
        "vorp, zulr morp! 😂",
        "minz zulr vorp! 😊",
        "vorp minz zorp. 😢",
        "morp minz zulr? 😊",
        "zulr morp vorp. 😊",
        "minz zulr zorp! 😲",
        "zulr minz zurk. 🌌",
        "zulr morp minz? 😊",
        "morp minz zurk? 😊",
        "zulr morp vorp! 😲",
        "zulr minz morp? 🌌"
    ]
}

# Convert additional data to DataFrame and append to existing DataFrame
additional_df = pd.DataFrame(additional_data)
df = pd.concat([df, additional_df], ignore_index=True)

# Remove any duplicates after augmentation
df.drop_duplicates(inplace=True)

# Display the updated DataFrame
df.tail()

Unnamed: 0,English Sentence,Alien Translation
35,It's raining.,zulr minz zurk. 🌌
36,How old are you?,zulr morp minz? 😊
37,Do you have any pets?,morp minz zurk? 😊
38,This is incredible!,zulr morp vorp! 😲
39,Where is the bathroom?,zulr minz morp? 🌌


#### Data Preprocessing

In [8]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

# Tokenize English sentences
df['English Tokenized'] = df['English Sentence'].apply(lambda x: word_tokenize(x.lower()))

# Tokenize Alien translations
df['Alien Tokenized'] = df['Alien Translation'].apply(lambda x: word_tokenize(x.lower()))

# Display the tokenized columns
df.head()

# Padding sequences to uniform length (if needed for model training)
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Determine max length for padding
max_length = max(df['English Tokenized'].apply(len).max(), df['Alien Tokenized'].apply(len).max())

# Padding function
def pad_sequence(seq, maxlen):
    return pad_sequences([seq], maxlen=maxlen, padding='post', truncating='post')[0]

df['English Padded'] = df['English Tokenized'].apply(lambda x: pad_sequence(x, max_length))
df['Alien Padded'] = df['Alien Tokenized'].apply(lambda x: pad_sequence(x, max_length))

# Display the padded sequences
df.head()

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ckand\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


ValueError: invalid literal for int() with base 10: 'hello'

#### Save Augmented Data

In [None]:
# Save the augmented and preprocessed data to a new CSV file
augmented_file_path = "path_to_save_augmented_csv/alien_language_augmented_data.csv"
df.to_csv(augmented_file_path, index=False)

# Save data in JSON format for flexibility
json_file_path = "path_to_save_json/alien_language_augmented_data.json"
df.to_json(json_file_path, orient='records', lines=True)

# Display message
print(f"Augmented data saved to {augmented_file_path} and {json_file_path}")