# Next Word Prediction Exploration

This notebook is used for exploratory data analysis and visualizations related to the next word prediction dataset. It includes code snippets, plots, and markdown explanations to understand the dataset better.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
data = pd.read_csv('../data/dataset.csv')  # Adjust the path as necessary

# Display the first few rows of the dataset
data.head()

In [None]:
# Visualize the distribution of word frequencies
word_counts = data['text'].str.split(expand=True).stack().value_counts()

plt.figure(figsize=(12, 6))
sns.barplot(x=word_counts.index[:20], y=word_counts.values[:20])
plt.title('Top 20 Most Frequent Words')
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.show()

## Tokenization and Sequence Padding

In this section, we will explore tokenization and padding sequences to prepare the data for the model.

In [None]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

# Tokenize the text
tokenizer = Tokenizer()
tokenizer.fit_on_texts(data['text'])
total_words = len(tokenizer.word_index) + 1

# Convert text to sequences
input_sequences = tokenizer.texts_to_sequences(data['text'])

# Pad sequences
max_sequence_length = max(len(seq) for seq in input_sequences)
padded_sequences = pad_sequences(input_sequences, maxlen=max_sequence_length, padding='pre')

## Conclusion

This notebook provides an initial exploration of the dataset, including visualizations of word frequencies and preparation of the data for modeling. Further analysis and model training will follow in subsequent notebooks.