# Introduction
## About this Project

This project is based on a dataset derived from a gaming community that operates within a simulated medical emergency service in a science fiction universe. The gameplay involves scenarios where a player character, referred to as the "client", experiences incapacitation or isolation. In response, a coordinated team of players, termed the "responding team", is mobilized to provide assistance. Communication between the responding team and the client is facilitated through a text-based chat system established for each incident.

The objective of this analysis is to identify recurring patterns and commonalities in the chat communications. This exploration assists the community's leadership in determining the feasibility of automating certain repetitive messages, thereby streamlining operations and enhancing response efficiency.

Despite the dataset originating from the context of a video game, the analytical techniques and insights gained from this project are universally applicable to a wide range of industries. The core challenge addressed in this analysis - identifying and automating repetitive communication patterns - is prevalent in numerous settings where service representatives interact with clients.

The methodology employed here can be adapted to enhance customer service efficiency, reduce response times, and improve client satisfaction in various sectors, including healthcare, customer support, and tech support.

## About the Dataset

The dataset utilized in this analysis is strictly for internal use due to containing in-game usernames. As a commitment to respecting privacy and upholding ethical standards in data handling, the dataset itself will not be shared publicly.

In any outputs or examples provided within this project documentation or related presentations, identifiable information has been anonymized or replaced with placeholders such as `[Organization]`, `[PlayerName]`, `[Location]`, and similar terms. The findings and methodologies are shared openly for educational and demonstrative purposes while ensuring the privacy and anonymity of the individuals involved.

### The Dataset's Description

- The dataset has 6 columns and 16&nbsp;509 rows.
- Data coverage:
  - Start date: June 18, 2023
  - End date: March 21, 2024

The dataset's columns listed below are the result of normalizing data from `json` format and renaming the columns for clarity. Detailed description for each column:

| Column name            | Description                                                 | Type     |
|------------------------|-------------------------------------------------------------|----------|
| ID                     | Unique identifier for each message within the database      | Text     |
| contents               | Text content of the chat message sent between players       | Text     |
| created                | Date and time when the message was recorded in the system   | Datetime |
| emergency ID           | Identifier linking the message to a specific emergency case | Text     |
| sender ID              | Unique identifier of the player who sent the message        | Text     |
| message sent timestamp | Date and time when the message was sent by the player       | Datetime |

## The Tools I Use:

- **Python:** Programming language for data manipulation and analysis.
- **Pandas:** Python library for efficient data cleaning and transformation.
- **NLTK (Natural Language Toolkit):** Python toolkit for natural language processing and text analysis.
- **Scikit-learn:** Machine learning library for building predictive models.
- **Matplotlib:** Visualization tool for creating plots and graphs.
- **Jupyter Notebook:** Interactive environment for documenting data analysis.

# 1. Setting up the Environment

The following command installs the Python libraries required for this project:

```
pip install pandas matplotlib nltk scikit-learn ipython
```

# 2. Loading and Preprocessing the Data
## 2.1. Importing Libraries

In [None]:
# Essential for data manipulation and analysis.
import pandas as pd

# Specific function from pandas for flattening JSON objects into a flat table.
from pandas import json_normalize

# Visualization tool for creating plots and graphs.
import matplotlib.pyplot as plt

# For parsing JSON data.
import json

# Toolkit for natural language processing and text analysis.
import nltk

# List of common words to filter out from text.
from nltk.corpus import stopwords

# Function to split text into individual words (tokens).
from nltk.tokenize import word_tokenize

# Transform texts into a suitable format for analysis.
from sklearn.feature_extraction.text import TfidfVectorizer

# Unsupervised machine learning algorithm for clustering.
from sklearn.cluster import KMeans

# Measures how similar an object is to its own cluster compared
# to other clusters.
from sklearn.metrics import silhouette_score

# For topic modeling.
from sklearn.decomposition import NMF

# Provides regular expression matching operations.
import re

# For interacting with the operating system.
import os

# For richer output formatting in Jupyter Notebooks.
from IPython.display import display, HTML

# Downloading the necessary NLTK datasets.
nltk.download('punkt')  # Tokenizer for breaking text into individual words.
nltk.download('stopwords')  # Common words to filter out from text.

pd.set_option('display.max_rows', None)  # To display all DataFrame rows.
pd.set_option('display.max_columns', None)  # To display all DataFrame columns.

# Setting max column width to 1000 characters.
pd.options.display.max_colwidth = 1000

## 2.2. Loading the Dataset

In [None]:
# Initializing an empty list to store each parsed JSON object.
data = []

# Initializing a string to collect pieces of JSON spread across multiple lines.
partial_json = ""

# Opening and reading the dataset file.
with open('chatMessage.json', 'r', encoding='utf-8') as f:
    for line in f:
        try:
            # Attempting to parse the line with any previously collected
            # JSON fragments.
            parsed_json = json.loads(partial_json + line)

            # If parsing is successful, appending the JSON object to the list
            # and resetting partial_json.
            data.append(parsed_json)

            # Clearing the partial_json to start fresh for the next object.
            partial_json = ""
        except json.JSONDecodeError:
            # If JSON decoding fails due to incomplete JSON fragments,
            # appending the line to partial_json to complete the JSON object.
            partial_json += line

# Loading the collected JSON objects into a DataFrame.
df = pd.DataFrame(data)

# Displaying the first few rows of the DataFrame to ensure data is loaded
# properly.
df.head()

Our JSON object reveals to have a nested structure.

In [None]:
# Normalizing nested JSON data from the "Item" column to create a flat
# table structure.
df_normalized = json_normalize(df['Item'])

# Combining the normalized data back with the original DataFrame,
# and dropping the original "Item" column which contained nested JSON.
chat = pd.concat([df.drop('Item', axis=1), df_normalized], axis=1)

# Displaying the first five rows of the processed DataFrame to verify
# that it loaded and normalized correctly.
chat.head()

The JSON normalization process converts nested structures into a flat table by using the path to each element in the nested JSON as the column name. The suffixes in each column name, such as `.S` in `id.S`, `contents.S`, `created.S`, etc., and `.N` in `messageSentTimestamp.N`, indicate the data type or structure from the original JSON object. Here, `.S` stands for a string type, and `.N` in `messageSentTimestamp.N` denotes a numerical type.

## 2.3. Renaming Columns

To improve the readability and usability of our dataset within the analysis, I will remove the `.S` and `.N` suffixes from each column name.

In [None]:
# Renaming columns to more descriptive and simpler names.
chat = chat.rename(
    columns={
        'id.S': 'ID',
        'contents.S': 'contents',
        'created.S': 'created',
        'emergencyId.S': 'emergency ID',
        'senderId.S': 'sender ID',
        'messageSentTimestamp.N': 'message sent timestamp'
    }
)

## 2.4. Changing Column Types

We begin by examining the current data types of the columns using the `info()` method.

In [None]:
chat.info()

From the output of `chat.info()`, we can observe the structure of our DataFrame: it contains 6 columns and 16&nbsp;509 rows.

This output also reveals that all columns are currently recognized as `object` type, which is a generic type for storing data in pandas.

The `created` and `message sent timestamp` columns are currently formatted as objects but represent date-time information. The `created` column appears in ISO 8601 format (example value: `2024-03-23T21:44:36.788Z`), while `message sent timestamp` uses a Unix timestamp format (example value: `1711230277`). I will convert both to pandas datetime objects for consistency.

In [None]:
# Converting "created" from ISO 8601 format string to datetime.
chat['created'] = pd.to_datetime(chat['created'])

# Ensuring the "created" datetime column is timezone-unaware (naive).
chat['created'] = chat['created'].dt.tz_localize(None)

# Flooring the "created" datetime to the nearest second to remove any
# smaller time units.
chat['created'] = chat['created'].dt.floor('s')

# Converting the "message sent timestamp" from Unix timestamp to a proper
# datetime format.

# First, ensuring the column is treated as an integer for accurate
# datetime conversion.
chat['message sent timestamp'] = chat['message sent timestamp'].astype(int)

# Converting the integer timestamps in "message sent timestamp" to datetime
# using Unix epoch time (seconds since 1970-01-01).
chat['message sent timestamp'] = pd.to_datetime(
    chat['message sent timestamp'], unit='s'
)

From our earlier call to `chat.info()`, we identified two missing values in the `contents` column. We will remove these rows and then convert the `contents` column to a string type to ensure compatibility with text processing functions in pandas.

In [None]:
# Removing rows with missing values in the "contents" column.
chat = chat.dropna(subset=['contents'])

# Converting "contents" to string type.
chat['contents'] = chat['contents'].astype(str)

# Displaying the DataFrame info again to show the changes.
chat.info()

These conversions ensure that each column is optimized for the most appropriate and efficient data type.

The columns containing ID information, such as `ID`, `emergency ID`, and `sender ID`, will not be used in the analyses planned for this project, and therefore will be left in their original format.

## 2.5. Removing System-generated Messages

Now, let's sort our DataFrame by `created` and take a look at the first registered chat messages.

In [None]:
chat = chat.sort_values(by='created')
chat.head()

The first messages seem to be system-generated, and I'm going to remove them.

From now on, for the purpose of this project I'm going to focus on `contents` and `created` columns.

In [None]:
# Retaining only the "contents" and "created" columns.
chat = chat[['contents', 'created']]

# Removing the system-generated messages.

# Stripping leading and trailing whitespaces in the "contents" column first.
chat['contents'] = chat['contents'].str.strip()

# Filtering out rows containing the automated message.
chat = chat[
    ~chat['contents'].str.contains(
        "This emergency was submitted via the __\\*\\*Client Portal\\*\\*__",
        regex=True,
        na=False
    )
]

chat.head()

Now we see a different kind of system-generated messages in the first rows. Let's remove them as well.

In [None]:
# Filtering out rows that start with "## Emergency details from Client".
chat = chat[
    ~chat['contents'].str.startswith(
        "## Emergency details from Client", na=False
    )
]

chat.head(20)

These messages are human-written.

Let's see how many messages we have in our DataFrame now.

In [None]:
len(chat)

## 2.6. Exploratory Attempt to Filter Non-English Messages

We could notice that not all messages were in English. To address this, I attempted to filter them out using the `langdetect` and `langid` libraries. 

`langdetect` identified 2&nbsp;742 messages as non-English, while `langid` detected 166 high-confidence messages in other languages. However, in both cases, most of the labeled messages were actually written in English. Consequently, I decided to discard the idea of filtering by language for our dataset.

Here is the code used for these attempts:

```
pip install langdetect langid
```

In [None]:
from langdetect import detect, LangDetectException
import langid

# Function to detect the language using langdetect.
def detect_language_langdetect(text):
    try:
        return detect(text)
    except LangDetectException:
        return "unknown"

# Function to detect the language using langid and return both language code
# and confidence.
def detect_language_langid(text):
    lang, confidence = langid.classify(text)
    return lang, confidence

# Applying langdetect.
chat['language_langdetect'] = chat['contents'].apply(detect_language_langdetect)

# Applying langid.
chat[['language_langid', 'langid_confidence']] = chat['contents'].apply(
    lambda x: pd.Series(detect_language_langid(x))
)

# Filtering messages detected as non-English by langdetect.
non_english_langdetect = chat[chat['language_langdetect'] != 'en']

# Filtering messages detected as non-English with high confidence by langid.
high_confidence_threshold = 0.9
high_confidence_non_english_langid = chat[
    (chat['language_langid'] != 'en')
    & (chat['langid_confidence'] >= high_confidence_threshold)
]

# Displaying the count of non-English messages detected.
print(f"langdetect detected {len(non_english_langdetect)} "
      "non-English messages.")
print(f"langid detected {len(high_confidence_non_english_langid)} "
      "high-confidence non-English messages.")

# Displaying the top 10 detected non-English messages.
top_10_non_english_langdetect = non_english_langdetect.head(10)
display(top_10_non_english_langdetect[['contents', 'language_langdetect']])
top_10_non_english_langid = high_confidence_non_english_langid.head(10)
display(
    top_10_non_english_langid[
        ['contents', 'language_langid', 'langid_confidence']
    ]
)

# Cleaning up.
chat.drop(
    columns=['language_langdetect', 'language_langid', 'langid_confidence'],
    inplace=True
)

## 2.7. Text Preprocessing

To prepare the chat messages for analysis, I perform several preprocessing steps on the text data. These steps include removing punctuation, converting all text to lowercase, and eliminating common stop words (like "the", "is", etc.) that are not useful for identifying key themes.

Additionally, tokenization is part of this process. It breaks the text into individual words or phrases, allowing for more granular analysis.

In [None]:
# Function for text preprocessing.
def preprocess(text):
    # Converting to lowercase.
    text = text.lower()
    # Tokenizing text.
    tokens = word_tokenize(text)
    # Removing stopwords.
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    # Removing non-alphabetic tokens.
    tokens = [word for word in tokens if word.isalpha()]
    return " ".join(tokens)

# Applying preprocessing to the "contents" column.
chat['processed_contents'] = chat['contents'].apply(preprocess)

# Displaying the first 20 rows of the DataFrame to verify the preprocessing.
chat.head(20)

# 3. Vectorization

In this step, we transform the text data into a numerical format suitable for machine learning algorithms. We achieve this using TF-IDF (Term Frequency-Inverse Document Frequency), which helps in representing the importance of words in the context of the entire dataset.

In [None]:
# Initializing the TF-IDF Vectorizer.
vectorizer = TfidfVectorizer()

# Fitting the vectorizer to the processed text data and transforming the text
# into numerical format.

# "X" will be a sparse matrix where each row represents a text message
# and each column represents a term.
X = vectorizer.fit_transform(chat['processed_contents'])

# 4. Initial Analysis
## 4.1. Clustering

In this section, I will use K-means clustering to group similar messages. This technique can help identify common themes or frequently discussed topics.

The advantage of clustering is that it can group messages that are semantically similar even if they don't use the exact same words.

I begin by retrieving the total number of CPU cores available on my PC to determine the computational resources available for parallel processing.

In [None]:
print(os.cpu_count())

In [None]:
# Setting the maximum number of CPU cores to be used by joblib to 8.
# Joblib is a library for parallel computing in Python, used by scikit-learn
# to speed up computations.
os.environ['LOKY_MAX_CPU_COUNT'] = '8'

# Starting with the number of clusters equal to 5.
k = 5

# Initializing the KMeans clustering algorithm with "k" clusters.
kmeans = KMeans(n_clusters=k, random_state=0)

# Fitting the KMeans model to the data and predicting cluster assignments.
clusters = kmeans.fit_predict(X)

# Attaching the predicted cluster labels back to the original DataFrame.
chat['cluster'] = clusters

Next, I will display the texts within each cluster to be able to identify common themes.

To add perspective, I will calculate the size of each cluster and its percentage relative to the entire pool of messages.

In [None]:
# Calculating cluster sizes and percentages.
cluster_counts = chat['cluster'].value_counts()
total_counts = len(chat)
cluster_percentages = ((cluster_counts / total_counts) * 100).round(2)

# Sorting clusters by size.
sorted_cluster_indices = cluster_counts.sort_values(ascending=False).index

# Displaying clusters, sorted by cluster size.
for i in sorted_cluster_indices:
    # Sampling texts from each cluster.
    sample = chat[chat['cluster'] == i]['contents'].sample(7).to_frame()
    
    # Getting size and percentage of the current cluster.
    cluster_size = cluster_counts.loc[i]
    cluster_percentage = cluster_percentages.loc[i]
    
    # Formatting the header with size and percentage, and displaying
    # the sample texts.
    display(HTML(f"<h3>Cluster {i}: Size = {cluster_size}, "
                 f"Percentage = {cluster_percentage}%</h3>"))
    display(HTML(sample.to_html(escape=False)))

We can notice that a large cluster, comprising almost 4% of the data, consists of another type of automatic messages. We will use this discovery later to remove those messages.

## 4.2. Topic Modeling with NMF

Non-negative Matrix Factorization (NMF) is a topic modeling technique that can be used to discover the hidden thematic structure in large archives of text. Unlike K-means, NMF is a soft clustering method, meaning that each document can be associated with multiple topics, each with a certain weight.

Let’s apply NMF to our dataset and explore its effectiveness.

In [None]:
# Using TF-IDF Vectorizer for NMF.
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=1000)
tfidf = tfidf_vectorizer.fit_transform(chat['processed_contents'])

# Applying NMF to the TF-IDF features.
nmf_model = NMF(n_components=5, random_state=1, init='nndsvd').fit(tfidf)

# Retrieving the feature names (words) from the TF-IDF vectorizer.
feature_names = tfidf_vectorizer.get_feature_names_out()

# Displaying topics and their key words.
for topic_idx, topic in enumerate(nmf_model.components_):
    print(f"Topic #{topic_idx}:")
    # Displaying top 10 words for each topic.
    print(" ".join([feature_names[i] for i in topic.argsort()[:-11:-1]]))

## 4.3. Evaluation of the First Results

### Evaluation of Clustering

- **Largest Cluster (82%)**: This cluster primarily consists of miscellaneous messages that do not fit into other clusters.

- **Second Largest Cluster (9%)**: Predominantly centered around the words "send/sent/sending," this cluster mostly involves discussions about friend requests or party invites, both essential for the community's rescue services.

- **Third Largest Cluster (4%)**: This cluster reveals another type of pre-written message appearing in slightly different forms.

- **Second Smallest Cluster (3%)**: Messages in this cluster frequently use the word "ready", typically asking if a person is prepared to receive an invite.

- **Smallest Cluster (2%)**: This cluster is formed around the word "sorry", used in various contexts.

### Evaluation of Topic Modeling

- **Topic #0**: Reflects the theme of the pre-written message discovered during clustering: "Alert received, please stand by, the Team Lead assigned".

- **Topic #1**: Centers on sending a friend request and a party invite, and asking the recipient to press the left bracket.

- **Topic #2**: Related to Topic #1, with phrases like "Let me know when you're ready to receive the invites".

- **Topic #3**: Appears to be a generic confirmation message: "Copy, thanks, alright".

- **Topic #4**: Resembles a goodbye message: "Thank you for choosing [Organization] services, stay safe".

### Summary and Next Steps

While both approaches provide valuable insights, I found the clustering results easier to interpret. I will continue to work with clustering, adjusting the number of clusters to extract further insights.

# 5. Fine-Tuning Clustering

## 5.1. Removing Identified Pre-written Messages

From the analysis of the clusters, it became apparent that there is an existing pre-written type of message:

> Thank you for choosing [Organization], your alert has been received.  A [Organization] team will be deployed to assist you shortly, please stand by to accept a friend request and a party invite from our Team Lead. (To accept the invitations, You should follow the Team Leader Instructions)  * *The Team Lead assigned, will inform you, when they are sending their invites.*  After you joined the party, please stand by to answer a few follow-up questions, so that we can provide a better service.

I will remove the rows containing this message before re-running the clustering algorithm. This adjustment will help focus on more variable and unique interactions, potentially revealing deeper insights.

In [None]:
# Defining a regular expression pattern that accounts for possible variations.
# This regex will be flexible with spaces and punctuation.
pattern = (
    r"thank you for choosing [Organization], your alert has been received\..*"
    r"please stand by to accept a friend request and a party invite "
    r"from our team lead\."
)

# Using the regex pattern to filter out rows.
# The "flags=re.I" parameter makes the match case insensitive.
chat = chat[
    ~chat['contents'].str.contains(
        pattern, case=False, na=False, regex=True, flags=re.I
    )
]

# Checking how many messages remain in our DataFrame.
len(chat)

## 5.2. Determining the Optimal Number of Clusters

In this chapter, I will focus on finding the optimal number of clusters for our dataset. I will use two commonly applied methods to achieve this: the Elbow Method and the Silhouette Score Method.

### Elbow Method
The Elbow Method helps to determine the optimal number of clusters by plotting the sum of squared distances from each point to its assigned cluster center (within-cluster sum of squares) against the number of clusters. The point at which the rate of decrease sharply slows down, forming an "elbow", suggests an optimal cluster count.

In [None]:
# Initializing an empty list to store the Within-Cluster Sum of Squares
# (WCSS) values.
wcss = []

# Looping through a range of cluster numbers from 1 to 50.
for i in range(1, 50):
    # Initializing the KMeans clustering algorithm with "i" clusters.
    kmeans = KMeans(
        n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0
    )
    # Fitting the KMeans model to the data.
    kmeans.fit(X)  # "X" is my vectorized data from earlier steps.
    wcss.append(kmeans.inertia_)  # "inertia_" is the WCSS for the given model.

# Plotting the results to visualize the Elbow Method.
plt.plot(range(1, 50), wcss)
plt.title("The Elbow Method")
plt.xlabel("Number of clusters")
plt.ylabel("WCSS")
plt.show()

### Silhouette Score Method

The Silhouette Score Method evaluates the quality of clusters by measuring how similar each point is to its own cluster compared to other clusters. Scores range from -1 to 1, with higher scores indicating better-defined clusters. By plotting the Silhouette Score against different numbers of clusters, we can identify the cluster count that maximizes this score, indicating the optimal clustering solution.

In [None]:
# Initializing an empty list to store the silhouette scores.
silhouette_scores = []

# Defining the range of cluster numbers to evaluate, starting from 2
# because the silhouette score cannot be calculated for a single cluster.
K_range = range(2, 50)

# Looping through the range of cluster numbers.
for k in K_range:
    # Initializing the KMeans clustering algorithm with "k" clusters.
    kmeans = KMeans(n_clusters=k, random_state=10)
    # Fitting the KMeans model to the data and predicting cluster labels.
    cluster_labels = kmeans.fit_predict(X)
    # Calculating the average silhouette score for the current number of clusters.
    silhouette_avg = silhouette_score(X, cluster_labels)
    # Appending the silhouette score to the list.
    silhouette_scores.append(silhouette_avg)

# Plotting the silhouette scores to visualize the results.
plt.plot(K_range, silhouette_scores)
plt.title("Silhouette Score Method")
plt.xlabel("Number of Clusters")
plt.ylabel("Silhouette Score")
plt.show()

### Analyzing the Results

The Elbow Method did not show a clear "elbow", even with an extended range of clusters.

Additionally, the highest Silhouette Score obtained was 0.1, which is generally considered low and indicates poorly separated clusters.

These findings suggest that the data might not naturally separate into distinct groups very well, or there could be a high degree of overlap in the structure of the data points (chat messages).

### Next Steps

After iterative testing, where I ran the clustering algorithm several times with different numbers of clusters, I found that setting the number of clusters to **14** is optimal. I focused only on clusters containing more than 50 messages. A smaller number of clusters does not reveal more nuanced, yet still valuable themes, while a larger number of clusters starts "overfitting", grouping messages by frequently used but very abstract themes, such as the verbs "see" and "get".

# 6. Final Clustering Execution

## 6.1. Running the Model

In [None]:
# Re-vectorizing the "processed_contents" column after filtering out some rows
# in the earlier steps.
X = vectorizer.fit_transform(chat['processed_contents'])

# The chosen optimal number of clusters.
k = 14

# Initializing the KMeans clustering algorithm with "k" clusters.
kmeans = KMeans(n_clusters=k, random_state=0)

# Fitting the KMeans model to the data and predicting cluster assignments.
clusters = kmeans.fit_predict(X)

# Rewriting the existing "cluster" column with new clusters.
chat['cluster'] = clusters

# Calculating cluster sizes and percentages.
cluster_counts = chat['cluster'].value_counts()
total_counts = len(chat)
cluster_percentages = ((cluster_counts / total_counts) * 100).round(2)

# Sorting clusters by size.
sorted_cluster_indices = cluster_counts.sort_values(ascending=False).index

# Displaying clusters, sorted by cluster size.
for i in sorted_cluster_indices:
    # Getting size and percentage of the current cluster.
    cluster_size = cluster_counts.loc[i]
    cluster_percentage = cluster_percentages.loc[i]
    
    # Only displaying clusters with more than 50 messages.
    if cluster_size > 50:
        # Sampling texts from each cluster.
        sample = chat[chat['cluster'] == i]['contents'].sample(7).to_frame()
        
        # Formatting the header with size and percentage, and displaying
        # the sample texts.
        display(HTML(f"<h3>Cluster {i}: Size = {cluster_size}, "
                     f"Percentage = {cluster_percentage}%</h3>"))
        display(HTML(sample.to_html(escape=False)))

## 6.2. Cluster Analysis

After filtering out system-generated and pre-written messages, the analysis revealed that a significant portion of the remaining chat messages - **66%** - couldn't be clustered in a meaningful way. However, several prominent themes emerged from the clustered messages:

- **778 messages** contained "Friend request sent".
- **667 messages** mentioned that "Team is on their way now".
- **573 messages** included "Sending party invite".
- **410 messages** stated "Let me know when you are ready for the invites".
- **355 messages** said "Thank you", with additional **265 messages** saying "Thanks".
- **279 messages** said "Sorry".
- **207 messages** said "Copy that", with additional **78 messages** saying "Roger that".
- **138 messages** mentioned "beacon".
- **104 messages** said "Accepted".

# 7. Project Conclusion and Recommendations

The most frequently used and lengthy messages that are written manually are related to sending friend requests and party invites. These steps occur at the very beginning of the alert process and are mandatory. The analysis indicates that automating these messages would significantly reduce the manual effort required by the team.

- Automating the message **"Friend Request sent. Please press the Left Bracket key `[`"** could streamline up to **6%** of the messages currently being written manually.

- Automating **"Sending party invite. Please be in first-person to accept"** could streamline up to **4.5%**.

- Automating **"Let me know when you are ready to receive the invitations"** could streamline up to **3.2%**.

Together, automating these three messages could streamline up to **13.7%** of the manually written messages.

Additionally, there are several short, frequently repeated messages that could also be considered for automation:

- **"Thank you"** - 2.8%
- **"Sorry"** - 2.2%
- **"Thanks"** - 2.1%
- **"Copy that"** - 1.6%
- **"Roger that"** - 0.6%

Assuming approximately 75% of these shorter messages are written by the team (and 25% by the client), automating them could streamline up to additional **7%** of the messages. Combined with the invitation-related messages, this could result in automating up to **20.7%** of the messages.

Depending on the available options for automation, prioritizing the invitation-related messages would provide the most significant benefit. However, if feasible, automating the shorter messages could further enhance communication efficiency.