### Keyword and Phrase Extraction using TF-IDF
Extract keywords and phrases from the text using TF-IDF.

In [1]:
# Imports for data manipulation and analysis
import pandas as pd

# Imports from scikit-learn for TF-IDF and clustering
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

### Read Data Frame Data
Get the processed data frame in Pickle format.

In [2]:
# Load the DataFrame
df = pd.read_pickle('C:\\Users\\ted59\\Knapp069-Practicum-1-Project\\Processed Data\\processed_document_data.pkl')

### Apply TF-IDF
Convert the collection of raw documents to a matrix of TF-IDF features using The parameters max_df=0.85 and min_df=2.

In [3]:
# Apply TF-IDF to the processed text

tfidf_vectorizer = TfidfVectorizer(max_df=0.85, min_df=2, stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(df['processed_content'])

# Extract feature names (key terms)
feature_names = tfidf_vectorizer.get_feature_names_out()

### Clustering for Intent Discovery
Cluster the extracted phrases to group similar phrases, representing potential intents.

In [4]:
# Number of clusters (intents) - adjust based on your analysis
num_clusters = 10

# Apply K-means clustering
km = KMeans(n_clusters=num_clusters, n_init=10)
km.fit(tfidf_matrix)
clusters = km.labels_.tolist()

# Add cluster labels to DataFrame
df['cluster'] = clusters

### Generating Training Examples from Clusters
Generate training examples for each identified intent.

In [5]:
# Group by cluster to get training examples for each intent
training_examples = df.groupby('cluster')['processed_content'].apply(list)

 ## # Save Clustered Data
 Save this data for further analysis or manual review.

In [6]:
# Save this data for further analysis or manual review as Pickle file.
training_examples.to_pickle('C:\\Users\\ted59\\Knapp069-Practicum-1-Project\\Processed Data\\clustered_data.pkl')

### Read and Print the Clustered Data File Contents

In [7]:
# Read the contents of the clustered_data Pickle file
file_path = 'C:\\Users\\ted59\\Knapp069-Practicum-1-Project\\Processed Data\\clustered_data.pkl'
df = pd.read_pickle(file_path)

In [8]:
print(df.head())
print ('\n')
print(df.tail())
print ('\n')
print(df.info())

# Iterate over the Series and print to a file
with open('C:\\Users\\ted59\\Knapp069-Practicum-1-Project\\Processed Data\\clustered_data.txt', 'w', encoding='utf-8') as file:
    for cluster_index, content in df.items():
        file.write(f"Cluster {cluster_index} contents:\n")
        file.write('\n'.join(content) + "\n\n")

cluster
0    [list client consol skip main content show nav...
1    [system monitor skip main content show navig g...
2    [log collect web consol skip main content show...
3    [messag process engin rule builder skip main c...
4    [applic manag skip main content show navig go ...
Name: processed_content, dtype: object


cluster
5    [inclus librari skip main content show navig g...
6    [logrhythm siem skip main content show navig g...
7    [modifi singl log sourc skip main content show...
8    [creat schedul report skip main content show n...
9    [ga releas note januari skip main content show...
Name: processed_content, dtype: object


<class 'pandas.core.series.Series'>
Index: 10 entries, 0 to 9
Series name: processed_content
Non-Null Count  Dtype 
--------------  ----- 
10 non-null     object
dtypes: object(1)
memory usage: 160.0+ bytes
None
