# Feature Extraction

This jupyter notebook encompasses a product feature extraction process aimed at transforming raw text data into meaningful and structured features for further analysis. The process involves several key steps:

## Detailed Process

### 1. Filter Extreme Words:
The code filters extreme words from the tokenized corpus, removing infrequent and highly frequent words. This step improves the quality of the data and reduces noise in subsequent analysis. It involves creating a Gensim Dictionary object from the tokenized corpus and applying filtering using the filter_extremes method.
### 2. Topic Modeling:
The code performs topic modeling using the Latent Dirichlet Allocation (LDA) model. It generates a bag-of-words corpus representation and builds an LDA model with a specific number of topics. The LDA model assigns topic distributions to each document in the corpus, allowing for the discovery of latent topics. 
###  3. Extract Keyword from Topics:
The code extracts the highest probability word for each topic in the LDA model. This step identifies the most representative word for each topic, providing insights into the main theme or concept of the topic.
###  4. POS Tagging for Only Noun:
The code performs Part-of-Speech (POS) tagging on the highest probability words and filters out only the words identified as nouns. This helps ensure that the extracted keywords are relevant and suitable for further product feature presentation.
### 5. Removing Unrelated Words:
The code manually filters out unrelated keywords from the resulting DataFrame, removing words that are not meaningful or do not contribute to the topics of interest. This step improves the relevance and clarity of the extracted keywords.
### 6. Calculate Coherence Score:
The code calculates the coherence score using the c_v coherence measure. This measure evaluates the overall coherence of the generated topics, indicating how well the topics capture the underlying patterns in the data. A higher coherence score indicates more coherent and meaningful topics.


By following this feature extraction process, the code enables the transformation of raw text data into structured and interpretable features, facilitating subsequent analysis, modeling, and insights generation from the given corpus of sentences.

In [1]:
import pandas as pd
import numpy as np
import nltk

In [2]:
%store -r disney disney_sentences

In [3]:
#Filter extreme words
from gensim.corpora import Dictionary

tokenized_corpus = [nltk.word_tokenize(doc) for doc in disney_sentences]
dictionary = Dictionary(tokenized_corpus)
dictionary.filter_extremes(no_below=2, no_above=0.5)

In [4]:
from gensim.models import LdaModel

# Create a bag of words corpus
bow_corpus = [dictionary.doc2bow(doc) for doc in tokenized_corpus]

# Build the LDA model with topics
num_topics = len(disney_sentences) *0.005
lda_model = LdaModel(corpus=bow_corpus,
                     id2word=dictionary,
                     num_topics=num_topics,
                     random_state=42,
                     update_every=1,
                     chunksize=100,
                     passes=10,
                     alpha='auto',
                     per_word_topics=True)

# Iterate over each topic and print the word with the highest probability
for idx, topic in lda_model.print_topics(-1):
    words = topic.split('+')
    highest_prob_word = words[0].split('*')[1].replace('"', '').strip()
    probability = float(words[0].split('*')[0])

In [85]:
from gensim.models import CoherenceModel

# Calculate coherence score using the c_v coherence measure for the evaluation
coherence_model = CoherenceModel(model=lda_model, texts=tokenized_corpus, dictionary=dictionary, coherence='c_v')
coherence_score = coherence_model.get_coherence()

print("Coherence Score:", coherence_score)

Coherence Score: 0.5133559229545281


In [108]:
# Download the required NLTK resources for POS tagging
nltk.download('averaged_perceptron_tagger')

# Create an empty DataFrame
title_df = pd.DataFrame(columns=['Word', 'Probability'])

# Iterate over each topic and populate the DataFrame
beforetagging = []
for idx, topic in lda_model.print_topics(-1):
    words = topic.split('+')
    highest_prob_word = words[0].split('*')[1].replace('"', '').strip()
    probability = float(words[0].split('*')[0])
    beforetagging.append(highest_prob_word)
    
    # Perform POS tagging to identify the word category
    pos_tags = nltk.pos_tag([highest_prob_word])
    pos_tag = pos_tags[0][1]
    
    # Check if the word is a noun (singular or plural)
    if pos_tag.startswith('NN'):
        # Append a new row to the DataFrame
        title_df = title_df.append({'Word': highest_prob_word, 'Probability': probability}, ignore_index=True)

# Remove duplicated rows if any from the DataFrame
title_df.drop_duplicates(inplace=True)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/ijeonghyeon/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [110]:
# Get the number of features before the POS tagging
beforetagging = pd.DataFrame(beforetagging, columns=["Keyword"])
len(beforetagging)

213

In [111]:
# Get the number of features after the POS tagging and before the manual removal
len(title_df)

80

In [95]:
# Manually filter unrelevant keywords from the title_df
disney_title_df = title_df.drop(title_df[title_df["Word"].isin(["year","paris","child","kid",
                                                         "son","mean","daughter","fun",
                                                        "disney","land","sure",
                                                         "day","look","weekend","recommend","visit","place",
                                                         "dream","brea","weather","enjoy","u","worth","thing",
                                                         "bit","way","use","decide","hong","start","stop",
                                                         "course","want","everything","review","feel","beautiful",
                                                         "choice","family","week","tell","need","trip","watch",
                                                         "disneyland","walt","favorite","experience","star","age"
                                                        ])].index)

In [97]:
# Get the final number of features
len(disney_title_df)

37

In [103]:
disney_title_df

Unnamed: 0,Word,Probability
0,hotel,0.303
2,lunch,0.533
14,snack,0.165
15,train,0.316
18,clean,0.277
20,buffet,0.141
26,car,0.412
27,service,0.679
31,member,0.199
37,staff,0.634


In [92]:
# Store variables for further analysis
%store disney_title_df

Stored 'disney_title_df' (DataFrame)
