<a href="https://colab.research.google.com/github/mmilannaik/BigOCheatSheet/blob/master/NLP_2_Real_Estate_Inquiry_Transcripts_intent_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Configurations

# 🏡 Real Estate Inquiry Intent Classifier (2018–2019 ML stack)
This notebook simulates real estate inquiry classification using a repurposed banking complaints dataset.

**Why 2018–2019 matters:**
This uses the tech stack and NLP methodology popular in the 2018–2019 era:
- TF-IDF
- Logistic Regression (or SVM)
- Weak supervision via keyword matching

Intent classes:
- Booking/Visit Inquiry
- Price Negotiation
- Loan Support
- Complaint/Escalation

In [1]:
# Step 1: Install dependencies (as per 2018–19 best practice)
!pip install pandas scikit-learn matplotlib seaborn -q

In [2]:
# 1. Install the Kaggle CLI
!pip install kaggle --quiet

# 2. Upload your Kaggle API token
#    • On Kaggle: Account → Create New API Token → download kaggle.json
#    • In Colab:
from google.colab import files
files.upload()   # select your kaggle.json

# 3. Configure the CLI
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

Saving kaggle.json to kaggle (1).json


In [3]:
# STEP 3 ▸ download the dataset zip
!kaggle datasets download -d adhamelkomy/bank-customer-complaint-analysis
!unzip bank-customer-complaint-analysis.zip

Dataset URL: https://www.kaggle.com/datasets/adhamelkomy/bank-customer-complaint-analysis
License(s): CC0-1.0
Downloading bank-customer-complaint-analysis.zip to /content
  0% 0.00/20.0M [00:00<?, ?B/s]
100% 20.0M/20.0M [00:00<00:00, 292MB/s]
Archive:  bank-customer-complaint-analysis.zip
  inflating: Bank Customer Complaint Analysis for Efficient Dispute Resolution.ipynb  
  inflating: complaints.csv          
  inflating: complaints_report_20240226_183305.txt  
  inflating: final_dataframe (1).csv  


In [4]:
# Step 2: Load the data (upload from Kaggle)
import pandas as pd

df = pd.read_csv('complaints.csv')  # Update with actual file name


In [5]:
df.rename(columns={'Unnamed: 0': 'Complaint ID'}, inplace=True)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 162421 entries, 0 to 162420
Data columns (total 3 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   Complaint ID  162421 non-null  int64 
 1   product       162421 non-null  object
 2   narrative     162411 non-null  object
dtypes: int64(1), object(2)
memory usage: 3.7+ MB


In [7]:

df = df[['Complaint ID', 'product', 'narrative']].dropna()
df.columns = ['id', 'product', 'narrative']
df.head()

Unnamed: 0,id,product,narrative
0,0,credit_card,purchase order day shipping amount receive pro...
1,1,credit_card,forwarded message date tue subject please inve...
2,2,retail_banking,forwarded message cc sent friday pdt subject f...
3,3,credit_reporting,payment history missing credit report speciali...
4,4,credit_reporting,payment history missing credit report made mis...


# Baseline Model

In [8]:
# Step 3: Keyword-based label simulation (used pre-BERT)
def label_intent(text):
    text = str(text).lower()
    if any(kw in text for kw in ['appointment', 'visit', 'site', 'call me']):
        return 'Booking/Visit Inquiry'
    elif any(kw in text for kw in ['price', 'rate', 'cost', 'quotation']):
        return 'Price Negotiation'
    elif any(kw in text for kw in ['loan', 'emi', 'mortgage', 'finance']):
        return 'Loan Support'
    else:
        return 'Complaint/Escalation'

df['intent'] = df['narrative'].apply(label_intent)
df['intent'].value_counts()

Unnamed: 0_level_0,count
intent,Unnamed: 1_level_1
Complaint/Escalation,87987
Price Negotiation,36661
Loan Support,25442
Booking/Visit Inquiry,12321


In [9]:
df.head(10)

Unnamed: 0,id,product,narrative,intent
0,0,credit_card,purchase order day shipping amount receive pro...,Complaint/Escalation
1,1,credit_card,forwarded message date tue subject please inve...,Price Negotiation
2,2,retail_banking,forwarded message cc sent friday pdt subject f...,Price Negotiation
3,3,credit_reporting,payment history missing credit report speciali...,Loan Support
4,4,credit_reporting,payment history missing credit report made mis...,Loan Support
5,5,credit_reporting,payment history missing credit report made mis...,Loan Support
6,6,credit_reporting,va date complaint experian credit bureau invol...,Booking/Visit Inquiry
7,7,credit_reporting,account reported abbreviated name full name se...,Price Negotiation
8,8,credit_reporting,account reported abbreviated name full name se...,Price Negotiation
9,9,credit_reporting,usdoexxxx account reported abbreviated name fu...,Price Negotiation


In [10]:
# Step 4: TF-IDF + Logistic Regression (baseline model from 2018–19 stack)
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(
    df['narrative'], df['intent'], test_size=0.2, random_state=42)

tfidf = TfidfVectorizer(max_features=5000)
X_train_vec = tfidf.fit_transform(X_train)
X_test_vec = tfidf.transform(X_test)

clf = LogisticRegression(max_iter=200)
clf.fit(X_train_vec, y_train)
y_pred = clf.predict(X_test_vec)
print(classification_report(y_test, y_pred))

                       precision    recall  f1-score   support

Booking/Visit Inquiry       0.98      0.82      0.89      2458
 Complaint/Escalation       0.94      1.00      0.97     17626
         Loan Support       0.92      0.90      0.91      5069
    Price Negotiation       0.96      0.89      0.92      7330

             accuracy                           0.94     32483
            macro avg       0.95      0.90      0.92     32483
         weighted avg       0.95      0.94      0.94     32483



### 📈 Outcome & Notes
- This approach reflects a realistic baseline from the 2018–2019 era.
- You can use this as a benchmark before upgrading to transformer-based models.
- Extend this by saving metrics, adding SHAP or word importance maps, or transitioning to BERT in a future notebook.

# Preprocessed Model

In [11]:
# Step 3: Preprocess text
import re
import nltk
nltk.download('punkt_tab')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    text = str(text).lower()
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    tokens = word_tokenize(text)
    tokens = [lemmatizer.lemmatize(w) for w in tokens if w not in stop_words]
    return ' '.join(tokens)

df['cleaned'] = df['narrative'].apply(clean_text)
df[['narrative', 'cleaned']].head()

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Unnamed: 0,narrative,cleaned
0,purchase order day shipping amount receive pro...,purchase order day shipping amount receive pro...
1,forwarded message date tue subject please inve...,forwarded message date tue subject please inve...
2,forwarded message cc sent friday pdt subject f...,forwarded message cc sent friday pdt subject f...
3,payment history missing credit report speciali...,payment history missing credit report speciali...
4,payment history missing credit report made mis...,payment history missing credit report made mis...


In [12]:
# Step 4: Label intents
def label_intent(text):
    text = str(text).lower()
    if any(kw in text for kw in ['appointment', 'visit', 'site', 'call me']):
        return 'Booking/Visit Inquiry'
    elif any(kw in text for kw in ['price', 'rate', 'cost', 'quotation']):
        return 'Price Negotiation'
    elif any(kw in text for kw in ['loan', 'emi', 'mortgage', 'finance']):
        return 'Loan Support'
    else:
        return 'Complaint/Escalation'

df['intent'] = df['cleaned'].apply(label_intent)
df['intent'].value_counts()

Unnamed: 0_level_0,count
intent,Unnamed: 1_level_1
Complaint/Escalation,87987
Price Negotiation,36661
Loan Support,25442
Booking/Visit Inquiry,12321


In [13]:
# Step 5: TF-IDF + Logistic Regression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(
    df['cleaned'], df['intent'], test_size=0.2, random_state=42)

tfidf = TfidfVectorizer(max_features=5000)
X_train_vec = tfidf.fit_transform(X_train)
X_test_vec = tfidf.transform(X_test)

clf = LogisticRegression(max_iter=200)
clf.fit(X_train_vec, y_train)
y_pred = clf.predict(X_test_vec)
print(classification_report(y_test, y_pred))

                       precision    recall  f1-score   support

Booking/Visit Inquiry       0.98      0.82      0.89      2458
 Complaint/Escalation       0.94      1.00      0.97     17626
         Loan Support       0.92      0.90      0.91      5069
    Price Negotiation       0.96      0.88      0.92      7330

             accuracy                           0.94     32483
            macro avg       0.95      0.90      0.92     32483
         weighted avg       0.95      0.94      0.94     32483



# Multiintent model

In [14]:
from sklearn.preprocessing import MultiLabelBinarizer

def multilabel_intent(text):
  test = str(text).lower()
  tags = []
  if any(kw in text for kw in ['appointment','visit','site','call me']):
    tags.append('booking/Visit Inquiry')
  if any(kw in text for kw in ['price','rate','cost','quotation']):
    tags.append('Price Negotiation')
  if any(kw in text for kw in ['loan','emi','mortgage','finance']):
    tags.append('Loan Support')
  if not tags:
    tags.append('Complaint/Escalation')
  return tags

df['multi_intent'] = df['cleaned'].apply(multilabel_intent)

In [15]:
df['multi_intent'].head()

Unnamed: 0,multi_intent
0,[Complaint/Escalation]
1,[Price Negotiation]
2,"[Price Negotiation, Loan Support]"
3,[Loan Support]
4,[Loan Support]


In [16]:
df['multi_intent'].value_counts()

Unnamed: 0_level_0,count
multi_intent,Unnamed: 1_level_1
[Complaint/Escalation],87987
[Loan Support],25442
[Price Negotiation],24356
"[Price Negotiation, Loan Support]",12305
[booking/Visit Inquiry],5509
"[booking/Visit Inquiry, Loan Support]",2663
"[booking/Visit Inquiry, Price Negotiation]",2288
"[booking/Visit Inquiry, Price Negotiation, Loan Support]",1861


In [None]:
df.shape

(162411, 6)

In [17]:
# Step 7: Multi-label model training
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import classification_report, hamming_loss

mlb = MultiLabelBinarizer()
Y = mlb.fit_transform(df['multi_intent'])

X_train, X_test, Y_train, Y_test = train_test_split(df['cleaned'], Y, test_size=0.2, random_state=42)
X_train_vec = tfidf.fit_transform(X_train)
X_test_vec = tfidf.transform(X_test)

multi_clf = OneVsRestClassifier(LogisticRegression(max_iter=200))
multi_clf.fit(X_train_vec, Y_train)
Y_pred = multi_clf.predict(X_test_vec)

In [18]:
# Step 8: Evaluate multi-label classifier
from sklearn.metrics import f1_score

print("Micro F1 Score:", f1_score(Y_test, Y_pred, average='micro'))
print("Macro F1 Score:", f1_score(Y_test, Y_pred, average='macro'))
print("Hamming Loss:", hamming_loss(Y_test, Y_pred))

Micro F1 Score: 0.9507767693078679
Macro F1 Score: 0.9273564615220262
Hamming Loss: 0.027214235138380075


In [19]:
print(classification_report(Y_test, Y_pred))

              precision    recall  f1-score   support

           0       0.95      0.99      0.97     17626
           1       1.00      0.92      0.96      8468
           2       0.99      0.86      0.92      8150
           3       1.00      0.76      0.86      2458

   micro avg       0.97      0.93      0.95     36702
   macro avg       0.98      0.88      0.93     36702
weighted avg       0.97      0.93      0.95     36702
 samples avg       0.95      0.94      0.95     36702



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
