## Narrative Similarity Analysis on CFPB Complaints Using GloVe Embeddings

### This script provides a comprehensive workflow for processing, analyzing, and identifying potential duplicates in consumer complaint narratives sourced from the Consumer Financial Protection Bureau (CFPB). Leveraging the power of GloVe word embeddings, the script first loads the embeddings to convert narratives into vector representations. It then introduces functions to normalize and vectorize the narratives. After preprocessing, the code identifies narratives that are potentially similar based on their vectorized representations and timeframes of submission. Ultimately, duplicates are marked, and the processed dataset, complete with identified duplicate narratives, is saved for further analysis. This approach aids in reducing redundancy and ensures a cleaner dataset for subsequent investigations or reports.

In [None]:
# Standard Libraries
import re
from datetime import datetime

# Third-party Libraries for Data Manipulation and Analysis
import numpy as np
import pandas as pd

# Natural Language Processing Libraries
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, RegexpTokenizer

# Visualization and Display
from IPython.display import display, HTML
import matplotlib.pyplot as plt

# Machine Learning and Embeddings
from scipy import spatial
from sklearn.manifold import TSNE

# Miscellaneous
import string

# Setting IPython display options for better visualization
display(HTML("<style>.container { width:100% !important; }</style>"))
pd.set_option('display.max_colwidth', None)

#### 1. Loading Embeddings: The GloVe (Global Vectors for Word Representation) embeddings of size 50 (something like glove.6B.50d.txt) are loaded into a dictionary (embeddings_dict_6B_50D). These embeddings are essentially vector representations of words.

In [None]:
# Loading GloVe word embeddings into a dictionary
glove_txt_file = "glove_file, something like glove.6B.50d.txt"
embeddings_dict_6B_50D = {}
with open(glove_txt_file, 'r', encoding="utf-8") as f:
    for line in f:
        values = line.split()
        word = ' '.join(values[:-50]).lower().strip()
        vector = np.asarray(values[-50:], "float32")
        embeddings_dict_6B_50D[word] = vector


#### 2. Text Vectorization: A function vectorize_text is defined to convert a given text into a vector form using the aforementioned embeddings.

In [None]:
# Convert text into its vector representation
def vectorize_text(text):
    vectors = [embeddings_dict_6B_50D.get(word) for word in str(text).split() if word in embeddings_dict_6B_50D]
    vectors = [v for v in vectors if v is not None]  # remove any None values
    if vectors:
        vectorized = np.mean(vectors, axis=0)
    else:
        vectorized = np.zeros(50)  # if there are no vectors, return a zero-vector
    return vectorized

#### 3. Text Normalization: The text_normalizer function is responsible for cleaning and pre-processing text. This involves tokenization, removing redundant characters, converting to lowercase, and removing punctuations.

In [None]:
# Normalize and clean the given text
def text_normalizer(text):
    if text:
        # Tokenization while retaining words with apostrophes
        tokenizer = RegexpTokenizer(r'\b\w[\w\'-]*\w\b|\w')
        words = tokenizer.tokenize(text)
        
        # Remove tokens with repeating characters
        words = [re.sub(r'(\w)\1{2,}', '', word) if re.search(r'(\w)\1{2,}', word) else word for word in words]
        
        # Convert to lowercase and remove punctuations
        words = [word.lower().strip() for word in words]
        
        # Substitute tokens that are just numbers with empty strings
        words = ['' if word.isdigit() else word for word in words]
        
        # Merge words into a single string
        text = ' '.join([word for word in words if word])
    return text

#### 4. Data Preprocessing: The CFPB dataset is loaded into a dataframe, and various transformations are applied. These transformations include:
* Removing rows with NaN values in the 'Consumer complaint narrative' column.
* Converting the 'Date received' column into datetime format.
* Computing the length of each narrative.
* Calculating the number of days since the complaint was received.
* Applying the text normalization function to the 'Consumer complaint narrative' column.
* Vectorizing the first 500 characters of each normalized narrative.

In [None]:
# Loading the dataset
complaint_file = "you complaint file here, should called complaints.csv"
cfpb_df = pd.read_csv(complaint_file)

# Data preprocessing
print("Before dropping nan narrative: ", len(cfpb_df))
cfpb_df.dropna(subset=['Consumer complaint narrative'], inplace=True)
cfpb_df['Date received'] = pd.to_datetime(cfpb_df['Date received'])
cfpb_df['narr_len'] = cfpb_df['Consumer complaint narrative'].apply(lambda x:len(str(x)))
cfpb_df['days_to_today'] = (datetime.now().date() - cfpb_df['Date received'].dt.date).dt.days
cfpb_df['narr_len'] = cfpb_df['narr_len'].astype(int)
cfpb_df['days_to_today'] = cfpb_df['days_to_today'].astype(int)
cfpb_df['clean_narr'] = cfpb_df['Consumer complaint narrative'].apply(text_normalizer)
cfpb_df['narr_head_vec'] = cfpb_df['clean_narr'].apply(lambda x: vectorize_text(x[:500]))
print("After dropping nan narrative: ", len(cfpb_df))

#### 5. Identifying Duplicate Narratives: The function find_duplicate_narr is aimed at identifying potential duplicate narratives by analyzing the vector representation of their content. This is achieved by measuring the Euclidean distance between vector representations of narratives. If the distance is below a specified threshold, narratives are marked as duplicates.

In [None]:
# Preparing a smaller version of the dataframe for processing
small_cfpb_df = cfpb_df[['State', 'ZIP code','Complaint ID','narr_len', 'days_to_today','narr_head_vec']].copy()
small_cfpb_df[['State', 'ZIP code']] = small_cfpb_df[['State', 'ZIP code']].fillna('')

# Function to identify duplicate narratives by checking vector similarity
def find_duplicate_narr(df):
    small_df = df[['Complaint ID','narr_len', 'days_to_today', 'narr_head_vec']].copy()
    def find_dupi_in_small_df(row_narr_len, row_to_day, row_narr_head_vec, small_df):
        tmp_df = small_df.query("narr_len <= @row_narr_len*1.2 & narr_len >= @row_narr_len*0.8 & days_to_today <= @row_to_day+5 & days_to_today >= @row_to_day-5").copy()    
        tmp_df['eclidean_dist'] = tmp_df['narr_head_vec'].apply(lambda x: np.linalg.norm(x - row_narr_head_vec))
        dupli_df = tmp_df[tmp_df['eclidean_dist']<0.25]
        dupli_id_list = sorted(dupli_df['Complaint ID'].to_list())
        return dupli_id_list
    df['dupi_id'] = small_df.apply(lambda row: find_dupi_in_small_df(row['narr_len'], int(row['days_to_today']), row['narr_head_vec'], small_df),axis=1)
    return df

# Applying the duplicate finder function to the dataframe
small_cfpb_df = small_cfpb_df.groupby(['State', 'ZIP code']).apply(func=find_duplicate_narr)
small_cfpb_df['dupi_len'] = small_cfpb_df['dupi_id'].apply(lambda x: len(x))
small_cfpb_df['dupi_id'] = small_cfpb_df['dupi_id'].apply(lambda x: ";".join([str(y) for y in x]))


#### 6. Merging and Saving Data: The identified duplicates are merged back into the original dataframe, and the processed dataset with marked duplicates is saved.

In [None]:
# Merging identified duplicates back to the original dataframe
merged_df = cfpb_df.merge(small_cfpb_df[['Complaint ID', 'dupi_id', 'dupi_len']], on='Complaint ID', how='left').drop(['narr_head_vec'], axis=1)

# Saving the processed dataset with marked duplicates
save_file_name = "where you want to store"
merged_df.to_csv('save_file_name', index=False)