<p align="center" width="100%">
    <img width="40%" src="customer_support_icon.JPG"> 
</p>

A retail company is on a transformative journey, aiming to elevate their customer services through cutting-edge advancements in Speech Recognition and Natural Language Processing (NLP). As the machine learning engineer for this initiative, you are tasked with developing functionalities that not only convert customer support audio calls into text but also explore methodologies to extract insights from transcribed texts.

In this dynamic project, we leverage the power of `SpeechRecognition`, `Pydub`, and `spaCy` – three open-source packages that form the backbone of your solution. Your objectives are:
  - Transcribe a sample customer audio call, stored at `sample_customer_call.wav`, to showcase the power of open-source speech recognition technology.
  - Analyze sentiment, identify common named entities, and enhance user experience by searching for the most similar customer calls based on a given query from a subset of their pre-transcribed call data, stored at `customer_call_transcriptions.csv`.

This project is an opportunity to unlock the potential of machine learning to revolutionize customer support. Let's delve into the interplay between technology and service excellence.

In [38]:
!pip install SpeechRecognition
!pip install pydub
!pip install spacy
!python3 -m spacy download en_core_web_sm

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Collecting en-core-web-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.6.0/en_core_web_sm-3.6.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m150.6 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [39]:
# Import required libraries
import pandas as pd

import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

import speech_recognition as sr
from pydub import AudioSegment

import spacy

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/repl/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [40]:
# Start coding here
# 1 # Implement speech recognition and calculate audio statistics

def transcribe_audio(filename):
  # Setup a recognizer instance
  recognizer = sr.Recognizer()
  
  # Import the audio file and convert to audio data
  audio_file = sr.AudioFile(filename)
  with audio_file as source:
    audio_data = recognizer.record(source)
  
  # Return the transcribed text
  return recognizer.recognize_google(audio_data)

# Transcribe AudioData to text
transcribed_text = transcribe_audio("sample_customer_call.wav")
print(transcribed_text)

hello I'm experiencing an issue with your product I'd like to speak to someone about a replacement


In [41]:
# Import audio file
wav_file = AudioSegment.from_file(file="sample_customer_call.wav")

# Find stats 
frame_rate = wav_file.frame_rate
print(f"Frame Rate: {frame_rate}")
number_channels = wav_file.channels 
print(f"Channels: {number_channels}")

Frame Rate: 44100
Channels: 1


In [42]:
# 2 # Perform sentiment analysis

# Load the CSV file
df = pd.read_csv("customer_call_transcriptions.csv")

# Initialize the sentiment analyzer
sid = SentimentIntensityAnalyzer()

# Initialize the prediction column
df["prediction"] = ""

# Iterate over each row in the DataFrame
for index, row in df.iterrows():
    # Get the text to analyze
    text = row['text']
    
    # Calculate the polarity scores
    scores = sid.polarity_scores(text)
    compound_score = scores['compound']
    
    # Determine the sentiment based on the compound score
    if compound_score >= 0.05:
        sentiment = "positive"
    elif compound_score <= -0.05:
        sentiment = "negative"
    else:
        sentiment = "neutral"
    
    # Assign the sentiment to the appropriate row in the DataFrame
    df.at[index, "prediction"] = sentiment

# Print the first few rows of the DataFrame
print(df.head())

true_positive = len(df[(df['sentiment_label'] == "positive") & (df['prediction'] == "positive")])
print(true_positive)

                                                text sentiment_label prediction
0  how's it going Arthur I just placed an order w...        negative   negative
1  yeah hello I'm just wondering if I can speak t...         neutral   positive
2  hey I receive my order but it's the wrong size...        negative   negative
3  hi David I just placed an order online and I w...         neutral    neutral
4  hey I bought something from your website the o...        negative    neutral
2


In [43]:
# 3 # Run named entity recognition

from collections import Counter

nlp = spacy.load("en_core_web_sm")
all_entities = []

# Iterate over each transcription
for text in df['text']:
    # Process the text with spaCy to extract named entities
    doc = nlp(text)
    # Add entities to the list
    all_entities.extend([ent.text for ent in doc.ents])

# Count the frequency of each named entity
entity_counts = Counter(all_entities)

# Find the most common named entity
most_common_entity = entity_counts.most_common(1)
most_freq_ent = most_common_entity[0][0]
print(f"The most frequently mentioned entity is: {most_freq_ent} with {most_common_entity[0][1]} mentions.")

The most frequently mentioned entity is: yesterday with 15 mentions.


In [44]:
# 4 # Find most similar texts

# Create a documents list containing Doc containers
documents = [nlp(t) for t in df['text']]
# Create a Doc container of the query
query = "wrong package delivery"
query_document = nlp(query)

similarity_scores = []

for doc in documents:
    # Calculate similarity
    similarity = query_document.similarity(doc)
    # Store the similarity score
    similarity_scores.append(similarity)
    
# Add similarity scores to the DataFrame
df['similarity'] = similarity_scores

# Find the most similar document
most_similar_doc = df.loc[df['similarity'].idxmax()]

most_similar_text = most_similar_doc['text']
# Print the most similar document and its similarity score
print(f"Most similar document: {most_similar_text}")
print(f"Similarity score: {most_similar_doc['similarity']}")

Most similar document: wrong package delivered
Similarity score: 0.5110670022005256
