### Dataset Exploration

In this section, we will analyze relevant datasets to identify the most suitable ones for the **CallConnect** app. After reviewing available options, we have shortlisted the following datasets for detailed analysis:

---

### 1. Question-Answer Dataset  
This dataset consists of question-answer pairs extracted from Wikipedia articles. It is ideal for training systems to generate accurate and concise answers to user queries, making it a valuable resource for building knowledge-driven chatbots.  

- **Link:** [Question-Answer Dataset](https://www.cs.cmu.edu/~ark/QA-data/?ref=hackernoon.com)

---

### 2. Ubuntu Dialogue Corpus  
This dataset comprises over **26 million conversational turns** from two-person dialogues, derived from real-world conversations. It is well-suited for modeling multi-turn dialogues and enhancing contextual understanding in chatbot systems.  
**Code to Download the Dataset:**

```python
import kagglehub

# Download the latest version
path = kagglehub.dataset_download("rtatman/ubuntu-dialogue-corpus")

print("Path to dataset files:", path)
```

- **Link:** [Ubuntu Dialogue Corpus on Kaggle](https://www.kaggle.com/datasets/rtatman/ubuntu-dialogue-corpus)

---

### 3. Customer Support on Twitter  
This dataset includes over **3 million tweets and replies** exchanged between prominent brands and their customers on Twitter. It is particularly valuable for training chatbots to handle customer service interactions and provide prompt, accurate responses.  
**Code to Download the Dataset:**

```python
import kagglehub

# Download the latest version
path = kagglehub.dataset_download("thoughtvector/customer-support-on-twitter")

print("Path to dataset files:", path)
```

- **Link:** [Customer Support on Twitter](https://www.kaggle.com/datasets/thoughtvector/customer-support-on-twitter)

---

### 4. Switchboard Dialog Dataset  
The **Switchboard Dialog Dataset** is a collection of over **2,400 human-to-human telephone conversations**. It focuses on task-oriented dialogues, providing valuable insights into goal-driven interactions. This dataset is especially useful for training chatbots designed to assist users with specific tasks or requests.  

- **Documentation:** [Switchboard Dialog Dataset Docs](https://convokit.cornell.edu/documentation/switchboard.html)  
- **GitHub Repository:** [Switchboard Dialog Dataset](https://github.com/cgpotts/swda)

---

### Additional Resources
For further exploration of chatbot datasets:  
- **Hackernoon Article:** [Top 15 Chatbot Datasets for NLP Projects](https://hackernoon.com/top-15-chatbot-datasets-for-nlp-projects-8k2f3zqc)  
- **Dev Article:** [10 Useful Chatbot Datasets for NLP Projects](https://dev.to/devashishmamgain/10-useful-chatbot-datasets-for-nlp-projects-1fj2)  

---

### Selected Dataset for the **CallConnect** App  
After careful consideration, we have decided to use the **Switchboard Dialog Dataset** as the primary dataset for our project. Its focus on real-world, task-oriented telephone conversations aligns well with the goals of the **CallConnect** app. Additional datasets, such as the **Ubuntu Dialogue Corpus**, **Customer Support on Twitter**, and the **Question-Answer Dataset**, will be leveraged to complement specific features of the app and enhance its overall performance.

### Reading the Switchboard Dialog Act Corpus (SwDA)

We are using the **Switchboard Dialog Act Corpus (SwDA)** for our **Call Connect App** project. This dataset consists of **1,155 five-minute telephone conversations** between two participants, annotated with speech act tags. These conversations involve callers questioning receivers on various topics like child care, recycling, and news media. Here’s an overview of the structure, metadata, and how we plan to leverage this data for our app.

---

#### Dataset Highlights:
- **Conversations:** 1,155 telephone conversations
- **Speakers:** 440 participants
- **Utterances:** 221,616 (after processing, we get 122,646 utterances)
- **Topics:** The conversations revolve around prompts like "child care" or "news media," which can be useful in analyzing various conversational aspects.

#### Processed Dataset:
To make this dataset ready for experimentation and modeling, we’re working with a processed version of it. The key modifications include:
- **Disfluencies Removed:** Using regex to clean interruptions and non-verbal sounds that aren’t useful for our analysis.
- **Backchannels Removed:** Utterances with fewer than 5 tokens (such as “uh” or “mm-hmm”) are excluded.
- **Merged Turns:** Successive turns by the same speaker, interrupted by backchannels, are merged to make the conversation flow clearer.

---

#### Dataset Structure

##### 1. **Speaker-Level Metadata**
Each speaker has metadata that helps us understand the conversation dynamics:
- **ID:** Unique identifier for each speaker.
- **Sex:** The gender of the speaker (`MALE` or `FEMALE`).
- **Education:** The speaker’s education level (e.g., 0: Less than high school, 1: Less than college, 2: College, 3: More than college, 9: Unknown).
- **Birth Year:** The year the speaker was born.
- **Dialect Area:** Regional dialect classification (e.g., `SOUTHERN`, `NYC`, `WESTERN`). This will help us tailor our app’s recognition systems.

##### 2. **Utterance-Level Metadata**
Each utterance corresponds to a turn in the conversation. We can use this metadata to structure our app’s responses and interactions:
- **ID:** Formatted as `_conversation_id_`-`_utterance_position_` (e.g., `4325-0`).
- **Speaker:** The person speaking.
- **Text:** The content of the utterance.
- **Tag:** Speech act tags (DAMSL) that classify the type of speech act (e.g., question, request).
- **Alpha Text:** A version with only alphabetic tokens (for processed data).
- **Reply To:** ID of the utterance this replies to (important for understanding conversation flow).
- **Next ID:** ID of the next utterance replying to this one.

##### 3. **Conversation-Level Metadata**
This information helps us track the entire conversation:
- **Filename:** The original file name in the SwDA dataset.
- **Topic Description:** A brief description of the conversation’s topic.
- **Prompt:** A detailed description of the conversation prompt, guiding the flow of conversation.
- **Length:** The duration of the conversation in minutes.
- **From Caller / To Caller:** The identifiers for the participants (A and B).

---

### Combining CSV Files for Model Training
The SwDA dataset is split across multiple CSV files (one for each type of data: conversations, speakers, utterances). To train our models, we’ll need to consolidate these files into a single format that we can use effectively.

#### Steps to Process:
1. **Understand the Files:** We’ll begin by reviewing each file's content (such as the utterances, metadata, and speaker information).
2. **Merge the Data:** We’ll combine the files based on common keys, like `conversation_id`, `speaker_id`, and `utterance_id`, so that all data about a conversation is in one place.
3. **Clean the Data:** We’ll remove any unnecessary or noisy data (e.g., disfluencies, backchannels) to ensure our models only get the most relevant information.

---

### Documentation and References:
- **SwDA Dataset Documentation:** [Switchboard Dialog Dataset Docs](https://convokit.cornell.edu/documentation/switchboard.html)  
- **Original Paper:** [Dialogue Act Modeling for Automatic Tagging and Recognition of Conversational Speech (2000)](https://www.cs.cmu.edu/~ark/QA-data/?ref=hackernoon.com)  

By merging this data into a unified format, we can build more accurate models for understanding conversations in our **Call Connect App**, making it smarter and more capable of handling natural dialogues.

In [30]:
# Importing required libraries
import pandas as pd

# Reading one file from dataset to understand its structure
file_sample = pd.read_csv(r"data\raw\swda\sw00utt\sw_0001_4325.utt.csv")

print("Sample File is:\n")
print(file_sample)

Sample File is:

                swda_filename ptb_basename  conversation_no  transcript_index  \
0    sw00utt/sw_0001_4325.utt     4/sw4325             4325                 0   
1    sw00utt/sw_0001_4325.utt     4/sw4325             4325                 1   
2    sw00utt/sw_0001_4325.utt     4/sw4325             4325                 2   
3    sw00utt/sw_0001_4325.utt     4/sw4325             4325                 3   
4    sw00utt/sw_0001_4325.utt     4/sw4325             4325                 4   
..                        ...          ...              ...               ...   
154  sw00utt/sw_0001_4325.utt     4/sw4325             4325               154   
155  sw00utt/sw_0001_4325.utt     4/sw4325             4325               155   
156  sw00utt/sw_0001_4325.utt     4/sw4325             4325               156   
157  sw00utt/sw_0001_4325.utt     4/sw4325             4325               157   
158  sw00utt/sw_0001_4325.utt     4/sw4325             4325               158   

    act_ta

### Columns of the dataset

In [37]:
print("Columns are:\n")

print(file_sample.columns)

Columns are:

Index(['swda_filename', 'ptb_basename', 'conversation_no', 'transcript_index',
       'act_tag', 'caller', 'utterance_index', 'subutterance_index', 'text',
       'pos', 'trees', 'ptb_treenumbers'],
      dtype='object')
