# üß† Project 07 ‚Äî Incident Classification & SLA Risk Detection (NLP)
### **Study Notebook ‚Äî Guided Exploration & Mentorship Version**

---

## üìå What this notebook is

This is the **study-version notebook**, where the full reasoning process is documented:

- exploratory steps  
- trial and error  
- debugging  
- alternative modeling paths  
- detailed explanations  
- mentor questions & reflections  
- incremental improvements  

It captures the *real learning journey* behind the project ‚Äî not only the final result.

---

## üéØ Project Summary

The goal of this project is to build and compare NLP pipelines to automatically classify **logistics incident reports** that impact SLA performance.  
We will evaluate classical ML models (TF-IDF + classifiers) alongside modern embedding-based approaches.

Key outcomes:

- Structured NLP preprocessing workflow  
- Incident classification models  
- Operational insights relevant for SLA management  
- Comparison between modeling strategies  

---

## üóÇÔ∏è Notebook Structure (study version)

1. **PPI ‚Äî Preliminary Problem Identification**  
2. **Data Loading & Initial Inspection**  
3. **Text Exploration & Cleaning Strategy**  
4. **NLP Preprocessing Pipeline**  
5. **TF-IDF Vectorization + Classical Models**  
6. **Embedding-Based Classification**  
7. **Model Comparison (Metrics + Interpretability)**  
8. **Operational Insights**  
9. **Next Steps toward final-version notebook**

---

## üß≠ Notes

This notebook is intentionally verbose.  
Every decision is explained.  
Every mistake stays visible until corrected.  
Every question raised here will shape the final executive version.

If you're reading this as part of the portfolio:  
üëâ *The final clean deliverable is located in* `notebooks/final-version.ipynb`.

---

Let‚Äôs start the analysis.


# üß† Mentorship Block ‚Äî Guided Reasoning Before Analysis

This section contains conceptual and strategic questions designed to:
- strengthen my reasoning before writing code,  
- clarify expectations about the data,  
- refine my understanding of NLP in operational environments,  
- anticipate challenges,  
- and prepare me for the analytical phase.

These answers will influence:
- my PPI refinement,  
- the modeling strategy,  
- the experiment plan,  
- and the final-version notebook.

---

## **1) What operational scenarios do I imagine when reading an incident report?**

*(Describe a few hypothetical incident messages and what they imply operationally.)*

During the review of the incident report, the following operational scenarios came to mind:

Route Obstruction:
Ongoing road construction led to two streets along the planned delivery route being blocked for an indefinite period, requiring immediate rerouting.

System / Connectivity Failure:
An internet connectivity issue prevented the GPS application from updating in real time, causing inaccurate tracking information and operational delays.

Logistics Capacity Constraint:
A shipment remained at the distribution center for three consecutive days due to the lack of an available vehicle assigned to that route.

Vehicle Breakdown:
A delivery vehicle experienced a mechanical failure due to insufficient preventive maintenance, rendering it non-operational and interrupting the delivery sequence.

---

## **2) Which incident types do I believe are the most critical for SLA? Why?**

**The most critical incident types for SLA performance, in order of severity, are:**

1. **Vehicle breakdowns (especially due to lack of maintenance)**  
   These incidents immediately interrupt the delivery chain and can halt an entire route.  
   A breakdown typically creates cascading delays, affects multiple orders, and has no quick workaround unless contingency vehicles or repair teams are readily available.  
   This makes it one of the highest-impact events for SLA compliance.

2. **Shipments stuck at the distribution center for extended periods**  
   When a package remains in the hub for several days, the SLA is already at risk before the delivery even begins.  
   Regardless of the root cause, prolonged stagnation indicates a structural failure in capacity planning or resource allocation, making it an operational red flag.

3. **Internet or system connectivity failures**  
   These incidents disrupt real-time tracking, routing updates, and communication with drivers.  
   Although typically resolvable, they degrade visibility and decision-making, increasing the likelihood of SLA breaches due to delayed responses.

4. **Road construction or route blockages**  
   These events require rerouting and may add significant travel time, but they are generally predictable and can often be mitigated with advance planning or updated routing strategies.  
   As a result, their SLA impact is lower compared to system failures or vehicle breakdowns.

---

## **3) If I had to manually classify 300 incident messages per day, what patterns would I look for?**

If I were manually classifying a large volume of incident messages, I would look for specific keywords and linguistic patterns that indicate the nature and urgency of the issue. These terms act as strong semantic signals for operational categorization.

Examples include:

- **‚Äúvehicle breakdown‚Äù, ‚Äúmechanical failure‚Äù, ‚Äúengine issue‚Äù**  
  ‚Üí Indicates a vehicle-related incident with high SLA risk.

- **‚Äúdelivery delay‚Äù, ‚Äúrunning late‚Äù, ‚Äúbehind schedule‚Äù**  
  ‚Üí Suggests time-related issues affecting route performance.

- **‚Äúunable to deliver‚Äù, ‚Äúcustomer not found‚Äù, ‚Äúaddress issue‚Äù**  
  ‚Üí Points to delivery exceptions that may require reattempt or rerouting.

- **‚Äúroad blocked‚Äù, ‚Äúconstruction‚Äù, ‚Äúroute closed‚Äù, ‚Äúobstruction‚Äù**  
  ‚Üí Signals external route disruptions that impact travel time.

- **‚Äúvehicle maintenance‚Äù, ‚Äúneeds repair‚Äù, ‚Äúinspection required‚Äù**  
  ‚Üí Suggests structural issues that may affect fleet availability.

- **‚Äúshipment stuck‚Äù, ‚Äúorder held‚Äù, ‚Äúitem not dispatched‚Äù**  
  ‚Üí Indicates stagnation at the distribution center, a strong SLA warning.

- **‚Äúno tracking‚Äù, ‚Äúsystem not updating‚Äù, ‚ÄúGPS error‚Äù**  
  ‚Üí Implies system or connectivity issues that reduce visibility and response speed.

These keyword patterns help identify the probable incident type quickly, even before examining the full context.

---

## **4) What kinds of text noise do I expect?**

By understanding the context of the possible message sender ‚Äî including the operational situation, sense of urgency, and commitment to keeping the operation running ‚Äî I expect to encounter several types of text noise in the incident reports.

These include:

### **Typos and misspellings**
Often caused by haste, mobile typing, or voice-to-text errors:
- ‚Äúvihacles nreakdaw‚Äù
- ‚Äúfail car‚Äù
- ‚Äúingene poblem‚Äù

### **Incomplete or truncated messages**
Typically written under time pressure or connectivity issues:
- ‚Äúcannot deliv...‚Äù
- ‚Äúintercep...‚Äù
- ‚Äúdelay...‚Äù

### **Role-specific operational vocabulary**
Messages may contain terms, abbreviations, or jargon commonly used by drivers, dispatchers, or hub operators, which may not follow standard language conventions.

### **Slang and functional expressions**
Informal or shorthand expressions used to communicate quickly within the operation, often relying on shared contextual knowledge rather than grammatical accuracy.

These characteristics reflect real-world operational communication and reinforce the need for robust NLP preprocessing rather than rigid keyword rules.

---

## **5) What makes NLP necessary here instead of simple keyword rules?**

NLP is essential in this scenario because it can account for multiple factors that influence how operational messages are written and interpreted.

By processing natural language, NLP models are able to:
- capture linguistic patterns and behavioral signals present in human communication,
- distinguish contextual meaning, reducing ambiguity and avoiding double or multiple interpretations,
- identify the underlying intent behind incident messages rather than relying solely on isolated keywords.

This contextual understanding allows the model to reduce category overlap and misclassification.
By evaluating language patterns together with operational behavior, NLP significantly increases the accuracy and reliability of incident interpretation compared to rigid, rule-based approaches.

---

## **6) How do I think class imbalance will appear in this dataset?**

Proper incident classification is critical to enable action plans that are aligned with both the urgency and the frequency of each event.

By classifying incidents accurately, the operation can:
- define response priorities based on risk to SLA,
- allocate resources more efficiently,
- distinguish between recurring operational issues and rare but critical events,
- and ensure that high-impact incidents receive immediate attention.

Without structured classification, all incidents tend to be treated similarly, which leads to delayed responses, inefficient prioritization, and increased operational risk.
Therefore, classification is not only a technical task but a key decision-support mechanism for operational management.

---

## **7) Do I expect short or long descriptions? Why?**

Both short and long incident descriptions are expected, depending on the severity and complexity of the problem.

For example, mechanical issues often require longer descriptions so that the resolution team has full context regarding the failure, symptoms, and constraints. This additional detail helps accelerate diagnosis and reduce downtime.

In contrast, route obstructions or external events such as road closures can usually be described briefly, as a short message is often sufficient to communicate the issue and trigger a rerouting decision.

---

## **8) Which classification errors are the most dangerous for SLA?**

One of the most harmful classification errors is treating incidents of different complexity as if they were equivalent.

For instance:
- grouping all mechanical failures under the same category ignores differences in severity and recovery time;
- failing to distinguish between a complete road closure and a temporary traffic congestion can lead to incorrect prioritization.

Such misclassifications may delay critical responses, allocate resources inefficiently, and ultimately increase the risk of SLA breaches.

---

## **9) When comparing TF-IDF vs Embeddings, what differences do I expect?**

I expect the models to provide a deeper level of insight beyond simple categorization.

This includes:
- a more accurate evaluation of context,
- better interpretation of operational scenarios,
- and clearer differentiation between incident severity levels.

With richer contextual understanding, the classification results become more reliable and actionable, directly supporting better decision-making and faster operational responses.

---

### ‚úîÔ∏è Notes  
These questions shape the analytical mindset required for NLP-based operational modeling.  
They help establish intuition before we work with real data.

---


# üß© PPI ‚Äî Preliminary Problem Identification  
### *Refined understanding after guided mentorship*

This section consolidates my refined understanding of the problem after the mentorship phase.  
The goal is to clearly define the nature of the business challenge, the data, and the analytical approach before any modeling decisions are made.

---

## **1) What is the problem asking?**

The problem asks for the development of an NLP-based system capable of automatically classifying operational incident reports in logistics operations.

The objective is not only to categorize incidents, but to support faster and more accurate operational decision-making, especially in scenarios that pose a risk to Service Level Agreements (SLA).

By structuring and interpreting textual incident descriptions, the model should help prioritize responses, allocate resources efficiently, and reduce operational delays.

---

## **2) What type of data do I expect to receive?**

I expect primarily unstructured textual data containing incident descriptions written by different operational actors such as drivers, dispatchers, and hub operators.

The data is likely to include:
- short and long descriptions,
- informal language,
- typos and abbreviations,
- role-specific operational vocabulary,
- incomplete or truncated messages caused by urgency or connectivity issues.

Additional metadata (such as timestamps or categories) may exist, but the core challenge lies in understanding the text itself.

---

## **3) Is this a supervised or unsupervised problem? Why?**

This is a **supervised learning problem**.

The goal is to map incident descriptions to predefined categories or severity levels based on historical examples.  
Each message is expected to have an associated label that represents the incident type, urgency, or operational impact, which enables model training and evaluation.

---

## **4) Why is NLP appropriate for this scenario?**

NLP is appropriate because operational incident reports contain contextual meaning that cannot be reliably captured using simple rules or keyword matching.

Through NLP techniques, it is possible to:
- interpret context rather than isolated words,
- reduce ambiguity and overlapping meanings,
- identify the intent behind messages,
- and handle linguistic variability caused by human communication under pressure.

This makes NLP far more effective than rigid, rule-based approaches for real operational environments.

---

## **5) What are the main challenges I anticipate?**

Key challenges include:
- ambiguous language and overlapping incident categories,
- noisy text with typos and informal expressions,
- class imbalance between frequent and rare but critical incidents,
- short messages that lack explicit context,
- and the need to balance interpretability with model performance.

---

## **6) What approaches do I believe are worth testing?**

Before exploring the dataset, I expect to evaluate and compare:

**Classical NLP approaches:**
- TF-IDF vectorization combined with Logistic Regression
- TF-IDF combined with tree-based models

**Modern approaches:**
- Embedding-based representations with similarity or classification layers
- Comparative analysis between classical and embedding-based methods

The comparison will focus not only on performance metrics, but also on interpretability and operational applicability.

---

## **7) Which evaluation metrics matter most for the business?**

Beyond overall accuracy, metrics such as:
- Recall per class,
- F1-score,
- and confusion matrices

are critical, especially for high-impact incident categories.  
Misclassifying severe incidents as low priority poses a much greater SLA risk than the opposite error.

---

## **8) What insights do I expect to extract from this project?**

I expect to gain insights into:
- recurring operational failure patterns,
- incident types with the highest SLA risk,
- how language structure correlates with urgency,
- and which modeling approach provides the best balance between accuracy, robustness, and explainability.

These insights should directly support operational monitoring and faster decision-making.

---

### ‚úîÔ∏è Notes  
This PPI represents my consolidated understanding after mentorship.  
All assumptions will be validated or challenged during exploratory analysis and modeling.

---


# üè∑Ô∏è Problem Framing & Label Definition  
### *Defining what we are predicting before touching the data*

Before loading the dataset, it is critical to clearly define the classification problem and the target labels.  
Well-defined labels reduce ambiguity, improve model performance, and ensure that the results are meaningful for operational decision-making.

---

## **1) What exactly is the prediction task?**

The task is a **multiclass text classification problem**.

Given a textual incident description, the model must predict:
- the **type of operational incident**,  
- and implicitly support **SLA risk prioritization**.

Each input consists of a free-text message describing an incident.  
Each output corresponds to a predefined incident category.

---

## **2) Why multiclass classification (and not binary)?**

A binary setup (e.g., *critical vs. non-critical*) would oversimplify the operational reality.

Different incident types:
- require different response teams,
- have different resolution times,
- and pose different levels of SLA risk.

Multiclass classification allows the model to preserve this operational nuance and support more precise action plans.

---

## **3) Proposed incident categories (labels)**

Based on operational reasoning and mentorship insights, the initial label set may include:

- **Vehicle Breakdown / Mechanical Failure**  
  Incidents related to vehicle malfunction, maintenance issues, or breakdowns.

- **Route Obstruction / External Disruption**  
  Road closures, construction, accidents, or environmental factors affecting routes.

- **System / Connectivity Failure**  
  GPS issues, tracking failures, system downtime, or internet connectivity problems.

- **Shipment Delay / Stagnation at Hub**  
  Orders stuck at distribution centers due to capacity or planning constraints.

- **Other / Miscellaneous**  
  Incidents that do not clearly fit into the main categories.

‚ö†Ô∏è These labels are **initial hypotheses** and may be refined after data exploration.

---

## **4) What does each label represent operationally?**

Each label corresponds to a distinct operational response:

- **Mechanical failures** ‚Üí Fleet maintenance or contingency vehicles  
- **Route obstructions** ‚Üí Rerouting and dispatch coordination  
- **System failures** ‚Üí IT or monitoring intervention  
- **Shipment stagnation** ‚Üí Capacity planning and resource reallocation  

This mapping reinforces that classification supports *action*, not just categorization.

---

## **5) Expected label challenges**

Potential challenges include:
- overlapping language between categories,
- incidents that combine multiple issues,
- imbalance between frequent and rare but critical events,
- ambiguous or incomplete descriptions.

These challenges will directly influence preprocessing choices and model evaluation.

---

## **6) Success criteria for labeling**

Labeling will be considered successful if:
- categories are operationally meaningful,
- misclassifications do not hide critical incidents,
- results are interpretable by non-technical stakeholders,
- and the model supports faster SLA-aware decisions.

---

### ‚úîÔ∏è Notes  
This framing defines the backbone of the project.  
All modeling, metrics, and comparisons will be aligned with these labels.

The label structure may be revisited after exploratory analysis if the data reveals new patterns.


# üì¶ Dataset Selection & Assumptions  
### *Understanding the data source before loading it*

Before loading the dataset, it is important to document the rationale behind its selection and the assumptions made about its structure and quality.  
This step helps contextualize the analysis and sets realistic expectations for the modeling phase.

---

## **1) Why this dataset?**

The selected dataset represents incident reports written in natural language within a logistics or operational context.

It was chosen because it:
- contains free-text descriptions similar to real operational messages,
- reflects the type of language used by drivers, hubs, and monitoring teams,
- aligns with the business objective of classifying incidents that impact SLA,
- allows experimentation with both classical NLP and embedding-based approaches.

Even as a sample or synthetic dataset, it simulates real-world challenges commonly found in operational text data.

---

## **2) What do I assume about the dataset structure?**

Before inspection, I assume the dataset includes:
- a textual field describing the incident (unstructured text),
- a categorical label representing the incident type,
- potentially additional metadata such as timestamps or IDs.

The text field is expected to be the primary input for the NLP pipeline.

---

## **3) What assumptions am I making about the text data?**

Based on operational reasoning, I assume that:
- messages vary significantly in length,
- descriptions may be informal or incomplete,
- spelling errors and abbreviations are common,
- similar incidents may be described using different wording,
- multiple incident types may share overlapping vocabulary.

These assumptions will guide preprocessing decisions such as normalization, tokenization, and vectorization.

---

## **4) What assumptions am I making about the labels?**

At this stage, I assume that:
- labels represent meaningful operational categories,
- each message is associated with a single primary label,
- some classes may be underrepresented,
- labeling may contain noise or borderline cases.

These assumptions will be validated during exploratory analysis.

---

## **5) Known limitations at this stage**

Before inspecting the data, I acknowledge that:
- the dataset may not cover all real-world incident variations,
- class distribution may be imbalanced,
- labels may oversimplify complex operational scenarios.

These limitations will be explicitly considered when evaluating model performance.

---

### ‚úîÔ∏è Notes  
All assumptions listed here are hypotheses.  
They exist to guide the analysis, not to constrain it.

Once the dataset is loaded and explored, these assumptions will be revisited, validated, or adjusted accordingly.


# üì• Data Loading & Initial Inspection  
### *First contact with the dataset*

In this step, the goal is to load the dataset and perform a high-level inspection to understand its structure, size, and basic characteristics.

At this stage, we intentionally avoid transformations or preprocessing.
The focus is on answering fundamental questions such as:
- How many records are available?
- Which columns exist?
- What is the primary text field?
- Are there obvious missing values?
- Do labels appear consistent with our assumptions?

This initial inspection will validate (or challenge) the assumptions defined in the previous section.


In [2]:
# Install required libraries (run once)
!pip install pandas numpy scikit-learn matplotlib seaborn


Collecting seaborn
  Downloading seaborn-0.13.2-py3-none-any.whl.metadata (5.4 kB)
Downloading seaborn-0.13.2-py3-none-any.whl (294 kB)
Installing collected packages: seaborn
Successfully installed seaborn-0.13.2


In [4]:
from datasets import load_dataset
ds = load_dataset("25b3nk/github-issues")
ds


  from .autonotebook import tqdm as notebook_tqdm
Generating train split: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 7314/7314 [00:00<00:00, 55310.30 examples/s]


DatasetDict({
    train: Dataset({
        features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'sub_issues_summary', 'active_lock_reason', 'body', 'closed_by', 'reactions', 'timeline_url', 'performed_via_github_app', 'state_reason', 'draft', 'pull_request', 'is_pull_request'],
        num_rows: 7314
    })
})

In [5]:
import pandas as pd
df = pd.DataFrame(ds["train"])
df.head()


Unnamed: 0,url,repository_url,labels_url,comments_url,events_url,html_url,id,node_id,number,title,...,active_lock_reason,body,closed_by,reactions,timeline_url,performed_via_github_app,state_reason,draft,pull_request,is_pull_request
0,https://api.github.com/repos/huggingface/datas...,https://api.github.com/repos/huggingface/datasets,https://api.github.com/repos/huggingface/datas...,https://api.github.com/repos/huggingface/datas...,https://api.github.com/repos/huggingface/datas...,https://github.com/huggingface/datasets/issues...,2802957388,I_kwDODunzps6nEbxM,7378,Allow pushing config version to hub,...,,"### Feature request\n\nCurrently, when dataset...",,"{'+1': 0, '-1': 0, 'confused': 0, 'eyes': 0, '...",https://api.github.com/repos/huggingface/datas...,,,,,False
1,https://api.github.com/repos/huggingface/datas...,https://api.github.com/repos/huggingface/datasets,https://api.github.com/repos/huggingface/datas...,https://api.github.com/repos/huggingface/datas...,https://api.github.com/repos/huggingface/datas...,https://github.com/huggingface/datasets/issues...,2802723285,I_kwDODunzps6nDinV,7377,Support for sparse arrays with the Arrow Spars...,...,,### Feature request\n\nAI in biology is becomi...,,"{'+1': 0, '-1': 0, 'confused': 0, 'eyes': 0, '...",https://api.github.com/repos/huggingface/datas...,,,,,False
2,https://api.github.com/repos/huggingface/datas...,https://api.github.com/repos/huggingface/datasets,https://api.github.com/repos/huggingface/datas...,https://api.github.com/repos/huggingface/datas...,https://api.github.com/repos/huggingface/datas...,https://github.com/huggingface/datasets/pull/7376,2802621104,PR_kwDODunzps6IiO9j,7376,[docs] uv install,...,,Proposes adding uv to installation docs (see S...,,"{'+1': 0, '-1': 0, 'confused': 0, 'eyes': 0, '...",https://api.github.com/repos/huggingface/datas...,,,0.0,{'diff_url': 'https://github.com/huggingface/d...,True
3,https://api.github.com/repos/huggingface/datas...,https://api.github.com/repos/huggingface/datasets,https://api.github.com/repos/huggingface/datas...,https://api.github.com/repos/huggingface/datas...,https://api.github.com/repos/huggingface/datas...,https://github.com/huggingface/datasets/issues...,2800609218,I_kwDODunzps6m7efC,7375,vllmÊâπÈáèÊé®ÁêÜÊä•Èîô,...,,### Describe the bug\n\n![Image](https://githu...,,"{'+1': 0, '-1': 0, 'confused': 0, 'eyes': 0, '...",https://api.github.com/repos/huggingface/datas...,,,,,False
4,https://api.github.com/repos/huggingface/datas...,https://api.github.com/repos/huggingface/datasets,https://api.github.com/repos/huggingface/datas...,https://api.github.com/repos/huggingface/datas...,https://api.github.com/repos/huggingface/datas...,https://github.com/huggingface/datasets/pull/7374,2793442320,PR_kwDODunzps6IC66n,7374,Remove .h5 from imagefolder extensions,...,,"the format is not relevant for imagefolder, an...",{'avatar_url': 'https://avatars.githubusercont...,"{'+1': 0, '-1': 0, 'confused': 0, 'eyes': 0, '...",https://api.github.com/repos/huggingface/datas...,,,0.0,{'diff_url': 'https://github.com/huggingface/d...,True


In [6]:
from pathlib import Path

p = Path("../data/raw/incidents_sample.csv")
print("Exists:", p.exists())
print("Size (bytes):", p.stat().st_size if p.exists() else None)

# Optional: preview first 3 lines
if p.exists():
    print("\n--- head(3) ---")
    with p.open("r", encoding="utf-8", errors="ignore") as f:
        for _ in range(3):
            print(f.readline().rstrip())


Exists: True
Size (bytes): 0

--- head(3) ---





In [7]:
# Basic inspection
df.head()


Unnamed: 0,url,repository_url,labels_url,comments_url,events_url,html_url,id,node_id,number,title,...,active_lock_reason,body,closed_by,reactions,timeline_url,performed_via_github_app,state_reason,draft,pull_request,is_pull_request
0,https://api.github.com/repos/huggingface/datas...,https://api.github.com/repos/huggingface/datasets,https://api.github.com/repos/huggingface/datas...,https://api.github.com/repos/huggingface/datas...,https://api.github.com/repos/huggingface/datas...,https://github.com/huggingface/datasets/issues...,2802957388,I_kwDODunzps6nEbxM,7378,Allow pushing config version to hub,...,,"### Feature request\n\nCurrently, when dataset...",,"{'+1': 0, '-1': 0, 'confused': 0, 'eyes': 0, '...",https://api.github.com/repos/huggingface/datas...,,,,,False
1,https://api.github.com/repos/huggingface/datas...,https://api.github.com/repos/huggingface/datasets,https://api.github.com/repos/huggingface/datas...,https://api.github.com/repos/huggingface/datas...,https://api.github.com/repos/huggingface/datas...,https://github.com/huggingface/datasets/issues...,2802723285,I_kwDODunzps6nDinV,7377,Support for sparse arrays with the Arrow Spars...,...,,### Feature request\n\nAI in biology is becomi...,,"{'+1': 0, '-1': 0, 'confused': 0, 'eyes': 0, '...",https://api.github.com/repos/huggingface/datas...,,,,,False
2,https://api.github.com/repos/huggingface/datas...,https://api.github.com/repos/huggingface/datasets,https://api.github.com/repos/huggingface/datas...,https://api.github.com/repos/huggingface/datas...,https://api.github.com/repos/huggingface/datas...,https://github.com/huggingface/datasets/pull/7376,2802621104,PR_kwDODunzps6IiO9j,7376,[docs] uv install,...,,Proposes adding uv to installation docs (see S...,,"{'+1': 0, '-1': 0, 'confused': 0, 'eyes': 0, '...",https://api.github.com/repos/huggingface/datas...,,,0.0,{'diff_url': 'https://github.com/huggingface/d...,True
3,https://api.github.com/repos/huggingface/datas...,https://api.github.com/repos/huggingface/datasets,https://api.github.com/repos/huggingface/datas...,https://api.github.com/repos/huggingface/datas...,https://api.github.com/repos/huggingface/datas...,https://github.com/huggingface/datasets/issues...,2800609218,I_kwDODunzps6m7efC,7375,vllmÊâπÈáèÊé®ÁêÜÊä•Èîô,...,,### Describe the bug\n\n![Image](https://githu...,,"{'+1': 0, '-1': 0, 'confused': 0, 'eyes': 0, '...",https://api.github.com/repos/huggingface/datas...,,,,,False
4,https://api.github.com/repos/huggingface/datas...,https://api.github.com/repos/huggingface/datasets,https://api.github.com/repos/huggingface/datas...,https://api.github.com/repos/huggingface/datas...,https://api.github.com/repos/huggingface/datas...,https://github.com/huggingface/datasets/pull/7374,2793442320,PR_kwDODunzps6IC66n,7374,Remove .h5 from imagefolder extensions,...,,"the format is not relevant for imagefolder, an...",{'avatar_url': 'https://avatars.githubusercont...,"{'+1': 0, '-1': 0, 'confused': 0, 'eyes': 0, '...",https://api.github.com/repos/huggingface/datas...,,,0.0,{'diff_url': 'https://github.com/huggingface/d...,True


In [8]:
# Character and word count
df["char_length"] = df["incident_text"].astype(str).apply(len)
df["word_count"] = df["incident_text"].astype(str).apply(lambda x: len(x.split()))

df[["char_length", "word_count"]].describe()


KeyError: 'incident_text'