# **Task 1 - Data Tagging**

The dataset consists of the following:

● Free-text data (Columns: Complaint, Cause, Correction) that needs to be tagged.

● Taxonomy Sheet: A reference list with predefined categories for Root Cause,
Symptom_Condition, Symptom_Component, Fix_Condition, and Fix_Component.

**Our goal is to tag the data by applying logical reasoning and aligning it with the categories provided in the taxonomy.**

In [1]:
from google.colab import files
files.upload()

Saving DA - Task 1..xlsx to DA - Task 1..xlsx


{'DA - Task 1..xlsx': b'PK\x03\x04\x14\x00\x08\x08\x08\x00T\x19}Y\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x18\x00\x00\x00xl/drawings/drawing1.xml\x9d\xd0]n\xc20\x0c\x07\xf0\x13\xec\x0eU\xdeiZ\x18\x13C\x14^\xd0N0\x0e\xe0%n\x1b\x91\x8f\xca\x0e\xa3\xdc~\xd1J6i{\x01\x1em\xcb?\xf9\xef\xcdnt\xb6\xf8Db\x13|#\xea\xb2\x12\x05z\x15\xb4\xf1]#\x0e\xefo\xb3\x95(8\x82\xd7`\x83\xc7F\\\x90\xc5n\xfb\xb4\x195\xad\xcf\xbc\xa7"\xed{^\xa7\xb2\x11}\x8c\xc3ZJV=:\xe02\x0c\xe8\xd3\xb4\r\xe4 \xa6\x92:\xa9\t\xceIvV\xce\xab\xeaE\xf2@\x08\x9a{\xc4\xb8\x9f&\xe2\xea\xc1\x03\x9a\x03\xe3\xf3\xfeM\xd7\x84\xb65\n\xf7A\x9d\x1c\xfa8!\x84\x16b\xfa\x05\xf7f\xe0\xac\xa9\x07\xaeQ=P\xfc\x01\xc6\x7f\x823\x8a\x02\x876\x96*\xb8\xeb)\xd9HB\xfd<\t8\xfe\x1a\xf5\xdd\xc8R\xbe\xca\xd5_\xc8\xdd\x14\xc7\x01\x1dO\xc3,\xb9Cz\xc8\x87\xb1&^\xbe\x93eFw\xee\x81\xb7h\x03\x1d\x81\xcb\xc8\xb88\xf8\xe3\xdd\xb1*\xc96\xb5(+l\xb1^\xde\xad\xcc\xb3"\xb7_PK\x07\x08\x07bi\x83\x05\x01\x00\x00\x07\x03\x00\x00PK\x03\x04\x14\x00\x08\x08\x08\x00T\x19}

In [2]:
import pandas as pd
import numpy as np

# Load the dataset

df = pd.read_excel("/content/DA - Task 1..xlsx")
data = df.copy()

In [3]:
data.columns

Index(['Primary Key', 'Order Date', 'Product Category', 'Complaint', 'Cause',
       'Correction', 'Root Cause', 'Symptom Condition 1',
       'Symptom Component 1', 'Symptom Condition 2', 'Symptom Component 2',
       'Symptom Condition 3', 'Symptom Component 3', 'Fix Condition 1',
       'Fix Component 1', 'Fix Condition 2', 'Fix Component 2',
       'Fix Condition 3', 'Fix Component 3'],
      dtype='object')

In [4]:
data.drop (['Root Cause', 'Symptom Condition 1',
       'Symptom Component 1', 'Symptom Condition 2', 'Symptom Component 2',
       'Symptom Condition 3', 'Symptom Component 3', 'Fix Condition 1',
       'Fix Component 1', 'Fix Condition 2', 'Fix Component 2',
       'Fix Condition 3', 'Fix Component 3'],axis=1 ,inplace =True)

In [5]:
data.head()

Unnamed: 0,Primary Key,Order Date,Product Category,Complaint,Cause,Correction
0,SO0026296-1,2023-03-08,SPRAYS,VISIBLY NOTICE fasteners under cab on P clips ...,Not tighten at factory.,"GO THROUGH AND RE-TIGHTEN ALL P CLIPS, NUTS, A..."
1,SO0026385-1,2023-03-08,SPRAYS,Fuel door will not stay open,GAS STRUT NOT INSTALLED OR ANYWHERE ON MACHINE,FOUND GAS STRUT NOT INSTALLED OR ANYWHERE ON M...
2,SO0026385-11,2023-03-08,SPRAYS,"Compressor pressure line, braided steel, crushed","Compressor pressure line, braided steel, crush...",DRAIN AIR FROM SYSTEM.REMOVE ASSOCIATED P CLIP...
3,SO0028352-1,2023-03-08,SPRAYS,Oil running from bottom of machine,OIL RETURN UNDER MACHINE SWIVEL FITTING LEFT L...,OIL RETURN UNDER MACHINE SWIVEL FITTING LEFT L...
4,SO0028770-1,2023-03-08,SPRAYS,MISSING VECTOR & INTRIP UNLOCKS.,MISSING VECTOR & INTRIP UNLOCKS WERE NOT INSTA...,INSTALLED MISSING UNLOCKS RAN AND TESTED.


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Primary Key       20 non-null     object        
 1   Order Date        20 non-null     datetime64[ns]
 2   Product Category  20 non-null     object        
 3   Complaint         20 non-null     object        
 4   Cause             20 non-null     object        
 5   Correction        20 non-null     object        
dtypes: datetime64[ns](1), object(5)
memory usage: 1.1+ KB


* Missing or malformed dates are handled by errors='coerce', which converts invalid date entries to NaT.
* Missing complaints, causes, or corrections are filled with default placeholders.

To handle tagging the dataset and applies the Root Cause, Symptom Condition, Symptom Component, Fix Condition, and Fix Component tags for the values involved, the approach adopted is as follows:

* Parsing the complaint: Extract the key components of the complaint to figure the symptom condition, symptom component, cause, and fix condition.
* Edge cases: We will check for missing or ambiguous causes and complaints and address them appropriately by assigning None or a fallback value.
* Data Output: For each row, tagging will be done and the resulting tagged columns will be added to the dataset.

In [7]:
data['Order Date'] = pd.to_datetime(data['Order Date'], errors='coerce')
data['Complaint'] = data['Complaint'].fillna('N/A').str.upper()
data['Cause'] = data['Cause'].str.upper()
data['Correction'] = data['Correction'].str.upper()

Create a tag_data function that automatically assigns appropriate tags to the dataset based on the provided taxonomy, we will apply the following steps:

* Match complaints, causes, and corrections to the taxonomy categories.
* Assign the correct tags under "Root Cause", "Symptom Condition", "Symptom Component", "Fix Condition", and "Fix Component".
* Handle edge cases like missing or unclear data (e.g., 'Not Mentioned', 'N/A').

We will use the apply() function to go through each row of the DataFrame and match the values to the taxonomy.

In [8]:
# Lists of taxonomy data
root_causes = [
    "Not Tightened", "Not Installed", "Not Mentioned", "Loosened", "Not Included",
    "Out of Fitting", "Blown", "Poor Material", "Leaking", "Failed Sending", "No Oring",
    "Not Tighten", "Out of Range", "Lubricant Drip Down",  "Internal Issue",
    "Screwed in a Thread", "Faulty"
]

symptom_conditions = [
    "Loose", "Won't stay open", "Crushed", "Oil Running", "Missing", "Oil Dripping",
    "Oil Leak", "Broke", "Leak", "Open", "Hydraulic Leak", "Fold Uneven", "Getting Fault Code",
    "Not Working", "Error Codes", "Product Leak", "Does not Light"
]

symptom_components = [
    "Cab P Clip", "Fuel Door", "Compressor Pressure Line", "Not Mentioned", "Vector",
    "Coupler", "Mount SVM Sign", "Harness", "Rinse Tank", "Fuel Sender", "Boom", "Auto Boom",
    "Condenser", "Left-Air Duct", "Bulkhead Connector", "Braided Steel", "Intrip Unlocks", "Compressor Line"
]

fix_conditions = [
    "Retightened", "Installed", "Replaced", "Topped Off", "Not Mentioned", "Cleaned Out",
    "Reseted", "Tightened", "Repaired", "Left Air Duct", "Oring", "Threads"
]

fix_components = [
    "Cab P Clip", "Gas Strut", "Braided Steel", "O-Ring", "Vector", "Coupler", "Brackets",
    "Hydraulic", "Not Mentioned", "Sensor", "NCV Harness", "Tube", "Compressor Line",
    "Bulkhead Connector", "SVM Sign", "Pipe Fitting", "ELB"
]

* Root Cause: This could be set directly if it's available or determined from the cause and complaint.
* Symptom Condition: Identified based on keywords from the complaint (e.g., "oil leak", "broken harness").
* Symptom Component: Components related to the condition, like "oil return", "compressor line".
* Fix Condition: Identifies what needs to be done to fix the issue, like "re-tighten" or "install".
* Fix Component: Identifies the components being worked on, like "P clips", "gas strut".

In [9]:

# Function to extract taxonomy from complaint, cause, and correction
def tag_data(row):
    # Extract complaint, cause, and correction from the row
    complaint = row['Complaint']
    cause = row['Cause']
    correction = row['Correction']

    # Initialize variables for the taxonomy columns
    root_cause_found = []
    symptom_conditions_found = []
    symptom_components_found = []
    fix_conditions_found = []
    fix_components_found = []

    # Function to find matching terms from a given list
    def find_matches(text, word_list):
        return [word for word in word_list if word.lower() in text.lower()]

    # Find matches in complaint
    root_cause_found = find_matches(complaint, root_causes)
    symptom_conditions_found = find_matches(complaint, symptom_conditions)
    symptom_components_found = find_matches(complaint, symptom_components)

    # Find matches in cause
    root_cause_found += find_matches(cause, root_causes)
    symptom_conditions_found += find_matches(cause, symptom_conditions)
    symptom_components_found += find_matches(cause, symptom_components)

    # Find matches in correction
    fix_conditions_found = find_matches(correction, fix_conditions)
    fix_components_found = find_matches(correction, fix_components)

    # Make sure the lists have at least 3 entries for symptoms and fixes, if not, fill with "N/A"
    symptom_conditions_found += ['N/A'] * (3 - len(symptom_conditions_found))
    symptom_components_found += ['N/A'] * (3 - len(symptom_components_found))
    fix_conditions_found += ['N/A'] * (3 - len(fix_conditions_found))
    fix_components_found += ['N/A'] * (3 - len(fix_components_found))


    # Only keep the first 3 occurrences from the lists
    symptom_conditions_found = symptom_conditions_found[:3]
    symptom_components_found = symptom_components_found[:3]
    fix_conditions_found = fix_conditions_found[:3]
    fix_components_found = fix_components_found[:3]

    # Return the tagged data as a pandas Series
    return pd.Series({
        'Root Cause': ', '.join(root_cause_found),
        'Symptom Condition 1': symptom_conditions_found[0],
        'Symptom Component 1': symptom_components_found[0],
        'Symptom Condition 2': symptom_conditions_found[1],
        'Symptom Component 2': symptom_components_found[1],
        'Symptom Condition 3': symptom_conditions_found[2],
        'Symptom Component 3': symptom_components_found[2],
        'Fix Condition 1': fix_conditions_found[0],
        'Fix Component 1': fix_components_found[0],
        'Fix Condition 2': fix_conditions_found[1],
        'Fix Component 2': fix_components_found[1],
        'Fix Condition 3': fix_conditions_found[2],
        'Fix Component 3': fix_components_found[2]
    })


In [10]:
tagged_dataset = data.apply(tag_data, axis=1)

# Concatenate the tagged dataset with the original dataframe
df_tagged = pd.concat([data, tagged_dataset], axis=1)

df_tagged.head()

Unnamed: 0,Primary Key,Order Date,Product Category,Complaint,Cause,Correction,Root Cause,Symptom Condition 1,Symptom Component 1,Symptom Condition 2,Symptom Component 2,Symptom Condition 3,Symptom Component 3,Fix Condition 1,Fix Component 1,Fix Condition 2,Fix Component 2,Fix Condition 3,Fix Component 3
0,SO0026296-1,2023-03-08,SPRAYS,VISIBLY NOTICE FASTENERS UNDER CAB ON P CLIPS ...,NOT TIGHTEN AT FACTORY.,"GO THROUGH AND RE-TIGHTEN ALL P CLIPS, NUTS, A...",Not Tighten,Loose,,,,,,,Bulkhead Connector,,,,
1,SO0026385-1,2023-03-08,SPRAYS,FUEL DOOR WILL NOT STAY OPEN,GAS STRUT NOT INSTALLED OR ANYWHERE ON MACHINE,FOUND GAS STRUT NOT INSTALLED OR ANYWHERE ON M...,Not Installed,Open,Fuel Door,,,,,Installed,Gas Strut,,,,
2,SO0026385-11,2023-03-08,SPRAYS,"COMPRESSOR PRESSURE LINE, BRAIDED STEEL, CRUSHED","COMPRESSOR PRESSURE LINE, BRAIDED STEEL, CRUSH...",DRAIN AIR FROM SYSTEM.REMOVE ASSOCIATED P CLIP...,,Crushed,Compressor Pressure Line,Crushed,Braided Steel,,Compressor Pressure Line,,Braided Steel,,Compressor Line,,
3,SO0028352-1,2023-03-08,SPRAYS,OIL RUNNING FROM BOTTOM OF MACHINE,OIL RETURN UNDER MACHINE SWIVEL FITTING LEFT L...,OIL RETURN UNDER MACHINE SWIVEL FITTING LEFT L...,,Oil Running,,Loose,,,,,O-Ring,,Hydraulic,,
4,SO0028770-1,2023-03-08,SPRAYS,MISSING VECTOR & INTRIP UNLOCKS.,MISSING VECTOR & INTRIP UNLOCKS WERE NOT INSTA...,INSTALLED MISSING UNLOCKS RAN AND TESTED.,Not Installed,Missing,Vector,Missing,Intrip Unlocks,,Vector,Installed,,,,,


In [11]:
df_tagged.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 19 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Primary Key          20 non-null     object        
 1   Order Date           20 non-null     datetime64[ns]
 2   Product Category     20 non-null     object        
 3   Complaint            20 non-null     object        
 4   Cause                20 non-null     object        
 5   Correction           20 non-null     object        
 6   Root Cause           20 non-null     object        
 7   Symptom Condition 1  20 non-null     object        
 8   Symptom Component 1  20 non-null     object        
 9   Symptom Condition 2  20 non-null     object        
 10  Symptom Component 2  20 non-null     object        
 11  Symptom Condition 3  20 non-null     object        
 12  Symptom Component 3  20 non-null     object        
 13  Fix Condition 1      20 non-null     

In [12]:
# Save the tagged dataset to a new CSV file
df_tagged.to_csv("tagged_dataset.csv", index=False)

print("Tagging complete! Tagged dataset saved as 'tagged_dataset.csv'")

Tagging complete! Tagged dataset saved as 'tagged_dataset.csv'


By **Mrudula A P**