## Step 1: Load Dataset
Load Job Desciption Data. <br>
The `pd.read_csv()` functions reads the CSV file into pandas DataFrame.<br>
The `.iloc[0]` get the first row by index position

In [51]:
import pandas as pd
import re
import string

df = pd.read_csv('../data/raw/job_title_des.csv')

sample_desc = df['Job Description'].iloc[0]
print("Original Description:")
print("="*80)
print(sample_desc)
print("="*80)

Original Description:
We are looking for hire experts flutter developer. So you are eligible this post then apply your resume.
Job Types: Full-time, Part-time
Salary: ₹20,000.00 - ₹40,000.00 per month
Benefits:
Flexible schedule
Food allowance
Schedule:
Day shift
Supplemental Pay:
Joining bonus
Overtime pay
Experience:
total work: 1 year (Preferred)
Housing rent subsidy:
Yes
Industry:
Software Development
Work Remotely:
Temporarily due to COVID-19


## Step 2: Convert to lowercase
The `lower()` method convert all character to lowercase.<br>
All text e.g 'Developer', 'developer', 'DEVELOPER' became same word.<br>
Reducing vocabulary size and improve mathcing for recommendation.


In [52]:
lowercase_desc = sample_desc.lower()
print("Lowercase Description:")
print("="*80)
print(lowercase_desc)
print("="*80)

Lowercase Description:
we are looking for hire experts flutter developer. so you are eligible this post then apply your resume.
job types: full-time, part-time
salary: ₹20,000.00 - ₹40,000.00 per month
benefits:
flexible schedule
food allowance
schedule:
day shift
supplemental pay:
joining bonus
overtime pay
experience:
total work: 1 year (preferred)
housing rent subsidy:
yes
industry:
software development
work remotely:
temporarily due to covid-19


## Step 3: Remove Special Character
The `re.sub()` function find and replace pattern in the text.<br>
Find Patter `r'[^a-zA-Z0-9\s]'` and replace it with ''.<br>
Special Caharacter e.g @ don't add meaning to the sentance.

In [53]:
clean_special_char_desc = re.sub(r'[^a-zA-Z0-9\s]', '', sample_desc)
print("After Removing Special Characters:")
print("="*80)
print(clean_special_char_desc)
print("="*80)   

After Removing Special Characters:
We are looking for hire experts flutter developer So you are eligible this post then apply your resume
Job Types Fulltime Parttime
Salary 2000000  4000000 per month
Benefits
Flexible schedule
Food allowance
Schedule
Day shift
Supplemental Pay
Joining bonus
Overtime pay
Experience
total work 1 year Preferred
Housing rent subsidy
Yes
Industry
Software Development
Work Remotely
Temporarily due to COVID19


## Step 4: Remove Extra Whitespace
The Pattern `\s` mathc one or more white space character (space, tabs, newline).<br>
The `strip` remove heading and tailing chracter from text.<br>
Text often has multimple space, tab, and newline that need to be standadize into single space.



In [54]:
cleaned_whitespace_desc = re.sub(r'\s+', ' ', sample_desc).strip()
print("After Cleaning Extra Whitespace:")
print("="*80)
print(cleaned_whitespace_desc)
print("="*80)

After Cleaning Extra Whitespace:
We are looking for hire experts flutter developer. So you are eligible this post then apply your resume. Job Types: Full-time, Part-time Salary: ₹20,000.00 - ₹40,000.00 per month Benefits: Flexible schedule Food allowance Schedule: Day shift Supplemental Pay: Joining bonus Overtime pay Experience: total work: 1 year (Preferred) Housing rent subsidy: Yes Industry: Software Development Work Remotely: Temporarily due to COVID-19


## Step 5: Remove Numbers
The pattern `\d+` matches one or more digits.<br>
Remove number to reduce noise.


In [55]:
clean_numbers_desc = re.sub(r'\d+', '', sample_desc)
print("After Removing Numbers:")    
print("="*80)
print(clean_numbers_desc)
print("="*80)

After Removing Numbers:
We are looking for hire experts flutter developer. So you are eligible this post then apply your resume.
Job Types: Full-time, Part-time
Salary: ₹,. - ₹,. per month
Benefits:
Flexible schedule
Food allowance
Schedule:
Day shift
Supplemental Pay:
Joining bonus
Overtime pay
Experience:
total work:  year (Preferred)
Housing rent subsidy:
Yes
Industry:
Software Development
Work Remotely:
Temporarily due to COVID-


## Step 6: Remove Punctuation
The `string.punctuation` contain all punctuation characters.<br>
Punctuation doest add semantic meanin.

In [56]:
clean_punctuation_desc = sample_desc.translate(str.maketrans('', '', string.punctuation))
print("After Removing Punctuation:")
print("="*80)
print(clean_punctuation_desc)
print("="*80)

After Removing Punctuation:
We are looking for hire experts flutter developer So you are eligible this post then apply your resume
Job Types Fulltime Parttime
Salary ₹2000000  ₹4000000 per month
Benefits
Flexible schedule
Food allowance
Schedule
Day shift
Supplemental Pay
Joining bonus
Overtime pay
Experience
total work 1 year Preferred
Housing rent subsidy
Yes
Industry
Software Development
Work Remotely
Temporarily due to COVID19


## Step 7: Create a Cleaning Function
We combine all cleaning steps into a single reusable function.


In [57]:
def clean_text(text):
    """
    Clean text by applying multiple preprocessing steps.

    Args:
        text (str): Raw text input

    Returns:
        str: Cleaned text
    """
    # Create temporary placeholders for skills with special chars
    skill_patterns = {
        'C#': 'CSHARP',
        'C++': 'CPLUSPLUS',
        'F#': 'FSHARP',
        '.NET': 'DOTNET',
        'Node.js': 'NODEJS',
        'ASP.NET': 'ASPNET'
    }
    
    # Replace skills with placeholders
    for skill, placeholder in skill_patterns.items():
        text = text.replace(skill, placeholder)

    # Convert to lowercase
    text = text.lower()

    # Remove special characters (keep only letters, numbers, spaces)
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)

    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text)

    # Remove numbers
    text = re.sub(r'\d+', '', text)

    # Remove leading/trailing spaces
    text = text.strip()

    # Restore original skill names
    for skill, placeholder in skill_patterns.items():
        text = text.replace(placeholder.lower(), skill.lower())

    return text

# Apply to sample description
cleaned_sample = clean_text(sample_desc)

print("Cleaned Job Description:")
print("=" * 80)
print(cleaned_sample)
print("=" * 80)

print(f"\nOriginal length: {len(sample_desc)} characters")
print(f"Cleaned length: {len(cleaned_sample)} characters")
print(f"Reduced by: {len(sample_desc) - len(cleaned_sample)} characters")

Cleaned Job Description:
we are looking for hire experts flutter developer so you are eligible this post then apply your resume job types fulltime parttime salary   per month benefits flexible schedule food allowance schedule day shift supplemental pay joining bonus overtime pay experience total work  year preferred housing rent subsidy yes industry software development work remotely temporarily due to covid

Original length: 429 characters
Cleaned length: 386 characters
Reduced by: 43 characters


## Step 8: Apply Cleaning to All Job Descriptions

The `apply()` method applies a function to each element in a DataFrame column.<br>
We create a new column `cleaned_description` with the cleaned text. <br>
This processes all 2,277 job descriptions at once.

In [58]:
# Apply cleaning function to all job descriptions
df['Job Description'] = df['Job Description'].apply(clean_text)

# Display before and after comparison
print("Comparison: Original vs Cleaned")
print("=" * 80)
for i in range(3):
    print(f"\nJob {i+1}: {df['Job Title'].iloc[i]}")
    print(f"Original: {df['Job Description'].iloc[i][:100]}...")
    print("-" * 80)


Comparison: Original vs Cleaned

Job 1: Flutter Developer
Original: we are looking for hire experts flutter developer so you are eligible this post then apply your resu...
--------------------------------------------------------------------------------

Job 2: Django Developer
Original: pythondjango developerlead job codepdj  strong python experience in api development restrpc experien...
--------------------------------------------------------------------------------

Job 3: Machine Learning
Original: data scientist contractor bangalore in responsibilities we are looking for a capable data scientist ...
--------------------------------------------------------------------------------


In [59]:
df.head()

Unnamed: 0.1,Unnamed: 0,Job Title,Job Description
0,0,Flutter Developer,we are looking for hire experts flutter develo...
1,1,Django Developer,pythondjango developerlead job codepdj strong...
2,2,Machine Learning,data scientist contractor bangalore in respons...
3,3,iOS Developer,job description strong framework outside of io...
4,4,Full Stack Developer,job responsibility full stack engineer react r...


## Step 9: Save Cleaned Data
The `to_csv()` function saves the DataFrame to a CSV file.<br>
We use `index=False` to avoid saving the row index as a column.<br>
The cleaned data is saved to `data/processed/` folder for future use

In [60]:
df.to_csv('../data/processed/job_title_des_cleaned.csv', index=False)

print("Cleaned data saved to: data/processed/job_title_des_cleaned.csv")
print(f"Total jobs: {len(df)}")
print(f"Columns: {df.columns.tolist()}")

Cleaned data saved to: data/processed/job_title_des_cleaned.csv
Total jobs: 2277
Columns: ['Unnamed: 0', 'Job Title', 'Job Description']
