## Step 1 : Load Cleaned Data
Load the preprocessed data from `data/processed/cleaned_job_descriptions.csv`.<br>
The `pd.read_csv()` function reads the CSV file into pandas DataFrame.<br>
The data is already cleaned with skill pattern protection.

In [79]:
import pandas as pd

# Load the cleaned job listings data
df = pd.read_csv('../data/processed/job_title_des_cleaned.csv')

print(f"Total job listings: {len(df)}")
print(f"Coloumns: {df.columns.tolist()}")

for i in range(2):
    print(f"\nJob {i+1}: {df['Job Title'].iloc[i]}")
    print(f"Description: {df['Job Description'].iloc[i][:100]}")

Total job listings: 2277
Coloumns: ['Unnamed: 0', 'Job Title', 'Job Description']

Job 1: Flutter Developer
Description: we are looking for hire experts flutter developer so you are eligible this post then apply your resu

Job 2: Django Developer
Description: pythondjango developerlead job codepdj  strong python experience in api development restrpc experien


## Stap 2: Test Skill Persevation
Verify that skills with special characters are preserved correctly.<br>
Check if C#, C++, .NET, and other skills are intact in the cleaned data.


In [80]:
skill_patterns = ['c#', 'c++', 'F#','.net', 'node.js', 'asp.net']
for skill in skill_patterns:
    count = df['Job Description'].str.lower().str.contains(skill, case=False, na=False).sum()
    print(f"{skill}: Found in {count} job descriptions")

c#: Found in 99 job descriptions
c++: Found in 2277 job descriptions
F#: Found in 1 job descriptions
.net: Found in 775 job descriptions
node.js: Found in 340 job descriptions
asp.net: Found in 48 job descriptions


## Step 3: Initialize TF-IDF Vectorizer with Custom Tokenizer

The `TfidfVectorizer` converts text documents to TF-IDF feature vectors.<br>
`max_features` keeps only the top N most important words.<br>
`min_df` ignores words that appear in too few documents.<br>
`max_df` ignores words that appear in too many documents.<br>
`stop_words` removes common English words like "the", "is", "and".<br>
The `tokenizer` function preserves special characters like #, +, and . for skill names.

In [None]:
import re
from sklearn.feature_extraction.text import TfidfVectorizer

# Custom tokenizer that preserves special characters for skills
def skill_aware_tokenizer(text):
    """
    Tokenize text while preserving special characters (#, +, .) for skill names.
    Splits on spaces but keeps skill patterns like 'c#', 'c++', '.net' intact.
    """
    # Split on whitespace
    tokens = text.split()
    return tokens

# Initialize TF-IDF Vectorizer with custom tokenizer
tfidf = TfidfVectorizer(
    tokenizer=skill_aware_tokenizer,  # Use custom tokenizer to preserve special chars
    max_features=1000,                # Limit to top 1000 features
    min_df=2,                         # Word must appear in at least 2 documents
    max_df=0.8,                       # Word must not appear in more than 80% of documents
    stop_words='english',             # Remove common English stop words
    token_pattern=None                # Disable default token pattern to use custom tokenizer
)    

# Fit and transform cleaned job descriptions
tfidf_matrix = tfidf.fit_transform(df['Job Description'])

print(f"TF-IDF matrix shape: {tfidf_matrix.shape}")
print(f"Number of job descriptions: {tfidf_matrix.shape[0]}")
print(f"Features (words): {tfidf_matrix.shape[1]}")

## Step 4 : View Vocabulary
The `get_feature_names_out()` method returns the vocabulary (list of words).<br>
Check if skills with special characters are present in the vocabulary.<br>
With custom tokenizer, skills like 'c#', 'c++', '.net' should now be preserved.

In [None]:
feature_names = tfidf.get_feature_names_out()  # Display first 20 feature names
print("Sample feature names:", feature_names[:20])
print(f"="*80)
# Check for preserved skills (now checking for actual skill names in data)
print("Checking for preserved skills in vocabulary:")
skills_to_check = ['c#', 'c++', 'net', 'node.js', 'asp.net', 'python', 'django', 'react', '.net']
for skill in skills_to_check:
    if skill in feature_names:
        # Find index of skill
        idx = list(feature_names).index(skill)
        print(f"✓ '{skill}': Found (index {idx})")
    else:
        print(f"✗ '{skill}': Not found")
print(f"="*80)
# Search for skills with special chars in vocabulary
print("\nSearching for features containing special characters:")
special_char_features = [f for f in feature_names if any(char in f for char in ['#', '+', '.'])]
if special_char_features:
    print(f"Found {len(special_char_features)} features with special characters:")
    for feature in special_char_features[:20]:  # Show first 20
        print(f"  - {feature}")
else:
    print("No features with special characters found.")
print(f"="*80)
# View first 20 words
print("\nFirst 20 words in vocabulary:")
print(feature_names[:20])


## Step 5: Extract Skills from a Document

Extract the top words (potential skills) from a single document.<br>
TF-IDF scores indicate how important each word is for that document.<br>
Filter words against a predefined skill list to get actual skills.


In [83]:
# Define skill list (lowercase)
skill_list = [
    # Programming languages
    'python', 'java', 'javascript', 'c#', 'c++', 'csharp', 'cplusplus',
    'ruby', 'php', 'swift', 'kotlin', 'go', 'rust', 'r', 'typescript',

    # Frameworks
    'django', 'flask', 'react', 'angular', 'vue', 'spring', 'express',
    'nodejs', 'aspnet', 'aspnetcore', 'aspnetmvc',

    # Data science / ML
    'tensorflow', 'pytorch', 'keras', 'scikit', 'pandas', 'numpy',
    'spark', 'hadoop', 'sklearn',

    # Databases
    'sql', 'mysql', 'postgresql', 'mongodb', 'redis', 'sqlite',

    # Tools / Platforms
    'docker', 'kubernetes', 'aws', 'azure', 'gcp', 'git', 'jenkins',
    'linux', 'ubuntu', 'windows',

    # Mobile
    'android', 'ios', 'flutter', 'reactnative',

    # Microsoft technologies
    'net', 'dotnet', 'dotnetcore', 'dotnetframework',

    # Web
    'html', 'css', 'json', 'xml', 'rest', 'graphql', 'api'
]

# Function to extract skills from a document
def extract_skills(doc_index, tfidf_matrix, feature_names, skill_list, threshold=0.05):
    """
    Extract skills with TF-IDF score above threshold.

    Args:
        doc_index: Index of the document
        tfidf_matrix: TF-IDF matrix
        feature_names: List of vocabulary words
        skill_list: List of known skills
        threshold: Minimum TF-IDF score to consider

    Returns:
        List of (skill, score) tuples
    """
    # Get TF-IDF scores for this document
    doc_tfidf = tfidf_matrix[doc_index].toarray()[0]

    # Create dictionary: word (feature_name) -> score for this document
    word_scores = dict(zip(feature_names, doc_tfidf))

    # Filter for skills above threshold
    skills = [(word, score) for word, score in word_scores.items()
                if  word in skill_list and score > threshold]

    # Sort by score (highest first)
    skills.sort(key=lambda x: x[1], reverse=True)

    return skills

# Extract skills for first 5 documents
for i in range(5):
    skills = extract_skills(i, tfidf_matrix, feature_names, skill_list)
    print(f"\nJob {i+1}: {df['Job Title'].iloc[i]}")
    print("Skills extracted:")
    for skill, score in skills:
        print(f"  - {skill}: {score:.4f}")
    print("=" * 80)


Job 1: Flutter Developer
Skills extracted:
  - flutter: 0.2140

Job 2: Django Developer
Skills extracted:
  - api: 0.3392
  - json: 0.1975
  - linux: 0.1814
  - sql: 0.1538
  - python: 0.1426

Job 3: Machine Learning
Skills extracted:
  - python: 0.1075
  - java: 0.0545

Job 4: iOS Developer
Skills extracted:
  - ios: 0.4664

Job 5: Full Stack Developer
Skills extracted:
  - react: 0.2398


## Step 6: Extract Skills for All Documents

Apply skill extraction to all documents in the dataset.<br>
The `apply()` method processes each document.<br>
Save extracted skills as a new column.

In [84]:
# Function to extract only skill names from a document
def extract_skills_list(doc_index):
    skills_with_scores= extract_skills(doc_index, tfidf_matrix, feature_names, skill_list)
    return [skill for skill, score in skills_with_scores]

# Apply to all documents
df['extracted_skills'] = [extract_skills_list(i) for i in range(len(df))]

# Display results
print("Skills extracted for all jobs:")
print(f"Total jobs: {len(df)}")
print("\nSample results:")
for i in range(5):
    skills = df['extracted_skills'].iloc[i]
    print(f"Job {i+1} ({df['Job Title'].iloc[i]}): {skills}")

Skills extracted for all jobs:
Total jobs: 2277

Sample results:
Job 1 (Flutter Developer): ['flutter']
Job 2 (Django Developer): ['api', 'json', 'linux', 'sql', 'python']
Job 3 (Machine Learning): ['python', 'java']
Job 4 (iOS Developer): ['ios']
Job 5 (Full Stack Developer): ['react']


## Step 7: Analyze Extracted Skills

Analyze the distribution of skills across all job descriptions.<br>
Count how many jobs require each skill.<br>
Identify the most in-demand skills.

In [85]:
from collections import Counter

# Flatten list of all extracted skills
all_skills = []

for skills in df['extracted_skills']:
    all_skills.extend(skills)

# Count skill frequencies
skill_counts = Counter(all_skills)

# Display top 20 most common skills
print("Top 20 most common skills:")
for skill, count in skill_counts.most_common(20):
    percentage = (count / len(df)) * 100
    print(f"{skill}: {count} jobs ({percentage:.1f}%)")

Top 20 most common skills:
javascript: 569 jobs (25.0%)
html: 463 jobs (20.3%)
css: 458 jobs (20.1%)
java: 356 jobs (15.6%)
python: 344 jobs (15.1%)
git: 343 jobs (15.1%)
php: 338 jobs (14.8%)
mysql: 325 jobs (14.3%)
aws: 309 jobs (13.6%)
sql: 299 jobs (13.1%)
rest: 279 jobs (12.3%)
api: 269 jobs (11.8%)
ios: 256 jobs (11.2%)
angular: 234 jobs (10.3%)
linux: 215 jobs (9.4%)
react: 206 jobs (9.0%)
json: 175 jobs (7.7%)
docker: 161 jobs (7.1%)
android: 160 jobs (7.0%)
mongodb: 157 jobs (6.9%)


## Step 8: Save Extracted Skills

Save the DataFrame with extracted skills to a CSV file.<br>
The `to_csv()` function saves the DataFrame to `data/processed/`.<br>
Use `index=False` to avoid saving the row index.

In [86]:
# Save with extracted skills
df.to_csv('../data/processed/job_title_des_with_skills.csv', index=False)

print("Data saved to: data/processed/job_title_des_with_skills.csv")
print(f"Total jobs: {len(df)}")
print(f"Columns: {df.columns.tolist()}")

Data saved to: data/processed/job_title_des_with_skills.csv
Total jobs: 2277
Columns: ['Unnamed: 0', 'Job Title', 'Job Description', 'extracted_skills']
