***About Author***

 
Author: Maham Noor

Contact's info: mahamnoor575@gmail.com

***About Data***

Data: resume_data

***Task***

Resume Matching Model

#### **Objective:**  
The goal of this project is to build a **machine learning model** that predicts the compatibility of a candidate’s resume with a given job description. The model analyzes structured and unstructured text data, such as required skills, related skills, and career objectives, to generate a **matching score** between resumes and job postings.

#### **Steps Involved:**  
1. **Data Preprocessing:**  
   - Handling missing values and data inconsistencies.  
   - Applying TF-IDF vectorization for text-based features.  
   - Encoding categorical variables and normalizing numerical features.  

2. **Feature Engineering:**  
   - Extracting meaningful features from resumes and job descriptions.  
   - Applying NLP techniques for better text representation.  

3. **Model Training & Evaluation:**  
   - Splitting data into training and testing sets.  
   - Training a machine learning model (e.g., Random Forest, XGBoost).  
   - Evaluating model performance using metrics such as **MAE, R² score, and RMSE**.  

4. **Predictions & Final Analysis:**  
   - Generating predictions for resume-job match scores.  
   - Interpreting results and identifying areas for improvement. 

Importing Libraries

In [242]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from dateutil import parser
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.preprocessing import MultiLabelBinarizer, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer



***Reading the data***

In [243]:
df = pd.read_csv(r"C:\Users\pc\Documents\resume_data.csv")

*First 10 rows of data*

In [282]:
df.head(10)

Unnamed: 0,educational_institution_name,passing_years,educational_results,result_types,professional_company_names,start_dates,end_dates,positions,locations,responsibilities,...,related_skils_in_job_tfidf_1418,related_skils_in_job_tfidf_1419,related_skils_in_job_tfidf_1420,related_skils_in_job_tfidf_1421,related_skils_in_job_tfidf_1422,related_skils_in_job_tfidf_1423,related_skils_in_job_tfidf_1424,related_skils_in_job_tfidf_1425,related_skils_in_job_tfidf_1426,related_skils_in_job_tfidf_1427
0,['The Amity School of Engineering & Technology...,2019.0,0.043237,29,['Coca-COla'],NaT,NaT,['Big Data Analyst'],['N/A'],Technical Support\nTroubleshooting\nCollaborat...,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"['Delhi University - Hansraj College', 'Delhi ...",2015.0,0.043237,17,['BIB Consultancy'],NaT,NaT,['Business Analyst'],['N/A'],Machine Learning Leadership\nCross-Functional ...,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"['Birla Institute of Technology (BIT), Ranchi']",2018.0,0.043237,18,['Axis Bank Limited'],NaT,NaT,['Software Developer (Machine Learning Enginee...,['N/A'],"Trade Marketing Executive\nBrand Visibility, S...",...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"['Martinez Adult Education, Business Training ...",2008.0,0.043237,29,"['Company Name ï¼ City , State', 'Company Name...",NaT,NaT,"['Accountant', 'Accounts Receivable Clerk', 'M...","['City, State', 'City, State', 'City, State', ...",Apparel Sourcing\nQuality Garment Sourcing\nRe...,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,['Kent State University'],2018.0,0.044802,29,"['Company Name', 'Company Name', 'Company Name...",NaT,NaT,"['Staff Accountant', 'Senior Accountant', 'Tax...","['City, State', 'City, State', 'City, State', ...",iOS Lifecycle\nRequirement Analysis\nNative Fr...,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,"['Glen Oaks High School', 'Glen Oaks High Scho...",2018.0,0.043237,28,"['N/A', 'Company Name', 'Company Name']",NaT,NaT,"['Engineering Systems Installer', 'IT Technici...","['City, State', 'City, State', 'N/A']",Machine Learning Design\nData Analysis\nModel ...,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,['DJR College and University'],2020.0,0.043237,18,['Remiro Amio'],NaT,NaT,['Intern'],['N/A'],iOS Lifecycle\nRequirement Analysis\nNative Fr...,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,['POLYTECHNIC UNIVERSITY OF PUERTO RICO'],2009.0,0.025743,8,"['Company Name', 'Company Name', 'Company Name']",NaT,NaT,"['Engineering Technician', 'Instrument Technic...","['City, State', 'City, State', 'City, State']",iOS Lifecycle\nRequirement Analysis\nNative Fr...,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,['Nagpur University'],2019.0,0.043237,18,['AMZ Loans and Mortgages ERC Analytics'],NaT,NaT,['Associate Analyst'],['N/A'],Machinery Maintenance\nTroubleshooting\nReport...,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,['Dr. Virendra Swaroop Institute of Computer S...,2016.0,0.043237,28,['Daffodil Software Pvt Ltd'],NaT,NaT,['Software Developer'],['N/A'],Apparel Sourcing\nQuality Garment Sourcing\nRe...,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [245]:
df.columns

Index(['address', 'career_objective', 'skills', 'educational_institution_name',
       'degree_names', 'passing_years', 'educational_results', 'result_types',
       'major_field_of_studies', 'professional_company_names', 'company_urls',
       'start_dates', 'end_dates', 'related_skils_in_job', 'positions',
       'locations', 'responsibilities', 'extra_curricular_activity_types',
       'extra_curricular_organization_names',
       'extra_curricular_organization_links', 'role_positions', 'languages',
       'proficiency_levels', 'certification_providers', 'certification_skills',
       'online_links', 'issue_dates', 'expiry_dates', '﻿job_position_name',
       'educationaL_requirements', 'experiencere_requirement',
       'age_requirement', 'responsibilities.1', 'skills_required',
       'matched_score'],
      dtype='object')

*Drop the unneccessary columns*

In [246]:
df = df.drop(columns = ['address','company_urls','online_links','extra_curricular_organization_links','responsibilities.1','issue_dates','expiry_dates','role_positions'])

In [247]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9544 entries, 0 to 9543
Data columns (total 27 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   career_objective                     4740 non-null   object 
 1   skills                               9488 non-null   object 
 2   educational_institution_name         9460 non-null   object 
 3   degree_names                         9460 non-null   object 
 4   passing_years                        9460 non-null   object 
 5   educational_results                  9460 non-null   object 
 6   result_types                         9460 non-null   object 
 7   major_field_of_studies               9460 non-null   object 
 8   professional_company_names           9460 non-null   object 
 9   start_dates                          9460 non-null   object 
 10  end_dates                            9460 non-null   object 
 11  related_skils_in_job          

***Missing values in dataset***

In [248]:
df.isnull().sum()

career_objective                       4804
skills                                   56
educational_institution_name             84
degree_names                             84
passing_years                            84
educational_results                      84
result_types                             84
major_field_of_studies                   84
professional_company_names               84
start_dates                              84
end_dates                                84
related_skils_in_job                     84
positions                                84
locations                                84
responsibilities                          0
extra_curricular_activity_types        6118
extra_curricular_organization_names    6118
languages                              8844
proficiency_levels                     8844
certification_providers                7536
certification_skills                   7536
﻿job_position_name                        0
educationaL_requirements        

In [249]:
df['has_certifications'] = df['certification_skills'].notnull().astype(int)
df['has_language'] = df['languages'].notnull().astype(int)

drop the unuseful columns

In [250]:
df.drop(columns= ['extra_curricular_activity_types','extra_curricular_organization_names','certification_providers','certification_skills','languages','proficiency_levels' 
                  
], inplace = True)

In [251]:
print(df.isnull().sum())

career_objective                4804
skills                            56
educational_institution_name      84
degree_names                      84
passing_years                     84
educational_results               84
result_types                      84
major_field_of_studies            84
professional_company_names        84
start_dates                       84
end_dates                         84
related_skils_in_job              84
positions                         84
locations                         84
responsibilities                   0
﻿job_position_name                 0
educationaL_requirements           0
experiencere_requirement        1364
age_requirement                 4087
skills_required                 1701
matched_score                      0
has_certifications                 0
has_language                       0
dtype: int64


***Dealing with the missing values.***

as experience column is in string we have to convert it into float.

In [252]:
 
import re

# Improved Function
def extract_experience(exp):
    if pd.isna(exp) or not isinstance(exp, str):  # Handle NaN and non-string cases
        return np.nan
    match = re.search(r'(\d+)', exp)  # Find first number in the string
    return float(match.group(1)) if match else np.nan

# Apply the function
df['experiencere_requirement'] = df['experiencere_requirement'].apply(extract_experience)
df['age_requirement'] = df['age_requirement'].apply(extract_experience)

# Fill missing values
df['experiencere_requirement'].fillna(df['experiencere_requirement'].median(), inplace=True)
df['age_requirement'].fillna(df['age_requirement'].median(), inplace=True)


Converting the missing value with median.

In [253]:

df['skills_count'] = df['skills'].apply(lambda x: len(str(x).split(',')) if pd.notna(x) else 0)

# Create a binary column 'has_degree' where 1 = degree present, 0 = no degree
df['has_degree'] = df['degree_names'].notnull().astype(int)

# Drop the 'skills' and 'degree_names' columns
df.drop(columns=['skills', 'degree_names'], inplace=True)


In [254]:
df.isnull().sum()

career_objective                4804
educational_institution_name      84
passing_years                     84
educational_results               84
result_types                      84
major_field_of_studies            84
professional_company_names        84
start_dates                       84
end_dates                         84
related_skils_in_job              84
positions                         84
locations                         84
responsibilities                   0
﻿job_position_name                 0
educationaL_requirements           0
experiencere_requirement           0
age_requirement                    0
skills_required                 1701
matched_score                      0
has_certifications                 0
has_language                       0
skills_count                       0
has_degree                         0
dtype: int64

There is some problem in the job position name column so we have to fix that first.

In [255]:
df.columns = df.columns.str.replace(r'\ufeff', '', regex=True)
df.columns = df.columns.str.strip()
df['job_position_name']


0                                Senior Software Engineer
1                          Machine Learning (ML) Engineer
2       Executive/ Senior Executive- Trade Marketing, ...
3                          Business Development Executive
4                                     Senior iOS Engineer
                              ...                        
9539                                        Data Engineer
9540                         Executive/ Sr. Executive -IT
9541                                      Executive - VAT
9542               Asst. Manager/ Manger (Administrative)
9543                                       Civil Engineer
Name: job_position_name, Length: 9544, dtype: object

Encoding of the missing values.

In [256]:
encoder = LabelEncoder()
df['job_position_encoded'] = encoder.fit_transform(df['job_position_name'])
df['major_field_encoded'] = encoder.fit_transform(df['major_field_of_studies'].astype(str))
df.drop(columns= ['job_position_name','major_field_of_studies'],inplace = True)

In [257]:
cat_cols = ['educational_institution_name','professional_company_names','positions','locations']
df[cat_cols] = df[cat_cols].fillna('unknown')

In [258]:
import ast

def extract_year(value):
    try:
        if isinstance(value, str) and value.startswith("["):
            value = ast.literal_eval(value)
            return int(value[0])
        return int(value)
    except:
        return None

df['passing_years'] = df['passing_years'].apply(extract_year)


In [259]:
import re
import ast

def extract_result(value):
    try:
        if isinstance(value, str) and value.startswith("["):
            value = ast.literal_eval(value)  # Convert list-like string to actual list
            value = value[0]  # Extract first element
        
        if isinstance(value, str):
            match = re.search(r"\d+(\.\d+)?", value)  # Extract integer or decimal number
            if match:
                num = float(match.group())  # Convert matched value to float
                return num / 100 if "%" in value else num  # Normalize percentage values
        
        return float(value)  # Directly convert numerical values to float
    except:
        return None  # Return None if an error occurs

# Apply the function to the column
df['educational_results'] = df['educational_results'].apply(extract_result)


Fill the missing values.

In [260]:
df['passing_years'].fillna(df['passing_years'].median(), inplace = True)
df['educational_results'].fillna(df['educational_results'].median(),inplace = True)
df['result_types'].fillna('unknown', inplace= True)
df['start_dates'].fillna('unknown', inplace= True)
df['end_dates'].fillna('unknown', inplace= True)
df[['related_skils_in_job', 'skills_required']] = df[['related_skils_in_job', 'skills_required']].fillna('None')


In [261]:
df['experiencere_requirement'].fillna(df['experiencere_requirement'].median(), inplace = True)
df['age_requirement'].fillna(df['age_requirement'].median(), inplace = True)

In [262]:
df.isnull().sum()

career_objective                4804
educational_institution_name       0
passing_years                      0
educational_results                0
result_types                       0
professional_company_names         0
start_dates                        0
end_dates                          0
related_skils_in_job               0
positions                          0
locations                          0
responsibilities                   0
educationaL_requirements           0
experiencere_requirement           0
age_requirement                    0
skills_required                    0
matched_score                      0
has_certifications                 0
has_language                       0
skills_count                       0
has_degree                         0
job_position_encoded               0
major_field_encoded                0
dtype: int64

In [263]:
df['has_career_objective'] = df['career_objective'].notnull().astype(int)

In [264]:
import re
def clean_text(text):
    if pd.isna(text):
        return ""
    text = text.lower()
    text = re.sub(r'[a-zA-Z0-9\s]','',text )
df['career_objective_cleaned'] = df['career_objective'].apply(clean_text)  

In [265]:
df.drop(columns = ['career_objective'],inplace= True)

Now check the missing values

In [266]:
df.isnull().sum()

educational_institution_name       0
passing_years                      0
educational_results                0
result_types                       0
professional_company_names         0
start_dates                        0
end_dates                          0
related_skils_in_job               0
positions                          0
locations                          0
responsibilities                   0
educationaL_requirements           0
experiencere_requirement           0
age_requirement                    0
skills_required                    0
matched_score                      0
has_certifications                 0
has_language                       0
skills_count                       0
has_degree                         0
job_position_encoded               0
major_field_encoded                0
has_career_objective               0
career_objective_cleaned        4740
dtype: int64

In [267]:
df.columns

Index(['educational_institution_name', 'passing_years', 'educational_results',
       'result_types', 'professional_company_names', 'start_dates',
       'end_dates', 'related_skils_in_job', 'positions', 'locations',
       'responsibilities', 'educationaL_requirements',
       'experiencere_requirement', 'age_requirement', 'skills_required',
       'matched_score', 'has_certifications', 'has_language', 'skills_count',
       'has_degree', 'job_position_encoded', 'major_field_encoded',
       'has_career_objective', 'career_objective_cleaned'],
      dtype='object')

In [268]:
df['matched_score']

0       0.850000
1       0.750000
2       0.416667
3       0.760000
4       0.650000
          ...   
9539    0.683333
9540    0.650000
9541    0.650000
9542    0.650000
9543    0.650000
Name: matched_score, Length: 9544, dtype: float64

***Label Encoding***


In [269]:
categorical_cols = ["educationaL_requirements", "result_types", "job_position_encoded"]

label_encoders = {}
for col in categorical_cols:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col].astype(str))  # Convert to string to handle NaNs
    label_encoders[col] = le  # Store encoders for later us


In [270]:
df['skills_required']

0                                                    None
1                                                    None
2       Brand Promotion\nCampaign Management\nField Su...
3       Fast typing skill\nIELTSInternet browsing & on...
4       iOS\niOS App Developer\niOS Application Develo...
                              ...                        
9539    Azure\nBig Data\nData Analytics\nETL Tools\nPo...
9540                                                 None
9541                                          VAT and Tax
9542    •Administration\n•Health Safety and Environmen...
9543    AutoCAD\nETABS\nMicrosoft Office Suite\nMS Pro...
Name: skills_required, Length: 9544, dtype: object

*Convert Dates to Experience Years.*

In [271]:
import pandas as pd
import datetime

# Convert start and end dates to datetime format
df["start_dates"] = pd.to_datetime(df["start_dates"], errors="coerce")
df["end_dates"] = pd.to_datetime(df["end_dates"], errors="coerce")

# Convert passing years to numeric format
df["passing_years"] = pd.to_numeric(df["passing_years"], errors="coerce")

# Get current year
current_year = datetime.datetime.now().year

# Compute experience years
df["experience_years"] = df.apply(
    lambda row: (row["end_dates"].year - row["start_dates"].year) 
                if pd.notna(row["end_dates"]) and pd.notna(row["start_dates"]) 
                else 0, 
    axis=1
)

# Compute years since graduation
df["years_since_graduation"] = df["passing_years"].apply(lambda x: current_year - x if pd.notna(x) else 0)

  df["start_dates"] = pd.to_datetime(df["start_dates"], errors="coerce")
  df["end_dates"] = pd.to_datetime(df["end_dates"], errors="coerce")


*Normalize Numerical Features*

In [272]:
numerical_cols = ["skills_count", "educational_results", "experience_years", "years_since_graduation"]

scaler = MinMaxScaler()
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])

df.head()


Unnamed: 0,educational_institution_name,passing_years,educational_results,result_types,professional_company_names,start_dates,end_dates,related_skils_in_job,positions,locations,...,has_certifications,has_language,skills_count,has_degree,job_position_encoded,major_field_encoded,has_career_objective,career_objective_cleaned,experience_years,years_since_graduation
0,['The Amity School of Engineering & Technology...,2019.0,0.043237,29,['Coca-COla'],NaT,NaT,[['Big Data']],['Big Data Analyst'],['N/A'],...,0,0,0.12069,1,16,116,1,,0.0,0.12766
1,"['Delhi University - Hansraj College', 'Delhi ...",2015.0,0.043237,17,['BIB Consultancy'],NaT,NaT,"[['Data Analysis', 'Business Analysis', 'Machi...",['Business Analyst'],['N/A'],...,0,0,0.057471,1,7,159,1,,0.0,0.212766
2,"['Birla Institute of Technology (BIT), Ranchi']",2018.0,0.043237,18,['Axis Bank Limited'],NaT,NaT,"[['Unified Payment Interface', 'Risk Predictio...",['Software Developer (Machine Learning Enginee...,['N/A'],...,0,0,0.08046,1,27,119,0,,0.0,0.148936
3,"['Martinez Adult Education, Business Training ...",2008.0,0.043237,29,"['Company Name ï¼ City , State', 'Company Name...",NaT,NaT,"[['accounts receivables', 'banking', 'G/L Acco...","['Accountant', 'Accounts Receivable Clerk', 'M...","['City, State', 'City, State', 'City, State', ...",...,0,0,0.206897,1,12,53,1,,0.0,0.361702
4,['Kent State University'],2018.0,0.044802,29,"['Company Name', 'Company Name', 'Company Name...",NaT,NaT,"[['collections', 'accounts receivable', 'finan...","['Staff Accountant', 'Senior Accountant', 'Tax...","['City, State', 'City, State', 'City, State', ...",...,1,0,0.183908,1,17,18,1,,0.0,0.148936


In [273]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np

text_cols = ["skills_required", "related_skils_in_job", "career_objective_cleaned"]
tfidf_vectorizer = TfidfVectorizer(min_df=1)  # Ensure it works even with fewer words

for col in text_cols:
    # Convert NaN values, lists, and numbers to strings
    df[col] = df[col].fillna("").astype(str)
    df[col] = df[col].apply(lambda x: " ".join(x) if isinstance(x, list) else str(x))
    
    # Skip column if empty
    if df[col].str.strip().replace("", np.nan).dropna().empty:
        print(f"Skipping TF-IDF for {col} because it's empty.")
        continue

    # Apply TF-IDF
    tfidf_matrix = tfidf_vectorizer.fit_transform(df[col])

    # Convert to DataFrame
    tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), 
                            columns=[f"{col}_tfidf_{i}" for i in range(tfidf_matrix.shape[1])])

    # Merge TF-IDF features into df
    df = pd.concat([df, tfidf_df], axis=1)

    # Drop original text column
    df.drop(columns=[col], inplace=True)

df.head()


Skipping TF-IDF for career_objective_cleaned because it's empty.


Unnamed: 0,educational_institution_name,passing_years,educational_results,result_types,professional_company_names,start_dates,end_dates,positions,locations,responsibilities,...,related_skils_in_job_tfidf_1418,related_skils_in_job_tfidf_1419,related_skils_in_job_tfidf_1420,related_skils_in_job_tfidf_1421,related_skils_in_job_tfidf_1422,related_skils_in_job_tfidf_1423,related_skils_in_job_tfidf_1424,related_skils_in_job_tfidf_1425,related_skils_in_job_tfidf_1426,related_skils_in_job_tfidf_1427
0,['The Amity School of Engineering & Technology...,2019.0,0.043237,29,['Coca-COla'],NaT,NaT,['Big Data Analyst'],['N/A'],Technical Support\nTroubleshooting\nCollaborat...,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"['Delhi University - Hansraj College', 'Delhi ...",2015.0,0.043237,17,['BIB Consultancy'],NaT,NaT,['Business Analyst'],['N/A'],Machine Learning Leadership\nCross-Functional ...,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"['Birla Institute of Technology (BIT), Ranchi']",2018.0,0.043237,18,['Axis Bank Limited'],NaT,NaT,['Software Developer (Machine Learning Enginee...,['N/A'],"Trade Marketing Executive\nBrand Visibility, S...",...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"['Martinez Adult Education, Business Training ...",2008.0,0.043237,29,"['Company Name ï¼ City , State', 'Company Name...",NaT,NaT,"['Accountant', 'Accounts Receivable Clerk', 'M...","['City, State', 'City, State', 'City, State', ...",Apparel Sourcing\nQuality Garment Sourcing\nRe...,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,['Kent State University'],2018.0,0.044802,29,"['Company Name', 'Company Name', 'Company Name...",NaT,NaT,"['Staff Accountant', 'Senior Accountant', 'Tax...","['City, State', 'City, State', 'City, State', ...",iOS Lifecycle\nRequirement Analysis\nNative Fr...,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [274]:
df.info()  # Check data types and missing values
df.head()  # Display processed data


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9544 entries, 0 to 9543
Columns: 1610 entries, educational_institution_name to related_skils_in_job_tfidf_1427
dtypes: datetime64[ns](2), float64(1594), int32(8), object(6)
memory usage: 116.9+ MB


Unnamed: 0,educational_institution_name,passing_years,educational_results,result_types,professional_company_names,start_dates,end_dates,positions,locations,responsibilities,...,related_skils_in_job_tfidf_1418,related_skils_in_job_tfidf_1419,related_skils_in_job_tfidf_1420,related_skils_in_job_tfidf_1421,related_skils_in_job_tfidf_1422,related_skils_in_job_tfidf_1423,related_skils_in_job_tfidf_1424,related_skils_in_job_tfidf_1425,related_skils_in_job_tfidf_1426,related_skils_in_job_tfidf_1427
0,['The Amity School of Engineering & Technology...,2019.0,0.043237,29,['Coca-COla'],NaT,NaT,['Big Data Analyst'],['N/A'],Technical Support\nTroubleshooting\nCollaborat...,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"['Delhi University - Hansraj College', 'Delhi ...",2015.0,0.043237,17,['BIB Consultancy'],NaT,NaT,['Business Analyst'],['N/A'],Machine Learning Leadership\nCross-Functional ...,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"['Birla Institute of Technology (BIT), Ranchi']",2018.0,0.043237,18,['Axis Bank Limited'],NaT,NaT,['Software Developer (Machine Learning Enginee...,['N/A'],"Trade Marketing Executive\nBrand Visibility, S...",...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"['Martinez Adult Education, Business Training ...",2008.0,0.043237,29,"['Company Name ï¼ City , State', 'Company Name...",NaT,NaT,"['Accountant', 'Accounts Receivable Clerk', 'M...","['City, State', 'City, State', 'City, State', ...",Apparel Sourcing\nQuality Garment Sourcing\nRe...,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,['Kent State University'],2018.0,0.044802,29,"['Company Name', 'Company Name', 'Company Name...",NaT,NaT,"['Staff Accountant', 'Senior Accountant', 'Tax...","['City, State', 'City, State', 'City, State', ...",iOS Lifecycle\nRequirement Analysis\nNative Fr...,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


 ***Prepare Data for Training.***

In [277]:
from sklearn.model_selection import train_test_split

# Define Features (X) and Target (y)
X = df.drop(columns=['matched_score'])  # Remove target column
y = df['matched_score']  # Define target variable

# Split into Training and Testing Data (80% Train, 20% Test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("✅ Data Split Done!")
print(f"Training Data: {X_train.shape}, Test Data: {X_test.shape}")


✅ Data Split Done!
Training Data: (7635, 1609), Test Data: (1909, 1609)


In [278]:
# Drop all categorical columns to avoid errors
X_train = X_train.select_dtypes(include=[np.number])
X_test = X_test.select_dtypes(include=[np.number])

# Fill missing values with 0
X_train.fillna(0, inplace=True)
X_test.fillna(0, inplace=True)

print("✅ Data Cleaning Done!")


✅ Data Cleaning Done!


***Training the Model***

In [279]:
from sklearn.ensemble import RandomForestRegressor

# Train the Model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

print("✅ Model Training Done!")


✅ Model Training Done!


***Model Evaluation***

In [280]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Make Predictions
y_pred = model.predict(X_test)

# Evaluate Performance
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"📊 Model Evaluation:")
print(f"🔹 Mean Absolute Error (MAE): {mae}")
print(f"🔹 R² Score: {r2}")
print(f"🔹 Root Mean Squared Error (RMSE): {rmse}")


📊 Model Evaluation:
🔹 Mean Absolute Error (MAE): 0.07798420636356555
🔹 R² Score: 0.5959211683141794
🔹 Root Mean Squared Error (RMSE): 0.10570786087709855


The trained model was evaluated using various performance metrics, and the results indicate moderate accuracy in predicting the target variable. The key evaluation metrics are:

🔹 Mean Absolute Error (MAE): 0.0779 (Lower is better, indicating minimal prediction error)

🔹 R² Score: 0.5959 (Shows the model explains ~59.6% of the variance in the target variable)

🔹 Root Mean Squared Error (RMSE): 0.1057 (Indicates the average deviation of predictions from actual values)



***Conclusion:***
This model now can use for:

Help recruiters shortlist candidates efficiently.

✅ Assist job seekers in finding roles that best match their skills.

✅ Improve hiring decisions by incorporating AI-based insights.