# Car Resale Price Prediction
## Project done by : Jaitashri Poddar
## Project Description
- The model created in this project leverages a collection of resume examples to predict the category of a given resume based on predefined labels.
- The training data for this project has been sourced from [Kaggle](https://www.kaggle.com/datasets/snehaanbhawal/resume-dataset/data).
- After building the models, the best performing model has been deployed as a web application that allows users to upload multiple resumes.
- The app processes these resumes and provides an output list with predicted categories for each resume. Additionally, the app features a pie chart that visualizes the distribution of different categories among the uploaded resumes, offering an intuitive overview of the classification results.
## Data Definition 
The [training data](https://www.kaggle.com/datasets/snehaanbhawal/resume-dataset/data) contains 2400+ Resumes in string as well as PDF format.
PDF stored in the data folder differentiated into their respective labels as folders with each resume residing inside the folder in pdf form with filename as the id defined in the csv.

**Inside the CSV:**

- **ID**: Unique identifier and file name for the respective pdf.
- **Resume_str** : Contains the resume text only in string format.
- **Resume_html** : Contains the resume data in html format as present while web scrapping.
- **Category** : Category of the job the resume was used to apply.

**Present categories are:** <br>
HR, Designer, Information-Technology, Teacher, Advocate, Business-Development, Healthcare, Fitness, Agriculture, BPO, Sales, Consultant, Digital-Media, Automobile, Chef, Finance, Apparel, Engineering, Accountant, Construction, Public-Relations, Banking, Arts, Aviation.

# 1. Importing Libraries and Setting Options

## 1.1 Importing Necessary LIbraries

In [1]:
# suppress display of warnings
import warnings
warnings.filterwarnings("ignore")

# 'Pandas' is used for data manipulation and analysis
import pandas as pd 

# 'Numpy' is used for mathematical operations on large, multi-dimensional arrays and matrices
import numpy as np

# 'Matplotlib' is a data visualization library for 2D and 3D plots, built on numpy
import matplotlib.pyplot as plt

# 'Seaborn' is based on matplotlib; used for plotting statistical graphics
import seaborn as sns

# WordCloud is used for generating word clouds from text data.
# STOPWORDS is a predefined set of common words that are typically excluded from word clouds.
from wordcloud import WordCloud, STOPWORDS


# Importing necessary modules from the Natural Language Toolkit (nltk) library for text processing
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer  # Importing Porter Stemmer and WordNet Lemmatizer for stemming and lemmatization
from nltk.corpus import stopwords  # Importing stopwords for filtering out common words that may not be useful in text analysis
from nltk.tokenize import word_tokenize, sent_tokenize  # Importing tokenizers for splitting text into words and sentences


# Import the gensim library for topic modeling and document similarity
import gensim


# Import the simple_preprocess function from gensim.utils for basic preprocessing
# This function tokenizes text, converts to lowercase, and removes punctuation
from gensim.utils import simple_preprocess

# Import the STOPWORDS set from gensim.parsing.preprocessing for removing common stopwords
# These are words that are usually removed in natural language processing tasks as they do not carry significant meaning (e.g., "and", "the", "is")
from gensim.parsing.preprocessing import STOPWORDS


# Import the re module for regular expression operations
import re






In [2]:
# Importing necessary libraries and modules for machine learning tasks

# Text feature extraction
from sklearn.feature_extraction.text import CountVectorizer

# Handling missing data
from sklearn.impute import SimpleImputer

# Evaluation metrics for classification
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Data preprocessing
from sklearn.preprocessing import StandardScaler

# Model selection and evaluation
from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV
from sklearn.metrics import cohen_kappa_score

# Creating machine learning pipelines
from sklearn.pipeline import Pipeline

# Importing various machine learning algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier, ExtraTreesClassifier
from sklearn.naive_bayes import MultinomialNB

# Importing necessary libraries for text vectorization, data splitting, and model persistence
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import joblib


## 1.2 Setting Figure Size

In [3]:
# set the plot size using 'rcParams'
# once the plot size is set using 'rcParams', it sets the size of all the forthcoming plots in the file
# pass width and height in inches to 'figure.figsize' 
plt.rcParams['figure.figsize'] = [15,8]

## 1.3 Setting Options

In [4]:
# display all columns of the dataframe
pd.options.display.max_columns = None

# display all rows of the dataframe
pd.options.display.max_rows = None

# use below code to convert the 'exponential' values to float
np.set_printoptions(suppress=True)


# 2. Loading Data

In [5]:
#Let's name the dataset as df
df = pd.read_csv('Resume.csv')

In [6]:
df.head()

Unnamed: 0,ID,Resume_str,Resume_html,Category
0,16852973,HR ADMINISTRATOR/MARKETING ASSOCIATE\...,"<div class=""fontsize fontface vmargins hmargin...",HR
1,22323967,"HR SPECIALIST, US HR OPERATIONS ...","<div class=""fontsize fontface vmargins hmargin...",HR
2,33176873,HR DIRECTOR Summary Over 2...,"<div class=""fontsize fontface vmargins hmargin...",HR
3,27018550,HR SPECIALIST Summary Dedica...,"<div class=""fontsize fontface vmargins hmargin...",HR
4,17812897,HR MANAGER Skill Highlights ...,"<div class=""fontsize fontface vmargins hmargin...",HR


In [7]:
df.shape

(2484, 4)

So the dataset has 2484 rows and 4 columns (features).

# 3. Data Cleaning and Analysis

In [8]:
# Let us drop ID and Resume_html columns, as these columns are not necessary to build the model
df.drop(columns = ['ID', 'Resume_html'], inplace = True)

In [9]:
df.head()

Unnamed: 0,Resume_str,Category
0,HR ADMINISTRATOR/MARKETING ASSOCIATE\...,HR
1,"HR SPECIALIST, US HR OPERATIONS ...",HR
2,HR DIRECTOR Summary Over 2...,HR
3,HR SPECIALIST Summary Dedica...,HR
4,HR MANAGER Skill Highlights ...,HR


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2484 entries, 0 to 2483
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Resume_str  2484 non-null   object
 1   Category    2484 non-null   object
dtypes: object(2)
memory usage: 38.9+ KB


In [11]:
# 'dtypes' gives the data type for each column
data_type_and_unique = pd.concat([df.dtypes, df.nunique()], axis = 1, keys = ['Data Types', 'Number of unique Values'])
data_type_and_unique

Unnamed: 0,Data Types,Number of unique Values
Resume_str,object,2482
Category,object,24


In [12]:
df['Category'].value_counts()

Category
INFORMATION-TECHNOLOGY    120
BUSINESS-DEVELOPMENT      120
FINANCE                   118
ADVOCATE                  118
ACCOUNTANT                118
ENGINEERING               118
CHEF                      118
AVIATION                  117
FITNESS                   117
SALES                     116
BANKING                   115
HEALTHCARE                115
CONSULTANT                115
CONSTRUCTION              112
PUBLIC-RELATIONS          111
HR                        110
DESIGNER                  107
ARTS                      103
TEACHER                   102
APPAREL                    97
DIGITAL-MEDIA              96
AGRICULTURE                63
AUTOMOBILE                 36
BPO                        22
Name: count, dtype: int64

In [13]:
# create list of all categories
categories = np.sort(df['Category'].unique())
categories

array(['ACCOUNTANT', 'ADVOCATE', 'AGRICULTURE', 'APPAREL', 'ARTS',
       'AUTOMOBILE', 'AVIATION', 'BANKING', 'BPO', 'BUSINESS-DEVELOPMENT',
       'CHEF', 'CONSTRUCTION', 'CONSULTANT', 'DESIGNER', 'DIGITAL-MEDIA',
       'ENGINEERING', 'FINANCE', 'FITNESS', 'HEALTHCARE', 'HR',
       'INFORMATION-TECHNOLOGY', 'PUBLIC-RELATIONS', 'SALES', 'TEACHER'],
      dtype=object)

# 4. Data Preprocessing and Cleaning

In [14]:
# Initialize the Porter Stemmer for reducing words to their root forms
stemmer = nltk.stem.porter.PorterStemmer()

In [15]:
def preprocess(txt):
    txt = txt.lower() #convert all characters in lower case
    txt = re.sub('[^a-zA-Z]', ' ', txt) #remove all non-english charaters
    txt = nltk.tokenize.word_tokenize(txt) #tokenize word
    txt = [w for w in txt if not w in nltk.corpus.stopwords.words('english')] #remove all stop words
    txt = [stemmer.stem(w) for w in txt]
    return ' '.join(txt)

In [16]:
# Passing 'Resume_str' column through preprocess() function, and storing output in a new column, named 'Resume'
df['Resume'] = df['Resume_str'].apply(lambda x: preprocess(x)) 

In [17]:
#Dropping 'Resume_str' column
df.drop(columns = ['Resume_str'], inplace = True)
df.head()

Unnamed: 0,Category,Resume
0,HR,hr administr market associ hr administr summar...
1,HR,hr specialist us hr oper summari versatil medi...
2,HR,hr director summari year experi recruit plu ye...
3,HR,hr specialist summari dedic driven dynam year ...
4,HR,hr manag skill highlight hr skill hr depart st...


In [18]:
# create list of all categories
categories = np.sort(df['Category'].unique())
categories

array(['ACCOUNTANT', 'ADVOCATE', 'AGRICULTURE', 'APPAREL', 'ARTS',
       'AUTOMOBILE', 'AVIATION', 'BANKING', 'BPO', 'BUSINESS-DEVELOPMENT',
       'CHEF', 'CONSTRUCTION', 'CONSULTANT', 'DESIGNER', 'DIGITAL-MEDIA',
       'ENGINEERING', 'FINANCE', 'FITNESS', 'HEALTHCARE', 'HR',
       'INFORMATION-TECHNOLOGY', 'PUBLIC-RELATIONS', 'SALES', 'TEACHER'],
      dtype=object)

In [19]:
# create new df for corpus and category
df_categories = [df[df['Category'] == category].loc[:, ['Resume', 'Category']] for category in categories]
df_categories[10]

Unnamed: 0,Resume,Category
1357,chef career focu nurs student recent obtain cn...,CHEF
1358,chef summari custom orient fast food worker de...,CHEF
1359,chef career overview dedic custom servic repre...,CHEF
1360,chef summari experienc cater chef skill prepar...,CHEF
1361,rm roxann mejia summari motiv chef compet keep...,CHEF
1362,chef execut profil accomplish person chef comm...,CHEF
1363,chef summari highli organ effici fast pace mul...,CHEF
1364,chef summari experienc cater chef skill prepar...,CHEF
1365,chef summari qualiti focus effici cook adept p...,CHEF
1366,chef credenti nation registri food safeti prof...,CHEF


In [20]:
# Download the 'punkt' tokenizer models and the 'stopwords' corpus from NLTK
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /Users/jpoddar/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jpoddar/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [21]:
#defining stopwords
stop_words = stopwords.words('english')

In [22]:
#function which stop words and the words less than 2 characters
def remove_stop_words (text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3 and token not in stop_words:
            result.append(token)
            
    return result

In [23]:
df.head(5)

Unnamed: 0,Category,Resume
0,HR,hr administr market associ hr administr summar...
1,HR,hr specialist us hr oper summari versatil medi...
2,HR,hr director summari year experi recruit plu ye...
3,HR,hr specialist summari dedic driven dynam year ...
4,HR,hr manag skill highlight hr skill hr depart st...


In [24]:
df.groupby("Category").size()

Category
ACCOUNTANT                118
ADVOCATE                  118
AGRICULTURE                63
APPAREL                    97
ARTS                      103
AUTOMOBILE                 36
AVIATION                  117
BANKING                   115
BPO                        22
BUSINESS-DEVELOPMENT      120
CHEF                      118
CONSTRUCTION              112
CONSULTANT                115
DESIGNER                  107
DIGITAL-MEDIA              96
ENGINEERING               118
FINANCE                   118
FITNESS                   117
HEALTHCARE                115
HR                        110
INFORMATION-TECHNOLOGY    120
PUBLIC-RELATIONS          111
SALES                     116
TEACHER                   102
dtype: int64

In [25]:
df["Category"].value_counts()

Category
INFORMATION-TECHNOLOGY    120
BUSINESS-DEVELOPMENT      120
FINANCE                   118
ADVOCATE                  118
ACCOUNTANT                118
ENGINEERING               118
CHEF                      118
AVIATION                  117
FITNESS                   117
SALES                     116
BANKING                   115
HEALTHCARE                115
CONSULTANT                115
CONSTRUCTION              112
PUBLIC-RELATIONS          111
HR                        110
DESIGNER                  107
ARTS                      103
TEACHER                   102
APPAREL                    97
DIGITAL-MEDIA              96
AGRICULTURE                63
AUTOMOBILE                 36
BPO                        22
Name: count, dtype: int64

In [26]:
# Initialize TfidfVectorizer for converting text documents to TF-IDF features with a maximum of 5000 features
vectorizer = TfidfVectorizer(max_features=5000)

In [27]:
X = vectorizer.fit_transform(df['Resume'])
y = df['Category']

In [28]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 5. Defining Generalized Functions for Measuring Performance Measures

## 5.1 Creating Generalized Function to Calculate Metrices for Test Set

In [29]:
#Create a generalized function to calculate the METRICS for the test set.

from sklearn.metrics import classification_report
# create a generalized function to calculate the metrics values for test set
def get_test_report(model):
    
    # return the performace measures on test set
    return(classification_report(y_test, y_pred))

## 5.2 Creating generalized function to tabulate performance metrices

In [30]:
# Create a generalized function to create a dataframe containing the scores for the models.
# create an empty dataframe to store the scores for various classification algorithms
score_card = pd.DataFrame(columns=['Model', 'Precision Score', 'Recall Score', 'Accuracy Score',
                                   'Kappa Score', 'f1-score'])

# append the result table for all performance scores
# performance measures considered for comparision are 'AUC', 'Precision', 'Recall','Accuracy','Kappa Score', and 'f1-score'
# compile the required information in a user defined function 
from sklearn import metrics
def update_score_card(model_name, model_caption):
    
    # assign 'score_card' as global variable
    global score_card

    # append the results to the dataframe 'score_card'
    # 'ignore_index = True' do not consider the index labels
    score_card = score_card._append({'Model': model_caption,
                                    'Precision Score': metrics.precision_score(y_test, y_pred,average='micro'),
                                    'Recall Score': metrics.recall_score(y_test, y_pred,average='micro'),
                                    'Accuracy Score': metrics.accuracy_score(y_test, y_pred),
                                    'Kappa Score': cohen_kappa_score(y_test, y_pred),
                                    'f1-score': metrics.f1_score(y_test, y_pred,average='micro')}, 
                                    ignore_index = True)
    return(score_card)

# 6. Model Building

## 6.1 Random Forest Classifier

In [31]:
# Train the model
model = RandomForestClassifier()
model.fit(X_train, y_train)

In [32]:
# Making predictions using the trained model on the test data transformed with TfidfVectorizer
y_pred = model.predict(X_test)

In [33]:
# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

Accuracy: 0.64


In [34]:
# Generate a classification report
print('Classification Report:')
print(classification_report(y_test, y_pred))

Classification Report:
                        precision    recall  f1-score   support

            ACCOUNTANT       0.72      0.90      0.80        29
              ADVOCATE       0.88      0.70      0.78        30
           AGRICULTURE       0.00      0.00      0.00         8
               APPAREL       0.77      0.50      0.61        20
                  ARTS       0.00      0.00      0.00        18
            AUTOMOBILE       0.00      0.00      0.00         6
              AVIATION       0.72      0.86      0.78        21
               BANKING       0.71      0.65      0.68        23
                   BPO       0.00      0.00      0.00         2
  BUSINESS-DEVELOPMENT       0.65      0.48      0.55        27
                  CHEF       0.69      0.75      0.72        24
          CONSTRUCTION       0.91      0.88      0.90        34
            CONSULTANT       1.00      0.20      0.33        20
              DESIGNER       0.61      0.89      0.72        19
         DIGITAL

In [35]:
# Tabulating the results
update_score_card(model, 'Random Forest Classifier')

Unnamed: 0,Model,Precision Score,Recall Score,Accuracy Score,Kappa Score,f1-score
0,Random Forest Classifier,0.637827,0.637827,0.637827,0.619675,0.637827


## 6.2 Naive Bayes Classifier 

In [36]:
# Train the model
model = MultinomialNB()
model.fit(X_train, y_train)

In [37]:
# Making predictions using the trained model on the test data transformed with TfidfVectorizer
y_pred = model.predict(X_test)

In [38]:
# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

Accuracy: 0.55


In [39]:
# Generate a classification report
print('Classification Report:')
print(classification_report(y_test, y_pred))

Classification Report:
                        precision    recall  f1-score   support

            ACCOUNTANT       0.75      0.83      0.79        29
              ADVOCATE       0.61      0.37      0.46        30
           AGRICULTURE       1.00      0.12      0.22         8
               APPAREL       0.86      0.30      0.44        20
                  ARTS       0.25      0.06      0.09        18
            AUTOMOBILE       0.00      0.00      0.00         6
              AVIATION       0.67      0.76      0.71        21
               BANKING       0.81      0.57      0.67        23
                   BPO       0.00      0.00      0.00         2
  BUSINESS-DEVELOPMENT       0.46      0.63      0.53        27
                  CHEF       0.81      0.71      0.76        24
          CONSTRUCTION       0.91      0.59      0.71        34
            CONSULTANT       0.50      0.05      0.09        20
              DESIGNER       0.78      0.74      0.76        19
         DIGITAL

In [40]:
# Tabulating the results
update_score_card(model, 'Naive Bayes Classifier')

Unnamed: 0,Model,Precision Score,Recall Score,Accuracy Score,Kappa Score,f1-score
0,Random Forest Classifier,0.637827,0.637827,0.637827,0.619675,0.637827
1,Naive Bayes Classifier,0.549296,0.549296,0.549296,0.526864,0.549296


## 6.2 Extra Tree Classifier 

In [41]:
# Train the model
model = ExtraTreesClassifier(n_estimators=500, criterion='gini')
model.fit(X_train, y_train)

In [42]:
# Making predictions using the trained model on the test data transformed with TfidfVectorizer
y_pred = model.predict(X_test)

In [43]:
# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

Accuracy: 0.62


In [44]:
# Generate a classification report
print('Classification Report:')
print(classification_report(y_test, y_pred))

Classification Report:
                        precision    recall  f1-score   support

            ACCOUNTANT       0.69      0.86      0.77        29
              ADVOCATE       0.83      0.67      0.74        30
           AGRICULTURE       0.25      0.12      0.17         8
               APPAREL       0.62      0.40      0.48        20
                  ARTS       0.00      0.00      0.00        18
            AUTOMOBILE       0.00      0.00      0.00         6
              AVIATION       0.71      0.81      0.76        21
               BANKING       0.71      0.65      0.68        23
                   BPO       0.00      0.00      0.00         2
  BUSINESS-DEVELOPMENT       0.62      0.56      0.59        27
                  CHEF       0.78      0.75      0.77        24
          CONSTRUCTION       0.85      0.82      0.84        34
            CONSULTANT       0.50      0.10      0.17        20
              DESIGNER       0.65      0.79      0.71        19
         DIGITAL

In [45]:
# Tabulating the results
update_score_card(model, 'Extra Trees Classifier')

Unnamed: 0,Model,Precision Score,Recall Score,Accuracy Score,Kappa Score,f1-score
0,Random Forest Classifier,0.637827,0.637827,0.637827,0.619675,0.637827
1,Naive Bayes Classifier,0.549296,0.549296,0.549296,0.526864,0.549296
2,Extra Trees Classifier,0.617706,0.617706,0.617706,0.598493,0.617706


# 6. Conclusion

In [46]:
score_card

Unnamed: 0,Model,Precision Score,Recall Score,Accuracy Score,Kappa Score,f1-score
0,Random Forest Classifier,0.637827,0.637827,0.637827,0.619675,0.637827
1,Naive Bayes Classifier,0.549296,0.549296,0.549296,0.526864,0.549296
2,Extra Trees Classifier,0.617706,0.617706,0.617706,0.598493,0.617706


# 7. Dumping Best Model

#### We see that we can achive best performance using Random Forest Classifier model with default values of different hyperparameters . Hence we dump this model for further use.

In [48]:
# Save the model and vectorizer
#joblib.dump(model, 'resume_classifier.pkl')
#joblib.dump(vectorizer, 'vectorizer.pkl')