# <b>Final Project - Job Application Helper (Resume Screening / Recommending Job / Comparing Job Description and Resume)</b>

#### Importing the necessary libraries

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#### Importing the dataset into dataframe and displaying the first 5 rows in it.

In [3]:
df1 = pd.read_csv('data.csv')
df1.head()

Unnamed: 0,Resume,Label
0,Database Administrator Database Administrator ...,Database_Administrator
1,Database Administrator Database Administrator ...,Database_Administrator
2,Oracle Database Administrator Oracle Database ...,Database_Administrator
3,Amazon Redshift Administrator and ETL Develope...,Database_Administrator
4,Scrum Master Scrum Master Scrum Master Richmon...,Database_Administrator


In [4]:
# Rename the column
df1.rename(columns={'Label': 'Category'}, inplace=True)
df1.head()


Unnamed: 0,Resume,Category
0,Database Administrator Database Administrator ...,Database_Administrator
1,Database Administrator Database Administrator ...,Database_Administrator
2,Oracle Database Administrator Oracle Database ...,Database_Administrator
3,Amazon Redshift Administrator and ETL Develope...,Database_Administrator
4,Scrum Master Scrum Master Scrum Master Richmon...,Database_Administrator


# <b>Data Exploration</b>

- Checking if the dataset has any null values or not

In [5]:
df1.isnull().sum()


Resume        2
Category    748
dtype: int64

In [6]:
df1.dropna(inplace=True)
df1.isnull().sum()

Resume      0
Category    0
dtype: int64

- Finding basic information about the dataset
    - It has 29033 rows and 2 columns
    - Both the columns contains data of Object datatype i.e. string

In [7]:
print(df1.shape)
print(df1.info())

(29033, 2)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 29033 entries, 0 to 29782
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Resume    29033 non-null  object
 1   Category  29033 non-null  object
dtypes: object(2)
memory usage: 680.5+ KB
None


- Looking at the Unique Categories

In [8]:
unique_categories = df1['Category'].unique()
print(unique_categories)

['Database_Administrator' 'Database_Administrator,Software_Developer'
 'Systems_Administrator,Database_Administrator'
 'Project_manager,Database_Administrator'
 'Database_Administrator,Project_manager'
 'Database_Administrator,Security_Analyst'
 'Software_Developer,Database_Administrator'
 'Web_Developer,Software_Developer,Database_Administrator'
 'Security_Analyst,Database_Administrator'
 'Database_Administrator,Web_Developer,Software_Developer'
 'Database_Administrator,Systems_Administrator'
 'Database_Administrator,Project_manager,Network_Administrator'
 'Database_Administrator,Network_Administrator'
 'Database_Administrator,Network_Administrator,Project_manager'
 'Database_Administrator,Systems_Administrator,Software_Developer'
 'Database_Administrator,Network_Administrator,Software_Developer'
 'Software_Developer,Systems_Administrator,Security_Analyst'
 'Systems_Administrator,Database_Administrator,Project_manager'
 'Systems_Administrator,Software_Developer,Database_Administrator'

- Exploring Categories - getting value counts of each category. 

In [9]:
df1['Category'].value_counts()

Systems_Administrator                                                                            2349
Project_manager                                                                                  2339
Database_Administrator                                                                           2225
Software_Developer                                                                               1991
Web_Developer,Software_Developer                                                                 1896
                                                                                                 ... 
Systems_Administrator,Security_Analyst,Project_manager,Web_Developer,Software_Developer             1
Security_Analyst,Network_Administrator,Software_Developer,Systems_Administrator                     1
Network_Administrator,Web_Developer,Software_Developer,Systems_Administrator,Security_Analyst       1
Security_Analyst,Systems_Administrator,Project_manager,Software_Developer         

In [10]:
temp_df = df1

# <b>Data Processing</b> 

In [11]:
# Preprocessing libraries
import re
from sklearn.preprocessing import LabelEncoder
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer 
import string
from sklearn.feature_extraction.text import TfidfVectorizer

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\harsh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\harsh\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Cleaning Data:                                     
1 Removing (URLs, hashtags, mentions, special letters, punctuations)

2 Tokenizing the cleaned text

3 Removing Stop Words

4 Performing Lemmatization on final text    

- The resumeKeywords function removes URLs, hashtags, mentions, special characters, non-ASCII characters, multiple spaces, and stop words from the input text while also performing tokenization, lowercasing, and lemmatization to provide cleaned and processed text as output.

- <b>Tokenization:</b> Tokenization is the process of breaking a text into individual words or tokens. In this step, the text is split into its constituent words, which makes it easier to analyze and process. For example, the sentence "I love coding" would be tokenized into three tokens: "I," "love," and "coding."

- <b> StopWords: </b> Stopwords are common words that are typically removed from text during natural language processing to improve text analysis and reduce noise in the data.Examples of common stopwords in English include "the," "and," "in," "is," "of," "it," "to," and many others. Removing stopwords from text helps reduce the dimensionality of the data and focuses the analysis on more meaningful words

- <b>Lemmatization:</b> Lemmatization is the process of reducing words to their base or root form. This step is essential for text analysis because it reduces different forms of a word to a common base form. For example, the words "running" and "ran" would both be lemmatized to "run." This simplifies the text and ensures that similar words are treated as the same, which is crucial for accurate analysis and modeling.

In [12]:

def resumeKeywords(txt):   # sourcery skip: avoid-builtin-shadow, list-comprehension
    cleanText = re.sub('http\S+\s', ' ', txt) # Removing URLs
    cleanText = re.sub('#\S+\s', ' ', cleanText) # Removing hashtags
    cleanText = re.sub('@\S+', '  ', cleanText)  # Removing mentions
    cleanText = re.sub('[%s]' % re.escape("""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"""), ' ', cleanText) # Removing punctuations
    cleanText = re.sub(r'[^\x00-\x7f]', ' ', cleanText) # Removing non-ASCII characters
    cleanText = re.sub('\s+', ' ', cleanText) # Replace multiple spaces with a single space
    cleanText = cleanText.strip() # Removing leading and trailing whitespaces
    
    #------------Tokenizing Cleaned Text--------------------------------------------------------
    # Tokenizing our cleaned text
    tokenizer = nltk.tokenize.RegexpTokenizer('\w+')
    tokens = tokenizer.tokenize(cleanText)
    # Now lower everything and storing it in new variable words
    words = []
    for word in tokens:
        words.append(word.lower())
    #--------------------------------------------------------------------------------------------

    #-------------Removing Stop Words------------------------------------------------------------
    stopwords = nltk.corpus.stopwords.words('english')
    words_new = []
    for word in words:
        if word not in stopwords:
            words_new.append(word)
    #--------------------------------------------------------------------------------------------
    #-----------Performing Lemmatization---------------------------------------------------------
    wn = WordNetLemmatizer() 
    lemm_text = [wn.lemmatize(word) for word in words_new]
    #--------------------------------------------------------------------------------------------
    #----------Converting List into String-------------------------------------------------------
    processed_text = ' '.join(lemm_text)
    
    return processed_text

- Testing the above custom function to remove certain details from Resume

In [13]:
resumeKeywords(" https://www.github.com Software Engineer with 2 years of experience in Data Structures and Algorithms Agile Scrum, SDLC, C++, Java MVC, JavaScript, Web Development, Python " +
               "Data Science, Machine Learning / AI, and Mainframe technologies Programming Languages: 	C/C++, Java, Python, SQL, JCL, Cobol, DB2" + 
               "Frameworks: %#####	Java Spring, Spring boot, React, Angular, NodeJs" +
               "Tools: 	GIT, Visual Studio Code, Sublime, Spyder, Jupyter Notebook, Bluezone, Netbeans, Jira, Confluence, Kanban, CI/CD (Jenkins, GitLab, Azure-Devops), AWS," + 
               "Data-Bricks Libraries: 	NumPy, Pandas, Matplotlib, nltk, Scikit learn, TensorFlow, Keras Other: 	Problem-Solving, Quick Learner, Time-Management")

'software engineer 2 year experience data structure algorithm agile scrum sdlc c java mvc javascript web development python data science machine learning ai mainframe technology programming language c c java python sql jcl cobol db2frameworks java spring spring boot react angular nodejstools git visual studio code sublime spyder jupyter notebook bluezone netbeans jira confluence kanban ci cd jenkins gitlab azure devops aws data brick library numpy panda matplotlib nltk scikit learn tensorflow kera problem solving quick learner time management'

#### Applying above created Custom Function to process the data and creating new column "Processed_Resume"

In [14]:
df1['Processed_Resume'] = df1['Resume'].apply(lambda x: resumeKeywords(x))
df1.head()

Unnamed: 0,Resume,Category,Processed_Resume
0,Database Administrator Database Administrator ...,Database_Administrator,database administrator database administrator ...
1,Database Administrator Database Administrator ...,Database_Administrator,database administrator database administrator ...
2,Oracle Database Administrator Oracle Database ...,Database_Administrator,oracle database administrator oracle database ...
3,Amazon Redshift Administrator and ETL Develope...,Database_Administrator,amazon redshift administrator etl developer bu...
4,Scrum Master Scrum Master Scrum Master Richmon...,Database_Administrator,scrum master scrum master scrum master richmon...


#### Generating WordCloud from the cleaned text

In [15]:
!pip install wordcloud
from wordcloud import WordCloud



In [16]:
import plotly.express as px
import plotly.graph_objects as go
# Join the cleaned text into a single string
text = ' '.join(df1['Processed_Resume'])

# Create a word cloud
wordcloud = WordCloud(background_color='white',
                      width=1000,
                      height=800,
                      max_words=500,
                      colormap='viridis'
                      ).generate(text)

# Convert word cloud to an image
wordcloud_image = wordcloud.to_image()

# Display the word cloud using Plotly as an image
fig = px.imshow(wordcloud_image)
fig.update_layout(
    title='Word Cloud of Cleaned Text',
    xaxis_showticklabels=False,
    yaxis_showticklabels=False,
    plot_bgcolor='white'
)
fig.show()

#### Encoding the Category column and plotting it

In [17]:
# Label encoding our Category
label = LabelEncoder()
df1['Encoded_Category'] = label.fit_transform(df1['Category'])
df1.head()

Unnamed: 0,Resume,Category,Processed_Resume,Encoded_Category
0,Database Administrator Database Administrator ...,Database_Administrator,database administrator database administrator ...,0
1,Database Administrator Database Administrator ...,Database_Administrator,database administrator database administrator ...,0
2,Oracle Database Administrator Oracle Database ...,Database_Administrator,oracle database administrator oracle database ...,0
3,Amazon Redshift Administrator and ETL Develope...,Database_Administrator,amazon redshift administrator etl developer bu...,0
4,Scrum Master Scrum Master Scrum Master Richmon...,Database_Administrator,scrum master scrum master scrum master richmon...,0


In [18]:
df1['Category'].sample(15)

422              Software_Developer,Database_Administrator
26930                     Web_Developer,Software_Developer
29679    Web_Developer,Software_Developer,Front_End_Dev...
1469                Database_Administrator,Project_manager
3309     Web_Developer,Software_Developer,Front_End_Dev...
29507                     Web_Developer,Software_Developer
15489                                Network_Administrator
11949                                     Security_Analyst
3651     Software_Developer,Front_End_Developer,Web_Dev...
22274                                   Software_Developer
11962                                     Security_Analyst
15714    Project_manager,Systems_Administrator,Network_...
9963                                      Security_Analyst
15106                                Network_Administrator
26411                                Systems_Administrator
Name: Category, dtype: object

In [19]:
# Create the mapping
category_mapping = df1.groupby('Encoded_Category')['Category'].first().to_dict()

# Print the created mapping
print("Created Category Mapping:")
print(category_mapping)

Created Category Mapping:
{0: 'Database_Administrator', 1: 'Database_Administrator,Java_Developer,Python_Developer,Software_Developer', 2: 'Database_Administrator,Java_Developer,Software_Developer', 3: 'Database_Administrator,Java_Developer,Software_Developer,Project_manager', 4: 'Database_Administrator,Java_Developer,Software_Developer,Project_manager,Web_Developer,Python_Developer', 5: 'Database_Administrator,Java_Developer,Software_Developer,Systems_Administrator,Web_Developer', 6: 'Database_Administrator,Java_Developer,Software_Developer,Web_Developer', 7: 'Database_Administrator,Java_Developer,Web_Developer,Software_Developer', 8: 'Database_Administrator,Network_Administrator', 9: 'Database_Administrator,Network_Administrator,Project_manager', 10: 'Database_Administrator,Network_Administrator,Software_Developer', 11: 'Database_Administrator,Network_Administrator,Software_Developer,Systems_Administrator,Security_Analyst', 12: 'Database_Administrator,Network_Administrator,Systems_Ad

### <b>#Vectorization (using TfidfVectorizer)</b>

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english',max_features=29033)

tfidf.fit(df1['Processed_Resume'])
requiredText  = tfidf.transform(df1['Processed_Resume'])
print(requiredText)

  (0, 28875)	0.008492481678927806
  (0, 28566)	0.010959879477079213
  (0, 28552)	0.042823865421950914
  (0, 28406)	0.029121651665376534
  (0, 27628)	0.04673440083594293
  (0, 27624)	0.016610927102728663
  (0, 27552)	0.02851590882204456
  (0, 27479)	0.01371202010586949
  (0, 27458)	0.07438488582750953
  (0, 27425)	0.014855684625616134
  (0, 27331)	0.031342657322105553
  (0, 27330)	0.019808407797841168
  (0, 27324)	0.03445715581602325
  (0, 27298)	0.019571895985584887
  (0, 27281)	0.022675644904771693
  (0, 27194)	0.02463186992150839
  (0, 27177)	0.03928771720975392
  (0, 27148)	0.05812821891275683
  (0, 27133)	0.025192591868946644
  (0, 27127)	0.01079288431716829
  (0, 27124)	0.011581061647108348
  (0, 27042)	0.05796497556242737
  (0, 26951)	0.009209959865093906
  (0, 26556)	0.07290209869277707
  (0, 26552)	0.029322079394940703
  :	:
  (29032, 2749)	0.028799858642959017
  (29032, 2644)	0.07828316211918149
  (29032, 2555)	0.03949935471169789
  (29032, 2492)	0.07177863573455685
  (29032, 

# <b>#Model Creation</b>

### <b>#Splitting into Train and Test using (Vectorized text & Encoded category)</b>

In [26]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(requiredText, df1['Encoded_Category'], test_size=0.35, random_state=42)
print(X_train.shape)

print(X_test.shape)

(18871, 29033)
(10162, 29033)


### <b>#Training the baseline model and printing its Accuracy</b>
- KNeighbors Classifier
- Multinomial NB

In [27]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

mnb = MultinomialNB()
mnb.fit(X_train,y_train)
y_pred1 = mnb.predict(X_test)
print('Accuracy: ', accuracy_score(y_test,y_pred1)*100, ' % \n')
print(classification_report(y_test, y_pred1))

Accuracy:  42.20625861050974  % 

              precision    recall  f1-score   support

           0       0.75      0.69      0.72       785
           5       0.00      0.00      0.00         1
           6       0.00      0.00      0.00         2
           7       0.00      0.00      0.00         2
           8       0.00      0.00      0.00         9
           9       0.00      0.00      0.00         3
          12       0.00      0.00      0.00         1
          13       0.00      0.00      0.00         1
          14       0.00      0.00      0.00        14
          15       0.00      0.00      0.00         1
          18       0.00      0.00      0.00         1
          22       0.00      0.00      0.00         2
          23       0.00      0.00      0.00         1
          24       0.00      0.00      0.00         5
          26       0.00      0.00      0.00        39
          27       0.00      0.00      0.00         2
          29       0.00      0.00      0.00    


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



In [121]:
 knc = OneVsRestClassifier(KNeighborsClassifier())
 knc.fit(X_train,y_train)
 y_pred2 = knc.predict(X_test)
 print('Accuracy: ', accuracy_score(y_test,y_pred2)*100, ' %')
 print(classification_report(y_test, y_pred2))

Accuracy:  10.400881664141066  %
              precision    recall  f1-score   support

           0       0.25      0.03      0.05       551
           6       0.00      0.00      0.00         2
           8       0.00      0.00      0.00         5
           9       0.00      0.00      0.00         3
          13       0.00      0.00      0.00         1
          14       0.00      0.00      0.00        14
          15       0.00      0.00      0.00         1
          18       0.00      0.00      0.00         1
          22       0.00      0.00      0.00         1
          23       0.00      0.00      0.00         1
          24       0.00      0.00      0.00         4
          26       0.00      0.00      0.00        26
          27       0.00      0.00      0.00         2
          29       0.00      0.00      0.00         1
          33       0.00      0.00      0.00         3
          37       0.00      0.00      0.00         1
          38       0.00      0.00      0.00     


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.



### <b>#Training the advanced model and printing its Accuracy</b>
- RNN

In [148]:
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense, SpatialDropout1D
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.feature_extraction.text import TfidfVectorizer

# Define the maximum number of words to consider
max_words = 100

# Tokenize and pad sequences
tokenizer = Tokenizer(num_words=max_words, split=' ')
tokenizer.fit_on_texts(df1['Processed_Resume'])
X = tokenizer.texts_to_sequences(df1['Processed_Resume'])
X = pad_sequences(X, maxlen=100)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, df1['Encoded_Category'], test_size=0.25, random_state=42)

# Build the RNN model
model = Sequential()
model.add(Embedding(input_dim=max_words, output_dim=128, input_length=100))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(100))
model.add(Dense(32, activation='relu'))
model.add(Dense(len(label.classes_), activation='softmax'))
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model
history = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=150, batch_size=64)

# Evaluate the RNN model
# y_pred3 = model.predict(X_test)
y_pred_rnn_prob = model.predict(X_test)
y_pred3 = y_pred_rnn_prob.argmax(axis=-1)
accuracy_rnn = accuracy_score(y_test, y_pred3)
print('RNN Model Accuracy: {:.2f}%'.format(accuracy_rnn * 100))
print(classification_report(y_test, y_pred3))
# Accuracy is approx 61%


Epoch 1/150
Epoch 2/150
Epoch 3/150
Epoch 4/150
Epoch 5/150
Epoch 6/150
Epoch 7/150
Epoch 8/150
Epoch 9/150
Epoch 10/150
Epoch 11/150
Epoch 12/150
Epoch 13/150
Epoch 14/150
Epoch 15/150
Epoch 16/150
Epoch 17/150
Epoch 18/150
Epoch 19/150
Epoch 20/150
Epoch 21/150
Epoch 22/150
Epoch 23/150
Epoch 24/150
Epoch 25/150
Epoch 26/150
Epoch 27/150
Epoch 28/150
Epoch 29/150
Epoch 30/150
Epoch 31/150
Epoch 32/150
Epoch 33/150
Epoch 34/150
Epoch 35/150
Epoch 36/150
Epoch 37/150
Epoch 38/150
Epoch 39/150
Epoch 40/150
Epoch 41/150
Epoch 42/150
Epoch 43/150
Epoch 44/150
Epoch 45/150
Epoch 46/150
Epoch 47/150
Epoch 48/150
Epoch 49/150
Epoch 50/150
Epoch 51/150
Epoch 52/150
Epoch 53/150
Epoch 54/150
Epoch 55/150
Epoch 56/150
Epoch 57/150
Epoch 58/150
Epoch 59/150
Epoch 60/150
Epoch 61/150
Epoch 62/150
Epoch 63/150
Epoch 64/150
Epoch 65/150
Epoch 66/150
Epoch 67/150
Epoch 68/150
Epoch 69/150

ResourceExhaustedError: Graph execution error:

Detected at node gradients/transpose_grad/transpose defined at (most recent call last):
<stack traces unavailable>
OOM when allocating tensor with shape[64,100,128] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
	 [[{{node gradients/transpose_grad/transpose}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

	 [[PartitionedCall]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
 [Op:__inference_train_function_442200]

# <b>Prediction & Recommendation System</b>

#### <b>Saving the created models</b>

In [130]:
import pickle
pickle.dump(tfidf,open('tfidf.pkl','wb'))
pickle.dump(mnb, open('mnb.pkl', 'wb'))
pickle.dump(tokenizer, open('rnn_tokenizer.pkl','wb'))
pickle.dump(model, open('rnn.pkl', 'wb'))


In [131]:
resume_1 = """I am a data scientist specializing in machine
learning, deep learning, and computer vision. With
a strong background in mathematics, statistics,
and programming, I am passionate about
uncovering hidden patterns and insights in data.
I have extensive experience in developing
predictive models, implementing deep learning
algorithms, and designing computer vision
systems. My technical skills include proficiency in
Python, Sklearn, TensorFlow, and PyTorch.
What sets me apart is my ability to effectively
communicate complex concepts to diverse
audiences. I excel in translating technical insights
into actionable recommendations that drive
informed decision-making.
If you're looking for a dedicated and versatile data
scientist to collaborate on impactful projects, I am
eager to contribute my expertise. Let's harness the
power of data together to unlock new possibilities
and shape a better future.
Contact & Sources
Email: 611noorsaeed@gmail.com
Phone: 03442826192
Github: https://github.com/611noorsaeed
Linkdin: https://www.linkedin.com/in/noor-saeed654a23263/
Blogs: https://medium.com/@611noorsaeed
Youtube: Artificial Intelligence
ABOUT ME
WORK EXPERIENCE
SKILLES
NOOR SAEED
LANGUAGES
English
Urdu
Hindi
I am a versatile data scientist with expertise in a wide
range of projects, including machine learning,
recommendation systems, deep learning, and computer
vision. Throughout my career, I have successfully
developed and deployed various machine learning models
to solve complex problems and drive data-driven
decision-making
Machine Learnine
Deep Learning
Computer Vision
Recommendation Systems
Data Visualization
Programming Languages (Python, SQL)
Data Preprocessing and Feature Engineering
Model Evaluation and Deployment
Statistical Analysis
Communication and Collaboration
"""

In [149]:
resume_1 = """
John A. Smith
(555) 555-5555
john.smith@example.com
1 Main Street, Seattle, WA 98133

Profile
A Senior Software Engineer with six years of professional experience, specializing in fullstack development, MySQL, Oracle, and Python. A proven track record of managing large scale software engineering projects to support cloud deployments and integrations.

Professional Experience
Senior Software Engineer, Microsoft, Los Angeles, CA
August 2019-Current

Manage a software engineering team of 15+ personnel to build innovative web applications using Agile-Waterfall methodologies, oversee all aspects of full-stack development, and identify opportunities to enhance the user experience
Identify creative solutions and workflow optimizations to improve deployment timelines and reduce project roadblocks during development lifecycles
Serve as the Microsoft Azure SME for the software engineering department and resolve escalated software issues from junior team members
Software Engineer, Uber, Los Angeles, CA
June 2017-August 2019

Coordinated with a team of 30+ software engineers to re-engineer the system into a PHP-based 3-tier application for a global rideshare company with 300M users
Supported projects to improve geo transit data application tools for drivers and users, which contributed to a 20% increase in user satisfaction
Certifications
MCPS: Microsoft Certified Professional
LPIC-3 Senior Level Linux Certification
Code Camp Trainer
Oracle Certified Professional – Java SE Programmer
Microsoft Certified Solutions Developer
Google Certified Professional Cloud Architect
Key Skills
Application Development
Java/Python/C++/Ruby/Perl/PHP/React/Angular
Full-stack developer
MySQL/Oracle/RedHat/AIX
Analysis and visualization of data structures
Education
Master of Business Administration, Information Systems
California State University, Long Beach, CA, September 2018 – July 2020

Bachelor of Computer Science, Software Engineering Major
University of California, Los Angeles, CA, August 2015 – July 2018, 4.0 GPA


"""

#### <b> Predicting Resume-1 by KNearestClassifier model</b>

In [150]:
"""
import pickle

# Load the trained KNearest classifier model
clf = pickle.load(open('knc.pkl', 'rb'))

# Clean the input resume
cleaned_resume = resumeKeywords(resume_1)

# Transform the cleaned resume using the trained TfidfVectorizer
input_features = tfidf.transform([cleaned_resume])

# Make the prediction using the loaded classifier
prediction_id = clf.predict(input_features)[0]

category_name = category_mapping.get(prediction_id, "Unknown")

print("Predicted Category:", category_name)
print(prediction_id)
"""


'\nimport pickle\n\n# Load the trained KNearest classifier model\nclf = pickle.load(open(\'knc.pkl\', \'rb\'))\n\n# Clean the input resume\ncleaned_resume = resumeKeywords(resume_1)\n\n# Transform the cleaned resume using the trained TfidfVectorizer\ninput_features = tfidf.transform([cleaned_resume])\n\n# Make the prediction using the loaded classifier\nprediction_id = clf.predict(input_features)[0]\n\ncategory_name = category_mapping.get(prediction_id, "Unknown")\n\nprint("Predicted Category:", category_name)\nprint(prediction_id)\n'

#### <b> Predicting Resume-1 by Multinomial Naive Bayes model</b>

In [133]:
import pickle

# Load the trained  Multinomial Naive Bayes model
mnb = pickle.load(open('mnb.pkl', 'rb'))

# Clean the input resume
cleaned_resume = resumeKeywords(resume_1)

# Transform the cleaned resume using the trained TfidfVectorizer
input_features = tfidf.transform([cleaned_resume])

# Make the prediction using the loaded classifier
prediction_id = mnb.predict(input_features)[0]

category_name = category_mapping.get(prediction_id, "Unknown")

print("Predicted Category:", category_name)
print(prediction_id)

Predicted Category: Python_Developer,Software_Developer
205


#### <b> Predicting Resume-1 by RNN model</b>

In [39]:
import pickle
rnn = pickle.load(open('rnn.pkl', 'rb'))
rnn_tokenizer = pickle.load(open('rnn_tokenizer.pkl', 'rb'))
# Define the maximum number of words to consider
max_words = 5000
cleaned_resume = resumeKeywords(resume_1)  # Clean the input resume

# Tokenize the cleaned input resume
input_sequence = rnn_tokenizer.texts_to_sequences([cleaned_resume])
# Padding the sequences
input_sequence = pad_sequences(input_sequence, maxlen=100)

# Make the prediction using the loaded RNN model
predicted_probabilities = rnn.predict(input_sequence)

prediction_id = predicted_probabilities.argmax(axis=-1)
# Map the category ID to the category name using category_mapping
category_name = category_mapping.get(prediction_id[0], "Unknown")

print("Predicted Category:", category_name)
print("Predicted Category ID:", prediction_id[0])


Predicted Category: Python_Developer,Software_Developer
Predicted Category ID: 205


In [41]:
job_description_1="""
About the job
Join a top employer and advance your career. Aplin has partnered with an Edmonton-based company to hire a Data Scientist. 

In this exciting role, you will serve as the catalyst for data-driven decision-making, guiding our client toward unparalleled success in lead conversion, customer loyalty, predictive innovation, and inventory optimization. 

Responsibilities:
Dive deep into datasets to unveil the untapped potential in enhancing lead conversion rates.
Develop cutting-edge algorithms that predict customer preferences.
Pioneer loyalty programs that exert a magnetic influence, guaranteeing a continuous stream of returning customers.
Optimize inventory with precision, ensuring the right products are available at the right time.
Qualifications:
Bachelor's degree in Computer Science, Statistics, Applied Math, or related fields; a Master's or PhD will give your credentials an extra boost.
Proficiency in SQL, Python, R, or any other data manipulation language.
Hands-on experience in machine learning, predictive analytics, and various statistical modeling techniques.
Exceptional attention to detail, capable of identifying outliers in datasets with precision.
A passion for storytelling through data, recognizing the importance of context in making impactful decisions.
Outstanding collaboration skills to navigate seamlessly within teams, ensuring smooth progress on your journey.

"""

In [151]:
job_description_1="""
My technical skills include proficiency in
Python, Sklearn, TensorFlow, and PyTorch.
What sets me apart is my ability to effectively
communicate complex concepts to diverse
audiences. I excel in translating technical insights
into actionable recommendations that drive
informed decision-making.

ABOUT ME
WORK EXPERIENCE
SKILLES
NOOR SAEED
LANGUAGES
English
Urdu
Hindi
I am a versatile data scientist with expertise in a wide
range of projects, including machine learning,
recommendation systems, deep learning, and computer
vision. Throughout my career, I have successfully
developed and deployed various machine learning models
to solve complex problems and drive data-driven
decision-making
Machine Learnine
Deep Learning
Computer Vision
Recommendation Systems
Data Visualization
Programming Languages (Python, SQL)
Data Preprocessing and Feature Engineering
Model Evaluation and Deployment
Statistical Analysis
Communication and Collaboration
"""

In [153]:
job_description_1 = """
LMI is seeking a Software Developer or computer science graduate with 7+ years of proven experience in computer vision who has the desire and skill set to design machine vision sensors. You will work in a multi-disciplinary, multi-platform, engineering team (software, electrical, mechanical/optical) that develops new sensor products and supporting infrastructure (manufacturing and test equipment). The ideal candidate will have a passion for leading-edge technology, extensive experience developing production-ready software, strong critical thinking and problem-solving skills, and can work well autonomously yet still communicate effectively with a close-knit group of about 10 engineers. Previous leadership or project management experience is an asset as there is an opportunity to lead a team in this role.

This Senior Software Developer will work in the R&D team and report to the Software Engineering Manager.

Design and develop 3D acquisition algorithms for our sensors to produce 3D data from images
Develop components of our calibration and acquisition pipeline
Characterize and validate prototype sensor performance and integrate final designs with customers
Investigate solutions for challenging acquisition problems. Investigate improvements to our algorithms to enhance the performance of our sensors
Design and develop manufacturing software tools required to build the sensors and control key component performance (e.g. software tasks for focusing and aligning cameras/lasers/projectors, quantifying and adjusting sensor sensitivity, etc.)
Lead technical investigations and produce reports and documentation for senior management
Demonstrate leadership and ownership. Drive projects to completion, participate in frequent peer design and code reviews, and use your expertise to oversee and mentor others in the team
Proactively contribute to and implement continuous improvement initiatives
What do you need to be successful?

Degree / Diploma in Computer Science, Electrical/Computer Engineering or equivalent
5+ years work experience in a disciplined software development environment producing deliverable code
Solid knowledge of C/C++ and C# programming using Microsoft Visual Studio
Expertise in 3D metrology or computer vision (object detection, image restoration, scene reconstruction, signal processing, etc., but excluding machine learning) is required
Experience independently planning and completing complex projects/deliverables in a reliable time frame
Proficient with commonly used scripting languages like Python
Excellent understanding of object-oriented programming
Excellent understanding of commonly used data structures and algorithms (lists, trees, sorting, binning, etc.)
Excellent understanding of math and statistics
Excellent written and verbal communication
Solid understanding of memory management, threading/synchronization, networking
Previous scrum master experience or experience overseeing a small team is an asset
Experience developing for a manufacturing automation environment is an asset
Salary Range: $106,000 - $132,500 - $151,000

Expected Salary: Our typical hiring range will be +/- 10% of the midpoint listed above. Factors influencing this decision include qualifications and market conditions for the role.
"""

# Keywords extraction from provided Job Description and Resume

In [155]:
from sklearn.metrics.pairwise import cosine_similarity
cleaned_resume = resumeKeywords(resume_1)
cleaned_job_description = resumeKeywords(job_description_1)
vectors_1  = tfidf.transform([cleaned_resume])
vectors_2 = tfidf.transform([cleaned_job_description])
similarity_score = cosine_similarity(vectors_1,vectors_2)
print("Similarity Score:", similarity_score)


Similarity Score: [[0.09759586]]


# <b>Bulding Job Category Recommendation System for this dataset</b>

- Shape of the vectors

In [97]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_1 = TfidfVectorizer(stop_words='english',max_features=500)

tfidf_1.fit(df1['Processed_Resume'][0:5000])
requiredText  = tfidf_1.transform(df1['Processed_Resume'][0:5000])
print(requiredText)

  (0, 498)	0.010773127752308145
  (0, 494)	0.014234209928247962
  (0, 492)	0.055329790174256765
  (0, 490)	0.0409807231433724
  (0, 484)	0.03363044368209837
  (0, 483)	0.017487792771952695
  (0, 482)	0.022139673369651686
  (0, 481)	0.04037912812370037
  (0, 474)	0.0711276063454855
  (0, 473)	0.03270432159567591
  (0, 472)	0.01384249645508328
  (0, 471)	0.015057734151000694
  (0, 470)	0.06685324629538808
  (0, 467)	0.011389784481159864
  (0, 462)	0.06702730593407363
  (0, 461)	0.03256345102558259
  (0, 454)	0.012481706190787135
  (0, 452)	0.02774544922477802
  (0, 451)	0.027584743754645386
  (0, 450)	0.0562538237067085
  (0, 447)	0.011873663791710475
  (0, 446)	0.05118585576836502
  (0, 445)	0.030884587632048016
  (0, 444)	0.031858121134203304
  (0, 441)	0.04549040682945522
  :	:
  (4999, 131)	0.05448517586981033
  (4999, 125)	0.17259855017737538
  (4999, 123)	0.16948100154173215
  (4999, 121)	0.054423954532528955
  (4999, 116)	0.038714404389223356
  (4999, 99)	0.04121682315203566
  (49

In [98]:
vectors = requiredText.toarray()
print(vectors.shape)
print(vectors[0])

(5000, 500)
[0.         0.         0.         0.         0.         0.
 0.0207141  0.1320715  0.         0.         0.10704297 0.
 0.         0.         0.07958486 0.01418155 0.06415366 0.
 0.         0.02748513 0.         0.         0.         0.03536323
 0.         0.         0.         0.         0.         0.01250148
 0.         0.05818717 0.06924674 0.         0.         0.
 0.         0.         0.0213643  0.02935969 0.01864874 0.
 0.         0.         0.         0.         0.         0.02117785
 0.         0.01562921 0.         0.         0.         0.
 0.         0.02162935 0.         0.         0.         0.
 0.         0.         0.         0.11119527 0.         0.
 0.         0.013595   0.24370672 0.         0.         0.
 0.01807841 0.         0.         0.         0.         0.01843001
 0.         0.02340617 0.         0.         0.06317263 0.
 0.         0.         0.06527333 0.0584796  0.         0.
 0.         0.02128648 0.         0.02096614 0.         0.0609107
 0.  

In [99]:
vectors

array([[0.        , 0.        , 0.        , ..., 0.        , 0.01077313,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.03999819,
        0.        ],
       [0.        , 0.21841025, 0.        , ..., 0.        , 0.08235419,
        0.        ],
       ...,
       [0.14427286, 0.        , 0.        , ..., 0.03851859, 0.11795584,
        0.0899475 ],
       [0.03637417, 0.        , 0.        , ..., 0.13595865, 0.02973911,
        0.03401644],
       [0.        , 0.        , 0.        , ..., 0.17342256, 0.02950409,
        0.        ]])

- Gives the total number of frequent words in the corpus

In [101]:
len(tfidf_1.get_feature_names_out())

500

- Gives the list of most frequent words to less frequent words in the corpus

In [102]:
print(tfidf_1.get_feature_names_out())

['10' '10g' '11' '11g' '12c' '2000' '2003' '2005' '2006' '2007' '2008'
 '2009' '2010' '2011' '2012' '2013' '2014' '2015' '2016' '2017' '2018'
 '2019' '9i' 'ability' 'access' 'account' 'active' 'activity' 'ad'
 'additional' 'addm' 'administration' 'administrator' 'adobe' 'agent'
 'agile' 'aix' 'ajax' 'alert' 'analysis' 'analyst' 'analytics' 'angular'
 'angularjs' 'apache' 'api' 'app' 'application' 'applied' 'april'
 'architect' 'architecture' 'area' 'asm' 'asp' 'assist' 'assistant'
 'assisted' 'associate' 'august' 'authorized' 'automated' 'automation'
 'availability' 'awr' 'aws' 'azure' 'bachelor' 'backup' 'bank' 'based'
 'basic' 'best' 'bootstrap' 'browser' 'bug' 'build' 'building' 'built'
 'business' 'ca' 'campaign' 'capacity' 'case' 'center' 'certification'
 'change' 'check' 'class' 'client' 'cloning' 'closely' 'cloud' 'cluster'
 'cm' 'code' 'coding' 'college' 'com' 'communication' 'company' 'complete'
 'complex' 'compliance' 'component' 'computer' 'concept' 'configuration'
 'configu

- Looking for most frequent word at certain indexes 

In [103]:
print(tfidf_1.get_feature_names_out()[40])
print(tfidf_1.get_feature_names_out()[20])
print(tfidf_1.get_feature_names_out()[11])
print(tfidf_1.get_feature_names_out()[22])
print(tfidf_1.get_feature_names_out()[33])
print(tfidf_1.get_feature_names_out()[44])

analyst
2018
2009
9i
adobe
apache


### Similarity Score (Cosine similarity method)
- For recommending the similar job category we need to identify the nearest vectors of a particular job category vector.
- In order to find out the nearest vectors, we use similarity score and it has two methods (1- Euclidean distance method) & (2- Cosine similarity method)
- Here, I will be using cosine similarity method rather than euclidean distance method because euclidean distance does not perform well on higher dimensions
- The similarity is inverse of distance. [0-1]
- If 1 then similarity is high and if 0 then similarity is low

In [104]:
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(vectors)
# This will create result with values as distance/similarity between each vectors
print(similarity)

[[1.         0.55356146 0.36181173 ... 0.12277021 0.16406078 0.16211031]
 [0.55356146 1.         0.28550621 ... 0.14020167 0.17869191 0.20984865]
 [0.36181173 0.28550621 1.         ... 0.09355866 0.12097149 0.06100921]
 ...
 [0.12277021 0.14020167 0.09355866 ... 1.         0.41071336 0.44344625]
 [0.16406078 0.17869191 0.12097149 ... 0.41071336 1.         0.40473632]
 [0.16211031 0.20984865 0.06100921 ... 0.44344625 0.40473632 1.        ]]


In [142]:
print(similarity.shape)
# It finds out distance of each job with every job, and that is the reason why the shape of the similarity shape is (5000,5000)
print(similarity[0])
# Gives similarity score of first job with every job, it can also be seen that the first value is 1 which indicates that the value is the
# similarity score of first job with itself.
print(similarity[1])

(5000, 5000)
[1.         0.55356146 0.36181173 ... 0.12277021 0.16406078 0.16211031]


In [108]:
df2 = df1
df2 = df2[['Category','Resume','Processed_Resume','Encoded_Category']][0:5000]
df2.head()

Unnamed: 0,Category,Resume,Processed_Resume,Encoded_Category
0,Database_Administrator,Database Administrator Database Administrator ...,database administrator database administrator ...,0
1,Database_Administrator,Database Administrator Database Administrator ...,database administrator database administrator ...,0
2,Database_Administrator,Oracle Database Administrator Oracle Database ...,oracle database administrator oracle database ...,0
3,Database_Administrator,Amazon Redshift Administrator and ETL Develope...,amazon redshift administrator etl developer bu...,0
4,Database_Administrator,Scrum Master Scrum Master Scrum Master Richmon...,scrum master scrum master scrum master richmon...,0


In [109]:
df2['Encoded_Category'].value_counts()

0      2212
309     815
509     509
329     330
38      149
       ... 
25        1
39        1
307       1
250       1
316       1
Name: Encoded_Category, Length: 189, dtype: int64

### Creating function which will recommend top 5 job categories based on the predicted job category 

1) Initially, if the predicted job category id is given, I will need to find out the index of that job category in my dataset
2) Based on index value, I will be able to know the list of similarity scores of that particular job category with every job category
3) Using Enumerate function to align index and values together while sorting the similarity score in descending order to get top-5 job category

1) Based on given predicted job category id getting the index of that job category in my dataset 

In [143]:
print(df2[df2['Encoded_Category'] == 0].index[0])
print(df2[df2['Encoded_Category'] == 38].index[0])
print(df2[df2['Encoded_Category'] == 26].index[0])

0
88
17


2) Based on the fetched index getting the similarity score list of that job

3) Now as I need the top 5 similar job categories I will need to sort the data in descending order to get top-5.
- However, there is a problem if I am sorting, the index position is getting lost. So, to solve this problem I am using enumerate function (which prints index with value)
- Applying lambda function in that to sort in descending order according to the similarity score and not based on the index value.


In [147]:
# 2.------------------------------- 
print(similarity[0])
print(similarity[38]) 
print(similarity[26]) 
# 3.------------------------------- 
print("-------------------------------------------------------------------------------------------------------------------------------------")
print(list(enumerate(similarity[0]))) 
print(list(enumerate(similarity[38]))) 
print(list(enumerate(similarity[26]))) 
print("-------------------------------------------------------------------------------------------------------------------------------------")
print(sorted(list(enumerate(similarity[0])),reverse = True, key=lambda x:x[1])) 
print(sorted(list(enumerate(similarity[38])),reverse = True, key=lambda x:x[1])) 
print(sorted(list(enumerate(similarity[26])),reverse = True, key=lambda x:x[1])) 

[1.         0.55356146 0.36181173 ... 0.12277021 0.16406078 0.16211031]
[0.48251885 0.32406671 0.66838276 ... 0.07239426 0.16359409 0.10166563]
[0.60267236 0.46790577 0.6564012  ... 0.12149165 0.19345431 0.12962868]
-------------------------------------------------------------------------------------------------------------------------------------
[(0, 1.0000000000000002), (1, 0.5535614620715451), (2, 0.36181173095186436), (3, 0.506076792233686), (4, 0.35426820386290453), (5, 0.5000667537861863), (6, 0.5082744443926042), (7, 0.620850076480139), (8, 0.6459173171996209), (9, 0.5680283736695639), (10, 0.4982842588118908), (11, 0.5774084628484435), (12, 0.5283007289051933), (13, 0.5344640328239794), (14, 0.800997545862586), (15, 0.7941591457625234), (16, 0.760524580956016), (17, 0.811711448538113), (18, 0.4680224211204071), (19, 0.48186554931294634), (20, 0.4700146967418357), (21, 0.5947003057368505), (22, 0.3133393744212687), (23, 0.5511020285622374), (24, 0.5008716166607481), (25, 0.7245

- Final Custom function

In [116]:
def recommend(category):
    #Based on given predicted job category id getting the index of that job category in my dataset 
    index = df2[df2['Encoded_Category'] == category].index[0] 

    #Based on the fetched index getting the similarity score list of that job
    distances = sorted(list(enumerate(similarity[index])), reverse=True, key=lambda x: x[1]) 
    # print(distances) # For testing purpose
    
    unique_set = set([])
    similarity_score = []
    
    for i in distances[0:100]: 
        if len(unique_set) < 5:
            job = df2.iloc[i[0]].Category
            score = "{:.2f}".format(i[1])
            if job not in unique_set and score != "1.00":
                unique_set.add(job)
                similarity_score.append(i[1])
                print(f"{job}, {score}") # Print each unique value and its corresponding similarity score
        else:
            break

In [117]:
print(category_mapping.get(0))
recommend(0)

Database_Administrator
Database_Administrator, 0.91
Database_Administrator,Project_manager, 0.86
Database_Administrator,Systems_Administrator, 0.84
Database_Administrator,Software_Developer, 0.81
Database_Administrator,Web_Developer,Software_Developer, 0.80


In [118]:
print(category_mapping.get(38))
recommend(38)

Database_Administrator,Systems_Administrator
Project_manager,Database_Administrator, 0.51
Database_Administrator, 0.50
Network_Administrator, 0.49
Systems_Administrator, 0.48
Software_Developer,Systems_Administrator,Project_manager,Java_Developer,Database_Administrator,Web_Developer, 0.48


In [139]:
print(category_mapping.get(250))
recommend(250)

Security_Analyst,Network_Administrator,Database_Administrator,Systems_Administrator
Database_Administrator, 0.78
Database_Administrator,Systems_Administrator, 0.73
Database_Administrator,Project_manager, 0.73
Database_Administrator,Software_Developer, 0.66
Database_Administrator,Web_Developer,Software_Developer, 0.61
