<h2> Basic Set up </h2>

Set up environment for the recommendation system
<br>
 -- Download and install python. <br>
 -- Download and install an IDE, like jupyter notebook, pycharm, etc. Here I will be using Google colab. </br>
 -- Start with the project :)

We will need several libraries while progressing with the project. You can install all those using the pip install. For example, pip install pandas, pip install nltk, etc. <br>
Google colab environment may have several pre-installed libraries. If you cannot find one, pip install it :)

<h2> What is Recommendation System? </h2>
A recommendation system is a program that helps a user discover a product or any content based on the user's interests or the user's interactions. <br/>
Example - Ecommerce websites, social medias, OTT platforms, etc. uses recommendation system. <br/>
Recommendation systems takes in user data and uses that data for making predictions on new products or contents which are not seen by the user but probably liked by the user. This will keep the user engaged, as they will get more of what they like. <br><br>
There are mainly two types of recommendation system - <br>
<ol>
<li> Content based Recommendation system - uses the specification of the product, that is the details of the product, and recommends new products based on it. For example, if you read the book of Origin by Dan Brown and tell about it to your friend, your friend might suggest you to read the book of Inferno, by Dan Brown. This is because the genre of both the books shares similarity; also the author is the same.</li>
<li> Collaborative filtering - This recommendation is based on the user's ratings on the products in past. It is got nothing to do with the product themselves, unlike content based. For example, a guest came to your house and gave you a chocolate. Your sister is almost like you in choices, so there is a high probability that she will like chocolates too. So the guest gives her one as well. This is collaborative.
</ol>
<br><br>
<h3> How content based recommendation system works? </h3>
Suppose you have purchased a t-shirt from Amazon. That t-shirt must be having certain decriptions. Amazon will start recommending you the tshirts which shares almost similar description, pricing, etc. Content based system works well when descriptive data is available for the products. Thus, descriptive data has to be there, which is kind of a disadvantage.
<br><br>
<h3> How collaborative filtering works? </h3>
Suppose you picked up black tshirt, blue jeans. There is another person A who picked up white shirt, brown jeans, nike shoes. There is another person B who picked up black shirt, blue jeans,  puma shoes. It seems like the person B and you have almost similar in taste based on the previous pickings. So the recommender will recommend you puma shoes, as your interests overlap with that of person B. This is what collaborative is, based on user interactions.
<br><br>
Advantage of collaborative filtering is you do not need to have the knowledge of the product or content at all. It solely depends on user review data. Disadvantage is that you cannot make recommendation if you dont have any user review. Also, it tends to show popular products more.


<h2> What are vectors ? </h2>
Understanding the concept of vectors is important. We need to deal with large arrays of data in ML. These arrays are sometimes called vectors for single column of data and is called matrices for larger arrays. Let's see how to work with vectors - <br>
Before that, I would like to brief on SIMD concept. Modern computers have the ability to take a list of numbers and apply mathematical operations on all the numbers parallely. This is known as single instruction, multiple data or SIMD. CPU can load chunk of data into memory and perform mathematical operations at once parallely.

In [1]:
dataset_on_rates = [4,3,4,5,2,3,4,1,5,3,2,1,3,5,4,4,2,1,2,5,2]
new_rating = []
for i, value in enumerate(dataset_on_rates):
  print('Updating rate at {}'.format(i), end = '   ')
  new_rating.insert(i, (value * 4))

print()
print(new_rating)

Updating rate at 0   Updating rate at 1   Updating rate at 2   Updating rate at 3   Updating rate at 4   Updating rate at 5   Updating rate at 6   Updating rate at 7   Updating rate at 8   Updating rate at 9   Updating rate at 10   Updating rate at 11   Updating rate at 12   Updating rate at 13   Updating rate at 14   Updating rate at 15   Updating rate at 16   Updating rate at 17   Updating rate at 18   Updating rate at 19   Updating rate at 20   
[16, 12, 16, 20, 8, 12, 16, 4, 20, 12, 8, 4, 12, 20, 16, 16, 8, 4, 8, 20, 8]


While using ML, similar mathematical operations are used over an entire array. Here, we are converting all the ratings from the scale of 5 to the scale of 20 by multiplying by 4.<br>
Looping over the list using a for loop and doing the multiplication operation is inefficient. So instead of working in this way, we can make use of SIMD, that is we can load the full array on CPU and perform the multiplication operation at once. This creates a huge difference in speed when working with large dataset. <br><br>
For the above reason, it is efficient to work with array libraries that can process data in parallel. This array library is the NumPy. Numpy creates array in the memory in a very efficient way and automatically parallelizes common operation on array. Thus numpy automatically takes the advantage of SIMD feature of CPU. Let's see how the code will be -

In [4]:
import numpy as np

dataset_on_rates = np.array([4,3,4,5,2,3,4,1,5,3,2,1,3,5,4,4,2,1,2,5,2])
new_ratings = dataset_on_rates * 4
print(new_ratings)

[16 12 16 20  8 12 16  4 20 12  8  4 12 20 16 16  8  4  8 20  8]


Most of the operations in an array can be done in parallel following this way. This is known as vectorizing the code. We are replacing iterative codes with vector operations that can be executed in parallel.

<h3> About our project </h3>
 This project is about job recommendation and is content based. We will go through the job description provided in the dataset and will look into the job title based on that job description. The model will learn the job description for the particular job titles. Then we will pass a job title to recommend other suitable jobs; it will show us certain job titles based on its content based learnings.

**Mounting drive**

In [8]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


**Importing libraries**

In [7]:
import pandas as pd
import numpy as np

**Loading dataset**

In [41]:
df = pd.read_csv('/content/drive/MyDrive/Python Codes/Global company/naukri_jobs.csv')
df.head()

Unnamed: 0,company,education,experience,industry,jobdescription,jobid,joblocation_address,jobtitle,numberofpositions,payrate,postdate,site_name,skills,uniq_id
0,MM Media Pvt Ltd,UG: B.Tech/B.E. - Any Specialization PG:Any Po...,0 - 1 yrs,Media / Entertainment / Internet,Job Description Send me Jobs like this Quali...,210516002263,Chennai,Walkin Data Entry Operator (night Shift),,"1,50,000 - 2,25,000 P.A",2016-05-21 19:30:00 +0000,,ITES,43b19632647068535437c774b6ca6cf8
1,find live infotech,UG: B.Tech/B.E. - Any Specialization PG:MBA/PG...,0 - 0 yrs,Advertising / PR / MR / Event Management,Job Description Send me Jobs like this Quali...,210516002391,Chennai,Work Based Onhome Based Part Time.,60.0,"1,50,000 - 2,50,000 P.A. 20000",2016-05-21 19:30:00 +0000,,Marketing,d4c72325e57f89f364812b5ed5a795f0
2,Softtech Career Infosystem Pvt. Ltd,UG: Any Graduate - Any Specialization PG:Any P...,4 - 8 yrs,IT-Software / Software Services,Job Description Send me Jobs like this - as ...,101016900534,Bengaluru,Pl/sql Developer - SQL,,Not Disclosed by Recruiter,2016-10-13 16:20:55 +0000,,IT Software - Application Programming,c47df6f4cfdf5b46f1fd713ba61b9eba
3,Onboard HRServices LLP,UG: Any Graduate - Any Specialization PG:CA Do...,11 - 15 yrs,Banking / Financial Services / Broking,Job Description Send me Jobs like this - Inv...,81016900536,"Mumbai, Bengaluru, Kolkata, Chennai, Coimbator...",Manager/ad/partner - Indirect Tax - CA,,Not Disclosed by Recruiter,2016-10-13 16:20:55 +0000,,Accounts,115d28f140f694dd1cc61c53d03c66ae
4,Spire Technologies and Solutions Pvt. Ltd.,UG: B.Tech/B.E. - Any Specialization PG:Any Po...,6 - 8 yrs,IT-Software / Software Services,Job Description Send me Jobs like this Pleas...,120916002122,Bengaluru,JAVA Technical Lead (6-8 yrs) -,4.0,Not Disclosed by Recruiter,2016-10-13 16:20:55 +0000,,IT Software - Application Programming,a12553fc03bc7bcced8b1bb8963f97b4


**Exploring and Cleaning dataset** <br>
We will drop the columns that we won't need. For the job recommendation system, we need the job description field mandatorily based on which we will recommend.

In [42]:
df.drop(['jobid', 'numberofpositions', 'payrate', 'postdate', 'site_name', 'uniq_id'], axis = 1, inplace = True)

In [43]:
df.shape

(22000, 8)

In [44]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22000 entries, 0 to 21999
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   company              21996 non-null  object
 1   education            20004 non-null  object
 2   experience           21996 non-null  object
 3   industry             21995 non-null  object
 4   jobdescription       21996 non-null  object
 5   joblocation_address  21499 non-null  object
 6   jobtitle             22000 non-null  object
 7   skills               21472 non-null  object
dtypes: object(8)
memory usage: 1.3+ MB


In [45]:
df.isna().sum()

company                   4
education              1996
experience                4
industry                  5
jobdescription            4
joblocation_address     501
jobtitle                  0
skills                  528
dtype: int64

There are four companies with null values. Let look into those -

In [46]:
df[df.company.isna()]

Unnamed: 0,company,education,experience,industry,jobdescription,joblocation_address,jobtitle,skills
3768,,,,,,,1-50 of 680 Service Desk Jobs in Chennai,
4026,,,,,,,1-50 of 658 .net Developer Jobs in Chennai,
4389,,,,,,,1-50 of 574 Asp.net Jobs in Chennai,
4841,,,,,,,1-50 of 507 Risk Analyst Jobs in Chennai,


In [47]:
#all are null, so we can drop these rows

df.dropna(subset = ['company'], axis = 0, inplace = True)

Then we have some values from education field as NaN. We will replace that NaN with 'Not Given'

In [48]:
df[df['education'].isna()].head()

Unnamed: 0,company,education,experience,industry,jobdescription,joblocation_address,jobtitle,skills
6,Kinesis Management Consultant Pvt. Ltd,,1 - 3 yrs,IT-Software / Software Services,Job Description Send me Jobs like this exper...,"Delhi NCR, Mumbai, Bengaluru, Kochi, Greater N...",PHP Developer,IT Software - Application Programming
16,SMACera Technologies Consulting and Services P...,,1 - 2 yrs,Recruitment / Staffing,Job Description Send me Jobs like this 1. Hi...,Bengaluru,Data Entry Executive,Executive Assistant
17,Janak Vidya Consultancy Pvt. Ltd.,,2 - 4 yrs,IT-Software / Software Services,Job Description Send me Jobs like this (2-3 ...,Bengaluru,PHP Developer_banashankari II Stage,IT Software - Other
31,"iPlanner, Inc. hiring for Product based Company",,5 - 10 yrs,IT-Software / Software Services,Job Description Send me Jobs like this Greet...,Bengaluru,Good Opportunity for Oracle DBA - Bangalore Lo...,IT Software - Application Programming
38,Inspiration Manpower Consultancy Pvt. Ltd.,,0 - 5 yrs,BPO / Call Centre / ITES,Job Description Send me Jobs like this Greet...,"Bengaluru, Chennai",Banking / Mortgage / Finance / Sales / Bpo / C...,ITES


In [49]:
df['education'].fillna('Not Given', inplace = True)

Similarly, let's replace the job location NaN values and skills NaN values as Not Given.

In [50]:
df['joblocation_address'].fillna('Not Given', inplace = True)
df['skills'].fillna('Not Given', inplace = True)

A single value for industry is NaN. Let's do the same

In [51]:
df['industry'].fillna('Not Given', inplace = True)

Now, let's check for duplicates

In [52]:
df[df.company.duplicated()]

Unnamed: 0,company,education,experience,industry,jobdescription,joblocation_address,jobtitle,skills
14,Accenture,UG: Any Graduate - Any Specialization PG:MBA/P...,1 - 5 yrs,IT-Software / Software Services,Job Description Send me Jobs like this Respo...,Bengaluru,Revenue Assurance,Accounts
20,Accenture,UG: Any Graduate - Any Specialization PG:Any P...,1 - 5 yrs,IT-Software / Software Services,Job Description Send me Jobs like this Deliv...,Bengaluru,Analytics,ITES
23,OKDA Solutions,UG: Any Graduate - Any Specialization,6 - 10 yrs,IT-Software / Software Services,Job Description Send me Jobs like this Job D...,Bengaluru,Sr iOS Developer - Objective C / Cocoa,IT Software - Application Programming
33,Accenture,UG: Any Graduate - Any Specialization PG:Any P...,1 - 5 yrs,IT-Software / Software Services,Job Description Send me Jobs like this Posit...,Bengaluru,Call Quality,ITES
36,Strivex Consulting Pvt Ltd,"UG: B.Tech/B.E. - Computers, Electronics/Telec...",4 - 9 yrs,IT-Software / Software Services,Job Description Send me Jobs like this Notic...,Bengaluru,WLAN Device Driver Development Engineer - Linux,IT Software - Network Administration
...,...,...,...,...,...,...,...,...
21994,Contactx Resource Management,UG: Any Graduate - Any Specialization PG:Any P...,10 - 14 yrs,Industrial Products / Heavy Machinery,Job Description Send me Jobs like this To br...,"Bengaluru, Chennai",Sales Manager for a Leading Manufacturing Orga...,Sales
21995,Morgan Stanley Advantage Services Pvt. Ltd.,UG: Any Graduate - Any Specialization,9 - 13 yrs,Banking / Financial Services / Broking,Job Description Send me Jobs like this Greet...,Bengaluru,Quality Assurance - VP with Morgan Stanley Ban...,IT Software - QA & Testing
21996,Careernet Technologies Pvt Ltd hiring for Client,UG: B.Tech/B.E. - Any Specialization PG:M.Tech...,3 - 5 yrs,IT-Software / Software Services,Job Description Send me Jobs like this Looki...,"Bengaluru, Gurgaon",Java Backend Developers for a Product Company,IT Software - Application Programming
21998,Confidential,UG: B.Tech/B.E. - Any Specialization PG:MCA - ...,7 - 12 yrs,IT-Software / Software Services,Job Description Send me Jobs like this We ar...,"Delhi NCR, Bengaluru",Sr UI Developer/ Technical Lead - Html/ CSS/ J...,IT Software - Application Programming


There are 13527 rows having duplicate values for company names. We will keep the duplicates, as for every company, job location, job title, and skills vary. Let's find the number of duplicates for each company -

In [53]:
df.company.value_counts().rename_axis('company').reset_index(name='count')

Unnamed: 0,company,count
0,Indian Institute of Technology Bombay,403
1,Confidential,393
2,National Institute of Industrial Engineering,185
3,Oracle India Pvt. Ltd.,151
4,JPMorgan Chase,135
...,...,...
8464,Bello Jewels Pvt Ltd,1
8465,ITG Telematics Pvt Ltd,1
8466,Genuine Management Services Pvt. Limited hirin...,1
8467,AJAX Consulting hiring for a large Japanese MNC,1


Now I will add the job description field with the job title and skills.

I will remove the space from the job title so that it is vectorized as one job.

In [54]:
df['job title'] = df['jobtitle'].apply(lambda x:"".join(x.split()))

In [55]:
df['jobdescription'] = df['jobdescription'] + df['jobtitle'] + df['skills']

Now we will clean up the job description field, keeping only words and numbers in it.

In [56]:
def remove_noise(obj):
  word_num = list([val for val in obj.split() if val.isalpha() or val.isnumeric()])
  new_string = " ".join(word_num)
  return (new_string)

In [57]:
df['jobdescription'] = df['jobdescription'].apply(remove_noise)

Now, I will turn all the words to lower case and then we will view one of the job description to look into the change.

In [58]:
df['jobdescription'] = df['jobdescription'].apply(lambda x:x.lower())

In [59]:
df.jobdescription[2]

'job description send me jobs like this as a developer in providing application design guidance and utilizing a thorough understanding of applicable tools and existing analyzes highly complex business designs and writes technical specifications to design or redesign complex computer platforms and provides coding direction to less experienced staff or develops highly complex original acts as an expert technical resource for simulation and analysis verifies program logic by overseeing the preparation of test testing and debugging of oversees overall systems testing and the migration of platforms and applications to develops new departmental technical procedures and user leads allocates and manages resources and manages the work of less experienced assures security and compliance requirements are met for supported area and oversees creation of or updates to and testing of the business continuation years application development and implementation additional job looking for who has experien

Now we will do stemming.

In [66]:
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

In [67]:
def stemming(text):
  stemmed = []
  for i in text.split():
    stemmed.append(ps.stem(i))
  
  new_string = " ".join(stemmed)
  return new_string

In [68]:
df['jobdescription'] = df['jobdescription'].apply(stemming)

<h3> Start of Vectorization </h3>

Now we will perform text vectorization. Here we will find the similarity on the basis of job description, which means we need to calculate the similarity score of different job descriptions. But job description is all written in text, thus we need to perform vectorization on the text to calculate the similarity scores. <br><br>
We will convert the job description into vectors. This is called text vectorization. Since we are having 21996 records, which means 21996 different job descriptions, vectorization will create 21996 vectors. So, the closest vector to our selected vector will be chosen for recommendation. <br>
We will use the method of bag of words for text vectorization. <br><br>
In the process of bag of words, all the words from all the job descriptions are combined. Then from that huge pool of words, we will fetch N number of words with highest frequency. Then we will take every job descriptions and will tally with the occurance of highest number of words in each. At the end, we will get a table with 21996 number of rows and N number of columns, that defines the occurance of every taken words for every movies. Now these movies has got converted to vectors in this space. The closest vectors are taken for prediction.<br><br>
All the above procedure of text vectorization is done by the Text vectorizer from scikit learn. Removing the stop words, as it doesn't place any meaning to sentence. Vectorization is applied to the rest of the words.

In [69]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 10000, stop_words = 'english')

In [70]:
word_vector = cv.fit_transform(df['jobdescription']).toarray()

In [71]:
word_vector.shape

(21996, 10000)

Hence, the text vectorization is done. Now we will calculate the distance between each of the vectors; this will be the angular distance in between the vectors (cosine angle). The vectors with the lowest distance are similar.

In [72]:
from sklearn.metrics.pairwise import cosine_similarity

similarity = cosine_similarity(word_vector)

Now, for recommendation I will pass on title. I will get the index of the title and then will get the similarity scores related to that index. Then I will sort all the similarity scores in descending order and will fetch the first 10 out of it as the most similar jobs.

In [167]:
def recommend(job):
  job_pos = df[df['jobtitle'] == job].index[0]
  distances = similarity[job_pos]
  job_list = sorted(list(enumerate(distances)), reverse = True, key = lambda x:x[1])[0:10]
  Index, Title, Company_Offering, Industry = [], [], [], []
  count = 1
  for j in job_list:
    Index.append(count)
    Title.append(df.iloc[j[0]].jobtitle)
    Company_Offering.append(df.iloc[j[0]].company)
    Industry.append(df.iloc[j[0]].industry)
    count = count + 1
  return Index, Title, Company_Offering, Industry

In [166]:
job = input('Enter job title - ')

Title, Company_Offering, Industry = recommend(job)
job_frame = pd.DataFrame({
    'Title' : Title,
    'Company Offering' : Company_Offering,
    'Industry' : Industry
  })
print(job_frame)

Enter job title - developer


TypeError: ignored

Thus, our job recommendation is complete :)