#### Intern Buddy Data Science Candidate Shortlisting.

InternBuddy is a platform where people who have prepared themselves for Data Science jobs and are looking for an entry into this industry. Job seekers could post their profile on the web portal of InternBuddy and, based on the scoring criteria of the system they are shortlisted and forwarded to the recruiters.
This helps the recruiters to only go through a limited number of profiles instead of reading each and every file submitted.

Here I have prepared a model which helps the system understand the profiles that are to be considered for shortlisting and the rest to be ignored.

Lets start by importing the required Libraries and the dataset for analysis.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from warnings import filterwarnings
import datetime as dt

pd.set_option('display.max_column',10000) 
pd.set_option('display.max_colwidth',10000)  
pd.set_option('display.max_rows',10000)  

Importing the dataset using the pandas library.
As we have a excel file lets use pandas.read_excel command for dataset import

In [2]:
ibdf=pd.read_excel('Data_Science_2020_v2.xlsx')

Lets have a look at the dataframe and get an idea about the file.

In [3]:
ibdf.head()

Unnamed: 0,Application_ID,Current City,Python (out of 3),R Programming (out of 3),Data Science (out of 3),Other skills,Institute,Degree,Stream,Current Year Of Graduation,Performance_PG,Performance_UG,Performance_12,Performance_10
0,DS0001,Bangalore,1,0,3,"Machine Learning, Arduino, C Programming, CSS, Data Analytics, Data Structures, Deep Learning, HTML, Natural Language Processing (NLP), MATLAB, Python",Global Academy of Technology,Bachelor of Engineering (B.E),Electrical and Electronics Engineering,2019,,7.73/10,,
1,DS0002,Mumbai,2,1,2,"AutoCAD, MS-Office, Machine Learning, Microsoft Azure, MySQL, Python, SolidWorks, Deep Learning, R Programming","Aegis School Of Business, Data Science, Cyber Security And Telecommunication",,PGP,2020,,68.00/100,,
2,DS0003,Mumbai,2,0,0,"C++ Programming, Data Structures, Image Processing, Python","VJTI, Mumbai",Bachelor of Technology (B.Tech),Information Systems,2018,,8.85/10,91.40/91.40,9.40/9.40
3,DS0004,Dhanbad,2,0,2,"Algorithms, C++ Programming, Data Structures, Natural Language Processing (NLP), CSS, Computer Vision, Deep Learning, HTML, Java, Machine Learning, MySQL, Python, Android",IIT (ISM) Dhanbad,Integrated M.Tech,Mathematics and Computing,2021,,8.40/10,91.80/91.80,10.00/10.00
4,DS0005,Bangalore,2,0,0,"MS-Word, Python, SQL, MS-Excel",Vvce,Bachelor of Engineering (B.E),Electronics and Communication,2018,,,,


As you can see there are many records and to read each row and find out the quality of data will be difficult for us. 
So lets perform some data quality checks and see what needs to be done on the data to proceed with analysis.

In [4]:
# This command will show the number of rows and the columns in the dataframe
ibdf.shape

(611, 14)

In [5]:
# We can list down all the column names that are there in the dataframe
ibdf.columns

Index(['Application_ID', 'Current City', 'Python (out of 3)',
       'R Programming (out of 3)', 'Data Science (out of 3)', 'Other skills',
       'Institute', 'Degree', 'Stream', 'Current Year Of Graduation',
       'Performance_PG', 'Performance_UG', 'Performance_12', 'Performance_10'],
      dtype='object')

#### - Info() command will give us a basic idea like:
    1. What are the number of rows
    2. How many columns are there?
    3. What are the Datatypes of different columns?
    4. Are there any null values?
    5. How much memory is being consumed by the dataframe?
#### So lets check the info() command

In [6]:
ibdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 611 entries, 0 to 610
Data columns (total 14 columns):
Application_ID                611 non-null object
Current City                  611 non-null object
Python (out of 3)             611 non-null int64
R Programming (out of 3)      611 non-null int64
Data Science (out of 3)       611 non-null int64
Other skills                  601 non-null object
Institute                     611 non-null object
Degree                        575 non-null object
Stream                        580 non-null object
Current Year Of Graduation    611 non-null int64
Performance_PG                128 non-null object
Performance_UG                533 non-null object
Performance_12                363 non-null object
Performance_10                339 non-null object
dtypes: int64(4), object(10)
memory usage: 67.0+ KB


#### This will help us in having a descriptive statistics about the dataset like minimum, maximum, counts, sum, unique, standard deviation, quartile values etc.

In [7]:
ibdf.describe(include='all')

Unnamed: 0,Application_ID,Current City,Python (out of 3),R Programming (out of 3),Data Science (out of 3),Other skills,Institute,Degree,Stream,Current Year Of Graduation,Performance_PG,Performance_UG,Performance_12,Performance_10
count,611,611,611.0,611.0,611.0,601,611,575,580,611.0,128,533,363,339
unique,611,70,,,,513,439,42,120,,83,269,204,155
top,DS0264,Bangalore,,,,"Data Analytics, Machine Learning, Python",Great Lakes Institute of Management,Bachelor of Technology (B.Tech),Computer Science,,8.00/10,70.00/100,86.00/86.00,10.00/10.00
freq,1,334,,,,5,21,168,72,,6,15,8,24
mean,,,1.788871,0.582651,0.87234,,,,,2018.487725,,,,
std,,,0.780673,0.918459,1.014681,,,,,2.573082,,,,
min,,,0.0,0.0,0.0,,,,,2004.0,,,,
25%,,,1.0,0.0,0.0,,,,,2018.0,,,,
50%,,,2.0,0.0,0.0,,,,,2019.0,,,,
75%,,,2.0,1.0,2.0,,,,,2020.0,,,,


In [8]:
# Checking the datatypes of the dataframe
ibdf.dtypes

Application_ID                object
Current City                  object
Python (out of 3)              int64
R Programming (out of 3)       int64
Data Science (out of 3)        int64
Other skills                  object
Institute                     object
Degree                        object
Stream                        object
Current Year Of Graduation     int64
Performance_PG                object
Performance_UG                object
Performance_12                object
Performance_10                object
dtype: object

- To perform analysis we need to make sure there are no missing values in our data. So lets check the missing values in the dataframe and find its percentage.

In [9]:
(ibdf.isnull().sum()*100)/ibdf.shape[0]

Application_ID                 0.000000
Current City                   0.000000
Python (out of 3)              0.000000
R Programming (out of 3)       0.000000
Data Science (out of 3)        0.000000
Other skills                   1.636661
Institute                      0.000000
Degree                         5.891980
Stream                         5.073650
Current Year Of Graduation     0.000000
Performance_PG                79.050736
Performance_UG                12.765957
Performance_12                40.589198
Performance_10                44.517185
dtype: float64

In [10]:
ibdf.loc[(ibdf['Performance_PG']=='NA') & (ibdf['Performance_UG']=='NA') & (ibdf['Current Year Of Graduation']<=2020)].index
ibdf.drop(index=ibdf.loc[(ibdf['Performance_PG']=='NA') & (ibdf['Performance_UG']=='NA') & (ibdf['Current Year Of Graduation']<=2020)].index,inplace=True)

- As per the analysis carried out till now we can make out that the columns Application_ID, Performance_10, Performance_12, Performance_PG, and Performance_UG are not required for shortlisting the candidates as these details are not required in the shortlisting criteria document. 
- Lets proceed with dropping these features and see the shape of the dataframe

In [11]:
Application_ID=ibdf.pop('Application_ID')
ibdf.pop('Performance_10')
ibdf.pop('Performance_12')
ibdf.pop('Performance_PG')
ibdf.pop('Performance_UG')
print(ibdf.shape)
ibdf.head()

(611, 9)


Unnamed: 0,Current City,Python (out of 3),R Programming (out of 3),Data Science (out of 3),Other skills,Institute,Degree,Stream,Current Year Of Graduation
0,Bangalore,1,0,3,"Machine Learning, Arduino, C Programming, CSS, Data Analytics, Data Structures, Deep Learning, HTML, Natural Language Processing (NLP), MATLAB, Python",Global Academy of Technology,Bachelor of Engineering (B.E),Electrical and Electronics Engineering,2019
1,Mumbai,2,1,2,"AutoCAD, MS-Office, Machine Learning, Microsoft Azure, MySQL, Python, SolidWorks, Deep Learning, R Programming","Aegis School Of Business, Data Science, Cyber Security And Telecommunication",,PGP,2020
2,Mumbai,2,0,0,"C++ Programming, Data Structures, Image Processing, Python","VJTI, Mumbai",Bachelor of Technology (B.Tech),Information Systems,2018
3,Dhanbad,2,0,2,"Algorithms, C++ Programming, Data Structures, Natural Language Processing (NLP), CSS, Computer Vision, Deep Learning, HTML, Java, Machine Learning, MySQL, Python, Android",IIT (ISM) Dhanbad,Integrated M.Tech,Mathematics and Computing,2021
4,Bangalore,2,0,0,"MS-Word, Python, SQL, MS-Excel",Vvce,Bachelor of Engineering (B.E),Electronics and Communication,2018


- As it is clearly mentioned in the shortlisting criteria that only candidates who are having graduation year 2020 and less than that are considered for shortlisting 
We will have to drop the records which are having graduation year greater than 2020

In [12]:
# Dropping records having graduation year greater than 2020
ibdf.drop(ibdf.loc[(ibdf['Current Year Of Graduation']>2020)].index)

Unnamed: 0,Current City,Python (out of 3),R Programming (out of 3),Data Science (out of 3),Other skills,Institute,Degree,Stream,Current Year Of Graduation
0,Bangalore,1,0,3,"Machine Learning, Arduino, C Programming, CSS, Data Analytics, Data Structures, Deep Learning, HTML, Natural Language Processing (NLP), MATLAB, Python",Global Academy of Technology,Bachelor of Engineering (B.E),Electrical and Electronics Engineering,2019
1,Mumbai,2,1,2,"AutoCAD, MS-Office, Machine Learning, Microsoft Azure, MySQL, Python, SolidWorks, Deep Learning, R Programming","Aegis School Of Business, Data Science, Cyber Security And Telecommunication",,PGP,2020
2,Mumbai,2,0,0,"C++ Programming, Data Structures, Image Processing, Python","VJTI, Mumbai",Bachelor of Technology (B.Tech),Information Systems,2018
4,Bangalore,2,0,0,"MS-Word, Python, SQL, MS-Excel",Vvce,Bachelor of Engineering (B.E),Electronics and Communication,2018
7,Bangalore,2,2,0,"Natural Language Processing (NLP), Python, R Programming, SQL","Government Engineering College, Bhuj",B.Tech (Hons.),Mining,2016
9,Bangalore,0,0,0,"CSS, HTML, JavaScript, PostgreSQL","MS Ramaiah College Of Arts, Science And Commerce, BANGALORE",Master of Business Administration_(MBA),Marketing,2019
11,Bangalore,1,2,2,"MS-Excel, MS-Office, SPSS, Data Science, Machine Learning, R Programming, SQL, Python","Devi Ahilya Vishwavidyalaya,Indore",Master of Science (M.Sc),Statistics,2019
12,Bangalore,0,0,0,"C Programming, Eclipse (IDE), Java, Database Management System (DBMS), HTML, Linux, SQL",The Oxford College of Engineering,Bachelor of Engineering (B.E),Computer Science & Engineering,2018
13,Bangalore,0,0,0,"AutoCAD, MS-Office, Catia, SolidWorks","SJB Institute of Technology, Bangalore",Bachelor of Engineering (B.E),Mechanical Engineering,2018
14,Chennai,3,0,3,"Machine Learning, Python, MySQL, Statistical Modeling, Tableau",Great Lakes Institute of Management,Post Graduate Programme (PGP),Data Science Engineering,2020


In [13]:
# Lets check the null values after dropping some records.
(ibdf.isnull().sum()*100)/ibdf.shape[0]

Current City                  0.000000
Python (out of 3)             0.000000
R Programming (out of 3)      0.000000
Data Science (out of 3)       0.000000
Other skills                  1.636661
Institute                     0.000000
Degree                        5.891980
Stream                        5.073650
Current Year Of Graduation    0.000000
dtype: float64

- As you can see most of the missing values have been treated. Now we are leaft with less than 6% of missing values.
Lets treat the remaining missing records.

- #### Lets find the unique value count of records with datatype 'object'.

In [14]:
# Checking the value counts of the features having the object data type. 
objectdtypes=list(ibdf.select_dtypes(include='object').columns)
for i in objectdtypes:
    print('Value counts of {0} '.format(i))
    print(ibdf[i].value_counts())
    print('================================')

Value counts of Current City 
Bangalore            334
Hyderabad             52
Pune                  29
Chennai               23
Banglore              17
Delhi                 17
Mumbai                14
Kolkata                7
Thrissur               7
Noida                  6
Kozhikode              5
Visakhapatnam          5
Indore                 4
Nellore                4
Jaipur                 4
Nagpur                 4
Varanasi               4
Bhubaneswar            4
Jalandhar              4
Ahmedabad              3
Kharagpur              3
Lucknow                3
Kochi                  3
Chandigarh             2
Ranchi                 2
Agartala               2
Greater Noida          2
Dharwad                2
Coimbatore             2
Dehradun               2
Vellore                2
Tirupati               1
Ajmer                  1
Bokaro Steel City      1
Wardha                 1
Salem                  1
Raebareli              1
New Delhi              1
Ernakulam           

In [15]:
# Checking the value counts of the features having the integer data type. These features are stored in a variable named dummy_vars.
# Features in Dummy_vars will be used to create dummy features
dummy_vars=[]
for i in objectdtypes:
    if len(ibdf[i].value_counts())>2:
        print('Value counts of {0} : '.format(i),len(ibdf[i].value_counts()))
        print('================================')
        dummy_vars.append(i)

Value counts of Current City :  70
Value counts of Other skills :  513
Value counts of Institute :  439
Value counts of Degree :  42
Value counts of Stream :  120


In [16]:
ibdf['Current City'].replace(to_replace='Banglore',value='Bangalore',inplace=True)

In [17]:
# Lets check the null values after dropping some records.
(ibdf.isnull().sum()*100)/ibdf.shape[0]

Current City                  0.000000
Python (out of 3)             0.000000
R Programming (out of 3)      0.000000
Data Science (out of 3)       0.000000
Other skills                  1.636661
Institute                     0.000000
Degree                        5.891980
Stream                        5.073650
Current Year Of Graduation    0.000000
dtype: float64

In [18]:
ibdf.dropna(inplace=True)

In [19]:
(ibdf.isnull().sum()*100)/ibdf.shape[0]

Current City                  0.0
Python (out of 3)             0.0
R Programming (out of 3)      0.0
Data Science (out of 3)       0.0
Other skills                  0.0
Institute                     0.0
Degree                        0.0
Stream                        0.0
Current Year Of Graduation    0.0
dtype: float64

In [20]:
ibdf['Degree'].value_counts()

Bachelor of Technology (B.Tech)                              158
Bachelor of Engineering (B.E)                                133
Master of Technology (M.Tech)                                 37
Master of Science (M.Sc)                                      32
B.Tech (Hons.)                                                27
Master of Computer Applications (MCA)                         25
PG Diploma in Data Science                                    16
Post Graduate Programme (PGP)                                 16
Bachelor of Science (B.Sc)                                    16
MBA                                                           13
Integrated B.Tech                                              4
Integrated M.Sc.                                               4
Bachelor of Computer Applications (BCA)                        4
Bachelor of Commerce (B.Com)                                   4
Post Graduate Diploma                                          4
Master of Science (M.Sc) 

In [21]:
ibdf['Stream'].replace(['computer science','Computer  Science','cs','Computer Science  Engineering','Computer Science & Engineering (CSE)','Computer Science & Engineering','Computer Science &amp;engineering','Computer Science AndEngineering'],"Compuer Science",inplace=True)
ibdf['Degree'].replace(['B.Tech (Hons.)', 'Bachelor of Technology (B.Tech)', 'Bachelor of Engineering (B.E)', 'Integrated B.Tech', 'Bachelor of Engineering (B.E) (Hons.)', 'Bachelor of Engineering (B.E) (Hons.)', 'Integrated B.Tech & M.Tech'], 'BE/BTech',inplace=True)
ibdf['Degree'].replace(['Integrated B.S. & M.S.','Master of Science (M.S.)', 'Master of Science (M.Sc)','Master of Technology (M.Tech)', 'Master of Science (M.Sc) ', 'Master of Engineering (M.E)', 'Integrated B.Sc. & M.Sc.', 'Integrated B.Tech', 'Master of Science (M.Sc) (Hons.)', 'Integrated M.Tech', 'Integrated B.Tech & M.Tech', 'Integrated M.Sc.'],'Msc/MTech',inplace=True)

In [22]:
ibdf['Degree'].value_counts()

BE/BTech                                                     326
Msc/MTech                                                     86
Master of Computer Applications (MCA)                         25
Post Graduate Programme (PGP)                                 16
PG Diploma in Data Science                                    16
Bachelor of Science (B.Sc)                                    16
MBA                                                           13
Bachelor of Commerce (B.Com)                                   4
Post Graduate Diploma                                          4
Bachelor of Computer Applications (BCA)                        4
Post Graduate Diploma in Management (P.G.D.M.)                 3
Master of Arts (M.A.)                                          3
Master of Business Administration_(MBA)                        3
Master of Statistics (M.Stat)                                  2
Bachelor of Computer Science (B.C.S.)                          2
Bachelor of Science (B.Sc

In [23]:
# # Visualisaing the data
# plt.figure(figsize=(7,6))

# #plotting heat map to find the correlations between different features
# sns.heatmap(ibdf[top_corr_features].corr(),annot=True,cmap="RdYlGn")
# plt.show()

In [24]:
# plt.figure(figsize=(7,4))
# sns.pairplot(data=ibdf.select_dtypes(exclude='object'))
# plt.show()

In [25]:
credits=['Machine Learning','Deep Learning','NLP','Statistical Data Analysis','AWS','SQL/NoSQL','Excel']
other_skills=list(ibdf['Other skills'].str.split(','))

# ibdf['Other skills'] = ibdf['Other skills'].str.replace(" Statistical Modeling","Statistical Data Analysis") 
ibdf['Other skills'] = ibdf['Other skills'].str.replace(" Amazon Web Services (AWS)","AWS")

In [26]:
ibdf['Other skills'].head(100)

0                                                                                                                                              Machine Learning, Arduino, C Programming, CSS, Data Analytics, Data Structures, Deep Learning, HTML, Natural Language Processing (NLP), MATLAB, Python
2                                                                                                                                                                                                                                          C++ Programming, Data Structures, Image Processing, Python
3                                                                                                                          Algorithms, C++ Programming, Data Structures, Natural Language Processing (NLP), CSS, Computer Vision, Deep Learning, HTML, Java, Machine Learning, MySQL, Python, Android
4                                                                                                                     