# Digipplus Assignment
### This notebook comprises a detailed analysis of the provided dataset of ML intern applicants

Assumptions :
- No. of open positions : 1
- The skills 'Data Science' & 'AWS' are mandatory
- Features 'Degree', 'Stream', 'Performance_10' & 'Performance_12' are unnecessary criteria for identifying the best intern
- No CLustering Algorithm needed since the complexity if not beyond understancding. An extra 'Score' can be added to the dataset

In [1]:
# Importing necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
pd.pandas.set_option('display.max_columns', None)

# Storing the dataset into a dataframe
df = pd.read_csv('Applications_for_Machine_Learning_internship_edited.xlsx - Sheet1.csv')
df.head(5)

Unnamed: 0,Name,Python (out of 3),Machine Learning (out of 3),Natural Language Processing (NLP) (out of 3),Deep Learning (out of 3),Other skills,"Are you available for 3 months, starting immediately, for a full-time work from home internship?",Degree,Stream,Current Year Of Graduation,Performance_PG,Performance_UG,Performance_12,Performance_10
0,,1,0,0,1,"MS-Excel, MS-Word, Deep Learning, MySQL, Pytho...","Yes, I am available for 3 months starting imme...",Bachelor of Vocation (B.Voc.),Software Engineering,2021,,6.50/7,,
1,,2,0,0,0,"Git, GitHub, Linux, Adobe After Effects, Adobe...","Yes, I am available for 3 months starting imme...",B.Tech,Computer Science & Engineering,2024,,8.90/10,,
2,,2,2,0,0,"Amazon Web Services (AWS), Docker, Hadoop, MS-...","Yes, I am available for 3 months starting imme...",Master of Science (M.S.),Data Science And Analytics,2022,,,,
3,,3,2,2,0,"Adobe XD, BIG DATA ANALYTICS, Canva, Data Anal...","Yes, I am available for 3 months starting imme...",Bachelor of Engineering (B.E),,2024,,,85.60/85.60,10.00/10.00
4,,2,2,0,0,"C++ Programming, Data Science, Machine Learnin...","Yes, I am available for 3 months starting imme...",B.Tech,Computer Science,2023,,8.10/10,93.40/93.40,10.00/10.00


In [2]:
# Accessing the column names/features
cols = df.columns
cols

Index(['Name', 'Python (out of 3)', 'Machine Learning (out of 3)',
       'Natural Language Processing (NLP) (out of 3)',
       'Deep Learning (out of 3)', 'Other skills',
       'Are you available for 3 months, starting immediately, for a full-time work from home internship? ',
       'Degree', 'Stream', 'Current Year Of Graduation', 'Performance_PG',
       'Performance_UG', 'Performance_12', 'Performance_10'],
      dtype='object')

In [3]:
len(cols)

14

In [4]:
df.shape

(1136, 14)

In [5]:
# No. of Post Graduates applying for the intern position
na_perfPG = df['Performance_PG'].notna().sum()
na_perfPG

184

In [6]:
# The column 'Name' has NaN values. Since there's no primary ket in our dataset, we can rename 'Name' to 'ID' and use it as the primary key
df = df.rename(columns={'Name' : 'ID'})

# 'Performance_12'& 'Performance_10' have no importance in predicting the best intern, so we can simply drop them from dataset
df = df.drop(['Performance_12', 'Performance_10'], axis = 1)
df.head()

Unnamed: 0,ID,Python (out of 3),Machine Learning (out of 3),Natural Language Processing (NLP) (out of 3),Deep Learning (out of 3),Other skills,"Are you available for 3 months, starting immediately, for a full-time work from home internship?",Degree,Stream,Current Year Of Graduation,Performance_PG,Performance_UG
0,,1,0,0,1,"MS-Excel, MS-Word, Deep Learning, MySQL, Pytho...","Yes, I am available for 3 months starting imme...",Bachelor of Vocation (B.Voc.),Software Engineering,2021,,6.50/7
1,,2,0,0,0,"Git, GitHub, Linux, Adobe After Effects, Adobe...","Yes, I am available for 3 months starting imme...",B.Tech,Computer Science & Engineering,2024,,8.90/10
2,,2,2,0,0,"Amazon Web Services (AWS), Docker, Hadoop, MS-...","Yes, I am available for 3 months starting imme...",Master of Science (M.S.),Data Science And Analytics,2022,,
3,,3,2,2,0,"Adobe XD, BIG DATA ANALYTICS, Canva, Data Anal...","Yes, I am available for 3 months starting imme...",Bachelor of Engineering (B.E),,2024,,
4,,2,2,0,0,"C++ Programming, Data Science, Machine Learnin...","Yes, I am available for 3 months starting imme...",B.Tech,Computer Science,2023,,8.10/10


In [7]:
# Transforming 'ID' to primary index
i = 0
for i in range(df.shape[0]):
    df['ID'][i] = i

df['ID'] = pd.to_numeric(df['ID'], downcast='integer')

df = df.set_index('ID')
df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['ID'][i] = i


Unnamed: 0_level_0,Python (out of 3),Machine Learning (out of 3),Natural Language Processing (NLP) (out of 3),Deep Learning (out of 3),Other skills,"Are you available for 3 months, starting immediately, for a full-time work from home internship?",Degree,Stream,Current Year Of Graduation,Performance_PG,Performance_UG
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,1,0,0,1,"MS-Excel, MS-Word, Deep Learning, MySQL, Pytho...","Yes, I am available for 3 months starting imme...",Bachelor of Vocation (B.Voc.),Software Engineering,2021,,6.50/7
1,2,0,0,0,"Git, GitHub, Linux, Adobe After Effects, Adobe...","Yes, I am available for 3 months starting imme...",B.Tech,Computer Science & Engineering,2024,,8.90/10
2,2,2,0,0,"Amazon Web Services (AWS), Docker, Hadoop, MS-...","Yes, I am available for 3 months starting imme...",Master of Science (M.S.),Data Science And Analytics,2022,,
3,3,2,2,0,"Adobe XD, BIG DATA ANALYTICS, Canva, Data Anal...","Yes, I am available for 3 months starting imme...",Bachelor of Engineering (B.E),,2024,,
4,2,2,0,0,"C++ Programming, Data Science, Machine Learnin...","Yes, I am available for 3 months starting imme...",B.Tech,Computer Science,2023,,8.10/10


In [8]:
# Replacing the NaN values in 'Performance_PG'& 'Performance_UG' by zero

df[['Performance_PG', 'Performance_UG']] = df[['Performance_PG', 'Performance_UG']].replace(np.nan, 0)

df.head()

Unnamed: 0_level_0,Python (out of 3),Machine Learning (out of 3),Natural Language Processing (NLP) (out of 3),Deep Learning (out of 3),Other skills,"Are you available for 3 months, starting immediately, for a full-time work from home internship?",Degree,Stream,Current Year Of Graduation,Performance_PG,Performance_UG
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,1,0,0,1,"MS-Excel, MS-Word, Deep Learning, MySQL, Pytho...","Yes, I am available for 3 months starting imme...",Bachelor of Vocation (B.Voc.),Software Engineering,2021,0,6.50/7
1,2,0,0,0,"Git, GitHub, Linux, Adobe After Effects, Adobe...","Yes, I am available for 3 months starting imme...",B.Tech,Computer Science & Engineering,2024,0,8.90/10
2,2,2,0,0,"Amazon Web Services (AWS), Docker, Hadoop, MS-...","Yes, I am available for 3 months starting imme...",Master of Science (M.S.),Data Science And Analytics,2022,0,0
3,3,2,2,0,"Adobe XD, BIG DATA ANALYTICS, Canva, Data Anal...","Yes, I am available for 3 months starting imme...",Bachelor of Engineering (B.E),,2024,0,0
4,2,2,0,0,"C++ Programming, Data Science, Machine Learnin...","Yes, I am available for 3 months starting imme...",B.Tech,Computer Science,2023,0,8.10/10


In [9]:
# df.reset_index(drop=True)
val = df.at[4, 'Performance_UG']
type(val)

str

In [10]:
val2 = df.at[4, 'Performance_PG']
type(val2)

int

### The values in 'Performance_UG' & 'Performance_PG' are of type 'str' and entered in the for 'Num/Denom', eg: '7/10'.
### Below code block splits these values into 2 separate values and stores them into a list.
### Then, we scale every value in the range of 1-10

In [11]:
for i, x in enumerate(df['Performance_UG']):
    if isinstance(x, str):
        df.at[i, 'Performance_UG'] = df.at[i, 'Performance_UG'].split("/")
        df.at[i, 'Performance_UG'] = (float(df.at[i, 'Performance_UG'][0])/float(df.at[i, 'Performance_UG'][1]))*10
df.head()

Unnamed: 0_level_0,Python (out of 3),Machine Learning (out of 3),Natural Language Processing (NLP) (out of 3),Deep Learning (out of 3),Other skills,"Are you available for 3 months, starting immediately, for a full-time work from home internship?",Degree,Stream,Current Year Of Graduation,Performance_PG,Performance_UG
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,1,0,0,1,"MS-Excel, MS-Word, Deep Learning, MySQL, Pytho...","Yes, I am available for 3 months starting imme...",Bachelor of Vocation (B.Voc.),Software Engineering,2021,0,9.285714
1,2,0,0,0,"Git, GitHub, Linux, Adobe After Effects, Adobe...","Yes, I am available for 3 months starting imme...",B.Tech,Computer Science & Engineering,2024,0,8.9
2,2,2,0,0,"Amazon Web Services (AWS), Docker, Hadoop, MS-...","Yes, I am available for 3 months starting imme...",Master of Science (M.S.),Data Science And Analytics,2022,0,0.0
3,3,2,2,0,"Adobe XD, BIG DATA ANALYTICS, Canva, Data Anal...","Yes, I am available for 3 months starting imme...",Bachelor of Engineering (B.E),,2024,0,0.0
4,2,2,0,0,"C++ Programming, Data Science, Machine Learnin...","Yes, I am available for 3 months starting imme...",B.Tech,Computer Science,2023,0,8.1


### Handling features with missing values

In [12]:
# Features with missing values
missing_values_features = [feature for feature in df.columns if df[feature].isnull().sum() > 0]
missing_values_features

['Other skills', 'Degree', 'Stream']

In [13]:
for x in [missing_values_features]:
    print("{} Missing values found in feature {}".format(df[x].isnull().sum(), x))

Other skills     66
Degree           43
Stream          170
dtype: int64 Missing values found in feature ['Other skills', 'Degree', 'Stream']


In [14]:
# Dropping rows with missing values

for x in missing_values_features:
    df = df.dropna(subset=[x])

In [15]:
for x in missing_values_features:
    print("{} Missing values found in feature {}".format(df[x].isnull().sum(), x))

0 Missing values found in feature Other skills
0 Missing values found in feature Degree
0 Missing values found in feature Stream


In [16]:
df.shape

(896, 11)

### Finding out if the required skils are present in 'Other skills'. If not, replace that string by 0 and drop such rows

In [17]:
# type(df['Other skills'][1])

In [18]:
#

#Converting the string at df['Other skills'] to a list of skills using split()

# for index, row in df.iterrows():
#     value = row['Other skills']
#     if isinstance(value, str):
#         df.at[index, 'Other skills'] = value.split(", ")
        
# df.head(3)

In [19]:
#Assumption : The 'Other skills' feature of an applicant must have 'Data Science' & 'Amazon Web Services (AWS)'
# req_skills = ['Data Science', 'Amazon Web Services (AWS)']

# for i, skills in enumerate(df['Other skills']):
#     if all(skill in skills for skill in req_skills):
#         df.at[i, 'Other skills'] = 1
#     else :
#         df.at[i, 'Other skills'] = 0

# df['Other skills']

In [20]:
# Considering only the rows where df['Other skills'] = 1
# df = df.loc[df['Other skills'] != 0]
# df.shape

In [21]:
# Assumption : The 'Other skills' feature of an applicant must have 'Data Science' & 'Amazon Web Services (AWS)'
import re

# Pattern to be matched (Regular Expression)
pattern = r"(?=.*\bdata science\b)(?=.*\bamazon web server \(aws\)\b|.*\baws\b)"

# If the required skills are found, replace the string at that index by 1, else replace by 0
for i, skill in enumerate(df['Other skills']):
    match = re.search(pattern, skill, re.IGNORECASE)
    if match:
        df.at[i, 'Other skills'] = 1
    else:
        df.at[i, 'Other skills'] = np.nan
df = df.dropna(subset='Other skills')
df.head()

Unnamed: 0_level_0,Python (out of 3),Machine Learning (out of 3),Natural Language Processing (NLP) (out of 3),Deep Learning (out of 3),Other skills,"Are you available for 3 months, starting immediately, for a full-time work from home internship?",Degree,Stream,Current Year Of Graduation,Performance_PG,Performance_UG
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,2.0,0.0,0.0,0.0,1,"Yes, I am available for 3 months starting imme...",B.Tech,Computer Science & Engineering,2024.0,0,8.9
30,1.0,2.0,2.0,2.0,1,"Yes, I am available for 3 months starting imme...",Master of Technology (M.Tech),Statistics Computing( DATA Science ),2022.0,0,5.9
59,0.0,0.0,0.0,0.0,1,"Yes, I am available for 3 months starting imme...",Bachelor of Science (B.Sc),Physics,2022.0,0,7.28
85,2.0,1.0,0.0,0.0,1,"Yes, I am available for 3 months starting imme...",B.Tech,Electronics and Communication,2023.0,7.42/10,0.0
104,2.0,2.0,0.0,2.0,1,"Yes, I am available for 3 months starting imme...",B.Tech,Computer Science & Engineering,2023.0,0,9.2


In [22]:
# Considering only those rows where f['Other skills'] != 0, i.e., only those applicants are considered who possess the required
# skills
df = df.loc[df['Other skills'] != 0]
df.shape

(218, 11)

In [23]:
# # Assumption : The 'Stream' feature of an applicant must have 'Data Science and Analytics' or 'CSE'

# # Pattern to be matched (Regular Expression)
# pattern = r"(?=.*\bcomputer science\b)|(?=.*\bcomputer science \& engineering\b)|(?=.*\bcse\b)"

# # If the required skills are found, replace the string at that index by 1, else replace by 0
# for i, stream in enumerate(df['Stream']):
#     match = re.search(pattern, str(stream), re.IGNORECASE)
#     if match:
#         pass
#     else:
#         df.at[i, 'Stream'] = np.nan
        
# df = df.dropna(subset=['Stream'])
# df.head()
# m = df.at[1, 'Python (out of 3)']
# print(type(m))

### We will now do feature engg by adding an extra feature `Score` to our dataset which calculates the score of the
### applicant taking into consideration the important features

- Custom devised formula : `5*Python + 4*ML + 2*DL + 2*NLP + Performance in UG`

In [24]:


# Convert columns to numeric data type
numeric_cols = ['Python (out of 3)', 'Machine Learning (out of 3)', 'Deep Learning (out of 3)',
                'Natural Language Processing (NLP) (out of 3)']
df[numeric_cols] = df[numeric_cols].astype(float)

# Fill missing values with a default value (0)
df.fillna(0, inplace=True)
df = df.dropna(subset='Performance_PG')
# Calculate the 'Score' column
df['Score'] = 5*df['Python (out of 3)'] + 4*df['Machine Learning (out of 3)'] + 2*df['Deep Learning (out of 3)'] + 2*df['Natural Language Processing (NLP) (out of 3)'] + df['Performance_UG'] 

df.head()

Unnamed: 0_level_0,Python (out of 3),Machine Learning (out of 3),Natural Language Processing (NLP) (out of 3),Deep Learning (out of 3),Other skills,"Are you available for 3 months, starting immediately, for a full-time work from home internship?",Degree,Stream,Current Year Of Graduation,Performance_PG,Performance_UG,Score
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,2.0,0.0,0.0,0.0,1,"Yes, I am available for 3 months starting imme...",B.Tech,Computer Science & Engineering,2024.0,0,8.9,18.9
30,1.0,2.0,2.0,2.0,1,"Yes, I am available for 3 months starting imme...",Master of Technology (M.Tech),Statistics Computing( DATA Science ),2022.0,0,5.9,26.9
59,0.0,0.0,0.0,0.0,1,"Yes, I am available for 3 months starting imme...",Bachelor of Science (B.Sc),Physics,2022.0,0,7.28,7.28
85,2.0,1.0,0.0,0.0,1,"Yes, I am available for 3 months starting imme...",B.Tech,Electronics and Communication,2023.0,7.42/10,0.0,14.0
104,2.0,2.0,0.0,2.0,1,"Yes, I am available for 3 months starting imme...",B.Tech,Computer Science & Engineering,2023.0,0,9.2,31.2


### Finding the Best Intern Candidate

In [25]:
max_score_id = df['Score'].idxmax()

print("Applicant {} is the best candidate for ML Intern at Digipplus with max score of {}".format(max_score_id, df['Score'][max_score_id]))

Applicant 1079 is the best candidate for ML Intern at Digipplus with max score of 49.0
