In [46]:
import pandas as pd
import numpy as np

We start by loading in the data. We then have to check if there are any modifications we have to do on the dataframe.


In [47]:


def loadData(file_path):
    """
    Load data from a CSV file.
    """
    try:
        data = pd.read_csv(file_path)
        return data
    except FileNotFoundError:
        print(f"File {file_path} not found.")
        return None
    
    
df = loadData("Data/UpdatedResumeDataSet.csv")

print(df.head())
print(df.columns)
print(df.info())
print(df.dtypes)

       Category                                             Resume
0  Data Science  Skills * Programming Languages: Python (pandas...
1  Data Science  Education Details \r\nMay 2013 to May 2017 B.E...
2  Data Science  Areas of Interest Deep Learning, Control Syste...
3  Data Science  Skills â¢ R â¢ Python â¢ SAP HANA â¢ Table...
4  Data Science  Education Details \r\n MCA   YMCAUST,  Faridab...
Index(['Category', 'Resume'], dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 962 entries, 0 to 961
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  962 non-null    object
 1   Resume    962 non-null    object
dtypes: object(2)
memory usage: 15.2+ KB
None
Category    object
Resume      object
dtype: object


First of all we have 2 different columns, both with an object data type:
- Category
- Resume

The resume column holds the text of the resumes we would like to train our models on.
The category column holds the specific category for the resume. We will later use this as a target for Multi label classification, so we can classify incoming resumes.

In [48]:
df.isnull().sum()

Category    0
Resume      0
dtype: int64

We have no null values to fix, which is good.

In [49]:
# Lets check for the number of unique values in the category column

print(df['Category'].nunique())
df['Category'].unique()

25


array(['Data Science', 'HR', 'Advocate', 'Arts', 'Web Designing',
       'Mechanical Engineer', 'Sales', 'Health and fitness',
       'Civil Engineer', 'Java Developer', 'Business Analyst',
       'SAP Developer', 'Automation Testing', 'Electrical Engineering',
       'Operations Manager', 'Python Developer', 'DevOps Engineer',
       'Network Security Engineer', 'PMO', 'Database', 'Hadoop',
       'ETL Developer', 'DotNet Developer', 'Blockchain', 'Testing'],
      dtype=object)

So we have 25 different categories - handy for the first screening of Resumes.

In [50]:
# See a sample of duplicates
df[df['Resume'].duplicated(keep=False)].sort_values(by='Resume').head(10)

# Check for length of each resume
df[df['Resume'].duplicated()]['Resume'].str.len().value_counts().head()

# Check how many of them are empty strings
df[df['Resume'].str.strip() == ''].shape

(0, 2)

In [51]:
df['Resume'].value_counts().head(5)


Resume
Technical Skills Web Technologies: Angular JS, HTML5, CSS3, SASS, Bootstrap, Jquery, Javascript. Software: Brackets, Visual Studio, Photoshop, Visual Studio Code Education Details \r\nJanuary 2015 B.E CSE Nagpur, Maharashtra G.H.Raisoni College of Engineering\r\nOctober 2009  Photography Competition Click Nagpur, Maharashtra Maharashtra State Board\r\n    College Magazine OCEAN\r\nWeb Designer \r\n\r\nWeb Designer - Trust Systems and Software\r\nSkill Details \r\nPHOTOSHOP- Exprience - 28 months\r\nBOOTSTRAP- Exprience - 6 months\r\nHTML5- Exprience - 6 months\r\nJAVASCRIPT- Exprience - 6 months\r\nCSS3- Exprience - Less than 1 year months\r\nAngular 4- Exprience - Less than 1 year monthsCompany Details \r\ncompany - Trust Systems and Software\r\ndescription - Projects worked on:\r\n1. TrustBank-CBS\r\nProject Description: TrustBank-CBS is a core banking solution by Trust Systems.\r\nRoles and Responsibility:\r\nâ Renovated complete UI to make it more modern, user-friendly, ma

In [52]:
df[df['Resume'].duplicated(keep=False)].sort_values(by='Resume').head(10)['Resume'].values


array(["* Excellent grasping power in learning new concepts and technology. * Highly motivated team player with strong work ethics, committed to hard work. * Ability to work and co-ordinate in a team effectively. * Enthusiastic self-starter and team player. * Quick and independent learner.Education Details \r\nJanuary 2014 Bachelor of Technology Information Technology branch  BPUT University\r\nJanuary 2010 Diploma Engineering Brahmapur, Orissa U.C.P Engineering School\r\nSoftware Testing & Automation Engineer \r\n\r\nSoftware Testing & Automation Engineer - Tech Mahindra\r\nSkill Details \r\nCompany Details \r\ncompany - Tech Mahindra\r\ndescription - India\r\nDuration       Oct 2017- Till Date\r\n\r\nProject Description\r\nBT Group plc (trading as BT and formerly British Telecom) is a British multinational telecommunications holding company with head offices in London, United Kingdom. I worked for Air Logistics Program under the banner of British Telecom. This project handles all the

In [53]:
df['Resume'].value_counts().head(5)


Resume
Technical Skills Web Technologies: Angular JS, HTML5, CSS3, SASS, Bootstrap, Jquery, Javascript. Software: Brackets, Visual Studio, Photoshop, Visual Studio Code Education Details \r\nJanuary 2015 B.E CSE Nagpur, Maharashtra G.H.Raisoni College of Engineering\r\nOctober 2009  Photography Competition Click Nagpur, Maharashtra Maharashtra State Board\r\n    College Magazine OCEAN\r\nWeb Designer \r\n\r\nWeb Designer - Trust Systems and Software\r\nSkill Details \r\nPHOTOSHOP- Exprience - 28 months\r\nBOOTSTRAP- Exprience - 6 months\r\nHTML5- Exprience - 6 months\r\nJAVASCRIPT- Exprience - 6 months\r\nCSS3- Exprience - Less than 1 year months\r\nAngular 4- Exprience - Less than 1 year monthsCompany Details \r\ncompany - Trust Systems and Software\r\ndescription - Projects worked on:\r\n1. TrustBank-CBS\r\nProject Description: TrustBank-CBS is a core banking solution by Trust Systems.\r\nRoles and Responsibility:\r\nâ Renovated complete UI to make it more modern, user-friendly, ma

In [54]:
dup_resumes = df[df['Resume'].duplicated(keep=False)]

dup_resumes.value_counts()


Category            Resume                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              

In [55]:
df_cleaned = df.drop_duplicates(subset='Resume', keep='first')

print(df_cleaned)
print(df_cleaned.count())
print(df_cleaned['Resume'].nunique())

         Category                                             Resume
0    Data Science  Skills * Programming Languages: Python (pandas...
1    Data Science  Education Details \r\nMay 2013 to May 2017 B.E...
2    Data Science  Areas of Interest Deep Learning, Control Syste...
3    Data Science  Skills â¢ R â¢ Python â¢ SAP HANA â¢ Table...
4    Data Science  Education Details \r\n MCA   YMCAUST,  Faridab...
..            ...                                                ...
894       Testing  Computer Skills: â¢ Proficient in MS office (...
895       Testing  â Willingness to accept the challenges. â ...
896       Testing  PERSONAL SKILLS â¢ Quick learner, â¢ Eagerne...
897       Testing  COMPUTER SKILLS & SOFTWARE KNOWLEDGE MS-Power ...
898       Testing  Skill Set OS Windows XP/7/8/8.1/10 Database MY...

[166 rows x 2 columns]
Category    166
Resume      166
dtype: int64
166


### We will try to find a different dataset. This one only has 166 unique values in a 900 resume dataset, which is bad for training.