# Resumes/Response dataset 

Sheet_2.csv contains 125 resumes, in the resume_text column. Resumes were queried from Indeed.com with keyword 'data scientist', location 'Vermont'. If a resume is 'not flagged', the applicant can submit a modified resume version at a later date. If it is 'flagged', the applicant is invited to interview.

The task her is to Classify new resumes/responses as flagged or not flagged.

There are two sets of data here - resumes and responses. Split the data into a train set and a test set to test the accuracy of your classifier. Bonus points for using the same classifier for both problems.

Diving Deep!!!

The pipeline (This's a classification challenge)
1. Load data
2. Feature extraction of the data (This helps us to understand the dataset)
3. Preprocessing 
4. Tokenization
5. Stemming & Lemmatization
6. Normalization
7. Modelling




In [1]:
import warnings
warnings.filterwarnings("ignore")                     #Ignoring unnecessory warnings

import numpy as np                                  #for large and multi-dimensional arrays
import pandas as pd                                 #for data manipulation and analysis
import nltk                                         #Natural language processing tool-kit

from nltk.corpus import stopwords                   #Stopwords corpus
from nltk.stem import PorterStemmer                 # Stemmer

from sklearn.feature_extraction.text import CountVectorizer          #For Bag of words
from sklearn.feature_extraction.text import TfidfVectorizer          #For TF-IDF
#from gensim.models import Word2Vec                                   #For Word2Vec

In [10]:
#Loading the first 5 rows
resume = pd.read_csv('D:\\R-Projects\\Sheet_2.csv', encoding='latin-1')
resume.head()

Unnamed: 0,resume_id,class,resume_text
0,resume_1,not_flagged,\rCustomer Service Supervisor/Tier - Isabella ...
1,resume_2,not_flagged,\rEngineer / Scientist - IBM Microelectronics ...
2,resume_3,not_flagged,\rLTS Software Engineer Computational Lithogra...
3,resume_4,not_flagged,TUTOR\rWilliston VT - Email me on Indeed: ind...
4,resume_5,flagged,\rIndependent Consultant - Self-employed\rBurl...


# 1.0 Feature extraction Phase

We can see from above output that our dataset has Punctuations, Capitalised words, Stop words, 

Issues to deal with...

In [3]:
#How many rows & columns columns do we have?
print('Dataset size:',resume.shape)
print('Number of Columns:',resume.columns)

Dataset size: (125, 3)
Number of Columns: Index(['resume_id', 'class', 'resume_text'], dtype='object')


In [4]:
#Let's check the word count of our sentences 
#This can help us discover which resumes have shorter words? the flagged or not-flagged resumes 

resume['word_count'] = resume['resume_text'].apply(lambda x: len(str(x).split(" ")))
resume[['resume_text','word_count']].head()

Unnamed: 0,resume_text,word_count
0,\rCustomer Service Supervisor/Tier - Isabella ...,723
1,\rEngineer / Scientist - IBM Microelectronics ...,367
2,\rLTS Software Engineer Computational Lithogra...,803
3,TUTOR\rWilliston VT - Email me on Indeed: ind...,603
4,\rIndependent Consultant - Self-employed\rBurl...,543


In [5]:
#Let's also get the Number of characters in each Review
#This can help us discover which resumes have shorter characters? the flagged or not-flagged resumes 

resume['char_count'] = resume['resume_text'].str.len() # Character count includes spaces
resume[['resume_text','char_count']].head()

Unnamed: 0,resume_text,char_count
0,\rCustomer Service Supervisor/Tier - Isabella ...,5505
1,\rEngineer / Scientist - IBM Microelectronics ...,3129
2,\rLTS Software Engineer Computational Lithogra...,5779
3,TUTOR\rWilliston VT - Email me on Indeed: ind...,4405
4,\rIndependent Consultant - Self-employed\rBurl...,4160


In [6]:
#We want now to discover the stopWords count
#Firstly, we load the Python StopWords Dictionary
from nltk.corpus import stopwords
stop = stopwords.words('english')
#The here we write the StopWord count function 
resume['StopWords Count'] = resume['resume_text'].apply(lambda x: len([x for x in x.split() if x in stop]))
resume[['resume_text','StopWords Count']].head()

Unnamed: 0,resume_text,StopWords Count
0,\rCustomer Service Supervisor/Tier - Isabella ...,152
1,\rEngineer / Scientist - IBM Microelectronics ...,62
2,\rLTS Software Engineer Computational Lithogra...,182
3,TUTOR\rWilliston VT - Email me on Indeed: ind...,153
4,\rIndependent Consultant - Self-employed\rBurl...,137


From the above output, we discover another issues to deal with, 
Removing stopwords 

In [7]:
#Function to check the number of special characters in the dataset, 

resume['hastags'] = resume['resume_text'].apply(lambda x: len([x for x in x.split() if x.startswith('#')]))
resume[['resume_text','hastags']].head()


Unnamed: 0,resume_text,hastags
0,\rCustomer Service Supervisor/Tier - Isabella ...,0
1,\rEngineer / Scientist - IBM Microelectronics ...,0
2,\rLTS Software Engineer Computational Lithogra...,0
3,TUTOR\rWilliston VT - Email me on Indeed: ind...,0
4,\rIndependent Consultant - Self-employed\rBurl...,0


We have found that thre's no hashtags in our dataset, but we can write a complete function to search all types of special characters in our dataset


In [11]:
#This is the function to remove noise in our dataset
import re, string

def remove_noise(tweet_tokens, stop_words = ()):

    cleaned_tokens = []

    for token, tag in pos_tag(tweet_tokens):
        token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|'\
                       '(?:%[0-9a-fA-F][0-9a-fA-F]))+','', token)
        token = re.sub("(@[A-Za-z0-9_]+)","", token)

        if tag.startswith("NN"):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'

        lemmatizer = WordNetLemmatizer()
        token = lemmatizer.lemmatize(token, pos)

        if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words:
            cleaned_tokens.append(token.lower())
    return cleaned_tokens


In [16]:
#Let's pass our function ('remove_noise') to our dataset
# Create function using string.punctuation to remove all punctuation
def remove_punctuation(sentence: str) -> str:
    return sentence.translate(str.maketrans('', '', string.punctuation))

# Apply function
[remove_punctuation(sentence) for sentence in resume]

['resumeid', 'class', 'resumetext']