As a Data Analyst at a leading global HR consultancy, your mission is to delve into an extensive database of resumes to identify suitable candidates for tech-focused roles. This task involves using regular expressions to extract key data points and applying data preprocessing techniques to organize this information effectively.

### Dataset Summary

`resumes.csv`

| Column      | Data Type | Description                                                  |
|-------------|-----------|--------------------------------------------------------------|
| `ID`        | float     | Unique identifier for each resume.                           |
| `Resume_str`| object    | Full text of the resume, rich with details for analysis.     |
| `Category`  | object    | Job category of the resume, indicating the field of expertise. |


### Apply your regular expressions and data preprocessing skills to efficiently organize resumes data.

### Generate a new DataFrame named candidates_df containing the following columns:
+ **id**: Candidate ID.
+ **job_title**: Most recent job title.
+ **tech_skills**: List of technical skills that may include Python, SQL, R, and Excel.
+ **education**: Highest educational degree such as PhD, Master, or Bachelor.


In [47]:
import pandas as pd
import re

# Load the resume dataset from a CSV file into a DataFrame
resumes = pd.read_csv('resumes.csv')
def clean_text(text):    
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

resumes['Resume_str'] = resumes['Resume_str'].apply(clean_text)

In [None]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk import pos_tag, ne_chunk
import os

# Get the current working directory
current_directory = os.getcwd()
nltk.data.path.append(current_directory)

# Download necessary NLTK resources
nltk.download('punkt', download_dir=current_directory)
nltk.download('averaged_perceptron_tagger', download_dir=current_directory)
nltk.download('maxent_ne_chunker', download_dir=current_directory)
nltk.download('words', download_dir=current_directory)

In [None]:
job_title_pattern =r"^([A-Z\s\.\,\-]+)\b"
edu_pattern = r"(Bachelor's|Master's|PhD|Doctorate|Associate's|Education|degree|University)\s*(of|in)?\s*([A-Za-z\s\.\,\-]+)?(?:,|\bat\b)?\s*([A-Za-z\s\.]+)"
skill_pattern = r"\bskill\b\s+(.+)"
job_title = []
education = []
tech_skills = []

skill_keywords = {'python', 'java', 'machine learning', 'project management', 'data analysis', 'public speaking', 'leadership', 'communication'}

for resume_text in resumes['Resume_str']:
    job_titles_found = re.findall(job_title_pattern, resume_text)
    if len(job_titles_found)>0:
        job_title.append(job_titles_found[0])
    else:
        job_title.append(None)
 
 
    edu_found = re.findall(edu_pattern, resume_text, re.IGNORECASE)
    if len(edu_found)>0:
        education.append(edu_found[0][2])
    else:
        education.append(None)
    

candidates_df = pd.DataFrame({'id':resumes['ID'], 'job_title':job_title, 'education':education})
candidates_df