### Python Notebook for Data Preprocessing:

This notebook performs **data preprocessing** for a resume generation AI model. It processes raw input data to make it suitable for machine learning.

#### **Steps in the Notebook:**
1. **Import Required Libraries**  
   - Uses `pandas` for data handling, `numpy` for numerical operations, and `sklearn` for preprocessing.

2. **Define `preprocess_data` Function**  
   - Handles missing values by imputing the mean (if enabled).  
   - Normalizes numerical features (Years of Experience, Salary) using `StandardScaler`.  
   - One-hot encodes categorical features (Job Title, Education Level, Skills, Location).  
   - Returns a cleaned and structured DataFrame ready for model training.

3. **Define `clean_text` Function**  
   - Removes special characters and extra spaces from resume text fields.

4. **Create Sample Data**  
   - Generates a small dataset containing job-related information (Job Title, Skills, Experience, etc.).

5. **Apply Preprocessing to the Sample Data**  
   - Calls `preprocess_data()` to clean, normalize, and encode the data.  
   - Displays the final structured DataFrame.




| Job Title         | Years of Experience | Education Level | Skills                       | Salary  | Location      |
|------------------|--------------------|----------------|-----------------------------|---------|--------------|
| Software Engineer | 3                  | Masters        | Python, SQL                 | 70000   | New York     |
| Data Scientist    | 5                  | PhD            | Python, Machine Learning    | 120000  | San Francisco|
| Web Developer    | 2                  | Bachelors      | JavaScript, HTML, CSS       | 60000   | Chicago      |


In [11]:
def preprocess_data(data, normalize=True, handle_missing=True):
    """
    Clean and preprocess input data for the resume generation AI model.

    Parameters
    ----------
    data : pandas.DataFrame
        Raw input data containing resume information such as job title, skills, education, etc.
    normalize : bool, default=True
        Whether to normalize numerical features (e.g., years of experience, education score).
    handle_missing : bool, default=True
        Whether to handle missing values by imputing or removing them.
        
    Returns
    -------
    pandas.DataFrame
        Preprocessed data ready for input into the AI model.
    """
    # 1. Handle Missing Values
    if handle_missing:
        # Separate numerical and categorical columns
        numerical_columns = data.select_dtypes(include=[np.number]).columns
        categorical_columns = data.select_dtypes(exclude=[np.number]).columns
        
        # Apply imputer only to numerical columns
        imputer = SimpleImputer(strategy='mean')
        data[numerical_columns] = imputer.fit_transform(data[numerical_columns])

    # 2. Normalize Numerical Features
    if normalize:
        numerical_columns = data.select_dtypes(include=[np.number]).columns
        scaler = StandardScaler()
        data[numerical_columns] = scaler.fit_transform(data[numerical_columns])

    # 3. Encode Categorical Features (Job Title, Education Level, etc.)
    data = pd.get_dummies(data, drop_first=True)

    return data


In [12]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

# Sample Data (replace this with any data you want)
sample_data = {
    'Job Title': ['Software Engineer', 'Data Scientist', 'Web Developer'],
    'Years of Experience': [3, 5, 2],
    'Education Level': ['Masters', 'PhD', 'Bachelors'],
    'Skills': ['Python, SQL', 'Python, Machine Learning', 'JavaScript, HTML, CSS'],
    'Salary': [70000, 120000, 60000],
    'Location': ['New York', 'San Francisco', 'Chicago']
}

# Convert sample data to a DataFrame
raw_data = pd.DataFrame(sample_data)

# Preprocess data
processed_data = preprocess_data(raw_data)

# Display the preprocessed data
print(processed_data)


   Years of Experience    Salary  Job Title_Software Engineer  \
0            -0.267261 -0.508001                         True   
1             1.336306  1.397001                        False   
2            -1.069045 -0.889001                        False   

   Job Title_Web Developer  Education Level_Masters  Education Level_PhD  \
0                    False                     True                False   
1                    False                    False                 True   
2                     True                    False                False   

   Skills_Python, Machine Learning  Skills_Python, SQL  Location_New York  \
0                            False                True               True   
1                             True               False              False   
2                            False               False              False   

   Location_San Francisco  
0                   False  
1                    True  
2                   False  


### Explanation of the Table:

- **Numerical values** (Years of Experience, Salary) have been **normalized** using StandardScaler.
- **Categorical values** (Job Title, Education Level, Skills, Location) have been **one-hot encoded** into binary columns.
- **1 = True** (the feature applies to this row), **0 = False** (the feature does not apply).

#### **Outcome:**
- The processed dataset is now in a format that can be used as input for an AI model, enabling more efficient resume generation.

|   Years of Experience |   Salary  | Job Title_Software Engineer | Job Title_Web Developer | Education Level_Masters | Education Level_PhD | Skills_Python, Machine Learning | Skills_Python, SQL | Location_New York | Location_San Francisco |
|----------------------:|----------:|----------------------------:|------------------------:|------------------------:|--------------------:|--------------------------------:|------------------:|-----------------:|--------------------:|
|            -0.267261 | -0.508001 |                           1 |                      0 |                      1 |                  0 |                              0 |                 1 |                1 |                  0 |
|             1.336306 |  1.397001 |                           0 |                      0 |                      0 |                  1 |                              1 |                 0 |                0 |                  1 |
|            -1.069045 | -0.889001 |                           0 |                      1 |                      0 |                  0 |                              0 |                 0 |                0 |                  0 |
