# üîπ Synthetic Job Listings for Testing

Below are **synthetic, clearly-labeled example job descriptions** you can use for testing.  
All examples are **safe** (no real contact/payment instructions). Each includes **red flags** a detector should catch.

---

<details>
<summary>‚ö†Ô∏è Fraudulent ‚Äî Example 1 (obvious)</summary>

**Label:** üî¥ Fraudulent  
**Title:** üí∏ Earn $5,000/week ‚Äî Remote Data Entry (No Experience)  

**Description:**  
We are hiring immediately for remote data entry positions. Work from home with flexible hours ‚Äî **no interview required**. Fill in simple forms and get paid weekly via instant transfer. Training takes 10 minutes. Applicants selected on a **first-come, first-served** basis. Apply now and start today!  

**Requirements:** None. Must have a computer and internet.  
**Company:** Global Opportunity Solutions  

**üîç Red flags:** unrealistic pay for trivial work, ‚Äúno interview,‚Äù immediate hire, vague company name, urgent language.  

</details>

<details>
<summary>‚ö†Ô∏è Fraudulent ‚Äî Example 2 (obvious)</summary>

**Label:** üî¥ Fraudulent  
**Title:** üíº Entry-Level Account Manager ‚Äî Remote  

**Description:**  
Join our team as an account manager. **Guaranteed high commissions and weekly payouts**. No experience needed ‚Äî onboarding is instant. Handle confidential transactions and help customers move funds. Excellent opportunity for quick earnings.  

Send application to recruitment team to be fast-tracked.  

**Company:** International Financial Services  

**üîç Red flags:** vague duties promising quick earnings, mention of handling transactions (without company details), high-pressure language.  

</details>

<details>
<summary>‚ö†Ô∏è Partly Fraudulent / Ambiguous ‚Äî Example 3</summary>

**Label:** üü† Partly Fraudulent  
**Title:** üìà Sales Representative (Remote/Contract)  

**Description:**  
Small startup seeks remote sales reps. Base pay + commission. Must be confident in closing deals remotely. Brief **paid test task** may be requested. Training provided.  

Company website exists but with limited history. Some applicants reported **long onboarding delays**.  

**Requirements:** 1+ year sales experience preferred.  

**üîç Red flags (mixed):** legitimate-sounding role but weak company presence, mention of paid test or onboarding delays ‚Äî could be legit or poorly-run.  

</details>

<details>
<summary>‚ö†Ô∏è Partly Fraudulent / Ambiguous ‚Äî Example 4</summary>

**Label:** üü† Partly Fraudulent  
**Title:** üéß Customer Support Agent ‚Äî Flexible Hours  

**Description:**  
Help customers over chat/email. Competitive hourly rate. Short trial period pays **small stipend**. Interviews mostly via messaging. Overseas e-commerce partner ‚Äî hiring fast.  

Some job details are vague; contact email uses a generic domain.  

**üîç Red flags (mixed):** generic contact, vague details, messaging-only interview ‚Äî suspicious but not definitely fraudulent.  

</details>

<details>
<summary>‚úÖ Authentic ‚Äî Example 5 (legitimate)</summary>

**Label:** üü¢ Authentic  
**Title:** üë®‚Äçüíª Junior Software Engineer ‚Äî Backend  

**Company:** BrightLeaf Technologies (www.brightleaftech.com) ‚Äî established SaaS, founded 2016, ~80 employees  
**Location:** Remote (US timezone overlap preferred) or Boston, MA office  

**Role:**  
- Backend services in Python & Django  
- Collaborate with product/QA teams  
- Code reviews & sprint ceremonies  

**Requirements:**  
- Bachelor‚Äôs in CS or equivalent  
- 1‚Äì2 years backend experience  
- REST APIs & SQL knowledge  

**Perks:** salary, benefits, 401(k), PTO, transparent interview (phone ‚Üí technical ‚Üí culture). Timeline: ~3 weeks  

**‚úÖ Legit signals:** clear company info, realistic duties, transparent interview process & benefits  

</details>

<details>
<summary>‚úÖ Authentic ‚Äî Example 6 (legit, startup)</summary>

**Label:** üü¢ Authentic  
**Title:** üì¢ Marketing Associate (Content & Social)  

**Company:** GreenLoop Media ‚Äî boutique agency. Portfolio: linkedin.com/company/greenloop-media  

**Responsibilities:**  
- Write blog posts, social captions, campaign briefs  
- Coordinate with designers & clients  
- Track KPIs & weekly reporting  

**Requirements:**  
- 2 years agency experience  
- Strong writing samples (portfolio link)  
- Google Analytics proficiency  

**Recruiting steps:** portfolio review ‚Üí 30-min call ‚Üí short paid assignment ‚Üí final interview. Salary range listed.  

**‚úÖ Legit signals:** portfolio requirement, clear steps, salary range, verifiable company profile  

</details>


In [1]:
# Step 1: Install dependencies
!pip install gradio scikit-learn pandas numpy



In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report
import gradio as gr
import requests
from zipfile import ZipFile
from io import BytesIO
from collections import Counter
import re # For text cleaning

# --- Step 1: Data Loading ---
url = "https://github.com/MadhulikaSharma95/FakeJobPosting_Assessment/raw/refs/heads/main/fake_job_postings.csv.zip"
response = requests.get(url)
zip_file_bytes = BytesIO(response.content)

with ZipFile(zip_file_bytes, 'r') as zf:
    zip_contents = zf.namelist()
    csv_file_name = None
    for name in zip_contents:
        if name.endswith('.csv') and not name.startswith('__MACOSX/'):
            csv_file_name = name
            break
    if csv_file_name:
        with zf.open(csv_file_name) as csv_file:
            data = pd.read_csv(csv_file)
    else:
        raise ValueError("The expected CSV file was not found in the ZIP archive.")

# --- Step 2: Enhanced Data Preprocessing and Feature Engineering ---

# Fill NaN values in key text columns with an empty string for combination
text_cols = ['title', 'company_profile', 'description', 'requirements', 'benefits']
for col in text_cols:
    data[col] = data[col].fillna('')

# Feature Engineering: Combine all relevant text fields into one 'combined_text' column
data['combined_text'] = (
    data['title'] + ' ' +
    data['company_profile'] + ' ' +
    data['description'] + ' ' +
    data['requirements'] + ' ' +
    data['benefits']
)

# Optional: Simple text cleaning function (removes HTML tags and non-alphanumeric)
def clean_text(text):
    text = re.sub('<[^>]*>', '', text) # Remove HTML tags
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text) # Keep only alphanumeric and spaces
    return text.lower()

data['combined_text'] = data['combined_text'].apply(clean_text)

# Define features (X) and target (y)
X = data['combined_text']
y = data['fraudulent']

# Get class distribution for imbalance handling
class_counts = Counter(y)
n_0 = class_counts[0]
n_1 = class_counts[1]
# Calculate the scale_pos_weight for XGBoost: (Count of Majority Class) / (Count of Minority Class)
scale_pos_weight_value = n_0 / n_1
print(f"Class Distribution: Real Jobs={n_0}, Fake Jobs={n_1}")
print(f"Calculated scale_pos_weight for XGBoost: {scale_pos_weight_value:.2f}")

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
# Note: Used stratify=y to ensure train/test sets have similar class ratios

# Text vectorization
# Increased max_features and used a better TfidfVectorizer setting
vectorizer = TfidfVectorizer(stop_words='english', max_features=10000, ngram_range=(1, 2))
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)


# --- Step 3: Train Models with Imbalance Handling ---

models = {
    # Added class_weight='balanced' to handle imbalance
    "Logistic Regression (Balanced)": LogisticRegression(max_iter=500, class_weight='balanced', solver='liblinear', random_state=42),

    # Added class_weight='balanced' to handle imbalance
    "Random Forest (Balanced)": RandomForestClassifier(n_estimators=200, random_state=42, class_weight='balanced', n_jobs=-1),

    # Gradient Boosting does not have class_weight, so it's kept simple for comparison
    "Gradient Boosting": GradientBoostingClassifier(n_estimators=100, random_state=42),

    # Added scale_pos_weight to handle imbalance in XGBoost
    "XGBoost (Scaled)": XGBClassifier(
        #use_label_encoder=False,
        eval_metric='logloss',
        random_state=42,
        scale_pos_weight=scale_pos_weight_value, # Key change for imbalance
        n_estimators=200 # Increased estimators
    )
}

# Fit models and evaluate
best_model_name = ""
best_f1_score = -1

for name, model in models.items():
    model.fit(X_train_vec, y_train)
    preds = model.predict(X_test_vec)
    report = classification_report(y_test, preds, output_dict=True)

    # F1-score for the fraudulent class (label 1) is the most important metric
    f1_score_fraud = report['1']['f1-score']

    print(f"\n=== {name} ===")
    print(f"Accuracy: {accuracy_score(y_test, preds):.4f}")
    print(classification_report(y_test, preds))

    if f1_score_fraud > best_f1_score:
        best_f1_score = f1_score_fraud
        best_model_name = name

print(f"\nüèÜ The model with the highest Fraudulent (Class 1) F1-Score is: {best_model_name}")

# --- Step 4: Build Gradio web app using the best model ---
# We use the best performing model for the Gradio interface
best_model = models[best_model_name]

def predict_fraud(job_description, algorithm):
    # Apply the same cleaning function as used during training
    cleaned_desc = clean_text(job_description)
    vect = vectorizer.transform([cleaned_desc])
    model = models[algorithm] # Allow user to select any model for the interface
    pred = model.predict(vect)[0]
    prob = model.predict_proba(vect)[0]

    if pred == 1:
        # Confidence is for the positive class (1: Fraudulent)
        return f"‚ö†Ô∏è FRAUDULENT POSTING DETECTED (Confidence: {prob[1]:.4f})"
    else:
        # Confidence is for the negative class (0: Authentic)
        return f"‚úÖ Authentic Job Posting (Confidence: {prob[0]:.4f})"

iface = gr.Interface(
    fn=predict_fraud,
    inputs=[
        gr.Textbox(label="Paste Job Description", lines=10, placeholder="Paste job title, company profile, description, requirements, and benefits here..."),
        gr.Dropdown(label="Select Algorithm", choices=list(models.keys()), value=best_model_name)
    ],
    outputs="text",
    title="üïµÔ∏è Enhanced Fraud Job Post Detector",
    description="Detect if a job posting is fraudulent using machine learning. The model is optimized for Recall on fake postings.",
    theme='dark'
)



Class Distribution: Real Jobs=17014, Fake Jobs=866
Calculated scale_pos_weight for XGBoost: 19.65

=== Logistic Regression (Balanced) ===
Accuracy: 0.9746
              precision    recall  f1-score   support

           0       0.99      0.98      0.99      3403
           1       0.68      0.88      0.77       173

    accuracy                           0.97      3576
   macro avg       0.84      0.93      0.88      3576
weighted avg       0.98      0.97      0.98      3576


=== Random Forest (Balanced) ===
Accuracy: 0.9804
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      3403
           1       1.00      0.60      0.75       173

    accuracy                           0.98      3576
   macro avg       0.99      0.80      0.87      3576
weighted avg       0.98      0.98      0.98      3576


=== Gradient Boosting ===
Accuracy: 0.9771
              precision    recall  f1-score   support

           0       0.98      1.00      0.

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)



=== XGBoost (Scaled) ===
Accuracy: 0.9857
              precision    recall  f1-score   support

           0       0.99      1.00      0.99      3403
           1       0.92      0.77      0.84       173

    accuracy                           0.99      3576
   macro avg       0.96      0.88      0.92      3576
weighted avg       0.99      0.99      0.99      3576


üèÜ The model with the highest Fraudulent (Class 1) F1-Score is: XGBoost (Scaled)


In [2]:
iface.launch(share=True)

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://69ce1ba94adfd6b734.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


