# Task 1: Insight Extraction Response
To answer a natural query like “Show me the top fashion creators for my brand campaign this month,” I would follow a structured data + NLP hybrid approach.

First, I would preprocess the structured data by filtering rows where the Category field is "Fashion". This gives us a focused subset of relevant creators.

Next, to rank creators effectively, I’d define a composite score based on:

Engagement Rate: Measures content impact

Posting Frequency: Reflects recent activity

Follower Count: Represents potential reach

Each creator would be assigned a final score using a weighted formula such as:
Score = (0.5 * Engagement Rate) + (0.3 * Frequency) + (0.2 * Normalized Follower Count)
I would normalize numeric values (like followers) to avoid scale bias.

Regarding the query understanding, I’d extract key intents:

"Top" → implies ranking

"Fashion" → content category filter

"This month" → favor recent or frequent posters

This intent parsing can be rule-based (e.g., using spaCy/regex) or handled using a small LLM prompt via OpenAI or Cohere if budget allows. Given the structure of this use-case, I would start with traditional NLP + scoring logic, which is faster, cost-effective, and interpretable.

For production, this logic can later be embedded in a chatbot layer that detects user intent and triggers backend ranking based on filters and scores.

In [None]:
from google.colab import files
uploaded = files.upload()

Saving Mock_Creator_Engagement_Data.csv to Mock_Creator_Engagement_Data (2).csv


In [None]:
import pandas as pd
df = pd.read_csv('Mock_Creator_Engagement_Data.csv')
df.head()

Unnamed: 0,Name,Category,Average Likes/Post,Average Comments/Post,Engagement Rate (%),Posting Frequency,Follower Count,Past Brand Collaborations
0,Creator_1,Tech,3416,179,7.02,6 posts/week,77591,"Nykaa, Mamaearth"
1,Creator_2,Wellness,1591,299,5.24,3 posts/week,79450,"Minimalist, Plum"
2,Creator_3,Food,3015,151,1.5,5 posts/week,82160,"Amazon, Flipkart"
3,Creator_4,Wellness,3918,31,6.44,7 posts/week,61587,"Nike, Puma"
4,Creator_5,Tech,4584,201,7.96,6 posts/week,61613,"Minimalist, Plum"


In [None]:
from sklearn.preprocessing import MinMaxScaler

# Filter for 'Fashion' category
fashion_creators['Posting Frequency'] = fashion_creators['Posting Frequency'].astype(str).str.extract(r'(\d+)').astype(float)

# Now normalize the numeric columns
scaler = MinMaxScaler()
fashion_creators[['Engagement Rate (%)', 'Posting Frequency', 'Follower Count']] = scaler.fit_transform(
    fashion_creators[['Engagement Rate (%)', 'Posting Frequency', 'Follower Count']]
)
# Compute weighted score
fashion_creators['Score'] = (
    0.5 * fashion_creators['Engagement Rate (%)'] +
    0.3 * fashion_creators['Posting Frequency'] +
    0.2 * fashion_creators['Follower Count']
)
# Sort and display top 5
top_fashion_creators = fashion_creators.sort_values(by='Score', ascending=False)
top_fashion_creators[['Name', 'Category', 'Engagement Rate (%)', 'Posting Frequency', 'Follower Count', 'Score']].head()

Unnamed: 0,Name,Category,Engagement Rate (%),Posting Frequency,Follower Count,Score
14,Creator_15,Fashion,1.0,0.0,0.0,0.5
16,Creator_17,Fashion,0.429907,0.0,1.0,0.414953
11,Creator_12,Fashion,0.0,0.0,0.345663,0.069133


#Pseudocode: Fashion Creator Ranking Logic

1. Load the CSV data into a dataframe.
2. Filter creators where Category = "Fashion".
3. Preprocess 'Posting Frequency':
   - Extract numeric part from text (e.g., "4 posts/week" → 4).
4. Normalize numeric columns:
   - Engagement Rate (%)
   - Posting Frequency
   - Follower Count
5. Calculate a weighted score for each creator:
   Score = (0.5 * Engagement Rate) +
           (0.3 * Posting Frequency) +
           (0.2 * Follower Count)
6. Sort creators in descending order by score.
7. Return top N creators with the highest scores.

# Task 2: Model Thinking – Brand Category Classifier

To classify brand signups into categories like fashion, wellness, or tech using their brand name and Instagram bio, I would build a lightweight NLP classification pipeline using open-source models.

First, I’d collect a labeled dataset with brand names, bios, and category tags. Preprocessing would involve lowercasing, removing special characters, and possibly lemmatization using spaCy or NLTK. Then, I’d tokenize and embed the text using **SentenceTransformers** (e.g., `paraphrase-MiniLM-L6-v2`), which are optimized for sentence-level understanding and ideal for short text like bios.

For classification, I’d train a **Logistic Regression** or **Random Forest** model on the sentence embeddings. This setup is fast, interpretable, and effective for early-stage MVPs.

Alternatively, for an API-based solution, I could use **OpenAI’s text classification endpoint** or **Cohere’s Embed + Classify** stack to simplify model deployment.

To evaluate performance, I’d use **accuracy, precision, recall, and F1-score** on a validation set. I’d monitor misclassified examples to iteratively improve the model via active learning or feedback loops.

To improve accuracy over time:
- Collect misclassified bios and manually label them
- Retrain the model periodically with updated data
- Fine-tune embeddings if the domain vocabulary is highly niche

This pipeline ensures fast prototyping and scalable classification as the brand dataset grows.

In [None]:
#Task 2 – Demo Classifier
#1. Sample mock brand bios
brand_bios = [
    "We make organic skincare for glowing health",       # wellness
    "Premium headphones and gadgets for tech lovers",    # tech
    "Designer ethnic wear for modern women",             # fashion
    "Supplements and fitness gear for athletes",         # wellness
    "Sustainable laptop bags for professionals",         # tech
    "Natural beauty products from the hills",            # wellness
    "Gaming keyboards and streaming mics",               # tech
    "Streetwear for the fashion-forward youth",          # fashion
    "Home workout kits and fitness apps",                # wellness
    "Latest smartwatches and electronics",               # tech
    "Handmade sarees and Indian bridal wear",            # fashion
    "Nutrition coaching and wellness retreats",          # wellness
]
labels = [
    "wellness", "tech", "fashion", "wellness", "tech",
    "wellness", "tech", "fashion", "wellness", "tech",
    "fashion", "wellness"
]

In [None]:
#2. Converting to embedding using SentenceTransformer
!pip install -q sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

embeddings = model.encode(brand_bios)

In [None]:
#Training Classifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(embeddings, labels, test_size=0.2, random_state=42)

clf = LogisticRegression()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

     fashion       1.00      1.00      1.00         1
        tech       1.00      1.00      1.00         1
    wellness       1.00      1.00      1.00         1

    accuracy                           1.00         3
   macro avg       1.00      1.00      1.00         3
weighted avg       1.00      1.00      1.00         3



**Note:** The classifier currently achieves perfect accuracy because the dataset is small and manually labeled. While this demonstrates the end-to-end NLP pipeline (embedding → training → evaluation), real-world performance would vary.

⚫For a production-grade classifier, a larger, diverse, and balanced dataset would be required. The current model is intended to show proof-of-concept, not final deployment accuracy.


⚫With more labeled Instagram bios from real brand signups, I would explore fine-tuning SentenceTransformer models or using zero-shot classification via LLM APIs for scalable tagging.


# ▶ NLP Engineer Assignment – Nurdd
**Candidate:** Purnima Nahata

-Future NLP Engineer at Nurdd

**Date:** 25/07/2025

##  Summary

This notebook includes:

 Task 1: Insight extraction for top fashion creators
→ Includes explanation, ranking logic, code, and pseudocode

Task 2: NLP model thinking for classifying brand bios
→ Includes pipeline design, model choice, and evaluation approach

Bonus Task: Lightweight demo to classify brand bios using Sentence-BERT and logistic regression
→ Shows end-to-end text classification with awareness of data limitations

The solutions are modular, production-minded, and ready to scale.

