### High Level Goals:
1. **currently**: train a topic model on the opportunity ads, visualize opportunity ads in terms of these topics
2. test what a simple classification approach with logistic regression or SVMs would look like for classifying opportunities; treat this as baseline
3. finetune a LLM on opportunity ads; does it perform significantly better than the baseline?

In [22]:
# classification tutorial: https://towardsdatascience.com/text-classification-with-nlp-tf-idf-vs-word2vec-vs-bert-41ff868d1794

In [42]:
import pandas as pd
import numpy as np
import os

from gensim.corpora.dictionary import Dictionary
from gensim.parsing.preprocessing import preprocess_string
from gensim import models, utils

In [2]:
coded = pd.read_csv('../data/text_data_annotated.csv')    # total = 32,084
opportunity_ads = coded[coded['opportunity'] == 1]    # opportunity =  2,274

### Topic Modeling

In [3]:
opportunity_texts = opportunity_ads['text'].dropna().tolist()
# tokenize texts
texts_tokenized = [preprocess_string(t) for t in opportunity_texts]
# tokenizing manually
# texts_tokenized = [list(utils.tokenize(t)) for t in opportunity_texts]

# build dictionary of tokens
dict_opportunity = Dictionary(texts_tokenized)
# build BOW mapping of all texts based on dictionary
opportunity_texts_processed = [dict_opportunity.doc2bow(t) for t in texts_tokenized]

In [4]:
lda = models.ldamodel.LdaModel(opportunity_texts_processed, num_topics=7)

In [11]:
for i in range(7):
    print(f'=== topic {i} ===')
    terms = lda.get_topic_terms(i)
    for wid, prob in terms:
        print(dict_opportunity.get(wid), wid, prob)
    
    print('\n')

=== topic 0 ===
com 90 0.0249541
class 213 0.012157648
job 437 0.011724807
time 187 0.011613171
join 142 0.010429956
free 38 0.008994422
learn 54 0.00894235
todai 61 0.008489231
career 193 0.008040525
start 112 0.0074167084


=== topic 1 ===
com 90 0.024189373
earn 49 0.02183597
shasta 4557 0.011172712
work 19 0.010717492
want 177 0.009511888
home 676 0.009353182
free 38 0.0090760095
grant 67 0.00901813
join 142 0.00899255
card 639 0.0088354


=== topic 2 ===
com 90 0.021314884
earn 49 0.017034918
free 38 0.01460819
start 112 0.014286659
facebook 216 0.014200148
form 217 0.013351128
amazon 544 0.012220311
work 19 0.010913717
cours 92 0.009985787
onlin 41 0.009201662


=== topic 3 ===
onlin 41 0.027828455
scholarship 28 0.025577221
edu 37 0.019441608
degre 36 0.018807022
career 193 0.018388389
learn 54 0.01428539
cours 92 0.012127982
com 90 0.010891082
class 213 0.009719211
busi 64 0.008813099


=== topic 4 ===
com 90 0.030129235
job 437 0.018829772
join 142 0.016670017
career 193 0.015

In [21]:
# I believe topic #1 has some lower quality opportunities

for i in range(len(opportunity_texts_processed)):
    topics = lda.get_document_topics(opportunity_texts_processed[i])
    topic_ids = [i for (i, _) in topics]
    if 1 in topic_ids and len(topic_ids) == 1:
        print(opportunity_texts[i])
        print(topics)
        print('\n')

We are hiring for multiple positions and shifts! Check out all of our open jobs and apply: roguefitness.com/jobs
Maximum Wage: $55/Hour with both Adders*
Minimum Wage: $23/Hour with both Adders
[(1, 0.9570161)]


g2.com
G2 wants to hear about your experiences with software at work! https://g2.co/3cYLh4j
Compare the best business software and services based on user ratings and social data. Reviews for CRM, ERP, HR, CAD, PDM and Marketing software.
$10 Amazon gift card for your opinion!
[(1, 0.97132945)]


g2.com
G2 wants to hear about your experiences with software at work! https://g2.co/3xxReyJ
$10 Starbucks gift card for your opinion!
Compare the best business software and services based on user ratings and social data. Reviews for CRM, ERP, HR, CAD, PDM and Marketing software.
[(1, 0.9713335)]


g2.com
Limit 1 reward per user. Gift cards will be sent to your email within 3 business days pending review approval!
$10 Starbucks gift card for your opinion!
G2 wants to hear about your exp

arcadia.edu
Arcadia University’s hybrid Doctor of Physical Therapy (DPT) program prepares aspiring physical therapists to become the next generation of innovative, patient-centered practitioners.
Hybrid DPT from Arcadia: A Top-25 PT School
Earn your hybrid DPT from Arcadia in 25 months. Participate in 32 weeks of clinical experience. GRE req’d.
[(1, 0.97538906)]


Earn Your Hybrid DPT from Arcadia. GRE req’d.
Learn from Arcadia’s expert DPT faculty in a hybrid, online program. Complete in 25 months.
arcadia.edu
Arcadia University’s hybrid Doctor of Physical Therapy (DPT) program prepares aspiring physical therapists to become the next generation of innovative, patient-centered practitioners.
[(1, 0.97609586)]


wp.wpi.edu/touchtomorrow/register
Don’t miss it!
Free STEM festival for all ages
Join us April 2 for a day full of free STEM activities.
[(1, 0.9492364)]


pathrise.com
Get A New Job Faster & Earn More
We are entering the most talent friendly hiring economy in years. Work with t

### Simplistic Classification

In [76]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score

vectorizer = TfidfVectorizer()

In [72]:
nas = coded['text'].isna()
all_texts = coded[~nas]['text'].tolist()
all_texts_tfidf = vectorizer.fit_transform(all_texts)
opportunity_labels = np.array(coded[~nas]['opportunity'])

# split into train, val, test
X_train, X_test, y_train, y_test = train_test_split(all_texts_tfidf, opportunity_labels, test_size=0.33)

In [73]:
# fit logistic regression
logit_model = LogisticRegression()
logit_model.fit(X_train, y_train)

In [77]:
# test classification
test_preds = logit_model.predict(X_test)
print('Accuracy:', accuracy_score(y_test, test_preds))
print('Precision:', precision_score(y_test, test_preds))
print('Recall:', recall_score(y_test, test_preds))

print('\nConfusion Matrix:\n', confusion_matrix(y_test, test_preds, normalize='true'))

Accuracy: 0.9545324835917436
Precision: 0.8809523809523809
Recall: 0.36894586894586895

Confusion Matrix:
 [[0.99643258 0.00356742]
 [0.63105413 0.36894587]]


In [62]:
# TODO: unpack accuracy, is this high due to data imbalance?
# the model could just be saying no?

# aha yes, it's a pretty low recall model

319