We want to do some dimension reduction so let's install UMAP.

In [1]:
!pip install --quiet umap-learn

Now let's load up the data and do a little feature engineering. We want to drop one variable because it always has the same value; and we want to introduce dummy variables for economic sector.

In [2]:
import pandas as pd

COLUMNS = ['Increased_Work_Hours', 'Work_From_Home', 'Hours_Worked_Per_Day', 'Meetings_Per_Day', 'Productivity_Change', 'Health_Issue', 
           'Job_Security', 'Childcare_Responsibilities', 'Commuting_Changes','Technology_Adaptation', 'Salary_Changes', 
           'Team_Collaboration_Challenges',  'Sector_Education', 'Sector_Healthcare', 'Sector_IT', 'Sector_Retail']
COVID = '/kaggle/input/impact-of-covid-19-on-working-professionals/synthetic_covid_impact_on_work.csv'
TARGET = 'Stress_Level'

df = pd.read_csv(filepath_or_buffer=COVID)
df = pd.get_dummies(data=df, columns=['Sector']).drop(columns=['Affected_by_Covid'])
df.head()

Unnamed: 0,Increased_Work_Hours,Work_From_Home,Hours_Worked_Per_Day,Meetings_Per_Day,Productivity_Change,Stress_Level,Health_Issue,Job_Security,Childcare_Responsibilities,Commuting_Changes,Technology_Adaptation,Salary_Changes,Team_Collaboration_Challenges,Sector_Education,Sector_Healthcare,Sector_IT,Sector_Retail
0,1,1,6.392394,2.684594,1,Low,0,0,1,1,1,0,1,False,False,False,True
1,1,1,9.171984,3.339225,1,Low,0,1,0,1,1,0,1,False,False,True,False
2,1,0,10.612561,2.218333,0,Medium,0,0,0,0,0,0,0,False,False,False,True
3,1,1,5.546169,5.150566,0,Medium,0,0,0,1,0,0,0,True,False,False,False
4,0,1,11.424615,3.121126,1,Medium,0,1,1,1,0,1,1,True,False,False,False


Is our target variable balanced?

In [3]:
df[TARGET].value_counts().to_dict()

{'Medium': 4956, 'High': 3036, 'Low': 2008}

It isn't balanced, but those counts look very close to big round numbers. That's interesting.

Hmm. The hours/meetings data looks like it has been generated with random variables; either that or it has been normalized in a way we don't know. Let's graph it and see.

In [4]:
from plotly import express

express.scatter(data_frame=df, x='Hours_Worked_Per_Day', y='Meetings_Per_Day', color=TARGET, facet_col=TARGET)

That data looks Gaussian. Let's graph it a different way.

In [5]:
express.histogram(data_frame=df, x='Hours_Worked_Per_Day', color=TARGET, facet_col=TARGET).show()
express.histogram(data_frame=df, x='Meetings_Per_Day', color=TARGET, facet_col=TARGET).show()

That looks very Gaussian indeed. Let's run all of our numerical data through UMAP and see what falls out.

In [6]:
import arrow
from umap import UMAP

time_start = arrow.now()
umap = UMAP(random_state=2024, verbose=False, n_jobs=1, low_memory=False, n_epochs=201)
df[['x', 'y']] = umap.fit_transform(X=df[COLUMNS])
print('done with UMAP in {}'.format(arrow.now() - time_start))

done with UMAP in 0:00:37.492987


In [7]:
from plotly import express

express.scatter(data_frame=df, x='x', y='y', color='Stress_Level', hover_data=['Sector_Education', 'Sector_Healthcare', 'Sector_IT', 'Sector_Retail'])

Weirdly we get four near perfect clusters: they cluster together pretty tightly and they're distinct. What variable distinguishes them? Their economic sector. That tells us that apart from the sector all of this data is distributed the same way. Which is to say randomly. And that's a sure sign we have synthetic data.

In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df[COLUMNS], df[TARGET], test_size=0.2, random_state=2024, stratify=df[TARGET])

logreg = LogisticRegression(max_iter=10000, tol=1e-4).fit(X_train, y_train)
print('model fit in {} iterations'.format(logreg.n_iter_[0]))
print('accuracy: {:5.4f}'.format(accuracy_score(y_true=y_test, y_pred=logreg.predict(X=X_test))))
print('f1: {:5.4f}'.format(f1_score(average='weighted', y_true=y_test, y_pred=logreg.predict(X=X_test))))
print(classification_report(zero_division=0.0, y_true=y_test, y_pred=logreg.predict(X=X_test)))

model fit in 92 iterations
accuracy: 0.4955
f1: 0.3283
              precision    recall  f1-score   support

        High       0.00      0.00      0.00       607
         Low       0.00      0.00      0.00       402
      Medium       0.50      1.00      0.66       991

    accuracy                           0.50      2000
   macro avg       0.17      0.33      0.22      2000
weighted avg       0.25      0.50      0.33      2000



Interestingly our model just picks the largest class every time.