Before diving into model evaluation, I want to give a recap of what I've done to wrangle and engineer the dataset. The important thing to note is that I have tried two different feature engineering approaches. I will first explain the data scraping step and then I will explain the data processing for the two different feature engineering approaches following this data scraping/ initial collection step.

Data scraping step: prior to writing any code, I made a spreadsheet where I went through job postings on LinkedIn and recorded the url for the job postings in one column and the rating I gave each one (0-3 based on how much I would want that job) in another. This spreadsheet has 107 rows. Then I wrote code to iterate over the urls in order to generate a text file for each one containing valuable information I was able to scrape from the job posting and concatenate together using the requests library and beautifulsoup. This information was the body text of the job posting and some other metadata like name of position, company, locale, industries, etc. included at the top of the text file.

Feature Engineering Approach 1 (FANCY SCHMANCY):
1. I wrote code to iterate over the generated text files and pipe this text into a prompt engineered to have ChatGPT output summaries for each text file in the form of a JSON file with specified categories as keys and lists of phrases as values.
2. Once the JSON files were generated I could iterate over these to create a pandas dataframe with columns corresponding to the JSON dictionary keys. Thus each row of the resultant dataframe would essentially summarize the job posting. Remember, I had to reincorporate the labels back in with the data so 'rating' is also a column in this dataframe.
3. Then, I wrote special logic to treat the data in certain columns categorically and create one-hot encodings for data in those columns, while most of the others I performed clustering on phrase embedding maps produced by aggregating all the data in those columns and passing it through a sentence transformer, and then counted frequencies of cluster labels in new cluster_count columns.
4. Unused columns like the original columns containing phrases get dropped, and the dataframe gets passed through a supervised learning algorithm. I demonstrate and discuss results of that process herein, evaluating performance of different model architectures with different hyperparameter choices relevant to this feature extraction phase.

Feature Engineering Approach 2 (RUDIMENTARY TFIDF):
1. I essentially just started out with a dataframe with one column, which contained the text read from the text files containing data scraped for each job posting.
2. I performed TFIDF vectorization on this one column, generating many more columns, and then passed the resultant dataframe (again reincorporating the 'rating' labels into another column and using that column as the target) to the same supervised learning algorithms that I tried for the first feature engineering approach, taking into consideration different hyperparameter choices for the TFIDF vectorization. I discuss results of that process and compare the results of the two feature engineering approaches.    

The following code is used to create a dataframe from all the JSON summary files saved to my Google Drive.

In [10]:
json_files_path = "drive/MyDrive/LI-Jobs-JSON/"
links_csv_path = "drive/MyDrive/successfulLinksGDRIVE.csv"

import pandas as pd
links_dataframe = pd.read_csv(links_csv_path, header=None, names=['url', 'rating'])
#links_dataframe.head()
#print(links_dataframe.loc[0, 'rating'])

# let's snakecase our column names to avoid having spaces in them
# also we add a rating column at the end of the list to store the target variable
df_columns = ["employment_type", "job_function", "description_of_product/service", "industries",
              "position_name", "broader_role_name", "company", #"location", "salary/compensation_range",
              "responsibilities", "goals/objectives", "name_of_department/team", "required_qualifications",
              "preferred_qualifications", "benefits", "work_arrangement", "city", "state", "country",
              "min_salary", "max_salary", "rating"]
# Initialize DataFrame with column names
df = pd.DataFrame(columns=df_columns)

import os
import json
locale_json_files_path = "drive/MyDrive/jobLocations/"
salary_json_files_path = "drive/MyDrive/jobSalaries/"

for i in range(1, 108):
  row_num = df.last_valid_index()
  print(row_num)
  if row_num == None:
    row_num = 0
  else:
    row_num = row_num + 1
  assert i == row_num + 1
  json_file_path = os.path.join(json_files_path, f'row{i}.json')
  # Read JSON file
  with open(json_file_path, 'r') as json_file:
    json_data = json.load(json_file)
  if 'fields' in json_data.keys():
    #print(json_data)
    field_names = json_data['fields']
    for index, value in enumerate(field_names):
      info = json_data['info'][index]
      # Check if the value exists as a column name (ignoring case)
      column_name = value.replace(" ", "_").lower()
      if column_name in df.columns:
        # Add the info to the corresponding column and row
        df.at[row_num, column_name] = info
  else:
    for key, value in json_data.items():
      # Check if the key exists as a column name (ignoring case)
      column_name = key.replace(" ", "_").lower()
      if column_name == 'emploment_type':
        column_name = 'employment_type'
      if column_name in df.columns:
        # Add the value to the corresponding column and row
        df.at[row_num, column_name] = value

  locale_json_file_path = os.path.join(locale_json_files_path, f'row{i}.json')
  salary_json_file_path = os.path.join(salary_json_files_path, f'row{i}.json')
  # Read locale JSON file
  with open(locale_json_file_path, 'r') as json_file:
    json_data = json.load(json_file)
  try:
    city = json_data["city"]
    state = json_data["state"]
    country = json_data["country"]
    df.at[row_num, 'city'] = city
    df.at[row_num, 'state'] = state
    df.at[row_num, 'country'] = country
  except:
    df.at[row_num, 'city'] = "N/A"
    df.at[row_num, 'state'] = "N/A"
    df.at[row_num, 'country'] = "N/A"
  # Read salary JSON file
  with open(salary_json_file_path, 'r') as json_file:
    json_data = json.load(json_file)
  min_salary = json_data["salary_min"]
  max_salary = json_data["salary_max"]
  df.at[row_num, 'min_salary'] = min_salary
  df.at[row_num, 'max_salary'] = max_salary
  rating = links_dataframe.loc[row_num, 'rating']
  df.at[row_num, 'rating'] = rating

import numpy as np
# Replace 'N/A' with NaN in the whole DataFrame
df.replace('N/A', np.nan, inplace=True)

None
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105


The following code runs the fancy feature engineering process (approach 1) on the data before training and evaluating the performance of a Linear Regression model.

In [None]:
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
import numpy as np
from collections import Counter
import pandas as pd


common_corpus_columns = ['job_function', 'description_of_product/service', 'industries', 'position_name', 'broader_role_name',
                         'responsibilities', 'goals/objectives', 'required_qualifications', 'preferred_qualifications', 'benefits']

singular_corpus_columns = ['company', 'name_of_department/team', 'city']

work_arrangement_columns = ['employment_type', 'work_arrangement']

categorical_label_columns = ['state', 'country']

tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-large")
model = AutoModel.from_pretrained("thenlper/gte-large")

def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

def feature_engineering_step(train_df, test_df):
  phrases = []
  row_labels = []
  for col in common_corpus_columns:
    new_col_data = []
    for index, value in train_df[col].items():
      if type(value) == list:
        # Tokenize the input texts
        for phrase in value:
          phrases.append(phrase)
          row_labels.append(index)
      else:
        phrases.append(str(value))
        row_labels.append(index)
        value = [str(value)]
      batch_dict = tokenizer(value, max_length=512, padding=True, truncation=True, return_tensors='pt')
      outputs = model(**batch_dict)
      embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

      # (Optionally) normalize embeddings
      embeddings = F.normalize(embeddings, p=2, dim=1)

      new_col_data.append(embeddings.detach().numpy())

    train_df[f'{col}_vectors'] = new_col_data

  vector_cols = [f'{col}_vectors' for col in common_corpus_columns]
  phrase_vecs_list = []
  for col in vector_cols:
    for i in train_df[col]:
      for j in i:
        phrase_vecs_list.append(j)
  kmeans = KMeans(n_clusters=400, random_state=57)
  kmeans.fit(phrase_vecs_list)
  centroids = kmeans.cluster_centers_

  for i in range(400):
    column_name = f'common_cluster{i}_counts'  # Generate column name
    train_df[column_name] = 0  # Fill the column with zeros
    test_df[column_name] = 0

  cluster_labels = kmeans.labels_
  # Associate phrases with cluster labels
  data = {'phrase': phrases, 'row_label': row_labels, 'cluster_label': cluster_labels}
  common_clusters_df = pd.DataFrame(data)

  for i in range(row_labels[-1] + 1):
    clusters_in_row = common_clusters_df.loc[common_clusters_df['row_label'] == i, 'cluster_label'].tolist()
    # Generate value counts
    counts = Counter(clusters_in_row)
    for cluster in counts.keys():
      train_df.at[i, f'common_cluster{cluster}_counts'] = counts[cluster]

  test_phrases = []
  test_row_labels = []
  test_cluster_labels = []
  for col in common_corpus_columns:
    for index, value in test_df[col].items():
      if type(value) == list:
        # Tokenize the input texts
        for phrase in value:
          test_phrases.append(phrase)
          test_row_labels.append(index)
      else:
        test_phrases.append(str(value))
        test_row_labels.append(index)
        value = [str(value)]
      batch_dict = tokenizer(value, max_length=512, padding=True, truncation=True, return_tensors='pt')
      outputs = model(**batch_dict)
      embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
      for emb in embeddings:
        distances = np.linalg.norm(centroids - emb.detach().numpy(), axis=1)
        # Find the index of the centroid with the minimum distance
        closest_centroid_index = np.argmin(distances)
        test_cluster_labels.append(closest_centroid_index)

  # Associate phrases with cluster labels
  data = {'phrase': test_phrases, 'row_label': test_row_labels, 'cluster_label': test_cluster_labels}
  test_clusters_df = pd.DataFrame(data)

  for i in range(test_row_labels[-1] + 1):
    clusters_in_row = test_clusters_df.loc[test_clusters_df['row_label'] == i, 'cluster_label'].tolist()
    # Generate value counts
    counts = Counter(clusters_in_row)
    for cluster in counts.keys():
      test_df.at[i, f'common_cluster{cluster}_counts'] = counts[cluster]

  n_clusters_scale_factor = {'city': .625, 'company': .94, 'name_of_department/team': .25}
  singular_corpus_clusters_df_dict = {}
  for col in singular_corpus_columns:
    phrases = []
    test_phrases = []
    row_labels = []
    test_row_labels = []
    test_cluster_labels = []
    new_col_data = []
    for index, value in train_df[col].items():
      if type(value) == list:
        # Tokenize the input texts
        for phrase in value:
          phrases.append(phrase)
          row_labels.append(index)
      else:
        phrases.append(str(value))
        row_labels.append(index)
        value = [str(value)]
      batch_dict = tokenizer(value, max_length=512, padding=True, truncation=True, return_tensors='pt')
      outputs = model(**batch_dict)
      embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

      # (Optionally) normalize embeddings
      embeddings = F.normalize(embeddings, p=2, dim=1)

      new_col_data.append(embeddings.detach().numpy())

    train_df[f'{col}_vectors'] = new_col_data
    vector_col = f'{col}_vectors'
    phrase_vecs_list = []
    for i in train_df[vector_col]:
      for j in i:
        phrase_vecs_list.append(j)
    n_clusters = int(train_df.shape[0]*n_clusters_scale_factor[col])
    kmeans = KMeans(n_clusters=n_clusters, random_state=57)
    kmeans.fit(phrase_vecs_list)
    centroids = kmeans.cluster_centers_

    for i in range(n_clusters):
      column_name = f'{col}_cluster{i}_counts'  # Generate column name
      train_df[column_name] = 0  # Fill the column with zeros
      test_df[column_name] = 0

    cluster_labels = kmeans.labels_
    # Associate phrases with cluster labels
    data = {'phrase': phrases, 'row_label': row_labels, 'cluster_label': cluster_labels}
    clusters_df = pd.DataFrame(data)
    singular_corpus_clusters_df_dict[col] = clusters_df

    for i in range(row_labels[-1] + 1):
      clusters_in_row = clusters_df.loc[clusters_df['row_label'] == i, 'cluster_label'].tolist()
      # Generate value counts
      counts = Counter(clusters_in_row)
      for cluster in counts.keys():
        train_df.at[i, f'{col}_cluster{cluster}_counts'] = counts[cluster]

    for index, value in test_df[col].items():
      if type(value) == list:
        # Tokenize the input texts
        for phrase in value:
          test_phrases.append(phrase)
          test_row_labels.append(index)
      else:
        test_phrases.append(str(value))
        test_row_labels.append(index)
        value = [str(value)]
      batch_dict = tokenizer(value, max_length=512, padding=True, truncation=True, return_tensors='pt')
      outputs = model(**batch_dict)
      embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

      # (Optionally) normalize embeddings
      embeddings = F.normalize(embeddings, p=2, dim=1)
      for emb in embeddings:
        distances = np.linalg.norm(centroids - emb.detach().numpy(), axis=1)
        # Find the index of the centroid with the minimum distance
        closest_centroid_index = np.argmin(distances)
        test_cluster_labels.append(closest_centroid_index)

    # Associate phrases with cluster labels
    data = {'phrase': test_phrases, 'row_label': test_row_labels, 'cluster_label': test_cluster_labels}
    test_clusters_df = pd.DataFrame(data)

    for i in range(test_row_labels[-1] + 1):
      clusters_in_row = test_clusters_df.loc[test_clusters_df['row_label'] == i, 'cluster_label'].tolist()
      # Generate value counts
      counts = Counter(clusters_in_row)
      for cluster in counts.keys():
        test_df.at[i, f'{col}_cluster{cluster}_counts'] = counts[cluster]


  phrases = []
  test_phrases = []
  row_labels = []
  test_row_labels = []
  test_cluster_labels = []
  for col in work_arrangement_columns:
    new_col_data = []
    for index, value in train_df[col].items():
      if type(value) == list:
        # Tokenize the input texts
        for phrase in value:
          phrases.append(phrase)
          row_labels.append(index)
      else:
        phrases.append(str(value))
        row_labels.append(index)
        value = [str(value)]
      batch_dict = tokenizer(value, max_length=512, padding=True, truncation=True, return_tensors='pt')
      outputs = model(**batch_dict)
      embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

      # (Optionally) normalize embeddings
      embeddings = F.normalize(embeddings, p=2, dim=1)

      new_col_data.append(embeddings.detach().numpy())

    train_df[f'{col}_vectors'] = new_col_data

  vector_cols = [f'{col}_vectors' for col in work_arrangement_columns]
  phrase_vecs_list = []
  for col in vector_cols:
    for i in train_df[col]:
      for j in i:
        phrase_vecs_list.append(j)
  kmeans = KMeans(n_clusters=10, random_state=57)
  kmeans.fit(phrase_vecs_list)
  centroids = kmeans.cluster_centers_

  for i in range(10):
    column_name = f'work_arrangement_cluster{i}_counts'  # Generate column name
    train_df[column_name] = 0  # Fill the column with zeros
    test_df[column_name] = 0

  cluster_labels = kmeans.labels_
  # Associate phrases with cluster labels
  data = {'phrase': phrases, 'row_label': row_labels, 'cluster_label': cluster_labels}
  work_arrangement_clusters_df = pd.DataFrame(data)

  for i in range(row_labels[-1] + 1):
    clusters_in_row = work_arrangement_clusters_df.loc[work_arrangement_clusters_df['row_label'] == i, 'cluster_label'].tolist()
    # Generate value counts
    counts = Counter(clusters_in_row)
    for cluster in counts.keys():
      train_df.at[i, f'work_arrangement_cluster{cluster}_counts'] = counts[cluster]

  for col in work_arrangement_columns:
    for index, value in test_df[col].items():
      if type(value) == list:
        # Tokenize the input texts
        for phrase in value:
          test_phrases.append(phrase)
          test_row_labels.append(index)
      else:
        test_phrases.append(str(value))
        test_row_labels.append(index)
        value = [str(value)]
      batch_dict = tokenizer(value, max_length=512, padding=True, truncation=True, return_tensors='pt')
      outputs = model(**batch_dict)
      embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

      # (Optionally) normalize embeddings
      embeddings = F.normalize(embeddings, p=2, dim=1)
      for emb in embeddings:
        distances = np.linalg.norm(centroids - emb.detach().numpy(), axis=1)
        # Find the index of the centroid with the minimum distance
        closest_centroid_index = np.argmin(distances)
        test_cluster_labels.append(closest_centroid_index)

  # Associate phrases with cluster labels
  data = {'phrase': test_phrases, 'row_label': test_row_labels, 'cluster_label': test_cluster_labels}
  test_clusters_df = pd.DataFrame(data)

  print(test_row_labels)
  print(test_phrases)
  print(test_cluster_labels)
  for i in range(test_row_labels[-1] + 1):
    clusters_in_row = test_clusters_df.loc[test_clusters_df['row_label'] == i, 'cluster_label'].tolist()
    # Generate value counts
    counts = Counter(clusters_in_row)
    for cluster in counts.keys():
      test_df.at[i, f'work_arrangement_cluster{cluster}_counts'] = counts[cluster]

  return train_df, test_df


df_copy = df.copy()

for col in categorical_label_columns:
  # Perform one-hot encoding
  one_hot_encoded_df = pd.get_dummies(df_copy[col])
  # Concatenate the original dataframe with the one-hot encoded dataframe
  df_copy = pd.concat([df_copy, one_hot_encoded_df], axis=1)

y = df_copy['rating']  # Target variable
X = df_copy.drop(columns=['rating'])  # Features

# Assuming X and y are your feature matrix and target variable respectively
# Split the data into training and testing sets
train_df, test_df, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=507)

train_df, test_df = feature_engineering_step(train_df, test_df)

print('got through feature engineering step')

columns_to_drop = common_corpus_columns + singular_corpus_columns + work_arrangement_columns + categorical_label_columns
embedded_data_columns = common_corpus_columns + singular_corpus_columns + work_arrangement_columns
embedded_data_columns = [f'{col}_vectors' for col in embedded_data_columns]

train_columns_to_drop = columns_to_drop + embedded_data_columns
test_columns_to_drop = columns_to_drop

# Make a copy of the dataframe with specified columns dropped
train_df = train_df.drop(columns=train_columns_to_drop)
test_df = test_df.drop(columns=test_columns_to_drop)
train_df = train_df.drop(columns=[''])
test_df = test_df.drop(columns=[''])

# Iterate over columns and fill NaN values with the mode of each column
for col in train_df.columns:
  mode_val = train_df[col].mode()[0]
  train_df[col].fillna(mode_val, inplace=True)  # Fill NaN values with the mode
  if col in test_df.columns:
    test_df[col].fillna(mode_val, inplace=True)

from sklearn.metrics import mean_absolute_error, accuracy_score
from sklearn.linear_model import LinearRegression
regression_model = LinearRegression()

# Assuming X_train, X_test, y_train, y_test are your training and testing data
# Train a regression model instead of SVM classifier
regression_model.fit(train_df, y_train)

# Make predictions on the testing data
y_pred = regression_model.predict(test_df)

# Calculate the mean absolute error
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error:", mae)

# Round the predictions
y_pred_rounded = [round(pred) for pred in y_pred]

# Calculate accuracy using the rounded predictions and actual target values
accuracy = accuracy_score(y_test, y_pred_rounded)

print("accuracy is", accuracy)

  train_df[column_name] = 0  # Fill the column with zeros
  train_df[column_name] = 0  # Fill the column with zeros
  train_df[column_name] = 0  # Fill the column with zeros
  train_df[column_name] = 0  # Fill the column with zeros
  train_df[column_name] = 0  # Fill the column with zeros
  train_df[column_name] = 0  # Fill the column with zeros
  train_df[column_name] = 0  # Fill the column with zeros
  train_df[column_name] = 0  # Fill the column with zeros
  train_df[column_name] = 0  # Fill the column with zeros
  train_df[column_name] = 0  # Fill the column with zeros
  train_df[column_name] = 0  # Fill the column with zeros
  test_df[column_name] = 0
  train_df[column_name] = 0  # Fill the column with zeros
  test_df[column_name] = 0
  train_df[column_name] = 0  # Fill the column with zeros
  test_df[column_name] = 0
  train_df[column_name] = 0  # Fill the column with zeros
  test_df[column_name] = 0
  train_df[column_name] = 0  # Fill the column with zeros
  test_df[column_name]

[42, 27, 71, 64, 58, 69, 28, 37, 102, 46, 31, 75, 51, 93, 65, 6, 34, 38, 17, 76, 96, 81, 42, 27, 71, 64, 58, 69, 28, 37, 102, 46, 31, 75, 51, 93, 65, 6, 34, 38, 17, 76, 96, 81]
['Full-time', 'Full-time', 'Full-time', 'Full-time', 'Full-time', 'Full-time', 'Full-time', 'Full-time', 'Full-time', 'Full-time', 'Full-time', 'Full-time', 'Full-time', 'Full-time', 'Full-time', 'Full-time', 'Part-time', 'Full-time', 'Full-time', 'Full-time', 'Full-time', 'Full-time', 'On-site', 'N/A', 'REMOTE', 'remote', 'N/A', 'N/A', 'On-site', 'N/A', 'N/A', 'N/A', 'Onsite', 'Weekly hybrid onsite component', 'N/A', 'N/A', 'Remote', 'On-site', 'On-site', 'N/A', 'N/A', 'Onsite', 'N/A', 'Local remote']
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 6, 3, 5, 5, 3, 3, 6, 3, 3, 3, 6, 2, 3, 3, 5, 6, 6, 3, 3, 6, 3, 5]
got through feature engineering step
Mean Absolute Error: 1713990052.7007072
accuracy is 0.045454545454545456


In [None]:
y_pred

array([ 1.95319117e+00, -1.64982558e+09,  2.06118124e+00, -4.07070856e+08,
        1.95319117e+00,  1.65079003e-01,  2.84838882e+00,  1.00002599e+00,
       -5.11490388e+10,  5.05109598e+09, -2.01720169e+10,  2.06118124e+00,
        1.33193870e+00,  2.08928929e+00,  1.69212827e+00,  2.06118124e+00,
        4.48297080e+09, -1.80860825e+10,  1.90734220e+00,  2.06118124e+00,
       -1.89942615e+10, -1.81146828e+10])

We see for this particular train-test split, the linear regression model makes absurd predictions. Maybe we need to perform normalization on the data before passing it to the learning algorithm.

In [None]:
from sklearn.preprocessing import MinMaxScaler

regression_model = LinearRegression()

min_max_scaler = MinMaxScaler()
X_train_normalized_minmax = min_max_scaler.fit_transform(train_df)
X_test_normalized_minmax = min_max_scaler.transform(test_df)

# Assuming X_train, X_test, y_train, y_test are your training and testing data
# Train a regression model instead of SVM classifier
regression_model.fit(X_train_normalized_minmax, y_train)

# Make predictions on the testing data
y_pred = regression_model.predict(X_test_normalized_minmax)

# Calculate the mean absolute error
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error:", mae)

# Round the predictions
y_pred_rounded = [round(pred) for pred in y_pred]

# Calculate accuracy using the rounded predictions and actual target values
accuracy = accuracy_score(y_test, y_pred_rounded)

print("accuracy is", accuracy)

Mean Absolute Error: 1222531901985.8655
accuracy is 0.22727272727272727


Well, that didn't help.

In [None]:
for column in train_df.columns:
  print(f"Column: {column}")
  # Print value counts for the column
  print(train_df[column].value_counts())
  print()  # Print an empty line for readability

Column: min_salary
min_salary
100.000    63
150.000     3
192.000     1
137.500     1
216.000     1
101.000     1
198.200     1
115.000     1
170.000     1
58.300      1
126.000     1
107.000     1
98.144      1
117.000     1
245.000     1
129.000     1
62.400      1
90.000      1
122.000     1
120.000     1
130.000     1
Name: count, dtype: int64

Column: max_salary
max_salary
200.000    64
170.000     2
130.000     2
220.000     2
175.000     1
216.646     1
132.000     1
93.600      1
385.000     1
173.000     1
162.000     1
189.000     1
133.000     1
288.000     1
223.600     1
297.300     1
122.000     1
414.000     1
150.000     1
Name: count, dtype: int64

Column: AR
AR
False    84
True      1
Name: count, dtype: int64

Column: AZ
AZ
False    84
True      1
Name: count, dtype: int64

Column: CA
CA
False    72
True     13
Name: count, dtype: int64

Column: CO
CO
False    84
True      1
Name: count, dtype: int64

Column: FL
FL
False    84
True      1
Name: count, dtype: int64

C

In [None]:
train_df.head()

Unnamed: 0,min_salary,max_salary,AR,AZ,CA,CO,FL,GA,IA,IL,...,work_arrangement_cluster0_counts,work_arrangement_cluster1_counts,work_arrangement_cluster2_counts,work_arrangement_cluster3_counts,work_arrangement_cluster4_counts,work_arrangement_cluster5_counts,work_arrangement_cluster6_counts,work_arrangement_cluster7_counts,work_arrangement_cluster8_counts,work_arrangement_cluster9_counts
16,100.0,200.0,False,False,False,False,False,False,False,True,...,0,1,1,0,0,0,0,0,0,0
70,100.0,200.0,False,False,False,False,False,False,False,False,...,0,0,0,0,0,0,0,0,0,0
15,100.0,200.0,False,False,False,False,False,False,False,False,...,0,1,0,1,0,0,0,0,0,0
58,117.0,173.0,False,False,False,False,False,False,False,False,...,0,0,0,0,0,0,0,0,0,0
13,100.0,200.0,False,False,False,False,False,False,False,False,...,0,1,0,0,0,0,1,0,0,0


TT Split RS 42: accuracy: .591
mean absolute error: .614

TT Split RS 95: Mean Absolute Error: 6277592985.598255
accuracy is 0.22727272727272727

TT Split RS 13: Mean Absolute Error: 11821743704.50261
accuracy is 0.18181818181818182

TT Split RS 21: Mean Absolute Error: 15975157405.285627
accuracy is 0.2727272727272727

TT Split RS 507: Mean Absolute Error: 1713990052.7007072
accuracy is 0.045454545454545456

TT Split RS 42: We see that using a Linear Regression model where we round our predictions, using the "fancy" (maybe just dumb) feature engineering approach the way I did here yields an accuracy 59% and a MAE of .614. What do the results of Linear Regression predictions look like with the less fancy feature engineering approach (TFIDF)? We will look at that after we experiment with different models after having just done FE approach 1 and the chosen train-test split.

Other train-test splits: we get absurd predictions from the linear regression model as shown above by the massive error values and horrible accuracy scores. We need to figure out why this is happening and try to get reasonable accuracy out of Linear Regression, because clearly something is wrong. We tried normalizing our data with MinMaxScaler as shown in one of the earlier code cells but we got horrendous results after doing that as well. Could dropping some columns help us? We saw earlier (when we were doing FE approach 1 but including phrases from the test set before doing the clustering) that what appeared to be a relatively well-performing linear regression model started showing results similar to what we are seeing now. Could either adding or dropping some columns get us the outcome we desire? To drop columns, we could drop cluster count columns or other columns with uninteresting distributions (which we tried earlier and saw linear regression start predicting wildly), and to add columns, we could for example increase our number of common corpus clusters from 400 to 500 and see what happens. Lastly, all of this has reminded me that I should probably try a LOGISTIC REGRESSION model next.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

le = LabelEncoder()
y_train = le.fit_transform(y_train)
y_test = le.transform(y_test)

# Create a Logistic Regression model
# By default, LogisticRegression uses OvR for multi-class classification
model = LogisticRegression(max_iter=10000)  # Increase max_iter for convergence if needed

# Fit the model to the training data
model.fit(train_df, y_train)

# Make predictions on the testing data
y_pred = model.predict(test_df)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Calculate the mean absolute error
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error:", mae)

Accuracy: 0.5454545454545454
Mean Absolute Error: 0.45454545454545453


TT Split RS 42: Accuracy: 0.5909090909090909
Mean Absolute Error: 0.45454545454545453

TT Split RS 95: Accuracy: 0.4090909090909091
Mean Absolute Error: 0.7727272727272727

TT Split RS 13: Accuracy: 0.6363636363636364
Mean Absolute Error: 0.4090909090909091

TT Split RS 21: Accuracy: 0.5454545454545454
Mean Absolute Error: 0.5454545454545454

TT Split RS 507: Accuracy: 0.5454545454545454
Mean Absolute Error: 0.45454545454545453

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Initialize the decision tree classifier
clf = DecisionTreeClassifier()

# Train the decision tree classifier on the training data
clf.fit(train_df, y_train)

# Test the classifier on the testing data
y_pred = clf.predict(test_df)

# Evaluate the performance of the classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Calculate the mean absolute error
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error:", mae)

Accuracy: 0.2727272727272727
Mean Absolute Error: 1.1818181818181819


TT Split RS 42: Accuracy: 0.54545
Mean Absolute Error: .5

TT Split RS 95: Accuracy: 0.2727272727272727
Mean Absolute Error: 1.1363636363636365

TT Split RS 13: Accuracy: 0.5
Mean Absolute Error: 0.5454545454545454

TT Split RS 21: Accuracy: 0.5
Mean Absolute Error: 0.5909090909090909

TT Split RS 507: Accuracy: 0.45454545454545453
Mean Absolute Error: 0.7272727272727273

Compared to the Decision Tree with default hyperparameter values, Linear Regression had better accuracy (rounded predictions) but worse error on unrounded predictions for the train test split with random seed = 42.

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf_classifier = RandomForestClassifier(max_depth=None, max_features='auto', min_samples_leaf=1, min_samples_split=2, n_estimators=300)

rf_classifier.fit(train_df, y_train)

# Test the classifier on the testing data
y_pred = rf_classifier.predict(test_df)

# Evaluate the performance of the classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error:", mae)

  warn(


Accuracy: 0.5454545454545454
Mean Absolute Error: 0.5454545454545454


TT Split RS 42: Accuracy: 0.591
Mean Absolute Error: 0.454545

TT Split RS 95: Accuracy: 0.45454545454545453
Mean Absolute Error: 0.9090909090909091

TT Split RS 13: Accuracy: 0.6363636363636364
Mean Absolute Error: 0.4090909090909091

TT Split RS 21: Accuracy: 0.5454545454545454
Mean Absolute Error: 0.5

TT Split RS 507: Accuracy: 0.5454545454545454
Mean Absolute Error: 0.5454545454545454

We now try a HistGradientBoostingClassifier with our dialed in FE approach 1, just bc we are curious. I tried this earlier on our data prior to filling missing values with the mode of their column, because it actually worked even with the missing values, which Decision trees, random forest, and linear regression algorithms do not. When we trained the HistGradientBoostingClassifier on our data that had missing values, we got horrible accuracy at 32% and a pretty high MAE of .86.

In [None]:
from sklearn.ensemble import HistGradientBoostingClassifier

# Initialize the HistGradientBoostingClassifier
clf = HistGradientBoostingClassifier()

# Train the classifier on the training data
clf.fit(train_df, y_train)

# Test the classifier on the testing data
y_pred = clf.predict(test_df)

# Evaluate the performance of the classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error:", mae)

Accuracy: 0.3181818181818182
Mean Absolute Error: 0.9545454545454546


TT Split RS 42: Accuracy: 0.5
Mean Absolute Error: 0.6818181818181818

TT Split RS 95: Accuracy: 0.5
Mean Absolute Error: 0.6818181818181818

TT Split RS 13: Accuracy: 0.5454545454545454
Mean Absolute Error: 0.5

TT Split RS 21: Accuracy: 0.45454545454545453
Mean Absolute Error: 0.6363636363636364

TT Split RS 507: Accuracy: 0.3181818181818182
Mean Absolute Error: 0.9545454545454546

The algorithm is definitely doing better now, but not as well as any of the others I've tried.

We now try XGBoost with our data. We tried it before with our earlier invalid version of FE Approach 1 and got accuracy of 45%, MAE of .72. Second worst to HistGradientBoostingClassifier. How much better does it do now?

In [None]:
!pip install xgboost



In [None]:
import xgboost as xgb

# Initialize the XGBoost classifier
xgb_classifier = xgb.XGBClassifier(objective='binary:logistic', random_state=42)

# Train the classifier on the training data
xgb_classifier.fit(train_df, y_train)

# Test the classifier on the testing data
y_pred = xgb_classifier.predict(test_df)

# Evaluate the performance of the classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error:", mae)

Accuracy: 0.5454545454545454
Mean Absolute Error: 0.5454545454545454


TT Split RS 42: Accuracy: 0.6363636363636364
Mean Absolute Error: 0.5

TT Split RS 95: Accuracy: 0.4090909090909091
Mean Absolute Error: 0.8636363636363636

TT Split RS 13: Accuracy: 0.6818181818181818
Mean Absolute Error: 0.36363636363636365

TT Split RS 21: Accuracy: 0.5909090909090909
Mean Absolute Error: 0.45454545454545453

TT Split RS 507: Accuracy: 0.5454545454545454
Mean Absolute Error: 0.5454545454545454

Nice! So we got decent accuracy at nearly 64% and an MAE of .5 with XGBoost (first train-test split). Let's try some other parameters to pass to the XGBClassifier object.

In [None]:
# Initialize the XGBoost classifier
xgb_classifier = xgb.XGBClassifier(objective='multi:softmax', num_class=4, random_state=42)

# Train the classifier on the training data
xgb_classifier.fit(train_df, y_train)

# Test the classifier on the testing data
y_pred = xgb_classifier.predict(test_df)

# Evaluate the performance of the classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error:", mae)

Accuracy: 0.4090909090909091
Mean Absolute Error: 0.8636363636363636


OK, that didn't change our results at all. We want to perform grid searches on the hyperparameter spaces for the Decision Tree, Random Forest, and XGBoost algorithms.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV

# Define the parameter grid for hyperparameter tuning
param_grid = {
    'max_depth': [None, 10, 20, 30],  # Maximum depth of the tree
    'min_samples_split': [2, 5, 10],   # Minimum number of samples required to split an internal node
    'min_samples_leaf': [1, 2, 4],     # Minimum number of samples required to be at a leaf node
    'max_features': ['auto', 'sqrt', 'log2']  # Number of features to consider when looking for the best split
}

# Initialize the Decision Tree classifier
dt_classifier = DecisionTreeClassifier(random_state=42)

# Perform grid search with cross-validation
grid_search = GridSearchCV(estimator=dt_classifier, param_grid=param_grid, cv=5)
grid_search.fit(train_df, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

# Test the classifier on the testing data using the best hyperparameters
best_dt_classifier = grid_search.best_estimator_
y_pred = best_dt_classifier.predict(test_df)

# Evaluate the performance of the classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error:", mae)



Best Hyperparameters: {'max_depth': 20, 'max_features': 'log2', 'min_samples_leaf': 1, 'min_samples_split': 2}
Accuracy: 0.5454545454545454
Mean Absolute Error: 0.5


Even after finding the best estimator, the decision tree algorithm gives the same accuracy and MAE against the test set. We can probably rule out a decision tree as being the best model for our motives. Let's see if we can squeeze out anything better than 59% accuracy and .45 MAE from the Random Forest Classifier using a grid search.

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Define the parameter grid for hyperparameter tuning
param_grid = {
    'n_estimators': [100, 200, 300],        # Number of trees in the forest
    'max_depth': [None, 10, 20],             # Maximum depth of the trees
    'min_samples_split': [2, 5, 10],         # Minimum number of samples required to split an internal node
    'min_samples_leaf': [1, 2, 4],           # Minimum number of samples required to be at a leaf node
    'max_features': ['auto', 'sqrt', 'log2'] # Number of features to consider when looking for the best split
}

rf_classifier = RandomForestClassifier(random_state=42)

# Perform grid search with cross-validation
grid_search = GridSearchCV(estimator=rf_classifier, param_grid=param_grid, cv=5)
grid_search.fit(train_df, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

# Test the classifier on the testing data using the best hyperparameters
best_rf_classifier = grid_search.best_estimator_
y_pred = best_rf_classifier.predict(test_df)

# Evaluate the performance of the classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error:", mae)

  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(


Best Hyperparameters: {'max_depth': None, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 300}
Accuracy: 0.5909090909090909
Mean Absolute Error: 0.45454545454545453


Just like the decision tree, hyperparameter grid search didn't find us an estimator with better accuracy than we've already seen. Now we will perform a grid search for XGBoost.

In [None]:
# Define the parameter grid
param_grid = {
    'learning_rate': [0.01, 0.1, 0.2],
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 5],
    'min_child_weight': [1, 5, 10],
    'subsample': [0.5, 1.0],
    'colsample_bytree': [0.5, 1.0],
    'reg_lambda': [1, 10, 100],
    'reg_alpha': [0, 1, 10]
}

# Initialize XGBoost classifier
xgb_classifier = xgb.XGBClassifier(random_state=42)

# Perform grid search with cross-validation
grid_search = GridSearchCV(estimator=xgb_classifier, param_grid=param_grid, cv=5)
grid_search.fit(train_df, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

# Test the classifier on the testing data using the best hyperparameters
best_rf_classifier = grid_search.best_estimator_
y_pred = best_rf_classifier.predict(test_df)

# Evaluate the performance of the classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error:", mae)

Best Hyperparameters: {'colsample_bytree': 1.0, 'learning_rate': 0.1, 'max_depth': 3, 'min_child_weight': 1, 'n_estimators': 200, 'reg_alpha': 0, 'reg_lambda': 1, 'subsample': 1.0}
Accuracy: 0.5909090909090909
Mean Absolute Error: 0.5


Even though XGBoost had an accuracy of almost 64% against our test set from our chosen seed for the train-test split, cross validation calculated the best estimator as having an accuracy of only 59%.

I also tried a Support Vector Machine algorithm:

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(train_df)
X_test_scaled = scaler.transform(test_df)

# Step 4: Initialize and Train the SVM Classifier
svm_classifier = SVC(kernel='rbf', C=1.0, gamma='scale')  # Example hyperparameters
svm_classifier.fit(X_train_scaled, y_train)

# Step 5: Evaluate the Classifier
accuracy = svm_classifier.score(X_test_scaled, y_test)
print("Accuracy:", accuracy)

Accuracy: 0.5


TT Split RS 42: Accuracy: .54545

TT Split RS 95: Accuracy: 0.5

TT Split RS 13: Accuracy: .590909

TT Split RS 21: Accuracy: 0.45454545454545453

TT Split RS 507: Accuracy: 0.45454545454545453

So same accuracy as before when we did the feature engineering incorrectly. Note that there isn't an easy or built-in way to calculate mean error for the particular implementation of SVC that we used.

We also tried KNeighborsClassifier and KNeighborsRegressor algorithms.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor

# Initialize the KNN classifier
knn_classifier = KNeighborsClassifier(n_neighbors=5)  # You can adjust the number of neighbors as needed

# Train the classifier
knn_classifier.fit(train_df, y_train)

# Make predictions on the testing data
y_pred = knn_classifier.predict(test_df)

# Evaluate the accuracy of the classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

mae = mean_absolute_error(y_test, y_pred)
print("Mean absolute Error:", mae)

# Initialize the KNN regressor
knn_regressor = KNeighborsRegressor(n_neighbors=5)  # You can adjust the number of neighbors as needed

# Train the regressor
knn_regressor.fit(train_df, y_train)

# Make predictions on the testing data
y_pred = knn_regressor.predict(test_df)

# Round the predictions
y_pred_rounded = [round(pred) for pred in y_pred]

# Calculate accuracy using the rounded predictions and actual target values
accuracy = accuracy_score(y_test, y_pred_rounded)
print("Accuracy:", accuracy)

# Evaluate the mean squared error of the regressor
mae = mean_absolute_error(y_test, y_pred)
print("Mean absolute Error:", mae)

Accuracy: 0.4090909090909091
Mean absolute Error: 0.9090909090909091
Accuracy: 0.36363636363636365
Mean absolute Error: 0.7090909090909091


Interestingly, the KNN regressor did better on accuracy after its predictions are rounded but the classifier did better on error for TT SPLIT RS 42.

KNNC
KNNR
TT Split RS 42: Accuracy: .54545
Mean Absolute Error: .5
Accuracy: .591
Mean Absolute Error: .518

TT Split RS 95: Accuracy: 0.45454545454545453
Mean absolute Error: 0.8181818181818182
Accuracy: 0.45454545454545453
Mean absolute Error: 0.8181818181818182

TT Split RS 13: Accuracy: 0.5
Mean absolute Error: 0.6818181818181818
Accuracy: 0.5
Mean absolute Error: 0.5727272727272726

TT Split RS 21: Accuracy: 0.5
Mean absolute Error: 0.5454545454545454
Accuracy: 0.45454545454545453
Mean absolute Error: 0.6727272727272728

TT Split RS 507: Accuracy: 0.4090909090909091
Mean absolute Error: 0.9090909090909091
Accuracy: 0.36363636363636365
Mean absolute Error: 0.7090909090909091

Lastly, we tried a Gaussian Naive Bayes algorithm.

In [None]:
from sklearn.naive_bayes import GaussianNB

# Step 3: Initialize and Train the GNB Classifier
gnb_classifier = GaussianNB()
gnb_classifier.fit(train_df, y_train)  # Convert sparse matrix to dense array for GNB

# Step 4: Evaluate the Classifier
y_pred = gnb_classifier.predict(test_df)  # Convert sparse matrix to dense array for prediction
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

mae = mean_absolute_error(y_test, y_pred)
print("Mean absolute error:", mae)

Accuracy: 0.4090909090909091
Mean absolute error: 0.7727272727272727


TT Split RS 42:
Accuracy: 0.36363636363636365
Mean absolute error: .7727

TT Split RS 95: Accuracy: 0.22727272727272727
Mean absolute error: 1.1818181818181819

TT Split RS 13:
Accuracy: 0.36363636363636365
Mean absolute error: 1.0

TT Split RS 21:
Accuracy: 0.18181818181818182
Mean absolute error: 1.2727272727272727

TT Split RS 507:
Accuracy: 0.4090909090909091
Mean absolute error: 0.7727272727272727

Wow, GaussianNB does quite awfully.

In [None]:
links_csv_path = "drive/MyDrive/successfulLinksGDRIVE.csv"

import pandas as pd
links_dataframe = pd.read_csv(links_csv_path, header=None, names=['url', 'rating'])
#links_dataframe.head()
#print(links_dataframe.loc[0, 'rating'])

# let's snakecase our column names to avoid having spaces in them
# also we add a rating column at the end of the list to store the target variable
df_columns = ["posting_text", "rating"]
# Initialize DataFrame with column names
df = pd.DataFrame(columns=df_columns)

import os
import json

for i in range(1, 108):
  row_num = df.last_valid_index()
  print(row_num)
  if row_num == None:
    row_num = 0
  else:
    row_num = row_num + 1
  assert i == row_num + 1
  text_file_path = f'drive/MyDrive/text_files2/row{i}.txt'
  with open(text_file_path, 'r') as file:
    # Read the entire content of the file into a string
    file_contents = file.read()
  df.at[row_num, 'posting_text'] = file_contents
  rating = links_dataframe.loc[row_num, 'rating']
  df.at[row_num, 'rating'] = rating

import numpy as np
# Replace 'N/A' with NaN in the whole DataFrame
df.replace('N/A', np.nan, inplace=True)

None
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Assuming your dataframe is named 'df' and the text column is named 'job_posting_text'

# Step 1: Split your data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df['posting_text'], df['rating'], test_size=0.2, random_state=507)

# Step 2: Initialize and fit TF-IDF vectorizer on the training data only
tfidf_vectorizer = TfidfVectorizer(max_features=5000)  # You can adjust max_features as needed
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

# Step 3: Transform the test data using the fitted vectorizer
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Now you have X_train_tfidf and X_test_tfidf ready to be used with your model for training and evaluation.

Basically, we are going to try a bunch of different models and see what the performance looks like for this much simpler feature engineering approach that only analyzes the text from the job postings using TFIDF, starting with linear regression.

In [None]:
regression_model = LinearRegression()

# Assuming X_train, X_test, y_train, y_test are your training and testing data
# Train a regression model instead of SVM classifier
regression_model.fit(X_train_tfidf, y_train)

# Make predictions on the testing data
y_pred = regression_model.predict(X_test_tfidf)

# Calculate the mean absolute error
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error:", mae)

# Round the predictions
y_pred_rounded = [round(pred) for pred in y_pred]

# Calculate accuracy using the rounded predictions and actual target values
accuracy = accuracy_score(y_test.tolist(), y_pred_rounded)

print("accuracy is", accuracy)

Mean Absolute Error: 0.5222678485765134
accuracy is 0.5


So using Linear Regression again except with our data that we feature engineered with approach #2, our predictions give an accuracy of 55% (when rounded) and a mean absolute error of .48. It is interesting that the accuracy of this model is worse but the MAE is lower compared to feature engineering approach #1, for train-test split random seed 42.

TT Split RS 42: Mean Absolute Error: 0.48080695505940163
accuracy is 0.5454545454545454

TT Split RS 95: Mean Absolute Error: 0.5589456133684491
accuracy is 0.45454545454545453

TT Split RS 13: Mean Absolute Error: 0.4983688602870732
accuracy is 0.5

TT Split RS 21: Mean Absolute Error: 0.45019354415526897
accuracy is 0.5909090909090909

TT Split RS 507: Mean Absolute Error: 0.5222678485765134
accuracy is 0.5

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

le = LabelEncoder()
y_train = le.fit_transform(y_train)
y_test = le.transform(y_test)

# Create a Logistic Regression model
# By default, LogisticRegression uses OvR for multi-class classification
model = LogisticRegression(max_iter=1000)  # Increase max_iter for convergence if needed

# Fit the model to the training data
model.fit(X_train_tfidf, y_train)

# Make predictions on the testing data
y_pred = model.predict(X_test_tfidf)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Calculate the mean absolute error
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error:", mae)

Accuracy: 0.45454545454545453
Mean Absolute Error: 0.6363636363636364


TT Split RS 42: Accuracy: 0.5909090909090909
Mean Absolute Error: 0.45454545454545453

TT Split RS 95: Accuracy: 0.5909090909090909
Mean Absolute Error: 0.5909090909090909

TT Split RS 13: Accuracy: 0.5909090909090909
Mean Absolute Error: 0.45454545454545453

TT Split RS 21: Accuracy: 0.5
Mean Absolute Error: 0.5454545454545454

TT Split RS 507: Accuracy: 0.45454545454545453
Mean Absolute Error: 0.6363636363636364

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y_train = le.fit_transform(y_train)

# Initialize the Random Forest classifier
rf_classifier = RandomForestClassifier(max_depth=None, max_features='auto', min_samples_leaf=1, min_samples_split=2, n_estimators=300)

# Train the classifier on the training data
rf_classifier.fit(X_train_tfidf, y_train)

# Test the classifier on the testing data
y_pred = rf_classifier.predict(X_test_tfidf)

# Evaluate the performance of the classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

mae = mean_absolute_error(y_test, y_pred)
print("mae:", mae)
print(y_pred)

  warn(


ValueError: Classification metrics can't handle a mix of unknown and binary targets

We are getting an error, likely because we need to one-hot encode the target classes. Maybe RandomForestRegressor will work straight away.

Anyway, I added in some code to label encode the rating data to evaluate an RFC model together with FE approach 2 which is purely TFIDF based.

In [None]:
from sklearn.ensemble import RandomForestRegressor

# Initialize the Random Forest classifier
rf_classifier = RandomForestRegressor(max_depth=None, max_features='auto', min_samples_leaf=1, min_samples_split=2, n_estimators=300)

# Train the classifier on the training data
rf_classifier.fit(X_train_tfidf, y_train)

# Test the classifier on the testing data
y_pred = rf_classifier.predict(X_test_tfidf)

mae = mean_absolute_error(y_test, y_pred)
print("mae:", mae)
print(y_pred)

# Round the predictions
y_pred_rounded = [round(pred) for pred in y_pred]

# Calculate accuracy using the rounded predictions and actual target values
accuracy = accuracy_score(y_test.tolist(), y_pred_rounded)

print("accuracy is", accuracy)

  warn(


mae: 0.618030303030303
[1.55       0.7        1.59666667 1.74       2.59333333 2.1
 1.9        1.17       2.24666667 2.09333333 1.49       2.22666667
 1.91333333 1.21       1.94       2.22       1.72       2.49333333
 1.74333333 2.42       1.61666667 2.14666667]
accuracy is 0.45454545454545453


TT Split RS 42: Accuracy: .5
Mean Absolute Error: .5024242424242424

TT Split RS 95: Accuracy: .5454545454545454
Mean Absolute Error: .5603030303030303

TT Split RS 13: Accuracy: .5454545454545454
Mean Absolute Error: .5413636363636363

TT Split RS 21: Accuracy: .5
Mean Absolute Error: .521060606060606

TT Split RS 507: Accuracy: .45454545454545453
Mean Absolute Error: .618030303030303


We see that the RandomForestRegressor model with FE approach 2 gives us an accuracy of rounded predictions of 50% and MAE of .502, which is worse than the RandomForestClassifier combined with FE approach 1.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_absolute_error, accuracy_score

le = LabelEncoder()
y_train = le.fit_transform(y_train)
y_test = le.transform(y_test)

# Initialize the Random Forest classifier
rf_classifier = RandomForestClassifier(max_depth=None, max_features='auto', min_samples_leaf=1, min_samples_split=2, n_estimators=300)

# Train the classifier on the training data
rf_classifier.fit(X_train_tfidf, y_train)

# Test the classifier on the testing data
y_pred = rf_classifier.predict(X_test_tfidf)

print(y_pred)
print(y_test)

# Evaluate the performance of the classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

mae = mean_absolute_error(y_test, y_pred)
print("mae:", mae)
print(y_pred)

  warn(


[2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
[1 0 3 3 3 2 1 2 2 3 2 3 2 1 2 2 0 3 2 2 3 2]
Accuracy: 0.5
mae: 0.5909090909090909
[2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]


Amazingly, the random forest classifier gives us the exact same accuracy on the chosen test set (RS 42) whether we performed FE Approach 1 or 2.

TT Split RS 42: Accuracy: 0.5909090909090909
mae: 0.45454545454545453

TT Split RS 95: Accuracy: 0.5454545454545454
mae: 0.6363636363636364

TT Split RS 13: Accuracy: 0.5909090909090909
mae: 0.4090909090909091

TT Split RS 21: Accuracy: 0.5
mae: 0.5454545454545454

TT Split RS 507: Accuracy: 0.5
mae: 0.5909090909090909

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Initialize the decision tree classifier
clf = DecisionTreeClassifier()

# Train the decision tree classifier on the training data
clf.fit(X_train_tfidf, y_train)

# Test the classifier on the testing data
y_pred = clf.predict(X_test_tfidf)

# Evaluate the performance of the classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Calculate the mean absolute error
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error:", mae)

Accuracy: 0.6363636363636364
Mean Absolute Error: 0.5


Interestingly, the decision tree does worse with feature engineering approach 2 on RS 42.

TT Split RS 42: Accuracy: 0.5
Mean Absolute Error: 0.6363636363636364

TT Split RS 95: Accuracy: 0.45454545454545453
Mean Absolute Error: 0.6363636363636364

TT Split RS 13: Accuracy: 0.3181818181818182
Mean Absolute Error: 0.8636363636363636

TT Split RS 21: Accuracy: 0.4090909090909091
Mean Absolute Error: 0.6818181818181818

TT Split RS 507: Accuracy: 0.6363636363636364
Mean Absolute Error: 0.5

In [None]:
import xgboost as xgb
# Initialize the XGBoost classifier
xgb_classifier = xgb.XGBClassifier(objective='binary:logistic', random_state=42)

# Train the classifier on the training data
xgb_classifier.fit(X_train_tfidf, y_train)

# Test the classifier on the testing data
y_pred = xgb_classifier.predict(X_test_tfidf)

# Evaluate the performance of the classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error:", mae)

Accuracy: 0.4090909090909091
Mean Absolute Error: 0.6363636363636364


Also interestingly, the MAE is the same for XGBoost algorithm but the accuracy is slightly lower.

TT Split RS 42: Accuracy: 0.5454545454545454
Mean Absolute Error: 0.5

TT Split RS 95: Accuracy: 0.5909090909090909
Mean Absolute Error: 0.45454545454545453

TT Split RS 13: Accuracy: 0.36363636363636365
Mean Absolute Error: 0.7272727272727273

TT Split RS 21: Accuracy: 0.5909090909090909
Mean Absolute Error: 0.45454545454545453

TT Split RS 507: Accuracy: 0.4090909090909091
Mean Absolute Error: 0.6363636363636364

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

# Step 4: Initialize and Train the SVM Classifier
svm_classifier = SVC(kernel='rbf', C=1.0, gamma='scale')  # Example hyperparameters
svm_classifier.fit(X_train_tfidf, y_train)

# Step 5: Evaluate the Classifier
accuracy = svm_classifier.score(X_test_tfidf, y_test)
print("Accuracy:", accuracy)

Accuracy: 0.45454545454545453


With this feature engineering approach, we got the exact same prediction accuracy for the support vector machine algorithm.

TT Split RS 42: Accuracy: 0.5454545454545454

TT Split RS 95: Accuracy: 0.5454545454545454

TT Split RS 13: Accuracy: 0.5909090909090909

TT Split RS 21: Accuracy: 0.5

TT Split RS 507: Accuracy: 0.45454545454545453

In [None]:
from sklearn.ensemble import HistGradientBoostingClassifier

# Initialize the HistGradientBoostingClassifier
clf = HistGradientBoostingClassifier()

# Train the classifier on the training data
clf.fit(X_train_tfidf.toarray(), y_train)

# Test the classifier on the testing data
y_pred = clf.predict(X_test_tfidf.toarray())

# Evaluate the performance of the classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error:", mae)

Accuracy: 0.5
Mean Absolute Error: 0.5909090909090909


TT Split RS 42: Accuracy: 0.7272727272727273
Mean Absolute Error: 0.3181818181818182

TT Split RS 95: Accuracy: 0.5909090909090909
Mean Absolute Error: 0.4090909090909091

TT Split RS 13: Accuracy: 0.45454545454545453
Mean Absolute Error: 0.5909090909090909

TT Split RS 21: Accuracy: 0.6363636363636364
Mean Absolute Error: 0.36363636363636365

TT Split RS 507: Accuracy: 0.5
Mean Absolute Error: 0.5909090909090909

In [None]:
print(clf.get_params())

{'categorical_features': None, 'class_weight': None, 'early_stopping': 'auto', 'interaction_cst': None, 'l2_regularization': 0.0, 'learning_rate': 0.1, 'loss': 'log_loss', 'max_bins': 255, 'max_depth': None, 'max_iter': 100, 'max_leaf_nodes': 31, 'min_samples_leaf': 20, 'monotonic_cst': None, 'n_iter_no_change': 10, 'random_state': None, 'scoring': 'loss', 'tol': 1e-07, 'validation_fraction': 0.1, 'verbose': 0, 'warm_start': False}


When I saw this MY JAWS DROPPED! HOLY MOTHER OF PEARL! I didn't think I would get such great results combining HistGradientBoostingClassifier with feature engineering approach 2! I literally just did it for shits and giggles and to say I tried it. Wow. I'm going to have to examine these results more closely because this is by far the best performance I have gotten out of any of the models. LOOK AT THAT MEAN ABSOLUTE ERROR!

In [None]:
print(y_test)
print(y_pred)

[2 2 1 2 2 0 2 2 3 3 2 2 1 2 3 2 1 3 2 1 2 1]
[2 2 2 2 2 0 2 2 2 3 2 2 2 2 3 2 1 2 2 2 2 3]


This is rather amazing.

I am going to perform a grid search cross-validation to A) be more confident in believing what I'm seeing and B) see if we can get even better results from hyperparameter tuning. Because this can take forever, I'm only going to try two values per parameter in the search.

In [None]:
from sklearn.model_selection import GridSearchCV

# Define the hyperparameter grid
param_grid = {
    'learning_rate': [0.1, 0.01],
    'max_iter': [100, 300],
    'max_depth': [3, 7],
    'min_samples_leaf': [1, 4],
    # Add more hyperparameters as needed
}

# Instantiate HistGradientBoostingClassifier
clf = HistGradientBoostingClassifier()

# Instantiate GridSearchCV
grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy', n_jobs=-1)

# Perform grid search
grid_search.fit(X_train_tfidf.toarray(), y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

# Test the classifier on the testing data using the best hyperparameters
best_classifier = grid_search.best_estimator_
y_pred = best_classifier.predict(X_test_tfidf.toarray())

# Evaluate the performance of the classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error:", mae)

Best Hyperparameters: {'learning_rate': 0.01, 'max_depth': 7, 'max_iter': 300, 'min_samples_leaf': 4}
Accuracy: 0.45454545454545453
Mean Absolute Error: 0.6818181818181818


Unfortunately, we got very poor results from doing this grid search. We should change our param grid to contain the point used by the default invocation of HistGradientBoostingClassifier which gave us the values of accuracy and MAE that initially got us excited. That point has coordinates:
'learning_rate': .1
'max_iter': 100
'max_depth': None
'min_samples_leaf': 20

In [None]:
from sklearn.naive_bayes import GaussianNB

# Step 3: Initialize and Train the GNB Classifier
gnb_classifier = GaussianNB()
gnb_classifier.fit(X_train_tfidf.toarray(), y_train)  # Convert sparse matrix to dense array for GNB

# Step 4: Evaluate the Classifier
y_pred = gnb_classifier.predict(X_test_tfidf.toarray())  # Convert sparse matrix to dense array for prediction
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

mae = mean_absolute_error(y_test, y_pred)
print("Mean absolute error:", mae)

Accuracy: 0.5909090909090909
Mean absolute error: 0.5


TT Split RS 42: Accuracy: 0.5
Mean absolute error: 0.5454545454545454

TT Split RS 95: Accuracy: 0.5909090909090909
Mean absolute error: 0.6363636363636364

TT Split RS 13: Accuracy: 0.5
Mean absolute error: 0.5909090909090909

TT Split RS 21: Accuracy: 0.5454545454545454
Mean absolute error: 0.5

TT Split RS 507: Accuracy: 0.5909090909090909
Mean absolute error: 0.5

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y_train = le.fit_transform(y_train)
y_test = le.transform(y_test)

# Initialize the KNN classifier
knn_classifier = KNeighborsClassifier(n_neighbors=5)  # You can adjust the number of neighbors as needed

# Train the classifier
knn_classifier.fit(X_train_tfidf, y_train)

# Make predictions on the testing data
y_pred = knn_classifier.predict(X_test_tfidf)

# Evaluate the accuracy of the classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

mae = mean_absolute_error(y_test, y_pred)
print("Mean absolute Error:", mae)

# Initialize the KNN regressor
knn_regressor = KNeighborsRegressor(n_neighbors=5)  # You can adjust the number of neighbors as needed

# Train the regressor
knn_regressor.fit(train_df, y_train)

# Make predictions on the testing data
y_pred = knn_regressor.predict(test_df)

# Round the predictions
y_pred_rounded = [round(pred) for pred in y_pred]

# Calculate accuracy using the rounded predictions and actual target values
accuracy = accuracy_score(y_test, y_pred_rounded)
print("Accuracy:", accuracy)

# Evaluate the mean squared error of the regressor
mae = mean_absolute_error(y_test, y_pred)
print("Mean absolute Error:", mae)

Accuracy: 0.4090909090909091
Mean absolute Error: 0.6363636363636364
Accuracy: 0.36363636363636365
Mean absolute Error: 0.7090909090909091


TT Split RS 42: Accuracy: 0.5909090909090909
Mean absolute Error: 0.5
Accuracy: 0.4090909090909091
Mean absolute Error: 0.7272727272727273

TT Split RS 95: Accuracy: 0.5909090909090909
Mean absolute Error: 0.5909090909090909
Accuracy: 0.45454545454545453
Mean absolute Error: 0.7727272727272727

TT Split RS 13: Accuracy: 0.6363636363636364
Mean absolute Error: 0.36363636363636365
Accuracy: 0.22727272727272727
Mean absolute Error: 0.8181818181818182

TT Split RS 21: Accuracy: 0.5454545454545454
Mean absolute Error: 0.5
Accuracy: 0.5454545454545454
Mean absolute Error: 0.6454545454545455

TT Split RS 507: Accuracy: 0.4090909090909091
Mean absolute Error: 0.6363636363636364
Accuracy: 0.36363636363636365
Mean absolute Error: 0.7090909090909091

To give a little bit more context and info, when I first started training and evaluating models against my data, I tried a handful of different types of models together with FE approach 1, and found that Linear Regression gave me the best performance at 68% accuracy and about .45 MAE. However, I was including phrases from the test set in the embedding maps before identifying and counting clusters, meaning language from the test set was used to train the model, which is a big no-no aka cheating in a supervised learning problem. While I was getting better performance from LR using the invalid FE approach, I was getting the same performance in terms of accuracy from tree-based models including Random Forest Classifier that I did when doing FE approach 1 correctly.

I have formed a hypothesis which is partially based on the observation that linear regression had noticeably better accuracy when we did FE approach 1 but with phrases from our test group influencing our clustering, which was not valid. We got an accuracy of 68% (rounded predictions) and an MAE of roughly .45 using 500 clusters for our common corpus. However, when we removed the salary columns from the data having gone through FE approach 1 because we wanted to compare results to FE approach 2 based only on representations of our text to see if TFIDF or phrase embedding clustering would prove better, linear regression drastically failed and made wild predictions which were nowhere in the ballpark of 0-3 (they could be orders of magnitude larger or smaller). Linear regression completely exploded without the salary columns acting as some kind of glue for the algorithm to work meaningfully well, when we did FE Approach 1. However, with FE approach 2, LR did reasonably well at 59% although the MAE was .61. So I am thinking, what if we combined feature engineering approaches 1 and 2, where we perform the TFIDF vectorization but we also include salary columns in the X dataframe, and see how linear regression does? We can also see if random forest performs differently, although I think we observed that removing the salary columns didn't change anything or had negligible effects on the predictions of random forest, though I say that somewhat unconfidently.

So what I need to do now is use a different name for the dataframe I make with FE approach 2 and then combine that dataframe with just the salary columns from the FE approach 1 dataframe.

In [None]:
df.head()

Unnamed: 0,employment_type,job_function,description_of_product/service,industries,position_name,broader_role_name,company,responsibilities,goals/objectives,name_of_department/team,required_qualifications,preferred_qualifications,benefits,work_arrangement,city,state,country,min_salary,max_salary,rating
0,[Full-time],"[Engineering, Information Technology]","[design-led software development, end-to-end d...","[Business Consulting, Services]",[Python Software Engineer (Robotics/Mechatroni...,[N/A],[Fresh Consulting],"[integrate software/hardware components, devel...",[manage delivery of high-quality work],[N/A],"[0-1+ years experience, Python skills, program...","[clear communication, outside the box thinking...","[100% Medical, PTO, Holiday Pay, 401K Plan]",[N/A],Redmond,WA,USA,62.4,93.6,1
1,[Full-time],"[Design, Art/Creative, Information Technology]",[Surgical Robotics Systems],"[Biotechnology Research, Pharmaceutical Manufa...",[Robotics Engineer],[N/A],[Barrington James],"[Contribute to cutting-edge robotic systems, C...",,,"[Bachelor's, Master's, or Ph.D. in Robotics, M...",[N/A],[N/A],[N/A],New York,,United States,,,2
2,[Full-time],"[Information Technology, Consulting, Engineering]","[Amazon Robotics builds high-performance, real...","[Software Development, IT Services, IT Consult...",[SDE - Amazon Robotics],[N/A],[Amazon],"[Help with initial robotic deployments, Plan r...","[Build high-performance robotic systems, Inven...",[N/A],[3+ years professional software development ex...,[3+ years full software development life cycle...,[N/A],[On-site],North Reading,MA,USA,,,2
3,[Full-time],"[Engineering, Information Technology]",[Credit scoring models],"[Financial Services, Capital Markets, IT Servi...",[Senior Software Engineer],[Software Engineer],[VantageScore®],"[Application Development, Collaboration, Mento...",[Building public APIs],[N/A],"[Bachelor's degree, Master's degree, Computer ...","[Quantitative applications, Fintech experience...","[401(K) match, Flexible Time Off, 12 Paid Holi...",[Hybrid],San Francisco,CA,USA,150.0,200.0,3
4,[Full-time],"[Engineering, Quality Assurance, Information T...",[Large-scale distributed software applications...,"[IT Services, IT Consulting]",[Software Engineer in Test],[N/A],[Optomi],[Build and maintain automated test infrastruct...,[N/A],[N/A],"[5+ years of test automation experience, Profi...",[N/A],[N/A],[On-site],Dallas-Fort Worth,TX,USA,,,1


Cool, so we don't actually have to run code that is tied up with all the phrase embedding and clustering stuff which takes a long time. We can get it straight from df. So replace 'df' with 'df2' in the code that generates that dataframe with 'posting_text' and 'rating', from FE approach 2.

In [11]:
links_csv_path = "drive/MyDrive/successfulLinksGDRIVE.csv"

import pandas as pd
links_dataframe = pd.read_csv(links_csv_path, header=None, names=['url', 'rating'])
#links_dataframe.head()
#print(links_dataframe.loc[0, 'rating'])

# let's snakecase our column names to avoid having spaces in them
# also we add a rating column at the end of the list to store the target variable
df_columns = ["posting_text", "rating"]
# Initialize DataFrame with column names
df2 = pd.DataFrame(columns=df_columns)

import os
import json

for i in range(1, 108):
  row_num = df2.last_valid_index()
  print(row_num)
  if row_num == None:
    row_num = 0
  else:
    row_num = row_num + 1
  assert i == row_num + 1
  text_file_path = f'drive/MyDrive/text_files2/row{i}.txt'
  with open(text_file_path, 'r') as file:
    # Read the entire content of the file into a string
    file_contents = file.read()
  df2.at[row_num, 'posting_text'] = file_contents
  rating = links_dataframe.loc[row_num, 'rating']
  df2.at[row_num, 'rating'] = rating

import numpy as np
# Replace 'N/A' with NaN in the whole DataFrame
df2.replace('N/A', np.nan, inplace=True)

None
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105


In [12]:
salary_df = df[['min_salary', 'max_salary']]
# Iterate over columns and fill NaN values with the mode of each column
for col in salary_df.columns:
  mode_val = salary_df[col].mode()[0]
  salary_df[col].fillna(mode_val, inplace=True)  # Fill NaN values with the mode

# Concatenate the DataFrames
concatenated_df = pd.concat([df2, salary_df], axis=1)

print(concatenated_df.isnull().sum())

posting_text    0
rating          0
min_salary      0
max_salary      0
dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  salary_df[col].fillna(mode_val, inplace=True)  # Fill NaN values with the mode


In [54]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Assuming your dataframe is named 'df' and the text column is named 'job_posting_text'

# Step 1: Split your data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(concatenated_df[['posting_text', 'min_salary', 'max_salary']], concatenated_df['rating'], test_size=0.2, random_state=507)

# Step 2: Initialize and fit TF-IDF vectorizer on the training data only
tfidf_vectorizer = TfidfVectorizer(max_features=5000)  # You can adjust max_features as needed
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train['posting_text'])
# Convert TF-IDF matrix to DataFrame
X_train_tfidf_df = pd.DataFrame(X_train_tfidf.toarray(), columns=tfidf_vectorizer.get_feature_names_out(), index=X_train.index)

# Step 3: Transform the test data using the fitted vectorizer
X_test_tfidf = tfidf_vectorizer.transform(X_test['posting_text'])
# Convert TF-IDF matrix to DataFrame
X_test_tfidf_df = pd.DataFrame(X_test_tfidf.toarray(), columns=tfidf_vectorizer.get_feature_names_out(), index=X_test.index)

new_df = X_train.drop(columns='posting_text')
new_test_df = X_test.drop(columns='posting_text')
X_train = pd.concat([new_df, X_train_tfidf_df], axis=1)
X_test = pd.concat([new_test_df, X_test_tfidf_df], axis=1)


# Now you have X_train and X_test ready to be used with your model for training and evaluation.

In [None]:
X_train

Unnamed: 0,min_salary,max_salary,00,000,000remote,004,00applications,00salary,00summary,00the,...,youhave,your,yourself,youtube,yr,yrsmandatory,yummy,zero,zoho,zone
67,100.0,200.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.008027,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0
26,100.0,200.0,0.069401,0.071417,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.075790,0.0,0.0,0.000000,0.0,0.0
22,100.0,200.0,0.033130,0.068185,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.036180,0.0,0.0,0.000000,0.0,0.0
31,100.0,200.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0
56,100.0,200.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70,,,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.079688,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0
76,,,0.036686,0.075505,0.0,0.0,0.0,0.0,0.0,0.0,...,0.098938,0.016891,0.0,0.0,0.040064,0.0,0.0,0.000000,0.0,0.0
77,,,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0
78,,,0.039157,0.080590,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.054086,0.0,0.0,0.042762,0.0,0.0,0.000000,0.0,0.0


In [55]:
from sklearn.metrics import mean_absolute_error, accuracy_score
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression

regression_model = LinearRegression()

# Assuming X_train, X_test, y_train, y_test are your training and testing data
# Train a regression model instead of SVM classifier
regression_model.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = regression_model.predict(X_test)

# Calculate the mean absolute error
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error:", mae)

# Calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

# Round the predictions
y_pred_rounded = [round(pred) for pred in y_pred]

# Calculate accuracy using the rounded predictions and actual target values
accuracy = accuracy_score(y_test.tolist(), y_pred_rounded)

print("accuracy is", accuracy)

Mean Absolute Error: 0.4719712687542771
Mean Squared Error: 0.43326203180801814
accuracy is 0.6363636363636364


TT Split RS 42: Mean Absolute Error: 0.47993065078666375
Mean Squared Error: 0.3460654666605357
accuracy is 0.6363636363636364

TT Split RS 95: Mean Absolute Error: 0.5449295696118838
Mean Squared error: 0.5375021723477539
accuracy is 0.6363636363636364

TT Split RS 13: Mean Absolute Error: 0.4485097307444198
Mean Squared Error: 0.3790559124675783
accuracy is 0.6818181818181818

TT Split RS 21: Mean Absolute Error: 0.4227856724462283
Mean Squared Error: 0.3112876514654064
accuracy is 0.7272727272727273

TT Split RS 507: Mean Absolute Error: 0.4719712687542771
Mean Squared Error: 0.43326203180801814
accuracy is 0.6363636363636364

Yes!!! After bumping into some road blocks, I finally got it working, and it's the best accuracy I have gotten using a legitimate approach. Accuracy (rounded predictions) of 64%, MAE of .48. If Linear Regression got better results, what about the Logisic Regression and Random Forest Classifier algorithms?

In [56]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

le = LabelEncoder()
y_train = le.fit_transform(y_train)
y_test = le.transform(y_test)

# Create a Logistic Regression model
# By default, LogisticRegression uses OvR for multi-class classification
model = LogisticRegression(max_iter=10000)  # Increase max_iter for convergence if needed

# Fit the model to the training data
model.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Calculate the mean absolute error
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error:", mae)

# Calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

Accuracy: 0.5
Mean Absolute Error: 0.5909090909090909
Mean Squared Error: 0.7727272727272727


TT Split RS 42: Accuracy: 0.5909090909090909
Mean Absolute Error: 0.5
Mean Squared Error: 0.6818181818181818

TT Split RS 95: Accuracy: 0.45454545454545453
Mean Absolute Error: 0.7272727272727273
Mean Squared Error: 1.0909090909090908

TT Split RS 13: Accuracy: 0.6363636363636364
Mean Absolute Error: 0.4090909090909091
Mean Squared Error: 0.5

TT Split RS 21: Accuracy: 0.4090909090909091
Mean Absolute Error: 0.6818181818181818
Mean Squared Error: 0.8636363636363636

TT Split RS 507: Accuracy: 0.5
Mean Absolute Error: 0.5909090909090909
Mean Squared Error: 0.7727272727272727

In [57]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y_train = le.fit_transform(y_train)
y_test = le.transform(y_test)

# Initialize the Random Forest classifier
rf_classifier = RandomForestClassifier(max_depth=None, max_features='auto', min_samples_leaf=1, min_samples_split=2, n_estimators=300)

# Train the classifier on the training data
rf_classifier.fit(X_train, y_train)

# Test the classifier on the testing data
y_pred = rf_classifier.predict(X_test)


# Evaluate the performance of the classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

mae = mean_absolute_error(y_test, y_pred)
print("mae:", mae)

# Calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

  warn(


Accuracy: 0.5
mae: 0.5909090909090909
Mean Squared Error: 0.7727272727272727


TT Split RS 42: Accuracy: 0.5909090909090909
mae: 0.45454545454545453
Mean Squared Error: 0.5454545454545454

TT Split RS 95: Accuracy: 0.5454545454545454
mae: 0.5909090909090909
Mean Squared Error: 0.9090909090909091

TT Split RS 13: Accuracy: 0.5909090909090909
mae: 0.4090909090909091
Mean Squared Error: 0.4090909090909091

TT Split RS 21: Accuracy: 0.5
mae: 0.5454545454545454
Mean Squared Error: 0.6363636363636364

TT Split RS 507: Accuracy: 0.45454545454545453
mae: 0.6363636363636364
Mean Squared Error: 0.7727272727272727

In fact we see that this change in feature engineering had no impact on the RFC's accuracy or MAE whatsoever. Maybe we should try a different seed for our train-test split?

How does a HistGradientBoostingClassifier model do with this hybrid feature engineering approach (which we will from here on refer to as FE approach 3)?

In [58]:
from sklearn.ensemble import HistGradientBoostingClassifier

le = LabelEncoder()
y_train = le.fit_transform(y_train)
y_test = le.transform(y_test)

# Initialize the HistGradientBoostingClassifier
clf = HistGradientBoostingClassifier()

# Train the classifier on the training data
clf.fit(X_train, y_train)

# Test the classifier on the testing data
y_pred = clf.predict(X_test)

# Evaluate the performance of the classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error:", mae)

# Calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

Accuracy: 0.5
Mean Absolute Error: 0.5909090909090909
Mean Squared Error: 0.7727272727272727


TT Split RS 42: Accuracy .727 MAE .318 MSE 0.4090909090909091

TT Split RS 95: Accuracy .591, MAE .409 MSE 0.4090909090909091

TT Split RS 13: Accuracy: 0.45454545454545453
Mean Absolute Error: 0.5909090909090909
Mean Squared Error: 0.6818181818181818

TT Split RS 21: Accuracy: 0.6363636363636364
Mean Absolute Error: 0.36363636363636365
Mean Squared Error: 0.36363636363636365

TT Split RS 507: Accuracy: 0.5
Mean Absolute Error: 0.5909090909090909
Mean Squared Error: 0.7727272727272727

We are seeing a good amount of variance with this model.

First try with train-test-split random seed 42: Accuracy .727 MAE .318; We see that this adding of the salary columns into the dataframe that would be passed into the algorithm from FE approach 2 (which we are calling FE approach 3) didn't result in any changes to accuracy or MAE of the HGBC model on this particular train-test split.
Second train-test-split RS 95: Accuracy .591, MAE .409


In [59]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Initialize the decision tree classifier
clf = DecisionTreeClassifier()

# Train the decision tree classifier on the training data
clf.fit(X_train, y_train)

# Test the classifier on the testing data
y_pred = clf.predict(X_test)

# Evaluate the performance of the classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Calculate the mean absolute error
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error:", mae)

# Calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

Accuracy: 0.45454545454545453
Mean Absolute Error: 0.7272727272727273
Mean Squared Error: 1.0909090909090908


TT Split RS 42: Accuracy: 0.3181818181818182
Mean Absolute Error: 0.8181818181818182
Mean Squared Error: 1.1363636363636365

TT Split RS 95: Accuracy: 0.3181818181818182
Mean Absolute Error: 0.7272727272727273
Mean Squared Error: 1.0909090909090908

TT Split RS 13: Accuracy: 0.36363636363636365
Mean Absolute Error: 0.6818181818181818
Mean Squared Error: 0.8636363636363636

TT Split RS 21: Accuracy: 0.5
Mean Absolute Error: 0.5454545454545454
Mean Squared Error: 0.8636363636363636

TT Split RS 507: Accuracy: 0.45454545454545453
Mean Absolute Error: 0.6363636363636364
Mean Squared Error: 1.0909090909090908

In [60]:
import xgboost as xgb

# Initialize the XGBoost classifier
xgb_classifier = xgb.XGBClassifier(objective='binary:logistic', random_state=42)

# Train the classifier on the training data
xgb_classifier.fit(X_train, y_train)

# Test the classifier on the testing data
y_pred = xgb_classifier.predict(X_test)

# Evaluate the performance of the classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error:", mae)

# Calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

Accuracy: 0.5
Mean Absolute Error: 0.5909090909090909
Mean Squared Error: 0.7727272727272727


TT Split RS 42: Accuracy: 0.5909090909090909
Mean Absolute Error: 0.4090909090909091
Mean Squared Error: 0.4090909090909091

TT Split RS 95: Accuracy: 0.6818181818181818
Mean Absolute Error: 0.3181818181818182
Mean Squared Error: 0.3181818181818182

TT Split RS 13: Accuracy: 0.36363636363636365
Mean Absolute Error: 0.7272727272727273
Mean Squared Error: 0.9090909090909091

TT Split RS 21: Accuracy: 0.6363636363636364
Mean Absolute Error: 0.36363636363636365
Mean Squared Error: 0.36363636363636365

TT Split RS 507: Accuracy: 0.5
Mean Absolute Error: 0.5909090909090909
Mean Squared Error: 0.772727272727272

Interesting. With this feature engineering approach 3, we don't get fantastic accuracy using XGBoost but our MAE is lower than we've seen it for this algorithm.

Well, it looks like doing other train-test splits that this algorithm has a HIGH degree of variance with this data because for some splits our accuracy and error are good and for others they are awful.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 4: Initialize and Train the SVM Classifier
svm_classifier = SVC(kernel='rbf', C=1.0, gamma='scale')  # Example hyperparameters
svm_classifier.fit(X_train_scaled, y_train)

# Step 5: Evaluate the Classifier
accuracy = svm_classifier.score(X_test_scaled, y_test)
print("Accuracy:", accuracy)

Accuracy: 0.45454545454545453


TT Split RS 42: Accuracy: 0.5454545454545454
TT Split RS 95: Accuracy: .5
TT Split RS 13: Accuracy: 0.5909090909090909
TT Split RS 21: Accuracy: 0.45454545454545453
TT Split RS 507: Accuracy: 0.45454545454545453

We notice nothing changed for SVC on RS 42 with change in feature engineering.

In [61]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor

# Initialize the KNN classifier
knn_classifier = KNeighborsClassifier(n_neighbors=5)  # You can adjust the number of neighbors as needed

# Train the classifier
knn_classifier.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = knn_classifier.predict(X_test)

# Evaluate the accuracy of the classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

mae = mean_absolute_error(y_test, y_pred)
print("Mean absolute Error:", mae)

# Calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

# Initialize the KNN regressor
knn_regressor = KNeighborsRegressor(n_neighbors=5)  # You can adjust the number of neighbors as needed

# Train the regressor
knn_regressor.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = knn_regressor.predict(X_test)

# Round the predictions
y_pred_rounded = [round(pred) for pred in y_pred]

# Calculate accuracy using the rounded predictions and actual target values
accuracy = accuracy_score(y_test, y_pred_rounded)
print("Accuracy:", accuracy)

# Evaluate the mean squared error of the regressor
mae = mean_absolute_error(y_test, y_pred)
print("Mean absolute Error:", mae)

# Calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

Accuracy: 0.3181818181818182
Mean absolute Error: 0.7272727272727273
Mean Squared Error: 0.8181818181818182
Accuracy: 0.36363636363636365
Mean absolute Error: 0.6272727272727271
Mean Squared Error: 0.58


Interesting, KNN Regressor did better, but KNN classifier did worse, with FE approach 3. It kinda makes sense, since we have in general observed regression models improving with FE approach 3 but not necessarily other kinds of models.
Order:
KNNC
KNNR
TT Split RS 42:
Accuracy: 0.5
Mean absolute Error: 0.5909090909090909
Mean Squared Error: 0.8636363636363636
Accuracy: 0.6363636363636364
Mean absolute Error: 0.48181818181818187
Mean Squared Error: 0.4563636363636364

TT Split RS 95:
Accuracy: 0.45454545454545453
Mean absolute Error: 0.7272727272727273
Mean Squared Error: 1.1818181818181819
Accuracy: 0.45454545454545453
Mean absolute Error: 0.7
Mean Squared Error: 0.7363636363636364

TT Split RS 13:
Accuracy: 0.6363636363636364
Mean absolute Error: 0.36363636363636365
Mean Squared Error: 0.36363636363636365
Accuracy: 0.5909090909090909
Mean absolute Error: 0.4636363636363638
Mean Squared Error: 0.38727272727272727

TT Split RS 21:
Accuracy: 0.4090909090909091
Mean absolute Error: 0.6363636363636364
Mean Squared Error: 0.7272727272727273
Accuracy: 0.4090909090909091
Mean absolute Error: 0.5818181818181819
Mean Squared Error: 0.48363636363636364

TT Split RS 507:
Accuracy: 0.3181818181818182
Mean absolute Error: 0.7272727272727273
Mean Squared Error: 0.8181818181818182
Accuracy: 0.36363636363636365
Mean absolute Error: 0.6272727272727271
Mean Squared Error: 0.58

Now, we want to get more accurate measurements of the accuracy of each of our models. We will start with the top performing models and go down the list. We will update model_leaderboard.md https://github.com/liamtabrams/LIRecommend/blob/main/discussion/model_leaderboard.md in the Github repo with our findings.

Initially, I tried to use built-in scikit learn APIs for doing K-Fold cross validation on my data, but since I either do TFIDF on my training set and transform my test set based on that process (FE 2) or calculate phrase embeddings from the training set and count cluster occurences, I would have to bake my feature engineering into the cross validation using an SKLearn pipeline. That will be nontrivial. Although I will probably opt to do this at some point, for now I will opt to change the seed setting when I call train_test_split and reevaluate models. I can do this myself 5 times, and use that as a rudimentary cross validation to get some accuracy score averages. Then I might look at mean squared error and AUC-ROC values for different models.

We should try to refine and improve our model evaluation techniques. So far we have done a rather preliminary analysis using one train-test split and only looking at accuracy and mean absolute error for the different models we try. We should also be looking at AUC scores for the ROC for each class in the target variable. Furthermore, instead of focusing on mean absolute error, I think trying to minimize mean squared error is more important than mean absolute error. We see from the predictions of the RFC above that when the label value was 0, the algorithm predicted a 2. Even if other predictions are spot on, this larger difference between prediction and reality should be penalized more greatly, and mean squared error would take care of that.

Before we embark any further, let's have a quick discussion about random models and baseline accuracy.

A random model would give an accuracy of 25% and a mean absolute error of 1.25. We feel like we are moving in the right direction but I am still relunctant to say so when I consider the class imbalance in the dataset, which has 50/107 2s, or 46.7% of the dataset is 2s. This means that if a model were to predict just 2 no matter what, it would have achieved a 50/107 = 46.7% accuracy, which is better than how some of the models we trained did! What would the MAE be, based on the distribution of values in the dataset? It would be 70/107 or .654, which is not much worse than how most of the models have been doing, and better than some. What should reasonable values for accuracy and mean absolute error (or better yet mean squared error) be then, for a model that has potential value? Obviously we need to strive for at least better than 46.7% accuracy and less than .654 MAE! So we have found several models that do better than that, like linear regression, random forest classifiers and regressors, XGBoost, KNN classifiers and regressors, and even decision trees and support vector machines. HistGradientBoostingClassifier did about the same as our majority class picking model and GaussianNaiveBayes did horrendously, doing about in between as well as a purely random model and our majority class model. We see below that this original dataset of 107 job postings is not the most balanced, as nearly half of the data points are 2, about a quarter are 3, about 15% are 1 and 10% are 0.

In [None]:
ratings_value_counts = df_copy['rating'].value_counts()

print(ratings_value_counts)

rating
2    50
3    26
1    18
0    13
Name: count, dtype: int64


We will now look at the accuracy of our attempt at reproducing the ratings from the original set of job postings about 2 months later.

In [7]:
import pandas as pd

og_ratings_path = "drive/MyDrive/successfulLinksGDRIVE.csv"
reproduce_ratings_path = "drive/MyDrive/ReproduceRatings.csv"

og_ratings = pd.read_csv(og_ratings_path, header=None, names=['url', 'rating'])['rating'].tolist()
reproduced_ratings = pd.read_csv(reproduce_ratings_path, header=None, names=['rating'])['rating'].tolist()

assert len(og_ratings) == len(reproduced_ratings)

num_correct = 0
total_absolute_error = 0
total_squared_error = 0
for i in range(len(og_ratings)):
  if og_ratings[i] == reproduced_ratings[i]:
    num_correct += 1
  else:
    total_absolute_error += abs(og_ratings[i] - reproduced_ratings[i])
    total_squared_error += (og_ratings[i] - reproduced_ratings[i])**2


accuracy = num_correct/len(og_ratings)
mean_absolute_error = total_absolute_error/len(og_ratings)
mean_squared_error = total_squared_error/len(og_ratings)
print(f"accuracy is {accuracy}")
print(f"mean absolute error is {mean_absolute_error}")
print(f"mean squared error is {mean_squared_error}")


accuracy is 0.6261682242990654
mean absolute error is 0.37383177570093457
mean squared error is 0.37383177570093457


We see that, just from 2-month drift in my job preferences as well as inherit subjectivity in this exercise of rating job postings, I myself was only able to predict the rating I gave to the job posting 2 months ago, using the scraped text files (which admittedly may have problematic information loss from the original job posting), at a 62.6% accuracy rate with a mean absolute error of .374. We also see that our mean squared error is the same as our mean absolute error indicating that none of our misses were off by more than 1 rating point. So obviously this would be a very good model, to have myself modeling my job preferences from 2 months ago! This also means that if I have a linear regression model that produced predictions at 66.4% accuracy, that's pretty good. The error of the linear regression model with FE approach 3 was .474 so noticeably larger than the error of me reproducing the original ratings, but that .474 number was close to the lowest error we saw for any of the models we did cross validation on.     

It's looking like I will choose to move forward with combining linear regression with feature engineering approach 3. However, we never looked at mean squared error, so we will do cross validation to get average MSE values for our top performing models. Naturally we will start at the top of our leaderboard with linear regression and FE approach 3, rerunning code cells from above and updating our tables with our results for MSE.

So, looking at MSE has only strengthened the conclusion that the linear regression model with FE approach 3 is the clear-cut best choice. Not only is the MSE of the linear regression less than its MAE, but the MSE of the linear regression model is only slightly larger than RR (reproducing ratings) after 2 months. For a compact summary of all results found in this notebook, see https://github.com/liamtabrams/LIRecommend/blob/main/discussion/model_leaderboard.md.  