We are some common modules used in machine learning for importing/modifying data as well as visualizing data.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sb
import dask.dataframe as dd

We are using the pandas module to create a dataframe using the encoded_df_blanks_as_na.csv file for our features dataframe. features_df contains the independent variables used to train our machine learning model. These are the characterisitcs of the data that our model will analyze to make predictions.

In [2]:
'''
- A one-hot encoded matrix of features, with 11599 person names (nam_id_XXXX)
    and 4784 place names (geo_id_XXXX) as columns.
- The first column contains unique text_id corresponding to individual texts.
'''
features_df = pd.read_csv('encoded_df_blanks_as_na.csv')

We are using the pandas module to create a dataframe using the 20240103_texts_with_dates.csv file for our labels dataframe. We then preview the dataframe. labels_df contains the dependent variables (target values) our model is trying to predict. These are the outputs corresponding to observations in the features_df. y1 and y1 willl serve as the ground truth our model learns to predict.

In [3]:
'''
- text_id: same as the text id in the features dataset
- y1: The earliest possible date the text was written.
- y2: The latest possible date the text was written.
- If y1 and y2 are equal, the date of writing is known with certainty.
'''

labels_df = pd.read_csv('20240103_texts_with_dates.csv')

labels_df

Unnamed: 0,tex_id,geotex_id,written,found,geo_id,language_text,material_text,y1,y2,remark,Unnamed: 10,Unnamed: 11
0,12042,8388,1,1,1008,Greek,papyrus,117.0,118.0,,,- y1 = earliest possible year of writing
1,12054,8391,1,1,1008,Greek,papyrus,119.0,119.0,,,- y2 = latest possible year of writing
2,12063,8393,1,1,1008,Greek,papyrus,96.0,98.0,,,"[if y1 = y2, then we are certain of the date]"
3,12064,8394,1,1,1008,Greek,papyrus,131.0,131.0,,,
4,17239,9507,0,1,1008,Greek,papyrus,108.0,108.0,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
29245,967240,1021239,1,0,332,Greek,papyrus,-263.0,-229.0,,,
29246,5910,5438,1,1,1344,Greek,papyrus,-148.0,-148.0,,,
29247,3506,4066,1,1,1344,Demotic / Greek,papyrus,-150.0,-150.0,,,
29248,7491,6778,0,1,720,Demotic / Greek,papyrus,-236.0,-236.0,new,,


**One Hot Encoding**

Looking at the features file, it is one-hot encoded. One-hot encoding is a method used to represent categorical data as binary values. It is commonly used to convert non-numerical data into a format that algorithms can process effectively as well as avoiding implicit ordinal relationships. It works by identifying all unique categories and creating binary columns for all of them. We assign 1 to the column corresponding to the presence of that data and 0 for all other related columns.

This file was not fully one-hot encoded at first. It had a 1 for all values, but a NULL for the absence of features. All NULL values should be filled with a 0 to represent the absence of that feature and clean the data more.

After filling all NaN columns, we preview the first 5 columns of the features_df. We are setting the text_id as the index of the features_df because although it is a column, it is not actually a feature but rather a unique identifier for each row.

Lastly, we print information about the shape of the dataframe.

In [4]:
# Printing information about the features dataset
features_df = features_df.fillna(0)

print(features_df.sample(5).iloc[:, :10])

features_df.set_index('text_id', inplace=True)


print(f'Number of rows: {features_df.shape[0]}')
print(f'Number of columns: {features_df.shape[1]}')
print(labels_df['y1'].min())
print(labels_df['y2'].max())


       text_id  nam_id_1057.0  nam_id_19683.0  nam_id_356.0  nam_id_904.0  \
29976   851611            0.0             0.0           0.0           0.0   
29141   220342            0.0             0.0           0.0           0.0   
9013     15900            0.0             0.0           0.0           0.0   
27334    91732            0.0             0.0           0.0           0.0   
29792   765547            0.0             0.0           0.0           0.0   

       nam_id_2227.0  nam_id_726.0  nam_id_761.0  nam_id_731.0  nam_id_1246.0  
29976            0.0           0.0           0.0           0.0            0.0  
29141            0.0           0.0           0.0           0.0            0.0  
9013             0.0           0.0           0.0           0.0            0.0  
27334            0.0           0.0           0.0           0.0            0.0  
29792            0.0           0.0           0.0           0.0            0.0  
Number of rows: 30324
Number of columns: 16383


**Data Preprocessing**

Data preprocessing is the stage in machine learning where raw data is cleaned and prepared for analysis or model training. Firstly, we cleaned the data by filling NULLS with 0's for one-hot encoding. Now, we are removing duplicate rows from features_df and labels_df to ensure the dataset is clean, consistent, and free from redundancy. Duplicates can influence bias which would negatively impact the performance of our model.

<font color='red'>Remove outliers?</font>

In [5]:
# Check for duplicates in the labels dataframe
print(f"Number of rows before removing duplicates: {features_df.shape[0]}")
print(f"Number of duplicate rows: {features_df.duplicated().sum()}")

# Remove duplicate rows
features_df = features_df.drop_duplicates()

# Verify duplicates are removed
print(f"Number of rows after removing duplicates: {features_df.shape[0]}")

Number of rows before removing duplicates: 30324
Number of duplicate rows: 1478
Number of rows after removing duplicates: 28846


In [6]:
# Check for duplicates in the labels dataframe
print(f"Number of rows before removing duplicates: {labels_df.shape[0]}")
print(f"Number of duplicate rows: {labels_df.duplicated().sum()}")

# Remove duplicate rows
labels_df = labels_df.drop_duplicates()

# Verify duplicates are removed
print(f"Number of rows after removing duplicates: {labels_df.shape[0]}")

Number of rows before removing duplicates: 29250
Number of duplicate rows: 2
Number of rows after removing duplicates: 29248


We are further preparing the data by removing features with low variance. Variance refers to how much the values in a feature vary. Features with very little variation don't provide value for prediction because there aren't enough differences in the data for the model to learn any patterns. By removing these low-variance features, we simplify the model, reduce overfitting, and improve computational efficiency.

The threshold we chose is arbitrary since it is a hyper parameter. We tuned our models by setting various thresholds and seeing which one resulted in the lowest MAE. A threshold of 0.001 produced our best result.

fit_transform calculates the variance of each feature in features_df and removes features whose variance is below our specified threshold.

Printing our original and reduced shape shows how we are able to reduce our dimensions by 15,038. There were diminishing returns with feature reduction. After a certain point, it was better to have more features. Originally, we reduced to 143 columns but 1345 performed better.

In [7]:
from sklearn.feature_selection import VarianceThreshold

# Remove low-variance features
selector = VarianceThreshold(threshold=0.001)  # Adjust threshold as needed
features_reduced_df = selector.fit_transform(features_df)

print(f"Original shape: {features_df.shape}")
print(f"Reduced shape: {features_reduced_df.shape}")

Original shape: (28846, 16383)
Reduced shape: (28846, 1345)


We decided to use Singular Value Decomposition (SVD) for dimensionality reduction because it is good at reducing the number of features in high-dimensional datasets and sparse matrixes in particular. Our data was an extremely sparse matrix. SVD reduces the number of features while retaining the most important information. We chose 1000 principal components to keep because it performed well after testing various values for this hyperparameter. The explained variance ratio shows that even after reducing our features by over 15000, we still manage to capture 95% of the variance in 1000 components.

<font color='red'>EXPLAIN SVD MORE</font>

**WHY SVD INSTEAD OF PCA?**

We chose SVD over PCA for several reasons such as nonlinear relationships, sparse matrixes, and large datasets. SVD is more flexible in handling non-linear relationships and can be used as a general-purpose dimensionality reduction technique. We were able to determine our data was non-linear by comparing the results of models based on linear and nonlinear approaches. Our nonlinear models outperformed our linear models by far. SVD is also useful when dealing with sparse data, which is the case for our data. SVD seemed like a better option due to the size of our dataset as well. PCA involves calculating a covariance matrix which would take longer than SVD to process.

In [8]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=1000, random_state=42)
# svd = TruncatedSVD(n_components=2000, random_state=42)
features_svd = svd.fit_transform(features_reduced_df)

print(f"Original shape: {features_reduced_df.shape}")
print(f"Reduced shape: {features_svd.shape}")

print(svd.explained_variance_ratio_)
print(sum(svd.explained_variance_ratio_))  # Total variance retained

Original shape: (28846, 1345)
Reduced shape: (28846, 1000)
[0.02962784 0.02500479 0.01815715 0.01565095 0.01279698 0.01168515
 0.01004947 0.00987946 0.00931344 0.00879833 0.00840652 0.00839601
 0.00809594 0.00763554 0.00718988 0.00665971 0.00635726 0.00601058
 0.00589252 0.00575561 0.00555587 0.00538438 0.00514929 0.00505432
 0.00497139 0.00482803 0.00466713 0.00462055 0.00449435 0.00450015
 0.00440983 0.00434999 0.00419037 0.00417054 0.00398282 0.00395789
 0.00388713 0.00385748 0.00381505 0.00375992 0.00364783 0.00358395
 0.00353121 0.00346126 0.00344036 0.00339893 0.00334313 0.00329448
 0.00327808 0.00323764 0.00320789 0.00318611 0.00315127 0.00310233
 0.00307426 0.00303943 0.00302351 0.00295952 0.00290862 0.00287503
 0.00286709 0.00282286 0.00280273 0.00276819 0.0027324  0.00270148
 0.00267772 0.00264733 0.00259831 0.00260706 0.00255261 0.00252425
 0.00250912 0.00249147 0.00248418 0.00243584 0.00241878 0.00241629
 0.0023637  0.00236463 0.00233179 0.00229635 0.00228166 0.00225857
 0.

The following cell converts our features_svd array back into a dataframe features_svd_df.

In [9]:
# Assuming reduced_features is your NumPy array from SVD or Variance Thresholding
# features_df['text_id'] contains the IDs you need to reattach
features_df.reset_index(inplace=True)
features_svd_df = pd.DataFrame(features_svd, columns=[f'feature_{i}' for i in range(features_svd.shape[1])])
features_svd_df['text_id'] = features_df['text_id'].values


This is one of the most important cells because it prepares our training set. Our original label and features files have around 20k and 30k rows respectively. Not every row of data in the features dataset has a corresponding row of data in the labels dataset and vice versa. In order to train our model, we need to give it data that it already knows the answers for - the ground truths. This is the metric we will compare our predictions against to guage the performance of our models. Although we cleaned our data for duplicates before, we wanted to ensure the data was thoroughly prepared as we inch closer to training our model.

After removing duplicates and data that didn't appear in both the label and features datasets, we were left with 8565 datapoints with corresponding labels. This is the data we use to train our model.

In [10]:
# Get text_ids in each dataset
features_text_ids = set(features_svd_df['text_id'])
labels_text_ids = set(labels_df['tex_id'])

# Find unmatched text_ids
missing_in_labels = features_text_ids - labels_text_ids
missing_in_features = labels_text_ids - features_text_ids

print(f"Text IDs missing in labels: {len(missing_in_labels)}")
print(f"Text IDs missing in features: {len(missing_in_features)}")

# Keep only common text_ids
common_text_ids = features_text_ids & labels_text_ids

# Filter features and labels datasets
features_common_df = features_svd_df[features_svd_df['text_id'].isin(common_text_ids)]
labels_common_df = labels_df[labels_df['tex_id'].isin(common_text_ids)]

# Check the updated shapes
print(f"Updated features shape: {features_common_df.shape}")
print(f"Updated labels shape: {labels_common_df.shape}")

# Check for duplicated rows
duplicate_rows = features_common_df.duplicated()

# Check for duplicated text_ids
duplicate_text_ids = features_common_df['text_id'].duplicated()

print(f"Total duplicate rows: {duplicate_rows.sum()}")
print(f"Total duplicate text_ids: {duplicate_text_ids.sum()}")

# Check for duplicate rows
duplicate_label_rows = labels_common_df.duplicated()

# Check for duplicate text_ids
duplicate_label_text_ids = labels_common_df['tex_id'].duplicated()

print(f"Total duplicate rows: {duplicate_label_rows.sum()}")
print(f"Total duplicate text_ids: {duplicate_label_text_ids.sum()}")

# Drop duplicate text_ids, keeping the first occurrence
features = features_common_df.drop_duplicates(subset='text_id', keep='first')

# Drop duplicate text_ids, keeping the first occurrence
labels = labels_common_df.drop_duplicates(subset='tex_id', keep='first')


# Ensure alignment of text_ids between features and labels
common_text_ids = set(features['text_id']) & set(labels['tex_id'])

# Filter again if necessary
features_filtered_df = features[features['text_id'].isin(common_text_ids)]
labels_filtered_df = labels[labels['tex_id'].isin(common_text_ids)]

print(f"Aligned features shape: {features_filtered_df.shape}")
print(f"Aligned labels shape: {labels_filtered_df.shape}")


Text IDs missing in labels: 20281
Text IDs missing in features: 12381
Updated features shape: (8565, 1001)
Updated labels shape: (13013, 12)
Total duplicate rows: 0
Total duplicate text_ids: 0
Total duplicate rows: 0
Total duplicate text_ids: 4448
Aligned features shape: (8565, 1001)
Aligned labels shape: (8565, 12)


We set the index of our dataframes to the text_ids since are not considered features, but rather unique identifiers of each datapoint.

In [11]:
features_filtered_df.set_index('text_id', inplace=True)

labels_filtered_df.rename(columns={'tex_id': 'text_id'}, inplace=True)
labels_filtered_df.set_index('text_id', inplace=True)

Since our task involves predicting the year a text was written, retaining the other ground truth labels was unnecessary. Removing them avoids confusion and ensures our process is focused solely on the data relevant to our task.

<font color="red"> Should we convert the other labels into features before dimension reduction?</font>

In [12]:
labels_final_df = labels_filtered_df[['y1','y2']]
labels_final_df

Unnamed: 0_level_0,y1,y2
text_id,Unnamed: 1_level_1,Unnamed: 2_level_1
12042,117.0,118.0
12054,119.0,119.0
12063,96.0,98.0
12064,131.0,131.0
17239,108.0,108.0
...,...,...
703104,-250.0,-175.0
703317,-263.0,-229.0
5910,-148.0,-148.0
3506,-150.0,-150.0


Although this step is redundant, we wanted to ensure all of our datapoints were merged with a corresponding label via an inner join. This could be skipped altogether since we are separating the y label from the X features when training our model, but having it as one dataframe in the beginning aligned with our approaches throughout the course.

<font color="red">Can we replace the cell above where we filter for records in both features_df and labels_df by only using this method?</font>

In [14]:
# Merge on text_id
merged_df = features_filtered_df.merge(labels_final_df, on='text_id', how='inner')

merged_df.shape

(8565, 1002)

<h1>Classification or Regression?</h1>

Here we are creating two variables that indicate how many rows exist where y1=y2 and y1!=y2 and displaying those results.

Since we have more rows where an exact year is known (5,239) we believe regression was the better approach because predicting an exact year (a continuous variable) aligns with the strengths of regression.

If we had more cases where y1!=y2, then there would be more uncertainty about the exact date of our texts, justifying the use of a classifcation model where we simplify the task by binning the ranges into predifined categories (100-199AD, 200-299AD).

Based on the context of our training data, we believed regression was the better approach because the majority of our ground truths were a single continous year vs a range of years.

<b>We chose Regression</b>

<font color="red"> Should we create classification models anyway for comparison?</font>

<font color="red">Should we combine classifcation and regression models into one model?
ex. classifcation followed by regression
ex. regression followed by classification
ex. multi-output neural network</font>

In [15]:
equal_rows = merged_df[merged_df['y1'] == merged_df['y2']].shape[0]
unequal_rows = merged_df[merged_df['y1'] != merged_df['y2']].shape[0]

print(f"Number of rows where y1 = y2: {equal_rows}")
print(f"Number of rows where y1 != y2: {unequal_rows}")

Number of rows where y1 = y2: 5239
Number of rows where y1 != y2: 3326


Since we chose regression, we created a target column that handles both cases when y1 equals y2 (continuous) and when y1 didn't equal y2 (range of years).

y_target simply became y1 if they were equal
y_target took the midpoint of y1 and y2 if they weren't equal. The midpoint serves as a reasonable estimate for a range when an exact year isn't available.

In [16]:
merged_df['y_target'] = merged_df.apply(lambda row: row['y1'] if row['y1'] == row['y2'] else (row['y1'] + row['y2']) / 2, axis=1)

merged_df

Unnamed: 0_level_0,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,...,feature_993,feature_994,feature_995,feature_996,feature_997,feature_998,feature_999,y1,y2,y_target
text_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.404936,-0.332399,0.569584,0.244467,0.118471,-0.012269,0.153942,0.019276,0.003331,0.233105,...,0.006794,0.075361,-0.012450,0.011055,0.047048,-0.014151,-0.024547,-124.0,-124.0,-124.0
2,1.194844,-0.455850,0.790932,-0.084557,0.912922,0.938616,0.366342,-0.087693,-0.983342,-0.075471,...,0.029750,-0.032232,0.014291,-0.030107,-0.024161,-0.011760,0.020541,-112.0,-112.0,-112.0
3,0.669130,-0.418529,0.354021,0.003984,0.230967,0.093175,-0.131145,-0.388602,0.684473,0.321883,...,0.022786,-0.009252,0.038818,-0.002101,0.028129,-0.016734,0.019266,-109.0,-109.0,-109.0
4,1.226901,-0.607306,0.716642,-0.143475,0.932176,0.681385,0.182460,0.103260,-0.469498,-0.267133,...,-0.035404,-0.097845,0.001688,0.080584,0.043998,0.040462,0.092145,-108.0,-108.0,-108.0
5,0.403593,-0.306048,0.475889,0.177825,0.139036,-0.021976,0.124563,0.050058,-0.034562,0.217894,...,0.002199,-0.002162,0.007168,0.001697,-0.003452,-0.001445,0.011756,-106.0,-106.0,-106.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
981643,0.008250,0.016243,0.018408,0.012553,-0.044002,0.041111,0.018783,0.001615,0.011903,0.007014,...,-0.002129,-0.002368,0.003550,0.002539,0.004470,-0.000576,0.000890,548.0,548.0,548.0
981644,0.222981,0.390343,0.176930,-0.095436,-0.083316,0.120702,0.464449,0.153280,0.487788,-0.111622,...,-0.000316,0.002716,0.007476,0.001311,0.000043,0.009732,-0.003175,553.0,553.0,553.0
981646,0.019691,0.037447,0.039414,0.022553,-0.083950,0.074489,0.025522,0.007548,0.021971,0.009494,...,0.000879,0.002465,0.000985,0.000319,0.000345,-0.001285,0.004310,549.0,549.0,549.0
981648,0.072397,0.109943,0.118935,0.083189,-0.261060,0.207830,0.096346,0.025396,0.049978,0.062749,...,-0.005298,0.002257,0.007938,-0.000986,0.003599,0.002233,0.001810,500.0,599.0,549.5


After creating our new ground truth y_target, we no longer needed the labels y1 and y2.

In [17]:
merged_df = merged_df.drop(columns=['y1','y2'])
merged_df

Unnamed: 0_level_0,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,...,feature_991,feature_992,feature_993,feature_994,feature_995,feature_996,feature_997,feature_998,feature_999,y_target
text_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.404936,-0.332399,0.569584,0.244467,0.118471,-0.012269,0.153942,0.019276,0.003331,0.233105,...,-0.036339,0.028035,0.006794,0.075361,-0.012450,0.011055,0.047048,-0.014151,-0.024547,-124.0
2,1.194844,-0.455850,0.790932,-0.084557,0.912922,0.938616,0.366342,-0.087693,-0.983342,-0.075471,...,0.025149,-0.023396,0.029750,-0.032232,0.014291,-0.030107,-0.024161,-0.011760,0.020541,-112.0
3,0.669130,-0.418529,0.354021,0.003984,0.230967,0.093175,-0.131145,-0.388602,0.684473,0.321883,...,0.015310,0.003670,0.022786,-0.009252,0.038818,-0.002101,0.028129,-0.016734,0.019266,-109.0
4,1.226901,-0.607306,0.716642,-0.143475,0.932176,0.681385,0.182460,0.103260,-0.469498,-0.267133,...,0.063221,0.044466,-0.035404,-0.097845,0.001688,0.080584,0.043998,0.040462,0.092145,-108.0
5,0.403593,-0.306048,0.475889,0.177825,0.139036,-0.021976,0.124563,0.050058,-0.034562,0.217894,...,0.005871,-0.008616,0.002199,-0.002162,0.007168,0.001697,-0.003452,-0.001445,0.011756,-106.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
981643,0.008250,0.016243,0.018408,0.012553,-0.044002,0.041111,0.018783,0.001615,0.011903,0.007014,...,0.005598,-0.002627,-0.002129,-0.002368,0.003550,0.002539,0.004470,-0.000576,0.000890,548.0
981644,0.222981,0.390343,0.176930,-0.095436,-0.083316,0.120702,0.464449,0.153280,0.487788,-0.111622,...,-0.002949,-0.002140,-0.000316,0.002716,0.007476,0.001311,0.000043,0.009732,-0.003175,553.0
981646,0.019691,0.037447,0.039414,0.022553,-0.083950,0.074489,0.025522,0.007548,0.021971,0.009494,...,0.001661,0.002340,0.000879,0.002465,0.000985,0.000319,0.000345,-0.001285,0.004310,549.0
981648,0.072397,0.109943,0.118935,0.083189,-0.261060,0.207830,0.096346,0.025396,0.049978,0.062749,...,-0.001269,0.001269,-0.005298,0.002257,0.007938,-0.000986,0.003599,0.002233,0.001810,549.5


<h1>TRAINING & COMPARING OUR MODELS</h1>

This code splits our dataset into training and testing subsets to prepare it for our machine learning algorithms.

X contains all of our features
y contains our label

We are then splitting the data into training and testing subsets, using 20% of the data for testing and 80% for training. We set random_sate =42 for reproducibility.

The purpose of splitting our data is to evaluate how well the model generalizes to unseen data. Our training sets are used to train the model while the testing set is used afterward to assess the model's performance on data it hasn't seen during training.

We did not scale our data because the original data was one-hot encoded and reducced using Variance Thresholding and SVD. Variance Thresholding only removes features with low variance and doesn't change their scales. SVD produces principle components that are linear combinations of our original features, but SVD already scales them to optimize variance. The components were normalized internally in the decomposition process of the SVD class.

In [286]:
from sklearn.model_selection import train_test_split

# Split data into features (X) and target (y)
X = merged_df.drop(columns=['y_target'])
y = merged_df['y_target']

# Train-test split (80-20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


<h1>LINEAR OR NONLINEAR RELATIONSHIPS?</h1>
After training all of our models, it is apparent our the relationships in our data may be nonlinear.

Our Lasso and Ridge Regression lienar models performed significantly worse. These models assume that the relationship between the features and target is linear.

Our RandomForest and XGBoost ensemble nonlinear models performed far better. These models capture nonlinear relationships in the data. They work by splitting the features into smaller tress and makes decisions based on those trees.

<h1>LASSO</h1>


In [157]:
from sklearn.linear_model import LassoCV
from sklearn.metrics import mean_absolute_error

# Lasso with cross-validation
lasso = LassoCV(cv=5, random_state=42)
lasso.fit(X_train, y_train)

# Evaluate performance
print("Lasso - Best alpha:", lasso.alpha_)
print("Train MAE:", mean_absolute_error(y_train, lasso.predict(X_train)))
print("Test MAE:", mean_absolute_error(y_test, lasso.predict(X_test)))

lasso_features = pd.Series(lasso.coef_, index=X.columns)
print("Selected features by Lasso:")
print(lasso_features[lasso_features != 0])



Lasso - Best alpha: 0.11463516894989338
Train MAE: 142.75912648399776
Test MAE: 144.72150059681678
Selected features by Lasso:
feature_0       36.478300
feature_1      285.691928
feature_2       18.183640
feature_3       65.560417
feature_4     -126.042950
                  ...    
feature_104    -25.799100
feature_105    -13.182620
feature_106    -19.839579
feature_108     37.442107
feature_109    -29.204855
Length: 100, dtype: float64


<h1>RIDGE</h1>

In [158]:
from sklearn.linear_model import RidgeCV
from sklearn.metrics import mean_absolute_error

# Ridge with cross-validation
ridge = RidgeCV(alphas=[0.1, 1.0, 10.0], cv=5)
ridge.fit(X_train, y_train)

# Evaluate performance
print("Ridge - Best alpha:", ridge.alpha_)
print("Train MAE:", mean_absolute_error(y_train, ridge.predict(X_train)))
print("Test MAE:", mean_absolute_error(y_test, ridge.predict(X_test)))

ridge_features = pd.Series(ridge.coef_, index=X.columns)
print("Ridge coefficients:")
print(ridge_features.sort_values(ascending=False).head(10))  # Top 10



Ridge - Best alpha: 10.0
Train MAE: 142.69147866233317
Test MAE: 144.62340916417358
Ridge coefficients:
feature_1      284.001682
feature_5      231.785481
feature_12     119.447728
feature_22     111.392607
feature_65      98.502349
feature_79      70.167565
feature_3       66.364979
feature_85      66.199689
feature_77      54.787702
feature_100     53.352236
dtype: float64


<h1>RANDOM FOREST</h1>

In [287]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Train Random Forest
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Evaluate performance
print("Random Forest - Train MAE:", mean_absolute_error(y_train, rf_model.predict(X_train)))
print("Random Forest - Test MAE:", mean_absolute_error(y_test, rf_model.predict(X_test)))




Random Forest - Train MAE: 30.740977718639943
Random Forest - Test MAE: 78.56731082965455


<h1>OPTIMIZING RANDOM FOREST</h1>

In [159]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(RandomForestRegressor(random_state=42), param_grid, cv=5)
grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
print("Best Random Forest Test MAE:", mean_absolute_error(y_test, grid_search.best_estimator_.predict(X_test)))


Best Parameters: {'max_depth': 20, 'min_samples_split': 5, 'n_estimators': 300}
Best Random Forest Test MAE: 92.32550495382549


In [264]:
# Train Random Forest
rf_model = RandomForestRegressor(max_depth=20, min_samples_split=5, n_estimators=300, random_state=42)
rf_model.fit(X_train, y_train)

# Evaluate performance
print("Random Forest - Train MAE:", mean_absolute_error(y_train, rf_model.predict(X_train)))
print("Random Forest - Test MAE:", mean_absolute_error(y_test, rf_model.predict(X_test)))

Random Forest - Train MAE: 59.81336336248054
Random Forest - Test MAE: 92.32550495382549


<h1>XGBOOST (EXTREME GRADIENT BOOSTING)</h1>

In [252]:
import xgboost as xgb
from sklearn.metrics import mean_absolute_error

# Initialize XGBRegressor
xg_reg = xgb.XGBRegressor(objective ='reg:squarederror', colsample_bytree = 0.7, learning_rate = 0.031,
                          max_depth = 7, alpha = 25, n_estimators = 350)

# Train the model
xg_reg.fit(X_train, y_train)

# Predict on the test set
y_pred = xg_reg.predict(X_test)

# Calculate MAE
mae = mean_absolute_error(y_test, y_pred)
print(f"XGBoost MAE: {mae}")


XGBoost MAE: 89.78133348274217


In [219]:
from sklearn.metrics import mean_absolute_error, r2_score

# Calculate MAE and R-squared
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"XGBoost MAE: {mae}")
print(f"XGBoost R-squared: {r2}")


XGBoost MAE: 89.78133348274217
XGBoost R-squared: 0.7394333114829847


In [265]:
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Assuming 'merged_df' is your dataframe and it's already loaded.

# Step 1: Standardize the features
# Exclude the target variable if it's part of the dataframe
X = merged_df.drop(columns=['y_target'])  # Replace 'target_column' with the actual name of your target column
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 2: Apply PCA
# Define the number of components you want to keep. If unsure, start with the total number of features.
pca = PCA(n_components=0.85)  # This will keep enough components to explain 95% of the variance
X_pca = pca.fit_transform(X_scaled)

# Step 3: Convert the PCA results to a DataFrame (optional)
pca_df = pd.DataFrame(X_pca, columns=[f'PC{i+1}' for i in range(X_pca.shape[1])])

# Step 4: You can join the PCA results back with the target variable if needed.
merged_df_pca = pd.concat([pca_df, merged_df['y_target']], axis=1)

# Print the explained variance ratio (optional)
print("Explained variance ratio by each component:", pca.explained_variance_ratio_)

# Optionally, print the new dataframe with PCA components
print(merged_df_pca.head())


Explained variance ratio by each component: [0.02009968 0.01721329 0.01706886 0.01563002 0.01416983 0.01400168
 0.01363093 0.01350333 0.01342391 0.01325897 0.01304782 0.01291276
 0.01264846 0.01242928 0.01235066 0.01214336 0.01195931 0.01183328
 0.01158328 0.01138039 0.01135438 0.01128614 0.01111041 0.01109109
 0.01099495 0.01084793 0.01083958 0.01074549 0.0106687  0.01060011
 0.01043792 0.01039694 0.01035413 0.01031585 0.01006652 0.01002971
 0.00998617 0.00991306 0.00984947 0.00973248 0.00970367 0.00956474
 0.00952874 0.00945024 0.00941634 0.00928925 0.00926897 0.00921159
 0.00913221 0.00906736 0.00897531 0.00892098 0.00890046 0.00882999
 0.00881589 0.0087819  0.00862541 0.00861903 0.0085754  0.00854557
 0.0084766  0.00844001 0.0083891  0.00834661 0.00826329 0.00819992
 0.00811499 0.00807654 0.00804795 0.00801258 0.00794356 0.00783884
 0.00778971 0.00777591 0.00765647 0.00761074 0.00754618 0.00748518
 0.00738814 0.00727699 0.0071861  0.00712556 0.00709235 0.00704214]
        PC1      

In [266]:
pca_df = pd.concat([pca_df, merged_df['y_target']], axis=1)

In [267]:
# Split data into features (X) and target (y)
X = merged_df.drop(columns=['y_target'])  # Replace with your feature dataframe
y = merged_df['y_target']  # Replace with your target column

# Train-test split (80-20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [268]:
# Initialize XGBRegressor
xg_reg = xgb.XGBRegressor(objective ='reg:squarederror', colsample_bytree = 0.7, learning_rate = 0.031,
                          max_depth = 7, alpha = 25, n_estimators = 350)

# Train the model
xg_reg.fit(X_train, y_train)

# Predict on the test set
y_pred = xg_reg.predict(X_test)

# Calculate MAE
mae = mean_absolute_error(y_test, y_pred)
print(f"XGBoost MAE: {mae}")

XGBoost MAE: 89.78133348274217


In [289]:
# Initialize XGBRegressor
xg_reg = xgb.XGBRegressor(objective ='reg:squarederror', colsample_bytree = 0.7, learning_rate = 0.031,
                          max_depth = 7, alpha = 25, n_estimators = 350)

# Train the model
xg_reg.fit(X_train, y_train)

# Predict on the test set
y_pred = xg_reg.predict(X_test)

# Calculate MAE
mae = mean_absolute_error(y_test, y_pred)
print(f"XGBoost MAE: {mae}")

r2 = r2_score(y_test, y_pred)
print(f"XGBoost MAE: {mae}")
print(f"XGBoost R-squared: {r2}")

XGBoost MAE: 71.1634089561213
XGBoost MAE: 71.1634089561213
XGBoost R-squared: 0.8406828639172944


In [None]:
TRY USING LIGHTGBM GRADIENT BOOSTING

TRY SUPPORT VECTOR MACHINE