In [11]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error

# Load the California housing dataset
california_housing = fetch_california_housing()

# Create a DataFrame
X = pd.DataFrame(california_housing.data, columns=california_housing.feature_names)
y = california_housing.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize XGBoost regressor with parallelization
xgb_reg = XGBRegressor(learning_rate=0.1, n_estimators=500, max_depth=5, objective='reg:squarederror', n_jobs=-1)

# Train the model
xgb_reg.fit(X_train, y_train)

# Make predictions
y_pred = xgb_reg.predict(X_test)

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

Mean Squared Error: 0.20745356114708538


In [14]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error

# Load the California housing dataset
california_housing = fetch_california_housing()

# Create a DataFrame
X = pd.DataFrame(california_housing.data, columns=california_housing.feature_names)
y = california_housing.target

# Selecting specific columns
selected_features = ['Longitude', 'MedInc']
X_selected = X[selected_features]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.2, random_state=42)

# Initialize XGBoost regressor with parallelization
xgb_reg = XGBRegressor(learning_rate=0.1, n_estimators=500, max_depth=5, objective='reg:squarederror', n_jobs=-1)

# Train the model
xgb_reg.fit(X_train, y_train)

# Make predictions
y_pred = xgb_reg.predict(X_test)

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)


Mean Squared Error: 0.49716071275105733


In [8]:
# Importing necessary libraries
from scipy.stats import pearsonr

# Calculate Pearson correlation coefficients between features and target variable
correlations = {}
for feature in X.columns:
    correlation, _ = pearsonr(X[feature], y)
    correlations[feature] = abs(correlation)

# Sort the features by their correlation with the target variable
sorted_correlations = sorted(correlations.items(), key=lambda x: x[1], reverse=True)

# Select the top features based on correlation coefficient
top_features = [feature for feature, correlation in sorted_correlations[:5]]  # Selecting top 5 features

print("Top features based on Pearson correlation coefficient:")
for feature, correlation in sorted_correlations[:5]:
    print(f"{feature}: {correlation}")

Top features based on Pearson correlation coefficient:
MedInc: 0.6880752079585469
AveRooms: 0.1519482897414577
Latitude: 0.1441602768746582
HouseAge: 0.10562341249320949
AveBedrms: 0.04670051296948675


In [9]:
from sklearn.feature_selection import mutual_info_regression

# Calculate mutual information between features and target variable
mi_scores = mutual_info_regression(X, y)

# Create a dictionary to store mutual information scores for each feature
mi_scores_dict = {}
for i, feature in enumerate(X.columns):
    mi_scores_dict[feature] = mi_scores[i]

# Sort the features by their mutual information scores
sorted_mi_scores = sorted(mi_scores_dict.items(), key=lambda x: x[1], reverse=True)

# Select the top features based on mutual information
top_features = [feature for feature, mi_score in sorted_mi_scores[:5]]  # Selecting top 5 features

print("Top features based on Mutual Information scores:")
for feature, mi_score in sorted_mi_scores[:5]:
    print(f"{feature}: {mi_score}")

Top features based on Mutual Information scores:
Longitude: 0.3995484574076329
MedInc: 0.3875033850264291
Latitude: 0.3684838710710805
AveRooms: 0.10333844972600925
AveOccup: 0.07217356375695427


In [None]:
Yes, mutual information (MI) and Pearson correlation coefficient can give different rankings for feature importance, and this is because they capture different aspects of the relationship between features and the target variable.

Mutual Information (MI):

Mutual information measures the amount of information obtained about one variable through the other variable. In the context of feature selection, it quantifies the amount of information that a feature provides about the target variable.
MI can capture both linear and non-linear relationships between variables.
It does not make any assumptions about the distribution of the data, making it suitable for a wide range of data types.
Pearson Correlation Coefficient:

Pearson correlation coefficient measures the linear correlation between two variables. It ranges from -1 to 1, where:
1 indicates a perfect positive linear relationship,
-1 indicates a perfect negative linear relationship, and
0 indicates no linear relationship.
It only captures linear relationships between variables. Non-linear relationships might not be captured accurately.
It assumes that the variables are normally distributed.
In the case of the California housing dataset:

MI might capture non-linear relationships between features and the target variable, which Pearson correlation may miss.
Pearson correlation might overemphasize features with linear relationships with the target variable and might not capture important non-linear relationships.
Therefore, the rankings obtained from MI and Pearson correlation can differ, as they each emphasize different aspects of the relationship between features and the target variable.

In [None]:

RidgeCV is a linear regression model and, by definition, can capture linear relationships between variables 
and the target. However, it may not capture non-linear relationships between variables and
the target directly. Ridge regression penalizes the size of the coefficients to prevent overfitting,
but it doesn't introduce non-linear transformations of the features.

In [None]:
To analyze the consistency of feature rankings obtained from Mutual Information (MI) across different subsets of your data (in this case, different years), you can use Kendall's Tau correlation coefficient. Kendall's Tau measures the similarity of rankings between two different ranking methods. Here's how you can approach it:

Compute the Kendall's Tau correlation coefficient between the rankings obtained from MI for each pair of years.
Average the Kendall's Tau coefficients to obtain an overall measure of consistency across the years.

In [None]:
When computing Kendall's Tau correlation coefficient to compare rankings obtained from Mutual Information (MI) scores across different subsets of your data, you should pass the rankings themselves rather than the scores.

Here's why:

Kendall's Tau compares rankings: Kendall's Tau is designed to compare rankings between two lists of items. It measures the similarity in the order of items between the two lists, regardless of the magnitude of the values.

Rankings preserve the relative importance: When you pass rankings instead of scores, you're focusing on the relative importance of features within each subset of data. This is what matters when assessing the consistency of feature importance across different subsets.

Therefore, you should pass the rankings of features based on their MI scores to 