<a href="https://www.kaggle.com/code/lovishsaini25/airbnb-analysis-visualization-and-prediction?scriptVersionId=144007197" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# **Breakdown of the notebook**:

**1: Data Loading and Exploration**
- Load the dataset into a pandas DataFrame.
- Explore the dataset to understand its structure and content.
- Check for missing values and handle them if necessary.
- Clean and preprocess the data as needed.

**2: Data Visualization and Exploration**
- Perform data visualization to gain insights into the dataset.
- Create visualizations such as histograms, scatter plots, and box plots to understand the distribution of variables and relationships between them.
- Use libraries like Matplotlib and Seaborn for data visualization.

**3: Data Preprocessing**
- Select relevant features (columns) for the analysis.
- Encode categorical variables if needed (e.g., one-hot encoding).
- Split the data into training and test sets for model evaluation.

**4: Model Selection and Training**
- Choose machine learning models for the analysis (e.g., Decision Tree, Random Forest, XGBoost).
- Train the selected models on the training data.

**5: Model Evaluation**
- Evaluate model performance using appropriate metrics (e.g., RMSE, R2 score).
- Check for overfitting by comparing performance on training and test data.
- Tune hyperparameters to improve model performance.

**6: Interpretation and Insights**
- Interpret model results and evaluate the importance of features.
- Use insights from the analysis to make informed decisions or draw conclusions.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

df = pd.read_csv('/kaggle/input/us-airbnb-open-data/AB_US_2023.csv')
# data_2020 = pd.read_csv('/kaggle/input/us-airbnb-open-data/AB_US_2020.csv')

In [None]:
df.head(5)

In [None]:
df.shape

In [None]:
df.columns

<table style="border-collapse: collapse; width: 70%; border: 2px solid black;">
  <tr>
    <th style="border: 1px solid black; padding: 8px;">Column</th>
    <th style="border: 1px solid black; padding: 8px;">Description</th>
  </tr>
  <tr>
    <td style="border: 1px solid black; padding: 8px;">id</td>
    <td style="border: 1px solid black; padding: 8px;">Airbnb's unique identifier for the listing</td>
  </tr>
  <tr>
    <td style="border: 1px solid black; padding: 8px;">name</td>
    <td style="border: 1px solid black; padding: 8px;">Name of the listing</td>
  </tr>
  <tr>
    <td style="border: 1px solid black; padding: 8px;">host_id</td>
    <td style="border: 1px solid black; padding: 8px;">Airbnb host's unique identifier</td>
  </tr>
  <tr>
    <td style="border: 1px solid black; padding: 8px;">host_name</td>
    <td style="border: 1px solid black; padding: 8px;">Name of the host</td>
  </tr>
  <tr>
    <td style="border: 1px solid black; padding: 8px;">neighbourhood_group</td>
    <td style="border: 1px solid black; padding: 8px;">The neighbourhood group as geocoded using the latitude and longitude against neighborhoods as defined by open or public digital shapefiles.</td>
  </tr>
  <tr>
    <td style="border: 1px solid black; padding: 8px;">neighbourhood</td>
    <td style="border: 1px solid black; padding: 8px;">The neighbourhood as geocoded using the latitude and longitude against neighborhoods as defined by open or public digital shapefiles.</td>
  </tr>
  <tr>
    <td style="border: 1px solid black; padding: 8px;">latitude</td>
    <td style="border: 1px solid black; padding: 8px;">Latitude in the World Geodetic System (WGS84) projection</td>
  </tr>
  <tr>
    <td style="border: 1px solid black; padding: 8px;">longitude</td>
    <td style="border: 1px solid black; padding: 8px;">Longitude in the World Geodetic System (WGS84) projection</td>
  </tr>
  <tr>
    <td style="border: 1px solid black; padding: 8px;">room_type</td>
    <td style="border: 1px solid black; padding: 8px;">Type of room (e.g., entire home, private room, shared room)</td>
  </tr>
  <tr>
    <td style="border: 1px solid black; padding: 8px;">price</td>
    <td style="border: 1px solid black; padding: 8px;">Daily price in local currency (Note: $ sign may be used regardless of locale)</td>
  </tr>
  <tr>
    <td style="border: 1px solid black; padding: 8px;">minimum_nights</td>
    <td style="border: 1px solid black; padding: 8px;">Minimum number of nights required for the listing (calendar rules may vary)</td>
  </tr>
  <tr>
    <td style="border: 1px solid black; padding: 8px;">number_of_reviews</td>
    <td style="border: 1px solid black; padding: 8px;">Total number of reviews the listing has received</td>
  </tr>
  <tr>
    <td style="border: 1px solid black; padding: 8px;">last_review</td>
    <td style="border: 1px solid black; padding: 8px;">Date of the last/newest review</td>
  </tr>
  <tr>
    <td style="border: 1px solid black; padding: 8px;">calculated_host_listings_count</td>
    <td style="border: 1px solid black; padding: 8px;">Number of listings the host has in the current scrape, in the city/region geography</td>
  </tr>
  <tr>
    <td style="border: 1px solid black; padding: 8px;">availability_365</td>
    <td style="border: 1px solid black; padding: 8px;">Availability of the listing x days in the future as determined by the calendar. Note: A listing may be available because it has been booked by a guest or blocked by the host.</td>
  </tr>
  <tr>
    <td style="border: 1px solid black; padding: 8px;">number_of_reviews_ltm</td>
    <td style="border: 1px solid black; padding: 8px;">Number of reviews the listing has received in the last 12 months</td>
  </tr>
</table>


In [None]:
df.info()

In [None]:
df.describe()

1. **Counts:**
   - The dataset contains 232,147 records for all columns, indicating that there is no missing data in any of the columns.

2. **Hosts and Listings:**
   - The range of `host_id` values is quite extensive, ranging from a minimum of 23 to a maximum of 506,938,400.
   - The dataset includes listings across various geographic locations, with `latitude` values ranging from approximately 25.96 to 47.73 and `longitude` values ranging from approximately -123.09 to -70.99.

3. **Pricing:**
   - The `price` column has a wide range of values, with a minimum of 0 and a maximum of 100,000. The mean price is approximately *USD* 259.47, but the standard deviation is quite high (approximately *USD* 1024.65), indicating a significant variability in listing prices.

4. **Minimum Nights and Availability:**
   - The `minimum_nights` column ranges from a minimum of 1 night to a maximum of 1,250 nights, with a mean of approximately 13.50 nights.
   - The `availability_365` column ranges from 0 to 365 days, indicating the availability of listings throughout the year.

5. **Reviews:**
   - The `number_of_reviews` column has a wide range, with a minimum of 0 and a maximum of 3,091 reviews. The mean number of reviews is approximately 40.92, but the standard deviation is relatively high (approximately 80.65), suggesting variations in the number of reviews.
   - The `reviews_per_month` column ranges from a minimum of 0.01 to a maximum of 101.42, with a mean of approximately 1.64. This column provides information about the average number of reviews per month for each listing.

6. **Host Listings:**
   - The `calculated_host_listings_count` column shows the number of listings each host has in the current scrape. The range is from 1 to 1,003 listings, with a mean of approximately 29.88 listings per host.

7. **Availability Within the Last 12 Months:**
   - The `number_of_reviews_ltm` column represents the number of reviews each listing has received in the last 12 months. It ranges from 0 to 1,314 reviews, with a mean of approximately 11.69 reviews.

**Potential issues**

1. **Outliers:**
   - The presence of outliers in columns like `price`, `minimum_nights`, `number_of_reviews`, `reviews_per_month`, `calculated_host_listings_count`, and `number_of_reviews_ltm` is evident from the large standard deviations and the significant differences between the 75th percentile and the maximum values. Let's investigate and decide whether to remove these outliers.

2. **Zero Minimum Nights:**
   - The `minimum_nights` column has a minimum value of 0, which does not make sense for a minimum stay requirement. We will validate these records to determine if they are valid or if they need to be cleaned. Maybe the 0 means None.

3. **Missing Data:**
   - The column `reviews_per_month` has missing data, as indicated by the count being less than the total number of records.

4. **Geographic Outliers:**
   - The latitude and longitude columns may contain data points that could potentially be outliers.

In [None]:
# Compute the sum of missing values for each column
missing_sum = df.isnull().sum()

# Compute the percentage of missing values for each column
missing_percentage = (df.isnull().sum() / len(df)) * 100

# Create a DataFrame to display the results
missing_info = pd.DataFrame({'Missing Values': missing_sum, 'Percentage Missing': missing_percentage})

# Sort the DataFrame by the number of missing values in descending order
missing_info.sort_values(by='Missing Values', ascending=False)

**Data Cleaning Strategy**
1. **Deletion:**
    - **name/host name** The number of missing values for name/host name is very small and won't significantly impact our analysis, we can choose to remove the rows with missing values. These columns contain text data (names). Handling missing text data can be a bit more complex.
        - We can impute missing names with a placeholder value like "Unknown" or "Anonymous." But, we will prefer to remove them since they are a very small percentage of the total data and names are not critical for our analysis
    - **neighbourhood_group:** The neighbourhood_group is not a critical feature for our analysis and the missing values are large, we may choose to remove this column from the dataset.

2. **Imputation:** You can impute missing values using various techniques depending on the data type. For numeric columns (e.g., price, minimum_nights), you can fill missing values with the mean, median, or a specific value that makes sense in the context. For categorical columns (e.g., room_type), you can use the mode (most frequent category) to fill missing values.
    - **last_review and reviews_per_month:** We can impute missing values in these columns with a specific value or date that indicates missing data. Alternatively, We can use statistical methods to estimate missing values based on the distribution of the existing data.

In [None]:
# Drop rows with null values in 'host_name' and 'name' columns
df.dropna(subset=['host_name', 'name'], inplace=True)

In [None]:
df.isnull().sum()

In [None]:
# This cell was time consuming and compute intensive. Hence, commented it out. As a replacenment using the below code
# import folium

# # Create a map of listings
# m = folium.Map(location=[df['latitude'].mean(), df['longitude'].mean()], zoom_start=12)
# for index, row in df.iterrows():
#     if index % 25000 == 0:
#         print(index)
#     folium.Marker([row['latitude'], row['longitude']], tooltip=row['name']).add_to(m)
# m

In [None]:
# from folium.plugins import MarkerCluster

# m = folium.Map(location=[df['latitude'].mean(), df['longitude'].mean()], zoom_start=12)
# marker_cluster = MarkerCluster().add_to(m)

# for index, row in df.iterrows():
#     folium.Marker([row['latitude'], row['longitude']], tooltip=row['name']).add_to(marker_cluster)

# # Save the map as a PNG image (static map takes up less space)
# m.save('airbnb_listings_map.png')

**Cluster Markers**: Folium provides marker clustering functionality, which groups nearby markers into clusters when the map is zoomed out. This can improve performance and make the map more user-friendly. You can use MarkerCluster from Folium to achieve this.

In [None]:
# we wanted to ensure using folium if all the data points lies within USA, however, it was taking longer to execute.
# Hence, used this alternative to perform the task
# We can see that 100% of the data records lies within USA

def is_within_usa(latitude, longitude):
    usa_bounding_box = {
        'min_lat': 24.396308,
        'max_lat': 49.384358,
        'min_lon': -125.000000,
        'max_lon': -66.934570
    }
    if (usa_bounding_box['min_lat'] <= latitude <= usa_bounding_box['max_lat'] and
        usa_bounding_box['min_lon'] <= longitude <= usa_bounding_box['max_lon']):
        return True
    else:
        return False

print(len(df.apply(lambda row: is_within_usa(row['latitude'], row['longitude']), axis=1)) / len(df)*100)
df = df[df.apply(lambda row: is_within_usa(row['latitude'], row['longitude']), axis=1)]

In [None]:
from scipy import stats

# Identify and remove price outliers using Z-score
z_scores = np.abs(stats.zscore(df['price']))
print(f'Percentage of data retained: {len(df[(z_scores < 3)])/len(df)* 100}%')

In [None]:
# Create a feature for the number of days since the last review
from datetime import datetime
df['last_review'] = pd.to_datetime(df['last_review'])
df['days_since_last_review'] = (datetime.now() - df['last_review']).dt.days
# df = df.drop(columns=['last_review'])
df['days_since_last_review'].fillna(-1, inplace=True)
df['days_since_last_review'] = df['days_since_last_review'].astype(int)

# Price itself itn't a good feature
df['price_per_night'] = df['price'] / df['minimum_nights']

# reviews_ratio might be useful too to see how eagerly people review the site
df['reviews_ratio'] = df['number_of_reviews_ltm'] / df['number_of_reviews']

# Binning 'availability_365' into 4 bins
df['availability'] = pd.cut(df['availability_365'], bins=[0, 90, 180, 270, 365], labels=['low', 'medium', 'high', 'very high'])

In [None]:
# Create a feature representing the interaction between 'price' and 'reviews_ratio'
df['price_reviews_ratio'] = df['price'] * df['reviews_ratio']

In [None]:
import numpy as np
percentiles = [25, 50, 75, 90, 95]  

percentiles.extend([i/10 for i in range(990, 1000, 1)] + [100])

# Calculate the percentiles of the 'price' column
price_percentiles = np.percentile(df['price_per_night'], percentiles)

# Print the calculated percentiles
for percentile, value in zip(percentiles, price_percentiles):
    print(f'{percentile}th Percentile: ${value:.2f}')

# The price of $7k sseems to be too high. Hence, I am droping it

In [None]:
df = df[df['price_per_night'] < 6900]

In [None]:
percentiles = [25, 50, 75, 90, 95]  

percentiles.extend([i/10 for i in range(990, 1000, 1)] + [100])

# Calculate the percentiles of the 'price' column
price_percentiles = np.percentile(df['price'], percentiles)

# Print the calculated percentiles
for percentile, value in zip(percentiles, price_percentiles):
    print(f'{percentile}th Percentile: ${value:.2f}')

# Still the price seems to be at the higher end
df = df[df['price'] < 10000]

In [None]:
percentiles = [25, 50, 75, 90, 95]  

percentiles.extend([i/10 for i in range(990, 1000, 1)] + [100])

# Calculate the percentiles of the 'price' column
price_percentiles = np.percentile(df['minimum_nights'], percentiles)

# Print the calculated percentiles
for percentile, value in zip(percentiles, price_percentiles):
    print(f'{percentile}th Percentile: {value:.2f}')

# It is strange to observe that a site has to be atleast booked for an entire year
# 180 days are fine too but 365 days are strange to me. However, we will keep it
# There can be a possibility that a site is rented as a paying guest for an entire year on lease

In [None]:
# Set display options for pandas DataFrames
pd.set_option('display.max_columns', None)  # Display all columns
pd.set_option('display.max_colwidth', None)  # Display full column width (no truncation)

df[df['minimum_nights'] == 365].head(5)

# There are substantial records, hence it might be incorrect to remove it.
# even, such kind of data can be useful to analyze

In [None]:
df[df['minimum_nights'] == 0].head(5)

# earlier, we got minimum value for a night stay is zero. But, it seems like it is deleted by us
# Hence, we don't have any faulty zero minimum nights

In [None]:
df[df['minimum_nights'] > 365].head(5)

# It is strange but let's keep it as it is

We have resolved all the potential issues. Let's try to describe the data again

In [None]:
df.describe()

# -1 in days_since_last_review if because last reviews for them is NaT

In [None]:
print(df.columns)

df = df.drop(columns=['price_reviews_ratio', 'price_per_night'])

In [None]:
from sklearn.preprocessing import StandardScaler

numeric_columns = ['minimum_nights', 'number_of_reviews', 'reviews_per_month',
                   'calculated_host_listings_count',
                   'number_of_reviews_ltm', 'days_since_last_review', 'reviews_ratio']

# Create a StandardScaler instance
scaler = StandardScaler()

# Apply the standardization to the selected numeric columns
df[numeric_columns] = scaler.fit_transform(df[numeric_columns])



To predict the `price` column, we'll nned to select features that are likely to have a significant impact on the price of a listing. By intuition, we think the below columns can be helpful to predict `price`:

1. **room_type**: The type of room (e.g., entire home, private room, shared room) can significantly affect the price.

2. **minimum_nights**: The minimum number of nights required for the listing can influence pricing.

3. **number_of_reviews**: The number of reviews can reflect the popularity of the listing and may correlate with price.

4. **reviews_per_month**: The rate of new reviews per month can indicate the ongoing appeal of the listing.

5. **calculated_host_listings_count**: The number of listings the host has can impact pricing.

6. **availability_365**: The availability of the listing throughout the year may affect price.

7. **number_of_reviews_ltm**: The number of reviews in the last 12 months can provide recent feedback on the listing.

8. **city**: The location (city) of the listing can have a significant impact on the price.

9. **neighbourhood** and **neighbourhood_group**: The neighborhood and neighborhood group can also be used for location-based analysis.

In [None]:
# Calculate the correlation matrix
numeric_columns.append('price')
correlation_matrix = df[numeric_columns].corr()

# Select features with strong correlations to 'price'
strong_correlations = correlation_matrix['price'].abs() > 0.2

# Get the names of selected features
selected_features = correlation_matrix.index[strong_correlations].tolist()

# Print the selected features
print("Selected Features:")
print(selected_features)

#Get Correlation between different variables
plt.figure(figsize=(18,12))
sns.heatmap(correlation_matrix*100, annot=True)

In [None]:
# df = pd.get_dummies(df, columns=['room_type', 'neighbourhood_group'], drop_first=True)

In [None]:
# Interaction features - Feature engineering
df['reviews_times_availability'] = df['number_of_reviews'] * df['availability_365']
df['host_experience'] = df['calculated_host_listings_count'] * df['reviews_per_month']
# I assumed that the site with more reviews is popular and if the availability is high it isn't popular
df['popularity'] = df['number_of_reviews'] / df['availability_365']
df['booking_flexibility'] = df['minimum_nights'] / df['availability_365']

neighborhood_avg_price = df.groupby('neighbourhood')['price'].transform('mean')
df['neighborhood_price_index'] = df['price'] / neighborhood_avg_price


skewed_features = ['number_of_reviews', 'reviews_per_month']
for feature in skewed_features:
    df[feature] = np.log1p(df[feature])

In [None]:
numeric_columns.extend(['reviews_times_availability',
                        'host_experience',
                        'popularity',
                        'booking_flexibility',
                        'neighborhood_price_index']
                      )

# Calculate the correlation matrix
correlation_matrix = df[numeric_columns].corr()

# Select features with strong correlations to 'price'
strong_correlations = correlation_matrix['price'].abs() >= 0.025

# Get the names of selected features
selected_features = correlation_matrix.index[strong_correlations].tolist()

# Print the selected features
print("Selected Features:")
print(selected_features)

#Get Correlation between different variables
plt.figure(figsize=(18,12))
sns.heatmap(correlation_matrix*100, annot=True)

In [None]:
df[selected_features].describe()

In [None]:
# Removing Outliers
lower_bound = .25
upper_bound = .75
df = df[df['price'].between(df['price'].quantile(lower_bound), df['price'].quantile(upper_bound))]
df = df[df['number_of_reviews'] > 0]
df = df[df['calculated_host_listings_count'] < 10]
df = df[df['number_of_reviews'] < 200]
df = df[df['minimum_nights'] < 10]
df = df[df['reviews_per_month'] < 5]

In [None]:
df[selected_features].describe()

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from sklearn.metrics import mean_squared_error

# Select the features and target variable
selected_features = [
    'minimum_nights',
    'number_of_reviews',
    'number_of_reviews_ltm',
    'days_since_last_review',
    'reviews_times_availability'
]
X = df[selected_features]
y = df['price']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Decision Tree Regressor
dt_model = DecisionTreeRegressor(random_state=42)
dt_param_grid = {
    'max_depth': [None, 5, 10, 20],  # Max depth of the tree
    'min_samples_split': [2, 5, 10],  # Minimum samples required to split a node
}
dt_grid_search = GridSearchCV(dt_model, dt_param_grid, cv=5, scoring='neg_mean_squared_error')
dt_grid_search.fit(X_train, y_train)
best_dt_model = dt_grid_search.best_estimator_
dt_mse = -cross_val_score(best_dt_model, X, y, cv=5, scoring='neg_mean_squared_error')
dt_rmse = np.sqrt(dt_mse.mean())

In [None]:
# # Random Forest Regressor
# rf_model = RandomForestRegressor(random_state=42)
# rf_param_grid = {
#     'n_estimators': [100, 200],  # Number of trees in the forest
#     'max_depth': [None, 5, 10]  # Max depth of individual trees
# }
# rf_grid_search = GridSearchCV(rf_model, rf_param_grid, cv=5, scoring='neg_mean_squared_error')
# rf_grid_search.fit(X_train, y_train)
# best_rf_model = rf_grid_search.best_estimator_
# rf_mse = -cross_val_score(best_rf_model, X, y, cv=5, scoring='neg_mean_squared_error')
# rf_rmse = np.sqrt(rf_mse.mean())

In [None]:
# # XGBoost Regressor
# xgb_model = xgb.XGBRegressor(random_state=42)
# xgb_param_grid = {
#     'learning_rate': [0.01, 0.1, 0.2],  # Step size shrinking
#     'max_depth': [3, 4, 5],  # Maximum depth of the tree
#     'n_estimators': [100, 200, 300],  # Number of boosting rounds
# }
# xgb_grid_search = GridSearchCV(xgb_model, xgb_param_grid, cv=5, scoring='neg_mean_squared_error')
# xgb_grid_search.fit(X_train, y_train)
# best_xgb_model = xgb_grid_search.best_estimator_
# xgb_mse = -cross_val_score(best_xgb_model, X, y, cv=5, scoring='neg_mean_squared_error')
# xgb_rmse = np.sqrt(xgb_mse.mean())

In [None]:
# # Print model performance
# print("Decision Tree RMSE:", dt_rmse)
# print("Random Forest RMSE:", rf_rmse)
# print("XGBoost RMSE:", xgb_rmse)

In [None]:
# Make predictions using the Decision Tree model
dt_predictions = best_dt_model.predict(X_test)

# Make predictions using the Random Forest model
# rf_predictions = best_rf_model.predict(X_test)

# Make predictions using the XGBoost model
# xgb_predictions = best_xgb_model.predict(X_test)

# Evaluate model performance using RMSE
from sklearn.metrics import mean_squared_error

dt_rmse = np.sqrt(mean_squared_error(y_test, dt_predictions))
# rf_rmse = np.sqrt(mean_squared_error(y_test, rf_predictions))
# xgb_rmse = np.sqrt(mean_squared_error(y_test, xgb_predictions))

# Print RMSE for each model
print("Decision Tree RMSE on Test Data:", dt_rmse)
# print("Random Forest RMSE on Test Data:", rf_rmse)
# print("XGBoost RMSE on Test Data:", xgb_rmse)


In [None]:
# Calculate RMSE on training data
dt_train_predictions = best_dt_model.predict(X_train)
# rf_train_predictions = rf_model.predict(X_train)
# xgb_train_predictions = xgb_model.predict(X_train)

dt_train_rmse = np.sqrt(mean_squared_error(y_train, dt_train_predictions))
# rf_train_rmse = np.sqrt(mean_squared_error(y_train, rf_train_predictions))
# xgb_train_rmse = np.sqrt(mean_squared_error(y_train, xgb_train_predictions))

# Calculate RMSE on test data
dt_test_predictions = best_dt_model.predict(X_test)
# rf_test_predictions = rf_model.predict(X_test)
# xgb_test_predictions = xgb_model.predict(X_test)

dt_test_rmse = np.sqrt(mean_squared_error(y_test, dt_test_predictions))
# rf_test_rmse = np.sqrt(mean_squared_error(y_test, rf_test_predictions))
# xgb_test_rmse = np.sqrt(mean_squared_error(y_test, xgb_test_predictions))

# Print RMSE for training and test data
print("Decision Tree RMSE on Training Data:", dt_train_rmse)
print("Decision Tree RMSE on Test Data:", dt_test_rmse)
# print("\nRandom Forest RMSE on Training Data:", rf_train_rmse)
# print("Random Forest RMSE on Test Data:", rf_test_rmse)
# print("\nXGBoost RMSE on Training Data:", xgb_train_rmse)
# print("XGBoost RMSE on Test Data:", xgb_test_rmse)

If RMSE on training data is significantly lower than RMSE on test data, it indicates that we are not overfitting. In this case, we are good.