In [24]:

# Flight Price Prediction:
# We want to predict the price of a flight based on features such as departure location, destination, airline, and flight duration. This is a regression task where you would use features like "from_airport_code," "dest_airport_code,","stops", "airline_name," and "duration" to predict the "price" column.


In [45]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression

data = pd.read_csv("../data/processedData/filled_na.csv")

In [46]:
# Normalize or standardize the numeric features
scaler = StandardScaler()
data['duration'] = scaler.fit_transform(data['duration'].values.reshape(-1, 1))

In [47]:
# x is equals to df without the price column
X = data.drop(['price'], axis=1)

# y is equals to the price column
y = data['price']
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [48]:
# Create a k-NN regression model
k = 12  
knn = KNeighborsRegressor(n_neighbors=k)

In [49]:
# Train the k-NN model
knn.fit(X_train, y_train)

In [50]:
# Make price predictions
y_pred = knn.predict(X_test)

In [51]:
from sklearn.metrics import mean_squared_error, r2_score
# Evaluate the model's performance
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)   
print("Mean Squared Error:", mse)
print("R-squared:", r2)

Mean Squared Error: 2641953.75396107
R-squared: 0.38759156633477154


## Model Evaluation

Results for the Model:

- **Mean Squared Error (MSE):** ~2.2 million
- **R-squared (R²):** ~0.3


- **Mean Squared Error (MSE):** This measures how much my model's predictions differ, on average, from the actual values. In this case, the MSE of around 2.5M is quite high, indicating that, on average, 
  
- *My model's predictions have an error of approximately 2.5M units squared. Lower MSE values are better, so a lower MSE would have indicated a better model.*

- **R-squared (R²):** R-squared measures how well my model explains the variance in the dependent variable (in this case, flight prices) based on the independent variables (features). It ranges from 0 to 1, and a higher R² is better. 
  
- *My R² of around 0.2 means that my model explains only about 10.44% of the variance in the flight prices. This suggests that my model's features do not account for much of the price variation, and there's a lot of unexplained variance.*

In summary, my model's MSE is relatively high, indicating significant prediction errors, and the R² is low, indicating that my features don't explain much of the price variation. Trying different models or exploring different features is the best move in this case. 


In [52]:
# ussing ploty. plot the actual vs predicted price
import plotly.express as px
fig = px.scatter(x=y_test, y=y_pred, 
                 labels={'x':'Actual Price', 'y':'Predicted Price'},
                 trendline="ols"
)
fig.update_layout(title='Actual vs Predicted Price', xaxis_title="Actual Price", yaxis_title="Predicted Price")
fig.show()

## Decision Tree

In [53]:
# Create the decision tree regressor
dt_model = DecisionTreeRegressor(random_state=42)

# Train the model
dt_model.fit(X_train, y_train)

In [54]:
y_pred = dt_model.predict(X_test)

In [55]:
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

Mean Squared Error: 3495356.1605061926
R-squared: 0.18977174064900504


In [56]:
import plotly.express as px
fig = px.scatter(x=y_test, y=y_pred, 
                 labels={'x':'Actual Price', 'y':'Predicted Price'},
                 trendline="ols"
)
fig.update_layout(title='Actual vs Predicted Price', xaxis_title="Actual Price", yaxis_title="Predicted Price")
fig.show()

## Linear regression

In [57]:
lr_model = LinearRegression()

# Train the model
lr_model.fit(X_train, y_train)

# Make predictions
y_pred = lr_model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

Mean Squared Error: 3317022.855685577
R-squared: 0.23110964056939343


In [58]:
import plotly.express as px
fig = px.scatter(x=y_test, y=y_pred, 
                 labels={'x':'Actual Price', 'y':'Predicted Price'},
                 trendline="ols"
)
fig.update_layout(title='Actual vs Predicted Price', xaxis_title="Actual Price", yaxis_title="Predicted Price")
fig.show()

## Support Vector Regression

In [59]:
from sklearn.svm import SVR

categorical_col = ['from_airport_code', 'dest_airport_code', 'aircraft_type', 'airline_name', 'airline_name', 'stops']
# Encode categorical columns (using one-hot encoding in this case)
X_encoded = pd.get_dummies(X, columns=categorical_col, prefix='category')

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42)

batch_size = 1000
scaler = StandardScaler()

for i in range(0, len(X_train), batch_size):
    X_batch = X_train[i:i+batch_size]
    scaler.partial_fit(X_batch)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

svr_model = SVR()

# Train the model
svr_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = svr_model.predict(X_test_scaled)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

MemoryError: Unable to allocate 188. MiB for an array with shape (3944, 50000) and data type bool

In [43]:
import plotly.express as px

fig = px.scatter(x=y_test, y=y_pred,
                 labels={'x': 'Actual Price', 'y': 'Predicted Price'},
                 trendline="ols"
                 )
fig.update_layout(title='Actual vs Predicted Price', xaxis_title="Actual Price", yaxis_title="Predicted Price")
fig.show()

ValueError: All arguments should have the same length. The length of argument `y` is 25000, whereas the length of  previously-processed arguments ['x'] is 20000

In [None]:
from sklearn.ensemble import RandomForestRegressor

# Create the Random Forest model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

# Make predictions
y_pred = rf_model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

In [None]:
fig = px.scatter(x=y_test, y=y_pred,
                 labels={'x': 'Actual Price', 'y': 'Predicted Price'},
                 trendline="ols"
                 )
fig.update_layout(title='Actual vs Predicted Price', xaxis_title="Actual Price", yaxis_title="Predicted Price")
fig.show()

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

# Create the Gradient Boosting model
gb_model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42)

# Train the model
gb_model.fit(X_train, y_train)

# Make predictions
y_pred = gb_model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

In [None]:
fig = px.scatter(x=y_test, y=y_pred,
                 labels={'x': 'Actual Price', 'y': 'Predicted Price'},
                 trendline="ols"
                 )
fig.update_layout(title='Actual vs Predicted Price', xaxis_title="Actual Price", yaxis_title="Predicted Price")
fig.show()

In [None]:
import keras
from keras import layers

model = keras.Sequential([
    layers.Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
    layers.Dense(32, activation='relu'),
    layers.Dense(1)  # Output layer for regression
])

model.compile(optimizer='adam', loss='mean_squared_error')
model.fit(X_train_scaled, y_train, epochs=50, batch_size=32, validation_split=0.2, verbose=1)

In [None]:
y_pred = model.predict(X_test_scaled)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

In [None]:
y_pred_reshaped = y_pred.reshape(-1)
fig = px.scatter(x=y_test, y=y_pred_reshaped,
                 labels={'x': 'Actual Price', 'y': 'Predicted Price'},
                 trendline="ols"
                 )
fig.update_layout(title='Actual vs Predicted Price', xaxis_title="Actual Price", yaxis_title="Predicted Price")
fig.show()

XGBoost:

XGBoost is an efficient and scalable implementation of gradient boosting. It is widely used in competitions and can perform well in a variety of scenarios.
Elastic Net Regression:

Elastic Net is a linear regression model with both L1 and L2 regularization. It combines the strengths of Lasso and Ridge regression and can handle correlated features.
