
# Flight Price Prediction:
We want to predict the price of a flight based on features such as departure location, destination, airline, and flight duration. This is a regression task where you would use features like "from_airport_code," "dest_airport_code,","stops", "airline_name," and "duration" to predict the "price" column.


In [33]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression

data = pd.read_csv("../data/processed-flight-data.csv")

In [34]:
# Normalize or standardize the numeric features
scaler = StandardScaler()
data['duration'] = scaler.fit_transform(data['duration'].values.reshape(-1, 1))

In [35]:
# x is equals to df without the price column
X = data.drop(['price'], axis=1)

# y is equals to the price column
y = data['price']


# # Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y)




In [36]:
model_results = []

In [37]:
# Create a k-NN regression model
k = 5
knn = KNeighborsRegressor(n_neighbors=k)

In [38]:
# Train the k-NN model
knn.fit(X_train, y_train)

In [39]:
# Make price predictions
y_pred = knn.predict(X_test)

In [40]:
from sklearn.metrics import mean_squared_error,mean_absolute_error, r2_score
# Evaluate the model's performance
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)

r2 = r2_score(y_test, y_pred)   
print("Mean Squared Error:", mse)
print("R-squared:", r2)

model_results.append(pd.DataFrame({ 'ModelName':'KNeighbour','Mean Squared Error': [mse], 'R-squared': [r2], 'Mean Absolute Error': [mae]}))

Mean Squared Error: 3141537.1785050505
R-squared: 0.19760120412515114


## Model Evaluation

Results for the Model:

- **Mean Squared Error (MSE):** ~2.2 million
- **R-squared (R²):** ~0.3


- **Mean Squared Error (MSE):** This measures how much my model's predictions differ, on average, from the actual values. In this case, the MSE of around 2.5M is quite high, indicating that, on average, 
  
- *My model's predictions have an error of approximately 2.5M units squared. Lower MSE values are better, so a lower MSE would have indicated a better model.*

- **R-squared (R²):** R-squared (in this case, flight prices) is very low based on the independent variables (features). It ranges from 0 to 1, and a higher R² is better. 
  
- *My R² of around 0.2 means that my model explains only about 10.44% of the variance in the flight prices. This suggests that my model's features do not account for much of the price variation, and there's a lot of unexplained variance.*

In summary, my model's MSE is relatively high, indicating significant prediction errors, and the R² is low, indicating that my features don't explain much of the price variation. Trying different models or exploring different features is the best move in this case. 


In [41]:
# ussing ploty. plot the actual vs predicted price
import plotly.express as px
fig = px.scatter(x=y_test, y=y_pred, 
                 labels={'x':'Actual Price', 'y':'Predicted Price'},
                 trendline="ols"
)
fig.update_layout(title='Actual vs Predicted Price', xaxis_title="Actual Price", yaxis_title="Predicted Price")
fig.show()

## Decision Tree

In [42]:
# Create the decision tree regressor
dt_model = DecisionTreeRegressor(random_state=42)

# Train the model
dt_model.fit(X_train, y_train)

In [43]:
y_pred = dt_model.predict(X_test)

In [44]:
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')
model_results.append(pd.DataFrame({ 'ModelName':'Decision Tree','Mean Squared Error': [mse], 'R-squared': [r2], 'Mean Absolute Error': [mae]}))

Mean Squared Error: 4469936.5703703705
R-squared: -0.14169322783851657


In [45]:
import plotly.express as px
fig = px.scatter(x=y_test, y=y_pred, 
                 labels={'x':'Actual Price', 'y':'Predicted Price'},
                 trendline="ols"
)
fig.update_layout(title='Actual vs Predicted Price', xaxis_title="Actual Price", yaxis_title="Predicted Price")
fig.show()

## Linear regression

In [46]:
lr_model = LinearRegression()

# Train the model
lr_model.fit(X_train, y_train)

# Make predictions
y_pred = lr_model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')
model_results.append(pd.DataFrame({ 'ModelName':'LinearRegression','Mean Squared Error': [mse], 'R-squared': [r2], 'Mean Absolute Error': [mae]}))

Mean Squared Error: 3072988.85148108
R-squared: 0.2151095422214213


In [47]:
import plotly.express as px
fig = px.scatter(x=y_test, y=y_pred, 
                 labels={'x':'Actual Price', 'y':'Predicted Price'},
                 trendline="ols"
)
fig.update_layout(title='Actual vs Predicted Price', xaxis_title="Actual Price", yaxis_title="Predicted Price")
fig.show()

## Support Vector Regression

In [57]:
from sklearn.svm import SVR

# categorical_col = ['from_airport_code', 'dest_airport_code', 'aircraft_type', 'airline_name', 'airline_name', 'stops']
# Encode categorical columns
# X_encoded = pd.get_dummies(X, columns=categorical_col, prefix='category')

# Split the data into training and testing sets
# X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42)

batch_size = 1000
scaler = StandardScaler()

for i in range(0, len(X_train), batch_size):
    X_batch = X_train[i:i+batch_size]
    scaler.partial_fit(X_batch)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

svr_model = SVR()

# Train the model
svr_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = svr_model.predict(X_test_scaled)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

In [49]:
import plotly.express as px

fig = px.scatter(x=y_test, y=y_pred,
                 labels={'x': 'Actual Price', 'y': 'Predicted Price'},
                 trendline="ols"
                 )
fig.update_layout(title='Actual vs Predicted Price', xaxis_title="Actual Price", yaxis_title="Predicted Price")
fig.show()

In [50]:
from sklearn.ensemble import RandomForestRegressor

# Create the Random Forest model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

# Make predictions
y_pred = rf_model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

Mean Squared Error: 3248081.143057665
R-squared: 0.2512593860308643


In [51]:
fig = px.scatter(x=y_test, y=y_pred,
                 labels={'x': 'Actual Price', 'y': 'Predicted Price'},
                 trendline="ols"
                 )
fig.update_layout(title='Actual vs Predicted Price', xaxis_title="Actual Price", yaxis_title="Predicted Price")
fig.show()

In [52]:
from sklearn.ensemble import GradientBoostingRegressor

# Create the Gradient Boosting model
gb_model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42)

# Train the model
gb_model.fit(X_train, y_train)

# Make predictions
y_pred = gb_model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

Mean Squared Error: 3295261.6009214376
R-squared: 0.24038344930626288


In [53]:
fig = px.scatter(x=y_test, y=y_pred,
                 labels={'x': 'Actual Price', 'y': 'Predicted Price'},
                 trendline="ols"
                 )
fig.update_layout(title='Actual vs Predicted Price', xaxis_title="Actual Price", yaxis_title="Predicted Price")
fig.show()

In [54]:
import tensorflow as tf
from tensorflow.keras import layers

model = tf.keras.Sequential([
    layers.Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
    layers.Dense(32, activation='relu'),
    layers.Dense(1)  # Output layer for regression
])

model.compile(optimizer='adam', loss='mean_squared_error')
model.fit(X_train_scaled, y_train, epochs=50, batch_size=32, validation_split=0.2, verbose=1)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.src.callbacks.History at 0x23280c1aed0>

In [55]:
y_pred = model.predict(X_test_scaled)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

Mean Squared Error: 7282722.0486700395
R-squared: -0.6787972768914781


In [56]:
y_pred_reshaped = y_pred.reshape(-1)
fig = px.scatter(x=y_test, y=y_pred_reshaped,
                 labels={'x': 'Actual Price', 'y': 'Predicted Price'},
                 trendline="ols"
                 )
fig.update_layout(title='Actual vs Predicted Price', xaxis_title="Actual Price", yaxis_title="Predicted Price")
fig.show()