In [438]:

# Flight Price Prediction:
# We want to predict the price of a flight based on features such as departure location, destination, airline, and flight duration. This is a regression task where you would use features like "from_airport_code," "dest_airport_code,","stops", "airline_name," and "duration" to predict the "price" column.


In [439]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsRegressor


data = pd.read_csv("../data/PreProcessedFlightData.csv")


In [440]:
# Preprocess categorical features using one-hot encoding
categorical_columns = ['from_airport_code', 'from_country', 'dest_airport_code', 'dest_country', 'aircraft_type', 'airline_number', 'airline_name']
df = pd.get_dummies(data, columns=categorical_columns)
# Encode airline_name using label encoding
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
df['airline_name'] = labelencoder.fit_transform(data['airline_name'])


In [441]:
# Normalize or standardize the numeric features
scaler = StandardScaler()
df['duration'] = scaler.fit_transform(df['duration'].values.reshape(-1, 1))


In [442]:


# x is equals to df without the price column
X = df.drop(['price'], axis=1)

# y is equals to the price column
y = df['price']
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [443]:
df

Unnamed: 0,duration,stops,price,from_airport_code_ADD,from_airport_code_AEP,from_airport_code_ALG,from_airport_code_ATH,from_airport_code_BOG,from_airport_code_BOM,from_airport_code_BRU,...,airline_name_[XiamenAir| EVA Air],airline_name_[XiamenAir| Hainan| Air Europa],airline_name_[XiamenAir| KLM],airline_name_[XiamenAir| Malaysia Airlines| Singapore Airlines],airline_name_[XiamenAir| Shenzhen],airline_name_[XiamenAir| Singapore Airlines],airline_name_[XiamenAir| Virgin Australia],airline_name_[flydubai| Emirates],airline_name_[jetSMART],airline_name
0,0.394351,1,347.0,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,239
1,0.151317,2,1838.0,False,False,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,197
2,-0.946020,1,366.0,False,False,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,268
3,0.526915,2,2940.0,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,386
4,-1.557288,1,519.0,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,815
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5339,1.219194,1,3573.0,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,1175
5340,2.036673,2,1888.0,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,600
5341,-0.607245,1,3892.0,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,0
5342,-1.078584,2,334.0,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,22


In [444]:
# Create a k-NN regression model
k = 5  # Choose the value of k
knn = KNeighborsRegressor(n_neighbors=k)

In [445]:

# Train the k-NN model
knn.fit(X_train, y_train)

In [446]:
# Make price predictions
y_pred = knn.predict(X_test)

In [447]:
from sklearn.metrics import mean_squared_error, r2_score
# Evaluate the model's performance
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)   
print("Mean Squared Error:", mse)
print("R-squared:", r2)

Mean Squared Error: 2388214.7862874246
R-squared: 0.24410596707136878


## Model Evaluation

Results for the Model:

- **Mean Squared Error (MSE):** 2,999,330.55
- **R-squared (R²):** 0.1044


- **Mean Squared Error (MSE):** This measures how much my model's predictions differ, on average, from the actual values. In this case, the MSE of 2,999,330.55 is quite high, indicating that, on average, 
  
- *My model's predictions have an error of approximately 2,999,330.55 units squared. Lower MSE values are better, so a lower MSE would have indicated a better model.*

- **R-squared (R²):** R-squared measures how well my model explains the variance in the dependent variable (in this case, flight prices) based on the independent variables (features). It ranges from 0 to 1, and a higher R² is better. 
  
- *My R² of 0.1044 means that my model explains only about 10.44% of the variance in the flight prices. This suggests that my model's features do not account for much of the price variation, and there's a lot of unexplained variance.*

In summary, my model's MSE is relatively high, indicating significant prediction errors, and the R² is low, indicating that my features don't explain much of the price variation. Trying different models or exploring different features is the best move in this case. 
