# Airline Analysis

In this project, you'll imagine that you work for a travel agency and need to know the ins and outs of airline prices for your clients. You want to make sure that you can find the best deal for your client and help them to understand how airline prices change based on different factors.

You decide to look into your favorite airline. The data include:
- `miles`: miles traveled through the flight
- `passengers`: number of passengers on the flight
- `delay`: take-off delay in minutes
- `inflight_meal`: is there a meal included in the flight?
- `inflight_entertainment`: are there free entertainment systems for each seat?
- `inflight_wifi`: is there complimentary wifi on the flight?
- `day_of_week`: day of the week of the flight
- `weekend`: did this flight take place on a weekend?
- `coach_price`: the average price paid for a coach ticket
- `firstclass_price`: the average price paid for first-class seats
- `hours`: how many hours the flight took
- `redeye`: was this flight a redeye (overnight)?

In this project, you'll explore a dataset for the first time and get to know each of these features. Keep in mind that there's no one right way to address each of these questions. The goal is simply to explore and get to know the data using whatever methods come to mind.

You will be working in this file. Note that there is the file **Airline Analysis_Solution.ipynb** that contains the solution code for this project. We highly recommend that you complete the project on your own without checking the solution, but feel free to take a look if you get stuck or if you want to compare answers when you're done.

In order to get the plots to appear correctly in the notebook, you'll need to show and then clear each plot before creating the next one using the following code:

```py
plt.show() # Show the plot
plt.clf() # Clear the plot
```

Clearing the plot will not erase the plot from view, it will just create a new space for the following graphic.

## Univariate Analysis

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import statsmodels
import matplotlib.pyplot as plt
import math

# Read in Data
flight = pd.read_csv("flight.csv")
flight.head(5)

1. What do coach ticket prices look like? What are the high and low values? What would be considered the average? Does $500 seem like a good price for a coach ticket?

In [None]:
flight.describe()

In [None]:
sns.boxplot(x='coach_price', data=flight)

With the information above, we conclude that $500 is much higher than the average price, even higher than the third quartile.

2. Now visualize the coach ticket prices for flights that are 8 hours long. What are the high, low, and average prices for 8-hour-long flights? Does a $500 dollar ticket seem more reasonable than before?

In [None]:
flight_8h = flight[flight.hours==8]

sns.boxplot(x="coach_price", data=flight_8h)


Now we see that $500 is a price much closer to the third quartile, althoug still high.

3. How are flight delay times distributed? Let's say there is a short amount of time between two connecting flights, and a flight delay would put the client at risk of missing their connecting flight. You want to better understand how often there are large delays so you can correctly set up connecting flights. What kinds of delays are typical?

In [None]:
sns.boxplot(x="delay", data=flight)

In [None]:
flight_hidelay = flight[flight.delay > 1000]
flight_lodelay = flight[flight.delay < 1000]


plt.figure(figsize=(20,8))

plt.subplot(131)
sns.boxplot(x="delay", data=flight_hidelay)
plt.title("High-delayed flights")

plt.subplot(132)
sns.boxplot(x="delay", data=flight_lodelay)
plt.title("Low-delayed flights")

plt.subplot(133)
sns.boxplot(x="delay", data=flight)
plt.title("All flights")

plt.show()
plt.clf()

In [None]:
flight_hidelay = flight[flight.delay > 1000]
flight_lodelay = flight[flight.delay < 1000]

plt.figure(figsize=(20,8))

plt.subplot(121)
plt.hist(flight_hidelay.delay, bins = 20)
plt.title("High-delayed flights")

plt.subplot(122)
plt.hist(flight_lodelay.delay, bins = 20)
plt.title("Low-delayed flights")

# plt.subplot(133)
# plt.hist(flight.delay, bins = 200)
# plt.title("All flights")

plt.show()

By comparing both histograms, we conclude that the most popular delay is of 10 minutes. Another way to see this is through value_counts:

In [None]:
flight.delay.value_counts().sort_values(ascending=False).head(30)

From where we conclude that the most common delay is from 8 to 12 minutes.

## Bivariate Analysis

4. Create a visualization that shows the relationship between coach and first-class prices. What is the relationship between these two prices? Do flights with higher coach prices always have higher first-class prices as well?

In [None]:

sns.scatterplot(x="coach_price", y="firstclass_price", data=flight, alpha=.4)



Although there is a tendency that higher coach prices define higher first class prices, there are flights with the same first-class price but different coach prices.

5. What is the relationship between coach prices and inflight features &mdash; inflight meal, inflight entertainment, and inflight WiFi? Which features are associated with the highest increase in price?

In [None]:
l = flight.columns.to_list()[3:-3]
for i in range(len(l)):
    feature = l[i]
    sns.boxplot(x="coach_price", y=feature, data=flight)
    plt.show()
    plt.clf()

Both inflight entertainement and inflight Wifi have high increase, but are realtively similar. Apart from the question, we see that the highest increase, however, comes from being in a weekend day or not.

6. How does the number of passengers change in relation to the length of flights?

In [None]:
flight.columns

In [None]:
sns.lineplot(x="hours", y="passengers", data=flight)

The number os passengers is more or less constant, whhen related to the length of flight. There is, indeed, much greater deviation when the flight is longer, however such a deviation is limited to, at most, +-1 passenger.  

## Multivariate Analysis

7. Visualize the relationship between coach and first-class prices on weekends compared to weekdays.

In [None]:

sns.scatterplot(x="coach_price", y="firstclass_price", data=flight, alpha=.4, hue="weekend")


It is quite clear that weekend flights are more expensive.

8. How do coach prices differ for redeyes and non-redeyes on each day of the week?

In [None]:
sns.boxplot(x="day_of_week", y="coach_price", hue="redeye", data=flight)


The variation seems to be the same, although the prices in weekends are clearly higher.