# IMPROVE CUSTOMER EXPERIENCE FOR EMIRATES AIRLINE TO INCREASE  PROFITS 



Customer Experience is the sum of all the interactions a customer has with a business and its products or services. It is the customer's perception of a company. In the airline industry, you need to be on top of your game if you want recurrent customers. Emirates is one of the airlines with an average rating of 6/10. From 2010, the rating for the service hasn’t changed much. Improving the customer experience would certainly increase their ratings. 

However, changing things would mean restructuring their business strategy.The aim of this project is to provide a model which they can use to improve their customer experience without changing too much.


### Dataset 
The data was scraped from airlinequality.com by quankiquanki from Skytrax website: https://github.com/quankiquanki/skytrax-reviews-dataset. Skytrax is probably the best site for customers reviews from the airline industry, where it collects reviews and ratings for each airlines, lounges, seats and airports. Special thanks to reddit user jo698 for recommending this.

The dataset used here had a total sample of 691 observations(Emirates only) with 20 columns containing info of the reviewers. The ratings for each attribute ranges from 1 to 5, while the overall rating is from 1-10. 

### Approach
* First, I built a model that can predict the overall rating based on each feature's rating. In order to do that, I conduct OLS Linear model, Ridge linear for parameter regularization and random forest. 
* Then, I select the best model among these methods and choose the important features for airlines.
* Finally, using these features on the best models, I create a table for relative reference in planning targets for attribute ratings. This table together with the selected best model and cost-benefit analysis, Emirates airlines can build a good strategy on where to invest their resources to achieve better performance in customer satisfaction and profits.


### Result
* Random forest is the best performance with the lowest MSE of 1.64 on the test set. The most important factors in order that may affect overall rating are value-money rating, cabin staff, seat comfort, food beverages and inflight entertainment. These all have positive correlations with overall rating.

### Recommendations 
* Based on its corporate strategy and cost-benefit analysis, Emirates may use the reference table and Random forest model to plan their target for improvement in some services.
* While pricing is related closely with business strategy and positioning, Emirates might find it easier to improve its cabin staff and food beverages while still have huge effect on overall rating.

### Improvements I can make
Do topic and sentiment analysis on the travellers' reviews for Emirates airline to see which topics are most discussed, and alerts on negative reviews to take in-time action.


## Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import math
import matplotlib.pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
import statsmodels.api as sm

In [None]:
df=pd.read_csv('airline.csv')

In [None]:
df_emi=df[df.airline_name=='emirates']
df_emi.head(2)

In [None]:
df_emi.describe()

In [None]:
df_emi.info()

* From these descriptive tables, we understand that we have a problem with missing values, but there are outliers. In some features, the rating given is 0, which is invalid as we are only taking into account from 1 to 5. So, those are missing values as well.
* It is noticeable that data for overall rating, recommended and value_money rating are nearly full while rating data for ground service and wifi connectivity are mostly missing. Given this fact, there is a high probability that the Emirates do not offer these along certain routes. 
* The data points with missing values for overall rating will be removed as it is a really important factor.

## Cleaning

In the data cleaning step, I do the following:
* Only choose data since 2010
* Drop data without overall rating
* Drop info that is not needed for modelling purpose: link,content, aircraft type, route, title, author, author country, date, 
* Encode missing values as 0
* Get dummies for traveller type and cabin flown 

In [None]:
df_emi.date=pd.to_datetime(df_emi.date) #change date object into datetime format
df_emi=df_emi[df_emi.date>='2010-01-01'] #choose recent dates only
df_emi.shape

In [None]:
#drop data points without overall rating
df_clean=df_emi[df_emi['overall_rating'].notnull()]

In [None]:
#drop some attributes that will not be used in modelling
df_clean.drop(['airline_name','link','title','author','author_country','date','content','aircraft','route','recommended'],axis=1, inplace=True)


In [None]:
df_clean.head()

In [None]:
#deal with missing values
df_clean=df_clean.fillna({'ground_service_rating':0, 'wifi_connectivity_rating':0,'seat_comfort_rating':0,'cabin_staff_rating':0,'food_beverages_rating':0,'inflight_entertainment_rating':0, 'value_money_rating':0}, inplace=True)


In [None]:
#change types cabin flown and traveler types into category type
df_cabin_flown= pd.get_dummies(df_clean['cabin_flown'])
df_clean=pd.concat([df_clean, df_cabin_flown], axis=1)
df_clean.drop(['cabin_flown'], axis=1, inplace=True)

df_type_traveller= pd.get_dummies(df_clean['type_traveller'])
df_clean=pd.concat([df_clean, df_type_traveller], axis=1)
df_clean.drop(['type_traveller'], axis=1, inplace=True)

In [None]:
df_clean.isnull().values.any() #check if there are any missing values in dataframe

In [None]:
df_clean.head()

In [None]:
df_clean.shape

## Descriptive analysis

In [None]:
correlations=df_clean.corr()
names=df_clean.columns
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0,16,1)
ax.set_xticks(ticks)
plt.xticks(rotation=90)
ax.set_yticks(ticks)
ax.set_xticklabels(names)
ax.set_yticklabels(names)
plt.show()

*From the table, we can see that there are much correlations between cabin staff, food beverages rating and value money rating. Hence, there might be huge bias in case of OLS due to multi-colinearity*

In [None]:
df_emi['year']=df_emi.date.dt.year
df_emi.groupby('year')[['overall_rating']].mean().plot()
plt.title('Overall rating over years')
plt.yticks(np.arange(0,10,1))
plt.legend().remove()
plt.show()

*The overall rating has risen slightly from around 5.7 to 6.0 point on scale of 10 in 2010-2015 period, which may suggest there has not been much improvement in rating for Emirates airlines.* 

In [None]:
df_emi.groupby('cabin_flown')[['overall_rating']].mean().plot(kind='bar')
plt.title('Overall rating over each cabin flown type')
plt.legend().remove()
plt.show()

In [None]:
df_emi.groupby('cabin_flown').size()

*The rating is similar for economy and business class (around 6.5), while first class seems to enjoy more for Emirates service (nearly 8/10). There is only 1 data point for premium economy class, so we cannot infer much from this.*

In [None]:
df_emi.groupby('type_traveller')[['overall_rating']].mean().plot(kind='bar')
plt.title('Overall rating over traveller type')
plt.yticks(np.arange(0,10,1))
plt.legend().remove()
plt.show()

*Business and couple show much more preference in Emirate services while family and solo traveller seem rather dissatisfied (less than 6 score).*

In [None]:
# Test Ho: the mean of overall rating for business/couple and family/solo traveller are equal
bi_cou=df_emi[(df_emi.type_traveller=='Business')|(df_emi.type_traveller=='Couple Leisure')].overall_rating
fa_solo=df_emi[(df_emi.type_traveller=='FamilyLeisure')|(df_emi.type_traveller=='Solo Leisure')].overall_rating
stats.ttest_ind_from_stats(bi_cou.mean(), bi_cou.std(),len(bi_cou),fa_solo.mean(), fa_solo.std(),len(fa_solo), equal_var=False)

*p_value< 0.01, so we can reject the null hypothesis. We may then continue to segment to see what accounts for the difference* 

In [None]:
df_emi.groupby(['type_traveller']).size()

In [None]:
#df_emi.groupby('type_traveller')[['inflight_entertainment_rating','food_beverages_rating','cabin_staff_rating','seat_comfort_rating','ground_service_rating','wifi_connectivity_rating','value_money_rating']].mean().plot(kind='bar')
df_emi.groupby('type_traveller')[['cabin_staff_rating','value_money_rating']].mean().plot(kind='bar')
plt.title('Feature ratings over traveller type')
plt.legend()
plt.show()

In [None]:
# Test Ho: the mean of cabin staff rating for business/couple and family/solo traveller are equal
bi_cou=df_emi[(df_emi.type_traveller=='Business')|(df_emi.type_traveller=='Couple Leisure')].cabin_staff_rating
fa_solo=df_emi[(df_emi.type_traveller=='FamilyLeisure')|(df_emi.type_traveller=='Solo Leisure')].cabin_staff_rating
stats.ttest_ind_from_stats(bi_cou.mean(), bi_cou.std(),len(bi_cou),fa_solo.mean(), fa_solo.std(),len(fa_solo), equal_var=False)

 *Cabin staff and value money ratings are significantly lower for Family and Solo travellers than Business or couple.*

In [None]:
df_emi.groupby('recommended')[['overall_rating']].mean().plot(kind='bar')
plt.title('Correlation between overall rating and being recommended')
plt.yticks(np.arange(0,10,1))
plt.legend().remove()
plt.show()

*This shows strong positive correlation between overall rating and being recommended. Hence, we only choose one for our analysis* 

In [None]:
df_emi.groupby('food_beverages_rating')[['overall_rating']].mean().plot(kind='bar')
plt.title('Correlation between overall rating and food beverage rating')
plt.yticks(np.arange(0,10,1))
plt.legend().remove()
plt.show()

*This relationship also makes sense. The higher score for food beverages, the higher the overall score.*

In [None]:
df_emi.groupby('inflight_entertainment_rating')[['overall_rating']].mean().plot(kind='bar')
plt.title('Correlation between overall rating and inflight entertainment')
plt.yticks(np.arange(0,10,1))
plt.legend().remove()
plt.show()

## Modelling

In order to get the most important features that may affect overall rating, the following steps are conducted:
    * Classifiers: OLS, Ridge regression,  Random forest
    * Split dataset into train/testset (0.75/0.25), grid search and cross validation with 5 folds 
    * Metrics: mean squared error

In [None]:
y=df_clean.overall_rating
X=df_clean.drop(['overall_rating'], axis=1)

In [None]:
from sklearn.model_selection import train_test_split
Xlr, Xtestlr, ylr, ytestlr = train_test_split(X,y,random_state=1)
from sklearn.metrics import mean_squared_error

### Linear regression 

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn import model_selection
from sklearn.cross_validation import KFold
ln=LinearRegression()
model = ln.fit(Xlr,ylr)

#kfold = KFold(n=len(Xlr), n_folds=5, random_state=1)
#results = model_selection.cross_val_score(ln, Xlr, ylr, cv=kfold, scoring='neg_mean_squared_error')
#print('mean squared error on train set',(results.mean())*(-1))
#print('mean squared error on test set:', mean_squared_error(ytestlr,ln.predict(Xtestlr)))

In [None]:
predictions = model.predict(Xlr)
predictions

### Ridge regression 

In [None]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
model = Ridge()
model.fit(Xlr, ylr)
param_grid = {'alpha':[0.1,0.5,0.7,1.0,5.0,7.0, 10.0,12,15,20,25,70,80,100]} #alpha=0 is actually OLS regression
grid = GridSearchCV(model, param_grid, scoring='neg_mean_squared_error')
grid.fit(Xlr, ylr)
best_params = grid.best_params_
model = grid.best_estimator_
score = grid.best_score_

print(str(best_params))
print('mean squared error on train set',abs(score))
print('mean squared error on test set:', mean_squared_error(ytestlr,model.predict(Xtestlr)))

*The best model is with Ridge regularization for alpha=70*

In [None]:
pd.DataFrame(list(zip(Xlr.columns, model.coef_)), columns =['Features', 'Estimated Coefficients']).sort_values('Estimated Coefficients', ascending=False)

*So the ratings for value money, cabin staff, food beverages, seat comfort and inflight entertainment have the most impact on the overall rating.* 

### Random forest

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
forest=RandomForestRegressor(random_state=1)
param_grid = {"max_features"   : ["sqrt","log2", "auto"],
            "n_estimators"     : [200, 500, 1000, 2000],
           "max_depth"         : [2, 10, 50, 100],
           "min_samples_split" : [2,  5, 10, 20, 50],
            "min_samples_leaf" : [1,5,10,20,50]}
%timeit
grid_search = GridSearchCV(forest, param grid, n_jobs=-1, cv=5, scoring='mean_squared_error')
grid_search.fit(Xlr, ylr)
print(grid_search.best_params_)
print('best mean squared error:',grid_search.best_score_*(-1))

In [None]:
a,b,c,d,e=grid_search.best_params_.values()
forest1=RandomForestRegressor(random_state=1,n_estimators=d, min_samples_split=a,
                             max_depth=b, max_features=c, min_samples_leaf=e).fit(Xlr,ylr)
importances = forest1.feature_importances_
names=list(Xlr.columns.values)
features = []
indices = np.argsort(importances)[::-1]

for f in range(len(importances)):
    print("%d. feature %d (%f), %s" % (f + 1, indices[f], importances[indices[f]], names[indices[f]]))
    features.append(indices[f])
    # Print only first 10 most important variables
    if len(features) >= 10:
        break
featurenames = [names[feature] for feature in features]

In [None]:
std = np.std([tree.feature_importances_ for tree in forest1.estimators_],axis=0)

# Plot the feature importances of the forest
print('Feature ranking:')
fig=plt.figure()
plt.title("Feature importances")
plt.bar(range(10), importances[indices[0:10]],
       color="r", align="center")
plt.xticks(range(len(features)), featurenames,rotation=60)
plt.show()
fig.savefig('feature important.png')

In [None]:
print('mean squared error on train set:', grid_search.best_score_*(-1))
print('mean squared error on test set:', mean_squared_error(ytestlr,forest1.predict(Xtestlr)))

#### Result analysis

*With the smallest MSE on the test set, we see that Random forest model performs the best among these 3 models. Hence, we will use Random forest for building our model.*

**Model in use**

*In the following, I group the rating for each feature where 1-3 is recorded as low, and 4-5 as high. Then, I take the median of predicted overall rating over these subsets of each feature. *

In [None]:
data_predict=Xtestlr.copy()
data_predict['Value money']=(data_predict.value_money_rating<=3)*1
data_predict['Cabin']=(data_predict.cabin_staff_rating<=3)*1
data_predict['Seat comfort']=(data_predict.seat_comfort_rating<=3)*1
data_predict['Food beverages']=(data_predict.food_beverages_rating<=3)*1
data_predict['Inflight entertainment']=(data_predict.inflight_entertainment_rating<=3)*1
data_predict['Overall rating']=forest1.predict(Xtestlr)

In [None]:
table=pd.pivot_table(data=data_predict,columns=None, index=['Value money','Cabin','Seat comfort','Food beverages','Inflight entertainment'],values='Overall rating', aggfunc='median')
table1=pd.DataFrame(table)
table1.rename(index={0:'high',1:'low'}, inplace=True)
table1

**The above table can serve as a relative reference for setting targets in rating.** 

*One approach is that after setting goal for overall rating, Emirates can use this table together to visualize metrics for attributes. Then it can plug in some different sets of specific number for each feature rating in Random forest model to get the expected overall rating. Finally, Emirates might use this result in benefit-cost analysis to set goals for the ratings in these features.*
