# Ecommerce Customers Project #

------------------------------

This project is about an Ecommerce company based in New York City that sells clothing online but they also have in-store style and clothing advice sessions. Customers come in to the store, have sessions/meetings with a personal stylist, then they can go home and order either on a mobile app or website for the clothes they want.

The company is trying to decide whether to focus their efforts on their mobile app experience or their website.

## Imports ##

In [1]:
%matplotlib notebook

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Get the data ##

We'll work with the Ecommerce Customers csv file from the company. It has Customer info, suchas Email, Address, and their color Avatar. Then it also has numerical value columns:

* Avg. Session Length: Average session of in-store style advice sessions.
* Time on App: Average time spent on App in minutes
* Time on Website: Average time spent on Website in minutes
* Length of Membership: How many years the customer has been a member.

In [3]:
df = pd.read_csv("Ecommerce Customers")

**Checking the head of customers, and check out its info().**

In [5]:
df.head()

Unnamed: 0,Email,Address,Avatar,Avg. Session Length,Time on App,Time on Website,Length of Membership,Yearly Amount Spent
0,mstephenson@fernandez.com,"835 Frank Tunnel\nWrightmouth, MI 82180-9605",Violet,34.497268,12.655651,39.577668,4.082621,587.951054
1,hduke@hotmail.com,"4547 Archer Common\nDiazchester, CA 06566-8576",DarkGreen,31.926272,11.109461,37.268959,2.664034,392.204933
2,pallen@yahoo.com,"24645 Valerie Unions Suite 582\nCobbborough, D...",Bisque,33.000915,11.330278,37.110597,4.104543,487.547505
3,riverarebecca@gmail.com,"1414 David Throughway\nPort Jason, OH 22070-1220",SaddleBrown,34.305557,13.717514,36.721283,3.120179,581.852344
4,mstephens@davidson-herman.com,"14023 Rodriguez Passage\nPort Jacobville, PR 3...",MediumAquaMarine,33.330673,12.795189,37.536653,4.446308,599.406092


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 8 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Email                 500 non-null    object 
 1   Address               500 non-null    object 
 2   Avatar                500 non-null    object 
 3   Avg. Session Length   500 non-null    float64
 4   Time on App           500 non-null    float64
 5   Time on Website       500 non-null    float64
 6   Length of Membership  500 non-null    float64
 7   Yearly Amount Spent   500 non-null    float64
dtypes: float64(5), object(3)
memory usage: 31.4+ KB


# Exploratory Data Analysis #

Let's explore the data!

**Using seaborn to create a jointplot to compare the Time on Website and Yearly Amount Spent columns.**

In [28]:
sns.jointplot (x = 'Time on Website', y= 'Yearly Amount Spent', data=df, color = 'blue')

<IPython.core.display.Javascript object>

<seaborn.axisgrid.JointGrid at 0x19bbfdd2400>

**Doing the same but with the Time on App column instead.**

In [29]:
sns.jointplot (x = 'Time on App', y= 'Yearly Amount Spent', data=df, color = 'red')

<IPython.core.display.Javascript object>

<seaborn.axisgrid.JointGrid at 0x19bc3221760>

**Using jointplot to create a 2D hex bin plot comparing Time on App and Length of Membership.**

In [30]:
sns.jointplot (x = 'Time on App', y= 'Length of Membership', data=df, kind = 'hex', color = 'blue')

<IPython.core.display.Javascript object>

<seaborn.axisgrid.JointGrid at 0x19bc34c9e20>

**Creating a linear model plot (using seaborn's lmplot) of Yearly Amount Spent vs. Length of Membership.**

In [31]:
sns.lmplot(x='Length of Membership', y='Yearly Amount Spent', data= df)

<IPython.core.display.Javascript object>

<seaborn.axisgrid.FacetGrid at 0x19bc42ac220>

## Training and Testing Data ##

**Now let's split the data into training and testing sets. Set a variable X equal to the numerical features of the customers and a variable y equal to the "Yearly Amount Spent" column.**

In [32]:
X = df [['Avg. Session Length','Time on App','Time on Website','Length of Membership']]
y = df ['Yearly Amount Spent']

In [33]:
from sklearn.model_selection import train_test_split

In [34]:
X_train, X_test, y_train, y_test = train_test_split (X,y, test_size=0.3, random_state = 101)

## Training the Model ##

Now its time to train our model on our training data!

**Importing LinearRegression from sklearn.linear_model**

In [35]:
from sklearn.linear_model import LinearRegression

**Creating an instance of a LinearRegression() model named lm.**

In [36]:
lm = LinearRegression()

**Fitting lm on the training data.**

In [37]:
lm.fit (X_train, y_train)

LinearRegression()

**Printing out the coefficients of the model**

In [38]:
print (lm.coef_)

[25.98154972 38.59015875  0.19040528 61.27909654]


## Predicting Test Data ##

Now that we have fit our model, let's evaluate its performance by predicting off the test values!

**Using lm.predict() to predict off the X_test set of the data.**

In [39]:
prediction = lm.predict (X_test)

**Creating a scatterplot of the real test values versus the predicted values.**

In [40]:
sns.scatterplot(y_test, prediction,color = 'blue')
plt.xlabel ("Y test (True Values)")
plt.ylabel ("Predicted values")



<IPython.core.display.Javascript object>

Text(0, 0.5, 'Predicted values')

## Evaluating the Model ##

Let's evaluate our model performance by calculating the residual sum of squares and the explained variance score (R^2).

**Calculating the Mean Absolute Error, Mean Squared Error, and the Root Mean Squared Error.**

In [41]:
from sklearn import metrics

In [42]:
print ('MAE:', metrics.mean_absolute_error(y_test, prediction))
print('MSE:', metrics.mean_squared_error(y_test, prediction))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, prediction)))

MAE: 7.228148653430832
MSE: 79.81305165097456
RMSE: 8.93381506697864


## Residuals ##


**Plot a histogram of the residuals and make sure it looks normally distributed.**

In [43]:
sns.distplot (y_test - prediction, bins=50, color='green')



<IPython.core.display.Javascript object>

<AxesSubplot:xlabel='Yearly Amount Spent', ylabel='Density'>

## Conclusion ##

We still want to figure out the answer to the original question, do we focus our efforst on mobile app or website development? Or maybe that doesn't even really matter, and Membership Time is what is really important. Let's see if we can interpret the coefficients at all to get an idea.

**Recreate the dataframe below.**

In [36]:
coeff_df = pd.DataFrame(lm.coef_,X.columns,columns=['Coefficient'])
coeff_df

Unnamed: 0,Coefficient
Avg. Session Length,25.98155
Time on App,38.590159
Time on Website,0.190405
Length of Membership,61.279097


## How can you interpret these coefficients? ##

**Interpreting the coefficients:**

* Holding all other features fixed, a 1 unit increase in Avg. Session Length is associated with an increase of 25.98 total dollars spent.

* Holding all other features fixed, a 1 unit increase in Time on App is associated with an increase of 38.59 total dollars spent.

* Holding all other features fixed, a 1 unit increase in Time on Website is associated with an increase of 0.19 total dollars spent.

* Holding all other features fixed, a 1 unit increase in Length of Membership is associated with an increase of 61.27 total dollars spent.

Do you think the company should focus more on their mobile app or on their website?


According to the data above the largest increase of approximately $61.5 is observed when the "Length of Membership" is increased by one unit.


Using the predictions above we can develop ways in order to increase yearly amount spent by the customers. We can improve the app experience so that the users spend more of their time on the app or we can also focus on the website and develop it so that it becomes as efficient as the app or we can focus on customer relationship so that people remain members for long periods of time.