# Relax Data Science Challenge Report

This report is written for the Relax Data Science Challenge as part of Springboard's Data Science Career Track.

**Goal**: The goal of this project is to identify which features are most important in predicting future user adoption. In this context, we define an *adopted user* as a user who has logged into the product on three separate days in at least one seven day period.

**Dataset**: There are 2 datasets used for this report:
- takehome_user.csv - A table  with  data  on  12,000  users  who  signed  up  for  the product  in  the  last  two  years.
- takehome_users_engagement.csv - A table that contains a row for each day that a user has logged into the product.

**Clients**: The client will be the company Relax.

**Methodology**: There are 5 main steps in this process:

1) Importing the data and merging the 2 tables for better processing.

2) Data Wrangling - Cleaning the data and feature engineering.

3) Data Exploration - Using data visualization to describe and learn trends that may lead to insightful conclusions about the data.

4) Machine Learning - Building a predictive model for future user adoption.

5) Conclusion - Summarize the findings and providing recommendations for the client based on the data.

### Data Wrangling:

- Converted the creation_time and time_stamp columns into datetime format so that we can apply various transformation techniques.

- Filled in data for columns with missing entries. 

- Created a new email_domain column that contained the email domain of each user based on their provided email address.

- Created a new adopted column that indicated whether or not a user was adopted.

### Data Exploration:

1) Number of Adopted Users

Less than 1% of all users were adopted. 

![s](https://github.com/nysportsfan/SpringBoard/blob/master/Relax_Data_Science_Challenge/Images/adopted.png?raw=true)

2) Adopted Users by Creation Source

Users were more likely to be adopted if they created their account via an invite (Guest invite or Organization invite) than those who signed up using Google or worked on personal projects.

![s](https://github.com/nysportsfan/SpringBoard/blob/master/Relax_Data_Science_Challenge/Images/adopted_creation.png?raw=true)

3) Adopted Users by Marketing Email Drip Status

Users were more likely to be adopted if they enabled their marketing email drip. 

![s](https://github.com/nysportsfan/SpringBoard/blob/master/Relax_Data_Science_Challenge/Images/adopted_emaildrip.png?raw=true)

4) Adopted Users by Email Domain

Users were more likely to be adopted if their email domain was gmail.com, yahoo.com, gustr.com, or hotmail.com compared to jourrapide.com or cuvox.de.

![s](https://github.com/nysportsfan/SpringBoard/blob/master/Relax_Data_Science_Challenge/Images/adopted_emaildomain.png?raw=true)

### Machine Learning:

We chose a **Random Forest Classifier** because some of the important features are categorical (opted_in_to_mailing_list, enabled_for_marketing_drip, creation_source). It also does a better job than other Classifier models, such as Logistic Regression, when it comes to scaling since many of the features are unbalanced and not linearly separable.

After creating dummy variables for the categorical data, we then created the feature set using all features with the exception of adoption, name, email, creation_time and last_session_creation_time. To make it scalable, we included the StandardScaler as part of the pre-processing step in our pipeline. 

The model was fit onto the training set and then predicted on the testing set. The results were very good with a prediction accuracy of 98.92%.

![s](https://github.com/nysportsfan/SpringBoard/blob/master/Relax_Data_Science_Challenge/Images/relax_df.png?raw=true)

### Conclusion:

Of the 5 most important features, 3 of them were based on the email domain variable. This shows the significance of the email domain in predicting the adoption status of a user. Another important feature was whether or not the creation source was through an organization invitation. The mailing list status of a user was also a strong indication of whether they would be adopted.

Recommendations:

- Strengthen marketing email dripping campaign by incentivizing users to enable it. Examples can include giveaway or sweepstakes every few months to entice them to continue to receive emails. 

- Improve product features for organizations to use. A platform that caters to teamwork can make the product more attractive for various organizations. 