##  A Collection of Data Science Take-Home Challenges 
[A Collection of Data Science Take-Home Challenges](#https://datamasked.com/).


### How would you improve engagement in FB?
Most case studies start with a vague goal, hence need to

   1. define concrete **metric** 
        - something measureable and relates to company's mission)
   2. pick relevant **variables** 
        -(sex, age, country, number of friends and etc) related to their browsing/online behavior(device, channels they come from ads/SEO, session time, etc)
   3. pick a model to predict the metric by the variables you selected
        - explain the reasoning on your selected model (i.e random forest (tree-based) b/c it works well(high accuracy) in high dimension, with categorical variables and outliers)
   4. analysis/conclusion
        - pick **one bad** and **one good** segment 
        
        
note: Data scientist is to suggest actions based on data (i.e start with data, then make actionable suggestions)

### Approach
1. check the quality of data
2. don't overspend time to optimize model, instead, explain your reasoning on why did you pick that model, and how would you improve/optimize it if you were given more time
3. focus on how the business would benefit on your analysis



-----

# Pramp practice interviews

### January 25, 2018

### 1\. Pramp Engagement

You’re a data scientist at Pramp and the product team wants to improve user engagement. What metrics would you choose and how would you tackle this?


### Answer:

#### Objective: improve user engagement

Some examples of metrics that Pramp would be interested to investigate:

- numbers of monthly active users 
- numbers of sessions of each user each month 
- numbers of users churn after three months (retention)


If we are interested to look for how many active users, then we want to define what are the factors that determines an user is **active**.

For example, these are the two criterias:
- attended 4 or more session each month
- no show score is less than 90%

1\. Then, we would grab a set of 6-12 months data, and feature engineer **active** and **non-active** labels for all the users.


2a\. Train and build a model to predict the future income traffic data. 

2b\. Model Selection: random forest and then look at the feature importance to find out what factors are affecting the predictions 

3\. Assume the top three features in the feature importance chart are: no show rate, ratings of the user, and rating of the interviewers.

A recommendated solution might be to set up a team to investigate if the users had a bad experience on their few tries of the session.  Then see if we can reduce the no show rate with some implementation of stricter rules. 






 


----

### One of the take-home challenge samples from Galvanize's interview preparation repository

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure, hold, ylim, legend, boxplot, setp, axes
%matplotlib inline

## Context

#### There are 2 tasks:

Collect metrics of interest.
Offer insights for how we could improve CPM.
The first task requires specific metrics collection: Find the conversion rate and CPM per campaign within each application. Include all of the code you need to transform and calculate the data.

The second goal is more of an open ended question and involves writing about your methods and reasoning: Given the data that was collected in the first task, what are some metrics we can calculate to give us insights as to how to improve CPM? For this second question, if you don't have enough data or would like to have additional data, please specify the format of the data(the columns in each file) that you would like to have and desscribe your transformations to acquire the information that you need.



## Backgound

**CPM** - cost per 1000 impression

Formula: CPM = (total cost of campaign * 1000) / total number of impressions

Since the question of this practice is weakly defined and missing a lot of imformations. I will define some assumptions for this questions. 

1. the number of *offers* will be "impressions" and the number of *engagement* will be "clicks"
2. the campaign cost is missing - I will set all the campaign costs to 1,000


**Campaign** - a specific, defined series of activities used in marketing a new or changed product or service


***

## Task 1 - Calculate conversion rate and CPM

In [2]:
engagement = pd.read_csv('https://raw.githubusercontent.com/gSchool/dsi-interview-prep/master/interview_questions/takehomes/takehome1/example_engagements.csv?token=AfcppxqXZaAL5zIr5EjjiNTdiAyin4Odks5cFD4EwA%3D%3D')

In [3]:
offers = pd.read_csv('https://raw.githubusercontent.com/gSchool/dsi-interview-prep/master/interview_questions/takehomes/takehome1/example_offers.csv?token=AfcppzaFmL-AsJKiJoUDUkB7F-lb2FqEks5cFD4pwA%3D%3D')

#### Simple EDA

In [4]:
engagement.head(5)

Unnamed: 0.1,Unnamed: 0,revenue,reward_id,campaign_id,application_id
0,2014-07-26 00:00:29.257095,0.499,53d2ef9d-361c-c0d1-9015-6525c28c8564,18,3
1,2014-07-26 00:00:30.468959,0.149,53d2ef9e-72f3-84bf-a243-78ae58d1626f,4,0
2,2014-07-26 00:00:43.396503,0.149,53d2efab-91fb-ec54-3435-40a502e34e83,4,3
3,2014-07-26 00:01:01.234404,0.149,53d2efbd-8f91-db89-12d3-c373bcde9c30,4,3
4,2014-07-26 00:01:15.100982,0.149,53d2efcb-3e74-a234-f986-938765766950,4,0


In [5]:
engagement.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2441 entries, 0 to 2440
Data columns (total 5 columns):
Unnamed: 0        2441 non-null object
revenue           2441 non-null float64
reward_id         2441 non-null object
campaign_id       2441 non-null int64
application_id    2441 non-null int64
dtypes: float64(1), int64(2), object(2)
memory usage: 95.4+ KB


In [6]:
set(engagement.revenue);

In [7]:
offers.head()

Unnamed: 0.1,Unnamed: 0,reward_id,application_id,campaign_id
0,2014-07-26 00:00:02.995009,53d2ef83-0008-50fd-80b6-022bd353332d,0,0
1,2014-07-26 00:00:03.114537,53d2ef83-1860-7515-2f58-bc73db3b6ce8,1,1
2,2014-07-26 00:00:03.738329,53d2ef83-dc59-4efc-8e6d-1840b994e96d,0,2
3,2014-07-26 00:00:04.333408,53d2ef84-ef12-f2f9-799f-d549f4acf691,1,0
4,2014-07-26 00:00:05.023120,53d2ef85-a900-e839-b0e5-4d07d619fa58,0,0


### Conversion

#### Outer join two tables by application id, campaign id, and reward id

In [8]:
df = offers.merge(engagement, how = 'outer', on = ['application_id', 'campaign_id', 'reward_id'])
df.head()

Unnamed: 0,Unnamed: 0_x,reward_id,application_id,campaign_id,Unnamed: 0_y,revenue
0,2014-07-26 00:00:02.995009,53d2ef83-0008-50fd-80b6-022bd353332d,0,0,,
1,2014-07-26 00:00:03.114537,53d2ef83-1860-7515-2f58-bc73db3b6ce8,1,1,,
2,2014-07-26 00:00:03.738329,53d2ef83-dc59-4efc-8e6d-1840b994e96d,0,2,,
3,2014-07-26 00:00:04.333408,53d2ef84-ef12-f2f9-799f-d549f4acf691,1,0,,
4,2014-07-26 00:00:05.023120,53d2ef85-a900-e839-b0e5-4d07d619fa58,0,0,,


In [9]:
df = df[['reward_id', 'application_id', 'campaign_id', 'revenue']]

In [10]:
# number of impression - total count
imp_count = df.groupby(['application_id', 'campaign_id'])['reward_id'].count()

# number of clicks - revenue is not null
click_count = df[df['revenue'].notnull()].groupby(['application_id', 'campaign_id'])['reward_id'].count()

In [11]:
# change to percentage
conversion_rate_serie = (click_count/imp_count) * 100

In [12]:
conversion_rate_serie = conversion_rate_serie.fillna(0)

In [14]:
conversion_df = conversion_rate_serie.to_frame()

#### Conversion rate table

In [15]:
conversion_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,reward_id
application_id,campaign_id,Unnamed: 2_level_1
0,0,1.532567
0,2,0.37037
0,4,5.514706
0,5,2.424242
0,7,0.471945


### CPM

Since I have defined the cost to be 100, the cost will 100

In [16]:
#CPM

CPM_serie = (100 * 1000)/imp_count;

In [17]:
CPM_df = CPM_serie.to_frame()

In [18]:
CPM_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,reward_id
application_id,campaign_id,Unnamed: 2_level_1
0,0,383.141762
0,2,370.37037
0,4,18.382353
0,5,606.060606
0,7,52.438385


## Task 2 - Analysis

After calculated the conversion rate and CPM above, look the who are the top 10 performancers. We can then build graphs to visualize the differences.