## Final Exam: Hands-On Portion


**Instructions**
This should take 1 - 2 hours of time. Do your best! We'll be grading not based accuracy, but on comprehension of the problem, the methods noted, and your approach. Good luck! 

### **The Challenge**

The executive team is interested in a user class called **"Adopted Users"** for its cloud project collaboration platform. 

What's an "Adopted User"?
- The team has defined this as a user who **logs in at least once a day across 3 days in seven day period**. 

- This means, Chris is a super user if he logs in on Monday, Wednesday and Friday. He's also a super user if he logs in Tuesday, Friday, then the following Monday. Thus the problem deals with "rolling" periods. 

**A. Use time-series data in the table "user_engagement.csv" to tag these users**.

After identifying these users, the team would like to find out: 

**B. "What factors drives users to become 'Adopted Users'"? **

This means a mixture of exploratory data analysis, feature engineer, model, crossvalidation and parameter tuning, and finally running feature importance. Feel free to use your creativity and explore and think through other potential solutions.



## Read Essentials

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline

raw_users = pd.read_csv('./users.csv', encoding='latin-1')
raw_user_engagement = pd.read_csv('./user_engagement.csv')

**Data Dictionary:**

A. user_engagement.csv
- time_stamp: mm/dd/yyy hh:mm:ss
- user_id
- visited: 1, indicating that they visited 

B. users.csv
- name:  the  user's  name
- object_id: the  user's  id
- email: email address
-  creation_source: How was account created?
    - personal_projects: invited to join another user's personal workspace
    - guest_invite: invited to an organization as a guest (limited permissions)
    - org_invite: invited to an organization (as a full member)
    - signup: signed up via the website
    - signup_google_auth: signed up using google authentication (using a google email account for their login id)
    
- creation_time:  when  they  created  their  account
- last_session_creation_time:   unix  timestamp  of  last  login
- opted_in_to_mailing_list:  whether  they  have  opted  into  receiving
marketing  emails
- enabled_for_marketing_drip:  whether  they  are  on  the  regular
marketing  email  drip
- org_id:   the  organization  (group  of  users)  they  belong  to
- invited_by_user_id:   which  user  invited  them  to  join  (if  applicable).


## 1. Explore 
1. User Desribe and Head to explore your two datasets
2. create a few bar plots to explore your data
- [Some ideas](https://machinelearningmastery.com/quick-and-dirty-data-analysis-with-pandas/)
3. How many unique user_ids are there in each table? in both?

List out some cleansing steps and note things you might have to have you do in order to get your data set into a feature frame and a target frame.

Helpful functions: 
- is.in, groupby, plot.bar()

## 2.  Create your Target Variable
1. You want to use a combination of steps to create a rolling 7 day period window. This rolling window of 7 days should count the number of log-ins. 
2. (optional) Execute any additional Exploratory Data analysis (EDA) you wish:

        Exploratory Data Analysis (EDA) is used on the one hand to answer questions, test business assumptions, generate hypotheses for further analysis. On the other hand, you can also use it to prepare the data for modeling. The thing that these two probably have in common is a good knowledge of your data to either get the answers that you need or to develop an intuition for interpreting the results of future modeling.


A few helpful methods in pandas
- [to_period](https://stackoverflow.com/questions/23840797/convert-a-column-of-timestamps-into-periods-in-pandas)
- [groupby](https://www.google.com/search?q=groupby&oq=groupby&aqs=chrome..69i57j0l2j69i60l3.1487j1j4&sourceid=chrome&ie=UTF-8)
- [resample](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.resample.html)
- [rolling](http://pandas.pydata.org/pandas-docs/stable/computation.html#window-functions)

Inputs:
- user_engagement

Output: 
- Call your target dataframe "target"


In [67]:
user_engagement = raw_user_engagement

# target = 

## 3. Data Cleanse  + Feature Engineering:
- Are there any steps you need to take to cleanse the data some more?
- Do you need to apply any masks?
- Do you need to take care of missing data? 
- Anything you need to drop? Things that don't make sense to include as a feature? .drop(['col1','col2'], axis = 1)
- Can you think of any features to create?
- do you need to one hot encode? pd.get_dummies or OneHotEncode from sklearn

Other useful functions:
- pd.concat([df1,df2], axis = 1) -- this will bind two dataframes column wise

Outputs: 
- y, your target 
- X, a dataframe for features

In [9]:
users = raw_users 


#X = 
#y = 

### 4.Train Test Split
- use "train_test_split" from sklearn.model_selection.
- use random_state = 100 
- test_size = .15

In [68]:
from sklearn.model_selection import train_test_split

# X_train, X_test, y_train, y_test = 
# X_train.shape,y_train.shape, X_test.shape, y_test.shape,

### 5. Modeling
- Choose 1 or 2 classifier (or regressors, depending on what your target variable is)
- use some cross validation and parameter optimization techiniques[hints here](https://github.com/hackoregon/civicu-machine-learning/blob/master/lessons/13-Hyperparameter-Optimization/Class%2013%20-%20Part%201%2C%20Hyper%20Parameter%20Tuning.ipynb):
    - cross validaiton
    - manaul search 
    - random search or gridsearch
- choose a scoring method (depends on classificaton or regression task)



Hint:
- If your gridsearch doesn't work, try X_train.values y_train.values

### 6. Test Performance on holdout data  (x_test, y_test)
1. Which performance metric would you use for this?
2. Bring up confusion matrix from confusion matrix and classification report

In [71]:
from sklearn.metrics import classification_report, confusion_matrix


#y_pred_in =  # your predictions
#y_test_in =  # your true ys
#print(classification_report(y_pred=y_pred_in, y_true=y_test_in))
#print(confusion_matrix(y_pred=y_pred_in, y_true=y_test_in))

### 7. Feature Importance
- Goal: Create a table with two columns: Variable and Importance. Sort by Importance. 
- Use either [permutation Feature Importance (may take a while)](https://github.com/hackoregon/civicu-machine-learning/blob/master/lessons/13-Hyperparameter-Optimization/Class%2013%20-%20Part%202%2C%20Permutation%20Feature%20Importance.ipynb) or [RandomForest's Feature Importance method](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)

In [70]:
importance = pd.DataFrame(
    {'Variable': [], 
     'Importance':[]},
    columns = ["Variable", "Importance"])\
    .sort_values('Importance', ascending = False)
importance

Unnamed: 0,Variable,Importance


### 8. Story Telling
- What are your next steps now? 
- What do you want to tell leadership?
- How confident are you with this information?
- Are there additional steps or data sets that you'd like to perform?

Write 1 to 2 paragraphs or a a few bullet points discussing the above.