# La Croix Data Challenge - Graham Edge

## Analysis Outline

The path from opening up this exploding data file to producing actionable insights was exciting, but complex. Here is a guide to structure the story.

  1. [Exploratory Data Analysis](#chapter-1)
      1. [Data Inspection](#chapter-1a)
      2. [Dimensionality Reduction](#chapter-1b)
  2. [Feature Engineering](#chapter-2)
      1. [Correlations Between Variables](#chapter-2a)
      2. [Explained Variances in PCA](#chapter-2b)
  3. [Model Development](#chapter-3)
      1. [Fitting Logistic Regression](#chapter-3a)
      2. [Training and Test Accuracy](#chapter-3b)
      3. [Accuracy vs. Precision: Confusion Matrix](#chapter-3c)
  4. [Model Iteration](#chapter-4)
      1. [Support Vector Machines](#chapter-4a)
      2. [Deep Belief Network](#chapter-4b)
      3. [Logistic Regression with Polynomial Feature Expansion](#chapter-4c)
  5. [Conclusions and Insights](#chapter-5)
      1. [Best Model](#chapter-5a)
      2. [Insights](#chapter-5b)
      3. [Next Steps](#chapter-5c)

___

## 1. Exploring the Data <a id="chapter-1"></a>
Here is a little intro preamble for what will happen in this section.

### Loading and Inspecting the Data<a id="chapter-1a"></a>
Here we are going to inspect the data by examining things like:
- the head of the dataframe
- the column names and the data type for each column
- whether there are missing values in any columns

In [6]:
#dataframe = pd.read_csv('datafile.csv')

#X = dataframe[features]
#Y = dataframe[target]

#### Here is a table describing the data we have

| Variable | Description |
| -- |:-- |
| `X[0,:]` |This is a 0 or 1 feature to encode gender |
| `X[1,:]`| This is the age feature |
| `X[2,:]`| *This is the number of followers on Twitter* |

Here I am summarizing some thoughts about the variable table above.

### Dimensionality Reduction<a id="chapter-1b"></a>
Now we will see if there is a way to reduce the size of the data and eliminate strongly correlated features. We will try things like:
- PCA
- ?
- ?

Notice that the PCA doesn't actually do anything to the data when we execute the command `pca_machine.fit(X)` below - this only calculates the optimal feature combinations. We will need to use a command like `X_pca = pca_machine.transform(X)` later to produce the new features.

In [1]:
#from sklearn.decomposition import pca

#pca_machine = pca(n_features = 5)
#pca_machine.fit(X)

#### Wrap-Up
Now that we have taken a close look at the data, its time to try some feature engineering to produce more meaningful features

## 2. Feature Engineering <a id="chapter-2"></a>
Now I am discussing some things about feature engineering and how it is going to blow this data problem wide open.

### Correlations Between Variables<a id="chapter-2a"></a>
We will want to combine together columns if they are strongly correlated with one another. We will detect correlated features by examining things like:
- Pearson Correlation Matrix
- ?

In [None]:
#some code to plot the correlation matrix

### Explained Variances in PCA<a id="chapter-2b"></a>
To see what kinds of feature combinations would be more meaninful to describe the data, we will use Principal Component Analysis (PCA). Specifically we want to know how many principal components are needed to describe most of the data, so we will examing the explained variance for each additional component that we add.

In [None]:
#pca.fit(X)
#print(pca.explained_variance_)

## 3. Model Development <a id="chapter-3"></a>
Now that I have totally inspected the data, I'm going to try out some  models to accomplish the prediction that we are looking for. I will probably start with a simple model to learn something from the data, before possibly proceeding to more complicated models or features if needed.

<div class="alert alert-block alert-success">
First we have to split the data into training and testing subsets.
</div>

In [5]:
#from sklearn.model import train_test_split

#X_train, X_test, y_train, y_test = train_test_split(
#                X, y, test_size=0.33, random_state=42)

### Logistic Regression<a id="chapter-3a"></a>
We always try a logistic regression, since it is simple, fast, and interpretable!

In [None]:
#from sklearn.linear_model import LogisticRegression

#logreg_model = LogisticRegression(C=1e5)
#logreg_model.fit(X,Y)

### Model Accuracy<a id="chapter-3b"></a>
Now after we have trained a model on a subset of the data, we need to check its accuracy on some withheld 'test data' in order to get an idea for how this model will generalize to data it has never seen before.

In [7]:
#code to calculate the accuracy on the training and test

#Y_pred = logreg_model.predict(X_test)

### Confusion Matrix<a id="chapter-3c"></a>
Accuracy of the model only tells part of the story. Here we are going to look at the False Positives, True Positives, False Negatives, and True Negatives visually with a confusion matrix.

In [8]:
#from sklearn.metrics import confusion_matrix

#confusion_matrix(Y_test, Y_pred)

*We just plotted confusion matrix but I'm not going to let it just stand there on its own. Let's talk a bit about False Positives, False Negatives, Precision, Accuracy, Recall, and all that stuff*

## 4. Model Iteration <a id="chapter-4"></a>
That logistic regression model was good, but we also saw that it seemed to be making (insert error here) types of errors. Now we will try out a few more complex models that should address this issue because of (insert cleverness here)

### Support Vector Machines Model<a id="chapter-4a"></a>
Support Vector Machine (SVM) models are good at fitting nonlinear relationships between features, because they can implement a 'kernel trick' to transform the data into a higher-dimensional space in which some previously non-linear data become linearly separable.  Let's try this widely used model out on this dataset:

In [2]:
#from sklearn.svm import svc

#svm_model = svc()
#svm_model.fit(X, y) 

### Deep Belief Network<a id="chapter-4b"></a>
<div class="alert alert-block alert-danger">
You might iterate through to more and more complex models as you build the data story. ~~Probably~~ Definitely nobody should be using deep learning in a data challenge though!</div>

In [3]:
#from keras import NO DONT DO IT

### Logistic Regression with Feature Expansion<a id="chapter-4c"></a>
At this point we have tried some complex models, but we are thinking that maybe we should revisit some of the feature engineering from the beginning of the notebook. Let's try the simple logistic regression model again, but we will try generating new polynomial combinations of the original features to see if this improves the model accuracy.

This will produce a lot of features to deal with, so we should perhaps try to prune down the features later once we find out the important ones. 

In [4]:
#from sklearn.preprocessing import PolynomialFeatures

## 5. Conclusions <a id="chapter-5"></a>
Now that we have examined the data, engineered features, and tried several models, it is time to ~~sleep~~ summarize what we have learned.

### Best Model<a id="chapter-5a"></a>
We tried several models, some complicated and some simple. My recommendation is that the (MODEL) is the best choice for this problem for the following reasons:
- simplicity?
- overall accuracy?
- best False Positive / False Negative rate?
- fastest to train?
- most interpretable?

### Insights<a id="chapter-5b"></a>
In the course of exploring and analyzing this data, some other points also became clear:
- there are too many features, you should collect less data
- there are not enough features, you should really get more data
- this data is messy, I suggest that you collect it differently
- there are a lot of missing values, you should try to fill in this data

### Next Steps and Recommendations<a id="chapter-5c"></a>
There are myriad ways to move forward with the results of this analysis, but here are a few that I think would provide the highest return-on-investment:
1. Getting more data will help these machine learning models to generalize better
2. Combining this data with (ADDITIONAL EXTERNAL DATA SOURCE) would bring in (INTERESTING NEW DIMENSION) and should allow you to (AMAZING NEW BUSINESS THING)

___

<img width=200 src='http://www.grahamedge.com/static/Graham_Edge_Square.jpg'>
<div style="text-align: center;" markdown="1">** Hire me**</div>