# Data Challenge Notebook

This notebook serves as a repertoire of python functions for solving data challenges, includes 10 tips for how to be successful, and how to develop metrics for analytical based projects. 

Most data challenges evaluate coding level, and ability to inspect data (i.e. identifying paradox, errors), perform data wrangling, choose appropriate visualizations, apply statistics/ML and concisely and effectively answer questions. 

This notebook includes functions for high level processes that may be necessary for a given data challenge and is meant to be customized for your particular challenge. The sections include:

- Initial Data Analysis
- Data Wrangling
- Exploratory Data Analysis
- Statistical Analysis [Optional]
- Machine Learning [Optional]

#### Tips & Tricks:
1. Create a notebook template of custom functions for analysis, visualizations and modeling BEFORE starting challenge!
2. Research the company/role. (Many companies will utilize their own data for challenges)
3. Develop a plan of action prior to coding and ask for clarification on the prompt and data, if necessary.
4. Keep track of time (Some challenges allot only a few hours!)
5. Use functional programming whenever possible for conciseness. (Show all your work!)
6. Use intuitive dataframe names-not "df".
7. State all assumptions! (Prompts are purposely vague, you will have to make decisions based on the information provided).
8. Explain all values/visualizations/errors/results as it pertains to the prompt provided. (Think storytelling)
9. Outline next steps and actionable insights dervied from analyses.
10. Add comments and utilize markdown to explain functions (input/output) and results.

#### Developing metrics for analysis
Consider what 'health' and success means to the particular business and what information would provide useful insights.
- What is their business model?
- Who is their target audience?
- What would they want to optimize?

A good metric should be:
- Comparative
- Understandable
- Ratio/Rate
- Changes the way you behave
    
Types of metrics, include:
- Qualitative/Quantitative
- Vanity/Actionable
- Exploratory/Reporting
- Leading/Lagging
- Correlated/Causal

## [Company Name]

### Data Challenge
### Data & Features

## Table of Contents
- Initial Data Analysis
- Data Wrangling
- Exploratory Data Analysis
- Statistical Analysis
- Machine Learning

In [6]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import pandas_profiling as pp
import etsy_py
from scipy.stats import shapiro
from scipy.stats import skew
from scipy.stats import kurtosis

### Initial Data Analysis

In [None]:
def initial_analysis(df):
    """
    Given a dataframe produces a simple report on initial data analytics
    Params:
        - df 
    Returns:
        - Shape of dataframe records and columns
        - Columns and data types
    """
    print('Report of Initial Data Analysis:\n')
    print(f'Shape of dataframe: {df.shape}')
    print(f'Features and Data Types: \n {df.dtypes}')

In [None]:
def percent_missing(df):
    """
    Given a dataframe it calculates the percentage of missing records per column
    Params:
        - df
    Returns:
        - Dictionary of column name and percentage of missing records
    """
    col=list(df.columns)
    perc=[round(df[c].isna().mean()*100,2) for c in col]
    miss_dict=dict(zip(col,perc))
    return miss_dict

In [None]:
def normality_test(df,col_list):
    """
    Given a dataframe determines whether each numerical column is Gaussian 
    Ho = Assumes distribution is not Gaussian
    Ha = Assumes distribution is Gaussian
    Params:
        - df
    Returns:
        - W Statistic
        - p-value
        - List of columns that do not have gaussian distribution
    """
    non_gauss=[]
    w_stat=[]
    # Determine if each sample of numerical feature is gaussian
    alpha = 0.05
    for n in numeric_list:
        stat,p=shapiro(df[n])
        print(sns.distplot(df[n]))
        print(tuple(skew(df[n]),kurtosis(df[n])))

        if p <= alpha: # Reject Ho -- Distribution is not normal
            non_gauss.append(n)
            w_stat.append(stat)
    # Dictionary of numerical features not gaussian and W-Statistic        
    norm_dict=dict(zip(non_gauss,w_stat))
    return norm_dict

In [None]:
# Outliers

### Data Wrangling

In [None]:
# Impute missing values

In [None]:
# Feature Engineering

In [None]:
# Data Formatting

### Exploratory Data Analysis

In [None]:
pp.ProfileReport(df)

In [None]:
# Seasonality

### Statistical Analysis

In [None]:
# Hypothesis Testing

In [None]:
# Anomaly Detection

### Machine Learning

In [None]:
# Data preprocessing

In [None]:
# PCA

#### Classification

In [None]:
# Instantiate classifier

In [None]:
# Hyperparameter Tuning

In [None]:
# Evaluation

#### Regression

In [None]:
# Instantiate regressor

In [None]:
# Hyperparameter Tuning

In [None]:
# Evaluation