### Statistical Foundations 

**What are the two parameters that specify the Normal Distribution?**

**Assume that $y_i \sim N(\mu_i, \sigma_i)$ where $\mu_i, \sigma_i$ are unknown and you can only access quantiles of $y_i$, how would you estimate $\mu_i, \sigma_i$?** 

*HINT: What would be the minimum number of quantiles required to estimate $\mu$, $\sigma$ to learn a distribution over $y_i$?*

**Given feature vectors and your choice of any function approximator, how would you learn a distribution over your predictions?**

**Assume you have $N$ observations of $x_i$ and you have a prior that $x_i \sim N(\mu_p, \sigma_p)$, how would you leverage your observations and prior to learn the proper estimates for actual parameters for $x_i$'s distribution?. Let's say an analyst proposed the following probabilistic program that imposes an even more hierarchical prior on your parameters, like below, how would you evaluate the probabilistic program?**
```
model {
  // prior distributions
  mu ~ normal(mu_prior_loc, mu_prior_scale);
  sigma ~ normal(sigma_prior_loc, sigma_prior_scale);

  // likelihood
  errors ~ normal(mu, sigma);
}
```

### Machine Learning Foundations & System Design 

**How do you diagnose and fix overfitting in traditional supervised learning systems?**

**How do you diagnose and fix overfitting in settings of extreme non-stationarity?**

**You have a massively unbalanced dataset. How do you evaluate the result of your classifier in this setting? If the model is not learning the imbalanced dataset, what are somethings you can do to help it learn the minority labels?**


**What is your process and framework for fixing bugs in machine learning systems?**

**Can you provide some examples of the hariest ML bug you had to fix to get your probablistic system to give proper outputs?**

**You have access to raw transaction level datasets of some of the largest organizations in the world. How would you use that transaction level information to build _interpretable_ forecasts for organizations' cash inflows/outflows?**

### Data Munging: Parallelizing Data Frame Operations 


Often when using Pandas to perform data pre-processing, you encounter transformations that can not be vectorized. In many such cases, these operations are trivially parallelizable across either the row or column axis. In this exercise, we present one such case out of many that we have been encountering internally. Your job is to parallelize this transformation. If you are not able to achieve such speedups, you need to explain why Dask or the method of your choice can’t provide the speed up. 

In [2]:
import pandas as pd

In [4]:
df = pd.read_csv('dataframe.csv')

Given `dataframe.csv`, for each `id` and `date`, e.g. each row, calculate the `median` of all of the `values` associated with that `id` before the given `date` in that row. In unoptimized Pandas Pseudo code, this looks something like:  

```

    def get_median_value_for_id_before_given_date(ID: id, date: pd.to_datetime, cache: pd.DataFrame) -> float:
        group = cache[(cache.id == ID) & (cache.date < date)]
        return group['value'].median()
        
    dataframe['value_summary_stats'] = sample_df.apply(
        lambda x: get_median_value_for_id_before_given_date(x['id'], x['date'], dataframe),
        axis=1
    )
    
 ```
 
 How would you speed up this unoptimized code? Feel free to write code!
