## 120 Data Science Interview Questions
---

1. Analyze this dataset and give me a model that can predict this response variable.

2. What could be some issues if the distribution of the test data is significantly different than the distribution of the training data?

    - Incorrect predictions and classifier.
    
    **Dataset Shift/Data Fracture**- when training and test distributions are different
    
    Types of Dataset Shift
        * Covariate Shift- shift in the independent variables
        * Prior Probability Shift- shift in the target variable
        * Concept Shift- shift in the relationship between the independent and target variable
    
    ![Covariate Shift](http://laoblogger.com/images/covariate-shift-clipart-6.jpg)

3. What are some ways I can make my model more robust to outliers?

    - Use a model resistant to outliers. Tree-based models are generally not as affected by outliers, while regression-based models are.
    - Use a more robust error metric. Minimize the sum of absolute values of errors instead of the sum of squares reduces the influence of outliers.
    - Ensemble Methods
    
    Data Changes
        - Remove the outliers
        - Transform the data (log, etc...)
        - Winsorize- replacing/modifying a specified number of extreme values with a smaller data value (such as assigning the outlier with a lower weight or changing the value so that it is close to other values in the set
    
4. What are some differences you would expect in a model that minimizes squared error, versus a model that minimizes absolute error? In which cases would each error metric be appropriate?

    MAE is more robust to outliers and the MSE is more useful if we are concerned about large errors whose consequences are more severe than smaller errors.
    
5. What error metric would you use to evaluate how good a binary classifer is? What if the classes are imbalanced? What if there are more than 2 groups?
    - Simple accuracy (TP + TN) / (TP + TN + FP + FN)  
    
    When the dataset is imbalanced, accuracy will be deceiving. Must change performance measure.
    - Sensitivity/Recall- measures the ability of a test to detect the condition when the condition is present (TP) / (TP + FN)
    - Specificity- measure the ability of a test to not detect the condition when the condition is absent (TN) / (TN + FP)
    - Precision- TP / (TP + FP)
    - Error Rate- 1 - Precision
    - F1- $\frac{2}{(\frac{1}{Precision}) + (\frac{1}{Recall})}$
    - ROC
    - AUC

## Statistics
---

1. What is the Central Limit Theorem and why is it important?
    
    If we sample from a population using a sufficiently large sample size (around n=30 is sufficient), the mean of the samples will be normally distributed (regardless of the distribution of the original population). It is important because according to the CLT, even though we might not know the shape of the distribution where our data comes from, the CLT says that we can treat the sampling distribution as if it were normal. More info [here.](https://www.thoughtco.com/importance-of-the-central-limit-theorem-3126556)  
    
2. What is sampling? How many sampling methods do you know?
    
    Sampling is the selection of a subset of observations from a population to estimate characteristics of the whole population.
    
    * Sampling Methods
    
        * Simple Random Sampling (SRS)- each observation of the population has an equal chance of being selected
        * Stratified Sampling- divide the population into homogenous groups (strata), then a probability sample is drawn from each group
        * Cluster Sampling- divide the population into naturally occurring groups (clusters), then a SRS of clusters is selected
        * Systematic Sampling
        * Multistage Sampling
        
3. What is the difference between Type I and Type II error?

    Type I Error- "False Positive": detects the condition when the condition is absent  
    Type II Error- "False Negative": does not detect the condition when the condition is present
    
4. What is Linear Regression? What do the terms P-value, coefficient, and $R^2$ mean?

    - Linear Regression is the supervised learning task for modeling and predicting continuous, numeric variables.
    - Can be updated easily with new data using stochastic gradient descent (SGD) and straightforward to understand
    
    Cons
        Performs poorly when there are non-linear relationships
    
Regularization- technique for penalizing large coefficients in order to avoid overfitting

## Random
---
No Free Lunch Theorem- no one algorithm works best for every problem

True Positive (TP): detects the condition when the condition is present
True Negative (TN): does not detect the condition when the condition is absent
False Positive (FP): detects the condition when the condition is absent
False Negative (FN): does not detect the condition when the condition is present



## Programming (CTCI)
---
### Chapter 1- Arrays and Strings

- Hash Tables: data structure that maps keys to values for highly efficient lookup
    - The hash table has an underlying array and a hash function that maps the key to an integer (which indicates the index in the array)
    - To avoid collisions, we store a linked list at each index of the underlying array
    - Worst case runtime- O(N)
    - Average- O(1)
    - We can assume good implementation keeps collisions to a minimum- O(1)
    - Could also implemenent with a balanced binary search tree. Guarantees a O(log N) lookup time and uses less space
    
    
- ArrayList & Resizable Arrays
    - In some languages, arrays (lists) are automatically resizable. The list grows as you append items.
    - O(1) access, O(N) search, insertion, deletion
    
- StringBuffer/StringBuilder
    - Creates an array of all the strings, copying them back to a string only when necessary instead of creating a new copy of the string every time (which is O(x$N^2$))
    
TODO- implement HashTable, ArrayList, StringBuilder
   
   
### Chapter 2- Linked Lists
Data Structure that represents a sequence of nodes

Singly Linked List- each node points to the next node in the linked list and stores
![Singly LL](https://www.geeksforgeeks.org/wp-content/uploads/gq/2013/03/Linkedlist.png)
Doubly Linked List- gives each node pointers to both the next node and the previous node
![Doubly LL](https://www.geeksforgeeks.org/wp-content/uploads/gq/2014/03/DLL1.png)

Unlike an array, a linked list does not provide constant time access to a particular "index" within the list. If you want to find the Kth element, you have to iterate through K elements.

Benefit: add/remove items from the beginning of the list in constant time

Creating a Singly Linked List

`

    class Node:
        def __init__ (self, data):
            self.data = data
            self.next = None
       
    class LinkedList:
        def __init__ (self):
            self.head = None
            self.tail = None  

        def add(self, data):
            new_node = Node(data)
            if self.head == None:
                self.head = new_node
            elif self.tail != None:
                self.tail.next = new_node
            self.tail = new_node  

        def remove(self, index):
            prev = None
            node = self.head
            i = 0

            while (node != None) and (i < index):
                prev = node
                node = node.next
                i += 1
            if prev == None:
                self.head = node.next
            else:
                prev.next = node.next
`
## References 
---

Shan, Carl, et al. 120 Data Science Interview Questions. 

McDowell, Gayle Laakmann. Cracking the coding interview: 189 programming interview questions and solutions. CareerCup, 2015.   