# Interview Practice for Data Analyst Job 

August 2017, by Jude Moon

## Practice Overview
Employers use interviews to judge your readiness and fit for the job, which includes hearing about your skills and interest in the role. The interview is not a test or exam, but a conversation between you and the employer. Build your own strategies to be prepared come interview day. This project is one of many ways for you to practice!

### 1. Describe a data project you worked on recently.

One of the most challenging projects from the Data Analyst Nanodegree Program was a Machine Learning (ML) project for detecting fraud from Enron email and financial datasets. It was easy for me to explore and clean the data sets since I already had experience in data exploration from other ND projects and academia research. But the challenge was to figure out where to start among lots of algorithms available. The task was to build a person of interest (POI) identifier based on labeled data using Python. For supervised classification, I found 6 common algorithms from the scikit-learn library. I engineered and created 9 feature lists from the original feature list. To compare each pair of feature list and classifier algorithm, I had 60 combinations to test out. Instead of scripting procedures to perform all combinations, I stepped back and researched the pros, cons, and proper usages of each algorithm. And then I prioritized 3 top algorithms and feature lists. I documented the rationale behind the choice of algorithms. I constructed procedures and loops to implement pipeline and gridsearch for selecting features and optimizing hyper-parameters. I finally had a combination giving me good enough performance scores after about 20 trials. It could have been an exhausting trial process and could have taken me much longer time, but I could get it done within two days by spending some time to understand the algorithms first. 

From this, I have learned how to approach and plan for building a model when various options of algorithms and features are available. I think the knowledge and skills I gained from the ML course and project can be transferrable and lead to open the spectrum of my ability to build problem-solving and decision-making models on patient-related information and experience issues as well as clinic and hospital risks. Such models for identifying faulty information in electronic health record (EHR) and ineffective treatment using various features recorded from the past cases. If we successfully construct the models, they can be implemented in the EHR system to flag the errors real-time. The models are expected to tremendously improve patient service quality and reduce the hospital risks.


----

### 2. You are given a ten piece box of chocolate truffles. You know based on the label that six of the pieces have an orange cream filling and four of the pieces have a coconut filling. If you were to eat four pieces in a row, what is the probability that the first two pieces you eat have an orange cream filling and the last two have a coconut filling?

* Probability of 1st and 2nd choices to be orange: $\frac{6}{10} * \frac{5}{9}$
* Probability of 3rd and 4th choice to be coconut: $\frac{4}{8} * \frac{3}{7}$
* Probability of 1st and 2nd choices to be orange and 3rd and 4th choice to be coconut: $\frac{6}{10} * \frac{5}{9} * \frac{4}{8} * \frac{3}{7} = 0.0714$


### Follow up question: If you were given an identical box of chocolates and again eat four pieces in a row, what is the probability that exactly two contain coconut filling?

* Number of cases to be orange or coconut for 4 pieces: $2 * 2 * 2 * 2 = 16$
* Number of combinations to be two coconut and two orange: $(4-1) + (4-2) + (4-3) = 6$
* Probability of combinations to be two coconut and two orange: $\frac{6}{16} =0.375$

----

### 3. Given the table users:


<center>Table "users"</center>


| Column   | Type      |
|----------|-----------|
| id       | integer   |
| username | character |
| email    | character |
| city     | character |
| state    | character |
| zip      | integer   |
| active   | boolean   |

### Construct a query to find the top 5 states with the highest number of active users. Include the number for each state in the query result. Example result:

| state      | num_active_users |
|------------|------------------|
| New Mexico | 502              |
| Alabama    | 495              |
| California | 300              |
| Maine      | 201              |
| Texas      | 189              |


    SELECT state, SUM(active)
    from users
    GROUP BY state
    ORDER BY SUM(active) DESC
    LIMIT 5


----

### 4. Define a function first_unique that takes a string as input and returns the first non-repeated (unique) character in the input string. If there are no unique characters return None. Note: Your code should be in Python.

```
def first_unique(string):
 # Your code here
 return unique_char

> first_unique('aabbcdd123')
> c

> first_unique('a')
> a

> first_unique('112233')
> None

```

In [1]:
from collections import defaultdict

def first_unique(word):
    
    counts = defaultdict(int) # initiate defaultdict for count 
    
    for c in word:
        counts[c] += 1 # add c as a key and number as a value to dictionary
      
    for c in counts:
        if counts[c] == 1: # if the value is 1 (appeared once), return the key
            return c
    
    return "None"

In [2]:
first_unique('aabbcdd123')

'c'

In [3]:
first_unique('a')

'a'

In [4]:
first_unique('112233')

'None'

----

### 5. What are underfitting and overfitting in the context of Machine Learning? How might you balance them?

Both underfitting and overfitting refer poor generalization to new data, but underfitting refers poor performance on the training data as well, while overfitting refers good performance on the training data. Underfitting is easy to be detected by evaluating metrics and it is better to move on and try other algorithms. Overfitting can be controlled by adjusting parameters, so that it limits how much detail and noise in the training data the model would learn.

To balance between good performance on training data and unseen data, cross validation can be used. We can separate a subset of the training data and hold back from our machine learning algorithm training and tuning. After finishing selecting and tuning algorithms, we can get evaluation of the algorithm on the subset, which shows a performance on unseen data. But if we have a limited dataset size, such methods like KFold and ShuffleSplit allow to cross evaluate the algorithm on multiple different splitting sets of training and testing data and give average evaluating metrics.


----

### 6. If you were to start your data analyst position today, what would be your goals a year from now?

I would like to work on managing health care database and data system in HealthTech Hub community in Austin, TX. My goal is to fully understand health related database system and metrics and provide high-quality reporting to improve service administration and patient experience.

One-year Goals:

1.	Understanding current database designs and metrics and becoming proficient in managing them
2.	Developing and implementing databases, data collection systems, and data metrics and reporting standards
3.	Building models and conducting analyses on patient experience to support 3-year strategic agenda goals
