# DATA SCIENCE INTERVIEW MATERIALS

Taken from this [reddit link](https://www.reddit.com/r/cscareerquestions/comments/1jov24/heres_how_to_prepare_for_tech_interviews/).

__NOTE:__ Some of this information may overlap with Unit 5 Notes, but I am including it here again for comprehensiveness purposes.

## Table of Contents

1. __Data Structures__: Description, example derivation code, big-O values for delete, insert, lookup, etc. 
    - List/Array (index)
    - Linked List (no index - list values defined in relation to adjacent ('linked') values 
        - Singly linked vs. Doubly-linked
    - Trees 
        - Types
            - Basic tree
            - Binary tree
            - Binary Search tree
            - Red-Black tree 
            - Cartesian tree
            - B-tree
            - Splay tree
            - AVL tree
            - KD tree
        - depth-first vs. breadth-first
    - Heap
    - Hash Table (__VERY IMPORTANT!__ Need to know different collision mitigation mechanisms, amortized constant-time)
    - Trie (pronounced "tree")
    - Linked Hash Map
        - Two kinds of interview questions: (1) knowing which one to use to code, and (2) comparison questions, such as "why would you use X over Y in situation Z?"
        - [Data Cheat Sheet for all Big-O](https://www.bigocheatsheet.com/)
        <br><br>
2. __Algoritihms__: Description, example derivation code, big-O values for _each_ kind of algorithm for _each_ data structure.
    - Array Sorting (non-comparison vs. comparison sorting)
        - Again, [Data Cheat Sheet for Big-O of Array Sorting ONLY](https://www.bigocheatsheet.com/)
    - Tree (in-order, pre-order, post-order, level-order)
    - Pre-fix tree searches
    - Other traversals: Djikstra, A*, Breadth-first search vs. depth-first search
    <br><br>
3. __Bits vs. Bytes__
    - "Bitshifting"
    - Big and Little Endian
    - "Write a method to determine whether the bit-wise representation of an integer is a palindrome"
<br><br>
4. __How the Internet Works__
    - Sockets
    - TCP/IP
    - HTTP
    - Networking Layers and their respective responsibilities
    <br><br>
5. __Databases & SQL__
    - See "SQL" Notes - will not be in this noteset
<br><br>
6. __Basics of Testing__
    - What is TDD?
    <br><br>
7. __Linux__
    - Shell Scripting Basics<br>
8. __Common questions__
    - Fizzbuzz
    - Reversing a string

***
# QUESTIONS TO ANSWER

1. What is "data mining"?
2. What is "operations research"?
3. How does hybrid boosting reduce bias and variance?

# Sample Coding Questions

## 1. "Degree" of an Array

The ___degree___ of an array is the "maximum frequency" of any one of the elements within the array.  

__Question:__ Output the _length_ of the smallest contiguous subarray of the array that has the same degree as the degree of nums.

__Example:__ For an array \[1, 2, 2, 3, 1, 4, 2\], the degree is 3 (the maximum frequency of any one element, here, it is 2 element found three times). Subarrays are \[2, 2, 3, 1, 4, 2\] and \[1, 2, 2, 3, 1, 4\]. If we go any smaller, we lose the degree of 3 (since the subarrays that are one shorter are \[2, 2, 3, 1, 4\] and \[1, 2, 2, 3, 1\], both of which do not have a degree of 3). 

Therefore, the shortest length of a subarray of the array with a degree of 3 is ___6___.

In [7]:
# Answer Code

# Using toy sample array

N = [1, 4, 4, 3, 1]

'''
For the following enumeration loop:

(POSITION)
"Beginning at position 0...
If the value k has not yet been seen in the Array 'N' yet, 
record the *position* of dictionary key 'k' as position i

(DEGREE COUNT)
Separately,
to keep track of how frequently each key-value 'k' comes up, 
place 'k' into dictionary "count", where everytime 'k' pops up again, 
get the key 'k' count again and add 1 more to it.

(DECIDING WHICH VALUE IS THE DEGREE)

If a certain value 'k' count, as seen by the dictionary, 
is higher than the degree (starting off with degree = 0),
set that highest count as the degree variable
AND
set the minimum length of the subarray to (current position) - (initial position where k was first seen) + 1"

Each value 'k' is looked at one-by-one in the enumeration, each iteration updating the degree,
and if the degree equals the count of a specific value k, checking whether the minimum length of k is smaller
than the running minimum min_length (and if so, updating min_length to k's min_length)

Once the loop has run to the end, the variable min_length will reflect the smallest contiguous length
of a subarray containing the degree of the main array N.
'''

# setting up initial counting terms
def shortest_subarray(N):
    min_length = 0
    degree = 0
    count = {}
    first_seen = {}

# aforementioned enumeration loop - see above note for explanation
    for i, k in enumerate(N):
        if k not in first_seen:
            first_seen[k] = i

        count[k] = count.get(k, 0) + 1

        if count[k] > degree:
            degree = count[k]
            min_length = (i - first_seen[k]) + 1
    
        elif count[k] == degree:
            min_length = min(min_length, (i - first_seen[k]) + 1)

    return min_length

In [8]:
shortest_subarray(N) # reflects answer length "2" for shortest subarray [4, 4]

2

***
## 2. Reverse Max Array Challenge

__Question:__ For a list "A" of size "n", produce the greatest integer greater than 0 _not found_ within the list. If list is full, then the answer is max(list) + 1. If there are no positive integers, answer is 1.

For list\[1, 2, 4, 6, 1, 7\], 
output is: 
5

For list \[1, 2, 3\],
output is:
4

For list \[-1, -100, -55\],
output is:
1

For list \[-1, 0, 10000, 100000.12\],
output is:
100001

In [9]:
# Solution

def solution(A):
    # write your code in Python 3.6
    # code idea - create list and pop list of all things except highest
    
    B = [item for item in A if item > 0] # to focus on only positive integers
    if B == []:
        return 1
    minpoint = min(B)
    maxpoint = max(B) + 1 # to account for zero numbering
    wholerange = list(range(minpoint, maxpoint))

    operativeset = list(set(wholerange) - set(B)) # doing set may not be necessary and may actually slow function
    if operativeset == []:
        return maxpoint
    
    else:
        return max(operativeset)

***
## 3. Reversing a String

__Question:__ How do you reverse the characters of a string s?

In [34]:
# 1. Answer via loop

string = "Python"                        # initial string
reversedString=[]
index = len(str)                         # calculate length of string and save in index

while index > 0: 
    reversedString += string[index - 1]  # save the value of str[index-1] in reverseString
    index = index - 1                    # decrement index
    
print(reversedString)                    # reversed string output

#################

# 2. Answer via built-in list function

string = '......'
## splitting into individual charactes
string = list(string)
## built-in reversal
string.reverse()
##putting it back together
string = ''.join(string)
    
#############

# 3. Answer via slicing - BETTER!!!!!:

string = '.......'

reversed_string = string[::-1] 
# "Start at position 0 ---> go to end in increments of -1"

['n', 'o', 'h', 't', 'y', 'P']


__Note:__ You can do this to reverse digits in a number too! The only additional thing is to convert data types!!!
```python
number = 123456789
reversed_number = int(str(number)[::-1])
```

***
## 4. Fizzbuzz

__Question:__ For a continuous list of integers with length _n_, print the list. 

If an integer is divisible by 3, replace the integer with the string "fizz". If the integer is divisible by 5, replace the integer with the string "buzz". If the integer is divisible by both 3 and 5, replace the integer with the string "fizzbuzz."

In [17]:
for i in range(50):
    if i % 3 == 0 and i % 5 == 0:
        print("fizzbuzz")
        continue
    elif i % 3 == 0:
        print("fizz")
        continue
    elif i % 5 == 0:
        print("buzz")
        continue
    else:
        print(i)

fizzbuzz
1
2
fizz
4
buzz
fizz
7
8
fizz
buzz
11
fizz
13
14
fizzbuzz
16
17
fizz
19
buzz
fizz
22
23
fizz
buzz
26
fizz
28
29
fizzbuzz
31
32
fizz
34
buzz
fizz
37
38
fizz
buzz
41
fizz
43
44
fizzbuzz
46
47
fizz
49


***
## 5. Reading Input from System

__Question:__ How do you convert system information into list variables?

In [None]:
# Answer: Three ways, all very similar

import sys

list_of_inputs = sys.stdin.readlines() # DON'T RUN THIS CELL!!!!!!!!!!!!!!

# each input will be a string within the list, convert to list or integer as necessary:

# say there were three inputs, with three being a 'list' string of 2, 2, 4, 5, 7...

main_list_input = [int(elem) for elem in list_of_inputs[3].split()]

#####################################

# Can use comprehensive cheating way!

# import sys
import fileinput

N = []

for i in range(2): # 2 = number of inputs (known value)
    N.append(list(map(int, input().rstrip().split())))
    
#############################

# EASIEST KNOWN WAY! .iter() has a good secondary use for reading sys line by line!!!!!!!
# This creates every line of stdin as its own list, all combined into one input_matrix list comprehension

input_matrix = [[float(x) for x in line.split()] for line in iter(input, '')]

print(input_matrix) 

# less succinct version of the above:

input_matrix = []
while True:
    line = sys.stdin.readline().strip()
    if not line:
        break
    input_matrix.append(map(float, line.split()))

## 5. Calculating Mode

Numpy and other Python packages do not offer a .mode() method because it's much more complicated than median or mean. What kind of function can be created to offer a mode?

In [41]:
# Sample DataFrame
df = pd.DataFrame()
df['age'] = [28, 42, 24, 27, 35, 24, 54, 25, 37, 37]

#################################

# ZERO-TH ANSWER: The fastest but wrong way

import statistics

statistics.mode(df['age'])
'Bad because will throw StatisticsError if multiple modes!!!'

#################################

# FIRST ANSWER: The cheap and easy way (the 'best' way????)

from collections import Counter

Counter(df['age']).most_common()

#################################

# SECOND ANSWER: The more manual counting way

def compute_mode(numbers):
    counts = {}
    maxcount = 0
    for number in numbers:
        if number not in counts:
            counts[number] = 1 # placing count for new number into dictionary
        else:
            counts[number] += 1
            
        if counts[number] > maxcount:
            maxcount = counts[number] # keeping tally of what is maxcount loop by loop
            
    for number, count in counts.items(): # loop done in case there are multiple modes
        if count == maxcount:
            print(number, count)

compute_mode(df['age'])

ModeResult(mode=array([1]), count=array([2]))
<class 'list'>


AttributeError: 'list' object has no attribute 'tolist'

Hadoop Ecosystem Setup

https://www.ironsidegroup.com/2015/12/01/hadoop-ecosystkey-components/