# Interview Query Questions

## Complete Addresses

You’re given two dataframes. One contains information about addresses and the other contains relationships between various cities and states. Write a function to create a single dataframe with complete addresses in the format of street, city, state, zipcode.

In [3]:
import pandas as pd

addresses = ["4860 Sunset Boulevard, San Francisco, 94105", "3055 Paradise Lane, Salt Lake City, 84103", 
             "682 Main Street, Detroit, 48204", "9001 Cascade Road, Kansas City, 64102", "5853 Leon Street, Tampa, 33605"]

df1 = pd.DataFrame(addresses, columns=["Addresses"])

city = {"Salt Lake City": "Utah", "Kansas City": "Missouri", "Detroit": "Michigan", "Tampa": "Florida", "San Francisco": "California"}

df2 = pd.DataFrame(city.items(), columns=["City", "State"])

In [36]:
def add_postcode(df1, df2):
    df1_split = pd.DataFrame([i.split(", ") for i in df1['Addresses'].values], columns=["Address", "City", "PostCode"])
    df_joint = pd.merge(df1_split, df2, on=["City"])
    res = pd.DataFrame([', '.join(i) for i in df_joint[list(df_joint.columns)[:2] + [list(df_joint.columns)[-1]] + [list(df_joint.columns)[-2]]].values], columns=['Addresses'])
    return res

add_postcode(df1, df2)

Unnamed: 0,Addresses
0,"4860 Sunset Boulevard, San Francisco, Californ..."
1,"3055 Paradise Lane, Salt Lake City, Utah, 84103"
2,"682 Main Street, Detroit, Michigan, 48204"
3,"9001 Cascade Road, Kansas City, Missouri, 64102"
4,"5853 Leon Street, Tampa, Florida, 33605"


## Max Width

Given an array of words and a maxWidth parameter, format the text such that each line has exactly maxWidth characters. Pad extra spaces ' ' when necessary so that each line has exactly maxWidth characters.

Extra spaces between words should be distributed as evenly as possible. If the number of spaces on a line do not divide evenly between words, the empty slots on the left will be assigned more spaces than the slots on the right.

In [37]:
words = ["This", "is", "an", "example", "of", "text", "justification."]
maxWidth = 16

In [38]:
def fullJustify(words, maxWidth):
    res, cur = [], []
    num_of_letters = 0

    for w in words:
        # Checking if existing words + new words are greater than max width
        if num_of_letters + len(w) + len(cur) > maxWidth:
            # Implementing round robin logic
            for i in range(maxWidth - num_of_letters):
                cur[i%(len(cur)-1 or 1)] += ' '
            res.append(''.join(cur))
            cur, num_of_letters = [], 0
        cur += [w]
        num_of_letters += len(w)
    return res + [' '.join(cur).ljust(maxWidth)]

fullJustify(words, maxWidth)

['This    is    an', 'example  of text', 'justification.  ']

## Customer Analysis

You're given a dataframe containing sales data from a grocery store chain with columns for customer ID, gender, and date of sale.

Create a new dataset with summary level information on their purchases including the columns:

    customer_id
    gender
    most_recent_sale
    order_count

most_recent_sale should display the date of the customer's most recent purchase. order_count should display the total number of purchases that the customer has made.

In [5]:
customers = {"customer_id" : [5156, 2982, 1011, 3854, 2982], 
             "Gender" : ["m", "f", "m", "f", "f"], "Date of Sale" : ["2021-01-04", "2021-02-15", "2021-03-01", "2021-03-21", "2021-04-12"]}

customer_df = pd.DataFrame(customers)
customer_df

Unnamed: 0,customer_id,Gender,Date of Sale
0,5156,m,2021-01-04
1,2982,f,2021-02-15
2,1011,m,2021-03-01
3,3854,f,2021-03-21
4,2982,f,2021-04-12


In [26]:
dfs_by_id = [customer_df[customer_df['customer_id'] == i] for i in list(customer_df['customer_id'].unique())]

df_list = []
for i in dfs_by_id:
    ids, gen, date = list(i.iloc[len(i)-1,:].values)
    count = len(i)
    df_list.append([ids, gen, date, count])
    
most_recent_sale = pd.DataFrame(df_list, columns=['customer_id', 'gender', 'most_recent_sale', 'order_count'])
most_recent_sale.sort_values('customer_id')

Unnamed: 0,customer_id,gender,most_recent_sale,order_count
2,1011,m,2021-03-01,1
1,2982,f,2021-04-12,2
3,3854,f,2021-03-21,1
0,5156,m,2021-01-04,1


## Merge Sorted Lists

Given two sorted lists, write a function to merge them into one sorted list.

What's the time complexity?

In [11]:
def merge_list(list1, list2):
    list1.extend(list2)
    list1.sort()
    return list1

print(merge_list([0,2,6,7,9], [1,3,4,8]))

[0, 1, 2, 3, 4, 6, 7, 8, 9]


## Over 100 Dollars

You're given two dataframes: transactions and products.

The transactions dataframe contains transaction ids, product ids, and the total amount of each product sold.

The products dataframe contains product ids and prices.

Write a function to return a dataframe containing every transaction with a total value of over $100. Include the total value of the transaction as a new column in the dataframe.

In [12]:
import pandas as pd

transactions = {"transaction_id" : [1, 2, 3, 4, 5], "product_id" : [101, 102, 103, 104, 105], "amount" : [3, 5, 8, 3, 2]}

products = {"product_id" : [101, 102, 103, 104, 105], "price" : [20.00, 21.00, 15.00, 16.00, 52.00]}

df_transactions = pd.DataFrame(transactions)

df_products = pd.DataFrame(products)

In [14]:
df_transactions

Unnamed: 0,transaction_id,product_id,amount
0,1,101,3
1,2,102,5
2,3,103,8
3,4,104,3
4,5,105,2


In [15]:
df_products

Unnamed: 0,product_id,price
0,101,20.0
1,102,21.0
2,103,15.0
3,104,16.0
4,105,52.0


In [17]:
df_transactions['total_value'] = df_transactions.amount *df_products.price
df_transactions[df_transactions['total_value']>=100]

Unnamed: 0,transaction_id,product_id,amount,total_value
1,2,102,5,105.0
2,3,103,8,120.0
4,5,105,2,104.0


## Good Grades and Favorite Colors

You’re given a dataframe of students named df_students:

| name | age | favorite_color | grade |
|------|-----|----------------|-------| 
|Tim Voss | 19 | red | 91 |
|Nicole Johnson | 20 | yellow | 95
|Elsa Williams | 21 | green | 82
|John James | 20 | blue | 75
|Catherine Jones | 23 | green | 93

Write a function named grades_colors to select only the rows where the student’s favorite color is green or red and their grade is above 90.

In [19]:
import pandas as pd

students = {"name" : ["Tim Voss", "Nicole Johnson", "Elsa Williams", "John James", "Catherine Jones"], "age" : [19, 20, 21, 20, 23], 
            "favorite_color" : ["red", "yellow", "green", "blue", "green"], "grade" : [91, 95, 82, 75, 93]}

students_df = pd.DataFrame(students)
students_df

Unnamed: 0,name,age,favorite_color,grade
0,Tim Voss,19,red,91
1,Nicole Johnson,20,yellow,95
2,Elsa Williams,21,green,82
3,John James,20,blue,75
4,Catherine Jones,23,green,93


In [37]:
def grades_colors(df):
    return df[(df['grade'] > 90) & (df['favorite_color'] == 'green') | (df['favorite_color'] == 'red')].sort_values('grade')

grades_colors(students_df)

Unnamed: 0,name,age,favorite_color,grade
0,Tim Voss,19,red,91
4,Catherine Jones,23,green,93


## Weekly Aggregation

Given a list of timestamps in sequential order, return a list of lists grouped by week (7 days) using the first timestamp as the starting point.

        ts = [
            '2019-01-01', 
            '2019-01-02',
            '2019-01-08', 
            '2019-02-01', 
            '2019-02-02',
            '2019-02-05',
        ]

        def weekly_aggregation(ts) -> [
            ['2019-01-01', '2019-01-02'], 
            ['2019-01-08'], 
            ['2019-02-01', '2019-02-02'],
            ['2019-02-05'],
        ]

In [17]:
import datetime

ts = [
    '2019-01-01', 
    '2019-01-02',
    '2019-01-08', 
    '2019-02-01', 
    '2019-02-02',
    '2019-02-05',
]

def weekly_aggregations(ts):
    res = {}
    for i in range(len(ts)):
        week_num = datetime.datetime.strptime(ts[i],'%Y-%m-%d').isocalendar()[1]
        if week_num in res:
            res[week_num].append([ts[i]])
        else:
            res[week_num] = [ts[i]]
    return list(res.values())

weekly_aggregations(ts)

[['2019-01-01', ['2019-01-02']],
 ['2019-01-08'],
 ['2019-02-01', ['2019-02-02']],
 ['2019-02-05']]

## Replace words with stems

In data science, there exists the concept of stemming, which is the heuristic of chopping off the end of a word to clean and bucket it into an easier feature set. 

Given a dictionary consisting of many roots and a sentence, write a function replace_words to stem all the words in the sentence with the root forming it. If a word has many roots that can form it, replace it with the root with the shortest length.

Example:

Input:

    roots = ["cat", "bat", "rat"]
    sentence = "the cattle was rattled by the battery"
    
Output:

    "the cat was rat by the bat"

In [28]:
roots = ["cat", "bat", "rat"]
sentence = "the cattle was rattled by the battery"

In [30]:
# easy solution assuming roots all have a lenght of 3
res = []
for i in sentence.split(' '):
    if i[:3] in roots:
        res.append(i[:3])
    else:
        res.append(i)
print(' '.join(res))


# improved solution
res = []
for w in sentence.split(' '):
    res.append(w)
    for r in roots:
        if w.startswith(r):
            res[-1] = r
print(' '.join(res))

the cat was rat by the bat
the cat was rat by the bat


## Acquisition Threshold

Capital One has two levels of customer acquisition strategies for customers that are opening credit cards.

For high spending customers, Capital One will give clients a one time bonus of 800 dollars. For everyone else, they give a 100 dollar bonus.

Write a function in Python that takes a list of client spends as floats and figures out the threshold to divide the high spending vs low spending customers.

In [5]:
def bonus_threshold(spends_list, pct=0.20):
    spends_list.sort(reverse = True)
    limit = int(len(spends_list) * pct)
    return spends_list[limit]

bonus_threshold([23, 14, 75, 23, 9, 705, 709])

705

## Scrambled Tickets

Consider a trip from one city to another that may contain many layovers. Given the list of flights out of order, each with a starting city and end city, write a function plan_trip to reconstruct the path of the trip so the trip tickets are in order.

Example:

For a trip from Bombay to Beijing:

In [1]:
flights = [['Chennai', 'Bangalore'], ['Bombay', 'Delhi'], ['Goa', 'Chennai'], ['Delhi', 'Goa'], ['Bangalore', 'Beijing']]
output = [['Bombay', 'Delhi'], ['Delhi', 'Goa'], ['Goa', 'Chennai'], ['Chennai', 'Bangalore'], ['Bangalore', 'Beijing'],]

In [2]:
def builder(sorted_list, unsorted_items):
    still_unsorted = []
    for item in unsorted_items:
        if sorted_list[0][0] == item[1]:
            sorted_list.insert(0,item)
        elif sorted_list[len(sorted_list)-1][1] == item[0]:
            sorted_list.append(item)
        else:
            still_unsorted.append(item)
                
    return sorted_list, still_unsorted

route = [flights[0]]
remaining = flights[1:len(flights)]
# print(route, remaining)
while remaining:
    route, remaining = builder(route,remaining)
#     print(route, remaining)
print(route)

[['Bombay', 'Delhi'], ['Delhi', 'Goa'], ['Goa', 'Chennai'], ['Chennai', 'Bangalore'], ['Bangalore', 'Beijing']]


## Computing the Correlation

You are given the scores of N students in three different subjects, Maths, Physics, and Chemistry.
All three are graded on a scale of 0 to 100.
Your task is to compute the Pearson product moment correlation coefficient between the scores of different pairs of subject.
This data is based on the records of the CBSE K-12 Exam, a national school leaving exam in India, for year 2013.

Pearson product-moment correlation coefficient:
This is a measure of linear correlation between data series:
<br>
![](https://github.com/jayshah19949596/CodingInterviews/raw/master/QuantumBlack%20Machine%20Learning%20Software%20Engineer%202019/Images/Correlation.JPG)


Input Format:
The first row contains an integer N. This is followed by N rows containing three space-separated integers M, P, C corresponding to a candidate's scores in Maths, Physics, Chemistry.
Each row contains to the scores obtained in these three subjects by one students.

Input Constraints:

1 <= N <= 500000
0 <= M, P, C <= 100

Output Format:
The output should contain three lines, with correlation coefficients computed and rounded off to exactly 2 decimal places.
The first line should contain the correlation coefficient between Maths and Physics scores.
The second line should contain the correlation coefficient between Physics and Chemistry scores.
The third line should contain the correlation coefficient between Chemistry and Maths scores.

In [7]:
class_scores = ['73\t72\t76', '48\t67\t76', '95\t92\t95', '95\t95\t96', '33\t59\t79', '47\t58\t74', '98\t95\t97',
				'91\t94\t97', '95\t84\t90', '93\t83\t90', '70\t70\t78', '85\t79\t91', '33\t67\t76', '47\t73\t90',
				'95\t87\t95', '84\t86\t95', '43\t63\t75', '95\t92\t100', '54\t80\t87', '72\t76\t90']

def corr_maker(f, s):
    num = len(f)*sum([i*j for i, j in zip(f, s)])-sum(f)*sum(s)
    dem = ((len(f)*sum([i**2 for i in f])-sum(f)**2)**(1/2))*((len(f)*sum([i**2 for i in s])-sum(s)**2)**(1/2))
    return round(num/dem, 2)
    
m, p, c = [], [], []
for i in class_scores:
    math, physics, chem = i.split('\t')
    m.append(int(math))
    p.append(int(physics))
    c.append(int(chem))
print(corr_maker(m, p))
print(corr_maker(p, c))
print(corr_maker(c, m))

0.89
0.92
0.81


## Distinct Pairs

In this challenge, you will be given an array of integers and a target value. 
Determining the number of distinct pairs of elements in the array that sum to the target value.
Two pairs (a, b) and (c, d) are considered to be distinct if and only if the values in sorted order do not match, i.e., (1, 9) and (9, 1) are indistinct but (1, 9 ) and (9, 2) are distinct.

For instance given the array [1, 2, 3, 6, 7, 8, 9, 1] and a target value of 10, the seven pairs (1, 9), (2, 8), (3, 7), (8, 2), (9,1), and (1, 9) all sum to 10 and only three distinct pair: (1, 9), (2, 8), and (3, 7).

Function Description:
Complete the function numberOfPairs. The function must return an integer, the total number of distinct pairs of elements in the array that sum to the target value.

numberOfPairs has following parameters:
a[a[0], ..., an-1]]: an array of integers to select the pairs from 
k: target integer value to sum to

Constraints:  
- 1 <= n <= 500000
- 0 <= a[i] <= 1000000000
- 0 <= k <= 500000

		Sample Input 0:
		6
		1
		3
		46
		1
		3
		9
		47
		
		Sample Output 0:
		1
		
		Explanation 0:
		a = [1, 3, 46, 1, 3, 9], k = 47
		There are 4 pairs of unique elements where a[i]+a[j] = k
		1. (a[0] = 1, a[2] = 46)
		2. (a[2] = 46, a[0] = 1)
		3. (a[2] = 46, a[3] = 1)
		4. (a[3] = 1, a[2] = 46) 

In [9]:
inp = [1, 3, 46, 1, 3, 9]
k = 47

res, visited = [], []
for i in range(len(inp)-1):
    if inp[i]+inp[i+1]==k:
        if [inp[i], inp[i+1]] not in visited:
            res.append([inp[i], inp[i+1]])
            visited.append(sorted([inp[i], inp[i+1]]))
            
print(len(res))

1


## Simple queries

Given two array if positive integer, for each element in the second array, find the total number of elements in the first array which are less than or equal to that element.
Store the values determined in an array.
For example, if the first array is `[1, 2, 3]` and the second array is `[2, 4]`, then the there are 2 elements in the first array less than or equal to 2.
There are 3 elements in the first array which are less than or equal to 4.
We can store these answers in an array, answer = [2, 3].

Function Description:
Complete the function `counts`. 
The function must return an array of m positive integers, one for each maxes[i] representing the total number of elements nums[j] satisfying nums[j]<=maxes[i], where 0 <= j < n and 0 <= i < m, in given order

counts has the following parameters:
num[nums[0], ,..., nums[n-1]]: first array of positive integers
maxes[maxes[0], ,..., maxes[n-1]]: second array of positive integers


Constraints:  
- 2 <= n, m <= 100000
- 1 <= nums[j] <= 1000000000, where 0 <= j < n
- 1 <= maxes[i] <= 1000000000, where 0 <= i < m.

		Sample Input 0:
		4
		1
		4
		2
		4
		2
		3
		5
		
		Sample Output 0:
		2
		4
			
		Explanation 0:
		We are given n = 4, nums = [1, 4, 2, 4], m = 2, and maxes = [3, 5].
		1. For maxes[0] = 3, we have 2 elements in nums (nums[0] = 1, and nums[2] = 2) that are <= maxes[0].
		2. For maxes[1] = 5, we have 4 elements in nums (nums[0] = 1, nums[1] = 4, nums[2] = 2, and nums[3] = 4) that are <= maxes[1].
		Thus the function returns the array [2, 4] as the answer.

In [31]:
def counts(inp, match):
    out = []
    for i in match:
        count = 0
        for j in inp:
            if i>=j:
                count+=1
        out.append(count)
    return out

arr = [1, 2, 4, 4, 7]
maxes = [3, 5]
res = counts(arr, maxes)
print("numbers =", arr, "maxes =", maxes, "answer =", res)

arr = [2, 10, 5, 4, 8]
maxes = [3, 1, 7, 8]
res = counts(arr, maxes)
print("numbers =", arr, "maxes =", maxes, "answer =", res)

arr = [1, 4, 2, 4]
maxes = [3, 5]
res = counts(arr, maxes)
print("numbers =", arr, "maxes =", maxes, "answer =", res)

numbers = [1, 2, 4, 4, 7] maxes = [3, 5] answer = [2, 4]
numbers = [2, 10, 5, 4, 8] maxes = [3, 1, 7, 8] answer = [1, 0, 3, 4]
numbers = [1, 4, 2, 4] maxes = [3, 5] answer = [2, 4]


## Compute Deviation

Write a function compute_deviation that takes in a list of dictionaries with a key and list of integers and returns a dictionary with the standard deviation of each list.

Note that this should be done without using the numpy built in functions.

Example:

    input = [
        {
            'key': 'list1',
            'values': [4,5,2,3,4,5,2,3],
        },
        {
            'key': 'list2',
            'values': [1,1,34,12,40,3,9,7],
        }
    ]

output -> {'list1': 1.12, 'list2': 14.19}

In [7]:
def compute_deviation(data):
    res = {}
    for i in data:
        vals = list(i['values'])
        m = sum(vals)/len(vals)
        num = [(i - m)**2 for i in vals]
        res[i['key']] = round((sum(num)/len(vals))**(1/2), 2)
        
    return res

input_ex = [
    {
        'key': 'list1',
        'values': [4,5,2,3,4,5,2,3],
    },
    {
        'key': 'list2',
        'values': [1,1,34,12,40,3,9,7],
    }
]

compute_deviation(input_ex)

{'list1': 1.12, 'list2': 14.19}

## Impute Median

You’re given a dataframe df_cheeses containing a list of the price of various cheeses from California. The dataframe has missing values in the price column.

Write a function cheese_median to impute the median price of the selected California cheeses in place of the missing values. You may assume at least one cheese is not missing its price.

In [34]:
import pandas as pd

cheeses = {"Name": ["Bohemian Goat", "Central Coast Bleu", "Cowgirl Mozzarella", "Cypress Grove Cheddar", "Oakdale Colby"], "Price" : [15.00, None, 30.00, None, 45.00]}
# cheeses = {"Name": ["Bohemian Goat", "Central Coast Bleu", "Cowgirl Mozzarella", "Cypress Grove Cheddar", "Oakdale Colby", "Oakdale Colby"], "Price" : [15.00, None, 30.00, None, 45.00, 30.00]}

df_cheeses = pd.DataFrame(cheeses)
df_cheeses.head()

Unnamed: 0,Name,Price
0,Bohemian Goat,15.0
1,Central Coast Bleu,
2,Cowgirl Mozzarella,30.0
3,Cypress Grove Cheddar,
4,Oakdale Colby,45.0


In [35]:
def imp_median(price):
    res = list(price[~price['Price'].isna()]['Price'])
    res.sort()
    if len(res)%2 == 0:
        res = (res[len(res)//2-1]+res[len(res)//2])//2
    else:
        res = res[len(res)//2]
    return  price.fillna(res)

imp_median(df_cheeses)

Unnamed: 0,Name,Price
0,Bohemian Goat,15.0
1,Central Coast Bleu,30.0
2,Cowgirl Mozzarella,30.0
3,Cypress Grove Cheddar,30.0
4,Oakdale Colby,45.0


## Business Days

Given two dates, write a program to find the number of business days that exist between the date range.

Example:

Input

date1 = 2021-01-31
date2 = 2021-02-18

In [4]:
date1 = '2021-01-31'
date2 = '2021-02-18'
print(pd.bdate_range(date1, date2).shape[0])

14


## Term Frequency

Say you are given a text document in the form of a string with the following sentences:

Input

        document = "I have a nice car with a nice tires"
        Output

        {
        "I":0.11,
        "have":0.11,
        "a":0.22,
        "nice":0.22,
        "car": 0.11,
        "with":0.11,
        "tires":0.11
        }
Write a program in python to determine the TF (term frequency) values for each term of this document.

Note: round the term frequency to 2 decimal points.

In [9]:
document = "I have a nice car with a nice tires"

def term_frequency(document):
    text = document.split(" ")
    res = {}
    tf =  round(1/len(text), 2)
    for i in text:
        if i not in res:
            res[i] = tf
        else:
            res[i] += tf
    return res

term_frequency(document)

{'I': 0.11,
 'have': 0.11,
 'a': 0.22,
 'nice': 0.22,
 'car': 0.11,
 'with': 0.11,
 'tires': 0.11}

## Recurring Character

Given a string, write a function recurring_char to find its first recurring character. Return None if there is no recurring character.

Treat upper and lower case letters as distinct characters.

You may assume the input string includes no spaces.

Example:

        input = "interviewquery"
        output = "i"

        input = "interv"
        output = None

In [18]:
inp = "interviewquery"
inp = "interv"

def recurring_char(inp):
    res = {}
    for i in inp:
        if i not in res:
            res[i] = 1
        else:
            return i
    return None

recurring_char(inp)

## Find Bigrams

Write a function find_bigrams to take a string and return a list of all bigrams.

Example:

    sentence = """
    Have free hours and love children? 
    Drive kids to school, soccer practice 
    and other activities.
    """
    def find_bigrams(sentence) ->

     [('have', 'free'),
     ('free', 'hours'),
     ('hours', 'and'),
     ('and', 'love'),
     ('love', 'children?'),
     ('children?', 'drive'),
     ('drive', 'kids'),
     ('kids', 'to'),
     ('to', 'school,'),
     ('school,', 'soccer'),
     ('soccer', 'practice'),
     ('practice', 'and'),
     ('and', 'other'),
     ('other', 'activities.')]

In [22]:
def find_bigrams(sentence):
    string = sentence.lower().split(" ")
    res = []
    for i in range(0, len(string)-1):
        res.append((string[i], string[i+1]))
    return res

sentence = """Have free hours and love children? Drive kids to school, soccer practice and other activities."""

find_bigrams(sentence)

[('have', 'free'),
 ('free', 'hours'),
 ('hours', 'and'),
 ('and', 'love'),
 ('love', 'children?'),
 ('children?', 'drive'),
 ('drive', 'kids'),
 ('kids', 'to'),
 ('to', 'school,'),
 ('school,', 'soccer'),
 ('soccer', 'practice'),
 ('practice', 'and'),
 ('and', 'other'),
 ('other', 'activities.')]

### Bucket Test Scores

Let’s say you’re given a dataframe of standardized test scores from high schoolers from grades 9 to 12 called df_grades.

Given the dataset, write code function in Pandas called bucket_test_scores to return the cumulative percentage of students that received scores within the buckets of <50, <75, <90, <100.

Example:

Input:

    print(df_grades)
    user_id	grade	test score
    1	10	85
    2	10	60
    3	11	90
    4	10	30
    5	11	99
Output:

    def bucket_test_scores(df_grades) ->
    grade	test score	percentage
    10	<50	33%
    10	<75	66%
    10	<90	100%
    10	<100	100%
    11	<50	0%
    11	<75	0%
    11	<90	50%
    11	<100	100%

In [12]:
import pandas as pd

grades = {"user_id": [1, 2, 3, 4, 5], "grade" : [10, 10, 11, 10, 11], "test_score" : [85, 60, 90, 30, 99]}

df_grades = pd.DataFrame(grades)
df_grades.head()

Unnamed: 0,user_id,grade,test_score
0,1,10,85
1,2,10,60
2,3,11,90
3,4,10,30
4,5,11,99


In [4]:
def bucket_test_scores(df_grades):
    res = []
    for i in df_grades['grade'].unique():
        scores_by_grade = df_grades[df_grades['grade'] == i]
        dem = len(scores_by_grade)
        for treshold in [50, 75, 90, 100]:
            num = len(scores_by_grade[scores_by_grade['test_score'] <= treshold])
            label = '<' + str(treshold)
            bucket_score = str(100*num//dem) + '%'
            res.append([i, label, bucket_score])
            
    return pd.DataFrame(res, columns= ['grade', 'test_score', 'percentage'])

bucket_test_scores(df_grades)

Unnamed: 0,grade,test_score,percentage
0,10,<50,33%
1,10,<75,66%
2,10,<90,100%
3,10,<100,100%
4,11,<50,0%
5,11,<75,0%
6,11,<90,50%
7,11,<100,100%


In [26]:
def bucket_test_scores(df_grades):
    bins = [0, 50, 75, 90, 100]
    labels=['<50', '<75', '<90', '<100']
    group_size = df_grades.groupby(['grade', pd.cut(df_grades['test_score'], bins, labels)]).size()
    percentage = (100 * group_size.groupby('grade').cumsum() // df_grades.groupby('grade').size()).astype(str) + '%'
    return percentage.rename('percentage').reset_index()

bucket_test_scores(df_grades)

Unnamed: 0,grade,test_score,percentage
0,10,"(0, 50]",33%
1,10,"(50, 75]",66%
2,10,"(75, 90]",100%
3,10,"(90, 100]",100%
4,11,"(0, 50]",0%
5,11,"(50, 75]",0%
6,11,"(75, 90]",50%
7,11,"(90, 100]",100%


### Stratified Split

Let’s say you work as a medical researcher.

You are given a dataframe of patient data containing the age of the patient and two columns, smoking and cancer, indicating if the patient is a smoker or has cancer, respectively.

Write a function, stratified_split, that splits the dataframe into train and test sets while preserving the approximate ratios for the values in a specified column (given by a col parameter).

Note: Do not use scikit-learn.

Example:

Input:

    print(df)
    ...
       age smoking cancer
    0   25     yes    yes
    1   32      no     no
    2   10     yes     no
    3   40     yes     no
    4   75      no     no
    5   80     yes     no
    6   60     yes     no
    7   60      no    yes
    8   40     yes    yes
    9   80     yes     no
Output:

    def stratified_split(df, train_ratio=0.7, col='cancer') -> print(X_train)
    ...
       age smoking cancer
    8   40     yes    yes
    6   60     yes     no
    7   60      no    yes
    4   75      no     no
    9   80     yes     no
    1   32      no     no
    2   10     yes     no
    -----------------------
    print(X_test)
    ...
       age smoking cancer
    0   25     yes    yes
    5   80     yes     no
    3   40     yes     no

In [28]:
import pandas as pd

data = {"age": [25, 32, 10, 40, 75, 80, 60, 60, 40, 80],
          "smoking" : ['yes', 'no', 'yes', 'yes', 'no', 'yes', 'yes', 'no', 'yes', 'yes'],
          "cancer" : ['yes', 'no', 'no', 'no', 'no', 'no', 'no', 'yes', 'yes', 'no']}

df = pd.DataFrame(data)
df.head()

Unnamed: 0,age,smoking,cancer
0,25,yes,yes
1,32,no,no
2,10,yes,no
3,40,yes,no
4,75,no,no


In [30]:
def stratified_split(df, train_ratio=0.7, col='cancer'):
    var = df[col].value_counts()
    return round(var['no'] * train_ratio)

stratified_split(df)

5

### Rain on Rainy Days

You’re given a dataframe df_rain containing rainfall data. The dataframe has two columns: day of the week and rainfall in inches.

Write a function median_rainfall to find the median amount of rainfall for the days on which it rained.

Note: You may assume it rained on at least one of the days.

Input:

    import pandas as pd

    rainfall = {"Day" : ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"], "Inches" : [0, 1.2, 0, 0.8, 1]}

    df_rain = pd.DataFrame(rainfall)
Output:

    def median_rainfall(df_rain) -> 1

In [31]:
import pandas as pd

rainfall = {"Day" : ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"], "Inches" : [0, 1.2, 0, 0.8, 1]}

df_rain = pd.DataFrame(rainfall)
df_rain.head()

Unnamed: 0,Day,Inches
0,Monday,0.0
1,Tuesday,1.2
2,Wednesday,0.0
3,Thursday,0.8
4,Friday,1.0


In [42]:
def median_rainfall(df_rain):
    l = df_rain[df_rain['Inches'] > 0.0].sort_values('Inches').reset_index()
    if len(l)%2 == 1:
        return l['Inches'][len(l)//2]
    else:
        return (l['Inches'][len(l)//2 -1] + l['Inches'][len(l)//2])/2
    
median_rainfall(df_rain)

1.0

In [43]:
def median_rainfall(df_rain):
    return df_rain[df_rain['Inches'] > 0]['Inches'].median() 

median_rainfall(df_rain)

1.0

### Rectangle Overlap

You are given two rectangles a and b each defined by four ordered pairs denoting their corners on the x, y plane. Write a function rectangle_overlap to determine whether or not they overlap. Return True if so, and False otherwise.

Note: If the two rectangles border one another, or share a corner like two diagonally adjacent positions on a chessboard, they are said to overlap.

Note: The lists of ordered pairs are in no particular order. The first entry in list a could be the top left corner while the first in list b is the bottom right.

Example:

Input:

    a = [(-3,5), (-3,2),(0,5),(0,2)]
    b = [(-1,4), (3,4), (3,1), (-1,1)]
Output:

def rectangle_overlap(a, b) -> True
As point (0,2) is fully contained in rectangle b and point (-1,4) is fully contained in a rectangle a.

In [44]:
def rectangle_overlap(a, b):
    a_x, a_y = [i[0] for i in a], [i[1] for i in a]
    min_ax, max_ax, min_ay, max_ay = min(a_x), max(a_x), min(a_y), max(a_y)

    b_x, b_y = [i[0] for i in b], [i[1] for i in b]
    min_bx, max_bx, min_by, max_by = min(b_x), max(b_x), min(b_y), max(b_y)

    return True if (min_ax <= max_bx and max_ax >= min_bx) and \
                   (min_ay <= max_by and max_ay >= min_by) \
    else False

a = [(-3,5), (-3,2),(0,5),(0,2)]
b = [(-1,4), (3,4), (3,1), (-1,1)]
rectangle_overlap(a, b)

True

### Nightly Job

Every night between 7 pm and midnight, two computing jobs from two different sources are randomly started with each one lasting an hour.

Unfortunately, when the jobs simultaneously run, they cause a failure in some of the company’s other nightly jobs, resulting in downtime for the company that costs $1000. 

The CEO, who has enough time today to hear one word, needs a single number representing the annual (365 days) cost of this problem.

Note: Write a function to simulate this problem and output an estimated cost 

Bonus - How would you solve this using probability?

Example:

Input:

    n = 4
Output:

    simulate_overlap(n) -> 0.4

In [75]:
import numpy as np

def simulate_overlap(n):
    source_1 = np.random.randint(0, 300, size = n)
    source_2 = np.random.randint(0, 300, size = n)
    overlap = np.mean(np.abs(source_1 - source_2) <= 60)
    return overlap * 1000 * 365

simulate_overlap(4)

182500.0

### Greatest Common Denominator

Good morning. Here's your algorithms question for today.

This question was asked by: Microsoft
Given a list of integers, write a function gcd to find the greatest common denominator between them.

Example:

Input:

    int_list = [8, 16, 24]
Output:

    def gcd(int_list) -> 8

In [106]:
def gcd(int_list):
    div_list = [1]
    for i in int_list:
        if i <= 0:
            break
        elif sum([j % i for j in int_list if j>0]) == 0:
            div_list.append(i)
    return max(div_list)


int_list = [8, 16, 24]
gcd(int_list)

8

### Alphabet Sum

Given a list of strings of letters from a to z, create a function, sum_alphabet, that returns a list of the alphabet sum of each word in the string.

The alphabet sum is the sum of the ordinal position of each of the string’s letters in the standard English alphabet ordering. So, the letter a will have a value of 1, z will have a value of 26, and so on.

As an example of the alphabet sum of a string, the string sport will have an alphabet sum of 19 + 16 + 15 + 18 + 20 = 88.

In [1]:
words = ["sport" , "good" , "bad"]
def sum_alphabet(words):
    arr=[]
    for w in words:
        cnt=0
        for ch in w:
            cnt+=ord(ch)-ord('a')+1
        arr.append(cnt)
    return arr

sum_alphabet(words)

[88, 41, 7]

### Expected Tests

Suppose there are one million users and we want to expose 1000 users per day to a test. The same user can be selected twice for the test.

1. What’s the expected value of how long someone will have to wait before they receive the test? <br>
Its not a uniform distribution. The same user can be selected twice. So its geometric distribution with success probability = 1000/1M = 1⁄1000 = 0.001. For a geometric distribution expectation of wait time = 1/p = 1000 days.
2. What is the likelihood they get selected after the first day? Is that closer to 0 or 1? <br>
1-0.001 (probability of being selected one day) = 0.999

## Easy

Given 2 arrays write a function to get the intersection of the 2. For example, if A = [1,2,3,4,5] and B = [0,1,3,7] then you should return [1,3]

In [1]:
A = [1,2,3,4,5]
B = [0,1,3,7]

def intersect(a, b):
    set_a = set(a)
    set_b = set(b)
    if len(set_a) > len(set_b):
        return [i for i in set_b if i in set_a]
    else:
        return [i for i in set_a if i in set_b]
    
intersect(A, B)

[1, 3]

Given an integer array, return the maximum product of any three numbers in the array. For example, for A = [1,3,4,5]  you should return 60, while for B = [-2,-4,5,3] you should return 40.

In [21]:
A = [1,3,4,5]
B = [-2,-4,5,3]

def maxtreeprod(a):
    s_a = sorted(a)
    return max(s_a[-1:][0]*s_a[0]*s_a[1], s_a[-1:][0]*s_a[-2:-1][0]*s_a[-3:-2][0])

maxtreeprod(A)
maxtreeprod(B)

40

Given a list of coordinates, write a function to find the k closest points (measured by the Euclidean distance) to the origin. For example, if k=3, and the points are: [[2, -1], [3,2], [4,1], [-1, -1], [-2,2]], then return [[-1,-1],[2,-1],[-2,2]]

In [56]:
inp = [[2, -1], [3,2], [4,1], [-1, -1], [-2,2]]
k = 3

def get_dist(x, y):
    return x**2 + y**2 

def k_closest(k, inp):
    res = {}
    for i in range(0, len(inp)):
            res[tuple(inp[i])] = get_dist(inp[i][0], inp[i][1])
    
    indx = sorted(res.items(), key=lambda x: x[1])[:k]
    out = []
    for i in range(0, k):
        out.append(list(indx[i][0]))
    
    return out

k_closest(k, inp)

[[-1, -1], [2, -1], [-2, 2]]

In [51]:
# K-closest to each other

inp = [[2, -1], [3,2], [4,1], [-1, -1], [-2,2]]
k = 3

def get_dist(x, y):
    return ((x[0]-y[0])**2+(x[1]-y[1])**2)**(1/2) 

def k_closest(k, inp):
    res = {}
    for i in range(0, len(inp)):
        for j in range(i+1, len(inp)):
            res[tuple([i,j])] = get_dist(inp[i], inp[j])
    
    indx = sorted(res.items(), key=lambda x: x[1])[:k]
    out = []
    for i in range(0, k):
        out.append([inp[indx[i][0][0]], inp[indx[i][0][1]]])
    
    return out

k_closest(k, inp)

[[[3, 2], [4, 1]], [[2, -1], [4, 1]], [[2, -1], [-1, -1]]]