# Case Studies

<h3>Case Study 1</h3>

<h4>Citadel: Retailer Revenue Estimation</h4>

<b>Say you wanted to estimate the physical store sales for a publicly traded retail chain in the US. You have access to third-party foot-traffic data for each store that was derived from anonymized GPS data from 10 million phones. How could you use this foot-traffic dataset to predict the chain's in-store revenue?</b>

Before starting, I'd like to clarify a few things:

- What's the use case for predicting the in-store revenue? Why do stakeholders want this number? To make investment decisions?

- Are we making revenue predictions for the entire chain, and not for each store?

- What level of granularity is needed for the revenue prediction? Quarterly?


<b>We are trying to use the revenue estimate to decide whether to buy or sell (short) stocks of certain big box retailers before their earnings come out. We have robust e-commerce data, but want to proxy the entire chain's sales from physical retail with the foot-traffic dataset.</b>

Can you tell me more about the data source?


<b>The amount of visits detected to a store, reported at a daily granularity. Each row is a date-storeID pair, and the column is an integer of detected visits that day.</b>

The high-level appraoch would be to run a model to predict revenue based on some aggregated measure of foot traffic, at the same level of granularity of the company's revenues (quarterly). We want to normalize this foot traffic to account for the fact that the data comes from a sample. Then we can look at a simple regression model that correlates revenue with the normalized foot traffic.


<b>Say the foot-traffic data only goes back 12 quarters. How would you avoid overfitting on such little data?</b>

- We can pick simpler models like linear regression and apply regularization.

- We could report a 95% confidence interval to help account for uncertainty.

- We can look to similar retailers


<b>How would you use the foot-traffic data to determine the true number of visits to a store?</b>

Because the foot-traffic data would come from a consistently changing panel of mobile phones, to account for panel size changes, we'd incorporate the panel size into teh equation. As such, the number of true visits would be:

(sample foot traffic for store / sample pop. size) \times U.S. Population

Or, use region instead of the entire country.


<b>What is sampling bias? In what ways do you think this dataset may be biased due to the way it was sampled?</b>

Sampling bias is when data is sampled from some segments of the population at disproportionate rates. Location data may be biased because not everyone has a smartphone. Of those that do, not everyone is as likely to have apps with a location component, and not everyone is likely to turn their location permissions on.


<b>How would you check for sampling bias within the panel?</b>

We can check if the given dataset is a representative sample by comparing it to the distribution of the underlying U.S. population. We can assume that the location of a smartphone during night hours indicate's a user's location. We can compare the home locations to census data to account for geographic bias.

From the home location and census data, we can understand what types of neighborhoods these devices live in. From this, we can approximate average income and race.


<b>A regression on different subsets of data finds a wildly different intercept among subsets. What could the issue be?</b>

The data might be heterogenous, i.e., the distributions underlying the various subsets are drastically different. This may be due to improper collection or sampling techniques, in which case, it is recommended to cluster datasets into different subsets wisely, and draw different models from different subsets. Or, use nonparametric models like trees which can better deal with heterogeneity.


<b>What are some limitations of using a third party dataset of foot-traffic derived from mobile phones to model the retailer's store sales revenue?</b>

- We don't likely have granular information or transparency into what apps are sending location data, how the apps are changing, or how users are churning and shifting.

- Foot traffic is modeled by a geofence. The selling company could make changes in their visit attribution algorithm.

- Traffic does not necessarily imply a purchase

To address these, we could look to alternative datasets, like receipts data, credit card data, and point-of-sale datasets.

<h3>Case Study 2</h3>

<h4>Amazon: Show Recommendations</h4>

<b>Amazon: Assume you are designing a system whose purpose is to recommend shows on Amazon Prime Video. What data and techniques would you use?</b>

Underlying most recommender systems is a collaborative filtering approach, to recommend shows that are matched by other users who are similar to the user at hand.

<b>What kind of data would you use for collaborative filtering?</b>

Matrices are typically used to represent consumer preferences with respect to programming. The rows could represent users and the columns movies. Each cell contains either a rank that measures the $i^{th}$ person's attitude toward the $j^{th}$ movie, or a null value.

<b>How would you calculate whether two users are similar?</b>

Could use cosine similarity; normalization of scores (per user or across the dataset) is recommended.

<b>How would you address the cold-start problem?</b>

We can do the following:

1. For new shows, don't include them in the recommendations, but have a separate panel for 'new to Amazon'.

2. For new shows, compute measures of show similarity using the features of shows (genre, actors, language, content, etc.)

3. For new users, use factors outside of the given matrix (demographics, etc.)

4. For new users, simply recommend what's popular or trending


<b>Besides collaborative filtering, what techniques can you look at?</b>

Content-based filtering.


<b>Say you have a new model you want to test. You decide on total watch time as the top-line metric. You run an A/B test and it shows a significant increase in watch time with a p-value of $0.04$. Do you ship the new model?</b>

Before deciding, it would be good practice to make sure the experiment lasted two weeks or more, as you do not want to stop too early, and do want to take into account day-of-week effects.

Additionally, it is a good to consult with business stakeholders, and see what happened to other metrics being tracked. You could look at precision and recall: what is the rate of recommended shows that a user watched (precision) and of the shows that users watched, how many were recommended (recall)?

Just because the experiment is significant does not mean that it is practically significant, and the the findings have ability to make an impact.

What are other model deployment or product considerations you might have?

- How computationally intensive is it to train?

- How often should we retrain the model to account for the shifting desires of the user base as well as new media gets added?

- Should we favor Amazon originals? How much?

- Should we update the recommendations when the user has not interacted with the initial recommendations for a certain time? What events should trigger a refresh?

<h3>Case Study 3</h3>

<h4>AirBnB: Listing Revenue Model</h4>

<b>AirBnB: Assume you are building a model to predict the yearly revenue of new properties being listed. What property features would you use in making such predictions?</b>

Clarify the purpose - e.g., are we going to be showing new listers their potential revenue, or are we using this internally to decide on what markets to expand in?


<b>Assume it's to show potential customers an expectation on how much yearly revenue they can bring in. We need to make our predictions personalized based on a user's input data. What features would you use to model the expected yearly revenue?</b>

We could use:

- Property Details: number of bedrooms and bathrooms, square feet, whether there is a kitchen

- Pricing Details: owner's nightly rate, occupancy limit, minimum number of nights required

- Location: zip code, distance to nearby attractions

- Local Market: prices and occupancy of nearby listings


<b>What data preprocessing and cleaning would be necessary to handle duplicate or inconsistent values?</b>

- We can disregard duplicate records if it is confirmed that they are indeed duplicates.

- For inconsistent values, we can perform basic checks to ensure that all the data within the columns in which the inconsistent numbers appear uniform in type (e.g., if data from another field is in the wrong column).

- We want to normalize numerical data to improve interpretability of features, and encode categorical variables.


<b>What if many features are sparse?</b>

Some models, like random forests, deal better with sparsity then models like regression. You could look into machine learning techniques like PCA/SVD for dimension reduction. In deep learning, autoencoders can be used to learn a smaller relevant feature space.


<b>What model would you want to use?</b>

It's probably best to start simple with regression, which makes sense because we expect an increasing, roughly linear trend between features like square footage and the number of bedrooms/bathrooms and revenue.


<b>What about using a neural network?</b>

Neural networks tend to overfit unless there is a lot of data. If we have a large amount of training data, do not need the model to be very interpretable, and believe linear regression will underfit, then we can try using neural networks, and perhaps use regularization to prevent overfitting (i.e., to better address the high variance).


<b>Are there any other models you would like to try?</b>

A random forest or boosting model would probably be better in terms of adding complexity vs. a linear regression, with less overfitting than a neural network.


<b>Say your model endpoint needs to support a high number of queries per second (QPS). How would this affect your model choice?</b>

If the model complexity is high, more compute resources are needed, however high complexity models may be more accurate. Therefore, we can assess the trade-off between the resource consumption and model accuracy, and then decide what model to deploy that meets the QPS requirement.


<b>Say you used linear regression and found that square footage was crucial to predicting revenue. However, it is missing for 10% of listings. How would you deal with the missing values?</b>

Missing data can be eliminated by discarding records or columns, but this is not optimal for a very important feature. Values can be imputed using mean, median, or mode, but this may over-simplify. In general, the best approach is to build a model to predict the missing features given the other features.


<b>Do you have any outside-the-box approaches to sourcing the missing square footage data?</b>

- We can use push notifications on the app.

- Get end users to provide missing information.

- Use third-party datasets, like parcel data or county records, however this may nto help if only a subsection of the property is being rented out.


<b>What action would you take if there were too many features to look at thoroughly?</b>

The first step would be some feature selection, to filter out variables with very low variance and little relationship to the target. This can be done by looking at correlations and variance inflation factors (VIFs). Alternatively, we can apply specific feature selection methods such as recursive feature elimination (RFE).

Dimensionality reduction methods like PCA can be used to combine features that are highly correlated. In doing so, we create new variables that are uncorrelated with the remaining variables and represents a latent feature underlying the original variables.

<h3>Case Study 4</h3>

<h4>Walmart: Optimal Product Pricing</h4>

<b>How would you build an algorithm to price products sold physically at Walmart stores?</b>

Clarify the scope of the problem. How many products are we trying to price? Are we determining prices for a particular Walmart store, or all of them?


<b>We are trying to build an algorithm which can price all the Walmart products stocked physically at all stores in North America. We have about $5,000$ stores and $100,000$ products per store.</b>

To further clarify - how often do we need to determine prices?


<b>What advantages do you see in updating often vs. less often?</b>

With the retail industry so competitive, and some items being perishable, it could be appropriate to change prices very frequently, such as daily. However, there is a human cost in terms of labor, and potentially limitations on computational complexity. It may also affect brand perception if prices change too frequently. Customers may postpone purchases until something goes on sale, knowing that they are likely to encounter a lower price soon.


<b>Assume that the singular goal of this algorithm is to maximize profit. How could you construct a simple supply and demand curve for each item to determine an optimal price point?</b>

I'll assume the demand curve relationship for every product is approximately linear, and so plotting quantity sold on the x-axis and price on the y-axis should display a linear relationship with a negative slope. In this way, we can use the demand curve to identify the optimal price point in terms of profit, (price - cost) x units. The coefficients of the regression models will reflect prcie elasticity.


<b>How would you double-check elasticities?</b>

Determining whether similar products have similar elasticities would provide possible validation of our findings.


<b>What are some limitations of using a linear regression model for building demand curves?</b>

There are a few areas where pricing alone may not capture demand perfectly. Cannabilization does not get taken into account with a linear regression analysis, however discounts on some products can affect the sales of others. Also, variable pricing can involve mechanisms such as whether by an ad or physical coupon. With regular flyer specials, people may stock up during discount-periods.


<b>Assume you want to use more than a simple supply and demand curve with linear regression. What other kinds of data can be relevant to train a black-box pricing algorithm?</b>

- Item Details: cost, shelf placement, category, size
- Competitor Pricing
- Inventory Constraints
- Historical Prices


<b>How would you test your black-box algorithm?</b>

We can run an A/B test as follows. First, take two categories of items that should be roughly comparable in terms of price elasticity, seasonality, and revenue, avoiding categories that tend to be in the same basket because of interaction effects.

For the control group, we price using the status quo, and for the other, we use our algorithm. At the end, we monitor for lift in core metrics like revenue, profit, and sell-through rates. We should wait a decent amount of time, such as a few months.

<h3>Case Study 5</h3>

<h4>Accenture: Hotel Review Analytics</h4>

<b>Assume you want to help a major hotel chain analyze what people say about their brand on websites like Facebook, Twitter, and Reddit. How would you go about doing it?</b>

To clarify the data: what kind of content are we looking at? Reviews, comments, threads, a combination? Do we need to go out and collect the data, or is it already prepared?


<b>Assume we're talking about public online text and reviews posted on the biggest social media site, and that it has already been processed, cleaned, and stored in a database. Why might analyzing the data be helpful for the hotel chain?</b>

Looking at reviews isn't enough in terms of social listening. Posts on social media (positive or negative) have the potential to influence potential customers. The volume of opinions posted online magnifies the scope of influence.


<b>What is the strategic value of doing such an analysis?</b>

It can serve as another metric to monitor the board's perception. Understanding the themes behind low sentiment posts can lead to improving the product and services.


<b>How can the hotel make the results of the sentiment analysis actionable?</b>

Focus on identifying dissatisfied customers, and making sure they feel understood and accounted for. One way is to proactively reach out to posters of negative posts, potentially offering refunds. Another is to investigate the causes that motivated negative posts. Alerts could help to address negative posts in near real-time.


<b>How would you perform sentiment analysis?</b>

It would be worth trying to find existing APIs or packages, like NLTK and Text Blob, or attempt transfer-learning through Hugging Face, etc.


<b>What preprocessing would you do with the data?</b>

- Fixing text encoding
- Stripping away HTML
- Removing stopwords
- Stemming or Lemmatization
- Vectorization


<b>What are some ways of turning the review text into numerical features that can be used for ML models?</b>

The general process is text vectorization, and possible methods are bag-of-words, n-grams, and TF-IDF.


<b>How would you then run a model using those text vectors?</b>

A variety of classification methods can be used, such as logistic regression, random forests, and kernel-based methods such as SVMs, discriminant analysis, and neural networks. The effectiveness can be evaluated through a confusion matrix and computation of metrics such as precision and recall.


<b>Why might categorizing the reviews into different topics be useful to the business?</b>

Feedback can be more effectively routed to the sub-departments of the hotelier.


<b>How would you go about grouping the reviews into categories?</b>

We can do some basic clustering on the content based on vector representations, using an algorithm such as k-means. Alternatively, we can model the text-related data over an underlying set of topics using Latent Dirichlet Allocation (LDA). LDA assumes each post to be a distribution of topics and that each topic is a distribution of words. We should be able to characterize posts and group them into appropriate buckets, i.e., as expressing various sentiments regarding the brand.

<h3>Case Study 6</h3>

<h4>Facebook: People You May Know</h4>

<b>Suppose you are to build out Facebook's friend recommendation product, a.k.a. the People You May Know feature. How would you go about doing so?</b>

Clarify the goal of the product: is it to maximize friend count for existing users? Drive engagement of the product (and if so, short or long-term)?

<b>Let's say we want to increase the number of meaningful connections found. How would you go about building the PYMK feature for this goal?</b>

Two approaches are possible. The first involves recommendations based on uploaded contact information. A signup flow typically asks a user to import their phone contacts or email address book and then recommends you to friend people within your contacts who are on Facebook. The second approach involves leveraging a user's current social graph, recommending friends of current friends. Likely, we would like a blend of both approaches.

<b>How would you go about leveraging a user's social graph for the PYMK feature?</b>

We can rank the potential friends for any given user based on the social graph. For example, you can make a candidate list of all second and third-degree connections. Then, tank this list based on the likelihood that the user and candidate become friends.

<b>What are some features you would use to measure the potential for two people being friends?</b>

We can start by looking at how many mutual connections two users have. Also, 

- Profile Similarities
- In-App Activity (e.g., commenting on the same post)
- Ecosystem Signals (e.g., usage of other apps)
- Off-App Signals (e.g., mobile GPS location)

<b>Can you give examples of specific methods you would use to rank potential friends for a given user?</b>

Classification algorithms like logistic regression or naivee Bayes, to predict for a given user the likelihood of friendship with another user. The target variable is whether they will become friends or not, and the features are as mentioned above.

<b>What about unsupervised techniques for ranking potential friends?</b>

Unsupervised techniques such as k-means or PCA can identify similar users who are not yet friends with that user.

<b>Does the model setup you described pose any potential problems for new users?</b>

One major problem would be assessing new users who choose not to upload email and contact information.

<b>Why do you think new users are important, from Facebook's perspective?</b>

Facebook's main value to a user is realized only after that user has added a sufficient number of friends.

<b>What are some product ideas you have to help new users make more friendships?</b>

- Boost New Users: boost the odds that new users appear in existing users' PYMKs, thereby increasing a new user's inbound friend requests.

- Friend Chaining: if a new user accepts all their inbound friend requests, instead of leaving that section empty, show PYMK.

- Get More PYMK Signal: existing users ignoring or removing new users suggested in their PYMK recommendations could be a source of training data.

- Increase PYMK Units: show more PYMK news feed units in their first two weeks unless they hit a certain number of friends made.

- Use Gamification: add a progress bar to push new users to make a certain number of friends. This can incorporate other steps like uploading a picture, completing a profile, etc.

<h3>Case 7</h3>

<h4>Stripe: Loan Approval Modeling</h4>

<b>Assume you are working on a loan approval model for small businesses. What metrics would ou use to evaluate the model?</b>

To clarify: are we approving businesses for their requested loan amount, or recommending an upper limit that we can offer?

<b>Let's say the businesses apply to us with a fixed loan amount in mind.</b>

To further clarify: is this a binary situation, where a loan is either repaid or not, or are their partial repayments and debt collection to factor in?

<b>We will assume loans are either paid in full or defaulted on. What metrics would you use to evaluate this loan model?</b>

Our loan model can produce a probability score for whether a particular application will default or not. Assuming some threshold, each transaction can be classified according to whether it is likely to default. To evaluate the model, we will look at precision and recall, and the corresponding precision-recall curve, but not accuracy, since this is a highly unbalanced problem.

<b>What do FPs and FNs mean in this context?</b>

A FP is when the model predicts default when in fact, the loan did not. A FN is when the model predicts that the application will not default when it does.

<b>Should FPs and FNs be weighted equally? Why or why not?</b>

A lesser monetary issue would be FPs, where no loan was issued when it should have been. Assuming a 10% rate of interest (and that it's purely profit), Stripe will lose 10% of the principal on FPs. We can use this ratio to evaluate our classifier by trying various thresholds and assessing the weighted precision and recall that would result from a precision-recall curve.

<b>Would relying on this type of model produce any edge cases, especially under scenarios having increased uncertainty?</b>

Due to regulations, there may be constraints on how much money can be loaned. Human review might be beneficial. Implementation of a multilevel system could further reduce the numbers of FPs and RVs. The cost of manual interaction should be factored into the total cost calculation.

<b>What are some features you would recommend incorporating into the model described above?</b>

Model contruction would involve two feature dimensions. The first is at the loan applicant level and includes the applicant's demographics (assuming legal and ethical), including financial health metrics. The second set of features is at the loan application level and includes answers to questions such as:

- How complete and reasonable were the answers provided on the application?
- How much money is the business looking to borrow?
- What is the purpose of the loan?
- How much is being put up as collateral?

<b>What other methods would you recommend using to improve the model?</b>

- One way is to utilize reject inference, which is based on the idea that not accounting for rejected applications introduces bias into the generating model. Models of loan defaults are generally trained only using data from previously granted loans, introducing sampling bias. Reject inference involves using another model that has been trained on application-rejected data.

- We could integrate anomaly detection with the model, to flag odd cases in real-time. However, there will likely be several anomalies on any given dimension, due to the curse of dimensionality.

<h3>Case Study 8</h3>

<h4>Instagram: Ranking for Instagram Explore</h4>

<b>How would you provide content recommendations for Instagram Explore? This is a surface where you can see a feed of customized photos and videos, often from accounts you don't follow.</b>

I'm curious about SLAs for our system, given that there are probably billions of pieces of content. Also, how real-time should it be? 1 million concurrent feed refreshes per minute?

<b>Let's go with your assumptions. What would be your high-level approach for providing recommendations?</b>

We can focus on identifying accounts that have content a user would find interesting, rather than work with data at the media content level. i.e., a collaborative filtering approach.

<b>What features would you use for candidate retrieval?</b>

We can come up with an embedding per account, which treats account IDs that a user interacts with anagously to words in a sentence. If an individual interacts with a sequence of accounts in the same session, its likely to be more topically similar.

Alternatively, we could build a matrix that stores the interactions of users and other Instagram accounts, and a facorized version of this matrix can be used to explore account similarity.

<b>How does the model utilize these features in order to come up with the candidates?</b>

We can first calculate a distance metric between two accounts, then use KNN to find similar accounts for any account in the embedding and train a classifier to predict a set of accounts' topics based on that embedding.

<b>What models would you use for the ranking step?</b>

At Instagram scale, training a neural network seems appropriate. Rather than treat as a binary problem, we could treat this as a multi-classification problem. For each post, we can predict the probability that it is liked, commented, shared, hidden, reported, etc. These probabilities can be weighted to come up with a final score. The weights for the actions can be defined based on a statistical analysis that links each action to some top-level KPI, like user engagement.

<b>Should this model be deployed in batch or online?</b>

Online, since the predictions have to be real-time and include users' most recently used activities. For online, feature engineering needs to be optimized (for low latency).

<b>What are some challenges and what rollout strategy would you use at inference time?</b>

We can choose between the following:

1. A single deployment, where as users see the changes directly. This is the simplest of the four, but costly if there are mistakes.

2. Controlled deployment, where a smaller subset of users see the new model and the majority see the previous model. This can be complex to implement.

3. Silent deployment, where both are deployed but users do not see the predictions of the new model. This doesn't allow for seeing how users react to the model.

4. Flighting deployment, where you can run online A/B tests. This is the best fit for this particular use case.

<b>How would you assess model performance over time?</b>

Since stale models cannot capture changes in user behaviors or understand new trends, we can 1) monitor over time on a frequent basis, and 2) look at KL divergence, a measure of how similar two distributions are, in key behavioral distributions of the models.

<b>Can you describe what KL divergence is?</b>

KL divergence measures the similarity in distributions. We can use it to check the distributions of the input features over time. If the feature distributions change significantly, we should retrain our models.

<b>Say you run an A/B test on your improved model vs. a baseline model and did not see any correlation between the performance of the model and specific business metrics of interest (engagement). Why might this be?</b>

There are several possible angles, stemming from various parts of the experiment. 

- 'Over-optimization' leads to saturation of metric improvements from further model improvement. The A/B test metric might be ranking relevance, but since the content is already relevant enough, it's not strongly correlated with further engagement.

- Model improvement past a certain point may have a negative effect on user experience. For example, the recommendations have become too niche or specific.

- People may feel their privacy has been violated, and feel 'creeped out' by very accurate or specific recommendations.