## Checklist
*Fill this table appropriately as you progress in your tasks:*


|**Section**|**Completion**|
|-|-|
|**Section 1**| **Completed** |
|  Q 1 | Completed |
|  Q 2 | Completed |
|  Q 3 | Completed |
|**Section 2**| **Completed** |
|  Q 1 | Completed |
|  Q 2 | Completed |
|  Q 3 | Completed |
|  Q 4 | Completed |
|**Section 3**| **[Chosen Topic Name]** |

---
# Introduction

Welcome! As a budding Data Scientist, you've been entrusted with the exciting opportunity to dive into the realms of Data Science and unravel valuable insights that will steer our organization towards success. Your efforts have the power to shape decision-making processes and position you as a key player in our team.

We're committed to providing everyone with an equal chance to showcase their talents. This guided assignment is designed not only to evaluate your fit within our dynamic team but also to help you grasp the type of impactful work you'll be contributing to as an intern.  We're confident that, with your hard work, you'll not only meet but exceed expectations. Let's get started! &#x1F4AA;

This is a practical exercise that will test your analytical & programming skills as well as your understanding of various components of the analytics life cycle. **You would be required to share an iPython notebook (`.ipynb` format) and a Presentation document (`.pptx` format) uploaded to a folder in Google drive and shared as a "Google Drive link" which has viewer access to everyone.**

**Note:** You will not be able to edit this file directly, so make a copy of it on your local machine or in Google Colab beforing answering different sections of this assessment.

The final notebook shouldn't have the questions, it should only have appropriate headings for each section/sub-section and the questions should be correctly numbered.

---
# Instructions
Please read carefully:
- **Submit 1 Google drive link with all the answers. The submitted Google Colab notebook/PPT's name should be in `<your_full_name>_<date_of_submission>` format.**
- **Your code, comments & output should be present in the colab notebook. Please make sure that all the output code and text are organized and readable in the submitted Google Colab notebook.**
- You may not consult with any other person regarding the test. You are allowed to use internet searches, books, or notes you have on hand.
- **The test has 2 sections, both of which are mandatory.** Read the questions carefully and answer accordingly. **Code should be commented properly.**
- The **3rd section** contains resources on some advanced topics for you to go through. You can choose one of these topics and prepare to have a discussion on it during the interview. Please mention your selection in the checklist.
- In case of doubts please make thoughtful assumptions.

**Start your Google Colab notebook with a checklist mentioning the parts you were/were not able to complete.** The table to fill is given at the top. Ideally, all sections must be marked "Completed".


---
# Section 1 - Funnel Analysis

Analysing data and getting actionable insights is one of the very basic but key tasks of any data professional. For the purpose of this assessment, you have been provided with the data. The data for this section can be accessed from [Assignment Data Excel Sheet](https://docs.google.com/spreadsheets/d/1olG6BF2l6vxBLenqbhpgbc4cISQy7siV/edit?usp=sharing&ouid=104708378877685927883&rtpof=true&sd=true) (click on the hyperlink and download the dataset).

**NOTE:** Download the `AssignmentData.xlsx` to your current directory accordingly. Don't make any changes to the data using excel, all data manipulations must be done within this notebook only and your codes must run on the original data file provided to you.

The Microsoft excel file shared with you has 2 sheets:
1. `WorkerFunnel` sheet
2. `ABTest` sheet

`WorkerFunnel` sheet has the details of a garment manufacturing process and the productivity of the employees at the organisation. This data allows you to understand the productivity of the workers over a span of 70 days. The different columns represent the following:

| Column Name| Description|
|-|-|
| Date| Date in MM-DD-YYYY|
| Quarter| A portion of the month. A month was divided into four or five quarters|
| Department| Associated department with the instance|
| Targeted Productivity| Targeted productivity set for each team for each day|
| Overtime| Represents the amount of overtime by each team in minutes|
| No. of Workers| Number of workers in each team|
| Actual Productivity| The actual % of productivity that was delivered by the workers. It ranges from 0-1|



Import data from the `WorkerFunnel` sheet of the `AssignmentData.xlsx` file into a dataframe named `funnel` and perform exploratory analysis.


1. Identify and appropriately handle the missing/blank and duplicate values in the dataset, and explain the logic behind your strategy in a short paragraph.

2. Create a new column called `Target Achieved` (categorical as "**Yes**" if Actual Productivity is greater than Targeted Productivity, and "**No**" otherwise). Then, Complete the below given tasks using the columns : Department, Targeted Productivity, Overtime, Number of Workers, Quarter, and Start and End Dates of the observations in the dataset. And give a summary of  the results. <br><br>

    a) Create grouped bar graphs to show the level of Target achieved( 'yes' and 'No') for the given date range, at a quarterly time interval for both categories. The graph should have appropriate labels, titles and other factors which would make it readable.Also provide a brief interpretation of the graph.
    <br><br>
    b) Forecast the Actual Productivity and create a line graph for the next four quarters using the following algorithms mentioned below and display the values, then create plots for the result achieved using appropriate plots: <br>
      (i) ARIMA <br>
      (ii) Rolling Averages <br><br>
    
    c) Ceate a summary of comparison with the evaluation criteria and their values for each model. You can add more evaluation criteria, which will contribute to your increased chances of selection, but the following are a must have: <br>
    (i) Mean Absolute Percentage Error (MAPE)<br>
    (ii) Mean Squared Error (MSE)

3. You are currently a Data Scientist hired on contract by the organisation. Your performance in this project and your ability to generate data backed insights and strategies will decide if the management approves for a Data Science team to be established officially, with you as the team lead. To present your findings to the management, create a short but detailed PowerPoint presentation which answers the following questions: <br><br>
    a) The organisation currently spends an amount of Rs. 8.4 lakh per quarter, with each department getting half of the amount. Analyze and compare the value brought in by each department in various quarters. Please include appropriate visualizations for an easier understanding of the management. <br><br>(*Department Value* = *Actual Productivity* / *Department Quarterly Spend*). Normalize the computed value to lie in the range of 0-1 for easier interpretation. <br><br>
    b) The organisational budget remains to be Rs. 8.4 lakh per quarter. Suggest an allocation strategy to divide the resources between departments and share your reasoning. (i.e., would you recommend higher allocation to the better performing department or the worse one?)

---
# Section 2 - A/B Testing

`ABTest` sheet contains the data from an experiment conducted to optimize the website design by splitting the traffic to the website to the control and experiment groups. The control group was shown a "sign up" button in red and the treatment group was shown it in green.

AB Testing references
 - [A/B Testing Basics](https://towardsdatascience.com/a-b-testing-the-basics-86d6d98525c9) - Medium Article
 - [A/B Testing](https://vwo.com/ab-testing/) - VWO Article

Import data from the `ABTest` sheet of the `AssignmentData.xlsx` file into a dataframe named `abtest` and perform exploratory analysis.

1. Create a timeseries visualization with Date (on x-axis) and Total Number of Clicks (on y-axis) for each device type creating separate trendlines for each device in order to find which device performed best in terms of total number of clicks.

2. Assume MDE = 3%, α = 95% and statistical power (1-β) = 80%,. What is the sample size required for the test? (Use this article to supplement your understanding - [Sample Size in A/B Testing](https://guessthetest.com/calculating-sample-size-in-a-b-testing-everything-you-need-to-know/)). Do we have sufficient sample size to conclude the test?

3. Write a function that accepts the following inputs to test your hypothesis at the chosen level of statistical significance:
    - Control Group Visitors
    - Control Group Conversions
    - Treatment Group Visitors
    - Treatment Group Conversions
    - Confidence Level (three options: 90, 95, 99).

  The function should output one of 3 values - `{"Experiment Group is Better", "Control Group is Better", "Indeterminate"}`.<br> Use the function to perform AB Test on the given Data and provide your findings and interpretation of the results.<br><br>
4. Create a simple Streamlit app (you can follow this [tutorial](https://youtu.be/sogNluduBQQ?si=wA5a2wVh4bqeAtmi)) using the function you created that performs the hypothesis test by taking in the above mentioned inputs from the user. Finally, host this app on Streamlit Community Cloud using this [tutorial](https://blog.streamlit.io/host-your-streamlit-app-for-free/).
    
    **Note**: You get bonus points for a neater and more presentable app.


---
# Section 3 - Advanced Learning


You can pick one of the following topics to learn more about using the provided resources. The topic you choose will be discussed in depth during the interview process.

  - Multi-Armed Bandit (MAB)<br>
[A brief overview of the Multi-Armed Bandit in Reinforcement Learning](https://medium.com/analytics-vidhya/a-brief-overview-of-the-multi-armed-bandit-in-reinforcement-learning-d086853dc90a)<br>
[Solving the Multi-Armed Bandit Problem](https://towardsdatascience.com/solving-the-multi-armed-bandit-problem-b72de40db97c)<br>
[What is the Multi-Armed Bandit Problem?](https://www.optimizely.com/optimization-glossary/multi-armed-bandit/)<br><br>
  - Controlled experiment using Pre-Experiment Data (CUPED)<br>
[Understanding CUPED](https://matteocourthoud.github.io/post/cuped/) <br>
[Understanding how CUPED in GrowthBook Reduces Experiment Runtimes at the Los Angeles Times](https://medium.com/growth-book/understanding-how-cuped-in-growthbook-reduces-experiment-runtimes-at-the-los-angeles-times-79ba7c288d71)<br>
[How Booking.com increases the power of online experiments with CUPED](https://booking.ai/how-booking-com-increases-the-power-of-online-experiments-with-cuped-995d186fff1d)
<br><br>
  - Causal Inference<br>
[Causal Inference as a Blind Spot of Data Scientists](https://dzidas.com/ml/2023/10/15/blind-spot-ds/)<br>
[Causal inference (Part 1 of 3): Understanding the fundamentals](https://medium.com/data-science-at-microsoft/causal-inference-part-1-of-3-understanding-the-fundamentals-816f4723e54a)