<img src="mv_top_design.jpg" width="1250" align="center">

Welcome to this interactive Python Jupyter Notebook! This assessment is designed to gauge your knowledge of Python programming, as well as your understanding of predictive models including classification, regression and time-series forecasting. Throughout the completion of this notebook, you will be tasked to apply technical coding skills, demonstrate your proficiency in handling different types of data and showcase your ability to build and refine predictive algorithms. Keep in mind that, while understanding the concepts is crucial, implementation plays an equally important role in mastering Python and data science techniques. 

Good luck and happy coding! 

# Part 1: Regression and Classification with Wine Data

You have recently been hired by a famous wine merchant who wants you to use your machine learning skills to help grow his business. For this exercise you will be using the [UCI wine quality dataset](https://www.sciencedirect.com/science/article/abs/pii/S0167923609001377) (Cortez et al., 2009) and will be building a regression and classification model.

Let's take a look at the data:

In [2]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score

First, import "wine_quality.csv" dataset

## Brief

In this exercise you are going to build a regression model to predict the quality of a bottle of wine, and then a classification model to predict its colour. 


We have left this assignment fairly open, you can research and use more advanced models and techniques if you want, or use more simple models if that is more comfortable. Either way, you are expected to evaluate the models you built.

## 1) Explore the data

Never neglect your EDA! 

 - Calculate summary statistics for your data
 - Check the distribution - are all fields normally distributed? Does this matter?
 - Check for Null values
 - Check for outliers, how can you visualise this? What do you think you should do with them?
 - Check correlations, produce a heatmap to demonstrate- which fields are most correlated with median value? Is there any examples of multicollinearity?

# 2) Regression

Your first task is to build a model that predicts the quality of a type of wine. You will need to build a linear regression model but you can use as many features as you like, any you do include must be justified from your EDA. Once you have created your model, evaluate it by reporting at least the r-squared.

Required:

 - Select and justify features to be included in your model
 - Use the train_test_split function to create a training and testing set
 - Build a regression model to predict quality from your chosen features
 - Report the r-squared and at least one other performance metric
 
Optional: 

 - Use cross validation to evaluate the performance of your model 
 - Use StandardScaler to standardise your features and compare the models performance
 - Plot the model coefficients against each other- which variable is the most important when predicting median value?
 - Plot the set of predicted values against the target variable- what does this show?
 
Extend:

 - Optimise the model by using regularisation
 - Research cross validation estimators (e.g. ElasticNetCV)
 - Research sklearn Pipeline and use it in building your model

### Business question on Regression 
Q: Can you briefly describe a scenario in your current organization where a regression model could be utilised effectively? What data would you acquire and how would you categorize them? What potential challenges might you encounter, and how could your solutions affect decision-making within the organization?

Double Click here to give us an answer!

A:

## 3) Classification

Now you have made a model to predict the quality of a wine, you have now been asked to predict whether it is red or white based off its features.

You should use logistic regression, but you can research and use other classification methods if you would like. 

Required:

 - Use train_test_split to create a training and testing set
 - Build a classification model to predict 'Colour' using your chosen features
 - Report the accuracy and baseline accuracy
 
Optional:
 
  - Produce a confusion matrix to show how effectve your model is
  - Calculate the precision and recall to explain how effective your model is for predicting
 
Extend:
 
  - Research and use other classification models (a Decision Tree, SVC, etc) and compare them to your logistic regression model


### Business question on Classification 
Q: Can you identify a situation or problem in your organization that could potentially benefit from the application of a classification model? What kind of data would you need and how would you plan to collect them? What outcomes would you expect and how those predictions might guide the future strategies of your organization?

Double Click here to give us an answer!

A:

# Part 2 : Time series forecasting model with Walmart Sales Data 

You are a data analyst at Walmart. Your manager wants you to analyze Walmart's weekly sales data over a two-year period from 2010 to 2012. However, they would like you to focus on analysing one single store. 

First, load in dataset "train.csv"

The data include:

* `Store`: The store number.
* `Dept`: The department number.
* `Date`: The week.
* `Weekly_Sales`: Sales for the given department in the given store.
* `IsHoliday`: Whether the week is a special holiday week.

### 1) Preprocess the data using Pandas.

Since this is going to be a time series analysis, what are the things that we need to be aware of and check before modelling?

 - Convert the `Date` column to a `datetime` object.
 - Set `Date` as the index of the DataFrame

### 2) Filtering our dataset to compute rolling mean for "Weekly Sales"

Your manager would like you to focus on Store 1 sales. 

 - Filter the DataFrame to Store 1 sales 
 - Aggregate to compute the total weekly sales
 - Store this in a new DataFrame
 - Plot the rolling mean for `Weekly_Sales`. What general trends do you observe?

### 3) Create an autocorrelation plot and decomposition plot 

 - Compute the `1`, `13`, and `52` autocorrelations for `Weekly_Sales` 
 - Create a decomposition plot for the Store 1 sales data. 
 - Based on the analyses above, what can we deduce about this time series?

### 4) Choose a time series forecasting model to predict the next 6 months weekly sales for store 1

 - Can be simple naive forecasting, Holt's linear model, autoregressive model, or an ARIMA model.
 - Explain the rationale behind choosing your model and demonstrate analysis results in a graph. 

### Business question 
Q: Can you identify a scenario from your organization where this time series forecasting could be effectively applied? How would you gather and prepare the necessary data for this task? How do you think these predictive insights could enhance decision-making processes in your organization?

Double Click here to give us an answer!

A: