Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE". (After you have done that, you can delete the 'raise NotImplementedError()' line, and then run your code to check that it works).

Also, enter your NAME in the next cell.


In [1]:
NAME = ""

---

# ICT706 SouthBank 2020 Semester 1 Task 2

This assignment will be done completely inside this Jupyter notebook.

### Background
A medium-size company has given you one year of data about the online purchases that their customers have made.  They want you to analyse the data using statistical and machine learning techniques and produce:
* a prediction algorithm for predicting how much money each customer is likely to spend in a year;
* a classification algorithm for predicting which customers will be 'big spenders';
* some recommendations on what marketing strategy they should use to attract more 'big spender' customers.

### Instructions
Follow all the instructions in this notebook to complete these tasks.  Note that some cells contain 'assert' statements - these will automatically mark your work so that you can check that you have done the preceeding steps correctly.  (If they give errors, then go back and correct your previous work until you fix those errors.  Once those 'assert' cells execute without errors, you know that you have achieved the marks for that step.) 

When you have finished, this notebook is the only file that you will need to submit to Blackboard.

Note: If you want some space to try out some Python code of your own, feel free to add extra cells into this notebook.  Just make sure that before you submit your notebook, that those extra cells execute without error, or that you delete them before submitting.

### Overview
You have five sections to complete in this Notebook (total = 100 marks):
* Part A: Load and Clean Data (20 points)
* Part B Data Exploration (30 points)
* Part C: Predicting Spending Levels (20 points)
* Part D: Predicting Big Spenders (20 points)
* Part E: Business Recommendations (10 points)

In [3]:
# add all your imports here.
# YOUR CODE HERE
raise NotImplementedError()

---
# Part A: Load and Clean Data (20 points)

Save your CSV data file into the same folder as this notebook.

Write Python code to load your dataset into a Pandas DataFrame called 'sales'.

In [1]:
# YOUR CODE HERE
raise NotImplementedError()

After you have loaded the data correctly, you should have 10,000 rows. 
Run the following cells and tests to check that you have done this correctly.

In [2]:
sales.head()

In [None]:
"""Check that 'sales' has the right shape and number of rows (5 points)."""
assert len(sales.columns) == 10
assert sales.columns[0] == "CustNum"
assert sales.shape == (10000, 10)

## Cleaning the Data

Some of the columns are strings, with dollar signs.  But we need to convert them to numbers (float) so that we can do calculations on them.  The next cell shows what will go wrong if we try doing calculations *before* converting them floats!

In [None]:
s2 = sales["Spend"] * 4
s2.head()

In [None]:
# Complete the following remove_dollar function 
# so that it removes any dollar signs and spaces
# and then returns the string as a number (float).
def remove_dollar(s):
    """Removes dollar signs and spaces from s.
    Returns it as a float.
    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
"""Check that remove_dollar() removes dollars and spaces properly (5 points)."""
assert remove_dollar("12") == 12.0
assert remove_dollar("$123") == 123.0
assert remove_dollar("  $1234") == 1234.0
assert remove_dollar(" $42.3 ") == 42.3

## Clean up the Spend columns

Apply your remove_dollar function to the "Spend" column (every row), and put the cleaned-up float values into a new column of your 'sales' DataFrame called **"SpendValue"**.

Then do the same for the "LastSpend" column and put the float values into a new column called **"LastSpendValue"**.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
sales.dtypes

In [None]:
# check the new SpendValue columns (5 points)
assert sales.columns.contains("SpendValue")
assert sales.columns.contains("LastSpendValue")
# check that they are floats
assert sales["SpendValue"].dtype == "float64"
assert sales["LastSpendValue"].dtype == "float64"
# check that the values are greater than zero.
assert (sales["SpendValue"] > 0.0).all()
assert (sales["LastSpendValue"] >= 0.0).all()

## Make Sex and State numeric

To use the Sex and State columns as input features for the machine learning algorithms in Scikit-Learn they must be numeric.

Use the **LabelEncoder** object from the sklearn.preprocessing package to convert the 'Sex' column into an integer column called **"SexValue"**.  

Also convert the "State" column into a integer column called **"StateValue"**. 

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# see if Sex has been mapped to ints properly?
cols = ["Name", "Sex", "SexValue"]
sales[cols].head()

In [None]:
# see if State has been mapped to ints properly?
cols = ["Name", "State", "StateValue"]
sales[cols].head(10)

In [None]:
# test the new SexValue and StateValue columns (5 points)
assert sales.columns.contains("SexValue")
assert sales.columns.contains("StateValue")
# check that they are integer
assert str(sales["SexValue"].dtype).startswith("int")   # "int32" or "int64"
assert str(sales["StateValue"].dtype).startswith("int") # "int32" or "int64"
# check that the values are greater than zero.
assert sales["SexValue"].max() == 1    # 0 and 1 only
assert sales["StateValue"].max() == 7  # 7 states in Australia

In [None]:
# Finally, let us view just the numeric columns.
numcols = ["CustNum", "SexValue", "Age", "StateValue",
           "Income", "Clicks", "Purchases", "SpendValue"]
sales[numcols].head()

---

# Part B Data Exploration (30 points)

In this section, you will explore the data statistically and visually, to get a feel for what kinds of data you have, and how much people are spending on your web site.

## B.1 Data Inspection

Start by using the Pandas **describe()** function to analyse all the numeric columns of your 'sales' DataFrame.  Spend some time looking at this and making sure that you understand the average (mean) and range (min and max) of each column.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Data Inspection Questions

In the next cell, write your observations about the "SpendValue" and "Purchases" columns.  For each column, say what the average value is and discuss what that means in terms of your sales to an average person.  Also discuss the min and max values.  

Based on the "SpendValue" column, explain how much your "big spenders" (the top 25% percent of your clients) are spending each year.  This will be a range of values, such as from 1000 to 2000 dollars.

Your discussion must all be in the next cell.  

Add three level-2 headings in that cell to break your discussion into topics: "Purchases column", "SpendValue column", and "Big Spenders".

### Answer:
YOUR ANSWER HERE

## B.2 Differences between States

We want to know where most of our customers live and whether customers from certain areas spend more or less than average.  Write some Pandas code to calculate and display the total **number of customers** in each Australian state (NSW, QLD, VIC, etc.) and their average **SpendValue**.  

Hint: you could do this by *grouping* your 'sales' table, or by *looping* through all the states, or several other ways.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Graphical Comparison of States

Now *graph* your results, so that you can see them visually.

NOTE: since the states in Australia have very different populations, you should also calculate and graph the number of customers *relative* to the population of each state (you can use Google to find populations of each state).

So you should show at least the following three graphs:
* the absolute number of customers in each state;
* the number of customers in each state as a percentage of the population of that state;
* the average SpendValue of customers in each state (dollars/customer).

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Question:
Discuss these graphs and explain your conclusions.

For example, are there *significant* differences in the average spend in different states?  Are our customer spread evenly across Australia, or concentrated in particular areas? 

Write your answer in the next cell, and give reasons for your conclusions.

### Answer:

YOUR ANSWER HERE

---

# Part C: Predicting Spending Levels (20 points)

Using the LinearRegression function from the Scikit-Learn library (**sklearn**), build a machine learning model for predicting the expected **SpendValue** for a customer.  

Measure the performance of your model using 10-fold cross-validation with a test set size of 20% and print various measures of how accurate your predictions are.

In [None]:
sales.head()

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Analysis of Results

Print out the linear regression coefficients for all the input features, so that you can see which ones are more significant and which ones are unimportant.  

Hint 1: Since the scale of the input features is so different (0-1 for sex, 0-160000 for income, etc) multiply the linear regression coefficients by the average value of the corresponding column, to see how many dollars that column contributes to the total predicated-spend answer.

Hint 2: Could you graph the predicted and actual spendvalues of the test data, to visually see how good the linear regression results are?

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Discussion:

Discuss your conclusions about this linear regression model (in the next cell).  Which input features are most significant?

### Answer:
YOUR ANSWER HERE

---
# Part D: Predicting Big Spenders (20 points)

In this section we want to build some machine learning models predict if a new customer is likely to be a big spender or not.  This will be a binary outcome (yes or no), so we can use machine learning *classification* algorithms.

Remember that our definition of 'Big-Spender' is that it is a client whose annual spending level (**SpendValue**) is in the top 25% of our clients.  So the exact dollar cutoff for big spenders will be different for each student, as each of you are working for a different company and are using a different dataset.

Choose two classification algorithms.  Use each one to build and then evaluate a 'big-spender' prediction model.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Discussion:

Discuss your conclusions about your two classification models (in the next cell).

Which classification algorithm gives the more accurate results? 

How accurate are the results from your best classifier?

### Answer:
YOUR ANSWER HERE

---
# Part E: Business Recommendations (10 points)

The company you are doing this analysis for wants some recommendations from you about how to find new customers who are likely to be big spenders.  They are wondering if they should focus their advertising on a particular gender?  Or people in a given state, such as Victoria, or NSW?  Or aim at demographic groups who have high income level or medium income levels?  Or other strategies?  What recommendations will you give them?  

Write about 100 words describing your conclusions from your analysis, and your recommendations for the best strategy for attracting new big-spender customers.

## Recommendations:
YOUR ANSWER HERE