## Project 1

### Marissa Bradley

** CSC 19900 - Introduction to Data Science**

** Due Monday, Oct. 23**

The online retail company Yasuni has decided to start selling widgets. To help them better place targeted ads for widgets, they would like to build a Naive Baye's classifier that tries to predict whether or not a particular customer will purchase a widget based on that customer's gender and age. 

To gather data to train the model, they conduct a marketing study. They collect information on several customers who were presented with ads for widgets. For each customer, they recorded the following information:

* **Sex** - The customer's sex, recorded as 'M' for male and 'F' for femaile. 
* **AgeGroup** - The customer's age group, recorded as either 'A', 'B', or 'C'. The age groups are defined as follows:
    - Group A consists of customers 25 years old or younger.
    - Group B consists of customers between 26 and 50 years old.
    - Group C consists of customers older than 50 years old. 
* **Purchase** - Records whether or not the customer purchased a widget. A purchase is indicated with a value of 1, whereas a 0 indicates that the customer did not buy a widget.

The results of the study are contained in the file [marketing.csv](https://lindenwood.instructure.com/courses/24551/files/folder/Projects?preview=11742947). Please download this file into the same folder that contains this notebook. Then run the cell below, which loads the pandas package and reads the data into a data frame called `df`.  

In [1]:
import pandas as pd
df = pd.read_csv('marketing.csv')

The code below will display the first five rows of the data frame `df`. 

In [2]:
df.head()

Unnamed: 0,Purchase,Sex,AgeGroup
0,0,M,A
1,1,M,B
2,1,F,B
3,0,F,A
4,1,M,B


The following cell creates three `series` objects called `pur`, `sex`, and `age`. Each of these `series` represents one of the three collumns in the data frame `df`.  

In [3]:
pur = df['Purchase']
sex = df['Sex']
age = df['AgeGroup']

Before building our classifier, let's start with a couple of warmup problems. Assume that `mySeries` is a series, and `X` is a variable that contains a possible value of `mySeries`. We can count the number of elements in `mySeries` that are equal to `X` by using the following code:

    sum( mySeries == X )
    
Find the number of customers in the study who fall into age group `B`. 

In [4]:
sum(age == 'B')

76

Now, assume that we have two series, `seriesA` and `seriesB`, and assume that the entries of the two series are paired so that for each `i`, `seriesA[i]` and `seriesB[i]` record information for the same individual. We can count the number of entries for which `seriesA` is equal to `X` and `seriesB` is simulataneously equal to `Y` as follows:

    sum( (seriesA == X) & (seriesB == Y))
    
Find the number of customers in the survey who are female, and who made purchases. 

In [5]:
sum( (sex == 'F') & (pur == 1))

50

Our first task in building our classifier is to write a function that calculates conditional probabilities. In the cell below, create a function called `condProb()`. The function should take four arguments: `seriesX`, `X`, `seriesY`, and `Y`. 

The function should return the following conditional probability: $P ( seriesX = X\ \ |\ \ seriesY = Y)$. 

That is, the return value should be equal to the probability that `seriesX` is equal to `X` given that `seriesY` is equal to `Y`. 

In [6]:
def condProb(seriesX, X, seriesY, Y):
        
    numXY = sum((seriesX == X) & (seriesY == Y)) 
    numY = sum(seriesY == Y)
    
    return numXY / numY

Calculate the conditional probability that a customer is female, given that they made a purchase. 

In [7]:
condProb(sex, 'F', pur, 1)

0.41666666666666669

We are now ready to construct our Naive Bayes classifer. In the cell below, write a function called `predPurchase` that takes two arguments: `givenSex` and `giveAge`. The function should calculate the following two scores:

* `score0` measures the likelihood that a person with the supplied sex and age group will NOT buy a widget.
* `score1` measures the likelihood that a person with the supplied sex and age group will buy a widget.

Based on these scores, the function should make a prediction regarding whether or not the customer will make a purchase. It should return either a 0 or a 1, as follows:

* The function should return a value of 0 if it predicts that the customer will NOT buy a widget.
* The function should return a value of 1 if it predicts that the customer will buy a widget.

In [8]:
def predPurchase(givenSex, givenAge):
    score0 = condProb(sex, givenSex, pur, 0) * condProb(age, givenAge, pur, 0) * sum(pur == 0)
    score1 = condProb(sex, givenSex, pur, 1) * condProb(age, givenAge, pur, 1) * sum(pur == 1)
    if score0 >score1:
        return 0
    else:
        return 1

Print the outcome of the classifier(either 0 or 1) for each of the six possible groups of people:

1. sex = 'M', age = 'A'
2. sex = 'M', age = 'B'
3. sex = 'M', age = 'C'
4. sex = 'F', age = 'A'
5. sex = 'F', age = 'B'
6. sex = 'F', age = 'C'


In [9]:
print(predPurchase('M', 'A'))
print(predPurchase('M', 'B'))
print(predPurchase('M', 'C'))
print(predPurchase('F', 'A'))
print(predPurchase('F', 'B'))
print(predPurchase('F', 'C'))

0
1
1
0
0
1


We will now see how well the classifier performs on our original data set. Create the following four variables: `tPos`, `fPos`, `tNeg`, and `fNeg`. The variables are intended to refer to the following quantities:

* `tPos` is the number of **true positives**. It records the number of people in our dataset that the classifier predicted would buy a widget, and that actually did so. 
* `fPos` is the number of **false positives**. It records the number of people in our dataset that the classifier predicted would buy a widget, but that did NOT actually buy one. 
* `tNeg` is the number of **true negative**. It records the number of people in our dataset that the classifier predicted would NOT buy a widget, and that did NOT buy one. 
* `fNeg` is the number of **false negatives**. It records the number of people in our dataset that the classifier predicted would NOT buy a widget, but that did actually buy one.

Write a loop that considers each customer in the survey. For each customer, use the classifier to make a prediction for that customer, storing the prediction (either 0 or 1) in a variable called `pred`. Then determine which of the four categories described above the person belongs in, and increment the relevant count. 

After the loop is finished executing, print the following four lines, with the bracketed values replaced as appropriate.

    The number of true positives is [tPos].
    The number of false positives is [fPos].
    The number of true negatives is [tNeg].
    The number of false negatives is [fNeg].

In [10]:
tPos = 0
fPos = 0
tNeg = 0
fNeg = 0

for i in range(0, len(sex)):
    pred = predPurchase(sex[i], age[i])
    if (pred == 0) & (pur[i] == 0):
        tNeg += 1
    elif (pred == 1) & (pur[i] == 1):
        tPos += 1
    elif (pred == 1) & (pur[i] == 0):
        fPos += 1
    else:
        fNeg += 1
print("The number of true positives is " + str(tPos) + ".")
print("The number of false positives is " + str(fPos) + ".")
print("The number of true negatives is " + str(tNeg) + ".")
print("The number of false negatives is " + str(fNeg) + ".")

The number of true positives is 96.
The number of false positives is 16.
The number of true negatives is 64.
The number of false negatives is 24.


We can measure the effectiveness of a classifier by calculating its **accuracy** and its **precision**. 

* The **accuracy** of a model is defined as the number of correct classifications divided by the total number of observations. That is to say that: $acc = \frac{tPos + tNeg}{tPos + tNeg + fPos + fNeg}$
* The **precision** of a model is defined as the number of true positives divided by the total number of times that it predicted a positive result. That is to say that: $prec = \frac{tPos}{tPos + fPos}$

So, the accuracy of the model measures the probability that the model will make the correct classification. The precision of the model measures the probability that the model will correctly classify a positive result. 

Calculate the variables `acc` and `prec` and then print the following two sentences.

    The accuracy of our model is [acc, rounded to four decimal places].
    The accuracy of our model is [prec, rounded to four decimal places].

In [11]:
acc = (tPos + tNeg) / (tPos + tNeg + fPos  +fNeg)
precc = (tPos) / (tPos + fPos)
print("The accuracy of our models is " + str(round(acc,4)) + ".")
print("The precision of our model is " + str(round(precc,4)) + ".")

The accuracy of our models is 0.8.
The precision of our model is 0.8571.
