<img src="blackfriday.png" width=700 height=500>

# Predicting The Amount Of Purchase On Black Friday
##### Minh Truong, Yunhan Bai, David Hook 
### -----------------------------------------------------------------------------------------------------------------------------------------------------------------
## Introduction
For millions of people Black Friday is the time to do some serious Christmas shopping --even before the last of the Thanksgiving leftovers are gone! Black Friday is the Friday after Thanksgiving, and it's one of the major shopping days of the year in the United States, falling anywhere between November 23 and 29. While it's not recognized as an official U.S. holiday, many employees have the day off, except those working in retail.
<br><br>
In this tutorial, our goal is .....


# Getting Started with the Data
<br>We make use of Python 3 along with a few imported libraries: <a href="http://pandas.pydata.org/pandas-docs/stable/">pandas</a>, <a href="http://www.numpy.org/">numpy</a>, <a href="https://matplotlib.org/tutorials/index.html">matplotlib</a>, <a href="https://scikit-learn.org/stable/">scikit-learn</a>, <a href="https://seaborn.pydata.org/">seaborn</a>, and more.

In [24]:
# Necessary libraries and imports to complete this tutorial
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import f
import seaborn as sns
from sklearn import model_selection
from sklearn import linear_model
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
import statsmodels.formula.api as smf
import warnings
warnings.filterwarnings('ignore')

# Reading the data

In [28]:
blackfriday = pd.read_csv("blackfriday.csv")
blackfriday['Mean_age'] = 0.0
i = 0;
#Making Mean_age so for graphing in later sections
blackfriday['Mean_age'] = blackfriday['Age'].map({'0-17': 8.5, '18-25': 21.5,'26-35"': 30.5, '36-45': 40.5,'46-50"': 48, '51-55': 53, '55+': 67.5})
blackfriday.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase,Mean_age
0,1000001,P00069042,F,0-17,10,A,2,0,3,,,8370,8.5
1,1000001,P00248942,F,0-17,10,A,2,0,1,6.0,14.0,15200,8.5
2,1000001,P00087842,F,0-17,10,A,2,0,12,,,1422,8.5
3,1000001,P00085442,F,0-17,10,A,2,0,12,14.0,,1057,8.5
4,1000002,P00285442,M,55+,16,C,4+,0,8,,,7969,67.5


In the above dataset, there are 12 columns in total. We have: <br>
Column 1: User_ID <br>
Column 2: Product_ID  <br>
Column 3: Gender (Male and Female) <br>
Column 4: Age (Age in bins) <br>
Column 5: Occupation <br>
Column 6: City_Category (A, B, C) <br>
Column 7: Stay_In_Current_City_Years (Number of years stay in current city) <br>
Column 8: Marital_Status (0 for single, 1 for married) <br>
Column 9: Product_Category_1 <br>
Column 10: Product_Category_2 <br>
Column 11: Product_Category_3 <br>
Column 12: Purchase (Purchase amount in Dollars) <br>
Column 13: Mean_Age (mean of the age bins)<br>
<br>
A product can belong to many different categories.<br>

# Tidying the data

<b> In tidy data: </b>
<br> 1. Each variable forms a column.
<br> 2. Each observation forms a row.
<br> 3. Each type of observational unit forms a table.

Tell us why do you need to tidy the data. What are your purposes to use them?

Handling missing data. <br>
As we notice, our missing data is in the product category columns. Therefore, we are going to replace them by 0 to let people know that product does not have different categories. 

In [3]:
blackfriday = blackfriday.fillna(0)
blackfriday.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,0.0,0.0,8370
1,1000001,P00248942,F,0-17,10,A,2,0,1,6.0,14.0,15200
2,1000001,P00087842,F,0-17,10,A,2,0,12,0.0,0.0,1422
3,1000001,P00085442,F,0-17,10,A,2,0,12,14.0,0.0,1057
4,1000002,P00285442,M,55+,16,C,4+,0,8,0.0,0.0,7969


Another step in tidying data is to remove any variable in the column names. <br>For instance, Product_Category_1, Product_Category_2, and Product_Category_3 should be melted to one column.

# Exploratory Data Analysis

Compute the frequency of a product that is purchased within a group

Compute the sum of a purchase for a product

In [5]:
total_purchase_of_a_product = blackfriday.groupby('Product_ID').agg({'Purchase': np.sum})
total_purchase_of_a_product.reset_index(inplace=True)
total_purchase_of_a_product.head()

Unnamed: 0,Product_ID,Purchase
0,P00000142,12592163
1,P00000242,3914901
2,P00000342,1261383
3,P00000442,441173
4,P00000542,791219


Compute the sum of purchase for a user

In [6]:
total_purchase_of_an_user = blackfriday.groupby('User_ID').agg({'Purchase': np.sum})
total_purchase_of_an_user.reset_index(inplace=True)
total_purchase_of_an_user.head()

Unnamed: 0,User_ID,Purchase
0,1000001,333481
1,1000002,810353
2,1000003,341635
3,1000004,205987
4,1000005,821001


Merge the two tables: freq_purchase and total_purchase_of_a_product

### Compute the mean of purchase for each product

# Doing a graph for an age and the amount of purchase

# Doing a graph for an city_category and the amount of purchase

# Doing a graph for product_category and the amount of purchase

# Doing a graph for an age and the amount of category

# Linear Regression 
Occupation and Age

# Logistic Regression
Male vs Female 
Single vs Married
