# Project 3 -- Liquor Sales + Linear Regression

Scenario 1: State tax board

You are a data scientist in residence at the Iowa State tax board. The Iowa State legislature is considering changes in the liquor tax rates and wants a report of current liquor sales by county and projections for the rest of the year.


________________________________________________

Start Report

Date: 6.30.2015

Report By: 
Kelly Ann Gracia, Resident Data Scientist,
Iowa State Tax Board

Report Addressed to: 
Iowa State Legislature

___________________________________________________

Deliverables: 

    -Report of Current Liquor Sales by County.

        Including projections for the rest of year 2016.
       
    -Report to help inform discussion in changes in the liquor tax rates.

____________________________________________________________________________

##### y value = total sales (in dollars)
(continuous variable)

##### x values = dates of sales (transaction dates), price per bottle
(continuous variable)

___________________________________________________________________________________

Report Outline:

    1) Data Cleaning Round 1
    2) Visualize data to spot outliers
    3) Remove outliers if appropriate (Data Cleaning Round 2)
    4) Visualize data to choose features for model
    5) Create linear regression model
    6) Train-test-split data (using cross_val, K-fold = 4)
    7) Get performance score of model.
    8) Create linear regression model with L1 regularization (to address outliers)
    9) Train-test-split data (using cross_val, K-fold = 4)
    10) Get performance score of model.
    11) Compare performance scores and choose best model. Provide rationale.
    12) Use chosen model to predict sales for rest of 2016.
    13??? Evaluate predictions....


In [2]:
#Importing necessary libraries

% matplotlib inline
import datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import datetime
import seaborn as sns
from pandas.tools.plotting import scatter_matrix

In [3]:
#first load data and clean data
idf = pd.read_csv('/Users/kristensu/downloads/Iowa_Liquor_sales_sample_10pct.csv')
print idf.info()
idf.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 270955 entries, 0 to 270954
Data columns (total 18 columns):
Date                     270955 non-null object
Store Number             270955 non-null int64
City                     270955 non-null object
Zip Code                 270955 non-null object
County Number            269878 non-null float64
County                   269878 non-null object
Category                 270887 non-null float64
Category Name            270323 non-null object
Vendor Number            270955 non-null int64
Item Number              270955 non-null int64
Item Description         270955 non-null object
Bottle Volume (ml)       270955 non-null int64
State Bottle Cost        270955 non-null object
State Bottle Retail      270955 non-null object
Bottles Sold             270955 non-null int64
Sale (Dollars)           270955 non-null object
Volume Sold (Liters)     270955 non-null float64
Volume Sold (Gallons)    270955 non-null float64
dtypes: float64(4), int64(

Unnamed: 0,Date,Store Number,City,Zip Code,County Number,County,Category,Category Name,Vendor Number,Item Number,Item Description,Bottle Volume (ml),State Bottle Cost,State Bottle Retail,Bottles Sold,Sale (Dollars),Volume Sold (Liters),Volume Sold (Gallons)
0,11/04/2015,3717,SUMNER,50674,9.0,Bremer,1051100.0,APRICOT BRANDIES,55,54436,Mr. Boston Apricot Brandy,750,$4.50,$6.75,12,$81.00,9.0,2.38
1,03/02/2016,2614,DAVENPORT,52807,82.0,Scott,1011100.0,BLENDED WHISKIES,395,27605,Tin Cup,750,$13.75,$20.63,2,$41.26,1.5,0.4
2,02/11/2016,2106,CEDAR FALLS,50613,7.0,Black Hawk,1011200.0,STRAIGHT BOURBON WHISKIES,65,19067,Jim Beam,1000,$12.59,$18.89,24,$453.36,24.0,6.34
3,02/03/2016,2501,AMES,50010,85.0,Story,1071100.0,AMERICAN COCKTAILS,395,59154,1800 Ultimate Margarita,1750,$9.50,$14.25,6,$85.50,10.5,2.77
4,08/18/2015,3654,BELMOND,50421,99.0,Wright,1031080.0,VODKA 80 PROOF,297,35918,Five O'clock Vodka,1750,$7.20,$10.80,12,$129.60,21.0,5.55


In [4]:
# Need to check for NaNs and their counts if any...
print "# Rows, Columns: " +str(idf.shape)
print
print idf.isnull().sum()
print
# Looking at total NaNs per feature in dataset, and assuming a worst case scenario \
# (no overlap in NAs):
print "total missing rows = county NAs + category NAs + category name NAs = "
NA_rows = 10913 + 779 + 6109
print NA_rows
print
print "Percent missing (excluding county number column total NAs):"
percent_missing = (NA_rows/ 2709552.0)*100
print percent_missing
print
# Will fill other missing values with NA for now....Because total 
# NAs are negligible (<1%), I will drop NAs.
idf = idf.dropna()
print idf.isnull().sum()
print
print "# Rows, Columns: " +str(idf.shape)

# Rows, Columns: (270955, 18)

Date                        0
Store Number                0
City                        0
Zip Code                    0
County Number            1077
County                   1077
Category                   68
Category Name             632
Vendor Number               0
Item Number                 0
Item Description            0
Bottle Volume (ml)          0
State Bottle Cost           0
State Bottle Retail         0
Bottles Sold                0
Sale (Dollars)              0
Volume Sold (Liters)        0
Volume Sold (Gallons)       0
dtype: int64

total missing rows = county NAs + category NAs + category name NAs = 
17801

Percent missing (excluding county number column total NAs):
0.656972075088

Date                     0
Store Number             0
City                     0
Zip Code                 0
County Number            0
County                   0
Category                 0
Category Name            0
Vendor Number            0
Item Number        

In [5]:
#Remove spaces and parentheses from column names

idf.columns.str.strip()
idf.rename(columns=lambda x: x.replace(' ', '_'), inplace=True)
idf.rename(columns=lambda x: x.replace('(', ''), inplace=True)
idf.rename(columns=lambda x: x.replace(')', ''), inplace=True)

In [6]:
# Remove "$" prices from characters and convert values to floats.

idf[['State_Bottle_Cost', 'State_Bottle_Retail', 'Sale_Dollars']] = \
                            idf[['State_Bottle_Cost', 'State_Bottle_Retail', 'Sale_Dollars']]\
                                                    .apply(lambda x: x.str.replace('$', ""))
idf[['State_Bottle_Cost', 'State_Bottle_Retail', 'Sale_Dollars']] = \
                            idf[['State_Bottle_Cost', 'State_Bottle_Retail', 'Sale_Dollars']]\
                                                    .apply(lambda x: pd.to_numeric(x))

In [7]:
# Need to change date to datetime

idf['Date'] = idf['Date'].str.replace('/', "-")
idf["Date"] = pd.to_datetime(idf["Date"], format= "%m-%d-%Y")

In [8]:
#-need to change zipcode to int

# ----received error for code below. Error message revealed some incorrectly entered zipcodes
#     (e.g., 712-2)
# ===============>> idf['Zip_Code'] = idf['Zip_Code'].astype(int) 

# print "Before removal of invalid zipcodes: "
# print idf["Zip_Code"].nunique()
# print idf["Zip_Code"].nunique(dropna=False)
# print

# coercing zipcodes to numeric, and np.nan if error is raised
idf["Zip_Code"] = pd.to_numeric(idf["Zip_Code"], errors='coerce')
idf.fillna(np.nan)

# Dropping NAs. Also, I want zip_code values to be dtype 'int', not 'float'. 
idf['Zip_Code'].dropna().astype(int)
# print "After removal of invalid zipcodes: "
# print
# print idf["Zip_Code"].nunique()
# print idf["Zip_Code"].nunique(dropna=False)
# print "# Rows, Columns: " +str(idf.shape)

0         50674
1         52807
2         50613
3         50010
4         50421
5         52402
6         52501
7         50428
8         50035
9         52332
10        50265
11        52577
12        52806
13        52656
14        52241
15        50674
16        50703
17        52577
18        50208
19        52807
20        52402
21        52342
22        51250
23        50401
24        52402
25        51351
26        52246
27        51501
28        50111
29        52245
          ...  
270925    50702
270926    50009
270927    50320
270928    52240
270929    52778
270930    52402
270931    50595
270932    50208
270933    50010
270934    51104
270935    50310
270936    52404
270937    50801
270938    52405
270939    52233
270940    52544
270941    52253
270942    52240
270943    50010
270944    50311
270945    50310
270946    50322
270947    50111
270948    52245
270949    52405
270950    50316
270951    51445
270952    50702
270953    52655
270954    50322
Name: Zip_Code, dtype: i

In [None]:
#yr15_start = datetime.datetime(2015, 1, 1)

for date in idf['Date']:
    date = idf['Date'].dt.to_pydatetime()
    #idf["Days_from_yr_start"] = date - yr15_start
    
#idf['Days_from_yr_start'] = ((idf['Days_from_yr_start'])/ np.timedelta64(1, 'D')).astype(int)

In [None]:
### Now I will divide my datasets:

## All sales, all variables, 2015 - 2016
## ====>>idf

## All sales 2015
idf_2015 = idf[(idf['Date'].dt.year == 2015)]

## All sales 2015_Q1
yr15_Q1_end = datetime.date(2015, 3, 31)
idf_2015_Q1 = idf[(idf['Date'].dt.date <= yr15_Q1_end)]

## All sales 2016_Q1
idf_2016_Q1 = idf[(idf['Date'].dt.year == 2016)]


#Total sales by county up to now
total_sales_county = pd.pivot_table(idf, values= 'Sale_Dollars', index=['County'], aggfunc=np.sum)
total_sales_county = total_sales_county.reset_index()
total_sales_county = total_sales_county.groupby('County')['Sale_Dollars'].agg(np.sum)
total_sales_county = total_sales_county.reset_index()

total_sales_county.sort_values('Sale_Dollars')

In [None]:
##Checking for seasonality

#2015
#Looking at total sales amount per county for each transaction date in 2015
total_2015_sales_date = pd.pivot_table(idf_2015, values= 'Sale_Dollars', index=['Date'], aggfunc=np.sum)
total_2015_sales_date = total_2015_sales_date.reset_index()
total_2015_sales_date = total_2015_sales_date.groupby('Date')['Sale_Dollars'].agg(np.sum)
total_2015_sales_date = total_2015_sales_date.reset_index()

# #time series graph to check for seasonality
total_2015_sales_date.plot(x='Date', y='Sale_Dollars',label='total_2015_sales') 

#2015
#Looking at total sales amount per county for each transaction date in 2015
total_2015_sales_county = pd.pivot_table(idf_2015, values= 'Sale_Dollars', index=['County'], aggfunc=np.sum)
total_2015_sales_county = total_2015_sales_county.reset_index()
total_2015_sales_county = total_2015_sales_county.groupby('County')['Sale_Dollars'].agg(np.sum)
total_2015_sales_county = total_2015_sales_county.reset_index()

print total_2015_sales_county.sort_values('Sale_Dollars')

# #time series graph to check for seasonality
total_2015_sales_county.plot(x='County', y='Sale_Dollars',label='total_2015_sales', kind='bar')

#2015
#Looking at total sales amount per county for each transaction date in 2015
total_2015_sales_category = pd.pivot_table(idf_2015, values= 'Sale_Dollars', index=['Category_Name'], aggfunc=np.sum)
total_2015_sales_category = total_2015_sales_category.reset_index()
total_2015_sales_category = total_2015_sales_category.groupby('Category_Name')['Sale_Dollars'].agg(np.sum)
total_2015_sales_category = total_2015_sales_category.reset_index()

print total_2015_sales_category.sort_values('Sale_Dollars')

# #time series graph to check for seasonality
total_2015_sales_category.plot(x='Category_Name', y='Sale_Dollars',label='total_2015_sales', kind='bar')

In [None]:
#Checking growth of sales 2015

cumula_2015_sales = total_2015_sales_date['Sale_Dollars'].cumsum()
cumula_2015_sales.plot(x='Date', y='Sale_Dollars',label='cumula_2015_sales')

In [None]:
##Checking for seasonality

#2015, Q1
#Looking at total sales amount for each transaction date in 2015, QUARTER 1
total_2015Q1_sales_date = pd.pivot_table(idf_2015_Q1, values= 'Sale_Dollars', index=['Date'], aggfunc=np.sum)
total_2015Q1_sales_date = total_2015Q1_sales_date.reset_index()
total_2015Q1_sales_date = total_2015Q1_sales_date.groupby('Date')['Sale_Dollars'].agg(np.sum)
total_2015Q1_sales_date = total_2015Q1_sales_date.reset_index()

# #time series graph to check for seasonality
total_2015Q1_sales_date.plot(x='Date', y='Sale_Dollars',label='total_2015_Q1_sales') 


#Looking at total sales amount per county for each transaction date in 2015, QUARTER 1
total_2015Q1_sales_county = pd.pivot_table(idf_2015_Q1, values= 'Sale_Dollars', index=['County'], aggfunc=np.sum)
total_2015Q1_sales_county = total_2015Q1_sales_county.reset_index()
total_2015Q1_sales_county = total_2015Q1_sales_county.groupby('County')['Sale_Dollars'].agg(np.sum)
total_2015Q1_sales_county = total_2015Q1_sales_county.reset_index()

print total_2015Q1_sales_county.sort_values('Sale_Dollars')

# #time series graph to check for seasonality
total_2015Q1_sales_county.plot(x='County', y='Sale_Dollars',label='total_2015_Q1_sales', kind='bar')

#2015
#Looking at total sales amount per county for each transaction date in 2015
total_2015Q1_sales_category = pd.pivot_table(idf_2015_Q1, values= 'Sale_Dollars', index=['Category_Name'], aggfunc=np.sum)
total_2015Q1_sales_category = total_2015Q1_sales_category.reset_index()
total_2015Q1_sales_category = total_2015Q1_sales_category.groupby('Category_Name')['Sale_Dollars'].agg(np.sum)
total_2015Q1_sales_category = total_2015Q1_sales_category.reset_index()

print total_2015Q1_sales_category.sort_values('Sale_Dollars')

# #time series graph to check for seasonality
total_2015Q1_sales_category.plot(x='Category_Name', y='Sale_Dollars',label='total_2015_Q1_sales', kind='bar')



In [None]:
#Checking growth of sales 2015, Q1

cumula_2015_Q1_sales = total_2015Q1_sales_date['Sale_Dollars'].cumsum()
cumula_2015_Q1_sales.plot(x='Date', y='Sale_Dollars',label='cumula_2015_Q1_sales')

In [None]:
##Checking for seasonality

#2016, Q1
#Looking at total sales amount for each transaction date in 2016, QUARTER 1
total_2016Q1_sales_date = pd.pivot_table(idf_2016_Q1, values= 'Sale_Dollars', index=['Date'], aggfunc=np.sum)
total_2016Q1_sales_date = total_2016Q1_sales_date.reset_index()
total_2016Q1_sales_date = total_2016Q1_sales_date.groupby('Date')['Sale_Dollars'].agg(np.sum)
total_2016Q1_sales_date = total_2016Q1_sales_date.reset_index()

# #time series graph to check for seasonality
total_2016Q1_sales_date.plot(x='Date', y='Sale_Dollars',label='total_2016_Q1_sales') 


#Looking at total sales amount per county for each transaction date in 2015, QUARTER 1
total_2016Q1_sales_county = pd.pivot_table(idf_2016_Q1, values= 'Sale_Dollars', index=['County'], aggfunc=np.sum)
total_2016Q1_sales_county = total_2016Q1_sales_county.reset_index()
total_2016Q1_sales_county = total_2016Q1_sales_county.groupby('County')['Sale_Dollars'].agg(np.sum)
total_2016Q1_sales_county = total_2016Q1_sales_county.reset_index()

print total_2016Q1_sales_county.sort_values('Sale_Dollars')

# #bar graph
total_2016Q1_sales_county.plot(x='County', y='Sale_Dollars',label='total_2016_Q1_sales', kind='bar')

#2015
#Looking at total sales amount per category for each transaction date in 2015
total_2016Q1_sales_category = pd.pivot_table(idf_2016_Q1, values= 'Sale_Dollars', index=['Category_Name'], aggfunc=np.sum)
total_2016Q1_sales_category = total_2016Q1_sales_category.reset_index()
total_2016Q1_sales_category = total_2016Q1_sales_category.groupby('Category_Name')['Sale_Dollars'].agg(np.sum)
total_2016Q1_sales_category = total_2016Q1_sales_category.reset_index()

print total_2016Q1_sales_category.sort_values('Sale_Dollars')

# #bar graph
total_2016Q1_sales_category.plot(x='Category_Name', y='Sale_Dollars',label='total_2016_Q1_sales', kind='bar')




In [None]:
#Checking growth of sales 2016, Q1

cumula_2016_Q1_sales = total_2016Q1_sales_date['Sale_Dollars'].cumsum()
cumula_2016_Q1_sales.plot(x='Date', y='Sale_Dollars',label='cumula_2016_Q1_sales')

In [None]:
#Looking at Boxplots of total sales for each store for 2015 and 2016

def my_pivot(df, index, values, aggfunc, plt=False):
    piv = pd.pivot_table(df, index=index, values=values, aggfunc=aggfunc)
    #piv.sort_values(by = ['week'], inplace=True)
    #print piv
    if plt: piv.plot(title= 'Average Current Liquor Sales by County',kind='hist', figsize=(16,8),bins=40)

my_pivot(idf, index=["County"], values=['Sale_Dollars'], aggfunc=np.mean, plt=True)

In [None]:
import pandas as pd
from sklearn import linear_model

from sklearn.cross_validation import cross_val_score
from sklearn.cross_validation import cross_val_predict
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
import statsmodels as sm


from IPython.core.display import Image, HTML

import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
#features for model: county, category, date

lm = linear_model.LinearRegression()

In [None]:


X1 = idf_2015[['Days_from_yr_start', 'State_Bottle_Retail']]
Y1 = idf_2015['Sale_Dollars']

model_1_reg = lm.fit(X1, Y1)
predictions_model_1 = model_1_reg.predict(X1)

#Plot the model
plt.scatter(predictions_model_1, Y1, s=30, c='r', marker='+', zorder=10)
plt.scatter(predictions_model_1, Y1, s=30, c='r', marker='+', zorder=10)
plt.xlabel("Predicted Values")
plt.ylabel("Actual Sales")
plt.axis([2500, 6000, 2500, 6000])
plt.show()
print "MSE:", mean_squared_error(Y1, predictions_model_1)

In [None]:
import pandas as pd
from sklearn import datasets
from sklearn import linear_model
from sklearn.cross_validation import cross_val_score
from sklearn.cross_validation import cross_val_predict
from sklearn.metrics import r2_score
import matplotlib.pyplot as plt

%matplotlib inline

lr_r2 =  r2_score(y_true=y, y_pred=model.predict(X))
lr_r2

In [None]:
print len(model.coef_)
model.coef_

In [None]:
print abs(model.coef_).mean()
print model.coef_.max()


In [None]:
lasso = linear_model.Lasso(alpha=1)



In [None]:
lasso_model = lasso.fit(X, y)

model = lm.fit(X, y)
predictions = model.predict(X)

# Plot the model
plt.scatter(predictions, y, s=30, c='r', marker='+', zorder=10)
plt.xlabel("Predicted Values")
plt.ylabel("Actual Price")
plt.show()
print "MSE:", mean_squared_error(y, predictions)

lr_r2 =  r2_score(y_true=y, y_pred=model.predict(X))
lr_r2

In [None]:
print len(model.coef_)
model.coef_

In [None]:
print abs(model.coef_).mean()
print model.coef_.max()


In [None]:
cvp_model_preds = cross_val_predict(model, X, y, cv=4)
cvp_model_preds.shape

In [None]:
from sklearn.cross_validation import train_test_split



dy = county_2015_avg['Sale_Dollars']

X_train, X_test, y_train, y_test = \
train_test_split(county_2015_avg, dy, test_size=0.25)

print "       X Shape  Y Shape"
print "Train", X_train.shape, y_train.shape
print "Test ", X_test.shape, y_test.shape

In [None]:
model2 = lm.fit(X_train, y_train)
predictions = lm.predict(X_test)

## The line / model
plt.scatter(y_test, predictions)
plt.xlabel("True Values")
plt.ylabel("Predictions")

print "Score:", model2.score(X_test, y_test) 

In [None]:
# Perform 4-fold cross validation
scores = cross_val_score(model2, county_2015_avg, dy, cv=4)
print "Cross-validated scores:", scores

# Make cross validated predictions
predictions = cross_val_predict(model, county_2015_avg, dy, cv=4)
plt.scatter(dy, predictions)
r2_s = r2_score(dy, predictions)
print "Cross-Predicted R^2:", r2_s

In [None]:

# grps=idf_2015.groupby(['County'])
# for county, grp in grps:
#     sm.OLS(y=grp.loc[:, 'MEAN'], x=grp.loc[:, ['Date', 'Bottles_Sold']])

# model = sm.OLS(Y,X)
# results = model.fit()
# results.params


# model2 = sm.OLS(Y,X)



# grps=idf_2015.groupby(['County'])
# for county, grp in grps:
#     sm.OLS(y=grp.loc[:, 'MEAN'], x=grp.loc[:, ['Date', 'Bottles_Sold']])

#     grps=df.groupby(['FID'])
# for fid, grp in grps:
#     sm.OLS(y=grp.loc[:, 'MEAN'], x=grp.loc[:, ['Accum_Prcp', 'Accum_HDD']])
    
#based on scatterplot matrices, I will make a model for counties.
__________

#total_2015_sales = pd.pivot_table(idf_2015, values= 'Sale_Dollars', index=['Date', 'County'], aggfunc=np.sum)
#total_2015_sales = total_2015_sales.reset_index()
#total_2015_sales = total_2015_sales.groupby('Date')['Sale_Dollars'].agg(np.sum)
#total_2015_sales = total_2015_sales.reset_index()
# X1 = total_2015_sales['Date'] 
# Y1 = total_2015_sales['Sale_Dollars']



            
            # # model_1['Date'].dtype
# X = []

# for date in model_1['Date']:
#     X.append(date)
#     #X.append(date.toordinal())
#     for timestamp in X:
#         actual_date = datetime.datetime.strptime(str(timestamp), "%Y-%m-%d %H:%M:%S")
#         model_1['Days_from_year_start'] = (actual_date - yr15_start)
    #date = datetime.date.fromtimestamp(timestamp)
#     model_1['Days_from_year_start'] = (ordinal_date - yr15_start)
# print model_1
#     #X.append(date.toordinal())
# print model_1['Date'].dtype

# yr15_start = datetime.date(2015, 1, 1)
# >>> diff = someday - today
# >>> diff.days

# X1 = model_1['Date'].reshape(221,1)
# Y1 = model_1['Sale_Dollars']

# for date in X1:
#     #X.append(date)
#     X.append(date.toordinal())
#     for timestamp in X:
# #         actual_date = datetime.datetime.strptime(int(timestamp), "%Y-%m-%d %H:%M:%S")
# #         model_1['Days_from_year_start'] = (actual_date - yr15_start)
# #         print repr(actual_date)
# #     #date = datetime.date.fromtimestamp(timestamp)
# # #     model

__________

#          X1 = model_1['Days_from_yr_start']
# .reshape(221,1)

#          Y1 = model_1['Sale_Dollars']
    # X = pd.DataFrame(X, )
# X['County'] = idf_2015[["Dates",'County']]
# X.columns = ['Date', 'County']

# y = county_2015_avg["Sale_Dollars"]
# print X1.shape
# print Y1.shape
#          model_1_reg = lm.fit(X1, Y1)

# predictions_lr1 = model_1_reg.predict(X1)

# Plot the model
#plt.scatter(predictions_lr1, Y1, s=30, c='r', marker='+', zorder=10)
#plt.scatter(predictions_lr1, Y1, s=30, c='r', marker='+', zorder=10)
# plt.xlabel("Predicted Values")
# plt.ylabel("Actual Price")
# plt.show()
# print "MSE:", mean_squared_error(y, predictions)



# model_111 = model_1['Days_from_yr_start'] + model_1['Sale_Dollars']

# merge(left, right, how='inner', on=None, left_on=None, right_on=None,
#       left_index=False, right_index=False, sort=True,
#       suffixes=('_x', '_y'), copy=True, indicator=False)

# result = pd.merge(model_1['Days_from_yr_start'], model_1['Sale_Dollars'], how='left', on=['key1', 'key2'])

# print result
    # #print model_1
    # X1_vals = (model_1['Days_from_yr_start']/np.timedelta64(1, "D")).astype(int).reshape(7502,)
    # Y1_vals = model_1['Sale_Dollars'].reshape(7502,)

    # X1 = pd.Series(X1_vals)
    # Y1 = pd.Series(Y1_vals)

    # # print X1.shape
    # # print Y1.shape
    # # print X1
    # # print Y1