The objective of this exercise is to fit a linear regression model on Black Friday purchase data to predict future purchases.
This would include the following steps
1)Loading data
2)Cleaning data
3)Fitting the model
4)Predicting future values and finding the accuracy score

First step we follow is loading the required packages and loading the data

In [1]:
import pandas as pd
import numpy as np
train=pd.read_csv('black-friday-train.csv')

Next step would be to figure out the type of data stored in the table as well as figuring out the column names

In [2]:
train.head(5)

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,,,8370
1,1000001,P00248942,F,0-17,10,A,2,0,1,6.0,14.0,15200
2,1000001,P00087842,F,0-17,10,A,2,0,12,,,1422
3,1000001,P00085442,F,0-17,10,A,2,0,12,14.0,,1057
4,1000002,P00285442,M,55+,16,C,4+,0,8,,,7969


In [3]:
train.columns

Index(['User_ID', 'Product_ID', 'Gender', 'Age', 'Occupation', 'City_Category',
       'Stay_In_Current_City_Years', 'Marital_Status', 'Product_Category_1',
       'Product_Category_2', 'Product_Category_3', 'Purchase'],
      dtype='object')

After this we are going to look at the missing values in each columns

In [4]:
train.isna().sum()

User_ID                            0
Product_ID                         0
Gender                             0
Age                                0
Occupation                         0
City_Category                      0
Stay_In_Current_City_Years         0
Marital_Status                     0
Product_Category_1                 0
Product_Category_2            173638
Product_Category_3            383247
Purchase                           0
dtype: int64

We have to handle the missing values in the columns found above. We are changing the missing characteristics to 0 so ensure that it is characterised differently than the existing product characteristics

In [5]:
train.Product_Category_3.fillna(0,inplace=True)
train.Product_Category_2.fillna(0,inplace=True)

The above commands fix our missing values issues. Next we handle Categorical variable. The method used is a label classifier that changes various categories to numbers from 0 to n-1. Next for categories that are not ranked
we use OneHotEncoder so that each level of unranked features is treated indeppendently

In [6]:
#we use the following code snippet to know the unique values in each column
for col in train.columns:
    print(col)
    print(train.loc[:,col].unique())

User_ID
[1000001 1000002 1000003 ... 1004113 1005391 1001529]
Product_ID
['P00069042' 'P00248942' 'P00087842' ... 'P00370293' 'P00371644'
 'P00370853']
Gender
['F' 'M']
Age
['0-17' '55+' '26-35' '46-50' '51-55' '36-45' '18-25']
Occupation
[10 16 15  7 20  9  1 12 17  0  3  4 11  8 19  2 18  5 14 13  6]
City_Category
['A' 'C' 'B']
Stay_In_Current_City_Years
['2' '4+' '3' '1' '0']
Marital_Status
[0 1]
Product_Category_1
[ 3  1 12  8  5  4  2  6 14 11 13 15  7 16 18 10 17  9 20 19]
Product_Category_2
[ 0.  6. 14.  2.  8. 15. 16. 11.  5.  3.  4. 12.  9. 10. 17. 13.  7. 18.]
Product_Category_3
[ 0. 14. 17.  5.  4. 16. 15.  8.  9. 13.  6. 12.  3. 18. 11. 10.]
Purchase
[ 8370 15200  1422 ...   135   123   613]


Now we use label encoder on the columns in 'Labelecol' from string to numeric type

In [7]:
from sklearn.preprocessing import LabelEncoder
lbe=LabelEncoder()
labelcol=['Age','City_Category','Stay_In_Current_City_Years','City_Category','Gender']
for col in labelcol:
    train.loc[:,col]=lbe.fit_transform(train.loc[:,col])
print('the encoded matrix is')
print(train.head(10))

the encoded matrix is
   User_ID Product_ID  Gender  Age  Occupation  City_Category  \
0  1000001  P00069042       0    0          10              0   
1  1000001  P00248942       0    0          10              0   
2  1000001  P00087842       0    0          10              0   
3  1000001  P00085442       0    0          10              0   
4  1000002  P00285442       1    6          16              2   
5  1000003  P00193542       1    2          15              0   
6  1000004  P00184942       1    4           7              1   
7  1000004  P00346142       1    4           7              1   
8  1000004   P0097242       1    4           7              1   
9  1000005  P00274942       1    2          20              0   

   Stay_In_Current_City_Years  Marital_Status  Product_Category_1  \
0                           2               0                   3   
1                           2               0                   1   
2                           2               0          

In [8]:
for col in train.columns:
    print(col)
    print(train.loc[:,col].unique())

User_ID
[1000001 1000002 1000003 ... 1004113 1005391 1001529]
Product_ID
['P00069042' 'P00248942' 'P00087842' ... 'P00370293' 'P00371644'
 'P00370853']
Gender
[0 1]
Age
[0 6 2 4 5 3 1]
Occupation
[10 16 15  7 20  9  1 12 17  0  3  4 11  8 19  2 18  5 14 13  6]
City_Category
[0 2 1]
Stay_In_Current_City_Years
[2 4 3 1 0]
Marital_Status
[0 1]
Product_Category_1
[ 3  1 12  8  5  4  2  6 14 11 13 15  7 16 18 10 17  9 20 19]
Product_Category_2
[ 0.  6. 14.  2.  8. 15. 16. 11.  5.  3.  4. 12.  9. 10. 17. 13.  7. 18.]
Product_Category_3
[ 0. 14. 17.  5.  4. 16. 15.  8.  9. 13.  6. 12.  3. 18. 11. 10.]
Purchase
[ 8370 15200  1422 ...   135   123   613]


We next split the array into X1 and X2. X1 consists of unranked categorical variables that need to be
encoded. X2 doesnt need to be prepared. 
Please note that features User_ID and Product_ID have been dropped because my computer kept throwing a memory error, please add them back if your system can manage it.

In [9]:
col=['Product_Category_3','Product_Category_2','Marital_Status','Product_Category_1',
     'City_Category','Gender']
col1=['Age','Occupation','Stay_In_Current_City_Years']
X1=train[col]
X2=train[col1]
Y=train['Purchase']#dependent variable
print('X1 matrix')
print(X1.head(10))

X1 matrix
   Product_Category_3  Product_Category_2  Marital_Status  Product_Category_1  \
0                 0.0                 0.0               0                   3   
1                14.0                 6.0               0                   1   
2                 0.0                 0.0               0                  12   
3                 0.0                14.0               0                  12   
4                 0.0                 0.0               0                   8   
5                 0.0                 2.0               0                   1   
6                17.0                 8.0               1                   1   
7                 0.0                15.0               1                   1   
8                 0.0                16.0               1                   1   
9                 0.0                 0.0               1                   8   

   City_Category  Gender  
0              0       0  
1              0       0  
2              0 

Now we One hot encode X1. Please note that the ourput of the operation is an array and thus we need to convert it
to a Dataframe because we need to join it with x2 down thw line

In [11]:
from sklearn.preprocessing import OneHotEncoder
one=OneHotEncoder(sparse=False)
X1=one.fit_transform(X1)
X1=pd.DataFrame(X1)
print('new encoded X1')
print(X1.head(10))

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


new encoded X1
    0    1    2    3    4    5    6    7    8    9   ...   51   52   53   54  \
0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0  0.0   
1  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0  0.0   
2  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0  0.0   
3  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0  0.0   
4  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0  0.0   
5  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0  0.0   
6  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0  0.0   
7  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0  0.0   
8  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0  0.0   
9  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0  0.0   

    55   56   57   58   59   60  
0  0.0  1.0  0.0  0.0  1.0  0.0  
1  0.0  1.0  0.0  0.0  1.0  0.0  
2 

Join function joins dataframes along the rows. We use this to prepare the final feature matrix to be
used for training the model i.e X

In [12]:
X=X2.join(X1)
print('Final feature matrix')
print(X.head(5))

Final feature matrix
   Age  Occupation  Stay_In_Current_City_Years    0    1    2    3    4    5  \
0    0          10                           2  1.0  0.0  0.0  0.0  0.0  0.0   
1    0          10                           2  0.0  0.0  0.0  0.0  0.0  0.0   
2    0          10                           2  1.0  0.0  0.0  0.0  0.0  0.0   
3    0          10                           2  1.0  0.0  0.0  0.0  0.0  0.0   
4    6          16                           4  1.0  0.0  0.0  0.0  0.0  0.0   

     6  ...   51   52   53   54   55   56   57   58   59   60  
0  0.0  ...  0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0  1.0  0.0  
1  0.0  ...  0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0  1.0  0.0  
2  0.0  ...  0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0  1.0  0.0  
3  0.0  ...  0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0  1.0  0.0  
4  0.0  ...  0.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0  0.0  1.0  

[5 rows x 64 columns]


Now we train the linear model on the feature matrix and the dependent variable

In [13]:
from sklearn.linear_model import LinearRegression
lr=LinearRegression()
lr.fit(X,Y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

Finally we find our the accuracy score of the model fit

In [14]:
lr.score(X,Y)

0.6472497961009319