<a href="https://colab.research.google.com/github/jesse-venson/Machine-learning/blob/main/ML_assign2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler,MinMaxScaler, OneHotEncoder, KBinsDiscretizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import jaccard_score, pairwise_distances

from scipy.spatial.distance import cosine
import os

Part I: Based on Feature Selection, Cleaning, and Preprocessing to Construct an Input from Data Source (a) Examine the values of each attribute and Select a set of attributes only that would affect to predict future bike buyers to create your input for data mining algorithms. Remove all the unnecessary attributes. (Select features just by analysis).

In [2]:
import sys
!{sys.executable} -m pip install kagglehub
import kagglehub
# Download Latest Version
path = kagglehub.dataset_download("jahias/microsoft-adventure-works-cycles-customer-data")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/jahias/microsoft-adventure-works-cycles-customer-data?dataset_version_number=1...


100%|██████████| 939k/939k [00:00<00:00, 61.7MB/s]

Extracting files...
Path to dataset files: /root/.cache/kagglehub/datasets/jahias/microsoft-adventure-works-cycles-customer-data/versions/1





In [3]:
data_customers = pd.read_csv(os.path.join(path,'AWCustomers.csv'))
data_sales = pd.read_csv(os.path.join(path,'AWSales.csv'))

In [4]:
print(data_sales.columns)
print(data_customers.columns)

# to avoid having 2 custId column, we will drop it
data_sales.drop(['CustomerID'],axis = 1, inplace=True)

Index(['CustomerID', 'BikeBuyer', 'AvgMonthSpend'], dtype='object')
Index(['CustomerID', 'Title', 'FirstName', 'MiddleName', 'LastName', 'Suffix',
       'AddressLine1', 'AddressLine2', 'City', 'StateProvinceName',
       'CountryRegionName', 'PostalCode', 'PhoneNumber', 'BirthDate',
       'Education', 'Occupation', 'Gender', 'MaritalStatus', 'HomeOwnerFlag',
       'NumberCarsOwned', 'NumberChildrenAtHome', 'TotalChildren',
       'YearlyIncome', 'LastUpdated'],
      dtype='object')


In [5]:
df = pd.concat([data_customers,data_sales],axis= 1)
df.columns

Index(['CustomerID', 'Title', 'FirstName', 'MiddleName', 'LastName', 'Suffix',
       'AddressLine1', 'AddressLine2', 'City', 'StateProvinceName',
       'CountryRegionName', 'PostalCode', 'PhoneNumber', 'BirthDate',
       'Education', 'Occupation', 'Gender', 'MaritalStatus', 'HomeOwnerFlag',
       'NumberCarsOwned', 'NumberChildrenAtHome', 'TotalChildren',
       'YearlyIncome', 'LastUpdated', 'BikeBuyer', 'AvgMonthSpend'],
      dtype='object')

In [6]:
# Classifying the Variables into discrete and continuous

var_cat = [] #categorical
var_num = [] #numerical

for c in df.columns:
    if df[c].dtype == 'float64':
        var_cat.append(c)
    if df[c].dtype == 'int64':
        var_num.append(c)

print ("Discrete Variables : ", var_num)
print ("Continuous Variables : ", var_cat)

Discrete Variables :  ['CustomerID', 'HomeOwnerFlag', 'NumberCarsOwned', 'NumberChildrenAtHome', 'TotalChildren', 'YearlyIncome']
Continuous Variables :  ['BikeBuyer', 'AvgMonthSpend']


In [7]:
# Classification of Datatypes
Nominal = [
    'CustomerID',
    'FirstName',
    'LastName',
    'AddressLine1',
    'City',
    'StateProvinceName',
    'CountryRegionName',
    'PostalCode',
    'Gender',
    'MaritalStatus',
    'HomeOwnerFlag',
    'BikeBuyer'
]
Ordinal=[]
Ratio = [
    'NumberCarsOwned',
    'NumberChildrenAtHome',
    'TotalChildren',
    'YearlyIncome',
    'AvgMonthSpend'
]
Interval=['BirthDate']
print("Nominal: ",Nominal)
print("Ordinal: ",Ordinal)
print("Ratio: ",Ratio)
print("Interval: ",Interval)

Nominal:  ['CustomerID', 'FirstName', 'LastName', 'AddressLine1', 'City', 'StateProvinceName', 'CountryRegionName', 'PostalCode', 'Gender', 'MaritalStatus', 'HomeOwnerFlag', 'BikeBuyer']
Ordinal:  []
Ratio:  ['NumberCarsOwned', 'NumberChildrenAtHome', 'TotalChildren', 'YearlyIncome', 'AvgMonthSpend']
Interval:  ['BirthDate']


Part II: Data Preprocessing and Transformation Depending on the data type of each attribute, transform each object from your preprocessed data. Use all the data rows (~= 18000 rows) with the selected features as input to apply all the tasks below, do not perform each task on the smaller data set that you got from your random sampling result. (a) Handling Null values (b) Normalization (c) values (d) Standardization/Normalization (e) Binarization (One Hot Encoding)

Handling Null Values

In [8]:
numerical_features = ['YearlyIncome', 'AvgMonthSpend', 'NumberCarsOwned', 'NumberChildrenAtHome', 'TotalChildren']
categorical_features = ['Gender', 'MaritalStatus', 'HomeOwnerFlag', 'BikeBuyer']

numerical_imputer = SimpleImputer(strategy='mean')
categorical_imputer = SimpleImputer(strategy='most_frequent')


df[numerical_features] = numerical_imputer.fit_transform(df[numerical_features])
df[categorical_features] = categorical_imputer.fit_transform(df[categorical_features])

Normalization of Values

In [None]:
scaler = MinMaxScaler()
df[numerical_features] = scaler.fit_transform(df[numerical_features])

Standardization

In [9]:
standard_scaler = StandardScaler()
df[numerical_features] = standard_scaler.fit_transform(df[numerical_features])

Binning/ Discretization on continuous attributes / or Discrete attributes with too many values

In [10]:
binning_transformer = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')
df['YearlyIncome_Binned'] = binning_transformer.fit_transform(df[['YearlyIncome']])

One Hot Encoding

In [11]:
encoder = OneHotEncoder(sparse_output=False)
categorical_features = ['Gender', 'MaritalStatus', 'HomeOwnerFlag', 'BikeBuyer']
encoded_categorical_features = encoder.fit_transform(df[categorical_features])
encoded_df = pd.DataFrame(encoded_categorical_features, columns=encoder.get_feature_names_out(categorical_features))
df = pd.concat([df, encoded_df], axis=1).drop(categorical_features, axis=1)

Part III: Calculating Proximity /Correlation Analysis of two features Make sure each attribute is transformed in a same scale for numeric attributes and Binarization for each nominal attribute, and each discretized numeric attribute to standardization. Make sure to apply a correct similarity measure for nominal (one hot encoding)/binary attributes and numeric attributes respectively. (a) Calculate Similarity in Simple Matching, Jaccard Similarity, and Cosine Similarity between two following objects of your transformed input data. (b) Calculate Correlation between two features Commute Distance and Yearly Income

In [12]:
def jaccard(list1, list2):
    intersection = len(list(set(list1).intersection(list2)))
    union = (len(list1) + len(list2)) - intersection
    return float(intersection) / union

# calling the function JACCARD
jaccard(df['YearlyIncome'],df['AvgMonthSpend'])

0.0

In [13]:
df['NumberChildrenAtHome_Binary'] = (df['NumberChildrenAtHome'] > 0).astype(int)
# Compute Jaccard Similarity between 'NumberChildrenAtHome_Binary' and each 'BikeBuyer' column
for bike_buyer_col in ['BikeBuyer_0.0', 'BikeBuyer_1.0']:
    jaccard_sim = jaccard_score(df['NumberChildrenAtHome_Binary'], df[bike_buyer_col])
    print(f"Jaccard Similarity between NumberChildrenAtHome_Binary and {bike_buyer_col}: {jaccard_sim}")

Jaccard Similarity between NumberChildrenAtHome_Binary and BikeBuyer_0.0: 0.2138122536725188
Jaccard Similarity between NumberChildrenAtHome_Binary and BikeBuyer_1.0: 0.23450479233226837


In [14]:
df['YearlyIncome'].corr(df['AvgMonthSpend'])

np.float64(0.012200386558915567)