<a href="https://colab.research.google.com/github/manansharma2711/UML-501-ML/blob/main/MLassign2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Part-1 Based on Feature Selection, Cleaning, and Preprocessing to Construct an Input from Data Source (a) Examine the values of each attribute and Select a set of attributes only that would affect to predict future bike buyers to create your input for data mining algorithms. Remove all the unnecessary attributes. (Select features just by analysis).

(b) Create a new Data Frame with the selected attributes only.

(c) Determine a Data value type (Discrete, or Continuous, then Nominal, Ordinal, Interval, Ratio) of each attribute in your selection to identify preprocessing tasks to create input for your data mining.

In [1]:
import pandas as pd
from datetime import datetime
from sklearn.preprocessing import MaxAbsScaler
import kagglehub
import os

# Download latest version
path = kagglehub.dataset_download("jahias/microsoft-adventure-works-cycles-customer-data")

# Construct the full paths to the CSV files
customers_file_path = os.path.join(path, 'AWCustomers.csv')
sales_file_path = os.path.join(path, 'AWSales.csv')

# Read the CSV files using the full paths
df = pd.read_csv(customers_file_path)
sales = pd.read_csv(sales_file_path)

merged_df=pd.merge(df,sales,on='CustomerID')
# print(merged_df.head())
merged_df['BirthDate']=pd.to_datetime(merged_df['BirthDate'],errors='coerce') ## converts the birthdate to the format DD-MM-YYY HH:MM:SS
merged_df['Age']=datetime.now().year-merged_df['BirthDate'].dt.year ## Calculates age by extracting age from current date and year from birthdate and subtracting them both
new_columns=[
    'Gender',
    'MaritalStatus',
    'Age',
    'Education',
    'HomeOwnerFlag',
    'NumberCarsOwned',
    'NumberChildrenAtHome',
    'TotalChildren',
    'YearlyIncome',
    'BikeBuyer'
]

new_df=merged_df[new_columns]
print(new_df.head())

Using Colab cache for faster access to the 'microsoft-adventure-works-cycles-customer-data' dataset.
  Gender MaritalStatus  Age        Education  HomeOwnerFlag  NumberCarsOwned  \
0      M             M   38        Bachelors              1                3   
1      M             M   53  Partial College              1                2   
2      F             S   40        Bachelors              0                3   
3      M             M   48  Partial College              1                2   
4      M             S   50  Partial College              1                1   

   NumberChildrenAtHome  TotalChildren  YearlyIncome  BikeBuyer  
0                     0              1         81916          1  
1                     1              2         81076          1  
2                     0              0         86387          1  
3                     1              2         61481          1  
4                     0              0         51804          1  


Datatype Gender - Categorical , Nominal(Binary) - contains only 2 entries Male or Female Martial Status - Categorical , Nominal(Binary) - contains only 2 entries Married or Unmarried Age - Numerical , Ratio - Has a true Zero point Education - Categorical , Ordinal Home Owner Flag - Categorical , Nominal(Binary)- Either 0 or 1 Number of Cars Owned - Numerical , Discrete Number of Children at Home -Numerical , Discrete Total Children -Numerical , Discrete Yearly Income - Numerical , Ratio Bike Buyer - Categorical , Nominal(Binary) Part-2 Data Preprocessing and Transformation Depending on the data type of each attribute, transform each object from your preprocessed data. Use all the data rows (~= 18000 rows) with the selected features as input to apply all the tasks below, do not perform each task on the smaller data set that you got from your random sampling result.

(a) Handling Null values

(b) Normalization

(c) Discretization (Binning) on Continuous attributes or Categorical Attributes with too many different values

(d) Standardization/Normalization

(e) Binarization (One Hot Encoding)

In [2]:
## Dropping null values
print(new_df.isnull().sum())
new_df=new_df.dropna()

## Normalisation
cols=['Age','YearlyIncome','TotalChildren']
scaler=MaxAbsScaler()
scaled=scaler.fit_transform(new_df[cols])
new_df[cols]=pd.DataFrame(scaled,columns=cols,index=new_df.index)

##Discretization
new_df['AgeGroup']=pd.cut(
    new_df['Age'],
    bins=[0,0.3,0.6,1],
    labels=['Young','Middle-Aged','Senior']
)

# print(new_df['AgeGroup'].value_counts())

##Standardisation
from sklearn.preprocessing import StandardScaler
col=['NumberCarsOwned','NumberChildrenAtHome']
scaler2=StandardScaler()
new_df[col]=scaler2.fit_transform(new_df[col])

##Binarization
final_df=pd.get_dummies(new_df,columns=['Gender','MaritalStatus','Education','AgeGroup'])
print(final_df.head())

Gender                  0
MaritalStatus           0
Age                     0
Education               0
HomeOwnerFlag           0
NumberCarsOwned         0
NumberChildrenAtHome    0
TotalChildren           0
YearlyIncome            0
BikeBuyer               0
dtype: int64
        Age  HomeOwnerFlag  NumberCarsOwned  NumberChildrenAtHome  \
0  0.400000              1         1.892524             -0.594371   
1  0.557895              1         0.798389              1.163279   
2  0.421053              0         1.892524             -0.594371   
3  0.505263              1         0.798389              1.163279   
4  0.526316              1        -0.295746             -0.594371   

   TotalChildren  YearlyIncome  BikeBuyer  Gender_F  Gender_M  \
0       0.333333      0.588837          1     False      True   
1       0.666667      0.582798          1     False      True   
2       0.000000      0.620975          1      True     False   
3       0.666667      0.441944          1     False 

Part-3

In [3]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
obj1=final_df.iloc[0]
obj2=final_df.iloc[1]
A=obj1.values
B=obj2.values

## Simple Matching Coefficient
match=np.sum(A==B)
size=len(A)
smc=match/size
print(smc)

## Jaccard
match=np.sum((A==1) & (B==1))
denominator=np.sum((A==1) | (B==1))
jaccard=match/denominator
print(jaccard)

##cosine
cosine=cosine_similarity([A],[B])
print(cosine)

0.631578947368421
0.7142857142857143
[[0.67485062]]
