<a href="https://colab.research.google.com/github/katyayani-jha/ML-LAB-CS12/blob/main/ML_Lab_II.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Katyayani Jha | 3P12 | 102216040

In [1]:
import pandas as pd
from google.colab import files

uploaded = files.upload()

df = pd.read_csv('AWCustomers.csv')

Saving AWCustomers.csv to AWCustomers.csv


Part I: Based on Feature Selection, Cleaning, and Preprocessing to Construct an Input from Data
Source

(a) Examine the values of each attribute

In [2]:
print("DataFrame columns and types:")
print(df.dtypes)

print("\nSample of the DataFrame:")
print(df.head())

DataFrame columns and types:
CustomerID               int64
Title                   object
FirstName               object
MiddleName              object
LastName                object
Suffix                  object
AddressLine1            object
AddressLine2            object
City                    object
StateProvinceName       object
CountryRegionName       object
PostalCode              object
PhoneNumber             object
BirthDate               object
Education               object
Occupation              object
Gender                  object
MaritalStatus           object
HomeOwnerFlag            int64
NumberCarsOwned          int64
NumberChildrenAtHome     int64
TotalChildren            int64
YearlyIncome             int64
LastUpdated             object
dtype: object

Sample of the DataFrame:
   CustomerID Title FirstName MiddleName  LastName Suffix  \
0       21173   NaN      Chad          C      Yuan    NaN   
1       13249   NaN      Ryan        NaN     Perry    NaN   
2   

(b) Selected attributes

In [3]:
selected_attributes = ['CustomerID', 'NumberCarsOwned', 'YearlyIncome']
df_selected = df[selected_attributes]

print("\nNew DataFrame with selected attributes:")
print(df_selected.head())


New DataFrame with selected attributes:
   CustomerID  NumberCarsOwned  YearlyIncome
0       21173                3         81916
1       13249                2         81076
2       29350                3         86387
3       13503                2         61481
4       22803                1         51804


(c) Determining the data value of each attribute

In [5]:
data_types = {
    'CustomerID': 'Discrete, Nominal',  # Typically a unique identifier, not useful for prediction
    'NumberCarsOwned': 'Discrete, Ratio',  # Count of cars, non-negative integers, meaningful zero
    'YearlyIncome': 'Continuous, Ratio'   # Monetary value, can be fractional, meaningful zero
}

print("Data types and preprocessing requirements:")
for attribute, dtype in data_types.items():
    print(f"{attribute}: {dtype}")

Data types and preprocessing requirements:
CustomerID: Discrete, Nominal
NumberCarsOwned: Discrete, Ratio
YearlyIncome: Continuous, Ratio


Part II: Data Preprocessing and Transformation

In [6]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import Binarizer

(a) Handling NULL values

In [7]:
imputer = SimpleImputer(strategy='mean')
df_selected = pd.DataFrame(imputer.fit_transform(df_selected), columns=selected_attributes)

(b) Normalization

In [8]:
scaler = MinMaxScaler()
df_selected['YearlyIncome'] = scaler.fit_transform(df_selected[['YearlyIncome']])

(c) Discretization (binning)

In [9]:
binner = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform')
df_selected['YearlyIncome_binned'] = binner.fit_transform(df_selected[['YearlyIncome']])



(d) Standardization / Normalization

In [10]:
standard_scaler = StandardScaler()
df_selected['YearlyIncome_standardized'] = standard_scaler.fit_transform(df_selected[['YearlyIncome']])

(e) Binarization (One Hot Encoding)

In [11]:
encoder = OneHotEncoder(sparse=False)
number_cars_encoded = encoder.fit_transform(df_selected[['NumberCarsOwned']])
df_encoded = pd.DataFrame(number_cars_encoded, columns=[f'NumberCarsOwned_{int(i)}' for i in encoder.categories_[0]])
df_selected = pd.concat([df_selected, df_encoded], axis=1).drop('NumberCarsOwned', axis=1)



In [12]:
print("\nData after preprocessing:")
print(df_selected.head())


Data after preprocessing:
   CustomerID  YearlyIncome  YearlyIncome_binned  YearlyIncome_standardized  \
0     21173.0      0.496842                  2.0                   0.298555   
1     13249.0      0.489453                  2.0                   0.271180   
2     29350.0      0.536172                  2.0                   0.444261   
3     13503.0      0.317083                  1.0                  -0.367401   
4     22803.0      0.231958                  1.0                  -0.682765   

   NumberCarsOwned_0  NumberCarsOwned_1  NumberCarsOwned_2  NumberCarsOwned_3  \
0                0.0                0.0                0.0                1.0   
1                0.0                0.0                1.0                0.0   
2                0.0                0.0                0.0                1.0   
3                0.0                0.0                1.0                0.0   
4                0.0                1.0                0.0                0.0   

   NumberCa

Part III: Calculating Proximity / Correlation Analysis of two features

In [13]:
# Selecting two objects (rows) for similarity comparison
object1 = df_selected.iloc[0].values.reshape(1, -1)
object2 = df_selected.iloc[1].values.reshape(1, -1)

(a) Calculate Similarity in Simple Matching, Jaccard Similarity, and Cosine Similarity between two
following objects of your transformed input data.

In [16]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import jaccard_score

# Simple Matching Similarity
def simple_matching_similarity(obj1, obj2):
    return np.mean(obj1 == obj2)

# Jaccard Similarity (for binary attributes)
def jaccard_similarity(obj1, obj2):
    return jaccard_score(obj1, obj2, average='binary')

# Cosine Similarity
def cosine_similarity_measure(obj1, obj2):
    return cosine_similarity(obj1, obj2)[0][0]

In [17]:
simple_matching_sim = simple_matching_similarity(object1, object2)
print(f"Simple Matching Similarity: {simple_matching_sim}")

# Converting objects to binary for Jaccard similarity (for simplicity, using only one hot encoded part)
object1_binary = object1[:, -len(encoder.categories_[0]):]  # Assuming last columns are binary
object2_binary = object2[:, -len(encoder.categories_[0]):]
jaccard_sim = jaccard_similarity(object1_binary[0], object2_binary[0])
print(f"Jaccard Similarity: {jaccard_sim}")

cosine_sim = cosine_similarity_measure(object1, object2)
print(f"Cosine Similarity: {cosine_sim}")

Simple Matching Similarity: 0.5
Jaccard Similarity: 0.0
Cosine Similarity: 0.9999999943293292


(b) Calculate Correlation between two features NumberCarsOwned and Yearly Income

In [23]:
if 'NumberCarsOwned_3' in df_selected.columns and 'YearlyIncome' in df_selected.columns:
    correlation = df_selected[['NumberCarsOwned_3', 'YearlyIncome']].corr().iloc[0, 1]
    print(f"Correlation between Number of Cars Owned and Yearly Income: {correlation}")
else:
    print("Required columns for correlation calculation are missing.")

Correlation between Number of Cars Owned and Yearly Income: 0.30689334857282835
