# Feature Selection using Chi-Square Test

## Steps for Feature Selection Using Chi-Square Test
* Prepare the Data: Ensure that both the independent variables and target variable are categorical.
* Convert Categories to Numbers: Use encoding techniques like Label Encoding or One-Hot Encoding.
* Compute Chi-Square Scores: Calculate the Chi-Square statistic for each feature relative to the target variable.
* Select Top Features: Choose features with the highest Chi-Square values as they have the strongest relationship with the target variable.

# Real-World Example: Customer Purchase Prediction

## Step 1: Loading and Preparing the Dataset

In [1]:
# %pip install pandas scikit-learn
import pandas as pd
import sklearn

df = pd.read_csv('https://raw.githubusercontent.com/itsluckysharma01/Datasets/refs/heads/main/customer_purchase_behavior.csv')

print(df.head())

   Customer_ID  Age  Annual_Income  Purchase_Frequency  Avg_Purchase_Value  \
0            1   56          81228                   9          325.439657   
1            2   69          68984                  17          140.221673   
2            3   46          60774                  17          303.138007   
3            4   32          22568                  12          489.868572   
4            5   60          82592                   7          253.636233   

  Category_Preference  
0         Electronics  
1               Books  
2               Books  
3               Books  
4           Groceries  


## Step 2: Data Summary

In [2]:
print(df.info())

print(df.describe(include='all'))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Customer_ID          100 non-null    int64  
 1   Age                  100 non-null    int64  
 2   Annual_Income        100 non-null    int64  
 3   Purchase_Frequency   100 non-null    int64  
 4   Avg_Purchase_Value   100 non-null    float64
 5   Category_Preference  100 non-null    object 
dtypes: float64(1), int64(4), object(1)
memory usage: 4.8+ KB
None
        Customer_ID         Age  Annual_Income  Purchase_Frequency  \
count    100.000000  100.000000     100.000000          100.000000   
unique          NaN         NaN            NaN                 NaN   
top             NaN         NaN            NaN                 NaN   
freq            NaN         NaN            NaN                 NaN   
mean      50.500000   43.350000   69474.690000            9.170000   
std       

## Step 3: Data Cleaning

In [3]:
print(df.isnull().sum())

df = df.dropna()

print(df.isnull().sum())

Customer_ID            0
Age                    0
Annual_Income          0
Purchase_Frequency     0
Avg_Purchase_Value     0
Category_Preference    0
dtype: int64
Customer_ID            0
Age                    0
Annual_Income          0
Purchase_Frequency     0
Avg_Purchase_Value     0
Category_Preference    0
dtype: int64


## Step 4: Feature Encoding

In [8]:
df.head()

Unnamed: 0,Customer_ID,Age,Annual_Income,Purchase_Frequency,Avg_Purchase_Value,Category_Preference
0,1,34,59,8,325.439657,Electronics
1,2,46,47,14,140.221673,Books
2,3,25,40,14,303.138007,Books
3,4,11,1,11,489.868572,Books
4,5,38,62,6,253.636233,Groceries


In [9]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

categorical_features = ['Age', 'Annual_Income', 'Purchase_Frequency']
for feature in categorical_features:
    df[feature] = le.fit_transform(df[feature])

df['Purchase'] = le.fit_transform(df['Category_Preference'])

## Step 5: Applying Chi-Square Test

In [10]:
from sklearn.feature_selection import chi2, SelectKBest

X = df[categorical_features]
y = df['Purchase']

selector = SelectKBest(score_func=chi2, k=2)
X_new = selector.fit_transform(X, y)

feature_scores = selector.scores_
selected_features = X.columns[selector.get_support()]

print("Feature Scores:", feature_scores)
print("Selected Features:", selected_features)

Feature Scores: [12.1389507  37.93712349  3.00329345]
Selected Features: Index(['Age', 'Annual_Income'], dtype='object')


## This test assigns higher scores to features that are strongly related to the target variable. Based on these scores we select the top features that contribute most to predicting purchases.