# 1. Naive Bayes



To compute the conditional probability distrubutions for each feature, I used the target class "Construction Type" and each other feature using the following formula:

$$
P(\text{Feature} = v \mid \text{Class} = c) = \frac{\text{Count of records with Feature = } v \text{ and Class = } c}{\text{Total number of records with Class = } c}
$$


By taking the count of each feature per class, I was able to get the probabilities of each feature given the class.


For example, to compute the CPD for **`# Bedrooms`**, I first count how many times each bedroom value (e.g., 2, 3, 4, 5) appears within each class (e.g., Apartment, House, Condo). There are 7 Apartments in the training set, and among them, 3 have 3 bedrooms, 2 have 4 bedrooms, 1 has 2 bedrooms, and 1 has 5 bedrooms. The conditional probabilities for Apartments would be:  

$$
P(3 \mid \text{Apartment}) = \frac{3}{7}, \quad
P(4 \mid \text{Apartment}) = \frac{2}{7}, \quad
P(2 \mid \text{Apartment}) = \frac{1}{7}, \quad
P(5 \mid \text{Apartment}) = \frac{1}{7}
$$



For **`Local Price`**, which is a continuous feature, we first need to discretize it into bins (e.g., Low, Mid, High). For instance, if we define three bins based on quantile ranges -`(3.89–5.06]`, `(5.06–5.90]`, and `(5.96–16.42]` then I counted how many prices in each bin occur within each class. Suppose 4 out of 7 Apartments fall in the first bin, 2 in the second, and 1 in the third.

$$
P(\text{Low} \mid \text{Apartment}) = \frac{4}{7}, \quad
P(\text{Mid} \mid \text{Apartment}) = \frac{2}{7}, \quad
P(\text{High} \mid \text{Apartment}) = \frac{1}{7}
$$


The same process is applied to every feature in the dataset to construct the full conditional probability table. The following code automates this process and creates a final dictionary of all proabability distributions for each feature. Along the way I also calculate each probability distribution for each feature.





In [3]:
import pandas as pd


train_file = "/content/train.csv"
test_file = "/content/test.csv"

train = pd.read_csv(train_file)
test = pd.read_csv(test_file)

def get_feature_types(df):
    discrete_features = []
    continuous_features = []

    for column in df.columns:
        if df[column].dtype == 'object':
            discrete_features.append(column)
        elif df[column].nunique() < 10:
            discrete_features.append(column)
        else:
            continuous_features.append(column)

    return discrete_features, continuous_features

# Get discrete and continuous features
discrete_features, continuous_features = get_feature_types(train)

print("Discrete Features:", discrete_features)
print("Continuous Features:", continuous_features)

Discrete Features: ['Bathrooms', '# Garages', '# Rooms', '# Bedrooms', 'Construction type']
Continuous Features: ['House ID', 'Local Price', 'Land Area', 'Living area', 'Age of home']


In [20]:
# Conditional Probability Distributions

# Discrete Features
# Calculate the conditional probability distribution for discrete columns
def get_discrete_probability(discrete_column, target_column):
    # Group by the discrete column and target column
    distribution_counts = train.groupby([discrete_column, target_column]).size().unstack(fill_value=0)

    # Normalize the counts to get probabilities
    conditional_probabilities = distribution_counts.div(distribution_counts.sum(axis=1), axis=0)
    # View the table distribution counts
    print("Distribution Counts:")
    print(distribution_counts)

    # View the conditional probability distribution
    print("\nConditional Probability Distribution:")
    print(conditional_probabilities)

get_discrete_probability("Construction type", "# Bedrooms")


def get_continous_probability(continuous_column, column):

    def get_bin(column, num_bins=3):
      column_values = train[column]
      column_values_sorted = column_values.sort_values(ascending=True)
      bin_size = len(column_values_sorted) // num_bins
      bins = [column_values_sorted.iloc[i * bin_size:(i + 1) * bin_size].tolist() for i in range(num_bins)]

      # Handle any remaining values (if the length isn't perfectly divisible by num_bins)
      if len(column_values_sorted) % num_bins != 0:
          bins[-1].extend(column_values_sorted.iloc[num_bins * bin_size:].tolist())
      print(f"Bin 1: {bins[0]}")
      print(f"Bin 2: {bins[1]}")
      print(f"Bin 3: {bins[2]}")
      # get range of each bin
      bin_ranges = []
      for bin in bins:
          min_value = min(bin)
          max_value = max(bin)
          bin_ranges.append((min_value, max_value))
      return bin_ranges
    num_bins = 3
    bins = get_bin(column, num_bins=num_bins)
    # Create bins using pd.qcut for equal-sized bins
    bin_labels = ["Low", "Medium", "High"]
    train[f"{column} bin"] = pd.qcut(train[column], q=num_bins, labels=bin_labels)

    # Calculate the conditional probability distribution
    distribution_counts = train.groupby(["Construction type", f"{column} bin"], observed=True).size().unstack(fill_value=0)
    conditional_probabilities = distribution_counts.div(distribution_counts.sum(axis=1), axis=0)

    # View the table distribution counts
    print("Distribution Counts:")
    print(distribution_counts)

    # View the conditional probability distribution
    print("\nConditional Probability Distribution:")
    print(conditional_probabilities)

get_continous_probability("Construction type", "Local Price")


Distribution Counts:
# Bedrooms         2  3  4  5
Construction type            
Apartment          1  3  2  1
Condo              0  5  0  1
House              1  5  1  0

Conditional Probability Distribution:
# Bedrooms                2         3         4         5
Construction type                                        
Apartment          0.142857  0.428571  0.285714  0.142857
Condo              0.000000  0.833333  0.000000  0.166667
House              0.142857  0.714286  0.142857  0.000000
Bin 1: [3.891, 4.5429, 4.5573, 4.9176, 5.0208, 5.05]
Bin 2: [5.0597, 5.3003, 5.6039, 5.6039, 5.8282, 5.898]
Bin 3: [5.9592, 6.2712, 6.6969, 7.7841, 8.2464, 9.0384, 14.4598, 16.4202]
Distribution Counts:
Local Price bin    Low  Medium  High
Construction type                   
Apartment            4       0     3
Condo                2       2     2
House                1       4     2

Conditional Probability Distribution:
Local Price bin         Low    Medium      High
Construction type        

In [21]:
# Conditional Probabilities for discrete features
# Probabiliy of feature given 'Construction type'

for column in discrete_features:
  print(f"Conditional Probability Distribution for {column}:")
  get_discrete_probability("Construction type", column)
  print("___________________________________________________")
  print("\n")


Conditional Probability Distribution for Bathrooms:
Distribution Counts:
Bathrooms          1.0  1.5  2.5
Construction type               
Apartment            5    1    1
Condo                4    1    1
House                6    1    0

Conditional Probability Distribution:
Bathrooms               1.0       1.5       2.5
Construction type                              
Apartment          0.714286  0.142857  0.142857
Condo              0.666667  0.166667  0.166667
House              0.857143  0.142857  0.000000
___________________________________________________


Conditional Probability Distribution for # Garages:
Distribution Counts:
# Garages          0.0  1.0  1.5  2.0
Construction type                    
Apartment            1    3    1    2
Condo                0    4    0    2
House                2    2    1    2

Conditional Probability Distribution:
# Garages               0.0       1.0       1.5       2.0
Construction type                                        
Apartment  

In [22]:
# Conditional Probabilities for continous features
# Probabiliy of feature given 'Construction type'

for column in continuous_features:
  print(f"Conditional Probability Distribution for {column}:")
  get_continous_probability("Construction type", column)
  print("___________________________________________________")
  print("\n")

Conditional Probability Distribution for House ID:
Bin 1: [1, 2, 3, 4, 5, 6]
Bin 2: [7, 8, 9, 10, 11, 12]
Bin 3: [13, 14, 15, 16, 17, 18, 19, 20]
Distribution Counts:
House ID bin       Low  Medium  High
Construction type                   
Apartment            3       1     3
Condo                3       1     2
House                1       4     2

Conditional Probability Distribution:
House ID bin            Low    Medium      High
Construction type                              
Apartment          0.428571  0.142857  0.428571
Condo              0.500000  0.166667  0.333333
House              0.142857  0.571429  0.285714
___________________________________________________


Conditional Probability Distribution for Local Price:
Bin 1: [3.891, 4.5429, 4.5573, 4.9176, 5.0208, 5.05]
Bin 2: [5.0597, 5.3003, 5.6039, 5.6039, 5.8282, 5.898]
Bin 3: [5.9592, 6.2712, 6.6969, 7.7841, 8.2464, 9.0384, 14.4598, 16.4202]
Distribution Counts:
Local Price bin    Low  Medium  High
Construction type    

In [37]:
# Function to calculate and return discrete probabilities as a dictionary
def get_discrete_probability_dict(discrete_column, target_column):
    distribution_counts = train.groupby([discrete_column, target_column]).size().unstack(fill_value=0)
    conditional_probabilities = distribution_counts.div(distribution_counts.sum(axis=1), axis=0)
    probabilities_dict = conditional_probabilities.to_dict(orient="index")

    print(f"P_{target_column} = {probabilities_dict}")
    return probabilities_dict

# Function to calculate and return continuous probabilities as a dictionary
def get_continous_probability_dict(continuous_column, column, num_bins=3):
    bin_labels = ["low", "mid", "high"]
    train[f"{column} bin"] = pd.qcut(train[column], q=num_bins, labels=bin_labels)

    distribution_counts = train.groupby([continuous_column, f"{column} bin"], observed=True).size().unstack(fill_value=0)

    conditional_probabilities = distribution_counts.div(distribution_counts.sum(axis=1), axis=0)

    probabilities_dict = conditional_probabilities.to_dict(orient="index")

    print(f"P_{column} = {probabilities_dict}")
    return probabilities_dict


discrete_probabilities = {}
continuous_probabilities = {}

for column in discrete_features:
    print(f"Processing discrete feature: {column}")
    discrete_probabilities[column] = get_discrete_probability_dict("Construction type", column)


for column in continuous_features:
    print(f"Processing continuous feature: {column}")
    continuous_probabilities[column] = get_continous_probability_dict("Construction type", column)

# Printing the final dictionaries
print("\nDiscrete Probabilities:")
print(discrete_probabilities)

print("\nContinuous Probabilities:")
print(continuous_probabilities)

Processing discrete feature: Bathrooms
P_Bathrooms = {'Apartment': {1.0: 0.7142857142857143, 1.5: 0.14285714285714285, 2.5: 0.14285714285714285}, 'Condo': {1.0: 0.6666666666666666, 1.5: 0.16666666666666666, 2.5: 0.16666666666666666}, 'House': {1.0: 0.8571428571428571, 1.5: 0.14285714285714285, 2.5: 0.0}}
Processing discrete feature: # Garages
P_# Garages = {'Apartment': {0.0: 0.14285714285714285, 1.0: 0.42857142857142855, 1.5: 0.14285714285714285, 2.0: 0.2857142857142857}, 'Condo': {0.0: 0.0, 1.0: 0.6666666666666666, 1.5: 0.0, 2.0: 0.3333333333333333}, 'House': {0.0: 0.2857142857142857, 1.0: 0.2857142857142857, 1.5: 0.14285714285714285, 2.0: 0.2857142857142857}}
Processing discrete feature: # Rooms
P_# Rooms = {'Apartment': {5: 0.14285714285714285, 6: 0.2857142857142857, 7: 0.2857142857142857, 8: 0.14285714285714285, 9: 0.14285714285714285, 10: 0.0}, 'Condo': {5: 0.0, 6: 0.6666666666666666, 7: 0.16666666666666666, 8: 0.0, 9: 0.0, 10: 0.16666666666666666}, 'House': {5: 0.142857142857142

In [39]:
# Apply MAP for Naive Bayes Classification


P_Bathrooms = {'Apartment': {1.0: 0.7142857142857143, 1.5: 0.14285714285714285, 2.5: 0.14285714285714285}, 'Condo': {1.0: 0.6666666666666666, 1.5: 0.16666666666666666, 2.5: 0.16666666666666666}, 'House': {1.0: 0.8571428571428571, 1.5: 0.14285714285714285, 2.5: 0.0}}

P_Garages = {'Apartment': {0.0: 0.14285714285714285, 1.0: 0.42857142857142855, 1.5: 0.14285714285714285, 2.0: 0.2857142857142857}, 'Condo': {0.0: 0.0, 1.0: 0.6666666666666666, 1.5: 0.0, 2.0: 0.3333333333333333}, 'House': {0.0: 0.2857142857142857, 1.0: 0.2857142857142857, 1.5: 0.14285714285714285, 2.0: 0.2857142857142857}}

P_Rooms = {'Apartment': {5: 0.14285714285714285, 6: 0.2857142857142857, 7: 0.2857142857142857, 8: 0.14285714285714285, 9: 0.14285714285714285, 10: 0.0}, 'Condo': {5: 0.0, 6: 0.6666666666666666, 7: 0.16666666666666666, 8: 0.0, 9: 0.0, 10: 0.16666666666666666}, 'House': {5: 0.14285714285714285, 6: 0.5714285714285714, 7: 0.2857142857142857, 8: 0.0, 9: 0.0, 10: 0.0}}

P_Bedrooms = {'Apartment': {2: 0.14285714285714285, 3: 0.42857142857142855, 4: 0.2857142857142857, 5: 0.14285714285714285}, 'Condo': {2: 0.0, 3: 0.8333333333333334, 4: 0.0, 5: 0.16666666666666666}, 'House': {2: 0.14285714285714285, 3: 0.7142857142857143, 4: 0.14285714285714285, 5: 0.0}}

P_Local_Price = {'Apartment': {'low': 0.5714285714285714, 'mid': 0.0, 'high': 0.42857142857142855}, 'Condo': {'low': 0.3333333333333333, 'mid': 0.3333333333333333, 'high': 0.3333333333333333}, 'House': {'low': 0.14285714285714285, 'mid': 0.5714285714285714, 'high': 0.2857142857142857}}

P_Land_Area = {'Apartment': {'low': 0.42857142857142855, 'mid': 0.2857142857142857, 'high': 0.2857142857142857}, 'Condo': {'low': 0.3333333333333333, 'mid': 0.3333333333333333, 'high': 0.3333333333333333}, 'House': {'low': 0.2857142857142857, 'mid': 0.2857142857142857, 'high': 0.42857142857142855}}

P_Living_area = {'Apartment': {'low': 0.42857142857142855, 'mid': 0.2857142857142857, 'high': 0.2857142857142857}, 'Condo': {'low': 0.5, 'mid': 0.3333333333333333, 'high': 0.16666666666666666}, 'House': {'low': 0.14285714285714285, 'mid': 0.42857142857142855, 'high': 0.42857142857142855}}

P_Age_of_home = {'Apartment': {'low': 0.2857142857142857, 'mid': 0.2857142857142857, 'high': 0.42857142857142855}, 'Condo': {'low': 0.3333333333333333, 'mid': 0.3333333333333333, 'high': 0.3333333333333333}, 'House': {'low': 0.8571428571428571, 'mid': 0.0, 'high': 0.14285714285714285}}

priors = {"Apartment": 7 / 20, "House": 7 / 20, "Condo": 6 / 20}

results = []

def bin_price(price):
    if price <= 5.06:
        return 'low'
    elif price <= 6:
        return 'mid'
    else:
        return 'high'

def bin_land_area(area):
    if area <= 4.5:
        return 'low'
    elif area <= 6.5:
        return 'mid'
    else:
        return 'high'

def bin_living_area(area):
    if area <= 1.122:
        return 'low'
    elif area <= 1.491:
        return 'mid'
    else:
        return 'high'

def bin_age_of_home(age):
    if age <= 30:
        return 'low'
    elif age <= 41:
        return 'mid'
    else:
        return 'high'

for _, row in test.iterrows():
    scores = {}
    price_bin = bin_price(row["Local Price"])
    land_area_bin = bin_land_area(row["Land Area"])
    living_area_bin = bin_living_area(row["Living area"])
    age_of_home_bin = bin_age_of_home(row["Age of home"])

    for cls in priors:
        try:
            # Calculate the score for each class
            scores[cls] = (
                priors[cls] *
                P_Bathrooms[cls].get(row["Bathrooms"], 1e-2) *
                P_Garages[cls].get(row["# Garages"], 1e-2) *
                P_Rooms[cls].get(row["# Rooms"], 1e-2) *
                P_Bedrooms[cls].get(row["# Bedrooms"], 1e-2) *
                P_Local_Price[cls].get(price_bin, 1e-2) *
                P_Land_Area[cls].get(land_area_bin, 1e-2) *
                P_Living_area[cls].get(living_area_bin, 1e-2) *
                P_Age_of_home[cls].get(age_of_home_bin, 1e-2)
            )
        except KeyError as e:
            print(f"KeyError: {e} for class {cls} and row {row}")
            scores[cls] = 0
    predicted = max(scores, key=scores.get)
    results.append(
        {
            "Local Price": f"{price_bin}: {row['Local Price']}",
            "Bathrooms": row["Bathrooms"],
            "Land Area": f"{land_area_bin}: {row['Land Area']}",
            "Living area": f"{living_area_bin}: {row['Living area']}",
            "# Garages": row['# Garages'],
            "# Rooms": row["# Rooms"],
            "# Bedrooms": row["# Bedrooms"],
            "Age of home": f"{age_of_home_bin}: {row['Age of home']}",
            "P(House)": (scores["House"]),
            "P(Apartment)": scores["Apartment"],
            "P(Condo)": scores["Condo"],
            "Actual Construction Type": row["Construction type"],
            "Predicted Class": predicted

        }
    )
# Create a DataFrame from the results
results_df = pd.DataFrame(results)
print(results_df)


    Local Price  Bathrooms     Land Area  Living area  # Garages  # Rooms  \
0  high: 6.0931        1.5  high: 6.7265  high: 1.652        1.0        6   
1  high: 8.3607        1.5    high: 9.15  high: 1.777        2.0        8   
2    high: 8.14        1.0     high: 8.0  high: 1.504        2.0        7   
3  high: 9.1416        1.5  high: 7.3262  high: 1.831        1.5        8   
4    high: 12.0        1.5      mid: 5.0     mid: 1.2        2.0        6   

   # Bedrooms Age of home  P(House)  P(Apartment)  P(Condo)  \
0           3    high: 44  0.000044      0.000039  0.000114   
1           4    high: 48  0.000000      0.000009  0.000000   
2           3      low: 3  0.000787      0.000087  0.000057   
3           4     mid: 31  0.000000      0.000003  0.000000   
4           3     low: 30  0.000175      0.000017  0.000114   

  Actual Construction Type Predicted Class  
0                Apartment           Condo  
1                    House       Apartment  
2                    Ho

# Decision Tree

In [32]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score

file_path = "/content/Asssignment4_Data.xlsx"
train_df = pd.read_excel(file_path, sheet_name="Train")
test_df = pd.read_excel(file_path, sheet_name="Test")

X_train = train_df.drop(columns=["House ID", "Construction type"])
y_train = train_df["Construction type"]

X_test = test_df.drop(columns=["House ID", "Construction type"])
y_test = test_df["Construction type"]

label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)


clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train_encoded)


train_acc_default = accuracy_score(y_train_encoded, clf.predict(X_train))
test_acc_default = accuracy_score(y_test_encoded, clf.predict(X_test))

# Output the decision tree rules
tree_rules = export_text(clf, feature_names=list(X_train.columns))
tree_rules.splitlines()




['|--- Age of home <= 36.00',
 '|   |--- Local Price <= 8.41',
 '|   |   |--- Local Price <= 7.24',
 '|   |   |   |--- # Garages <= 1.75',
 '|   |   |   |   |--- class: 2',
 '|   |   |   |--- # Garages >  1.75',
 '|   |   |   |   |--- Living area <= 1.17',
 '|   |   |   |   |   |--- class: 1',
 '|   |   |   |   |--- Living area >  1.17',
 '|   |   |   |   |   |--- class: 2',
 '|   |   |--- Local Price >  7.24',
 '|   |   |   |--- class: 1',
 '|   |--- Local Price >  8.41',
 '|   |   |--- class: 0',
 '|--- Age of home >  36.00',
 '|   |--- Local Price <= 4.55',
 '|   |   |--- class: 1',
 '|   |--- Local Price >  4.55',
 '|   |   |--- Land Area <= 5.50',
 '|   |   |   |--- Age of home <= 58.00',
 '|   |   |   |   |--- class: 0',
 '|   |   |   |--- Age of home >  58.00',
 '|   |   |   |   |--- class: 2',
 '|   |   |--- Land Area >  5.50',
 '|   |   |   |--- class: 1']

In [33]:
# 1. Accuracy for training and test set
print(f"Training Accuracy: {train_acc_default}")
print(f"Test Accuracy: {test_acc_default}")


Training Accuracy: 1.0
Test Accuracy: 0.4


In [34]:
# 2. What is the effect of restricting the maximum depth of the tree?
# Try different depths and find the best value.

depth_results = {}
for depth in range(1, 11):
    clf = DecisionTreeClassifier(max_depth=depth, random_state=42)
    clf.fit(X_train, y_train_encoded)
    acc = accuracy_score(y_test_encoded, clf.predict(X_test))
    depth_results[depth] = acc
    print(f"Depth: {depth}, Accuracy: {acc}")

best_depth = max(depth_results, key=depth_results.get)
print(f"Best Depth: {best_depth}")
print(f"Test Accuracy for Best Depth: {depth_results[best_depth]}")

Depth: 1, Accuracy: 0.4
Depth: 2, Accuracy: 0.8
Depth: 3, Accuracy: 0.4
Depth: 4, Accuracy: 0.4
Depth: 5, Accuracy: 0.4
Depth: 6, Accuracy: 0.4
Depth: 7, Accuracy: 0.4
Depth: 8, Accuracy: 0.4
Depth: 9, Accuracy: 0.4
Depth: 10, Accuracy: 0.4
Best Depth: 2
Test Accuracy for Best Depth: 0.8


Restricting the maximum depth limits how many splits/decisions the tree can make. This prevents it from learning very specific rules which may reduce its accuracy on the training data but often improves test accuracy by reducing overfitting. For example, a shallow tree might only use Age of home and Local Price, ignoring more detailed splits like # Garages or Living area.

### 3. Why does restricting the depth have such a strong effect on the classifier



Restricting the depth of a decision tree strongly affects performance because it controls how much the tree is allowed to split and memorize the training data.

If the depth is too large:
The tree becomes very complex and may memorize noise. This causes overfitting — great training accuracy, but poor generalization to test data.

If the depth is too small:
The tree is too simple to capture patterns in the data. This causes underfitting — it misses important splits and makes poor predictions.

In this case, Depth 2 performed the best while every other one performed worse. The deeper trees after depth 2 performed worse likely due to the small dataset which allowed the model to overfit easily.


In [40]:
# 4. Inference on custom test point
sample = pd.DataFrame([{
    "Local Price": 9.0384,
    "Bathrooms": 1,
    "Land Area": 7.8,
    "Living area": 1.5,
    "# Garages": 1.5,
    "# Rooms": 7,
    "# Bedrooms": 3,
    "Age of home": 23
}])

predicted_class = label_encoder.inverse_transform(clf.predict(sample))[0]
print("Predicted Construction Type: ", predicted_class)

Predicted Construction Type:  Apartment
