### **Linear and Logistic Regression**

In the context of AI and machine learning, both linear and logistic regression are essential techniques. They are used to train predictive models from historical data, which can then make predictions or classifications on new data.

Logistic regression is used to model the relationship between two or more variables (1 dependent and 1 or more independent). The main difference between logistic and linear regression is that the dependent variable in logistic regression is binary (yes or no). The most common examples of its use are predictions or trends.

In the case of predicting house prices, we can consider different independent variables:

- Number of rooms
- Square meters
- Number of floors

The dependent variable could be whether the house price is expensive or not, making our dependent variable binary.

Logistic regression helps us in this case to predict if the house price is very high or not. Based on these variables, we analyze a certain amount of data (the more, the better) to find a relationship and answer the question of whether the price is high or not.

In the following exercise, we can see how dependent and independent variables work, the division between training and test data, and also see the model's result and the accuracy of the predictions.



In [None]:
!pip3 install numpy pandas matplotlib scikit-learn

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

print("All libraries imported successfully.")

In [None]:
# Generate synthetic data
np.random.seed(0)
n = 100

# Specify size from 500 to 3500 m² and generate n records, where n = 100
size = np.random.randint(500, 3500, n)

# Specify number of rooms between 1 and 5 and generate n records, where n = 100
rooms = np.random.randint(1, 5, n)

# 50,000 base price, add extra for rooms and m²
# Multiply size by 50, meaning each m² increases the house's value by 50 units (currency)
# Multiply number of rooms by 10,000, meaning each extra room increases the house's value by 10,000 units
# Generate an array of n (100) numbers, simulating the house value depending on the market
price = 50000 + (size * 50) + (rooms * 10000) + (np.random.randn(n) * 10000)

# Binary classification: 1 if price is greater than 200000, 0 otherwise
is_expensive = (price > 200000).astype(int)

# Create DataFrame
data = pd.DataFrame({
    'Size (square feet)': size,
    'Rooms': rooms,
    'Price': price,
    'Is expensive': is_expensive
})

# Save to CSV
data.to_csv('house_classification.csv', index=False)

In [None]:
# Load data
data = pd.read_csv('house_classification.csv')

# Show the first rows
print(data.head())

# Descriptive statistics
# count: non-null values
# mean: average (sum of the values of each column divided by number of rows)
# std: standard deviation
# min: minimum value per column
# 25%: 25% of the houses are below these values
# 50%: 50% of the houses are below these values
# 75%: 75% of the houses are below these values
# max: maximum value per column
print(data.describe())

In [None]:
# Relationship between size and if it is expensive
plt.scatter(data['Size (square feet)'], data['Is expensive'])
plt.xlabel('Size (square feet)')
plt.ylabel('Is expensive')
plt.title('Relationship between Size and if it is Expensive')
plt.show()

# Relationship between rooms and if it is expensive
plt.scatter(data['Rooms'], data['Is expensive'])
plt.xlabel('Rooms')
plt.ylabel('Is expensive')
plt.title('Relationship between Rooms and if it is Expensive')
plt.show()

# The following graphs show which houses are expensive (1) or not (0), depending on the independent variables (size and number of rooms respectively)

In [None]:
# train_test_split: used to split a dataset into two subsets, one for training and one for testing

# independent variables
X = data[['Size (square feet)', 'Rooms']]

# dependent or target variables, in logistic regression must be binary 1 or 0
y = data['Is expensive']

# Split into training and test sets
# X variables used for prediction
# y variable we want to predict
# test_size=0.2: indicates that we will use 20% of the data for the test set and 80% for the training set
# Training set is used to train the model, it is used to teach the model the relationship between the patterns in the data (the more, the better)
# Test set is used to compare the predictions and see how accurate they are
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
# Create the model
model = LogisticRegression()

# Train the model
model.fit(X_train, y_train)

# Model coefficients
# Intercept: Represents the probability of whether a house is expensive or not when the m² and rooms are 0
# Coefficients
# Coefficient for Size (0.0117012): An increase of one square foot in the house size slightly increases 
# the probabilities of the house being classified as expensive, with an increase of approximately 1.17% 
# in the probabilities for each additional square foot.
# Coefficient for Rooms (1.30764309): An increase of one room in the house significantly
# increases the probabilities of the house being classified as expensive, multiplying the probabilities by
# approximately 3.697 (or increasing the probabilities by 269.7%) for each additional room.
print("Intercept:", model.intercept_)
print("Coefficients:", model.coef_)

In [None]:
# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate the model
# Confusion Matrix: shows the performance of the classification model by comparing the model's predictions 
# with the true (real) values of the test data.

# [[12  0]
# [ 2  6]]
# Confusion matrix:
# True Positives (TP): 6 (number of times the model correctly predicted that a house is expensive).
# True Negatives (TN): 12 (number of times the model correctly predicted that a house is not expensive).
# False Positives (FP): 0 (number of times the model incorrectly predicted that a house is expensive when it is not).
# False Negatives (FN): 2 (number of times the model incorrectly predicted that a house is not expensive when it actually is).

# Classification Report:
#    class   precision  recall   f1-score   support
#      0       0.86      1.00      0.92        12
#      1       1.00      0.75      0.86         8

# Class 0 (not expensive):
# Precision: 0.86 (86% of the predictions that a house is not expensive were correct).
# Recall: 1.00 (the model correctly identified 100% of the houses that are not expensive).
# F1-score: 0.92 (the mean of precision and recall).

# Class 1 (expensive):
# Precision: 1.00 (100% of the predictions that a house is expensive were correct).
# Recall: 0.75 (the model correctly identified 75% of the houses that are expensive).
# F1-score: 0.86 (the mean of precision and recall).

conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)

print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", class_report)
print("Accuracy:", accuracy)

In [None]:
# Visualize the confusion matrix
# The following graph provides a visual representation of the confusion matrix, where
# using the color code, we can more quickly see the precision, where:
# [TN  FP]
# [FN  TP]
# True Positives (TP): 6 (number of times the model correctly predicted that a house is expensive).
# True Negatives (TN): 12 (number of times the model correctly predicted that a house is not expensive).
# False Positives (FP): 0 (number of times the model incorrectly predicted that a house is expensive when it is not).
# False Negatives (FN): 2 (number of times the model incorrectly predicted that a house is not expensive when it actually is).
plt.matshow(conf_matrix, cmap=plt.cm.Blues)
plt.title('Confusion Matrix')
plt.colorbar()
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()