### **Linear and Logistic Regression**

In the context of AI and machine learning, both linear and logistic regression are essential techniques. They are used to train predictive models from historical data, which can then make predictions or classifications on new data.

### **Regresion Lineal**

It is used to model the relationship between two or more variables (1 dependent and 1 or more independent). The most common examples of use are predictions or trends.

In the case of predicting house prices, we can consider different independent variables:

- Number of rooms
- Square meters
- Number of floors

The dependent variable could be the price, which will depend on other factors such as the current market value, location, etc.

Linear regression is useful in this case to predict the house price. Based on these variables, it will analyze a certain amount of data (the more, the better) to find a relationship and be able to predict the house price accurately.

In the following exercise, we can see how the dependent and independent variables work, the division between training and test data, and also see the model's result and the accuracy of the predictions.

In [None]:
!pip3 install numpy pandas matplotlib scikit-learn

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

print("All libraries imported successfully.")

In [None]:
# Generating data
np.random.seed(0)
n = 100

# We specify that the size ranges from 500 to 3500 m2 and we need to generate n records, where n = 100
size = np.random.randint(500, 3500, n)

# We specify that the number of rooms is between 1 and 5 and we need to generate n records, where n = 100
rooms = np.random.randint(1, 5, n)

# 50,000 base price, we need to add the extra for rooms and m2
# Multiply size by 50, meaning each m2 increases the house value by 50 units (currency)
# Multiply the number of rooms by 10,000, meaning each extra room increases the house value by 10,000 units
# Generate an array of n (100) numbers, simulating the house value depending on the market
price = 50000 + (size * 50) + (rooms * 10000) + (np.random.randn(n) * 10000)

# Create columns (headers) with their values
data = pd.DataFrame({
    'Size (square meters)': size,
    'Rooms': rooms,
    'Price': price
})

# Round all values to 0 decimals
data = data.round(0)

# Save to CSV
data.to_csv('house_prices.csv', index=False)


In [None]:
# Load data
data = pd.read_csv('house_prices.csv')

# Display the first rows
print(data.head())

# Descriptive statistics
# count: non-null values
# mean: average (sum of the values in each column divided by the number of rows)
# std: standard deviation
# min: minimum value per column
# 25%: 25% of the houses are below these values
# 50%: 50% of the houses are below these values
# 75%: 75% of the houses are below these values
# max: maximum value per column
print(data.describe())

In [None]:
# Relationship between size and price
plt.scatter(data['Size (square meters)'], data['Price'])
plt.xlabel('Size (square meters)')
plt.ylabel('Price')
plt.title('Relationship between Size and Price')
plt.show()

# Relationship between rooms and price
plt.scatter(data['Rooms'], data['Price'])
plt.xlabel('Rooms')
plt.ylabel('Price')
plt.title('Relationship between Rooms and Price')
plt.show()

# The following graphs show the relationship between house prices and size and number of rooms respectively

In [None]:
# train_test_split: used to split a dataset into two subsets, one for training and one for testing

# independent variables
X = data[['Size (square meters)', 'Rooms']]
# dependent variable or target, what we want to predict
y = data['Price']

# Split into training and testing sets
# X variables to be used for prediction
# y variable we want to predict
# test_size=0.2: indicates that we will use 20% of the data for the test set and 80% for the training set
# Training set is used to train the model, used to teach the model the relationship between patterns in the data (the more, the better)
# Test set is used to compare predictions with actual data and see how accurate they are.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
# Create the model
model = LinearRegression()

# Train the model
model.fit(X_train, y_train)

# Model coefficients
# Intercept: Represents the value of the house when the independent variables are 0
# Coefficients: Indicate how much the price increases on AVERAGE depending on the independent variables (number of rooms and size)
#   if the coefficients are positive, it means the price will increase, if they are negative, the price decreases
print("Intercept:", model.intercept_)
print("Coefficients:", model.coef_)

In [None]:
# Predict on the test set
# Here we use the model we trained in the previous step to make predictions based on the previously created test sets (X_test)
# The predict method generates the model's predictions for these inputs, and the result is saved in y_pred.
y_pred = model.predict(X_test)

# Evaluate the model
# The MSE measures the mean of the squared errors, that is, the average squared difference between the actual and predicted values.
# A lower MSE indicates that the model is making predictions closer to the actual values.
mse = mean_squared_error(y_test, y_pred)

# The r2 tells us what proportion of the variability in the variable you are trying to predict (house price) is explained
# by the variables you are using to make the prediction (number of rooms and size).

# An r2 of 1 indicates that the model perfectly predicts the observed values, the closer to one, the more accurate
# An r2 of 0 means the model explains none of the variability in the dependent variable.
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error (MSE):", mse)
print("Coefficient of Determination (R^2):", r2)

In [None]:
plt.scatter(y_test, y_pred, alpha=0.5)
plt.xlabel('Actual Values')
plt.ylabel('Predictions')
plt.title('Actual Values vs Predictions')

# overfitting: Occurs when a model is too complex and fits the training data too well.
#              Overfitting could occur if we add too many variables (parameters) or features to the model.
#              Symptoms of overfitting include:
#                - Low error on the training data.
#                - High error on the test data.

# underfitting: Occurs when a model is too simple to capture the underlying structure in the data.
#               In other words, the model does not fit well to either the training data or the test data.
#               This can happen if the model has too few parameters or if it has not captured the necessary complexity of the problem.
#               Symptoms of underfitting include:
#               - High error on the training data.
#               - High error on the test data.

# Add perfect reference line (y = x)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color='red')

# Show the plot
plt.show()