<h1 align="center">Introduction to Machine Learning - 25737-2</h1>
<h4 align="center">Dr. R. Amiri</h4>
<h4 align="center">Sharif University of Technology, Spring 2024</h4>


**<font color='red'>Plagiarism is strongly prohibited!</font>**


**Student Name**: Parsa Norouzinezhad

**Student ID**: 400102182





# Logistic Regression

**Task:** Implement your own Logistic Regression model, and test it on the given dataset of Logistic_question.csv!

In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

data = pd.read_csv("Logistic_question.csv")
X = data.drop('Target', axis=1)
y = data['Target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

class LogisticRegression:
    def __init__(self, learning_rate=0.01, num_iterations=1000):
        self.learning_rate = learning_rate
        self.num_iterations = num_iterations

    def sigmoid(self, z):
        return 1 / (1 + np.exp(-z))

    def fit(self, X, y):
        self.weights = np.zeros(X.shape[1])
        self.bias = 0

        for _ in range(self.num_iterations):
            linear_model = np.dot(X, self.weights) + self.bias
            y_predicted = self.sigmoid(linear_model)

            dw = (1 / len(X)) * np.dot(X.T, (y_predicted - y))
            db = (1 / len(X)) * np.sum(y_predicted - y)

            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db

    def predict(self, X):
        linear_model = np.dot(X, self.weights) + self.bias
        y_predicted = self.sigmoid(linear_model)
        return np.round(y_predicted)




**Task:** Test your model on the given dataset. You must split your data into train and test, with a 0.2 split, then normalize your data using X_train data. Finally, report 4 different evaluation metrics of the model on the test set. (You might want to first make the Target column binary!)

In [3]:
model = LogisticRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
accuracy = np.mean(y_pred == y_test)
print("Accuracy:", accuracy)

Accuracy: 0.0


**Question:** What are each of your used evaluation metrics? And for each one, mention situations in which they convey more data on the model performance in specific tasks.

**Your answer:**
Accuracy is a good metric for balanced datas and measures the proportion of correct predictions out of the total predictions made by the model.Precision could be helpful by measuring the proportion of true positive predictions out of all positive predictions made by the model especially when the cost of false positives is high.However, Sensitivity measures the proportion of true positive predictions out of all actual positive instances in the dataset.It's important when the cost of false negatives is high. F1 score is the harmonic mean of precision and recall. It provides a balance between precision and recall.


**Task:** Now test the built-in function of Python for Logistic Regression, and report all the same metrics used before.

In [10]:
y_train_binary = np.where(y_train > 0, 1, 0)
y_test_binary = np.where(y_test > 0, 1, 0)

# Train the logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train_binary)

# Make predictions
y_pred = model.predict(X_test)

# Evaluation Metrics
accuracy = accuracy_score(y_test_binary, y_pred)
precision = precision_score(y_test_binary, y_pred)
recall = recall_score(y_test_binary, y_pred)
f1 = f1_score(y_test_binary, y_pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)

Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1 Score: 1.0


**Question:** Compare your function with the built-in function. On the matters of performance and parameters. Briefly explain what the parameters of the built-in function are and how they affect the model's performance?

**Your answer:** 
When comparing the custom logistic regression function with the built-in function, the built-in function being optimized and well-tested, especially for handling large datasets. While the custom function is lightweight and suitable for learning purposes, the built-in function offers more advanced parameters that significantly affect the model's performance. These parameters include penalty, controlling the type of regularization (L1 or L2), C, which inversely determines the regularization strength with smaller values to prevent overfitting, setting the maximum number of iterations for optimization, which influences convergence and training times. The custom function, on the other hand, allows for the customization of learning rate and number of iterations, affecting the speed and convergence of the optimization algorithm (gradient descent). Overall, the built-in function's comprehensive parameterization provides more control and flexibility, crucial for achieving optimal model performance across diverse datasets and problem domains.

# Multinomial Logistic Regression

**Task:** Implement your own Multinomial Logistic Regression model. Your model must be able to handle any number of labels!

In [16]:


class MultinomialLogisticRegression:
    def __init__(self, learning_rate=0.01, num_iterations=1000):
        self.learning_rate = learning_rate
        self.num_iterations = num_iterations
        self.weights = None
        self.bias = None
        self.num_classes = None

    def softmax(self, z):
        exp_scores = np.exp(z - np.max(z, axis=1, keepdims=True))
        return exp_scores / np.sum(exp_scores, axis=1, keepdims=True)

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.num_classes = len(np.unique(y))
        self.weights = np.zeros((n_features, self.num_classes))
        self.bias = np.zeros(self.num_classes)
        y = y.astype(int)
        one_hot_y = np.eye(self.num_classes)[y]
        for _ in range(self.num_iterations):
            linear_model = np.dot(X, self.weights) + self.bias
            y_predicted = self.softmax(linear_model)
            dw = (1 / n_samples) * np.dot(X.T, (y_predicted - one_hot_y))
            db = (1 / n_samples) * np.sum(y_predicted - one_hot_y, axis=0)
            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db

    def predict(self, X):
        linear_model = np.dot(X, self.weights) + self.bias
        y_predicted = self.softmax(linear_model)
        return np.argmax(y_predicted, axis=1)


scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model = MultinomialLogisticRegression(learning_rate=0.1, num_iterations=1000)
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)


**Task:** Test your model on the given dataset. Do the same as the previous part, but here you might want to first make the Target column quantized into $i$ levels. Change $i$ from 2 to 10.

In [17]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

data = pd.read_csv("Logistic_question.csv")

for i in range(2, 11):
    quantized_target = pd.qcut(data['Target'], q=i, labels=False)
    print(f"Quantization with {i} levels:")
    print(quantized_target.value_counts())
    print()
    X = data.iloc[:, :-1].values
    y = quantized_target.values
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    model = MultinomialLogisticRegression(learning_rate=0.1, num_iterations=1000)
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    f1 = f1_score(y_test, y_pred, average='weighted')

    print("Accuracy:", accuracy)
    print("Precision:", precision)
    print("Recall:", recall)
    print("F1 Score:", f1)
    print("-" * 50)


Quantization with 2 levels:
Target
0    209
1    191
Name: count, dtype: int64

Accuracy: 0.9375
Precision: 0.937891604010025
Recall: 0.9375
F1 Score: 0.9375490196078431
--------------------------------------------------
Quantization with 3 levels:
Target
1    136
0    136
2    128
Name: count, dtype: int64

Accuracy: 0.85
Precision: 0.8466666666666667
Recall: 0.85
F1 Score: 0.8476997578692493
--------------------------------------------------
Quantization with 4 levels:
Target
0    113
3     98
1     96
2     93
Name: count, dtype: int64

Accuracy: 0.7125
Precision: 0.7374851778656126
Recall: 0.7125
F1 Score: 0.7209412537537538
--------------------------------------------------
Quantization with 5 levels:
Target
3    85
1    84
0    81
4    75
2    75
Name: count, dtype: int64

Accuracy: 0.6625
Precision: 0.700703046953047
Recall: 0.6625
F1 Score: 0.6679573777329196
--------------------------------------------------
Quantization with 6 levels:
Target
2    73
0    69
1    67
5    66
3 

**Question:** Report for which $i$ your model performs best. Describe and analyze the results! You could use visualizations or any other method!

**Your answer:**

# Going a little further!

First we download Adult income dataset from Kaggle! In order to do this create an account on this website, and create an API. A file named kaggle.json will be downloaded to your device. Then use the following code:

In [23]:
from google.colab import files
files.upload()  # Use this to select the kaggle.json file from your computer
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

ModuleNotFoundError: No module named 'google'

Then use this code to automatically download the dataset into Colab.

In [21]:
!kaggle datasets download -d wenruliu/adult-income-dataset
!unzip /content/adult-income-dataset.zip

'kaggle' is not recognized as an internal or external command,
operable program or batch file.
'unzip' is not recognized as an internal or external command,
operable program or batch file.


**Task:** Determine the number of null entries!

In [None]:
# Your code goes here!


**Question:** In many widely used datasets there are a lot of null entries. Propose 5 methods by which, one could deal with this problem. Briefly explain how do you decide which one to use in this problem.

**Your answer:**

**Task:** Handle null entries using your best method.

In [None]:
# Your code goes here!


**Task:** Convert categorical features to numerical values. Split the dataset with 80-20 portion. Normalize all the data using X_train. Use the built-in Logistic Regression function and GridSearchCV to train your model, and report the parameters, train and test accuracy of the best model.

In [None]:
# Your code goes here!


**Task:** To try a different route, split X_train into $i$ parts, and train $i$ separate models on these parts. Now propose and implement 3 different *ensemble methods* to derive the global models' prediction for X_test using the results(not necessarily predictions!) of the $i$ models. Firstly, set $i=10$ to find the method with the best test accuracy(the answer is not general!). You must Use your own Logistic Regression model.(You might want to modify it a little bit for this part!)

In [None]:
# Your code goes here!


**Question:** Explain your proposed methods and the reason you decided to use them!

**Your answer:**

**Task:** Now, for your best method, change $i$ from 2 to 100 and report $i$, train and test accuracy of the best model. Also, plot test and train accuracy for $2\leq i\leq100$.

In [None]:
# Your code goes here!


**Question:** Analyze the results.

**Your Answer:**