# Programming Assignment #4


## 1. Linear Regression using scikit-learn

The diamonds dataset contains the price, cut, color, and other characteristics of a sample of nearly 54,000 diamonds. This data can be used to predict the price of a diamond based on its characteristics. Use sklearn's LinearRegression() function to predict the price of a diamond from the diamond's carat and table values.

- Import needed packages for regression.
- Initialize and fit a multiple linear regression model.
- Get the estimated intercept weight.
- Get the estimated weights of the carat and table features.
- Predict the price of a diamond with the user-input carat and table values.

Ex: If the input is:

- 0.5
- 60

the output should be:

- Intercept is 1961.992
- Weights for carat and table features are [7820.038  -74.301]
- Predicted price is [1413.97]

In [30]:
import pandas as pd
from sklearn.linear_model import LinearRegression

# Silence warnings
import warnings
warnings.filterwarnings('ignore')

# Input feature values for a sample instance
carat = float(input("Enter carat value: "))
table = float(input("Enter table value: "))

# Load the diamonds dataset from a CSV file 
diamonds = pd.read_csv('diamonds.csv')

# Define input (X) and output (y) features
X = diamonds[['carat', 'table']]
y = diamonds['price']

# Initialize a multiple linear regression model
model = LinearRegression()

# Fit the model
model.fit(X, y)

# Get estimated intercept weight
intercept = model.intercept_
print('Intercept is', round(intercept, 3))

# Get estimated weights for carat and table features
coefficients = model.coef_
print('Weights for carat and table features are', coefficients)

# Predict the price based on user input carand table values
predicted_price = model.predict([[carat, table]])
print('Predicted price is', predicted_price)


Enter carat value:  0.2
Enter table value:  35


Intercept is 1961.992
Weights for carat and table features are [7820.03788357  -74.30074671]
Predicted price is [925.47377487]


# 

## 2. Logistic Regression using scikit-learn

The **nbaallelo_log** file contains data on 126314 NBA games from 1947 to 2015. The dataset includes the features **pts, elo_i, win_equiv, and game_result**. Using the csv file **nbaallelo_log.csv** and scikit-learn's **LogisticRegression()** function, construct a logistic regression model to classify whether a team will win or lose a game based on the team's elo_i score.

- Create a binary feature win for **game_result** with 0 for L and 1 for W
- Use the **LogisticRegression()** function to construct a logistic regression model with **win** as the target and **elo_i** as the predictor
- Print the weights and intercept of the fitted model
- Find the proportion of instances correctly classified
  
Note: Use **ravel()** from **numpy** to flatten the second argument of **LogisticRegression.fit()** into a 1-D array.

Ex: If the program uses the file **nbaallelo_small.csv**, which contains 100 instances, the output is:

- w1: [[0.01585017]]
- w0: [-20.5926668]
- 0.62

In [31]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the dataset
nba_data_small = pd.read_csv('nbaallelo_log.csv')  

# Convert the 'game_result' column into a binary target variable 'win' (0 for Loss 'L', 1 for Win 'W')
nba_data_small['win'] = nba_data_small['game_result'].apply(lambda x: 1 if x == 'W' else 0)

# Define the feature (elo_i) and the target (win)
X = nba_data_small[['elo_i']]  # elo_i is the predictor
y = nba_data_small['win']      # win is the binary target

# Split the dataset into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Logistic Regression model with a large number of iterations to ensure convergence
model = LogisticRegression(max_iter=1000)

# Fit the model to the training data
model.fit(X_train, y_train.ravel())

# Print the model's weights (coefficients) and intercept
print('Weights (w1):', model.coef_)
print('Intercept (w0):', model.intercept_)

# Predict the class labels on the test set
y_pred = model.predict(X_test)

# Calculate the accuracy of the model on the test set
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.4f}')


Weights (w1): [[0.00437846]]
Intercept (w0): [-6.55026145]
Accuracy: 0.5968


## 3. Support Vector Classifier using scikit-learn

The heart dataset contains 13 health-related attributes from 303 patients and one attribute denoting whether or not the patient has heart disease. Using the file heart.csv and scikit-learn's LinearSVC() function, fit a support vector classifier to predict whether a patient has heart disease based on other health attributes.

- Import the correct packages and functions.
- Split the data into 75% training data and 25% testing data. Set random_state=123.
- Initialize and fit a support vector classifier with C=0.2, a maximum of 500 iterations, and random_state=123.
- Print the model weights.

Ex: If the program input is heart_small.csv, which contains 100 instances, the output is:

0.6

w0: [0.013]
w1 and w2: [[ 0.361 -0.087]]

In [32]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

# Load the dataset 
heart_data_small = pd.read_csv('heart.csv')  

# Split the data into features (X) and target (y)
# Assuming the target variable is in the last column and other columns are features
X = heart_data_small.iloc[:, :-1]  # All columns except the last one
y = heart_data_small.iloc[:, -1]   # Last column as the target variable (whether the patient has heart disease)

# Split the dataset into 75% training data and 25% testing data, with random_state=123 for consistency
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=123)

# Initialize the LinearSVC model with C=0.2, max_iter=500, and random_state=123
model = LinearSVC(C=0.2, max_iter=500, random_state=123)

# Fit the model to the training data
model.fit(X_train, y_train)

# Print the intercept and model weights (coefficients)
print(f'Intercept (w0): {model.intercept_}')
print(f'Weights (w1, w2, ...): {model.coef_}')

# Predict the class labels on the test set
y_pred = model.predict(X_test)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.4f}')


Intercept (w0): [0.20903585]
Weights (w1, w2, ...): [[-1.29422534e-03 -4.12408582e-01  3.21110486e-01 -4.16999495e-03
  -1.08946319e-04  5.48225740e-02  2.41503288e-01  1.04181225e-02
  -2.33050514e-01 -2.34613654e-01  3.60722327e-02 -2.59271314e-01
  -3.33669551e-01]]
Accuracy: 0.7763


## 4. k-Nearest Neighbors using scikit-learn 
The dataset SDSS contains 17 observational features and one class feature for 10000 deep sky objects observed by the Sloan Digital Sky Survey. Use sklearn's KNeighborsClassifier() function to perform kNN classification to classify each object by the object's redshift and u-g color.

- Import the necessary modules for kNN classification
- Create dataframe X with features redshift and u_g
- Create dataframe y with feature class
- Initialize a kNN model with k=3
- Fit the model using the training data
- Find the predicted classes for the test data
- Calculate the accuracy score using the test data

Ex: If the feature u is used rather than u_g, the output is:
- Accuracy score is 0.979

In [19]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load the dataset
sdss_data = pd.read_csv('SDSS.csv')

# Create dataframe X with features 'redshift' and 'u'
X = sdss_data[['redshift', 'u']]

# Create dataframe y with the class label
y = sdss_data['class']

# Split the dataset into 75% training data and 25% testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Initialize a k-NN model with k=3
knn_model = KNeighborsClassifier(n_neighbors=3)

# Fit the model using the training data
knn_model.fit(X_train, y_train)

# Predict the class labels for the test data
y_pred = knn_model.predict(X_test)

# Calculate the accuracy score
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.3f}')

Accuracy: 0.978


## 5. Naive Bayes using scikit-learn 

The file SDSS contains 17 observational features and one class feature for 10000 deep sky objects observed by the Sloan Digital Sky Survey. Use sklearn's GaussianNB() function to perform Gaussian naive Bayes classification to classify each object by the object's redshift and u-g color.

- Import the necessary modules for Gaussian naive Bayes classification
- Create dataframe X with features redshift and u_g
- Create dataframe y with feature class
- Initialize a Gaussian naive Bayes model with the default parameters
- Fit the model
- Calculate the accuracy score

Note: Use ravel() from numpy to flatten the second argument of GaussianNB.fit() into a 1-D array.

Ex: If the feature u is used rather than u_g, the output is:

- Accuracy score is 0.987

In [20]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Load the dataset
sdss_data = pd.read_csv('SDSS.csv')

# Create dataframe X with features 'redshift' and 'u'
X = sdss_data[['redshift', 'u']]

# Create dataframe y with the class label
y = sdss_data['class']

# Split the dataset into 75% training data and 25% testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Initialize the Gaussian Naive Bayes model
gnb_model = GaussianNB()

# Fit the model using the training data
gnb_model.fit(X_train, y_train.ravel())

# Predict the class labels for the test data
y_pred = gnb_model.predict(X_test)

# Calculate the accuracy score, rounded to 3 decimal places
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.3f}')

Accuracy: 0.987


## 6. Ensemble methods using scikit-learn 

## 6.1. Bagging using scikit-learn 
The msleep_clean dataset contains information on sleep habits for 47 mammals. Features include length of REM sleep, time spent awake, brain weight, and body weight.

- Create a dataframe X containing the features awake, brainwt, and bodywt, in that order.
- Create a dataframe y containing sleep_rem.
- Initialize and fit a bagging regressor with 30 base estimators, a random state of 10, and oob_score=True.

Ex: If 10 base estimators are used, the output should be:

0.2322

[3.26   2.92   1.0333 2.3333 0.8    1.325  2.56   2.2667 0.8    2.38
 3.     0.5333 3.175  2.9667 0.7    0.65   1.825  2.2667 2.     1.
 0.6    1.1667 1.5    3.1    2.     1.9    4.15   1.3    0.75   1.2
 2.025  1.45   3.0286 2.72   0.5    2.0333 1.12   2.     2.65   1.65
 2.6667 2.3    1.45   0.58   2.625  1.6    0.74   1.3   ]

In [23]:
import numpy as np
import pandas as pd
from sklearn.ensemble import BaggingRegressor

# Load the dataset (make sure to replace with the correct path to 'msleep_clean.csv')
df = pd.read_csv('msleep_clean.csv')

# Create a dataframe X containing the features awake, brainwt, and bodywt, in that order
X = df[['awake', 'brainwt', 'bodywt']]

# Create a dataframe y containing sleep_rem
y = df['sleep_rem']

# Initialize and fit Bagging Regressor with 10 base estimators, random state of 10, and oob_score=True
sleepModel = BaggingRegressor(n_estimators=10, random_state=10, oob_score=True)
sleepModel.fit(X, y)

# Calculate and print out-of-bag accuracy (rounded to 4 decimal places)
print(np.round(sleepModel.oob_score_, 4))

# Calculate and print predictions from out-of-bag estimate (rounded to 4 decimal places)
print(np.round(sleepModel.oob_prediction_, 4))

0.2322
[3.26   2.92   1.0333 2.3333 0.8    1.325  2.56   2.2667 0.8    2.38
 3.     0.5333 3.175  2.9667 0.7    0.65   1.825  2.2667 2.     1.
 0.6    1.1667 1.5    3.1    2.     1.9    4.15   1.3    0.75   1.2
 2.025  1.45   3.0286 2.72   0.5    2.0333 1.12   2.     2.65   1.65
 2.6667 2.3    1.45   0.58   2.625  1.6    0.74   1.3   ]


## 6.2. Random forests using scikit-learn 
The mpg_clean.csv dataset contains information on miles per gallon (mpg) and engine size for cars sold from 1970 through 1982. Dataframe X contains the input features mpg, cylinders, displacement, horsepower, weight, acceleration, and model_year. Dataframe y contains the output feature origin.

- Initialize and fit a random forest classifier with a user-input number of decision trees, estimator, a user-input number of features considered at each split, max_features, and a random state of 123.
- Calculate the prediction accuracy for the model.
- Read the documentation for the permutation_importance function from scikit-learn's inspection module.
- Calculate the permutation importance using the default parameters and a random state of 123.

Ex: When the input is

5

3

the output is:

0.9796


     | Feature          | Permutation Importance |
     |------------------|------------------------|
    2| displacement     | 0.453571               |
    0| mpg              | 0.160204               |
    4| weight           | 0.133673               |
    3| horsepower       | 0.107653               |
    5| acceleration     | 0.057143               |
    6| model_year       | 0.051531               |
    1| cylinders        | 0.012245               |




In [27]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.inspection import permutation_importance
from sklearn.preprocessing import LabelEncoder

# Load the dataset
df = pd.read_csv('mpg_clean.csv')  # Make sure the dataset is in the same directory as your script

# Convert the target variable 'origin' to numeric values using LabelEncoder
le = LabelEncoder()
df['origin'] = le.fit_transform(df['origin'])

# Create dataframe X with input features (mpg, cylinders, displacement, horsepower, weight, acceleration, model_year)
X = df[['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model_year']]

# Create dataframe y with the output feature 'origin'
y = df['origin']

# Get user input for the number of decision trees (estimators) and number of features at each split (max_features)
n_estimators = int(input("Enter number of decision trees (e.g., 5): "))
max_features = int(input("Enter number of features considered at each split (e.g., 3): "))

# Split the dataset into training and testing data (75% training, 25% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=123)

# Initialize and fit the Random Forest classifier with user-specified parameters
rf_model = RandomForestClassifier(n_estimators=n_estimators, max_features=max_features, random_state=123)
rf_model.fit(X_train, y_train)

# Predict the class labels on the test set
y_pred = rf_model.predict(X_test)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.4f}')

# Calculate the permutation importance using default parameters and random_state=123
perm_importance = permutation_importance(rf_model, X_test, y_test, random_state=123)

# Display the feature importance
feature_names = X.columns
print("\n| Feature               | Permutation Importance |")
print("|-----------------------|------------------------|")
for i in perm_importance.importances_mean.argsort()[::-1]:
    print(f"| {feature_names[i]:<21} | {perm_importance.importances_mean[i]:<23.6f} |")

Enter number of decision trees (e.g., 5):  5
Enter number of features considered at each split (e.g., 3):  3


Accuracy: 0.7857

| Feature               | Permutation Importance |
|-----------------------|------------------------|
| displacement          | 0.320408                |
| weight                | 0.051020                |
| horsepower            | 0.028571                |
| model_year            | 0.022449                |
| mpg                   | 0.010204                |
| acceleration          | -0.000000               |
| cylinders             | -0.012245               |


## 6.3. Boosting using scikit-learn 
The mpg.csv dataset contains information on miles per gallon (mpg) and engine size for cars sold from 1970 through 1982.

- Create a dataframe X containing the input features cylinders, weight, and mpg.
- Create a dataframe y containing the output feature origin.
- Initialize and fit an adaptive boosting classifier with a user-input learning rate lr and a random state of 123.
- Initialize and fit a gradient boosting classifier with a user-input learning rate lr and a random state of 123.
- Calculate the prediction accuracy for each model.

Ex: If the user-input learning rate is 0.6, the output is:

0.7688

0.995

In [29]:
import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the dataset (ensure 'mpg.csv' is in the correct path)
mpg = pd.read_csv('mpg.csv')

# Create a dataframe X containing the input features cylinders, weight, and mpg
X = mpg[['cylinders', 'weight', 'mpg']]

# Create a dataframe y containing the output feature 'origin'
y = mpg['origin']

# Get user-input learning rate
lr = float(input("Enter learning rate (e.g., 0.6): "))

# Split the dataset into 75% training and 25% testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=123)

# Initialize and fit an adaptive boosting classifier (AdaBoost) with the user-input learning rate and random_state=123
adaBoostModel = AdaBoostClassifier(learning_rate=lr, random_state=123)
adaBoostModel.fit(X_train, y_train)

# Initialize and fit a gradient boosting classifier with the user-input learning rate and random_state=123
gradientBoostModel = GradientBoostingClassifier(learning_rate=lr, random_state=123)
gradientBoostModel.fit(X_train, y_train)

# Calculate the prediction accuracy for the AdaBoost classifier
adaBoostScore = accuracy_score(y_test, adaBoostModel.predict(X_test))
print(f'Accuracy of AdaBoost: {round(adaBoostScore, 4)}')

# Calculate the prediction accuracy for the Gradient Boosting classifier
gradientBoostScore = accuracy_score(y_test, gradientBoostModel.predict(X_test))
print(f'Accuracy of Gradient Boosting: {round(gradientBoostScore, 4)}')

Enter learning rate (e.g., 0.6):  0.6


Accuracy of AdaBoost: 0.62
Accuracy of Gradient Boosting: 0.68
