## Assignment #3: Cross Validation. 

## Name: KINZA NISAR
## Roll#: 22i-2872


---------------------

## Project Overview: 
This project aims to explore the effectiveness of cross-validation (CV) in machine learning to understand whether it helps in minimizing prediction errors. Cross-validation is a statistical method used to estimate the skill of machine learning models. It is commonly used to protect against overfitting in a predictive model, particularly in a case where the amount of data may be limited. In this project, we will apply linear regression to a dataset related to breast cancer, evaluate the model's performance using various error metrics, and then employ cross-validation to assess its impact on these error metrics.

By applying linear regression, we will predict the likelihood of breast cancer based on these features. The model's predictive accuracy will be quantified using Mean Squared Error (MSE), Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE) on the training data. Subsequently, we will implement cross-validation to evaluate if the model's performance is consistent across different subsets of the data. The comparison of error metrics before and after applying cross-validation will reveal whether CV helps in reducing the prediction error, thereby indicating a more robust model that generalizes well to new data.

## Dataset overview:
The dataset we are using is focused on breast cancer attributes, which is a compilation of observations and measurements related to breast cancer cases. Each entry in the dataset represents a case with a set of predictor variables such as the characteristics of the cell nuclei present in the digital images of a fine needle aspirate (FNA) of a breast mass. The variables include attributes like radius, texture, perimeter, area, smoothness, compactness, and symmetry of the cell nuclei, among others. The target variable is a binary classification indicating the presence or absence of breast cancer.

## Import necessary libraries and loadind data set

In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.model_selection import cross_val_score
from math import sqrt

# Load the dataset
df = pd.read_csv('Breast_cancer_data.csv')  # Load the breast cancer dataset

# Display the first few rows of the dataframe
df.head()

Unnamed: 0,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,diagnosis
0,17.99,10.38,122.8,1001.0,0.1184,0
1,20.57,17.77,132.9,1326.0,0.08474,0
2,19.69,21.25,130.0,1203.0,0.1096,0
3,11.42,20.38,77.58,386.1,0.1425,0
4,20.29,14.34,135.1,1297.0,0.1003,0


## Handling Missing Values

In [5]:
# Check for any missing values in the dataset
missing_values = df.isnull().sum()
# If there are any missing values, fill them with the mean of the column
df.fillna(df.mean(), inplace=True)


## Data Splitting

In [6]:
# Split the data into features and target variable
X = df.drop('diagnosis', axis=1)  # Features
y = df['diagnosis']                # Target variable


In [16]:
# Split the dataset into training (70%) and testing (30%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print(X_train, X_test, y_train, y_test)

     mean_radius  mean_texture  mean_perimeter  mean_area  mean_smoothness
149       13.740         17.91           88.12      585.0          0.07944
124       13.370         16.39           86.10      553.5          0.07115
421       14.690         13.98           98.22      656.1          0.10310
195       12.910         16.33           82.53      516.4          0.07941
545       13.620         23.23           87.19      573.2          0.09246
..           ...           ...             ...        ...              ...
71         8.888         14.64           58.79      244.0          0.09783
106       11.640         18.33           75.17      412.5          0.11420
270       14.290         16.82           90.30      632.6          0.06429
435       13.980         19.62           91.12      599.5          0.10600
102       12.180         20.52           77.22      458.7          0.08013

[398 rows x 5 columns]      mean_radius  mean_texture  mean_perimeter  mean_area  mean_smoothness
2


## Linear Regression Testing

In [15]:

# Initialize the Linear Regression model
lr = LinearRegression()
# Fit the model on the training data
lr.fit(X_train, y_train)
# Predict the target variable for the testing set
y_pred = lr.predict(X_test)
print(y_pred)

[ 0.73108432  0.24998958  0.34341551  0.67907371  0.92144917 -0.48496903
 -0.18514305  0.38133394  0.62198264  0.93558694  0.65181806  0.45922572
  0.65345093  0.15013175  1.00255465 -0.27657866  0.77876228  0.97933693
  1.38994846 -0.0708862   0.63537696  0.77334564 -0.27680044  1.13171496
  0.91720421  0.56804974  0.90567253  0.85740733  0.74576894  0.08459591
  0.8326972   1.00513101  0.98739088  0.75597472  0.97088645  0.9043856
  0.50324922  1.14507516  0.25941456  0.50403886  0.93965092  0.32458969
  0.79617807  0.91163794  0.66829346  0.91667661  0.89140571  1.04927446
  0.77805706  0.92128477  0.29723532 -0.05357702  0.51687832  0.52198484
  0.72756316  0.91209916  0.97220931 -0.52054619  0.54180927  1.01906026
  0.71648447 -0.0076777  -0.0869619   0.64143067  0.92872953  0.78063041
  0.13359099 -0.58619263  0.86011185  0.68322307  0.33073552  0.45775039
  0.81575771  0.29044024  1.32067658  0.67369022  0.63286345  0.52750621
  1.21256948  0.78943065  0.3023707   1.12717804  0.

## Evaluation Metrics on Training Data


In [14]:
# Calculate MSE, MAE, and RMSE on the training dataset
mse = mean_squared_error(y_train, lr.predict(X_train))
mae = mean_absolute_error(y_train, lr.predict(X_train))
rmse = sqrt(mse)
print(mse, mae, rmse)

0.08433777691869977 0.23466260384148993 0.29040967084224273



## Cross-Validation with Varying K values

In [13]:

# Perform cross-validation with different values of K
cv_scores = cross_val_score(lr, X_train, y_train, cv=5)
print(cv_scores)

[0.63691324 0.59130479 0.6736495  0.60304344 0.56079336]



## Re-evaluation Metrics on Training Data

In [17]:

# Reassess the performance on the training dataset after cross-validation
mse_cv = -cross_val_score(lr, X_train, y_train, scoring='neg_mean_squared_error', cv=5).mean()
mae_cv = -cross_val_score(lr, X_train, y_train, scoring='neg_mean_absolute_error', cv=5).mean()
rmse_cv = sqrt(mse_cv)
print(mse_cv, mae_cv, rmse_cv)

0.0895557172956861 0.2426183336959684 0.2992586127343474


## Compare the error metrics before and after cross-validation


In [12]:
print('MSE (without CV):', mse, '\nMSE (with CV):', mse_cv)
print('MAE (without CV):', mae, '\nMAE (with CV):', mae_cv)
print('RMSE (without CV):', rmse, '\nRMSE (with CV):', rmse_cv)


MSE (without CV): 0.08433777691869977 
MSE (with CV): 0.0895557172956861
MAE (without CV): 0.23466260384148993 
MAE (with CV): 0.2426183336959684
RMSE (without CV): 0.29040967084224273 
RMSE (with CV): 0.2992586127343474


## Interpretation
The results from the linear regression analysis on the Breast Cancer dataset indicate a slight increase in error metrics when cross-validation (CV) is applied compared to the initial training set without CV. Here's a detailed interpretation of each metric:

### MSE (Mean Squared Error): 
The MSE without CV is approximately 0.0843, and with CV, it is approximately 0.0896. MSE measures the average of the squares of the errors, that is, the average squared difference between the estimated values and the actual value. A lower MSE indicates a better fit of the model to the data. The increase in MSE with CV suggests that when the model is validated across different subsets of the data, the average error increases slightly, indicating a decrease in model performance.

### MAE (Mean Absolute Error): 
The MAE without CV is around 0.2347, and with CV, it is around 0.2426. MAE measures the average magnitude of the errors in a set of predictions, without considering their direction. It's the average over the test sample of the absolute differences between prediction and actual observation. The increase in MAE with CV indicates that the model's predictions are, on average, slightly further away from the actual values when cross-validation is used.

### RMSE (Root Mean Squared Error): 
The RMSE without CV is approximately 0.2904, and with CV, it is approximately 0.2993. RMSE is the square root of the mean of the squared differences between prediction and actual observation. It provides a measure of how well the model's predictions are spread out from the actual values. The increase in RMSE with CV suggests that the spread of the residuals (errors) is larger when the model is subjected to cross-validation.

The increase in error metrics with cross-validation is indicative of a model that is slightly overfitting to the training data. Without CV, the model is evaluated on the same data it was trained on, which can lead to an overly optimistic assessment of its performance. Cross-validation, by contrast, assesses the model's ability to perform on different subsets of the data, providing a more realistic evaluation of its predictive power.

In conclusion, while the non-CV metrics might look better because they are lower, the CV metrics provide a more honest assessment of the model's performance on unseen data. The slight increase in errors with CV is expected and indicates that the model's performance is consistent across different subsets of the data, which is a desirable property in a predictive model. It suggests that the model is not overly specialized to the training data and has a reasonable level of generalizability.