Name: Abhishek Gupta

Class: TE-CSE-DS

Batch: B

UID: 2021700027

1. Dimensionality Reduction:

In many machine learning problems, datasets may contain a large number of features or dimensions. High-dimensional data can lead to several issues such as increased computational complexity, overfitting, and difficulty in visualization.
Dimensionality reduction techniques aim to reduce the number of features in the dataset while preserving most of the important information.

2. Principal Component Analysis (PCA):

PCA is a popular technique for dimensionality reduction.
PCA works by transforming the original features into a new set of orthogonal features called principal components.
These principal components are linear combinations of the original features and are ordered by the amount of variance they capture in the data.
The first principal component captures the most variance, the second principal component captures the second most variance, and so on.
By retaining only a subset of the principal components that capture most of the variance, PCA effectively reduces the dimensionality of the data.

3. Singular Value Decomposition (SVD):

PCA can be implemented using Singular Value Decomposition (SVD), which is a matrix factorization technique.
SVD decomposes a matrix into three matrices: U, Σ, and V, such that X = UΣVᵀ.
In the context of PCA, the SVD of the data matrix X is used to find its principal components.

#Import Necessary Libraries
#pandas: for data manipulation and analysis.
#train_test_split: for splitting the dataset into training and testing sets.
#StandardScaler: for standardizing the features.
#PCA: for performing Principal Component Analysis.
#LogisticRegression: for building the logistic regression model.
#accuracy_score: for evaluating the accuracy of the model.

In [7]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

#Load the Dataset
python
Copy code
data = pd.read_csv('your_dataset.csv')
This line reads the dataset from the CSV file named 'your_dataset.csv' into a DataFrame called data.

In [9]:
data = pd.read_csv('/content/data.csv')
data.sample(5)

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
563,926125,M,20.92,25.09,143.0,1347.0,0.1099,0.2236,0.3174,0.1474,...,29.41,179.1,1819.0,0.1407,0.4186,0.6599,0.2542,0.2929,0.09873,
348,898690,B,11.47,16.03,73.02,402.7,0.09076,0.05886,0.02587,0.02322,...,20.79,79.67,475.8,0.1531,0.112,0.09823,0.06548,0.2851,0.08763,
425,907367,B,10.03,21.28,63.19,307.3,0.08117,0.03912,0.00247,0.005159,...,28.94,69.92,376.3,0.1126,0.07094,0.01235,0.02579,0.2349,0.08061,
473,9113846,B,12.27,29.97,77.42,465.4,0.07699,0.03398,0.0,0.0,...,38.05,85.08,558.9,0.09422,0.05213,0.0,0.0,0.2409,0.06743,
307,89346,B,9.0,14.4,56.36,246.3,0.07005,0.03116,0.003681,0.003472,...,20.07,60.9,285.5,0.09861,0.05232,0.01472,0.01389,0.2991,0.07804,


In [10]:
data.sample(5)

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
133,867387,B,15.71,13.93,102.0,761.7,0.09462,0.09462,0.07135,0.05933,...,19.25,114.3,922.8,0.1223,0.1949,0.1709,0.1374,0.2723,0.07071,
514,91594602,M,15.05,19.07,97.26,701.9,0.09215,0.08597,0.07486,0.04335,...,28.06,113.8,967.0,0.1246,0.2101,0.2866,0.112,0.2282,0.06954,
416,905978,B,9.405,21.7,59.6,271.2,0.1044,0.06159,0.02047,0.01257,...,31.24,68.73,359.4,0.1526,0.1193,0.06141,0.0377,0.2872,0.08304,
364,9010877,B,13.4,16.95,85.48,552.4,0.07937,0.05696,0.02181,0.01473,...,21.7,93.76,663.5,0.1213,0.1676,0.1364,0.06987,0.2741,0.07582,
445,9110720,B,11.99,24.89,77.61,441.3,0.103,0.09218,0.05441,0.04274,...,30.36,84.48,513.9,0.1311,0.1822,0.1609,0.1202,0.2599,0.08251,


#Preprocess the Data
We drop the unnecessary columns 'id', 'diagnosis', and 'Unnamed: 32' from the features and assign the target variable 'diagnosis'.
We split the dataset into training and testing sets with 80% for training and 20% for testing.

In [11]:
# Drop unnecessary columns (like 'id' and 'Unnamed: 32') and separate features and target variable
X = data.drop(['id', 'diagnosis', 'Unnamed: 32'], axis=1)
y = data['diagnosis']

In [12]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


 #Standardize the Features
python

We standardize the features using StandardScaler() to ensure all features have the same scale.

In [13]:
# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [14]:
# Train a classification model without PCA
model_without_pca = LogisticRegression()
model_without_pca.fit(X_train_scaled, y_train)

In [15]:
# Predict on the testing set without PCA
y_pred_without_pca = model_without_pca.predict(X_test_scaled)

In [17]:
# Evaluate accuracy without PCA
accuracy_without_pca = accuracy_score(y_test, y_pred_without_pca)
print("Accuracy without PCA:", accuracy_without_pca)

Accuracy without PCA: 0.9736842105263158


In [18]:
# Perform PCA
pca = PCA(n_components=10)  # You can adjust the number of components as needed
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

In [19]:
# Train a classification model with PCA
model_with_pca = LogisticRegression()
model_with_pca.fit(X_train_pca, y_train)

In [20]:
# Predict on the testing set with PCA
y_pred_with_pca = model_with_pca.predict(X_test_pca)

In [22]:
 #Evaluate accuracy with PCA
accuracy_with_pca = accuracy_score(y_test, y_pred_with_pca)
print("Accuracy with PCA:", accuracy_with_pca)

Accuracy with PCA: 0.9824561403508771


By comparing the model accuracies before and after applying PCA, we can determine whether dimensionality reduction improves or deteriorates the performance of the machine learning model.
PCA helps in reducing overfitting, mitigating the curse of dimensionality, and improving computational efficiency, especially in cases where the original dataset has a large number of features relative to the number of samples.