Principal Component Analysis
a. Apply PCA on CC dataset.
b. Apply k-means algorithm on the PCA result and report your observation if the silhouette score has
improved or not?
c. Perform Scaling+PCA+K-Means and report performance.


In [2]:
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
import numpy as np

def load_data(file_path):
    # Load the dataset
    cc_data = pd.read_csv('CC GENERAL.csv')
    return cc_data

def preprocess_data(cc_data):
    # Drop the 'CUST_ID' column
    cc_data = cc_data.drop(columns=['CUST_ID'])

    # Handle missing values by filling them with the mean of the column
    imputer = SimpleImputer(strategy='mean')
    cc_data_imputed = imputer.fit_transform(cc_data)

    return cc_data_imputed

def apply_pca(data, variance_threshold=0.95):
    pca = PCA()
    data_pca = pca.fit_transform(data)
    cumulative_variance = np.cumsum(pca.explained_variance_ratio_)
    num_components = np.argmax(cumulative_variance >= variance_threshold) + 1
    pca = PCA(n_components=num_components)
    data_pca_reduced = pca.fit_transform(data)

    return data_pca_reduced, num_components

def kmeans_clustering(data, n_clusters=3):
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    labels = kmeans.fit_predict(data)
    silhouette_avg = silhouette_score(data, labels)

    return silhouette_avg

def main(file_path):
    cc_data = load_data(file_path)
    cc_data_imputed = preprocess_data(cc_data)

    # Apply PCA
    cc_data_pca_reduced, num_components = apply_pca(cc_data_imputed)
    silhouette_avg_pca = kmeans_clustering(cc_data_pca_reduced)

    # Perform Scaling + PCA + K-means
    scaler = StandardScaler()
    cc_data_scaled = scaler.fit_transform(cc_data_imputed)
    cc_data_scaled_pca, _ = apply_pca(cc_data_scaled, variance_threshold=0.95)
    silhouette_avg_scaled_pca = kmeans_clustering(cc_data_scaled_pca)

    print(f"Silhouette Score (PCA): {silhouette_avg_pca}")
    print(f"Silhouette Score (Scaling + PCA): {silhouette_avg_scaled_pca}")

if __name__ == "__main__":
    file_path = '/mnt/data/CC GENERAL.csv'  # Adjust the file path as needed
    main(file_path)




Silhouette Score (PCA): 0.4774953485130042
Silhouette Score (Scaling + PCA): 0.25421032809181465


Use pd_speech_features.csv
a. Perform Scaling
b. Apply PCA (k=3)
c. Use SVM to report performance

In [1]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

# Load the dataset
file_path = 'pd_speech_features.csv'
df = pd.read_csv(file_path)

# Separate features and target
X = df.drop(columns=['class'])
y = df['class']

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X_scaled)

# Train and evaluate an SVM model using cross-validation
svm = SVC(kernel='linear', random_state=42)
scores = cross_val_score(svm, X_pca, y, cv=5)

# Report the performance
mean_score = scores.mean()
std_score = scores.std()

print(f"Mean Accuracy: {mean_score:.4f}")
print(f"Standard Deviation: {std_score:.4f}")

Mean Accuracy: 0.7751
Standard Deviation: 0.0199


Apply Linear Discriminant Analysis (LDA) on Iris.csv dataset to reduce dimensionality of data tok=2.

In [3]:
import pandas as pd
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

# Load the dataset
file_path_iris = 'Iris.csv'
iris_data = pd.read_csv(file_path_iris)

# Separate features and labels
X = iris_data.iloc[:, 1:-1]  # Features (excluding the first column and the last column)
y = iris_data.iloc[:, -1]    # Labels (last column)

# Apply LDA
lda = LinearDiscriminantAnalysis(n_components=2)
X_lda = lda.fit_transform(X, y)

# Create a DataFrame with LDA results
df_lda = pd.DataFrame(data=X_lda, columns=['LDA1', 'LDA2'])
df_lda['Class'] = y

# Save the LDA results to a CSV file
df_lda.to_csv('Iris_LDA_Result.csv', index=False)

df_lda.head()
import pandas as pd
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

# Load the dataset
iris_data = pd.read_csv('Iris.csv')

# Separate features and labels
X = iris_data.iloc[:, 1:-1]  # Features (excluding the first column and the last column)
y = iris_data.iloc[:, -1]    # Labels (last column)

# Apply LDA
lda = LinearDiscriminantAnalysis(n_components=2)
X_lda = lda.fit_transform(X, y)

# Create a DataFrame with LDA results
df_lda = pd.DataFrame(data=X_lda, columns=['LDA1', 'LDA2'])
df_lda['Class'] = y

# Save the LDA results to a CSV file
df_lda.to_csv('Iris_LDA_Result.csv', index=False)

# Display the first few rows of the LDA result
print(df_lda.head())

       LDA1      LDA2        Class
0  8.084953 -0.328454  Iris-setosa
1  7.147163  0.755473  Iris-setosa
2  7.511378  0.238078  Iris-setosa
3  6.837676  0.642885  Iris-setosa
4  8.157814 -0.540639  Iris-setosa


Briefly identify the difference between PCA and LDA


Both PCA and LDA are dimensionality reduction techniques, but they have different goals and methods:

Principal Component Analysis (PCA):
-PCA transforms into a new set of uncorrelated variables called principal components.These components capture the maximum variance in the data.
-unsupervised technique.
-PCA maximizes the variance captured in the data without considering class separation.

Linear Discriminant Analysis (LDA):
-aims to reduce the dimensionality of the data while maximizing the separation between different classes. It seeks to project the data onto a lower-dimensional space with good class-separability.
-supervised technique,considers class labels to find the best discriminant features.
-LDA maximizes the separation between different classes in the lower-dimensional space.