<a href="https://www.kaggle.com/code/hossamahmedsalah/ica-pca-k-means-msp?scriptVersionId=143240799" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

<div style="padding: 35px;color:white;margin:10;font-size:200%;text-align:center;display:fill;border-radius:10px;overflow:hidden;background-image: url(https://github.com/hossamAhmedSalah/Machine_Learning_MSP/blob/main/Assets/247.jpg?raw=true?)">
<b>
<span style='color:skyblue'>MSP Machine Learning workshop 2023 </span>
</b>
<div>
<span style='color:Salmon'>UnsuperVised Learning</span>

</div>

</div>

<br>


# You can view other sessions via 
[GitHub - hossamAhmedSalah/Machine_Learning_MSP: MSP 23 workshop of machine learning](https://github.com/hossamAhmedSalah/Machine_Learning_MSP/tree/main)
![MSP Logo](https://github.com/hossamAhmedSalah/Machine_Learning_MSP/blob/main/Assets/image-removebg-preview.png?raw=true)

<a id="key"></a>
# Table of Content 
1) [ICA](#1)
2) [Mixed Music dataset🎵](#2)
   - 2.1 [How to deal with Audio🔉](#21)
3) [Visualizing Audio Files🎵](#3)
   - 3.1 [Listen to the file 👂](#3.1)
   - 3.2 [Reading 2 other music files](#3.2)
4) [Summary🎶](#4)
5) [Creating the dataset from the music files](#5)
6) [Applying the ICA](#6)
   - 6.1 [Results of the ICA](#6.1)
7) [PCA](#7)
8) [MNIST dataset 0️⃣9️⃣](#8)
9) [KMeans](#9)

# <span style="color : skyblue" id="1">ICA</span>
> Independent component analysis

>Independent Component Analysis (ICA) is a computational technique used in signal processing and data analysis to separate a multivariate signal into additive, statistically independent components. It is a method for blind source separation, which means it can separate a set of mixed signals into their original source signals without knowing the characteristics of the sources or the mixing process.

ICA has applications in various fields, including:

1. **Audio Source Separation**: Separating different sound sources (e.g., instruments or voices) from a mixed audio recording.
2. **Image Processing**: Extracting independent features or sources from mixed images.
3. **Neuroscience**: Analyzing brain signals (e.g., EEG or fMRI data) to identify independent neural sources.
4. **Financial Data Analysis**: Separating underlying factors or market trends from financial time series data.
5. **Speech Recognition**: Separating different speakers' voices in a recorded conversation.

# <span style="color:skyblue" id="2">Mixed Music dataset🎵</span>
We have three WAVE files, each of which is a mix

## <h2 id="21">How to deal with Audio🔉</h2>
Dealing with audio files in Python involves various tasks such as reading, writing, processing, and analyzing audio data. To work with audio files in Python, you can use libraries like `librosa`, `soundfile`, and `pydub`. 

You can process audio data using libraries like `numpy` . For example, you can apply filters, perform Fourier transforms, or manipulate audio signals.

```python
import numpy as np

# Example: Apply a simple gain (volume) change
audio_data *= 2.0  # Increase volume by a factor of 2

```
    
- **Audio Playback**:
    
    To play audio directly from Python, you can use the `pydub` library. Install it first:
    
    ```bash
    pip install pydub
    
    ```
    
    Then, you can play audio like this:
    
    ```python
    from pydub import AudioSegment
    from pydub.playback import play
    
    audio = AudioSegment.from_file("your_audio.wav")
    play(audio)
    
    ```
    

In [None]:
from pydub import AudioSegment
import IPython
import numpy as np
import wave

In [None]:
audio1 = wave.open('/kaggle/input/icamusical/ICA mix 1.wav', 'r')

In [None]:
audio1.getparams()

1. **Mono (1 Channel)**:
    - Mono audio contains a single audio channel.
    - It is often used for voice recordings, podcasts, and some types of music.
2. **Stereo (2 Channels)**:
    - Stereo audio consists of two audio channels: a left channel and a right channel.
    - Stereo is used for most music recordings, as it allows for the creation of a spatial or directional audio experience.
3. **Multichannel (More than 2 Channels)**:
    - Multichannel audio can have more than two channels, such as 5.1 (five primary channels plus a subwoofer channel) or 7.1 (seven primary channels plus a subwoofer channel).
    - It is commonly used in home theater systems, surround sound setups, and advanced audio production.
>https://youtu.be/48ExvlterwY?si=52cWXKOgUgMGbaGb 📺 this video can be helpful

1. **`nchannels`**: This parameter specifies the number of audio channels in the audio file. In this case, it's set to **`1`**, indicating that the audio is monaural (single channel).
2. **`sampwidth`**: **`sampwidth`** specifies the sample width or the number of bytes used to represent each audio sample. A value of **`2`** typically represents 16-bit audio, which is a common format for CD-quality audio.
3. **`framerate`**: **`framerate`** indicates the number of samples (audio data points) recorded per second. In this case, it's set to **`44100`**, which is the standard sample rate for audio CDs and is often used for high-quality audio recordings.
4. **`nframes`**: **`nframes`** represents the total number of audio frames or samples in the audio file. In your example, it's set to **`264515`**, which means the audio file contains 264,515 audio frames.
5. **`comptype`**: This parameter specifies the compression type used for the audio file. In this case, it's **`'NONE'`**, indicating that the audio is not compressed. Uncompressed audio is common for high-quality audio recordings.
6. **`compname`**: **`compname`** provides the name of the compression scheme used, and in this case, it's **`'not compressed'`**, which corresponds to the fact that the audio is not compressed

>So this file has only channel which means it's mono sound.
>It has a frame rate of 44100, which means each second of sound is represented by 44100 integers (integers because the file is in the common PCM 16-bit format).
>The file has a total of 264515 integers/frames, which means its length in seconds is:`264515/44100 = 5.998072562358277`

In [None]:
# extract the frames of the wave file, which will be a part of the dataset we'll run ICA against
'''
audio1.readframes(-1) reads all the audio frames from the audio1 object. 
The -1 argument means to read all frames until the end of the file.
 This line retrieves the audio data in its raw binary form.  
'''
signal_1_raw = audio1.readframes(-1)
'''
np.frombuffer(signal_1_raw, dtype=np.int16) creates a NumPy array signal_1 from the binary data signal_1_raw and specifies the data type as np.int16, 
which corresponds to 16-bit signed integers. 
'''
signal_1 = np.frombuffer(signal_1_raw, dtype=np.int16)
# signal_1 will contain the audio data as a NumPy array of 16-bit integers, which can be easily manipulated, analyzed, or visualized 

In [None]:
len(signal_1)

In [None]:
signal_1[:100]

# <h1 style="color:skyblue" id="3">Visualizing Audio Files🎵</h1>

In [None]:
import matplotlib.pyplot as plt
'''
mix_1_wave.getframerate() retrieves the sample rate (in Hz) of the audio file represented by the mix_1_wave object and assigns it to the variable fs.
The sample rate represents the number of audio samples recorded per second.  
'''
fs = audio1.getframerate()
'''
len(signal_1) calculates the length of the signal_1 NumPy array, which corresponds to the number of audio samples in the audio file.
np.linspace(0, len(signal_1)/fs, num=len(signal_1)) creates an array timing containing evenly spaced time values.
 These time values correspond to the time axis of the audio waveform plot. The range of time values is determined by the length of the audio file and the sample rate 
'''
timing = np.linspace(0, len(signal_1)/fs, num=len(signal_1))

plt.figure(figsize=(12,2))
plt.title('Audio 1')
plt.plot(timing,signal_1, c="salmon")
plt.ylim(-35000, 35000)
plt.show()

- **`np.linspace(start, stop, num)`**: This NumPy function generates an array of evenly spaced values over a specified range.
    - **`start`**: The starting value of the sequence.
    - **`stop`**: The end value of the sequence.
    - **`num`**: The number of evenly spaced values to generate.

In your code:

- **`start`** is set to **`0`**, indicating that you want the time values to start from 0 seconds.
- **`stop`** is calculated as **`len(signal_1)/fs`**, where:
    - **`len(signal_1)`** is the length of the **`signal_1`** array, which corresponds to the number of audio samples.
    - **`fs`** is the sample rate of the audio, which represents the number of samples per second.
    - **`len(signal_1)/fs`** calculates the duration of the audio in seconds. It's the total number of audio samples divided by the sample rate, resulting in the duration of the audio recording in seconds.
- **`num`** is set to **`len(signal_1)`**, which specifies that you want to generate as many time values as there are audio samples in the **`signal_1`** array. Each time value corresponds to a specific audio sample.

## <h2 style="color:skyblue" id ="31">Listen to the file 👂</h2>

In [None]:
IPython.display.Audio('/kaggle/input/icamusical/ICA mix 1.wav') # 🎹 piano

## <h2 style="color:skyblue" id ="32">Reading 2 other music files</h2>

In [None]:
# audio2
audio2 = wave.open('/kaggle/input/icamusical/ICA mix 2.wav', 'r')

# Extract the frames
signal_2_raw = audio2.readframes(-1)
signal_2 = np.frombuffer(signal_2_raw, dtype=np.int16)

# ploting 🔉 
plt.figure(figsize=(12,2))
plt.plot(timing, signal_2, c="skyblue")
plt.title("Audio 2")
plt.ylim(-35000, 35000)


In [None]:
max(signal_2), min(signal_2)

In [None]:
# listening 👂 
IPython.display.Audio('/kaggle/input/icamusical/ICA mix 2.wav') 

In [None]:
# audio 3
audio3 = wave.open('/kaggle/input/icamusical/ICA mix 3.wav', 'r')

# Extract the frames
signal_3_raw = audio3.readframes(-1)
signal_3 = np.frombuffer(signal_3_raw, dtype=np.int16)

# ploting 
plt.figure(figsize=(12,2))
plt.plot(timing, signal_3, c="lightgreen")
plt.title("Audio 3")
plt.ylim(-35000, 35000)


In [None]:
# Listening 👂 
IPython.display.Audio('/kaggle/input/icamusical/ICA mix 3.wav')

## <h2 style="color:skyblue" id="4">Summary🎶</h2>

In [None]:
ax = [None for _ in range(3)]
plt.figure(figsize=(20, 10))


ax[0] = plt.subplot2grid((3,1), (0,0))
ax[0].plot(timing, signal_1, c="salmon")
ax[0].set_title("Audio 1")

ax[1] = plt.subplot2grid((3,1), (1,0))
ax[1].plot(timing, signal_2, c="skyblue")
ax[1].set_title("Audio 2")

ax[2] = plt.subplot2grid((3,1), (2,0))
ax[2].plot(timing, signal_3, c="lightgreen")
ax[2].set_title("Audio 3")

# setting the ylim 
ylim = (-35000, 35000)
for i in range(3):
    ax[i].set_ylim(ylim)

In [None]:
from IPython.display import Audio, display

# List of audio file paths
audio_files = ['/kaggle/input/icamusical/ICA mix 1.wav', '/kaggle/input/icamusical/ICA mix 2.wav', '/kaggle/input/icamusical/ICA mix 3.wav']

# Display audio files vertically
for audio_file in audio_files:
    print(audio_file)
    display(Audio(audio_file))

# <h1 style="color:skyblue" id="5">Creating the dataset from the music files</h1>

ex1
```python
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
np.vstack((a,b))

array([[1, 2, 3],
       [4, 5, 6]])
```
ex2
```python
a = np.array([[1], [2], [3]])
b = np.array([[4], [5], [6]])
np.vstack((a,b))
array([[1],
       [2],
       [3],
       [4],
       [5],
       [6]])
```
https://numpy.org/doc/stable/reference/generated/numpy.vstack.html

![Alt text](image.png)



In [None]:

# Combine the audio signals into a 2D NumPy array (X)
X = np.vstack((signal_1, signal_2, signal_3))

# Transpose the array so that each row corresponds to a signal
X = X.T

# <h1 style="color:skyblue" id="6">Applying the ICA</h1>

In [None]:
from sklearn.decomposition import FastICA

# Create an ICA model
ica = FastICA(n_components=3, random_state=0)
#ica = FastICA(n_components=3, random_state=0, whiten='unit-variance') # to silence the warning


# Fit the model to the mixed signals
sources = ica.fit_transform(X)

In [None]:
# sources now contains the result of FastICA, which we hope are the original signals. It's in the shape
sources.shape

## <h2 style="color:skyblue" id="6.1">Results of the ICA</h2>

In [None]:
# Let's split into separate signals and look at them
source_1 = sources[:, 0]
source_2 = sources[:, 1]
source_3 = sources[:, 2]

In [None]:
# plotting the results
ax = [None for _ in range(3)]
plt.figure(figsize=(20, 10))

# Independent Component #1
ax[0] = plt.subplot2grid((3,1), (0,0))
ax[0].plot(source_1, c="salmon")
ax[0].set_title("Source 1")

# Independent Component #2
ax[1] = plt.subplot2grid((3,1), (1,0))
ax[1].plot(source_2, c="skyblue")
ax[1].set_title("Source 2")

# Independent Component #2
ax[2] = plt.subplot2grid((3,1), (2,0))
ax[2].plot(source_3, c="lightgreen")
ax[2].set_title("Source 3")

# setting the ylim 
ylim = -0.010, 0.010
for i in range(3):
    ax[i].set_ylim(ylim)

In [None]:
sources

In [None]:
# Convert the separated audio signals to 16-bit signed integers
result_source_1_int = (source_1 * 32767 * 100).astype(np.int16)
result_source_2_int = (source_2 * 32767 * 100).astype(np.int16)
result_source_3_int = (source_3 * 32767 * 100).astype(np.int16)
'''
It scales the signal values by multiplying them by 32767*100. This scaling is done to map the float values 
to the range of a 16-bit signed integer (-32768 to 32767). Additionally, 
it multiplies by 100 to increase the volume of the signals.  🔊 🆙 
'''

In [None]:
from scipy.io import wavfile
# Writing wave files
wavfile.write("source1.wav", fs, result_source_1_int)
wavfile.write("source2.wav", fs, result_source_2_int)
wavfile.write("source3.wav", fs, result_source_3_int)

In [None]:
# List of audio file paths
audio_files = ['source1.wav', 'source2.wav', 'source3.wav']

# Display audio files vertically
for audio_file in audio_files:
    print(audio_file)
    display(Audio(audio_file))

>ICA becomes particularly valuable when you have multiple mixed signals (e.g., multiple microphones recording audio in a room with multiple sound sources) and you want to separate those mixed signals into their original independent sources (e.g., separating different speakers' voices from a microphone array).

# <h1 style="color: skyblue" id="7">PCA</h1>

# <h1 style="color: skyblue" id="8">MNIST dataset 0️⃣9️⃣</h1>

In [None]:
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
import matplotlib.image as mpimg
from PIL import Image
import seaborn as sns

In [None]:
# train 
train = pd.read_csv('/kaggle/input/digit-recognizer/train.csv')
# test 
test = pd.read_csv('/kaggle/input/digit-recognizer/test.csv')
# train sahpe and test shape
f'train shape {train.shape}', f'test shape {test.shape}'

In [None]:
train.columns

In [None]:
train.head()

In [None]:
train.sample(1)

In [None]:
train.describe()

In [None]:
test.describe()

In [None]:
train.isnull().sum()

In [None]:
test.isnull().sum()

In [None]:
# spliting train to x, y
X = train.drop(columns=['label'])
y = train['label']

## Functions

In [None]:
def show_images(n, dataset=X,  MAX_IMGS=300):
    num_cols = 10
    if n % num_cols == 0 and n <= MAX_IMGS:
        images = dataset.iloc[:n].values.reshape(-1, 28, 28)
        num_rows = n // num_cols
        fig, ax = plt.subplots(num_rows, num_cols, figsize=(num_cols, num_rows))
        for i in range(num_rows):
            for j in range(num_cols):
                ax[i, j].imshow(images[i * num_cols + j], cmap='gray')
                ax[i, j].axis('off')
        plt.show()
    else:
        print('Invalid number of images')

In [None]:
show_images(100)

In [None]:
def show_digits(digit, dataset=X):
    if digit in range(10):
        digit_indices = np.where(y == digit)[0]
        
        for i in range(50):  # Display the first 50 images of the digit
            plt.subplot(5, 10, i + 1)
            imdata = dataset.iloc[digit_indices[i]].values.reshape(28, 28)
            plt.imshow(imdata, cmap='gray')
            plt.xticks([])
            plt.yticks([])

In [None]:
show_digits(9)

In [None]:
show_digits(6)

In [None]:
show_digits(5)

In [None]:
import joblib
def fit_forest(X, y, save = (False, 'model_digitREC'), plot =True):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=33)
    clf = RandomForestClassifier(n_estimators=130, max_depth=None)
    clf.fit(X_train, y_train)
    # predictions
    y_pred = clf.predict(X_test)
    # scoring
    mat = confusion_matrix(y_test, y_pred)
    if plot:
       plt.figure(figsize=(8,8), dpi=170)
       sns.heatmap(mat, annot=True, linewidths=0.5, cmap='Blues',fmt='d')
       plt.show()
    else:
       print(mat)
    acc = accuracy_score(y_test, y_pred)
    print(acc)
    if save[0]:
        joblib.dump(clf, f'{save[1]}.joblib')
    return acc

In [None]:
fit_forest(X,y)

In [None]:
def do_pca(n_component, dataset):
    X = StandardScaler().fit_transform(dataset)
    pca = PCA(n_components=n_component)
    x_pca = pca.fit_transform(X)
    return pca, x_pca

In [None]:
pca, X_pca = do_pca(2, X)

In [None]:
pca

In [None]:
X_pca

In [None]:
f"X shape {X.shape}", f"X_PCA shape {X_pca.shape}"

In [None]:
fit_forest(X_pca,y)

In [None]:
def plot_components(X, y):
   
    x_min, x_max = np.min(X, 0), np.max(X, 0)
    X = (X - x_min) / (x_max - x_min)
    plt.figure(figsize=(10, 6))
    for i in range(X.shape[0]):
        plt.text(X[i, 0], X[i, 1], str(y[i]), color=plt.cm.Set1(y[i]), fontdict={'size': 15})

    plt.xticks([]), plt.yticks([]), plt.ylim([-0.1,1.1]), plt.xlim([-0.1,1.1])

In [None]:
plot_components(X_pca[:100], y[:100])

In [None]:
accls = [] # accuracy list
def tryPCA(start, end, step, clear = False):
    if clear:
        accls.clear()
    for reduced_featured in range(start, end, step):
        print(f"number of features {reduced_featured}")
        pca, x_pca = do_pca(reduced_featured, X)
        accls.append(fit_forest(x_pca, y, plot=False))

In [None]:
X.shape

In [None]:
# this would try to reduce the number of features 
# and see what a number of feature give us meaning 
# so the orginal X features 784 and we got acc 96% 
# with only 18 feature we could reach 91%
# we can go beyond that 38 feature was 94%
# we can do even better 
# TODO: I would leave it for you to try and a find if you can do better 🫣💫
tryPCA(3, 40, 5)

# <h1 style="color:skyblue" id="9">KMEANs</h1>

In [None]:
import pandas as pd
customers = pd.read_csv('/kaggle/input/customer-segmentation-tutorial-in-python/Mall_Customers.csv')

In [None]:
customers.head()

In [None]:
customers.info()

In [None]:
customers.drop(columns='CustomerID').describe()

In [None]:
customers.describe(include='O')

In [None]:
import seaborn as sns
sns.pairplot(customers.drop(columns='CustomerID'), hue='Gender');

In [None]:
customers.drop(columns='CustomerID', inplace=True)

In [None]:
# Encoding
customers['Gender'] = customers['Gender'].astype('category').cat.codes
customers['Gender'].value_counts()

In [None]:
customers.columns

In [None]:
numCols = customers.select_dtypes(include='int').columns
numCols

In [None]:
# Scalling 
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler

In [None]:
dm = customers
s = MinMaxScaler()
for col in numCols:
    dm[col] = s.fit_transform(dm[col].to_numpy().reshape(-1,1))
dm

In [None]:
dr = customers
s = RobustScaler()
for col in numCols:
    dr[col] = s.fit_transform(dr[col].to_numpy().reshape(-1,1))
dr

In [None]:
ds = customers
s = StandardScaler()
for col in numCols:
    ds[col] = s.fit_transform(ds[col].to_numpy().reshape(-1,1))
ds

In [None]:
sns.pairplot(data=ds)

In [None]:
from sklearn.cluster import KMeans

In [None]:
import plotly.express as px
import plotly.graph_objs as go
from plotly.subplots import make_subplots

In [None]:
def plot_3d_clusters(df, feature_cols, num_clusters):
    # extract the feature columns from the dataframe
    X = df[feature_cols]

    # perform K-means clustering with the specified number of clusters
    kmeans = KMeans(n_clusters=num_clusters, random_state=42).fit(X)
    df['cluster'] = kmeans.labels_

    # create 3D scatter plot of the clustered data
    fig = go.Figure(data=[go.Scatter3d(
        x=df[feature_cols[0]],
        y=df[feature_cols[1]],
        z=df[feature_cols[2]],
        mode='markers',
        marker=dict(
            color=df['cluster'],
            size=10,
            opacity=0.8
        )
    )])

    fig.update_layout(scene=dict(
        xaxis_title=feature_cols[0],
        yaxis_title=feature_cols[1],
        zaxis_title=feature_cols[2]
    ), title=f'K-means Clustering {num_clusters} Clusters 3D Scatter Plot')

    return fig

In [None]:
def createKmeanModels(data, mak_k, min_k =1):
    kmodels = [KMeans(n_clusters=k).fit(data) for k in range(min_k, mak_k)]
    return kmodels

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
for i in range(2, 9):
    fig = plot_3d_clusters(ds, ['Age', 'Annual Income (k$)','Spending Score (1-100)'], num_clusters=i)
    fig.show()

In [None]:
def elbow_plot(data, max_k):
   
    models = createKmeanModels(data, max_k)
    distortions = [model.inertia_ for model in models]

    fig = go.Figure()
    fig.add_trace(go.Scatter(x=list(range(1, max_k+1)), y=distortions, mode='lines+markers'))
    fig.update_layout(title='Elbow plot for KMeans clustering',
                      xaxis_title='Number of clusters',
                      yaxis_title='Distortion')

    return fig

In [None]:
elbow_plot(ds, 9)