## Machine Learning (ML) Basics Tutorial

### Example 1: Basic Classification with Iris Dataset

- **Objective:** Classify iris flowers (ดอกไอริส) into three species using sepal (กลีบเลี้ยง) and petal (กลีบดอก) measurements.
- **Dataset:** The Iris dataset includes four features and a target for three species of Iris flowers:
  - **Features (X):**
    - Sepal Length (cm)
    - Sepal Width (cm)
    - Petal Length (cm)
    - Petal Width (cm)
  - **Target (y):**
    - Species of Iris (0 for Setosa, 1 for Versicolour, 2 for Virginica)

In [None]:
# Install a pip package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install pip install pandas numpy matplotlib seaborn scikit-learn scipy

In [None]:
# Importing libraries
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

- **sns.pairplot(...)**: It is used to visualize the relationships between pairs of dataset features, which reveals how they are related to each other.
- **diag_kind="kde"**: This optional parameter sets diagonal plots to kernel density estimates (kde), which shows data distribution as a smooth curve.

In [None]:
# Visualizing the data
sns.pairplot(pd.DataFrame(X, columns=iris.feature_names), diag_kind="kde")
plt.show()

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

- **solver='saga':** The **'solver'** parameter used for optimization when fitting the logistic regression model, set to **'saga'** here, is effective for large datasets in Python's scikit-learn library.
- **max_iter=2000:** This parameter sets the maximum number of iterations (or epochs) for the optimization algorithm. 

In [None]:
# Create a model and train
model = LogisticRegression(solver='saga', max_iter=2000)  # Using a different solver
# Train the logistic regression model using labeled data
model.fit(X_train, y_train)

In [None]:
# Predict and evaluate
predictions = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, predictions))

- **confusion_matrix(...)**: This function creates a heatmap with Seaborn to compare actual and predicted values from a model's predictions.

In [None]:
# Plotting the Confusion Matrix
cm = confusion_matrix(y_test, predictions)
sns.heatmap(cm, annot=True, fmt="d")
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

### Example 2: Sentiment Analysis with IMDb Reviews

- **Objective:** Determine whether movie reviews are positive or negative using text data.
- **Dataset:** The IMDB dataset in TensorFlow's Keras includes:
  - **Reviews (Text Data):**
    - Every review is a sequence of word indices that correspond to the words used in the review text.
    - The words are indexed by their overall frequency in the dataset. For example, the integer "3" represents the 3rd most frequent word.
  - **Labels (Binary):**
    - Every review is categorized as either 0, representing a negative sentiment, or 1, representing a positive sentiment.

In [None]:
# Install a pip package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install pip install tensorflow

In [None]:
# Importing libraries
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D
from tensorflow.keras.datasets import imdb
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc

- **imdb.load_data(...):** Load the dataset, which stores training and test data with their respective labels, limited to 10,000 unique words.

In [None]:
# Load dataset
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

- **Padding in natural language processing (NLP) and sequence-based tasks serves several purposes:** 
  - **Fixed Input Length:** The padding ensures that all input sequences are the same length, which makes it possible to process them in batches efficiently.
  - **Batch Processing:** By standardizing sequence lengths in a batch, padding supports parallel processing and optimizes GPU use.
  - **Model Compatibility:** It enables compatibility with neural networks, such as RNNs, LSTMs, and CNNs that need fixed-length inputs.
  - **Prevents Data Loss:** Padding avoids truncating longer sequences, which might lose important information.
  - **Facilitates Embeddings:** In deep learning, fixed-length sequences streamline the creation of consistent-dimensional embeddings for text tokens.


- **tf.keras.preprocessing.sequence.pad_sequences:** To prepare the data, both the training and testing datasets are padded using a sequence length of 256, and any missing values are filled with zeros.

In [None]:
# Data preprocessing
train_data = tf.keras.preprocessing.sequence.pad_sequences(train_data, value=0, padding='post', maxlen=256)
test_data = tf.keras.preprocessing.sequence.pad_sequences(test_data, value=0, padding='post', maxlen=256)

- **The model is constructed with:**
  - An embedding layer (input vocabulary: 10,000, embedding size: 16).
  - A global average pooling layer to reduce the sequence dimension.
  - A 16-unit dense layer with ReLU activation.
  - A single-unit dense layer with sigmoid activation for binary classification.

- As for training, it is configured with the **'adam'** optimizer, using **'binary_crossentropy'** as the loss function, and tracking **'accuracy'** as a performance metric.
- The **ReLU (Rectified Linear Unit)** activation function is used. It outputs the input value if it's positive, and zero if it's negative This introduces non-linearity to the model and helps in mitigating gradient vanishing issues.

In [None]:
# Build the model
model = Sequential([
  Embedding(10000, 16),
  GlobalAveragePooling1D(),
  Dense(16, activation='relu'),
  Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [None]:
# Train the model
history = model.fit(train_data, train_labels, epochs=30, batch_size=512, validation_data=(test_data, test_labels), verbose=1)

In [None]:
# Visualizing the training history
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)

In [None]:
# Plot training & validation accuracy values
plt.plot(epochs, acc, 'bo', label='Training accuracy')
plt.plot(epochs, val_acc, 'b', label='Validation accuracy')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()

In [None]:
# Plot training & validation loss values
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()

In [None]:
# Making predictions
predictions = model.predict(test_data)

- The **Receiver Operating Characteristic (ROC)** curve is created using predictions, which leads to two key metrics: False Positive Rate (FPR) and True Positive Rate (TPR). From these, the Area Under the Curve (AUC) is calculated.

In [None]:
# Calculate ROC curve from predictions
fpr, tpr, _ = roc_curve(test_labels, predictions)
roc_auc = auc(fpr, tpr)

In [None]:
# Plot ROC curve
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()

### Example 3: Customer Segmentation with Mall Customer Data

- **Objective:** Segment customers based on their spending patterns and characteristics.
- **Dataset:** Mall Customer Segmentation Data (available on Kaggle), which contains the following fields:
  - **CustomerID:** A unique identifier for each customer.
  - **Gender:** The gender of the customer (e.g., Male, Female).
  - **Age:** The age of the customer.
  - **Annual Income (k$):** The customer's annual income in thousands of dollars.
  - **Spending Score (1-100):** A score assigned to the customer based on their spending behavior and purchasing data. This score is on a scale from 1 to 100.

In [None]:
# Importing libraries
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

In [None]:
# Load dataset (replace with actual dataset path)
df = pd.read_csv('Mall_Customers.csv')

# One-hot encoding for the 'Gender' column
df = pd.get_dummies(df, columns=['Gender'])

In [None]:
# Selects specific columns (income and spending score) for clustering.
X = df.iloc[:, [1, 2]].values

In [None]:
# Apply KMeans clustering
kmeans = KMeans(n_clusters=5)
y_kmeans = kmeans.fit_predict(X)

In [None]:
# Plot the clusters
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, cmap='rainbow')
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.title('Customer Segments')
plt.show()