# Pandas/Sklearn "Hello world" - Iris dataset

**Part 1:**
- Load the iris dataset from sklearn 
- Create a pandas dataframe containing the dataset 
- Try to plot two of the features using plotly scatter plot 
(just as an example, `sepal length` on the `x` axis and `sepal width` on the `y`): 
- Are the class linearly separable?
- Now try to use the sklearn `PCA` to reduce the number of features from 4 to 2
- Are the new features linearly separable? 

**Part 2:**
- Now, you can use the train/test using `train_test_split`
- Try to train the a classifier of your choice on your training dataset  
(just as an example, `sklearn.ensemble.RandomForestClassifier`)
- Measure the performance of your classifier on test data

In [None]:
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, ConfusionMatrixDisplay, confusion_matrix
dataset = load_iris(return_X_y=False, as_frame=False)

# remove comment to print the dataset description
# print(dataset['DESCR'])

In [None]:
dataset['target_names'][[0, 1, 2, 2, 2, 0, 1]]

In [None]:
import pandas as pd

features_names = dataset['feature_names']
print(features_names)

dataframe_dict = {name: dataset['data'][:, i] for i, name in enumerate(features_names)}
dataframe_dict['labels'] = dataset['target']
dataframe_dict['labels_name'] = dataset['target_names'][dataset['target']]
iris_dataframe = pd.DataFrame(dataframe_dict)

In [None]:
import plotly.express as px

px.scatter(iris_dataframe, x='sepal length (cm)', y='sepal width (cm)', color='labels_name')

In [None]:
px.scatter_3d(iris_dataframe, x='petal length (cm)', y='sepal length (cm)', z='sepal width (cm)', color='labels_name')

In [None]:
pca = PCA(2)
pca.fit(dataset['data'])

new_features = pca.transform(dataset['data'])  # fit and transfrom can be done in one step
iris_dataframe['pca1'] = new_features[:, 0]
iris_dataframe['pca2'] = new_features[:, 1]
iris_dataframe

In [None]:
px.scatter(iris_dataframe, x='pca1', y='pca2', color='labels_name')

In [None]:
import numpy as np
x = np.linspace(1, 10, 10)
print(x)
x_train, x_test = train_test_split(x)
x_train2, x_test2  = train_test_split(x)
x_train, x_test

In [None]:
x_train2, x_test2

In [None]:
# random forest on full data
train_x, test_x, train_labels, test_labels, train_pca_x, test_pca_x = train_test_split(dataset['data'],
                                                                                       dataset['target'],
                                                                                       new_features)

In [None]:
classifier = RandomForestClassifier()
classifier.fit(train_x, train_labels)
predictions = classifier.predict(test_x)

#print(classification_report(test_labels, predictions))


classifier = RandomForestClassifier()
classifier.fit(train_pca_x, train_labels)
predictions = classifier.predict(test_pca_x)

#print(classification_report(test_labels, predictions))

In [None]:
from sklearn import tree

In [None]:
#plt.figure(figsize=(20, 20))
#_ = tree.plot_tree(classifier.estimators_[1], feature_names=['pca1', 'pca2'], filled=True)
#plt.show()

In [None]:
pca.explained_variance_ratio_, pca.singular_values_

# Visualize very very large features spaces with TSNE

- Load the mnist dataset
- Take a small subset of the dataset to experiment with 
(maybe 1000 images, but you can play with it)
- Try to visualize the first two PCA components?
- Are the class well separated?
- If not, try to use TSNE instead
- Better? :)
- PCA(784) plot pca.explained_variance_ratio_, pca.singular_values_

Part 2
- If you have time left, try to use the Kmeans cluster to get a clean separation between classes.

In [None]:
from sklearn.datasets import fetch_openml
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans

mnist = fetch_openml('mnist_784', return_X_y=True) # takes a while becasue it has to download the data

In [None]:
mnist[0]

In [None]:
image_sample = mnist[0].iloc[2].to_numpy()
plt.figure(figsize=(5, 5))
plt.imshow(image_sample.reshape(28, 28), cmap='gray')
plt.show()

In [None]:
dataset_size = 5000
dataset, labels = mnist[0].to_numpy()[:dataset_size], mnist[1].to_numpy().astype('int64')[:dataset_size]

print(f'Dataset is of shape: {dataset.shape}')

In [None]:
from sklearn.decomposition import PCA
pca = PCA(784)
new_features = pca.fit_transform(dataset)

In [None]:
px.line(y=pca.explained_variance_ratio_)

In [None]:
random_colors_array = np.random.rand(10, 3)
plt.scatter(new_features[:, 0], new_features[:, 1], c=random_colors_array[labels])

In [None]:
from matplotlib.colors import ListedColormap
random_colors_array = np.random.rand(10, 3)
color_map = ListedColormap(random_colors_array)

plt.scatter(new_features[:, 0], new_features[:, 1], c=labels, cmap=color_map)
plt.colorbar()

In [None]:
random_colors_array = np.random.rand(10)
plt.scatter(new_features[:, 0], new_features[:, 1], c=random_colors_array[labels])

In [None]:
px.scatter(x=new_features[:, 0], y=new_features[:, 1], color=labels.astype('str'))

In [None]:
tsne = TSNE(2, learning_rate='auto')
new_features = tsne.fit_transform(dataset)

In [None]:
px.scatter(x=new_features[:, 0], y=new_features[:, 1], color=labels.astype(str))

In [None]:
kmeans = KMeans(10)
predictions = kmeans.fit_predict(dataset)

kmeans = KMeans(10)
new_predictions = kmeans.fit_predict(new_features)

In [None]:
px.scatter(x=new_features[:, 0], 
           y=new_features[:, 1],
           color=new_predictions.astype('str'))

In [None]:
from sklearn.metrics import rand_score

In [None]:
print('KMeans on full dimensionality', rand_score(labels, predictions))
print('KMeans on 2 dim TSNE         ', rand_score(labels, new_predictions))