# **Data Science and Business Analytics (GRIP May'21)**
## **Task 2 : Prediction using Unsupervised Machine Learning**
### **Author : Jeet Sahoo**
#### Objective: From the given ‘Iris’ dataset, predict the optimum number of clusters and represent it visually

## **Technical Stack : Scikit Learn, Numpy Array, Scipy, Pandas, Matplotlib**
In this K-means clustering task I tried to predict the optimum number of clusters and represent it visually from the given ‘Iris’ dataset.

#### Importing Required Libraries

In [None]:
# Importing the required libraries
from sklearn import datasets
import matplotlib.pyplot as plt # used for data visualization
import pandas as pd # used for data analysis
import numpy as np # used for mathematical operations
import seaborn as sns # used for data visualization
from sklearn.cluster import KMeans
import matplotlib.patches as mpatches
import sklearn.metrics as sm
from mpl_toolkits.mplot3d import Axes3D
from scipy.cluster.hierarchy import linkage,dendrogram
from sklearn.cluster import DBSCAN 
from sklearn.decomposition import PCA 

#### Exploring and Understanding Data

In [None]:
# Loading data from file
link='Iris.csv'
iris_df = pd.read_csv(link)
print("Data load successful")

iris_df.head(20) #To see first 20 rows of data

In [None]:
# Understanding the data
iris_df.describe() #Data Description

In [None]:
iris_df.info() #Info of Dataset

In [None]:
iris_df.values #Values of Dataset

In [None]:
iris_df.columns #Columns of Dataset

In [None]:
iris_df.shape #To find the shape of data

In [None]:
iris_df.isnull().sum() #Checking for null values in Dataset

In [None]:
iris_df.duplicated().sum() #Checking for duplicate entries in Dataset

#### PreProcessing Data

In [None]:
iris_df['Species'].unique() #Three types of species

In [None]:
iris_df['Species'].value_counts() #Counting total values for each unique species in dataset

#### Visualizing Data

In [None]:
#Using Seaborn
plt.rcParams['figure.dpi'] = 1000
fig=plt.figure(figsize=(8,4))
sns.set_style("whitegrid")
iris=sns.load_dataset('iris')
ax=sns.stripplot(x='species',y='sepal_length',data=iris,size=7) #Creating stripplot to see relationship vsiually among variables

plt.title('Iris Dataset',fontsize=16)
plt.xlabel('Species',fontsize=16)
plt.ylabel('Sepal Length',fontsize=16)
plt.show()

In [None]:
#Boxplot
fig= plt.figure(figsize=(8,4))
sns.boxplot(x='species',y='sepal_width',data=iris)
plt.title('Iris Dataset', fontsize=16)
plt.xlabel('Species',fontsize=16)
plt.ylabel('Sepal_width',fontsize=16)
plt.show()

In [None]:
#Plotting Boxplot 

fig=plt.figure(figsize=(8,4))
sns.boxplot( x= 'species', y= 'petal_width', data = iris)
plt.title('Iris Dataset',fontsize=16)
plt.xlabel('Species',fontsize=16)
plt.ylabel('Petal Width', fontsize =16)
plt.show()

In [None]:
#Plotting Boxplot 
plt.rcParams['figure.dpi'] = 300
fig=plt.figure(figsize=(8,4))
sns.boxplot( x= 'species', y= 'petal_length', data = iris)
plt.title('Iris Dataset')
plt.title('Iris Dataset',fontsize=16)
plt.xlabel('Species',fontsize=16)
plt.ylabel('Petal length', fontsize =16)
plt.show()

#### Splitting Data and Training Algorithm

In [None]:
x=iris_df.iloc[:,[0,1,2,3]].values # defining rows and columns to be taken in consideration
wcss=[] # List for saving the values of Within-Cluster sum of squares
for i in range(1,11): # running loop for 10 times
    kmeans=KMeans(n_clusters=i,init="k-means++",random_state=0)
    kmeans.fit(x) #To train the model and create classifier
    wcss.append(kmeans.inertia_) # To obtain the within cluster sum of square
wcss

#### Visualizing Data

In [None]:
plt.figure(figsize=(8,5))
plt.plot(range(1,11),wcss,'ro-')
plt.xlabel("Numbers of Clusters")
plt.ylabel("Within-Cluster sum of squares")
plt.title("Elbow Method") # plotting the elbow graph
plt.show()

#### Predicting Data

In [None]:
kmeans=KMeans(n_clusters=3,init='k-means++',random_state=0)
y_kmeans=kmeans.fit_predict(x) # Here we are training, creating Classifier and predicting the Model
y_kmeans

In [None]:
kmeans.cluster_centers_ # Centroids of the clusters formed

#### Visualizing the best fit Line of Regression

In [None]:
iris_df.head()
y_kmeans  #values arranged in matrix
x  #values arranged in matrix
data = pd.DataFrame(x, columns=['SL','SW','PL','PW'])

data['y_kmeans'] = y_kmeans
data.head()

plt.figure(figsize=(7,7))
colors = np.array(['red', 'green', 'blue'])
predictedY = np.choose(kmeans.labels_, [1, 0, 2]).astype(np.int64)
plt.scatter(iris_df.iloc[y_kmeans==0,0],iris_df.iloc[y_kmeans==0,2],s=75,c='red',label='Iris-virginica')
plt.scatter(iris_df.iloc[y_kmeans==1,0],iris_df.iloc[y_kmeans==1,2],s=75,c='blue',label='Iris-setosa')
plt.scatter(iris_df.iloc[y_kmeans==2,0],iris_df.iloc[y_kmeans==2,2],s=75,c='green',label='Iris-versicolor')
plt.scatter(kmeans.cluster_centers_[:,0],kmeans.cluster_centers_[:,2],s=100,c='black',label='Centroids') # Plotting the centroids of the clusters
plt.legend()
plt.xlabel('Sepal Length in cm')
plt.ylabel('Petal Length in cm')
plt.title('K-Means Clustering')

In [None]:
kmeans.cluster_centers_[:,0] #Co-orinates of x-axis

In [None]:
kmeans.cluster_centers_[:,1]  #Co-ordinates of y-axis

### Conclusion

I was able to successfully carry-out Prediction using Unsupervised Machine Learning task and was able to evaluate the model's clustering accuracy score.

#### Thank You