# PyCon2019: Hello World of Machine Learning using Scikit-learn


## [10] - K Means Clustering With Real World Data and Pandas

<br/><br/>

_Let's use the K Means Clustering algorithm with a real world data_

<br/><br/>

___We'll be using titanic data from Kaggle located @___ https://www.kaggle.com/c/titanic/data 

<br/><br/>

___It has 2 csv files containing training data and test data as "train.csv" and "test.csv". We'll be using "train.csv"___

<br/><br/>

### Getting the data ready for K-Means algorithm using Pandas

_We'll be using Pandas to prune the titanic data to be used in K-Means algorithm to find out two clusters. Survived / Not Survived_

<br/>


In [None]:
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans

<br/><br/>

___Let's create a Pandas DataFrame from the train.csv file___

<br/><br/>

In [None]:
titanic_df = pd.read_csv('train.csv')

In [None]:
titanic_df.shape

<br/><br/>

_To prune the data, we first need to identify what data must be required for my ML Algorithm. In titanic data, we can safely conclude that we need the_ __Age__ _column_

<br/><br/>

_Let's first find out if there are some missing_ ___"AGE"___ _values_

<br/><br/><br/>

In [None]:
titanic_df.head(5)

In [None]:
titanic_df.loc[titanic_df['Age'].isnull()]

<br/><br/>

__How many of them are missing?__
<br/><br/>

In [None]:
titanic_df['Age'].isnull().sum()

<br/><br/><br/>

___Well, that's a huge number and deleting those records is not a good idea.___

<br/><br/>

___So, let's fill those values with a median "Age" value___

<br/><br/>

In [None]:
titanic_df["Age"].fillna(titanic_df["Age"].median(skipna=True), inplace=True)

<br/>

___We can check that it actually happened___

<br/><br/>

In [None]:
titanic_df['Age'].isnull().sum()

<br/><br/>

___Now get rid of the columns which will not impact the survival of passengers___

<br/><br/>

___They are___

- ___Name___
- ___Ticket___
- ___Fare___
- ___Cabin___
- ___Embarked___


<br/><br/>

___Let's get rid of them in our Pandas DataFrame___

In [None]:
titanic_df = titanic_df.drop(['Name', 'Ticket', 'Fare', 'Cabin', 'Embarked'], axis = 1)

In [None]:
titanic_df.head(10)

<br/><br/>

#### Coverting Non Numeric Data into Numeric

___of the available data, only "Sex" field is not having numeric values. We can convert the same into multiple fields of numeric values used .get_dummies(...)__

<br/><br/>



In [None]:
titanic_df = pd.get_dummies(titanic_df, columns=['Sex'])

In [None]:
titanic_df.head(10)

<br/><br/>

___Now since we're trying to get the clusters around Surviving People, let's drop the "Survived" Columns___

<br/><br/>

In [None]:
X_df = titanic_df.drop(['Survived'],axis=1)

In [None]:
X_df.head(10)

<br/><br/>

___Till now we've passed "numpy" arrays as input to the "Algo.fit(...)" function. However, in here we can directly pass the Pandas DataFrame as input___

<br/><br/>

In [None]:
kmc = KMeans(n_clusters=2)

In [None]:
kmc.fit(X_df)

<br/><br/>

___Let's check the predictions and compare it against the original values___

<br/><br/>

In [None]:
predict = kmc.predict(X_df)

In [None]:
predict

In [None]:
X_df.shape, predict.shape

In [None]:
correct = 0
for index in range(X_df.shape[0]):
    if predict[index] == titanic_df.Survived[index]:
        correct += 1
    
correct / predict.shape[0]

<br/><br/>

___As of now we got ~50% accuracy, Let's see if scaling the values using MixMaxScalar() improved the situation___

<br/><br/>

In [None]:
from sklearn.preprocessing import MinMaxScaler

mm_scalar = MinMaxScaler()

In [None]:
X_scalar = mm_scalar.fit_transform(X_df)

In [None]:
X_scalar

In [None]:
type(X_scalar)

In [None]:
kmc = KMeans(n_clusters=2)
kmc.fit(X_scalar)

In [None]:
predict = kmc.predict(X_scalar)

In [None]:
correct = 0
for index in range(X_df.shape[0]):
    if predict[index] == titanic_df.Survived[index]:
        correct += 1
    
correct / predict.shape[0]

<br><br><br>

#### Exercise :

___Convert the Pandas DataFrame into "numpy" arrays and use the same with KMean and Match the results___
<br><br><br>

In [None]:
X = np.array(titanic_df.drop(['Survived'], 1).astype(float))

In [None]:
X[0:10,:]

<br/><br/><br/>

In [None]:
# Write Code

<br/><br/>

___With using MixMaxScalar()___

<br/><br/>

In [None]:
# Write Code