# PyCon2019: Hello World of Machine Learning using Scikit-learn


## [10] - K Means Clustering With Real World Data and Pandas

<br/><br/>

_Let's use the K Means Clustering algorithm with a real world data_

<br/><br/>

___We'll be using titanic data from Kaggle located @___ https://www.kaggle.com/c/titanic/data 

<br/><br/>

___It has 2 csv files containing training data and test data as "train.csv" and "test.csv". We'll be using "train.csv"___

<br/><br/>

### Getting the data ready for K-Means algorithm using Pandas

_We'll be using Pandas to prune the titanic data to be used in K-Means algorithm to find out two clusters. Survived / Not Survived_

<br/>


In [1]:
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans

<br/><br/>

___Let's create a Pandas DataFrame from the train.csv file___

<br/><br/>

In [2]:
titanic_df = pd.read_csv('train.csv')

In [3]:
titanic_df.shape

(891, 12)

<br/><br/>

_To prune the data, we first need to identify what data must be required for my ML Algorithm. In titanic data, we can safely conclude that we need the_ __Age__ _column_

<br/><br/>

_Let's first find out if there are some missing_ ___"AGE"___ _values_

<br/><br/><br/>

In [4]:
titanic_df.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
titanic_df.loc[titanic_df['Age'].isnull()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
17,18,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0000,,S
19,20,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.2250,,C
26,27,0,3,"Emir, Mr. Farred Chehab",male,,0,0,2631,7.2250,,C
28,29,1,3,"O'Dwyer, Miss. Ellen ""Nellie""",female,,0,0,330959,7.8792,,Q
29,30,0,3,"Todoroff, Mr. Lalio",male,,0,0,349216,7.8958,,S
31,32,1,1,"Spencer, Mrs. William Augustus (Marie Eugenie)",female,,1,0,PC 17569,146.5208,B78,C
32,33,1,3,"Glynn, Miss. Mary Agatha",female,,0,0,335677,7.7500,,Q
36,37,1,3,"Mamee, Mr. Hanna",male,,0,0,2677,7.2292,,C
42,43,0,3,"Kraeff, Mr. Theodor",male,,0,0,349253,7.8958,,C


<br/><br/>

__How many of them are missing?__
<br/><br/>

In [6]:
titanic_df['Age'].isnull().sum()

177

<br/><br/><br/>

___Well, that's a huge number and deleting those records is not a good idea.___

<br/><br/>

___So, let's fill those values with a median "Age" value___

<br/><br/>

In [7]:
titanic_df["Age"].fillna(titanic_df["Age"].median(skipna=True), inplace=True)

<br/>

___We can check that it actually happened___

<br/><br/>

In [8]:
titanic_df['Age'].isnull().sum()

0

<br/><br/>

___Now get rid of the columns which will not impact the survival of passengers___

<br/><br/>

___They are___

- ___Name___
- ___Ticket___
- ___Fare___
- ___Cabin___
- ___Embarked___


<br/><br/>

___Let's get rid of them in our Pandas DataFrame___

In [9]:
titanic_df = titanic_df.drop(['Name', 'Ticket', 'Fare', 'Cabin', 'Embarked'], axis = 1)

In [10]:
titanic_df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch
0,1,0,3,male,22.0,1,0
1,2,1,1,female,38.0,1,0
2,3,1,3,female,26.0,0,0
3,4,1,1,female,35.0,1,0
4,5,0,3,male,35.0,0,0
5,6,0,3,male,28.0,0,0
6,7,0,1,male,54.0,0,0
7,8,0,3,male,2.0,3,1
8,9,1,3,female,27.0,0,2
9,10,1,2,female,14.0,1,0


<br/><br/>

#### Coverting Non Numeric Data into Numeric

___of the available data, only "Sex" field is not having numeric values. We can convert the same into multiple fields of numeric values used .get_dummies(...)__

<br/><br/>



In [11]:
titanic_df = pd.get_dummies(titanic_df, columns=['Sex'])

In [12]:
titanic_df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Sex_female,Sex_male
0,1,0,3,22.0,1,0,0,1
1,2,1,1,38.0,1,0,1,0
2,3,1,3,26.0,0,0,1,0
3,4,1,1,35.0,1,0,1,0
4,5,0,3,35.0,0,0,0,1
5,6,0,3,28.0,0,0,0,1
6,7,0,1,54.0,0,0,0,1
7,8,0,3,2.0,3,1,0,1
8,9,1,3,27.0,0,2,1,0
9,10,1,2,14.0,1,0,1,0


<br/><br/>

___Now since we're trying to get the clusters around Surviving People, let's drop the "Survived" Columns___

<br/><br/>

In [13]:
X_df = titanic_df.drop(['Survived'],axis=1)

In [14]:
X_df.head(10)

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Sex_female,Sex_male
0,1,3,22.0,1,0,0,1
1,2,1,38.0,1,0,1,0
2,3,3,26.0,0,0,1,0
3,4,1,35.0,1,0,1,0
4,5,3,35.0,0,0,0,1
5,6,3,28.0,0,0,0,1
6,7,1,54.0,0,0,0,1
7,8,3,2.0,3,1,0,1
8,9,3,27.0,0,2,1,0
9,10,2,14.0,1,0,1,0


<br/><br/>

___Till now we've passed "numpy" arrays as input to the "Algo.fit(...)" function. However, in here we can directly pass the Pandas DataFrame as input___

<br/><br/>

In [15]:
kmc = KMeans(n_clusters=2)

In [16]:
# not how here we can pass pandas df's directly into the model fit scikit object
kmc.fit(X_df)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

<br/><br/>

___Let's check the predictions and compare it against the original values___

<br/><br/>

In [17]:
predict = kmc.predict(X_df)

In [18]:
predict

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [19]:
X_df.shape, predict.shape

((891, 7), (891,))

In [20]:
correct = 0
for index in range(X_df.shape[0]):
    if predict[index] == titanic_df.Survived[index]:
        correct += 1
    
correct / predict.shape[0]

0.49158249158249157

<br/><br/>

___As of now we got ~50% accuracy, Let's see if scaling the values using MixMaxScalar() improved the situation___

<br/><br/>

In [21]:
from sklearn.preprocessing import MinMaxScaler

mm_scalar = MinMaxScaler()

In [22]:
X_scalar = mm_scalar.fit_transform(X_df)

  return self.partial_fit(X, y)


In [23]:
X_scalar

array([[0.        , 1.        , 0.27117366, ..., 0.        , 0.        ,
        1.        ],
       [0.0011236 , 0.        , 0.4722292 , ..., 0.        , 1.        ,
        0.        ],
       [0.00224719, 1.        , 0.32143755, ..., 0.        , 1.        ,
        0.        ],
       ...,
       [0.99775281, 1.        , 0.34656949, ..., 0.33333333, 1.        ,
        0.        ],
       [0.9988764 , 0.        , 0.32143755, ..., 0.        , 0.        ,
        1.        ],
       [1.        , 1.        , 0.39683338, ..., 0.        , 0.        ,
        1.        ]])

In [24]:
type(X_scalar)

numpy.ndarray

In [25]:
kmc = KMeans(n_clusters=2)
kmc.fit(X_scalar)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [26]:
predict = kmc.predict(X_scalar)

In [27]:
correct = 0
for index in range(X_df.shape[0]):
    if predict[index] == titanic_df.Survived[index]:
        correct += 1
    
correct / predict.shape[0]

0.2132435465768799

<br><br><br>

#### Exercise :

___Convert the Pandas DataFrame into "numpy" arrays and use the same with KMean and Match the results___
<br><br><br>

In [28]:
X = np.array(titanic_df.drop(['Survived'], 1).astype(float))

In [29]:
X[0:10,:]

array([[ 1.,  3., 22.,  1.,  0.,  0.,  1.],
       [ 2.,  1., 38.,  1.,  0.,  1.,  0.],
       [ 3.,  3., 26.,  0.,  0.,  1.,  0.],
       [ 4.,  1., 35.,  1.,  0.,  1.,  0.],
       [ 5.,  3., 35.,  0.,  0.,  0.,  1.],
       [ 6.,  3., 28.,  0.,  0.,  0.,  1.],
       [ 7.,  1., 54.,  0.,  0.,  0.,  1.],
       [ 8.,  3.,  2.,  3.,  1.,  0.,  1.],
       [ 9.,  3., 27.,  0.,  2.,  1.,  0.],
       [10.,  2., 14.,  1.,  0.,  1.,  0.]])

<br/><br/><br/>

In [30]:
# Write Code

<br/><br/>

___With using MixMaxScalar()___

<br/><br/>

In [31]:
# Write Code