# ML Preparing data

### How to make data useful

JUST FOR HELP

And this can be used to find out the version of any package you have installed. For example

pip3 list | findstr numpy

numpy                         1.17.4
numpydoc                      0.9.2

Or if you want to look for more than one package at a time

pip3 list | findstr "scikit numpy"

numpy                         1.17.4
numpydoc                      0.9.2
scikit-learn                  0.22.1

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

dataset = pd.read_csv('patientData.csv')

In [2]:
dataset.shape

(10, 4)

In [3]:
dataset.head()

Unnamed: 0,Gender,Age,Albumin,Liver Disease
0,Female,62.0,3.3,Yes
1,Female,65.0,3.2,Yes
2,Female,45.0,3.1,No
3,Male,40.0,4.1,No
4,Male,50.0,,Yes


In [4]:
print(dataset)

   Gender   Age  Albumin Liver Disease
0  Female  62.0      3.3           Yes
1  Female  65.0      3.2           Yes
2  Female  45.0      3.1            No
3    Male  40.0      4.1            No
4    Male  50.0      NaN           Yes
5  Female  55.0      3.3            No
6    Male   NaN      3.6           Yes
7    Male  67.0      3.8           Yes
8    Male  75.0      3.0            No
9  Female  64.0      3.4           Yes


In [5]:
X = dataset.iloc[:,:-1].values
Y = dataset.iloc[:,3].values

In [6]:
print(Y)

['Yes' 'Yes' 'No' 'No' 'Yes' 'No' 'Yes' 'Yes' 'No' 'Yes']


#### Now we need to handle the missing data
We will use a library scikit learn

In [7]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values =np.nan, strategy = 'mean')
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

In [8]:
X

array([['Female', 62.0, 3.3],
       ['Female', 65.0, 3.2],
       ['Female', 45.0, 3.1],
       ['Male', 40.0, 4.1],
       ['Male', 50.0, 3.422222222222222],
       ['Female', 55.0, 3.3],
       ['Male', 58.111111111111114, 3.6],
       ['Male', 67.0, 3.8],
       ['Male', 75.0, 3.0],
       ['Female', 64.0, 3.4]], dtype=object)

#### Encode categorical data

In our dataset, there are two categorical columns.

    Gender

    Liver Disease

So, we need to encode these two columns of data.

In [9]:
# Encode Categorical Data
#we are dummy encoding as the machine learning algorithms will be
#confused with the values like Man or Woman

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer

labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
gndr = ColumnTransformer([("Gender", OneHotEncoder(), [0])], remainder = 'passthrough')
X = gndr.fit_transform(X)
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


In [10]:
X

array([[1.0, 0.0, 62.0, 3.3],
       [1.0, 0.0, 65.0, 3.2],
       [1.0, 0.0, 45.0, 3.1],
       [0.0, 1.0, 40.0, 4.1],
       [0.0, 1.0, 50.0, 3.422222222222222],
       [1.0, 0.0, 55.0, 3.3],
       [0.0, 1.0, 58.111111111111114, 3.6],
       [0.0, 1.0, 67.0, 3.8],
       [0.0, 1.0, 75.0, 3.0],
       [1.0, 0.0, 64.0, 3.4]], dtype=object)

In [11]:
Y

array([1, 1, 0, 0, 1, 0, 1, 1, 0, 1])

### Split the dataset into Training Set and Test Set.

Now, generally, we split the data with a ratio of 70% for the Training Data and 30% to test data. For our example, we divided into 80% for training data and 20% for the test data.

In [13]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2,random_state = 0)

### Feature Scaling

In a general scenario, machine learning is based on Euclidean Distance. Here for the column Albumin and Age column has an entirely different range of values. So we need to convert those values and make it under the range of values. That is why this is called feature scaling. We need to scale the values for the Age column. So let us scale the X_train and X_test.

In [14]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

In [15]:
X_train

array([[-1.        ,  1.        , -0.89871681, -0.32793138],
       [ 1.        , -1.        ,  0.74838599, -0.40624335],
       [ 1.        , -1.        ,  0.86603619, -1.11105109],
       [-1.        ,  1.        ,  0.05555704,  0.29856439],
       [-1.        ,  1.        ,  1.10133659,  1.00337213],
       [-1.        ,  1.        , -2.07521881,  2.06058373],
       [ 1.        , -1.        ,  0.51308559, -0.75864722],
       [ 1.        , -1.        , -0.31046581, -0.75864722]])

In [16]:
X_test

array([[ 1.        , -1.        , -1.48696781, -1.46345495],
       [-1.        ,  1.        ,  2.04253819, -1.81585882]])