# <p style="padding:10px;background-color:#254441;margin:0;color:#e9c46a;font-family:newtimeroman;font-size:150%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:500">Practice Preprocessing Machine Learning</p>


Data Preprocessing includes the steps we need to follow to transform or encode data so that it may be easily parsed by the machine. 

The main agenda for a model to be accurate and precise in predictions is that the algorithm should be able to easily interpret the data's features. 




<img src="https://assets-global.website-files.com/5d7b77b063a9066d83e1209c/627d122b8fdb884d672952bf_61f7bfab94334458028eec7d_data-preprocessing-cover.png" width=30% />

<img src="https://assets-global.website-files.com/5d7b77b063a9066d83e1209c/613749410d056eb67ec4b11f_model-building.png" width=30% />

he majority of the real-world datasets for machine learning are highly susceptible to be missing, inconsistent, and noisy due to their heterogeneous origin. 

Applying data mining algorithms on this noisy data would not give quality results as they would fail to identify patterns effectively. Data Processing is, therefore, important to improve the overall data quality.

* Duplicate or missing values may give an incorrect view of the overall statistics of data.
* Outliers and inconsistent data points often tend to disturb the model’s overall learning, leading to false predictions.

Quality decisions must be based on quality data. Data Preprocessing is important to get this quality data, without which it would just be a <font color='red'>Garbage In, Garbage Out scenario</font>

<a id='top'></a>
<div class="list-group" id="list-tab" role="tablist">
<p style="padding:10px;background-color:#254441;margin:0;color:#e9c46a;font-family:newtimeroman;font-size:130%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:500">Table Of Contents</p>   
    

    
|No  | Contents |No  | Contents  |
|:---| :---     |:---| :----     |
|1   | [<font color="#254441"> Importing Libraries</font>](#1)                |5   | [<font color="#254441"> Fillna (Solution 2)</font>](#5)
|2   | [<font color="#254441"> importing Dataset</font>](#2)                  |6   | [<font color="#254441"> Scikit-learn (Solution 3)</font>](#6)    
|3   | [<font color="#254441"> Missing Values</font>](#3)                     |7   | [<font color="#254441"> Encoding the Independent Variable</font>](#7)     
|4   | [<font color="#254441"> Dropna (Solution 1)</font>](#4)                |8  | [<font color="#254441"> Encoding the Dependent Variable</font>](#8)      


<a id="1"></a>
# <p style="padding:10px;background-color:#254441;margin:0;color:#e9c46a;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:500">Importing Libraries</p>

In [2]:
pip install matplotlib pandas numpy seaborn scikit-learn -q

Note: you may need to restart the kernel to use updated packages.


In [2]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

<a id="2"></a>
# <p style="padding:10px;background-color:#254441;margin:0;color:#e9c46a;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:500">importing Dataset</p>

__Now we have a small dataset that we want to examine__

In [16]:
df = pd.read_csv('./Datasets/Data.csv')

In [17]:
df

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [18]:
df.columns

Index(['Country', 'Age', 'Salary', 'Purchased'], dtype='object')

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Country    11 non-null     object 
 1   Age        11 non-null     float64
 2   Salary     11 non-null     float64
 3   Purchased  12 non-null     object 
dtypes: float64(2), object(2)
memory usage: 516.0+ bytes


In [20]:
df.shape

(12, 4)

In [21]:
df.describe()

Unnamed: 0,Age,Salary
count,11.0,11.0
mean,38.454545,64363.636364
std,6.919012,11047.829898
min,27.0,48000.0
25%,36.0,56000.0
50%,37.0,67000.0
75%,42.0,69500.0
max,50.0,83000.0


<a id="3"></a>
# <p style="padding:10px;background-color:#254441;margin:0;color:#e9c46a;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:500">Check Duplicate Values</p>

In [22]:
df.duplicated().sum()

np.int64(1)

In [28]:
df.drop_duplicates(inplace=True)
df.duplicated().sum()

np.int64(0)

In [27]:
df

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


<a id="3"></a>
# <p style="padding:10px;background-color:#254441;margin:0;color:#e9c46a;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:500">Missing Values</p>

One of the major problem in Data Cleaning/Exploratory Data Analysis phase is handling the missing values. Missing value means the data value that is not stored for a variable in the observation. This problem is common in almost all research and it can have a significant effect on the conclusions that can be drawn from the data.

### Sources of Missing Values

Before we dive into code, it’s important to understand the sources of missing data. Here’s some typical reasons why data is missing:

* User forgot to fill in a field.
* Data was lost while transferring manually from a legacy database.
* There was a programming error.
* Users chose not to fill out a field tied to their beliefs about how the results would be used or interpreted.

In [23]:
df.isnull().sum()

Country      1
Age          1
Salary       1
Purchased    0
dtype: int64

<a id="4"></a>
## <p style="padding:10px;background-color:#254441;margin:0;color:#e9c46a;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:500">Dropna (Solution 1)</p>

### The dropna() function is used to remove missing values.

In [12]:
df_dropna = df.copy()

In [13]:
print('Before:', df_dropna.shape)

df_dropna.dropna(inplace=True)

print('After:', df_dropna.shape)

Before: (11, 4)
After: (8, 4)


In [14]:
df_dropna.isnull().value_counts()

Country  Age    Salary  Purchased
False    False  False   False        8
Name: count, dtype: int64

In [55]:
df_dropna

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
5,France,35.0,58000.0,Yes
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


<a id="5"></a>
## <p style="padding:10px;background-color:#254441;margin:0;color:#e9c46a;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:500">Fillna (Solution 2)</p>

The fillna() method replaces the NULL values with a specified value.

The fillna() method returns a new DataFrame object unless the <font color='red'>inplace </font> parameter is set to True, in that case the fillna() method does the replacing in the original DataFrame instead

In [63]:
dfFill = df.copy()
dfFillZero = df.copy()
dfFill

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [57]:
# dfFill.fillna(dfFill.mean(), inplace=True) # Error Could not convert ['FranceSpainGermanySpainGermanyFranceSpainFranceGermanyFrance' 'NoYesNoNoYesYesNoYesNoYes'] to numeric

# Fill NaN with mean in numeric columns only
numeric_columns = dfFill.select_dtypes(include=['number']).columns

# dfFill[numeric_columns] = dfFill[numeric_columns].fillna(dfFill[numeric_columns].mean())
# or 
dfFill.fillna(dfFill[numeric_columns].mean(), inplace=True)

print (dfFill.isnull().sum())
dfFill

Country      1
Age          0
Salary       0
Purchased    0
dtype: int64


Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,64100.0,Yes
5,France,35.0,58000.0,Yes
6,Spain,38.6,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [66]:
# Fill NaN with 0 in numeric columns only
numeric_columns = df.select_dtypes(include=['number']).columns

dfFillZero[numeric_columns] = dfFillZero[numeric_columns].fillna(0)

print(dfFillZero.isnull().sum())
dfFillZero

Country      1
Age          0
Salary       0
Purchased    0
dtype: int64


Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,0.0,Yes
5,France,35.0,58000.0,Yes
6,Spain,0.0,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


<a id="6"></a>
## <p style="padding:10px;background-color:#254441;margin:0;color:#e9c46a;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:500">Scikit-learn (Solution 3)</p>

The <font color='red'>**scikit-learn library**</font> provides the SimpleImputer pre-processing class that can be used to replace missing values.

It is a flexible class that allows you to specify the value to replace (it can be something other than NaN) and the technique used to replace it (<font color='red'>**such as mean, median, or mode** </font>). The SimpleImputer class operates directly on the NumPy array instead of the DataFrame.

In [69]:
X = df[['Country', 'Age', 'Salary']].values
y = df['Purchased'].values
print(X)
print(y)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]
 [nan 37.0 67000.0]]
['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes' 'Yes']


### <p style="padding:10px;background-color:#254441;margin:0;color:#e9c46a;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:500">Mean</p>

In [76]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:, 1:3]) # All rows of column 1 and 2 only 
# Try from column 0 to 2 instead 1 to 2
# Cannot use median strategy with non-numeric data:
# could not convert string to float: 'France'
X[:, 1:3] = imputer.transform(X[:, 1:3])

In [71]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 64100.0]
 ['France' 35.0 58000.0]
 ['Spain' 38.6 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]
 [nan 37.0 67000.0]]


### <p style="padding:10px;background-color:#254441;margin:0;color:#e9c46a;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:500">Median</p>

In [77]:
X = df[['Country', 'Age', 'Salary']].values
y = df['Purchased'].values
imputer = SimpleImputer(missing_values=np.nan, strategy='median')
imputer.fit(X[:, 1:3]) # All rows of column 1 and 2 only 
# Try from column 0 to 2 instead 1 to 2
# Cannot use median strategy with non-numeric data:
# could not convert string to float: 'France'
X[:, 1:3] = imputer.transform(X[:, 1:3])
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 64000.0]
 ['France' 35.0 58000.0]
 ['Spain' 37.5 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]
 [nan 37.0 67000.0]]


### <p style="padding:10px;background-color:#254441;margin:0;color:#e9c46a;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:500">Most_frequent</p>

In [95]:
X = df[['Country', 'Age', 'Salary']].values
y = df['Purchased'].values
imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imputer.fit(X[:, 0:3]) # All rows of column 0 to 2
# Mode strategy works for both numeric and object type data
X[:, 0:3] = imputer.transform(X[:, 0:3])
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 67000.0]
 ['France' 35.0 58000.0]
 ['Spain' 37.0 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]
 ['France' 37.0 67000.0]]


### <p style="padding:10px;background-color:#254441;margin:0;color:#e9c46a;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:500">Constant</p>

In [96]:
# X = df[['Country', 'Age', 'Salary']].values
# y = df['Purchased'].values
# imputer = SimpleImputer(missing_values=np.nan, strategy='constant')
# imputer.fit(X[:, 0:3])
# X[:, 0:3] = imputer.transform(X[:, 0:3])
print(X)
print(y)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 67000.0]
 ['France' 35.0 58000.0]
 ['Spain' 37.0 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]
 ['France' 37.0 67000.0]]
['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes' 'Yes']


<a id="7"></a>
# <p style="padding:10px;background-color:#254441;margin:0;color:#e9c46a;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:500">Encoding the Independent Variable</p>

One-Hot Encoding can be defined as the essential process of converting the categorical data variables to be provided to machine and deep learning algorithms which in turn improve predictions as well as classification accuracy of a model.

The image below shows what we want to achieve by implementing One-Hot Encoding.

<img src="https://miro.medium.com/max/1400/1*dWvkew37QCveEekRdTirsw.png" />

### <p style="padding:10px;background-color:#254441;margin:0;color:#e9c46a;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:500">Pandas (get_dummies)</p>

In [84]:
pd.get_dummies(df, columns=['Country'])

Unnamed: 0,Age,Salary,Purchased,Country_France,Country_Germany,Country_Spain
0,44.0,72000.0,No,True,False,False
1,27.0,48000.0,Yes,False,False,True
2,30.0,54000.0,No,False,True,False
3,38.0,61000.0,No,False,False,True
4,40.0,,Yes,False,True,False
5,35.0,58000.0,Yes,True,False,False
6,,52000.0,No,False,False,True
7,48.0,79000.0,Yes,True,False,False
8,50.0,83000.0,No,False,True,False
9,37.0,67000.0,Yes,True,False,False


In [85]:
pd.get_dummies(df)

Unnamed: 0,Age,Salary,Country_France,Country_Germany,Country_Spain,Purchased_No,Purchased_Yes
0,44.0,72000.0,True,False,False,True,False
1,27.0,48000.0,False,False,True,False,True
2,30.0,54000.0,False,True,False,True,False
3,38.0,61000.0,False,False,True,True,False
4,40.0,,False,True,False,False,True
5,35.0,58000.0,True,False,False,False,True
6,,52000.0,False,False,True,True,False
7,48.0,79000.0,True,False,False,False,True
8,50.0,83000.0,False,True,False,True,False
9,37.0,67000.0,True,False,False,False,True


<a id="8"></a>
# <p style="padding:10px;background-color:#254441;margin:0;color:#e9c46a;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:500">Encoding the Dependent Variable</p>

In [98]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

In [100]:
print(y)

[0 1 0 0 1 1 0 1 0 1 1]


### My Code

#### Load the Data

In [2]:
df = pd.read_csv('./Datasets/Data.csv')
df

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Country    10 non-null     object 
 1   Age        10 non-null     float64
 2   Salary     10 non-null     float64
 3   Purchased  11 non-null     object 
dtypes: float64(2), object(2)
memory usage: 484.0+ bytes


In [7]:
df.isna().sum()

Country      1
Age          1
Salary       1
Purchased    0
dtype: int64

In [12]:
print("Missing Values:")
for col in df.columns:
    missing = df[col].isna().sum()
    percent = missing / df.shape[0] * 100
    print("%s: %.2f%% (%d)" % (col,percent,missing))

Missing Values:
Country: 9.09% (1)
Age: 9.09% (1)
Salary: 9.09% (1)
Purchased: 0.00% (0)


#### Null value fill

In [14]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

# 1. Mean Imputation for numeric columns
mean_imputer = SimpleImputer(strategy='mean')
df[['Salary']] = mean_imputer.fit_transform(df[['Salary']])

# 2. Median Imputation for numeric columns
median_imputer = SimpleImputer(strategy='median')
df[['Age']] = median_imputer.fit_transform(df[['Age']])

# 3. Mode Imputation for categorical columns
mode_imputer = SimpleImputer(strategy='most_frequent')
df[['Country']] = mode_imputer.fit_transform(df[['Country']])

df


Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,64100.0,Yes
5,France,35.0,58000.0,Yes
6,Spain,37.5,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


#### Feature Scaling

In [29]:
# importing sklearn Min Max Scaler class which is for Standardization
from sklearn.preprocessing import MinMaxScaler

mm = MinMaxScaler() # creating an instance of the class object

numeric_columns = df.select_dtypes(include=['number']).columns.tolist()
# print(numeric_columns)
df[numeric_columns] = mm.fit_transform(df[numeric_columns])  #fit and transforming MinMaxScaler the dataframe
df

Unnamed: 0,Country,Age,Salary,Purchased
0,France,0.73913,0.685714,No
1,Spain,0.0,0.0,Yes
2,Germany,0.130435,0.171429,No
3,Spain,0.478261,0.371429,No
4,Germany,0.565217,0.46,Yes
5,France,0.347826,0.285714,Yes
6,Spain,0.456522,0.114286,No
7,France,0.913043,0.885714,Yes
8,Germany,1.0,1.0,No
9,France,0.434783,0.542857,Yes


#### One Hot Encoding

In [30]:
#one hot encoding using OneHotEncoder of Scikit-Learn

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

#Print the dataframe:
print(f"Employee data : \n{df}")

#Extract categorical columns from the dataframe
#Here we extract the columns with object datatype as they are the categorical columns
# categorical_columns = df.select_dtypes(include=['object']).columns.tolist()
# print(categorical_columns)
categorical_columns = ['Country']
#Initialize OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)

# Apply one-hot encoding to the categorical columns
one_hot_encoded = encoder.fit_transform(df[categorical_columns])

#Create a DataFrame with the one-hot encoded columns
#We use get_feature_names_out() to get the column names for the encoded data
one_hot_df = pd.DataFrame(one_hot_encoded, columns=encoder.get_feature_names_out(categorical_columns))

# Concatenate the one-hot encoded dataframe with the original dataframe
df_encoded = pd.concat([df, one_hot_df], axis=1)

# Drop the original categorical columns
df_encoded = df_encoded.drop(categorical_columns, axis=1)

# Display the resulting dataframe
print(f"Encoded Employee data : \n{df_encoded}")

Employee data : 
    Country       Age    Salary Purchased
0    France  0.739130  0.685714        No
1     Spain  0.000000  0.000000       Yes
2   Germany  0.130435  0.171429        No
3     Spain  0.478261  0.371429        No
4   Germany  0.565217  0.460000       Yes
5    France  0.347826  0.285714       Yes
6     Spain  0.456522  0.114286        No
7    France  0.913043  0.885714       Yes
8   Germany  1.000000  1.000000        No
9    France  0.434783  0.542857       Yes
10   France  0.434783  0.542857       Yes
Encoded Employee data : 
         Age    Salary Purchased  Country_France  Country_Germany  \
0   0.739130  0.685714        No             1.0              0.0   
1   0.000000  0.000000       Yes             0.0              0.0   
2   0.130435  0.171429        No             0.0              1.0   
3   0.478261  0.371429        No             0.0              0.0   
4   0.565217  0.460000       Yes             0.0              1.0   
5   0.347826  0.285714       Yes        

#### Convering to X, Y lable to train

In [31]:
x = df_encoded[['Country_France', 'Country_Germany', 'Country_Spain', 'Age', 'Salary']].values
y = df_encoded['Purchased'].values

In [32]:
x

array([[1.        , 0.        , 0.        , 0.73913043, 0.68571429],
       [0.        , 0.        , 1.        , 0.        , 0.        ],
       [0.        , 1.        , 0.        , 0.13043478, 0.17142857],
       [0.        , 0.        , 1.        , 0.47826087, 0.37142857],
       [0.        , 1.        , 0.        , 0.56521739, 0.46      ],
       [1.        , 0.        , 0.        , 0.34782609, 0.28571429],
       [0.        , 0.        , 1.        , 0.45652174, 0.11428571],
       [1.        , 0.        , 0.        , 0.91304348, 0.88571429],
       [0.        , 1.        , 0.        , 1.        , 1.        ],
       [1.        , 0.        , 0.        , 0.43478261, 0.54285714],
       [1.        , 0.        , 0.        , 0.43478261, 0.54285714]])

In [35]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

In [36]:
y

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1])