In [2]:
'''
Building Good training sets- Data preprocessing.
The quality of the data and the amount of useful information that it contains are key factors that determine how well
a machine learning algorithm can learn.Therefore, it is absolutely critical that we make sure to examine and preprocess a dateset
before we feed it to a learning algorithm.

We will disucc the following:
    => Removing and imputing missing values from the dataset.
    => Getting categorical data into shape for machine learning algorithms
    => Selecting relevant features for the model construction.

'''

'\nBuilding Good training sets- Data preprocessing.\nThe quality of the data and the amount of useful information that it contains are key factors that determine how well\na machine learning algorithm can learn.Therefore, it is absolutely critical that we make sure to examine and preprocess a dateset\nbefore we feed it to a learning algorithm.\n\nWe will disucc the following:\n    => Removing and imputing missing values from the dataset.\n    => Getting categorical data into shape for machine learning algorithms\n    => Selecting relevant features for the model construction.\n\n'

In [3]:
'''
Dealing with missing data

It is not uncommon in real-world application for our samples to be missing one or more values for various reason. There 
could have been an error in the data collection process, certain measurements are not applicable, or particular fields could have
been simply left blank in  survey, for example. We typically see missing values as the blank spaces in our data table
or as placeholders strings such as NaN, which stands for not a number or Null.
'''

'\nDealing with missing data\n\nIt is not uncommon in real-world application for our samples to be missing one or more values for various reason. There \ncould have been an error in the data collection process, certain measurements are not applicable, or particular fields could have\nbeen simply left blank in  survey, for example. We typically see missing values as the blank spaces in our data table\nor as placeholders strings such as NaN, which stands for not a number or Null.\n'

In [4]:
'''
Identifyinh missing values in tabular data
'''
# Imports
#! pipenv install pandas

'\nIdentifyinh missing values in tabular data\n'

In [5]:

import pandas as pd
from io import StringIO
csv_data=\
    '''
    A,B,C,D
    1.0,2.0,3.0,4.0
    5.0,6.0,,8.0
    10.0,11.0,12.0,
    '''
df=pd.read_csv(StringIO(csv_data))

'''
Using the preceding code, we read CSV-formatted data into a pandas DataFrame via the read_csv function and noticed
that the two missing cells were replaced by NaN. The StringIO function in the preceding code example was simply
used for the purpose of illustration. It allows us to read the string assigned to csv_data into a pandas DataFrame
as if it was regular CSv file in our hard drive.

For a larger DataFrame, it can be tedious to look for missing values manually; in this case we can use the isnull method
to return a DataFrame with Boolean values that indicate whether a cell contains a numeric value (False) or if data is
missing (True)
'''
df

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


In [6]:
#Using the sum method, we cna then return the number of missing values per column as follows:
df.isnull().sum()

    A    0
B        0
C        1
D        1
dtype: int64

In [7]:
#================
# dropping rows with missing rows
df.dropna(axis=0)
#=======================

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


In [8]:
'''
Although scikit-learn was developed for working with NumPy arrays, it can sometimes be more convinient to preprocess data using
pandas' DataFrame. We can always access the underlying NumPy array of a DataFrame via the values attribute before we feed it 
into scikit-learn estimator.
'''

"\nAlthough scikit-learn was developed for working with NumPy arrays, it can sometimes be more convinient to preprocess data using\npandas' DataFrame. We can always access the underlying NumPy array of a DataFrame via the values attribute before we feed it \ninto scikit-learn estimator.\n"

In [9]:
'''
Eliminating samples or features with missing values

One of the easiest ways to deal with missing data is to simply remove the corresponding features(columns) or samples(rows)
from the dataset entirely: rows with misssing values can be easily dropped via the dropna method:


'''

df.dropna(axis=0)

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


In [10]:
'''Similarly, we can drop columns that have at least one NaN in any row by setting the axis argument to 1:'''

df.dropna(axis=1)

Unnamed: 0,A,B
0,1.0,2.0
1,5.0,6.0
2,10.0,11.0


In [11]:
'''The dropna method supports several additional parameters that can come in handy:
only drop rows where all columns are NaN
(returns the whole array here since we dont have a row with where all values are NaN)'''

df.dropna(how='all')

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


In [12]:
'''Drop rows that have less than 4 real values'''
df.dropna(thresh=4)

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


In [13]:
'''Only Drop rows where NaN appears in specific columns (here:'C')'''
df.dropna(subset=['C'])


Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
2,10.0,11.0,12.0,


In [14]:
'''Although the removal of missing data seems to be convinient approach, it also comes with certain disadvanatges;
for example , we may end up removing too many samples, which will make a reliable analysis impossible.Or, if we remove
too many feature columns, we will run the risk of losing valuable information that our classifiers needs to discriminate
between classes'''

'Although the removal of missing data seems to be convinient approach, it also comes with certain disadvanatges;\nfor example , we may end up removing too many samples, which will make a reliable analysis impossible.Or, if we remove\ntoo many feature columns, we will run the risk of losing valuable information that our classifiers needs to discriminate\nbetween classes'

In [15]:
'''Imputing missing values'''

'''Often, the removal of samples or dropping of entire feature columns is simply not feasible, because we might
lose too much valuable data.In this case, we can use different interpolation techniques to estimate the missing values
from the other training samples in our dataset. One of the most common interpolation techniques is mean imputation, where
 we simply replace the missing value with  the mean value of the entire feature column. A convinient was to achieve this is
 by using the Imputer class from scikit-learn'''

#! pipenv install scikit-learn


'Often, the removal of samples or dropping of entire feature columns is simply not feasible, because we might\nlose too much valuable data.In this case, we can use different interpolation techniques to estimate the missing values\nfrom the other training samples in our dataset. One of the most common interpolation techniques is mean imputation, where\n we simply replace the missing value with  the mean value of the entire feature column. A convinient was to achieve this is\n by using the Imputer class from scikit-learn'

In [16]:

from sklearn.impute import SimpleImputer
# The above needed to change from (from sklearn.preprocessing import Imputer ) to the above
import numpy as np
imr=SimpleImputer(missing_values=np.nan,strategy='mean')
# The missing values needed to change from (missing_value='NaN') to (missing_values=np.nan)
imr=imr.fit(df.values)
imputed_data=imr.transform(df.values)
imputed_data

array([[ 1. ,  2. ,  3. ,  4. ],
       [ 5. ,  6. ,  7.5,  8. ],
       [10. , 11. , 12. ,  6. ]])

In [17]:
#! pipenv install numpy

In [18]:
'''Here we have replaced each NaN value with the corresponding mean, which is seperatley calculated for each column.
Other options for the strategy parameter are median or most_frequent'''

'Here we have replaced each NaN value with the corresponding mean, which is seperatley calculated for each column.\nOther options for the strategy parameter are median or most_frequent'

In [19]:
'''Understanding the scikit-learn estimator API'''

'''The Imputer class belongs to the so called transformer classes in scikit-learn, which are used for data transformation
The two essential methods for those estimators are fit and transform. The fit method is used to learn the parameters from
the training data, and the transform method uses those parameters to trasnform the data. Any data array that is to be
transformed needs to have the same number of features as the data array that was used to fit the model.
The classifiers that we used in the previous chapters belong to the so called estimators in scikit learn with an API that
is conceptually very similar to the transformer class.Estimators have a predict method but can also have a tranform method.
As you may recall, we also used the fit method to learn the parameters of a model when we trained those estimators for classification
However, in supervised learning tasks, we additionally provide the class labels for fitting the model, which cna then be used
to make predictions about new data samples via the predict method'''

'The Imputer class belongs to the so called transformer classes in scikit-learn, which are used for data transformation\nThe two essential methods for those estimators are fit and transform. The fit method is used to learn the parameters from\nthe training data, and the transform method uses those parameters to trasnform the data. Any data array that is to be\ntransformed needs to have the same number of features as the data array that was used to fit the model.\nThe classifiers that we used in the previous chapters belong to the so called estimators in scikit learn with an API that\nis conceptually very similar to the transformer class.Estimators have a predict method but can also have a tranform method.\nAs you may recall, we also used the fit method to learn the parameters of a model when we trained those estimators for classification\nHowever, in supervised learning tasks, we additionally provide the class labels for fitting the model, which cna then be used\nto make predictions ab

In [20]:
'''Handling Categorical Data'''

'''
Norminal and ordinal features

Ordinal features can be understood as categorical values that can be sorted or ordered. In contrast nominal features
don't imply any order for example in a tshirt the sizes can be ordinal and colour nominal
'''


"\nNorminal and ordinal features\n\nOrdinal features can be understood as categorical values that can be sorted or ordered. In contrast nominal features\ndon't imply any order for example in a tshirt the sizes can be ordinal and colour nominal\n"

In [21]:
'''Creating an example dataset'''
import pandas as pd
df_shirt=pd.DataFrame([
    ['green','M',10.1,'class1'],
    ['red','L',13.5,'class2'],
    ['blue','XL',15.3,'class1']
])

df_shirt.columns=['color','size','price','classlabel']
df_shirt

Unnamed: 0,color,size,price,classlabel
0,green,M,10.1,class1
1,red,L,13.5,class2
2,blue,XL,15.3,class1


In [22]:
'''
Mapping ordinal features
To make sure that the learning algorithm interprets the ordinal features correctly, we need to convert the categorical string
values into integers.Unfortunately, there is no convinient function that can automatically derive the correct order of the labels
for our size feature,so we have to define the mapping manually
'''

'\nMapping ordinal features\nTo make sure that the learning algorithm interprets the ordinal features correctly, we need to convert the categorical string\nvalues into integers.Unfortunately, there is no convinient function that can automatically derive the correct order of the labels\nfor our size feature,so we have to define the mapping manually\n'

In [23]:
size_mapping={
    'XL':3,
    'L':2,
    'M':1
}

df_shirt['size']=df_shirt['size'].map(size_mapping)

df_shirt

Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,class1
1,red,2,13.5,class2
2,blue,3,15.3,class1


In [24]:
'''If we want to transform the integer values back to the original string representation at a later stage, we can
simply define a reverse-mapping dictionary inv_size_mapping={v:k for k, v in size_mapping.items()} that can then be used
via the pandas map method on the transformed feature column,similar to the size_mapping dictionary that we used
previously'''

inv_size_mapping={v:k for k ,v in size_mapping.items()}
df_shirt['size'].map(inv_size_mapping)

0     M
1     L
2    XL
Name: size, dtype: object

In [25]:
'''Encoding class labels'''

'''
Many machine learning libraries require that class labels are encoded as integer values. Although most estimators for classification
in scikit-learn convert class labels to integers internally, it is considered good practice to provide class labels as integer
arrays to avoid technical glitches. To encode the class labels, we can use an approach similar to the mapping of ordinal features
discussed previously. We need to remember that class labels are not ordinal, and it doesn't matter which integer number we
asign to a particular string label. Thus we can enumerate the class label.

'''

import numpy as np

class_mapping={label:idx for idx, label in enumerate(np.unique(df_shirt['classlabel']))}

class_mapping

{'class1': 0, 'class2': 1}

In [26]:
'''Next , We can use the mapping dictionary to transform the class labels into integers:'''
df_shirt['classlabel']=df_shirt['classlabel'].map(class_mapping)
df_shirt


Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,0
1,red,2,13.5,1
2,blue,3,15.3,0


In [27]:
'''We can reverse the key valuep pairs in the mapping dictionary as follows to map the converted class labels back to
the original string representation:'''

inv_class_mapping={v:k for k, v in class_mapping.items()}
df_shirt['classlabel']=df_shirt['classlabel'].map(inv_class_mapping)
df_shirt

Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,class1
1,red,2,13.5,class2
2,blue,3,15.3,class1


In [28]:
'''Alternatively, there is a convenient LabelEncoder class directly implemented in scikit-learn '''
from sklearn.preprocessing import LabelEncoder
class_le=LabelEncoder()
y=class_le.fit_transform(df_shirt['classlabel'].values)
y

array([0, 1, 0])

In [29]:
'''Note that the fit_transform method is just a shortcut for calling fit and transform seperately, and we can use the
inverse_transform method to transform the integer class label back into their original string representation:'''

class_le.inverse_transform(y)

array(['class1', 'class2', 'class1'], dtype=object)

In [30]:
df

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


In [31]:
'''Performing one-hot Encoding on nominal features'''

'''
Since scikit-learn estimators for classification treat class labels as categorical data that does not imply any order
(nominal), we used the convenient LabelEncoder to encode the string lables into integers. It may appear
that we could use a similar approach  to transform the nominal color of our dataset,as follows:
'''

X=df_shirt[['color','size','price']].values
color_le=LabelEncoder()

X[:,0]=color_le.fit_transform(X[:,0])
X

'''
After executing the precedding code, the first column of the NumPy array X now holds the new color values, which are 
encoded as follows:
blue=0
green=1
red=2

'''

'\nAfter executing the precedding code, the first column of the NumPy array X now holds the new color values, which are \nencoded as follows:\nblue=0\ngreen=1\nred=2\n\n'

In [32]:
X

array([[1, 1, 10.1],
       [2, 2, 13.5],
       [0, 3, 15.3]], dtype=object)

In [33]:
'''If we stop at this point and feed the array to our classifier, we will make one of the most common mistakes
in dealing with categorical data.Although the color values don't come in any particular order, a learning algorithm
 will now assume that green is larger than blue, and red is larger than green. Although this assumption is incorrect,
 the algorithm could still produce useful results. However, those results would not be optimal.
 A common workaround for this problem is to use a technique called one-hot encoding.The idea behind this approach is to create
 a new dummy feature for each unique value in the nominal feature column. Here, we would convert the color feature into 
 three new features: blue, green, and red. Binary values can then be used to indicate the particular color of a sample.'''

from sklearn.preprocessing import OneHotEncoder
ohe=OneHotEncoder()
ohe.fit_transform(X).toarray()

array([[0., 1., 0., 1., 0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 1., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0., 1., 0., 0., 1.]])

In [34]:
X

array([[1, 1, 10.1],
       [2, 2, 13.5],
       [0, 3, 15.3]], dtype=object)

In [35]:
'''Skipping one hot encoding for now'''


'Skipping one hot encoding for now'

In [36]:
'''Partitioning a dataset into seperate training and test sets'''

'''
The samples belong to one of three different classes, 1,2,3 which refer to the three different types of grapes grown 
in the same region in italy but derived from different wine culitvars.

A convinient way to randomly partition this dataset into seperate test and training datasets is to use the train_test_split
function from scikit-learn's model_selection submodule
'''

df_wine=pd.read_csv('wine.data',header=None)

In [37]:
df_wine

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,1,14.23,1.71,2.43,15.6,127,2.80,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.20,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.40,1050
2,1,13.16,2.36,2.67,18.6,101,2.80,3.24,0.30,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.50,16.8,113,3.85,3.49,0.24,2.18,7.80,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.80,2.69,0.39,1.82,4.32,1.04,2.93,735
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
173,3,13.71,5.65,2.45,20.5,95,1.68,0.61,0.52,1.06,7.70,0.64,1.74,740
174,3,13.40,3.91,2.48,23.0,102,1.80,0.75,0.43,1.41,7.30,0.70,1.56,750
175,3,13.27,4.28,2.26,20.0,120,1.59,0.69,0.43,1.35,10.20,0.59,1.56,835
176,3,13.17,2.59,2.37,20.0,120,1.65,0.68,0.53,1.46,9.30,0.60,1.62,840


In [38]:
df_wine.columns=['Class label','Alcohol',
                 'Malic acid','Ash',
                 'Alcalinity of ash','Magnesium',
                 'Total phenols','Flavanoids',
                 'Nonflavanoids phenols',
                 'Proanthocyanins','Color intensity',
                 'Hue','OD280/OD315 od diluted wines',
                 'Proline'
                 ]
print('Class labels',np.unique(df_wine['Class label']))

Class labels [1 2 3]


In [39]:
df_wine.head()

Unnamed: 0,Class label,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoids phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 od diluted wines,Proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


In [40]:
from sklearn.model_selection import train_test_split
X_wine=df_wine.iloc[:, 1: ].values
y_wine=df_wine.iloc[:, 0].values

a=X_wine.shape
b=y_wine.shape

a,b

((178, 13), (178,))

In [41]:
X_wine_train,y_wine_train,X_wine_test,y_wine_test=train_test_split(X_wine,y_wine,test_size=0.3,random_state=0,stratify=y_wine)


In [42]:
'''
First we assigned the NumPy array representation of the feature columns 1-13 to the variable X_wine; We assigned the class
labels from the first column to the variable y. Then, we used the train_test_split function to randomly split X_wine and 
y_wine into seperate  training and test datasets. By setting test_size=0.3 we assigned 30% of the wine samples to
X_wine_test and y_wine_test and the remaining 70 percent to X_wine_train and y_wine_train respectively.Providing the class
label array y as an argument to stratify ensures that both training and test datasets have the same class proportions as the
original dataset.
'''



'\nFirst we assigned the NumPy array representation of the feature columns 1-13 to the variable X_wine; We assigned the class\nlabels from the first column to the variable y. Then, we used the train_test_split function to randomly split X_wine and \ny_wine into seperate  training and test datasets. By setting test_size=0.3 we assigned 30% of the wine samples to\nX_wine_test and y_wine_test and the remaining 70 percent to X_wine_train and y_wine_train respectively.Providing the class\nlabel array y as an argument to stratify ensures that both training and test datasets have the same class proportions as the\noriginal dataset.\n'

In [43]:
'''Bringing feature onto the same scale'''

'''
Feature scaling is a crucial step in our preprocessing pipeline that can easily be forgotten. Decision trees and random
forests are two of the very few machine learning algorithms where we don't need to worry about feature scaling. Those
algorithms are scale invariant. However, the majority of machine learning and optimization algorithms behave much better
if features are on the same scale. There are two common approaches to bring different features  onto the same scale:
normalization and standardization. Normalization referes to the rescaling of the feature to a range of [0,1], which is a 
special case of min-max scaling.

Although normalization via min-max scaling is a commonly used technique that is useful when we need values in a bounded interval
standardization can be more practical for many machine learning algorithms, especially for optimization algorithms such as
gradient decent.for this reason many linear models, such as logistic regression and SVM initialize the weights to 0 or small
random values close to 0. Using standardization, we center the feature columns at mean 0 with standard deviation 1 so that the
feature columns takes the form of a normal distribution , which makes it easier to learn the weights. Furthermore standardization
maintains useful information about outliers and make the algorithm less sensitive to them in contrast to min-max scaling, which
scales the data to a limited range of values.
'''

from sklearn.preprocessing import MinMaxScaler
mms=MinMaxScaler()
X_wine_train_norm=mms.fit_transform(X_wine_train)
X_wine_test_norm=mms.transform(X_wine_test)


ValueError: Expected 2D array, got 1D array instead:
array=[3. 1. 1. 1. 3. 2. 2. 3. 2. 2. 2. 1. 2. 3. 1. 3. 2. 1. 3. 3. 2. 1. 2. 2.
 2. 2. 3. 1. 2. 2. 1. 1. 3. 1. 2. 1. 1. 2. 3. 3. 1. 3. 3. 3. 1. 2. 3. 3.
 2. 3. 2. 2. 2. 1. 2. 2. 3. 3. 2. 1. 1. 2. 3. 3. 2. 1. 2. 2. 2. 1. 1. 1.
 1. 1. 3. 1. 2. 3. 2. 2. 3. 1. 2. 1. 2. 2. 3. 2. 1. 1. 1. 3. 2. 1. 1. 2.
 2. 3. 3. 2. 1. 1. 2. 2. 3. 1. 3. 1. 2. 2. 2. 2. 1. 3. 1. 1. 1. 1. 2. 2.
 3. 3. 2. 2.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [93]:
from sklearn.preprocessing import StandardScaler
stdsc=StandardScaler()
X_wine_train_std=stdsc.fit_transform(X_wine_train)
X_wine_test_std=stdsc.fit(X_wine_test)

ValueError: Expected 2D array, got 1D array instead:
array=[3. 1. 1. 1. 3. 2. 2. 3. 2. 2. 2. 1. 2. 3. 1. 3. 2. 1. 3. 3. 2. 1. 2. 2.
 2. 2. 3. 1. 2. 2. 1. 1. 3. 1. 2. 1. 1. 2. 3. 3. 1. 3. 3. 3. 1. 2. 3. 3.
 2. 3. 2. 2. 2. 1. 2. 2. 3. 3. 2. 1. 1. 2. 3. 3. 2. 1. 2. 2. 2. 1. 1. 1.
 1. 1. 3. 1. 2. 3. 2. 2. 3. 1. 2. 1. 2. 2. 3. 2. 1. 1. 1. 3. 2. 1. 1. 2.
 2. 3. 3. 2. 1. 1. 2. 2. 3. 1. 3. 1. 2. 2. 2. 2. 1. 3. 1. 1. 1. 1. 2. 2.
 3. 3. 2. 2.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.