## Exploration/Visualisation/Cleaning

Do you suspect your data is “dirty” (has a few meaningless input patterns and/or noisy outputs or wrong class labels)? If yes, detect the outlier examples using the top ranking variables obtained in step 5 as representation; check and/or discard them.

Automating tasks such as plotting all your variables against the target variable being predicted as well as computing summary statistics can save lots of time.


print(dataset.shape)
View dimensions

Peak at data

Stats summary

- We don’t know which algorithms would be good on this problem or what configurations to use. We get an idea from the plots that some of the classes are partially linearly separable in some dimensions, so we are expecting generally good results.


Missing data
Incorrect datatypes
Imbalanced



# Class Distribution (Classification Only)

On classification problems you need to know how balanced the class values are.

Highly imbalanced problems (a lot more observations for one class than another) are common and may need special handling in the data preparation stage of your project.

You can quickly get an idea of the distribution of the class attribute in Pandas.

```
# Class Distribution
import pandas
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pandas.read_csv(url, names=names)
class_counts = data.groupby('class').size()
print(class_counts)
```

Skew, distribution, correlation

Transform Data
The final step is to transform the process data. The specific algorithm you are working with and the knowledge of the problem domain will influence this step and you will very likely have to revisit different transformations of your preprocessed data as you work on your problem.
Three common data transformations are scaling, attribute decompositions and attribute aggregations. This step is also referred to as feature engineering.
	•	Scaling: The preprocessed data may contain attributes with a mixtures of scales for various quantities such as dollars, kilograms and sales volume. Many machine learning methods like data attributes to have the same scale such as between 0 and 1 for the smallest and largest value for a given feature. Consider any feature scaling you may need to perform.
	•	Decomposition: There may be features that represent a complex concept that may be more useful to a machine learning method when split into the constituent parts. An example is a date that may have day and time components that in turn could be split out further. Perhaps only the hour of day is relevant to the problem being solved. consider what feature decompositions you can perform.
	•	Aggregation: There may be features that can be aggregated into a single feature that would be more meaningful to the problem you are trying to solve. For example, there may be a data instances for each time a customer logged into a system that could be aggregated into a count for the number of logins allowing the additional instances to be discarded. Consider what type of feature aggregations could perform.
	•	


Histograms to summarize the distribution of individual data attributes.
Pairwise Histograms to plot attributes against each other and highlight relationships and outliers
Dimensionality Reduction methods for creating lower dimensional plots and models of the data
Clustering to expose natural groupings in the data
    
    Real data can have inconsistencies, missing values and various other forms of corruption. If it was scraped from a difficult data source, it may require tripping and cleaning up. Even clean data may require post-processing to make it uniform and consistent



## Variable Data Types

- 

### Categorical
A categorical variable has too many levels. This pulls down performance level of the model. For example, a cat. variable “zip code” would have numerous levels.
A categorical variable has levels which rarely occur. Many of these levels have minimal chance of making a real impact on model fit. For example, a variable ‘disease’ might have some levels which would rarely occur.
There is one level which always occurs i.e. for most of the observations in data set there is only one level. Variables with such levels fail to make a positive impact on model performance due to very low variation.
If the categorical variable is masked, it becomes a laborious task to decipher its meaning. Such situations are commonly found in data science competitions.
You can’t fit categorical variables into a regression equation in their raw form. They must be treated.
Most of the algorithms (or ML libraries) produce better result with numerical variable. In python, library “sklearn” requires features in numerical arrays. Look at the below snapshot. I have applied random forest using sklearn library on titanic data set (only two features sex and pclass are taken as independent variables). It has returned an error because feature “sex” is categorical and has not been converted to numerical form.

A common challenge with nominal categorical variable is that, it may decrease performance of a model. For example: We have two features “age” (range: 0-80) and “city” (81 different levels). Now, when we’ll apply label encoder to ‘city’ variable, it will represent ‘city’ with numeric values range from 0 to 80. The ‘city’ variable is now similar to ‘age’ variable since both will have similar data points, which is certainly not a right approach.

Convert to number: As discussed above, some ML libraries do not take categorical variables as input. Thus, we convert them into numerical variables. Below are the methods to convert a categorical (string) input to numerical nature:
Label Encoder: It is used to transform non-numerical labels to numerical labels (or nominal categorical variables). Numerical labels are always between 0 and n_classes-1. 



In [None]:
from sklearn.preprocessing import LabelEncoder
train[‘sex’] = number.fit_transform(train[‘sex’].astype(‘str’))

Convert numeric bins to number: Let’s say, bins of a continuous variable are available in the data set (shown below).

Above, you can see that variable “Age” has bins (0-17, 17-25, 26-35 …). We can convert these bins into definite numbers using the following methods:
Using label encoder for conversion. But, these numerical bins will be treated same as multiple levels of non-numeric feature. Hence, wouldn’t provide any additional information
Create a new feature using mean or mode (most relevant value) of each age bucket. It would comprise of additional weight for levels.

Create two new features, one for lower bound of age and another for upper bound. In this method, we’ll obtain more information about these numerical bins compare to earlier two methods.



Combine levels: To avoid redundant levels in a categorical variable and to deal with rare levels, we can simply combine the different levels. There are various methods of combining levels. Here are commonly used ones:
Using Business Logic: It is one of the most effective method of combining levels. It makes sense also to combine similar levels into similar groups based on domain or business experience. For example, we can combine levels of a variable “zip code” at state or district level. This will reduce the number of levels and improve the model performance also.

Using frequency or response rate: Combining levels based on business logic is effective but we may always not have the domain knowledge. Imagine, you are given a data set from Aerospace Department, US Govt. How would you apply business logic here? In such cases, we combine levels by considering the frequency distribution or response rate.
To combine levels using their frequency, we first look at the frequency distribution of of each level and combine levels having frequency less than 5% of total observation (5% is standard but you can change it based on distribution). This is an effective method to deal with rare levels.
We can also combine levels by considering the response rate of each level. We can simply combine levels having similar response rate into same group.
Finally, you can also look at both frequency and response rate to combine levels. You first combine levels based on response rate then combine rare levels to relevant group.

Dummy Coding: Dummy coding is a commonly used method for converting a categorical input variable into continuous variable. ‘Dummy’, as the name suggests is a duplicate variable which represents one level of a categorical variable. Presence of a level is represent by 1 and absence is represented by 0. For every level present, one dummy variable will be created. Look at the representation below to convert a categorical variable using dummy variable.

sex male and famale gets hot enocded above



Let’s come straight to the point on this one – there are only 2 types of variables you see – Continuous and Discrete. Further, discrete variables can divided into Nominal (categorical) and Ordinal.

# Continuous

Binning refers to dividing a list of continuous variables into groups. It is done to discover set of patterns in continuous variables, which are difficult to analyze otherwise. Also, bins are easy to analyze and interpret. But, it also leads to loss of information and loss of power. Once the bins are created, the information gets compressed into groups which later affects the final model. Hence, it is advisable to create small bins initially.

This would help in minimal loss of information and produces better results. However, I’ve encountered cases where small bins doesn’t prove to be helpful. In such cases, you must decide for bin size according to your hypothesis.We should consider distribution of data prior to deciding bin size.

In simpler words, it is a process of comparing variables at a ‘neutral’ or ‘standard’ scale. It helps to obtain same range of values. Normally distributed data is easy to read and interpret. As shown below, in a normally distributed data, 99.7% of the observations lie within 3 standard deviations from the mean. Also, the mean is zero and standard deviation is one. Normalization technique is commonly used in algorithms such as k-means, clustering etc.

A commonly used normalization method is z-scores.

Transformation is required when we encounter highly skewed data. It is suggested not to work on skewed data in its raw form. Because, it reduces the impact of low frequency values which could be equally significant. At times, skewness is influenced by presence of outliers. Hence, we need to be careful while using this approach. The technique to deal with outliers is explained in next sections.

Business Logic adds precision to output of a model. Data alone can’t suggest you patterns which understanding its business can. Hence, in companies, data scientists often prefer to spend time with clients and understand their business and market. This not only helps them to make an informed decision. But, also enables them to think outside the data. Once you start thinking, you are no longer confined within data.

Data are prone to outliers. Outlier is an abnormal value which stands apart from rest of data points. It can happen due to various reasons. Most common reason include challenges arising in data collection methods. Sometime the respondents deliberately provide incorrect answers; or the values are actually real. Then, how do we decide? You can any of these methods:

Create a box plot. You’ll get Q1, Q2 and Q3. (data points > Q3 + 1.5IQR) and (data points < Q1 – 1.5IQR) will be considered as outliers. IQR is Interquartile Range. IQR = Q3-Q1
Considering the scope of analysis, you can remove the top 1% and bottom 1% of values. However, this would result in loss of information. Hence, you must be check impact of these values on dependent variable.
Treating outliers is a tricky situation – one where you need to combine business understanding and understanding of data. For example, if you are dealing with age of people and you see a value age = 200 (in years), the error is most likely happening because the data was collected incorrectly, or the person has entered age in months. Depending on what you think is likely, you would either remove (in case one) or replace by 200/12 years.

Sometime data set has too many variables. May be, 100, 200 variables or even more. In such cases, you can’t build a model on all variables. Reason being, 1) It would be time consuming.  2) It might have lots of noise 3) A lot of variables will tell similar information

Hence, to avoid such situation we use PCA a.k.a Principal Component Analysis. It is nothing but, finding out few ‘principal‘ variables which explain significant amount of variation in dependent variable. Using this technique, a large number of variables are reduced to few significant variables. This technique helps to reduce noise, redundancy and enables quick computations.

In PCA, components are represented by PC1 or Comp 1, PC2 or Comp 2.. and so on. Here, PC1 will have highest variance followed by PC2, PC3 and so on. Our motive should be to select components with eigen values greater than 1. Eigen values are represented by ‘Standard Deviation’. Let check this out in R below:

Create New Variables:

Have a look at the date format above. I’m sure you can easily figure out the possible new variables. If you have still not figure out, no problem. Let me tell you. We can easily break the format in different variables namely:

Date
Month
Year
Time
Days of Month
Days of Week
Days of Year
I’ve listed down the possibilities. You aren’t required to create all the listed variables in every situation. Create only those variables which only sync with your hypothesis. Every variable would have an impact( high / low) on dependent variable. You can check it using correlation matrix.



Create Bins:

Once you have extracted new variables, you can now create bins. For example: You’ve ‘Months’ variable. You can easily create bins to obtain ‘quarter’, ‘half-yearly’ variables. In ‘Days’, you can create bins to obtain ‘weekdays’. Similarly, you’ll have to explore with these variables. Try and Repeat. Who knows, you might find a variable of highest importance.

Convert Date to Numbers:

You can also convert date to numbers and use them as numerical variables. This will allow you to analyze dates using various statistical techniques such as correlation. This would be difficult to undertake otherwise. On the basis of their response to dependent variable, you can then create their bins and capture another important trend in data.



In [None]:
	•	Bagging: Known more formally as Bootstrapped Aggregation is where the same algorithm has different perspectives on the problem by being trained on different subsets of the training data.
	•	Boosting: Different algorithms are trained on the same training data.
	•	Blending: Known more formally as Stacked Aggregation or Stacking is where a variety of models whose predictions are taken as input to a new model that learns how to combine the predictions into an overall prediction.
	•	

How to Convert Categorical Data to Numerical Data?
This involves two steps:
	1.	Integer Encoding
	2.	One-Hot Encoding
1. Integer Encoding
As a first step, each unique category value is assigned an integer value.
For example, “red” is 1, “green” is 2, and “blue” is 3.
This is called a label encoding or an integer encoding and is easily reversible.
For some variables, this may be enough.
The integer values have a natural ordered relationship between each other and machine learning algorithms may be able to understand and harness this relationship.
For example, ordinal variables like the “place” example above would be a good example where a label encoding would be sufficient.
2. One-Hot Encoding
For categorical variables where no such ordinal relationship exists, the integer encoding is not enough.
In fact, using this encoding and allowing the model to assume a natural ordering between categories may result in poor performance or unexpected results (predictions halfway between categories).
In this case, a one-hot encoding can be applied to the integer representation. This is where the integer encoded variable is removed and a new binary variable is added for each unique integer value.


A one hot encoding is a representation of categorical variables as binary vectors.

from numpy import array
from numpy import argmax
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
# define example
data = ['cold', 'cold', 'warm', 'cold', 'hot', 'hot', 'warm', 'cold', 'warm', 'hot']
values = array(data)
print(values)
# integer encode
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(values)
print(integer_encoded)
# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
print(onehot_encoded)
# invert first example
inverted = label_encoder.inverse_transform([argmax(onehot_encoded[0, :])])
print(inverted)





In [None]:
Missing values

Occur during extraction, collection: missing at random, due to unobserved predictors, missing due to being missing: eg. People with higher incomes less likely to respond

Deletion

Medan/median/mode inputation
Prediciton model: e.g. knn

Outlier Detection and Treatment
Outlier can be of two types: Univariate and Multivariate. Above, we have discussed the example of univariate outlier. These outliers can be found when we look at distribution of a single variable. Multi-variate outliers are outliers in an n-dimensional space. In order to find them, you have to look at distributions in multi-dimensions.


	1.	Artificial (Error) / Non-natural
	2.	Natural.
	3.	
How to detect Outliers?
Most commonly used method to detect outliers is visualization. We use various visualization methods, like Box-plot, Histogram, Scatter Plot (above, we have used box plot and scatter plot for visualization). Some analysts also various thumb rules to detect outliers. Some of them are:
	•	Any value, which is beyond the range of -1.5 x IQR to 1.5 x IQR
	•	Use capping methods. Any value which out of range of 5th and 95th percentile can be considered as outlier
	•	Data points, three or more standard deviation away from mean are considered outlier
	•	Outlier detection is merely a special case of the examination of data for influential data points and it also depends on the business understanding
	•	Bivariate and multivariate outliers are typically measured using either an index of influence or leverage, or distance. Popular indices such as Mahalanobis’ distance and Cook’s D are frequently used to detect outliers.
	•	In SAS, we can use PROC Univariate, PROC SGPLOT. To identify outliers and influential observation, we also look at statistical measure like STUDENT, COOKD, RSTUDENT and others.


Most of the ways to deal with outliers are similar to the methods of missing values like deleting observations, transforming them, binning them, treat them as a separate group, imputing values and other statistical methods. Here, we will discuss the common techniques used to deal with outliers:
Deleting observations: We delete outlier values if it is due to data entry error, data processing error or outlier observations are very small in numbers. We can also use trimming at both ends to remove outliers.
Transforming and binning values: Transforming variables can also eliminate outliers. Natural log of a value reduces the variation caused by extreme values. Binning is also a form of variable transformation. Decision Tree algorithm allows to deal with outliers well due to binning of variable. We can also use the process of assigning weights to different observations.



Imputing: Like imputation of missing values, we can also impute outliers. We can use mean, median, mode imputation methods. Before imputing values, we should analyse if it is natural outlier or artificial. If it is artificial, we can go with imputing values. We can also use statistical model to predict values of outlier observation and after that we can impute it with predicted values.
Treat separately: If there are significant number of outliers, we should treat them separately in the statistical model. One of the approach is to treat both groups as two different groups and build individual model for both groups and then combine the output.


What is the process of Feature Engineering ?
You perform feature engineering once you have completed the first 5 steps in data exploration – Variable Identification, Univariate, Bivariate Analysis, Missing Values Imputation and Outliers Treatment. Feature engineering itself can be divided in 2 steps:

	•	Variable transformation.
	•	Variable / Feature creation.


In data modelling, transformation refers to the replacement of a variable by a function. For instance, replacing a variable x by the square / cube root or logarithm x is a transformation. In other words, transformation is a process that changes the distribution or relationship of a variable with others.

In data modelling, transformation refers to the replacement of a variable by a function. For instance, replacing a variable x by the square / cube root or logarithm x is a transformation. In other words, transformation is a process that changes the distribution or relationship of a variable with others.
Below are the situations where variable transformation is a requisite:
	•	When we want to change the scale of a variable or standardize the values of a variable for better understanding. While this transformation is a must if you have data in different scales, this transformation does not change the shape of the variable distribution
	•	When we can transform complex non-linear relationships into linear relationships. Existence of a linear relationship between variables is easier to comprehend compared to a non-linear or curved relation. Transformation helps us to convert a non-linear relation into linear relation. Scatter plot can be used to find the relationship between two continuous variables. These transformations also improve the prediction. Log transformation is one of the commonly used transformation technique used in these situations.


Symmetric distribution is preferred over skewed distribution as it is easier to interpret and generate inferences. Some modeling techniques requires normal distribution of variables. So, whenever we have a skewed distribution, we can use transformations which reduce skewness. For right skewed distribution, we take square / cube root or logarithm of variable and for left skewed, we take square / cube or exponential of variables.

	•	Variable Transformation is also done from an implementation point of view (Human involvement). Let’s understand it more clearly. In one of my project on employee performance, I found that age has direct correlation with performance of the employee i.e. higher the age, better the performance. From an implementation stand point, launching age based progamme might present implementation challenge. However, categorizing the sales agents in three age group buckets of <30 years, 30-45 years and >45  and then formulating three different strategies for each group is a judicious approach. This categorization technique is known as Binning of Variables.


There are various methods used to transform variables. As discussed, some of them include square root, cube root, logarithmic, binning, reciprocal and many others. Let’s look at these methods in detail by highlighting the pros and cons of these transformation methods.
	•	Logarithm: Log of a variable is a common transformation method used to change the shape of distribution of the variable on a distribution plot. It is generally used for reducing right skewness of variables. Though, It can’t be applied to zero or negative values as well.
	•	Square / Cube root: The square and cube root of a variable has a sound effect on variable distribution. However, it is not as significant as logarithmic transformation. Cube root has its own advantage. It can be applied to negative values including zero. Square root can be applied to positive values including zero.
	•	Binning: It is used to categorize variables. It is performed on original values, percentile or frequency. Decision of categorization technique is based on business understanding. For example, we can categorize income in three categories, namely: High, Average and Low. We can also perform co-variate binning which depends on the value of more than one variables.


Feature / Variable creation is a process to generate a new variables / features based on existing variable(s). For example, say, we have date(dd-mm-yy) as an input variable in a data set. We can generate new variables like day, month, year, week, weekday that may have better relationship with target variable. This step is used to highlight the hidden relationship in a variable:

	•	Creating derived variables: This refers to creating new variables from existing variable(s) using set of functions or different methods. Let’s look at it through “Titanic – Kaggle competition”. In this data set, variable age has missing values. To predict missing values, we used the salutation (Master, Mr, Miss, Mrs) of name as a new variable. How do we decide which variable to create? Honestly, this depends on business understanding of the analyst, his curiosity and the set of hypothesis he might have about the problem. Methods such as taking log of variables, binning variables and other methods of variable transformation can also be used to create new variables.
	•	Creating dummy variables: One of the most common application of dummy variable is to convert categorical variable into numerical variables. Dummy variables are also called Indicator Variables. It is useful to take categorical variable as a predictor in statistical models.  Categorical variable can take values 0 and 1. Let’s take a variable ‘gender’. We can produce two variables, namely, “Var_Male” with values 1 (Male) and 0 (No male) and “Var_Female” with values 1 (Female) and 0 (No Female). We can also create dummy variables for more than two classes of a categorical variables with n or n-1 dummy variables.





In [None]:
Univariate

Continuous Variables:- In case of continuous variables, we need to understand the central tendency and spread of the variable. These are measured using various statistical metrics visualization methods as shown below:

Categorical Variables:- For categorical variables, we’ll use frequency table to understand distribution of each category. We can also read as percentage of values under each category. It can be be measured using two metrics, Count and Count% against each category. Bar chart can be used as visualization.

Bi-variate Analysis

Continuous & Continuous: While doing bi-variate analysis between two continuous variables, we should look at scatter plot. It is a nifty way to find out the relationship between two variables. The pattern of scatter plot indicates the relationship between variables. The relationship can be linear or non-linear.

Categorical & Categorical: To find the relationship between two categorical variables, we can use following methods:
	•	Two-way table: We can start analyzing the relationship by creating a two-way table of count and count%. The rows represents the category of one variable and the columns represent the categories of the other variable. We show count or count% of observations available in each combination of row and column categories.
	•	Stacked Column Chart: This method is more of a visual form of Two-way table.
	•	

	•	Chi-Square Test: This test is used to derive the statistical significance of relationship between the variables. Also, it tests whether the evidence in the sample is strong enough to generalize that the relationship for a larger population as well. Chi-square is based on the difference between the expected and observed frequencies in one or more categories in the two-way table. It returns probability for the computed chi-square distribution with the degree of freedom.


Categorical & Continuous: While exploring relation between categorical and continuous variables, we can draw box plots for each level of categorical variables. If levels are small in number, it will not show the statistical significance. To look at the statistical significance we can perform Z-test, T-test or ANOVA.
	•	Z-Test/ T-Test:- Either test assess whether mean of two groups are statistically different from each other or not.


	•	ANOVA:- It assesses whether the average of more than two groups is statistically different.


Example: Suppose, we want to test the effect of five different exercises. For this, we recruit 20 men and assign one type of exercise to 4 men (5 groups). Their weights are recorded after a few weeks. We need to find out whether the effect of these exercises on them is significantly different or not. This can be done by comparing the weights of the 5 groups of 4 men each.
Till here, we have understood the first three stages of Data Exploration, Variable Identification, Uni-Variate and Bi-Variate analysis. We also looked at various statistical and visual methods to identify the relationship between variables. 

Now, we will look at the methods of Missing values Treatment. More importantly, we will also look at why missing values occur in our data and why treating them is necessary.



