### In this this notebook, I've just shown data pre-processing of a small dataset

In [1]:
# importing the modules
import pandas as pd
import numpy as np

In [2]:
# reading the dataset, Importing the Datasets
df = pd.read_csv('Data.csv')
df

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


From the dataset, we can see that based on age, country and salary whether a person buys a product or not.

## Data Pre-processing

In [3]:
# To extract an independent variable, we will use iloc[ ] method of Pandas library.
# It is used to extract the required rows and columns from the dataset.
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values # or y = df['Purchased'].values 
# or y = df.iloc[:,3].values
# or X=df.drop(['Purchased'],axis=1)
# or X = df[[
#             'Country',
#             'Age',
#             'Salary'
# ]].values

.values used here return a numpy representation of the data frame.

In [4]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [5]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


In [6]:
df.isnull().sum() # finding null values

Country      0
Age          1
Salary       1
Purchased    0
dtype: int64

In [7]:
df.describe()

Unnamed: 0,Age,Salary
count,9.0,9.0
mean,38.777778,63777.777778
std,7.693793,12265.579662
min,27.0,48000.0
25%,35.0,54000.0
50%,38.0,61000.0
75%,44.0,72000.0
max,50.0,83000.0


### Handling Missing Values

*There are mainly two ways to handle missing data, which are:*

    1. By deleting the particular row
    2. By calculating the mean: In this way, we will calculate the mean of that column or row which contains any missing value and will put it on the place of missing value. This strategy is useful for the features which have numeric data such as age, salary, year, etc. Here, we will use this approach.

df.dropna() --> it is not the best option to remove the rows and columns from our dataset as it can result in significant information loss. If you have 300K data points then removing 2–3 rows won’t affect your dataset much but if you only have 100 data points and out of which 20 have NaN values for a particular field then you can’t simply drop those rows.

Ex — Suppose we are collecting the data from a survey, then it is possible that there could be an optional field which let’s say 20% of people left blank. So when we get the dataset then we need to understand that the remaining 80% of data is still useful, so rather than dropping these values we need to somehow substitute the missing 20% values. We can do this with the help of Imputation.

Ref : https://towardsdatascience.com/introduction-to-data-preprocessing-in-machine-learning-a9fa83a5dc9d

In [9]:
from sklearn.impute import SimpleImputer
# taking average of salary is best to fill the missing values.
imputer = SimpleImputer(missing_values=np.nan, strategy='mean', axis = 0) # 2 arguments here, missing_mean means which type of missing values you want to fill.
#Fitting imputer object to the independent variables x.   
imputer.fit(X[:, 1:3]) # argument should be of all numerical values not text or string.
# replacing age and salary column with new updated values instead of changing whole X
X[:, 1:3] = imputer.transform(X[:, 1:3])

In [11]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


Remove a column from dataset:
https://www.google.com/search?q=how+to+remove+a+column+from+dataset+in+data+science&rlz=1C1CHBF_enIN965IN965&oq=how+to+remove+a+column+from+dataset+in+data+science&aqs=chrome..69i57.19402j0j1&sourceid=chrome&ie=UTF-8

Remove a row from dataset:
https://www.google.com/search?q=how+to+remove+a+row+from+a+dataset+in+data+science&rlz=1C1CHBF_enIN965IN965&ei=xG_DYcLhFovN-QbLi5WIAg&ved=0ahUKEwjC07iBhvj0AhWLZt4KHctFBSEQ4dUDCA4&uact=5&oq=how+to+remove+a+row+from+a+dataset+in+data+science&gs_lcp=Cgdnd3Mtd2l6EAM6BwgAEEcQsAM6BQgAEIAEOgYIABAWEB46BAgAEA06CAghEBYQHRAeOgUIIRCgAToECCEQFToHCCEQChCgAUoECEEYAEoECEYYAFDZA1jbI2CNKWgBcAJ4AIABswKIAeEYkgEIMC4xMi4zLjGYAQCgAQHIAQjAAQE&sclient=gws-wiz

To know more about the taking care of missing values: https://towardsdatascience.com/data-cleaning-how-to-handle-missing-values-in-pandas-cc8570c446ec