# Data Preprocessing Tools

## Project Description
A small toy robots company in the US purchases various electronic components. The company decides to close its business with some of these suppliers. The file "BusinessData.csv" summarizes these data.  
In this exercise, we learn how to import the data, review and check them, taking care of missing data, deal with outliers, encode categorical data, etc. 

## Importing the Libraries

In [10]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the Dataset

In [11]:
dataset = pd.read_csv('SuppliersData.csv')

## Showing the Dataset in a Table

In [12]:
pd.DataFrame(dataset)

Unnamed: 0,Country,Number of Parts,Cost ($),Business Continues
0,Brazil,44.0,72000.0,No
1,Mexico,27.0,48000.0,Yes
2,Canada,30.0,54000.0,No
3,Mexico,38.0,61000.0,No
4,Canada,40.0,,Yes
5,Brazil,35.0,58000.0,Yes
6,Mexico,,52000.0,No
7,Brazil,48.0,79000.0,Yes
8,Canada,50.0,83000.0,No
9,Brazil,37.0,67000.0,Yes


## A Quick Review of the Data

In [13]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Country             10 non-null     object 
 1   Number of Parts     9 non-null      float64
 2   Cost ($)            9 non-null      float64
 3   Business Continues  10 non-null     object 
dtypes: float64(2), object(2)
memory usage: 448.0+ bytes


## Separate The Input and Output
Here, we put the independent variables in X and the dependent variable in y. 

In [31]:
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

## Showing the Input Data in a Table format

In [32]:
pd.DataFrame(dataset)

Unnamed: 0,Country,Number of Parts,Cost ($),Business Continues
0,Brazil,44.0,72000.0,No
1,Mexico,27.0,48000.0,Yes
2,Canada,30.0,54000.0,No
3,Mexico,38.0,61000.0,No
4,Canada,40.0,,Yes
5,Brazil,35.0,58000.0,Yes
6,Mexico,,52000.0,No
7,Brazil,48.0,79000.0,Yes
8,Canada,50.0,83000.0,No
9,Brazil,37.0,67000.0,Yes


## A Quick Check of the Output Data

In [34]:
pd.DataFrame(y)

Unnamed: 0,0
0,No
1,Yes
2,No
3,No
4,Yes
5,Yes
6,No
7,Yes
8,No
9,Yes


## Taking care of missing data

In [35]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

In [36]:
# A quick check

pd.DataFrame(X)


Unnamed: 0,0,1,2
0,Brazil,44.0,72000.0
1,Mexico,27.0,48000.0
2,Canada,30.0,54000.0
3,Mexico,38.0,61000.0
4,Canada,40.0,63777.777778
5,Brazil,35.0,58000.0
6,Mexico,38.777778,52000.0
7,Brazil,48.0,79000.0
8,Canada,50.0,83000.0
9,Brazil,37.0,67000.0


## Encoding Categorical Data

### Encoding the Independent Variable

In [40]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
X = ct.fit_transform(X) 
X = np.array(X)

In [41]:
# A quick check
pd.DataFrame(X)

Unnamed: 0,0,1,2,3,4,5
0,0.0,1.0,0.0,0.0,44.0,72000.0
1,1.0,0.0,0.0,1.0,27.0,48000.0
2,1.0,0.0,1.0,0.0,30.0,54000.0
3,1.0,0.0,0.0,1.0,38.0,61000.0
4,1.0,0.0,1.0,0.0,40.0,63777.777778
5,0.0,1.0,0.0,0.0,35.0,58000.0
6,1.0,0.0,0.0,1.0,38.777778,52000.0
7,0.0,1.0,0.0,0.0,48.0,79000.0
8,1.0,0.0,1.0,0.0,50.0,83000.0
9,0.0,1.0,0.0,0.0,37.0,67000.0


### Encoding the Dependent Variable

In [44]:
from sklearn.preprocessing import LabelEncoder
LaEnc = LabelEncoder()
y = LaEnc.fit_transform(y)

In [45]:
# a quick check
pd.DataFrame(y)

Unnamed: 0,0
0,0
1,1
2,0
3,0
4,1
5,1
6,0
7,1
8,0
9,1


## Feature Scaling

In [46]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X[:, 3:] = sc.fit_transform(X[:, 3:])

In [47]:
# a quick check
pd.DataFrame(X)

Unnamed: 0,0,1,2,3,4,5
0,0.0,1.0,0.0,-0.654654,0.758874,0.749473
1,1.0,0.0,0.0,1.527525,-1.711504,-1.438178
2,1.0,0.0,1.0,-0.654654,-1.275555,-0.891265
3,1.0,0.0,0.0,1.527525,-0.113024,-0.2532
4,1.0,0.0,1.0,-0.654654,0.177609,0.0
5,0.0,1.0,0.0,-0.654654,-0.548973,-0.526657
6,1.0,0.0,0.0,1.527525,0.0,-1.07357
7,0.0,1.0,0.0,-0.654654,1.34014,1.387538
8,1.0,0.0,1.0,-0.654654,1.630773,1.752147
9,0.0,1.0,0.0,-0.654654,-0.25834,0.293712


## Splitting the Dataset into the Training set and Test set

In [48]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 23)

In [49]:
# Quick Check
pd.DataFrame(X_train)
pd.DataFrame(X_test)
pd.DataFrame(y_train)
pd.DataFrame(y_test)

Unnamed: 0,0,1,2,3,4,5
0,1.0,0.0,1.0,-0.654654,-1.275555,-0.891265
1,0.0,1.0,0.0,-0.654654,-0.25834,0.293712
2,1.0,0.0,1.0,-0.654654,0.177609,0.0
3,0.0,1.0,0.0,-0.654654,1.34014,1.387538
4,1.0,0.0,0.0,1.527525,-1.711504,-1.438178
5,0.0,1.0,0.0,-0.654654,0.758874,0.749473
6,1.0,0.0,0.0,1.527525,0.0,-1.07357
7,1.0,0.0,0.0,1.527525,-0.113024,-0.2532


In [50]:
# Quick Check

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=1)
knn = knn.fit(X_train, y_train)

knn.score(X_test, y_test)


0.5