# Data pre-processing

### Data Preprocessing - is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis.

#### Pandas -  is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. See http://pandas.pydata.org/ for full documentation.

#### Matplotlib - is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application servers, and four graphical user interface toolkits. See https://matplotlib.org/ for full documentation.

#### NumPy - is the fundamental package for scientific computing with Python. It contains among other things a powerful, N-dimensional array object, sophisticated (broadcasting) functions, tools for integrating C/C++ and Fortran code, useful linear algebra, Fourier transform, and random number capabilities. See https://www.numpy.org/ for full documentation.



Need of Data Preprocessing

• For achieving better results from the applied model in Machine Learning projects 

• Datasets should be formatted in such a way that more than one Machine Learning and Deep Learning algorithms are executed in one data set, and best out of them is chosen.

In [12]:
#importing nessesary libraries 
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')


######  Getting data into a DataFrame

In [2]:
# Instantiating an empty DataFrame
df = pd.DataFrame()

### <u>Loading data from CSV files</U>
to acces a specified collumn and rows we can use <br>
<i> df["Jan"].head() --- df.iloc[:, 1:4].head()  </i>

In [3]:
# We can Load a DataFrame from different files extensions.
# For Instance, Loading a DataFrame from a CSV file:
#displaying only 5 raws from the data frame
df = pd.read_csv("Datas/unemployment1948.csv")
df.head()  

Unnamed: 0,Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec,Annual
0,1948,4.0,4.7,4.5,4.0,,3.9,3.9,3.6,3.4,2.9,3.3,3.6,3.8
1,1949,5.0,5.8,,5.4,5.7,6.4,7.0,6.3,5.9,6.1,5.7,6.0,5.9
2,1950,7.6,7.9,7.1,6.0,5.3,5.6,5.3,4.1,4.0,3.3,3.8,3.9,5.3
3,1951,4.4,4.2,3.8,,2.9,3.4,3.3,2.9,3.0,2.8,3.2,2.9,3.3
4,1952,3.7,3.8,3.3,3.0,2.9,3.2,3.3,3.1,2.7,2.4,2.5,2.5,3.0


In [4]:
df["Mar"].max() # or df.iloc[:, 1:2].max()
df.describe().transpose()
bool_filter = df["Jan"] > 6.      #df[df["Jan"] > 6.]

## Missing_data
Replacing missing datas from the dataset with an appropriate value

In [5]:
#importing nessesary libraries 
#from sklearn.preprocessing import Imputer
from sklearn.impute import SimpleImputer 

In [6]:
#assigning an object for Imputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean') 
df.head()

Unnamed: 0,Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec,Annual
0,1948,4.0,4.7,4.5,4.0,,3.9,3.9,3.6,3.4,2.9,3.3,3.6,3.8
1,1949,5.0,5.8,,5.4,5.7,6.4,7.0,6.3,5.9,6.1,5.7,6.0,5.9
2,1950,7.6,7.9,7.1,6.0,5.3,5.6,5.3,4.1,4.0,3.3,3.8,3.9,5.3
3,1951,4.4,4.2,3.8,,2.9,3.4,3.3,2.9,3.0,2.8,3.2,2.9,3.3
4,1952,3.7,3.8,3.3,3.0,2.9,3.2,3.3,3.1,2.7,2.4,2.5,2.5,3.0


In [7]:
dfarray = df.iloc[:, :].values
imputer = imputer.fit(dfarray[:, 3:7])
dfarray[:, 3:7] = imputer.transform(dfarray[:, 3:7])
pd.DataFrame(dfarray).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,1948.0,4.0,4.7,4.5,4.0,5.552308,3.9,3.9,3.6,3.4,2.9,3.3,3.6,3.8
1,1949.0,5.0,5.8,6.189231,5.4,5.7,6.4,7.0,6.3,5.9,6.1,5.7,6.0,5.9
2,1950.0,7.6,7.9,7.1,6.0,5.3,5.6,5.3,4.1,4.0,3.3,3.8,3.9,5.3
3,1951.0,4.4,4.2,3.8,5.712308,2.9,3.4,3.3,2.9,3.0,2.8,3.2,2.9,3.3
4,1952.0,3.7,3.8,3.3,3.0,2.9,3.2,3.3,3.1,2.7,2.4,2.5,2.5,3.0


# Catagorical data

In [8]:
# Importing the dataset
dataset = pd.read_csv('Datas/Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44,72000,No
1,Spain,27,48000,Yes
2,Germany,30,54000,No
3,Spain,38,61000,No
4,Germany,40,67720,Yes
5,France,35,58000,Yes
6,Spain,36,52000,No
7,France,48,79000,Yes
8,Germany,50,83000,No
9,France,37,67000,Yes


In [9]:
# Encoding categorical data
# Encoding the Independent Variable
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
pd.DataFrame(X)

Unnamed: 0,0,1,2
0,0,44,72000
1,2,27,48000
2,1,30,54000
3,2,38,61000
4,1,40,67720
5,0,35,58000
6,2,36,52000
7,0,48,79000
8,1,50,83000
9,0,37,67000


In [13]:
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
pd.DataFrame(X)

Unnamed: 0,0,1,2,3,4,5
0,0.0,1.0,0.0,0.0,44.0,72000.0
1,1.0,0.0,0.0,1.0,27.0,48000.0
2,1.0,0.0,1.0,0.0,30.0,54000.0
3,1.0,0.0,0.0,1.0,38.0,61000.0
4,1.0,0.0,1.0,0.0,40.0,67720.0
5,0.0,1.0,0.0,0.0,35.0,58000.0
6,1.0,0.0,0.0,1.0,36.0,52000.0
7,0.0,1.0,0.0,0.0,48.0,79000.0
8,1.0,0.0,1.0,0.0,50.0,83000.0
9,0.0,1.0,0.0,0.0,37.0,67000.0


#### Rescale Data<br>
• When our data is comprised of attributes with varying scales, many machine learning algorithms can benefit from rescaling the attributes to all have the same scale.<br>
• This is useful for optimization algorithms in used in the core of machine learning algorithms like gradient descent.<br>
• It is also useful for algorithms that weight inputs like regression and neural networks and algorithms that use distance measures like K-Nearest Neighbors.<br>
• We can rescale your data using scikit-learn using the MinMaxScaler class.

In [51]:
from sklearn.preprocessing import MinMaxScaler 

In [57]:
#array = df.values
scaler = MinMaxScaler(feature_range=(0, 1)) 
rescaledX = scaler.fit_transform(dfarray) 
pd.DataFrame(rescaledX).head()
#pd.DataFrame(dfarray).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,0.0,0.075,0.185185,0.202532,0.166667,0.418124,0.16,0.169014,0.166667,0.112676,0.066667,0.101266,0.1375,0.132353
1,0.015385,0.2,0.320988,0.416358,0.361111,0.438356,0.493333,0.605634,0.541667,0.464789,0.493333,0.405063,0.4375,0.441176
2,0.030769,0.525,0.580247,0.531646,0.444444,0.383562,0.386667,0.366197,0.236111,0.197183,0.12,0.164557,0.175,0.352941
3,0.046154,0.125,0.123457,0.113924,0.404487,0.054795,0.093333,0.084507,0.069444,0.056338,0.053333,0.088608,0.05,0.058824
4,0.061538,0.0375,0.074074,0.050633,0.027778,0.054795,0.066667,0.084507,0.097222,0.014085,0.0,0.0,0.0,0.014706


# <i> Well Done </i>