# Machine Learning
    It is a subset of artificial intelligence which focuses on using the statistical methods, to build intelligent computer systems in order to learn from the database available to it.
    The intelligent computer systems built on machine learning have the capability to learn from the past experience without having to explicitly program. Primary aim is to allow computer systems learn automatically without human interventions.
    ML is being used in multiple fields like medical diagnosis, image processing, prediction, classification etc.
    
## Types of ML algorithms
Mainly divided in four categories:
### Supervised Learning
* In supervised learning all materials are "labeled" to tell the machine the corresponding value to make it predict the correct value. This method is mostly manual classification, which is the easiest for a computer and the hardest for humans.
* This method is like telling the machine, standard answer. When the machine is officially tested, the machine will reply according to the standard answer.
* For example, if you train a machine to distinguish between elephants and giraffes, you can provide 100 photos of elephants and giraffes. The machine detects the characteristics of elephants and giraffes according to the "labeled" photographs and identifies elephants and giraffes according to their characteristics.

### Un-supervised Learning
* In un-supervised learning no material is labeled and machine classifies the materials by detecting the characteristics of the data.
* Manual classification is not done in this method, which is simplest for humans, but hardest for the computers and caused more errors.
* If you asked machines to identify elephants and giraffes, the machine must decide which of the 100 photos provided are elephants and which are giraffes and do the classification at the same time. Machine must classify animals according to the characteristics and the results identified by machines can be wrong as well.

### Semi-supervised Learning
* In semi-supervised learning small amount of data are labeled. Computers only need to find features through labeled data and then classify other data accordingly. This method can make predictions more accurately and is the most commonly used method.
*  If there are 100 photos, 10 of them which are elephants and which are giraffes are labeled. Through the characteristics of these 10 photos, the machine identifies and classifies the remaining photos.
* The results are more accurate than unsupervised learning.

### Reinforcement Learning
* Reinforcement Learning is a feedback-based Machine learning technique in which an agent learns to behave in an environment by performing the actions and seeing the results of actions. For each good action, the agent gets positive feedback or rewards, and for each bad action, the agent gets negative feedback or penalty.
* The artificial intelligence gets either rewards or penalties for the actions it performs. Its goal is to maximize the total reward.
* The agent learns automatically using feedbacks without any labeled data. Since there is no labeled data, so the agent has to learn by its experience only. The primary goal of an agent in reinforcement learning is to improve the performance by getting the maximum positive rewards. The total number of rewards to reach the final goal will help an agent to improve its specific action.
* For example chess.

## Data Preprocessing
Data preprocessing is a process of preparing the raw data and making it suitable for a machine learning model. It is the first and crucial step while creating a machine learning model.

Below are the steps of preprocessing:

## Importing the dataset

In [18]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

dataset = pd.read_csv(r"C:\Users\kanki\OneDrive\Documents\python\Data1.csv")
print(dataset)
print('\n')
x = dataset.iloc[: , :-1].values  #if we don't use .values then column names and index names will be stored in assigned variable
y = dataset.iloc[: , -1].values
print(x)   #matrix of features
print('\n')
print(y)   #dependent value

   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
4  Germany  40.0      NaN       Yes
5   France  35.0  58000.0       Yes
6    Spain   NaN  52000.0        No
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes


[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Taking care of missing data

In [19]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')   #missing_values: arguement which tells which is the missing 
#value, strategy: arguement which tells what should be filled in that missing value

imputer.fit(x[: , 1:3])
#fit method will connect this imputer to the matrix of features and in the arguements just put the numerical values column

x[: , 1:3] = imputer.transform(x[: , 1:3])   #save the transformed column to those column
#transform method will replace the missing value of each column with the mean of that column

print(x)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding categorical data

### Encoding the Independent Variable

In [20]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
#here the arguements will be transformers: (what kind of transfromation i.e.encoding, what kind of encoding, index of columns to encode) 
#remainder: put passthrough which tells you to keep columns that havent been encoded 

x = np.array(ct.fit_transform(x))   #return new matrix after fitting and transforming as we did above
#compulsory to convert in numpy array as training expects you to have numpy array and fit_transform dont give you np array bydefault

print(x)   #if your run this cell multiple times, every time a new encoded 0s & 1s pair will be used

#in output france is encoded as 1.0 0.0 0.0 and spain is encoded as 0.0 0.0 1.0 and germany 0.0 1.0 0.0
#where column order dont matter, apply ColumnTransformer with OneHotEncoder
#where column order matter, apply LabelEncoder

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


### Encoding the Dependent Variable

In [21]:
#encoding the dependent variable
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)   # yes no column(last column)   #future machinery dont expect to have a numoy array of dependent varibale column

print(y)

[0 1 0 0 1 1 0 1 0 1]


## Split dataset into training set and test set

In [25]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)
#test_size=0.2 means 20% observations in testset so the train_size=0.8 means 80% observations in trainset, divided by taking random values
#random_state will give you the same training and test set even after running fro multiple time

print(x_train)
print('\n')
print(x_test)
print('\n')
print(y_train)
print('\n')
print(y_test)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


[0 1 0 0 1 1 0 1]


[0 1]


    Feature scaling is done after splitting dataset:
* Featue scaling is done to make sure your features take value in the same scale, we do this to prevent one feature dominating other. Feature scaling computes the std-deviation and mean of the dataset for scaling the features of that dataset.
* Test set is going to be the set on which your going to apply your machine learning model, so it should be new. 
* If we apply feature scaling before splitting then feature scaling will get the std-deviation and mean of all the data(original dataset). Getting information on test set which is going to have new observations that will be used for future production, will be like leaking the data.
* We do feature scaling after splitting to prevent the data leakage on the test set that we are not supposed to have until training is done.

## Feature scaling

In [27]:
#apply the standardisation(scaling technique) on x_tran y_train to scale the testset

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train[: , 3:] = sc.fit_transform(x_train[: , 3:])   #apply the same scalar on testset that you applied on trainset so just apply transform method 
x_test[: , 3:] = sc.transform(x_test[: , 3:])
#fit will just give you the mean and std-deviation of the values and transform will actually apply the standardisation formula

print(x_train)
print('\n')
print(x_test)

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]


[[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]


    Do we have to apply standardisation i.e. scaling to the dummy data i.e. 0.1.0.0.0.0 ? No , because we want to have all the features in the same scale. Standardisation transform your feature so they take values between -2 and +2, but our values of Countries are between +2 and -2. You can apply the standardisation to the dummy data above but you'll lose the whole the interpretation of the variables i.e. france like 1.0.0.0.0.0 .