## [Essentials of Machine Learning Algorithms](https://www.analyticsvidhya.com/blog/2015/08/common-machine-learning-algorithms/)

Broadly, there are 3 types of Machine Learning Algorithms..

* __Supervised Learning__: This algorithm consist of a __target / outcome variable__ (or dependent variable) which is to be predicted from a given set of predictors (independent variables). Using these set of variables, we generate a function that map inputs to desired outputs. The training process continues until the model achieves a __desired level of accuracy__ on the training data. Examples of Supervised Learning:
    * __Linear Regression__
    * __Logistic Regression__
    * __Decision Tree__
    * __SVM__
    * __Random Forest__
    * __KNN__



* __Unsupervised Learning__: In this algorithm, we do not have any target or outcome variable to predict / estimate.  It is used for __clustering population__ in different groups, which is widely used for __segmenting customers__ in different groups for specific intervention. Examples of Unsupervised Learning:
    * __K-means__.


* __Reinforcement Learning__: Using this algorithm, the machine is trained to make specific decisions. It works this way: the machine is exposed to an environment where it trains itself continually using trial and error. __This machine learns from past experience__ and tries to capture the best possible knowledge to make accurate business decisions. Example of Reinforcement Learning: 
    * __Markov Decision Process__

List of Common Machine Learning Algorithms

Here is the list of commonly used machine learning algorithms. These algorithms can be applied to almost any data problem:

* Linear Regression
* Logistic Regression
* Decision Tree
* SVM
* Naive Bayes
* kNN
* K-Means
* Random Forest
* Dimensionality Reduction Algorithms
* Gradient Boosting algorithms
    * GBM
    * XGBoost
    * LightGBM
    * CatBoost

# Linear Regression

It is used to estimate real values (cost of houses, number of calls, total sales etc.) based on continuous variable(s). Here, we establish relationship between independent and dependent variables by fitting a best line. This best fit line is known as regression line and represented by a linear equation Y= a *X + b.

The best way to understand linear regression is to relive this experience of childhood. Let us say, you ask a child in fifth grade to arrange people in his class by increasing order of weight, without asking them their weights! What do you think the child will do? He / she would likely look (visually analyze) at the height and build of people and arrange them using a combination of these visible parameters. This is linear regression in real life! The child has actually figured out that height and build would be correlated to the weight by a relationship, which looks like the equation above.

In this equation:

    Y – Dependent Variable
    a – Slope
    X – Independent variable
    b – Intercept

These coefficients a and b are derived based on minimizing the sum of squared difference of distance between data points and regression line.

Look at the below example. Here we have identified the best fit line having linear equation y=0.2811x+13.9. Now using this equation, we can find the weight, knowing the height of a person.

Linear Regression is mainly of two types: Simple Linear Regression and Multiple Linear Regression. Simple Linear Regression is characterized by one independent variable. And, Multiple Linear Regression(as the name suggests) is characterized by multiple (more than 1) independent variables. While finding the best fit line, you can fit a polynomial or curvilinear regression. And these are known as polynomial or curvilinear regression.

* Estimate real values
* relationship between independent and dependent variables
* fitting linear equation Y= a *X + b.
    * Y – Dependent Variable
    * a – Slope (based on minimizing the sum of squared difference)
    * X – Independent variable
    * b – Intercept (based on minimizing the sum of squared difference)
* Simple Linear Regression and Multiple Linear Regression (independent variables > 1)    

In [6]:
# Import Library
# Import other necessary libraries like pandas, numpy...
from sklearn import linear_model
from sklearn import neighbors, datasets, preprocessing
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np
import pandas as pd

In [13]:
# Load Train and Test datasets
# Reading the dataset in a dataframe using Pandas
df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv") 

In [14]:
# view first 5 registers from the train data set 
df_train.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [15]:
# view first 5 registers from the test data set 
df_test.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
0,LP001015,Male,Yes,0,Graduate,No,5720,0,110.0,360.0,1.0,Urban
1,LP001022,Male,Yes,1,Graduate,No,3076,1500,126.0,360.0,1.0,Urban
2,LP001031,Male,Yes,2,Graduate,No,5000,1800,208.0,360.0,1.0,Urban
3,LP001035,Male,Yes,2,Graduate,No,2340,2546,100.0,360.0,,Urban
4,LP001051,Male,No,0,Not Graduate,No,3276,0,78.0,360.0,1.0,Urban


In [34]:
# Check missing values in the dataset
df_test.apply(lambda x: sum(x.isnull())) 

Loan_ID               0
Gender               11
Married               0
Dependents           10
Education             0
Self_Employed        23
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount            5
Loan_Amount_Term      6
Credit_History       29
Property_Area         0
dtype: int64

In [35]:
# Check missing values in the dataset
df_test.apply(lambda x: sum(x.isnull())) 

Loan_ID               0
Gender               11
Married               0
Dependents           10
Education             0
Self_Employed        23
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount            5
Loan_Amount_Term      6
Credit_History       29
Property_Area         0
dtype: int64

In [36]:
# fill missing values with the mean value
df_train['ApplicantIncome'].fillna(df_train['ApplicantIncome'].mean(), inplace=True)
df_train['LoanAmount'].fillna(df_train['LoanAmount'].mean(), inplace=True)
df_test['ApplicantIncome'].fillna(df_test['ApplicantIncome'].mean(), inplace=True)

In [45]:
#Identify feature and response variable(s) and values must be numeric and numpy arrays
x_train = df_train['ApplicantIncome']
y_train = df_train['LoanAmount']
x_test = df_test['ApplicantIncome']

In [46]:
# Create linear regression object
linear = linear_model.LinearRegression()

In [47]:
# Train the model using the training sets and check score
linear.fit(x_train, y_train)
#linear.score(x_train, y_train)



ValueError: Found arrays with inconsistent numbers of samples: [  1 614]