## Machine Learning Foundations

* Residual Sum of Squares [RSS] is the sum of difference between actual scatter plot points and the y = (mx + c) line, i.e. in other words the distance between the point and the line, when a straight line is drawn from point to linear line.

* Any number of lines with different slope and different y intercept can be drawn for a given scatter plot. To find which one is the correct line we make use of RSS. Minimal the value of RSS better it is.

* One way of pridicting the value is where the point intersects the linear line(y = mx + c)

* It is not mandatory to have the linear line instead the line can be quadratic or any other higher degree polynomial. As the degree of the line increases the RSS decreases and the higher drgree line passes through all the points of scatter polt.

* Even if a higher order polynomial is used for drawing the line it is still called linear regression coz the higher order is just considered as a feature.

* As the RSS becomes zero it may not be able to give the best possible predication.

* The method of getting a curve to pass through all the points of the dataset is called overfitting, once the overfitting is achieve we will loose the ability to predict any thing as the model now prepared is for the only given dataset not any other data coming in future.

* so the approach to prepare a model for predicting the data is to divide the actual set into two 
    
    --> Trainig set
    
    --> Test set
    
* Now use the training data to prepare a model and once the model is ready use the test data to see the deviation or error from the actual value. Fine tune the model to get better accuracy.


### Training Error

We again calculate the residual sum of squares [RSS] on the training data with referrence to line and this is training error
i.e. in other words we calculate the distance between the training point and the line then square it and then sum it with all other points the result thus obtained is Training Error

### Test Error

We again calculate the residual sum of squares [RSS] on the test data with referrence to line and this is test error
i.e. in other words we calculate the distance between the test data point and the line and then square it and then sum it with all other points the result thus obtained is Test Error
   

### Training/Test Error Vs Model Complexity

<img src="https://i.stack.imgur.com/alkeM.png">

* Till now the discussion done considering just one feature [column] from the dataset. There are many other feature that can be included while creating the model.

* Regression coefficient is the weightage given to each of feature that is considered while creating the model.So one feature may be weighted more than others.

 

### Load some house sales data
Dataset is from house sales in King County, the region where the city of Seattle, WA is located.

In [1]:
sales = graphlab.SFrame('home_data.gl')

NameError: name 'graphlab' is not defined

In [None]:
sales

#### Exploring the data for housing sales
The house price is correlated with the number of square feet of living space.

In [None]:
graphlab.canvas.set_target('ipynb')
sales.show(view="Scatter Plot", x="sqft_living", y="price")

### Create a simple regression model of sqft_living to price
Split data into training and testing.

We use seed=0 so that everyone running this notebook gets the same results. In practice, you may set a random seed (or let GraphLab Create pick a random seed for you).

In [None]:
train_data,test_data = sales.random_split(.8,seed=0)

### Build the regression model using only sqft_living as a feature

In [None]:
sqft_model = graphlab.linear_regression.create(train_data, target='price', features=['sqft_living'],validation_set=None)

## Evaluate the simple model

In [None]:
print test_data['price'].mean()

In [None]:
print sqft_model.evaluate(test_data)

RMSE of about $255,170!

### Let's show what our predictions look like

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
plt.plot(test_data['sqft_living'],test_data['price'],'.',
        test_data['sqft_living'],sqft_model.predict(test_data),'-')

Above: blue dots are original data, green line is the prediction from the simple regression.

Below: we can view the learned regression coefficients.

In [None]:
sqft_model.get('coefficients')

### Explore other features in the data

To build a more elaborate model, we will explore using more features.

In [None]:
my_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']

In [None]:
sales[my_features].show()

In [None]:
sales.show(view='BoxWhisker Plot', x='zipcode', y='price')

Pull the bar at the bottom to view more of the data.

98039 is the most expensive zip code.

### Build a regression model with more features

In [None]:
my_features_model = graphlab.linear_regression.create(train_data,target='price',features=my_features,validation_set=None)

In [None]:
print my_features

### Comparing the results of the simple model with adding more features

In [None]:
print sqft_model.evaluate(test_data)
print my_features_model.evaluate(test_data)

The RMSE goes down from $255,170 to $179,508 with more features.

### Apply learned models to predict prices of 3 houses

The first house we will use is considered an "average" house in Seattle.

In [None]:
house1 = sales[sales['id']=='5309101200']

In [None]:
house1

<img src="http://info.kingcounty.gov/Assessor/eRealProperty/MediaHandler.aspx?Media=2916871">

In [None]:
print house1['price']

In [None]:
print sqft_model.predict(house1)

In [None]:
print my_features_model.predict(house1)

In this case, the model with more features provides a worse prediction than the simpler model with only 1 feature. However, on average, the model with more features is better.

### Prediction for a second, fancier house

We will now examine the predictions for a fancier house. 

In [None]:
house2 = sales[sales['id']=='1925069082']

In [None]:
house2

<img src="https://ssl.cdn-redfin.com/photo/1/bigphoto/302/734302_0.jpg">

In [None]:
print sqft_model.predict(house2)

In [None]:
print my_features_model.predict(house2)

In this case, the model with more features provides a better prediction. This behavior is expected here, because this house is more differentiated by features that go beyond its square feet of living space, especially the fact that it's a waterfront house.

### Last house, super fancy

Our last house is a very large one owned by a famous Seattleite.

In [None]:
bill_gates = {'bedrooms':[8], 
              'bathrooms':[25], 
              'sqft_living':[50000], 
              'sqft_lot':[225000],
              'floors':[4], 
              'zipcode':['98039'], 
              'condition':[10], 
              'grade':[10],
              'waterfront':[1],
              'view':[4],
              'sqft_above':[37500],
              'sqft_basement':[12500],
              'yr_built':[1994],
              'yr_renovated':[2010],
              'lat':[47.627606],
              'long':[-122.242054],
              'sqft_living15':[5000],
              'sqft_lot15':[40000]}

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/d9/Bill_gates%27_house.jpg/2560px-Bill_gates%27_house.jpg">

In [None]:
print my_features_model.predict(graphlab.SFrame(bill_gates))

The model predicts a price of over $13M for this house! But we expect the house to cost much more. (There are very few samples in the dataset of houses that are this fancy, so we don't expect the model to capture a perfect prediction here.)

## Classification Modeling

For classification also the initial steps remain the same, i.e. the dataset is divided into two sets
* Training set
* Test Set

we design the clasifier algorithm and feed the data to it, the classifier has the predefined criteria for classification.
In case of a sentiment analysis classifier,the classifier has a predefined set of words with corresponding score for each word.

Each sentence is evaluated based on how much it scores.

We pass in the test data and count the number of times the classifier algorithm evaluates it correctly to find the accuracy and error of the classifier algorithm

* error = No of Mistakes/ Total no of sentences

The best value for error is 0.0 and the error lies between 0 and 1

* accuracy = No of Correct / Total no of sentences

The best value for the accuracy is 1.0 and accuracy lies between 0 and 1.

* error = 1 - accuracy

error and accuracy are the way to evaluate the classifier

Accuracy of the classifier is tested by taking random test and on  the calssifier and average result of the random test must be greater that 1/k times value, i.e.
* For a binary classification the accuracy should be greater than 0.5 (2 classification so 1/2)
* For a classification with 3 classifiers the accuracy should be greater than 0.333 which is the average of a random test
* for a classification with k classifiers the accuracy should be greater than 1/k which is the average of a random test

#### Types of mistakes
* Here true values are plotted against the predicted values the matrix thus formed is called Confusion matrix

<img src = "http://rasbt.github.io/mlxtend/user_guide/evaluate/confusion_matrix_files/confusion_matrix_1.png">
<img src = "http://scikit-learn.org/stable/_images/sphx_glr_plot_confusion_matrix_001.png" >


#### Learning Curve

<img src = "learning_curve.png">

* If the amount of data collected is very less then the Test error increases and as amoount data collected increases training data increases and thereby the no of errors decreaes but it will not reach 0 instead there will be some bias.
* The gap between blue line and x- axis i s called Bias
* The above statement is wrt blue curve, i.e on considering a single word for classifiers
* where as in case of Bigrams we are considering a parin of words for classifiers so in this case accuracy incresase and there by test error decreses but bias will remain but at lower level as shown with the curve in red.

In [5]:
import graphlab

ImportError: No module named 'graphlab'

In [6]:
products = graphlab.SFrame('amazon_baby.gl/')

NameError: name 'graphlab' is not defined

In [7]:
products['word_count'] = graphlab.text_analytics.count_words(products['review'])

NameError: name 'graphlab' is not defined

In [8]:
products.head()

NameError: name 'products' is not defined

In [9]:
graphlab.canvas.set_target('ipynb')

NameError: name 'graphlab' is not defined

In [10]:
products['name'].show()

NameError: name 'products' is not defined

In [11]:
giraffe_reviews = products[products['name'] == 'Vulli Sophie the Giraffe Teether']

NameError: name 'products' is not defined

In [12]:
len(giraffe_reviews)

NameError: name 'giraffe_reviews' is not defined

In [13]:
giraffe_reviews['rating'].show(view='Categorical')

NameError: name 'giraffe_reviews' is not defined

In [14]:
products['rating'].show(view='Categorical')

NameError: name 'products' is not defined

In [16]:
# ignore all 3* reviews
products = products[products['rating'] != 3]

NameError: name 'products' is not defined

In [17]:
# positive sentiment = 4* or 5* reviews
products['sentiment'] = products['rating'] >=4

NameError: name 'products' is not defined

In [18]:
products.head()

NameError: name 'products' is not defined

#### Let's train the sentiment classifier

In [21]:
train_data,test_data = products.random_split(.8, seed=0)

NameError: name 'products' is not defined

In [22]:
sentiment_model = graphlab.logistic_classifier.create(train_data,
                                                     target='sentiment',
                                                     features=['word_count'],
                                                     validation_set=test_data)

NameError: name 'graphlab' is not defined

#### Evaluate the sentiment model

In [23]:
sentiment_model.evaluate(test_data, metric='roc_curve')

NameError: name 'sentiment_model' is not defined

In [24]:
sentiment_model.show(view='Evaluation')

NameError: name 'sentiment_model' is not defined

#### Applying the learned model to understand sentiment for Giraffe

In [27]:
giraffe_reviews['predicted_sentiment'] = sentiment_model.predict(giraffe_reviews, output_type='probability')

NameError: name 'sentiment_model' is not defined

In [28]:
giraffe_reviews.head()

NameError: name 'giraffe_reviews' is not defined

#### ## Sort the reviews based on the predicted sentiment and explore

In [29]:
giraffe_reviews = giraffe_reviews.sort('predicted_sentiment', ascending=False)

NameError: name 'giraffe_reviews' is not defined

In [30]:
giraffe_reviews.head()

NameError: name 'giraffe_reviews' is not defined

#### Most positive reviews for the giraffe

In [31]:
giraffe_reviews[0]['review']

NameError: name 'giraffe_reviews' is not defined

In [32]:
giraffe_reviews[1]['review']

NameError: name 'giraffe_reviews' is not defined

#### Show most negative reviews for giraffe

In [33]:
giraffe_reviews[-1]['review']

NameError: name 'giraffe_reviews' is not defined

In [34]:
giraffe_reviews[-2]['review']

NameError: name 'giraffe_reviews' is not defined