# Machine Learning

### Euclidean distance

Similarity metrics works by comparing a fixed set of numerical features, another word for attributes, between 2 observations, or living spaces in our case. 
When trying to predict a continuous value, like price, the main similarity metric that's used is Euclidean distance. Here's the general formula for Euclidean distance:

* The lowest value you can achieve is 0. This happens when the value for the feature is exactly the same for both observations you're comparing.
* Euclidean distance equation expects numerical values
* expects a value for each observation and attribute:
    * remove columns: dc_listings = dc_listings.drop(drop_columns, axis=1) 
    * remove rows with missing data: dc_listings = dc_listings.dropna(axis = 0, subset=null_columns)
    * check: print(dc_listings.isnull().sum())
* ranking by Euclidean distance doesn't make sense if all attributes aren't ordinal


can instead use the distance.euclidean() function from scipy.spatial, which takes in 2 vectors as the parameters and calculates the Euclidean distance between them. The euclidean() function expects:
* both of the vectors to be represented using a list-like object (Python list, NumPy array, or pandas Series)
* both of the vectors must be 1-dimensional and have the same number of elements
Here's a simple example:



In [None]:
from scipy.spatial import distance

first_listing = [-0.596544, -0.439151]
second_listing = [-0.596544, 0.412923]
dist = distance.euclidean(first_listing, second_listing)

## Normalisation/Standardization
To prevent any single column from having too much of an impact on the distance, we can normalize all of the columns to have a mean of 0 and a standard deviation of 1.
Normalizing the values in each columns to the standard normal distribution (mean of 0, standard deviation of 1) preserves the distribution of the values in each column while aligning the scales. To normalize the values in a column to the standard normal distribution, you need to:
* from each value, subtract the mean of the column
* divide each value by the standard deviation of the column

NOTE: These methods were written with mass column transformation in mind and when you call mean() or std(), the appropriate column means and column standard deviations are used for each value in the Dataframe. 


In [None]:
# Subtract each value in the column by the mean.
first_transform = dc_listings['maximum_nights'] - dc_listings['maximum_nights'].mean()

# Divide each value in the column by the standard deviation.
normalized_col = first_transform / first_transform.std()

# To apply this transformation across all of the columns in a Dataframe, you can use the corresponding Dataframe methods mean() and std():
normalized_listings = (dc_listings - dc_listings.mean()) / (dc_listings.std())

# Another example of Normalizing the data
pga['distance'] = (pga['distance'] - pga['distance'].mean()) / pga['distance'].std()
pga['accuracy'] = (pga['accuracy'] - pga['accuracy'].mean()) / pga['accuracy'].std()
print(pga.head())

# scikit-learn

Most popular machine learning in Python. 
Scikit-learn contains functions for all of the major machine learning algorithms and a simple, unified workflow. 
Both of these properties allow data scientists to be incredibly productive when training and testing different models on a new dataset.

The scikit-learn workflow consists of 4 main steps:
1. instantiate the specific machine learning model you want to use
2. fit the model to the training data
3. use the model to make predictions
4. evaluate the accuracy of the predictions


### 1. Instansiate the model

Scikit-learn uses a similar object-oriented style to Matplotlib and you need to instantiate an empty model first by calling the constructor:

In [None]:
from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor()

### 2. fit the model to the data: 
using the fit method. For all models, the fit method takes in 2 required parameters:
matrix-like object, containing the feature columns we want to use from the training set.
list-like object, containing correct target values.

In [None]:
# Split full dataset into train and test sets.
train_df = normalized_listings.iloc[0:2792]
test_df = normalized_listings.iloc[2792:]

# Matrix-like object, containing just the 2 columns of interest from training set.
train_features = train_df[['accommodates', 'bathrooms']]

# List-like object, containing just the target column, `price`.
train_target = normalized_listings['price']

# Pass everything into the fit method.
knn.fit(train_features, train_target)

### 3. Make Predictions

We can use the predict method to make predictions on the test set. The predict method has only one required parameter, matrix-like object, containing the feature columns from the dataset we want to make predictions on.

The number of feature columns you use during both training and testing need to match or scikit-learn will return an error

The predict() method returns a NumPy array containing the predicted price values for the test set. 

In [None]:
predictions = knn.predict(test_df[['accommodates', 'bathrooms']])

### 4. Evaluating Accuracy

Always use cross-validation to make sure the error metrics you are getting from your model are accurate. The most common form of cross validation, and the one we will be using, is called k-fold cross validation:

gives an accuracy score

cv is the number of folds - a bit like iterations
￼

In [None]:
from sklearn.model_selection import cross_val_score
import numpy as np

lr = LogisticRegression()
scores = cross_val_score(lr, all_X, all_y, cv=10)
accuracy = np.mean(scores)

print(scores,accuracy)

# Error Metrics

a metric that quantifies how good the predictions were on the test set - quantifies how inaccurate our predictions were from the actual values.


## mean error:

calculating the difference between each predicted and actual value and then averaging these differences - isn't an effective error metric for most cases. Mean error treats a positive difference differently than a negative difference



## mean absolute error:

compute the absolute value of each error before we average all the errors. ￼



In [None]:
lr = LinearRegression()
lr.fit(train[features], train[target])
predictions = lr.predict(test[features])
mae = mean_absolute_error(test[target],predictions)
mae

## mean squared error (MSE) :

Take the mean of the squared error values, makes the gap between the predicted and actual values more clear. A prediction that's off by 100 dollars will have an error (of 10,000) that's 100 times more than a prediction that's off by only 10 dollars (which will have an error of 100).



In [None]:
# compare the actual values with the predicted values
mean_squared_error(test_one['price'],predicted_price)

## Root mean squared error (RMSE):

Error metric whose units are the base unit, RMSE for short, this error metric is calculated by taking the square root of the MSE value

In [None]:

knn = KNeighborsRegressor(n_neighbors=5,algorithm='auto')
knn.fit(train_one[['accommodates']],train_one['price'])
predicted_price = knn.predict(test_one[['accommodates']])
iteration_one_rmse = (mean_squared_error(test_one['price'],predicted_price)) ** (1/2)

# or
test_rmse = np.sqrt(test_mse)

## MSE to RMSE:

In general, we should expect that the RMSE value be much less than the MAE value because we're taking the square root of the squared errors. . Looking at the ratio of MAE to RMSE can help us understand if there are large but infrequent errors. You can read more about comparing MAE and RMSE in this wonderful post.



## Hyperparameters.:

Values that affect the behavior and performance of a model that are unrelated to the data that's used. he process of finding the optimal hyperparameter value is known as hyperparameter optimization. A simple but common hyperparameter optimization technique is known as grid search, which involves:
* selecting a subset of the possible hyperparameter values,
* training a model using each of these hyperparameter values,
* evaluating each model's performance,
* selecting the hyperparameter value that resulted in the lowest error value.


Grid search essentially boils down to evaluating the model performance at different k values and selecting the k value that resulted in the lowest error. 

In [None]:
hyper_params = [x for x in range(1,21)]
mse_values = list()
feautures =  ['accommodates', 'bedrooms', 'bathrooms', 'beds','minimum_nights', 'maximum_nights', 'number_of_reviews']

for i in hyper_params:
   
    knn = KNeighborsRegressor(n_neighbors=i,algorithm='brute')
    knn.fit(train_df[features],train_df['price'])
    predictions = knn.predict(test_df[features])
    mse_values.append(mean_squared_error(test_df['price'],predictions))
    
plt.scatter(hyper_params,mse_values)
plt.show()

## Holdout validation 

Involves:
* splitting the full dataset into 2 partitions:
    * a training set
    * a test set
* training the model on the training set,
* using the trained model to predict labels on the test set,
* computing an error metric to understand the model's effectiveness,
* switch the training and test sets and repeat,
* average the errors.


In holdout validation, we usually use a 50/50 split instead of the 75/25 split from train/test validation. This way, we remove number of observations as a potential source of variation in our model performance.
specific example of a larger class of validation techniques called k-fold cross-validation

## K Value
The number of similar records to compare to, can use similarity metric (below) to find those.

In [None]:
# Exploring Different K Values ##

from sklearn.model_selection import cross_val_score, KFold

num_folds = [3, 5, 7, 9, 10, 11, 13, 15, 17, 19, 21, 23]

for fold in num_folds:
    kf = KFold(fold, shuffle=True, random_state=1)
    model = KNeighborsRegressor()
    mses = cross_val_score(model, dc_listings[["accommodates"]], dc_listings["price"], scoring="neg_mean_squared_error", cv=kf)
    rmses = np.sqrt(np.absolute(mses))
    avg_rmse = np.mean(rmses)
    std_rmse = np.std(rmses)
    
    print(str(fold), "folds: ", "avg RMSE: ", str(avg_rmse), "std RMSE: ", str(std_rmse))

## K-fold cross validation

takes advantage of a larger proportion of the data during training while still rotating through different subsets of the data to avoid the issues of train/test validation.
Here's the algorithm from k-fold cross validation:
* splitting the full dataset into k equal length partitions,
    * selecting k-1 partitions as the training set and
    * selecting the remaining partition as the test set
* training the model on the training set,
* using the trained model to predict labels on the test fold,
* computing the test fold's error metric,
* repeating all of the above steps k-1 times, until each partition has been used as the test set for an iteration,
* calculating the mean of the k error values.


Holdout validation is essentially a version of k-fold cross validation when k is equal to 2. Generally, 5 or 10 folds is used for k-fold cross-validation. Here's a diagram describing each iteration of 5-fold cross validation:
As you increase the number the folds, the number of observations in each fold decreases and the variance of the fold-by-fold errors increases. Let's start by manually partitioning the data set into 5 folds. Instead of splitting into 5 dataframes, let's add a column that specifies which fold the row belongs to. This way, we can easily select


In [None]:
KFold class from sklearn.model_selection:

from sklearn.model_selection import KFold
kf = KFold(n_splits, shuffle=False, random_state=None)

# n_splits is the number of folds you want to use,
# shuffle is used to toggle shuffling of the ordering of the observations in the dataset,
# random_state is used to specify the random seed value if shuffle is set to True.

from sklearn.model_selection import cross_val_score, KFold

n_splits = 5
kf = KFold(n_splits, shuffle=True, random_state=1)
model = KNeighborsRegressor()
mses = cross_val_score(model,dc_listings[['accommodates']],dc_listings['price'],scoring="neg_mean_squared_error", cv=kf)
rmses = np.sqrt(np.absolute(mses))
avg_rmse = np.mean(rmses)

## Cross Val Score

You'll notice here that no parameters depend on the data set at all. This is because the KFold class returns an iterator object which we use in conjunction with the cross_val_score() function, also from sklearn.model_selection. Together, these 2 functions allow us to compactly train and test using k-fold cross validation.

Depending on the scoring criteria you specify, either a single total value is returned one value for each fold. Here's the general workflow for performing k-fold cross-validation using the classes we just described:
* instantiate the scikit-learn model class you want to fit,
* instantiate the KFold class and using the parameters to specify the k-fold cross-validation attributes you want,
* use the cross_val_score() function to return the scoring metric you're interested in.


In [None]:
from sklearn.model_selection import cross_val_score
cross_val_score(estimator, X, Y, scoring=None, cv=None)

# estimator is a sklearn model that implements the fit method (e.g. instance of KNeighborsRegressor),
# X is the list or 2D array containing the features you want to train on,
# y is a list containing the values you want to predict (target column),
# scoring is a string describing the scoring criteria (list of accepted values here).
# cv describes the number of folds. Here are some examples of accepted values:
    # an instance of the KFold class,
    # an integer representing the number of folds.

## Bias & Variance

Bias and variance are the 2 observable sources of error in a model that we can indirectly control.

Bias describes error that results in bad assumptions about the learning algorithm. For example, assuming that only one feature, like a car's weight, relates to a car's fuel efficiency will lead you to fit a simple, univariate regression model that will result in high bias. The error rate will be high since a car's fuel efficiency is affected by many other factors besides just its weight.

Variance describes error that occurs because of the variability of a model's predicted values. If we were given a dataset with 1000 features on each car and used every single feature to train an incredibly complicated multivariate regression model, we will have low bias but high variance. In an ideal world, we want low bias and low variance but in reality, there's always a tradeoff.


** The standard deviation of the RMSE is a proxy for a model's variance**

** The average RMSE is a proxy for a model's bias. **

## Linear function

The word linear equation is often used interchangeably with linear function. Many real world processes can be modeled using multiple, related linear equations. 


A simple, straight line is more clearly defined as a linear function. All linear functions can be written in the following form:
y = mx + b
For a specific linear function, m and b are constant values while x and y are variables.
y = 3x + 1 and y = 5 are both examples of linear function.


When m is equal to 0, the line is completely flat and is parallel to the x-axis. 
When m and b are both set to 0, the line is equivalent to the x-axis.
The m value controls a line's slope while the b value controls a line's y-intercept. 
Another way to think about slope is rate of change. The rate of change is how much the y axis changes for a specific change in the x axis.


or m = (y1-y2) / (x1/x2)

When x1and x2 are equivalent,the slope value is undefined (This is because division by 0 has no meaning in mathematical calculations)

## Non Linear Functions:

Whenever x is raised to a power not equal to 1 , we have a non-linear function. 
When we calculate the slope between 2 points on a curve, we're really calculating the slope between the line that intersects both of those points.



## Secant line:

A line that intersects 2 points on a curve. The slope of a curve at a specific point, x1 is best understood as slope of the secant line at increasingly smaller intervals of [x1,x2]. The smaller the difference between x1 and x2, the more precise the secant line approximates the slope at that point on our curve.

## Instantaneous rate of change:

Describes the slope at a particular point. 
For linear functions, the instantaneous rate of change at any point on the line is the same. For nonlinear function, the instantaneous rate of change describes the slope of the line that's perpendicular to the nonlinear function at a specific point.

This line is known as the tangent, and, unlike the secant line, it only intersects our function at one point. So far, we've been working with secant lines that connect 2 points that are increasingly close together. You can think of the tangent line as the secant line when both points are the same. 


## Limits:
A limit desribes the value a function approaches when the input variable to the function approaches a specific value. 
Defined Limits are whenever the resulting value of a limit is defined at the value the input variable approaches, we say that limit is defined.



# Feature Engineering

- Remove columns that are all nulls
-  Remove rows containing missing values for specific columns AND/OR
-  Impute (or replace) missing values using a descriptive statistic from the column (NB:: many people instead use a 50% cutoff (if half the values in a column are missing, it's automatically dropped)
- Convert categorical features to numeric (if text). the feature transformation process is the same if the numbers used in those categories have no numerical meaning.
-  Dummy Coding for the numeric Categorical features (to resolve no numerical meaning to the codes) NB include dropping the original column source for the dummies

## Finding Missing Values ##

In [None]:
train_null_counts = train.isnull().sum()
train_null_counts = train_null_counts[(train_null_counts >0) & (train_null_counts<584)]

df_missing_values = train[train_null_counts.index]

print(df_missing_values.isnull().sum())
print(df_missing_values.dtypes)

## Imputing Missing Values ##

In [None]:
# Only select float columns from the df with missing values
float_cols = df_missing_values.select_dtypes(include=['float'])

# Return a data frame with missing values replaced with mean of that column.
float_cols = float_cols.fillna(float_cols.mean())

## Getting Categorical Columns


In [None]:
text_cols = df_no_mv.select_dtypes(include=['object']).columns

#show how many categorical values there are for each column.
for col in text_cols:
    print(col+":", len(train[col].unique()))
    
    # Convert all of the text columns in train to the categorical data type.
    train[col] = train[col].astype('category')
    
train['Utilities'].cat.codes.value_counts()

## Convert the categorical variables:
This involves assigning a number to each category label, then converting all of the labels in a column to the corresponding numbers.

One strategy is to convert the columns to a categorical type. Under this approach, pandas will display the labels as strings, but internally store them as numbers so we can do computations with them. The numbers aren't always compatible with other libraries like Scikit-learn, though, so it's easier to just do the conversion to numeric upfront. We can use the pandas.Categorical() class from pandas to perform the conversion to numbers:

In [None]:
for name in ["workclass","education", "marital_status", "occupation", "relationship", "race", "sex", "native_country", "high_income"]:
    col = pandas.Categorical(income[name])
    income[name] = col.codes

## Rescaling Data:

Within scikit-learn, the preprocessing.minmax_scale() function allows us to quickly and easily rescale our data:

In [None]:
from sklearn.preprocessing import minmax_scale

columns = ["column one", "column two"]
data[columns] = minmax_scale(data[columns])



## Checking Coefficient of Columns:

In order to select the best-performing features, we need a way to measure which of our features are relevant to our outcome - in this case, the survival of each passenger. One effective way is by training a logistic regression model using all of our features, and then looking at the coefficients of each feature.

The scikit-learn LogisticRegression class has an attribute in which coefficients are stored after the model is fit, LogisticRegression.coef_. We first need to train our model, after which we can access this attribute.



In [None]:
columns = [‘….]

lr = LogisticRegression()
lr.fit(train[columns], train["Survived"])
coefficients = lr.coef_

#To make these easier to interpret, we can convert the coefficients to a pandas series, adding the column names as the index:
feature_importance = pd.Series(coefficients[0],index=train[columns].columns)

feature_importance.plot.barh()
plt.show()

## Binning (Feature Engineering Method):

Binning is when you take a continuous feature, like the fare a passenger paid for their ticket, and separate it out into several ranges (or 'bins'), turning it into a categorical variable.

In [None]:
def process_fare(df, cut_points, label_names):
    df["Fare_categories"] = pd.cut(df["Fare"],cut_points,labels=label_names)
    return df

cut_points = [0,12,50,100,1000]
label_names = ['0-12','12-50','50-100','100+']
    
train = process_fare(train, cut_points, label_names)
holdout = process_fare(holdout, cut_points, label_names)

## Remove ordered relationship:

Don't imply any numeric relationship where there isn't one. If we think of the values in the Pclass column, we know they are 1, 2, and 3
Class 2 isn't "worth" double what class 1 is, and class 3 isn't "worth" triple what class 1 is. In order to remove this relationship, we can create dummy columns for each unique value in Pclass:

￼
Rather than doing this manually, we can use the pandas.get_dummies() function, which will generate columns shown in the diagram above.
The following code creates a function to create the dummy columns for the Pclass column and add it back to the original dataframe. It then applies that function the train and test dataframes.

NOTE: remember to remove one of each of our dummy variables to reduce the collinearity in each…


In [None]:
def create_dummies(df,column_name):
    dummies = pd.get_dummies(df[column_name],prefix=column_name)
    df = pd.concat([df,dummies],axis=1)
    return df

train = create_dummies(train,"Pclass")
test = create_dummies(test,"Pclass")

## Extracting data from text columns (Feature Engineering Method):

In [None]:
#First character from a column:
train["Cabin"].str[0]

# Extract titles from full name:

titles = {
    "Mme":         "Mrs",
    "Ms":          "Mrs",
    "Mrs" :        "Mrs",
    "Countess":    "Royalty",
    "Lady" :       "Royalty"
}

# use extract and regex to get titles out of full name e.g.. Beesley, Mr. Lawrence
extracted_titles = train["Name"].str.extract(' ([A-Za-z]+)\.', expand=False)

# map the extracted titles against the predefined dictionary
train["Title"] = extracted_titles.map(titles)

## Choosing highly correlated Columns

In [None]:
## 2. Correlating Feature Columns With Target Column ##

train_subset = train[full_cols_series.index]

results = train_subset.corr()
print(results)
sorted_corrs = abs(results['SalePrice']).sort_values()


## 3. Correlation Matrix Heatmap ##

import seaborn as sns
import matplotlib.pyplot as plt

# get the columns with a strong correlation from our list
strong_corrs = sorted_corrs[sorted_corrs > 0.3]

# use get the correlation of these columns with each other from the orginal dataset
corrmat = train_subset[strong_corrs.index].corr()
print(corrmat)

# visualise on a heatmap
sns.heatmap(corrmat)

## Collinearity:

Occurs where more than one feature contains data that are similar.

The effect of collinearity is that your model will overfit - you may get great results on your test data set, but then the model performs worse on unseen data (like the holdout set).
One easy way to understand collinearity is with a simple binary variable like the Sex column in our dataset. Every passenger in our data is categorized as either male or female, so 'not male' is exactly the same as 'female'.
As a result, when we created our two dummy columns from the categorical Sex column, we've actually created two columns with identical data in them.

**To check for Collinearity:**

using the DataFrame.corr() method to produce a correlation matrix, and then use the Seaborn library's seaborn.heatmap() function to plot the values agains each other, 
The darker squares, whether the darker red or darker blue, indicate pairs of columns that have higher correlation and may lead to collinearity.


In [None]:
import seaborn as sns
correlations = train.corr()
sns.heatmap(correlations)
plt.show()


## Nicer version
def plot_correlation_heatmap(df):

    corr = df.corr()
    
    sns.set(style="white")
    mask = np.zeros_like(corr, dtype=np.bool)
    mask[np.triu_indices_from(mask)] = True

    f, ax = plt.subplots(figsize=(11, 9))
    cmap = sns.diverging_palette(220, 10, as_cmap=True)
    
    sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
    square=True, linewidths=.5, cbar_kws={"shrink": .5})
    plt.show()


## Recursive feature elimination with cross-validation (feature selection method):

We will be using the feature_selection.RFECV class which performs recursive feature elimination with cross-validation.

The RFECV class starts by training a model using all of your features and scores it using cross validation. It then uses the logit coefficients to eliminate the least important feature, and trains and scores a new model. At the end, the class looks at all the scores, and selects the set of features which scored highest.



In [None]:
from sklearn.feature_selection import RFECV

lr = LogisticRegression()
selector = RFECV(lr,cv=10)
selector.fit(columns,target column)

# Once the RFECV object has been fit, we can use the RFECV.support_ attribute to access a boolean mask of True and False values 
# which we can use to generate a list of optimized columns:

optimized_columns = all_X.columns[selector.support_]


# Models
## Regression Models

Any machine learning model that helps us predict numerical values

## Classification Models

Where we're trying to predict a label from a fixed set of labels (e.g. blood type or gender).



## Parametric machine learning approaches
like linear regression and logistic regression. Unlike the k-nearest neighbors algorithm, the result of the training process for these machine learning algorithms is a mathematical function that best approximates the patterns in the training set. In machine learning, this function is often referred to as a model.

## Linear Regression Model:

most commonly used machine learning model. The logistic regression algorithm works by calculating linear relationships between the features and the target variable and using those to make predictions. Let's look at an algorithm that makes predictions using a different method.

Linear regression works well when the target column we're trying to predict, the dependent variable, is ordered and continuous. If the target column instead contains discrete values, then linear regression isn't a good fit.



In [None]:
from sklearn.linear_model import LinearRegression
import numpy as np
from sklearn.metrics import mean_squared_error


#remove the feature we found to have a variance of <0.015
features = features.drop('Open Porch SF')

target = 'SalePrice'

lr = LinearRegression()
lr.fit(train[features],train[target])

train_predictions = lr.predict(train[features])
test_predictions = lr.predict(clean_test[features])

train_mse = mean_squared_error(train_predictions, train[target])
test_mse = mean_squared_error(test_predictions, clean_test[target])

train_rmse_2 = np.sqrt(train_mse)
test_rmse_2 = np.sqrt(test_mse)

## Residual sum of squares 

To find the optimal parameters for a linear regression model, we want to optimize the model's residual sum of squares (or RSS). If you call, residual (often referred to as errors) describes the difference between the predicted values for the target column (y ^) and the true values (y). We want this difference to be as small as possible. 


## Logistic Regression

If the target column instead contains discrete values, then linear regression isn't a good fit; classification problems. In classification, our target column has a finite set of possible values which represent different categories a row can belong to. We use integers to represent the different categories so we can continue to use mathematical functions to describe how the independent variables map to the dependent variable. Here are a few examples of classification problems:

￼
While a linear regression model outputs a real number as the label, a logistic regression model outputs a probability value. In binary classification, if the probability value is larger than a certain threshold probability, we assign the label for that row to 1 or 0 otherwise.

Logistic regression is really just an adapted version of linear regression for classification problems. Both logistic and linear regression are used to capture linear relationships between the independent variables and the dependent variable.


To return the predicted probability, use the predict_proba method. The only required parameter for this method is the num_features by num_sample matrix of observations we want scikit-learn to return predicted probabilities for. For each input row, scikit-learn will return a NumPy array with 2 probability values:
* the probability that the row should be labelled 0,
* the probability that the row should be labelled 1.

Since 0 and 1 are the only 2 possible categories and represent the entire outcome space, these 2 probabilities will always add upto 1.

	probabilities = logistic_model.predict_proba(admissions[["gpa"]])
	# Probability that the row belongs to label `0`.
	probabilities[:,0]
	# Probabililty that the row belongs to label `1`.
	probabilities[:,1]

use the predict method to return the label predictions for each row in our training dataset.

In [1]:
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression

columns = ['Age_categories_Missing', 'Age_categories_Infant',
       'Age_categories_Child', 'Age_categories_Teenager',
       'Age_categories_Young Adult', 'Age_categories_Adult',
       'Age_categories_Senior', 'Pclass_1', 'Pclass_2', 'Pclass_3',
       'Sex_female', 'Sex_male', 'Embarked_C', 'Embarked_Q', 'Embarked_S',
       'SibSp_scaled', 'Parch_scaled', 'Fare_scaled']

lr = LogisticRegression()
lr.fit(train[columns], train["Survived"])
coefficients = lr.coef_
feature_importance = pd.Series(coefficients[0],index=train[columns].columns)

feature_importance.plot.barh()
plt.show()

NameError: name 'train' is not defined

## Random forests Model:

is a specific type of decision tree algorithm. Decision tree algorithms attempt to build the most efficient decision tree based on the training data, and then use that tree to make future predictions. If you'd like to learn about decision trees and random forests in detail, you should check out our decision trees course.



## k-nearest neighbors Model:
 
The k-nearest neighbors algorithm finds the observations in our training set most similar to the observation in our test set, and uses the average outcome of those 'neighbor' observations to make a prediction. The 'k' is the number of neighbor observations used to make the prediction.  K-nearest neighbors is known as an instance-based learning algorithm because it relies completely on previous instances to make predictions. The k-nearest neighbors algorithm doesn't try to understand or capture the relationship between the feature columns and the target column. Because the entire training dataset is used to find a new instance's nearest neighbors to make label predictions, this algorithm doesn't scale well to medium and larger datasets

Finding the best K value (also known as Grid Search):


In [None]:
knn_scores = dict()

for k in range(1,50,2):
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn,all_X,all_y,cv=10)
    knn_scores[k] = np.mean(scores)
    
print(knn_scores)

## inbuilt Grid Search:

train a number of models across a 'grid' of values and then searched for the model that gave us the highest accuracy.

Scikit-learn has a class to perform grid search, model_selection.GridSearchCV(). The 'CV' in the name indicates that we're performing both grid search and cross validation at the same time.
By creating a dictionary of parameters and possible values and passing it to the GridSearchCV object you can automate the process. Here's what the code from the previous screen would look like, when implemented using the GridSearchCV class.



In [None]:
from sklearn.model_selection import GridSearchCV

knn = KNeighborsClassifier()

hyperparameters = {
                   "n_neighbors": range(1,50,2)
                    }
grid = GridSearchCV(knn, param_grid=hyperparameters, cv=10)
grid.fit(all_X, all_y)

print(grid.best_params_)
print(grid.best_score_)

# Get the best estimator ie. with the best params/score:
best_rf = grid.best_estimator_

# Running this code will produce the following output:
 
#   {'n_neighbors': 19}
#   0.82379349046


## Naive Bayes classification algorithm.:

A Naive Bayes classifier works by figuring out how likely data attributes are to be associated with a certain class. Let's say we still have one classification -- whether or not you were tired. And let's say we have two data points -- whether or not you ran, and whether or not you woke up early. Bayes' theorem doesn't work in this case, because we have two data points, instead of just one.

This is where Naive Bayes can help. Naive Bayes extends Bayes' theorem to handle this case by assuming that each data point is independent.


In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics

# Generate counts from text using a vectorizer  
# We can choose from other available vectorizers, and set many different options
# This code performs our step of computing word counts
vectorizer = CountVectorizer(stop_words='english', max_df=.05)
train_features = vectorizer.fit_transform([r[0] for r in reviews])
test_features = vectorizer.transform([r[0] for r in test])

# Fit a Naive Bayes model to the training data
# This will train the model using the word counts we computed and the existing classifications in the training set
nb = MultinomialNB()
nb.fit(train_features, [int(r[1]) for r in reviews])

# Now we can use the model to predict classifications for our test features
predictions = nb.predict(test_features)

# Compute the error
# It's slightly different from our model because the internals of this process work differently from our implementation
fpr, tpr, thresholds = metrics.roc_curve(actual, predictions, pos_label=1)

print("Multinomal naive bayes AUC: {0}".format(metrics.auc(fpr, tpr)))

## Natural Language Processing:
Tokens that only occur once don't add anything to the model's prediction power, and removing them will make our algorithm run much more quickly.

There are two kinds of features that will reduce prediction accuracy. Features that occur only a few times will cause overfitting, because the model doesn't have enough information to accurately decide whether they're important. These features will probably correlate differently with upvotes in the test set and the training set.

Features that occur too many times can also cause issues. These are words like and and to, which occur in nearly every headline. These words don't add any information, because they don't necessarily correlate with upvotes. These types of words are sometimes called stopwords.



In [None]:
## 2. Overview of the Data ##

import pandas as pd
submissions = pd.read_csv("sel_hn_stories.csv")
submissions.columns = ["submission_time", "upvotes", "url", "headline"]
submissions = submissions.dropna()

## 3. Tokenizing the Headlines ##

tokenized_headlines = []

for headline in submissions['headline']:
    bag = []
    bag = headline.split(" ")
    tokenized_headlines.append(bag)

print(tokenized_headlines[3])

## 4. Preprocessing Tokens to Increase Accuracy ##

punctuation = [",", ":", ";", ".", "'", '"', "’", "?", "/", "-", "+", "&", "(", ")"]
clean_tokenized = []

for bag in tokenized_headlines:   
    clean_tokens = []
    
    for token in bag:
        token = token.lower()
        
        for char in punctuation:
            token = token.replace(char,"")
            
        clean_tokens.append(token)
     
    clean_tokenized.append(clean_tokens)   

## 5. Assembling a Matrix of Unique Words ##

import numpy as np
from collections import Counter

unique_tokens = []
single_tokens = []

for bag in clean_tokenized:
    for token in bag:
        if token in single_tokens:
            # word occurs more than once, add it to the real list
            unique_tokens.append(token)
        else:
            # first time we've found the word, add it to the unique_tokens list
            single_tokens.append(token)
            
# Create a data frame with 0 counts against each of the unique words as columns:
counts = pd.DataFrame(0, index=np.arange(len(clean_tokenized)), columns=unique_tokens)

## 6. Counting Token Occurrences ##
for idx, bag in enumerate(clean_tokenized):
    for token in bag:
        if token in unique_tokens:
            counts.iloc[idx][token] += 1 
        

## 7. Removing Columns to Increase Accuracy ##
# To reduce the number of features and enable the linear regression model to make better predictions, 
# we'll remove any words that occur fewer than 5 times or more than 100 times;
word_counts = counts.sum(axis=0)
counts = counts.loc[:,(word_counts >= 5) & (word_counts <= 100)]

## 8. Splitting the Data Into Train and Test Sets ##
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(counts, submissions["upvotes"], test_size=0.2, random_state=1)

## 10. Calculating Prediction Error ##
mse = (((predictions-y_test)**2).sum())/len(predictions)

## Decision Trees

We can use trees for classification or regression problems. enables us to automatically construct a decision tree that tells us what outcomes we should predict in certain situations.
The decision tree algorithm is a supervised learning algorithm -- we first construct the tree with historical data, and then use it to predict an outcome. One of the major advantages of decision trees is that they can pick up nonlinear interactions between variables in the data that linear regression can't.

Before we get started with decision trees, we need to convert the categorical variables in our data set to numeric variables. This involves assigning a number to each category label, then converting all of the labels in a column to the corresponding numbers.

One strategy is to convert the columns to a categorical type. Under this approach, pandas will display the labels as strings, but internally store them as numbers so we can do computations with them. The numbers aren't always compatible with other libraries like Scikit-learn, though, so it's easier to just do the conversion to numeric upfront. We can use the pandas.Categorical() class from pandas to perform the conversion to numbers:


In [None]:
for name in ["workclass","education", "marital_status", "occupation", "relationship", "race", "sex", "native_country", "high_income"]:
    col = pandas.Categorical(income[name])
    income[name] = col.codes

We'll need to continue splitting nodes until we get to a point where all of the rows in a node have the same value for target column

We use the DecisionTreeClassifier class for classification problems, and DecisionTreeRegressor for regression problems. The sklearn.tree package includes both of these classes.
In this case, we're predicting a binary outcome, so we'll use a classifier.

Our test set accuracy decreases to .691, and our training set accuracy increases to .975.
One way to prevent overfitting is to block the tree from growing beyond a certain depth (we tried this before). Another technique is called pruning. Pruning involves building a full tree, and then removing the leaves that don't add to prediction accuracy. Pruning prevents a model from becoming overly complex. It can result in a simpler model that has higher accuracy on the testing set.
Data scientists use pruning less often than parameter optimization (what we just did) and ensembling. It's still an important technique, though, and we'll cover it in more depth down the line.

Let's go over the main advantages and disadvantages of using decision trees. The main advantages of using decision trees is that they're:
* Easy to interpret
* Relatively fast to fit and make predictions
* Able to handle multiple types of data
* Able to pick up nonlinearities in data, and usually fairly accurate
The main disadvantage of using decision trees is their tendency to overfit.
Decision trees are a good choice for tasks where it's important to be able to interpret and convey why the algorithm is doing what it's doing.
The most powerful way to reduce decision tree overfitting is to create ensembles of trees. The random forest algorithm is a popular choice for doing this. In cases where prediction accuracy is the most important consideration, random forests usually perform better.

The most powerful tool for reducing decision tree overfitting is called the random forest algorithm.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score

clf = DecisionTreeClassifier(random_state=1)
clf.fit(train[columns], train["high_income"])

predictions = clf.predict(test[columns])

#compute error with ROC AUC
error = roc_auc_score(test['high_income'], predictions)
print(error)

## 5. Computing Error on the Training Set ##

predictions = clf.predict(train[columns])
print(roc_auc_score(train['high_income'],predictions))

## Random Forests

A random forest is a kind of ensemble model. Ensembles combine the predictions of multiple models to create a more accurate final prediction

The models are approaching the same problem in slightly different ways, and building different trees because we used different parameters for each one. Each tree makes different predictions in different areas. Even though both trees have about the same accuracy, when we combine them, the result is stronger because it leverages the strengths of both approaches. The more "diverse" or dissimilar the models we use to construct an ensemble are, the stronger their combined predictions will be (assuming that all of the models have about the same accuracy). Ensembling a decision tree and a logistic regression model, for example, will result in stronger predictions than ensembling two decision trees with similar parameters. That's because those two models use very different approaches to arrive at their answers.

A random forest is an ensemble of decision trees. If we don't make any modifications to the trees, each tree will be exactly the same, so we'll get no boost when we ensemble them. In order to make ensembling effective, we have to introduce variation into each individual decision tree model.
If we introduce variation, each tree will be be constructed slightly differently, and will therefore make different predictions. This variation is what puts the "random" in "random forest."

Similar to decision trees, we can tweak some of the parameters for random forests, including:
* min_samples_leaf
* min_samples_split
* max_depth
* max_leaf_nodes
These parameters apply to the individual trees in the model, and change how they are constructed. There are also parameters specific to the random forest that alter its overall construction:
* n_estimators
* bootstrap - "Bootstrap aggregation" is another name for bagging; this parameter indicates whether to turn it on (Defaults to True)
Refer to the documentation for a full list of parameters.
Tweaking parameters can increase the accuracy of the forest. The easiest tweak is to increase the number of estimators we use. This approach yields diminishing returns -- going from 10 trees to 100 will make a bigger difference than going from 100 to 500, which will make a bigger difference than going from 500 to 1000. The accuracy increase function is logarithmic, so increasing the number of trees beyond a certain number (usually 200) won't help much at all.


One of the major advantages of random forests over single decision trees is that they tend to overfit less. The average of 100 or more trees will be more likely to hone in on the signal and ignore the noise. The signal will be the same across all of the trees, whereas each tree will hone in on the noise differently. This means that the average will discard the noise and keep the signal.

The main weaknesses of using a random forest are:
* They're difficult to interpret - Because we've averaging the results of many trees, it can be hard to figure out why a random forest is making predictions the way it is.
* They take longer to create - Making two trees takes twice as long as making one, making three takes three times as long, and so on. Fortunately, we can exploit multicore processors to parallelize tree construction. Scikit allows us to do this through the n_jobs parameter on RandomForestClassifier. We'll discuss parallelization in greater detail later on.
Given these trade-offs, it makes sense to use random forests in situations where accuracy is of the utmost importance; being able to interpret or explain the decisions the model is making isn't key. In cases where time is of the essence or interpretability is important, a single decision tree may be a better choice

In [None]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=150, random_state=1, min_samples_leaf=2)
clf.fit(train[columns], train["high_income"])
predictions = clf.predict(test[columns])

print(roc_auc_score(test["high_income"], predictions))