##### 1.  Using the documentation for Recursive Feature Selection, apply this process to the crime dataset to create the best multivariate linear regression model https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html.  You can select what you’re trying to predict, but be sure to indicate what that is. Be sure to explain what RFE is in the markdown. You should be able to answer this using what’s on the documentation page + what you already know.

RFE, or Recursive Feature Selection, is a process whereby the features in a dataset are pared down via an algorithm that begins with all of the features in a training dataset, then removes the features one at at time through recursive application of the algorithm.  The reason for doing this evaluating, then eliminating, then evaluating again is to determine the best multivariable regression model for predicting a specific feature in a dataset.

In [98]:
# Initial Set-Up
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_selection import RFE
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Import and view crime dataset
crime_df = pd.read_csv("../wk_13_hmwk/crime_data.csv")
crime_df.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7
0,478,184,40,74,11,31,20
1,494,213,32,72,11,43,18
2,643,347,57,70,18,16,16
3,341,565,31,71,11,25,19
4,773,327,67,72,9,29,24


In [79]:
# Run RFE to find the best 3 variables for a multi-variable linear regression
X = crime_df[['X2', 'X3', 'X4', 'X5', 'X6', 'X7']]
y = crime_df['X1']
estimator = SVR(kernel="linear")
selector = RFE(estimator, step=1)  # Since number of features to select is not designated, half of the features will be selected
selector = selector.fit(X, y)
# Ranking the features according to their value in the regression model.
selector.ranking_

array([4, 1, 1, 1, 2, 3])

In [80]:
# Define X using results of RFE with 3 features selected
X = crime_df[['X3', 'X4','X5']]
y = crime_df['X1']

# Separate data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=32)
regression = LinearRegression()
regression.fit(X_train, y_train)
y_pred = regression.predict(X_test)
accuracy_score = regression.score(X_test,y_test)
accuracy_score
# This accuracy score doesn't seem very good, although this is not the only measure of the effectiveness of the model.  I will
# examine further...

0.2863700478358361

In [81]:
# Run RFE to find the best TWO variables for the multi-variable linear regression
X = crime_df[['X2', 'X3', 'X4', 'X5', 'X6', 'X7']]
y = crime_df['X1']
estimator = SVR(kernel="linear")
selector = RFE(estimator, n_features_to_select=2, step=1)
selector = selector.fit(X, y)
selector.ranking_

array([5, 1, 2, 1, 3, 4])

In [101]:
# Define X using results of RFE with 2 features selected
X = crime_df[['X3', 'X5']]
y = crime_df['X1']

# Separate data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=32)
regression = LinearRegression()
regression.fit(X_train, y_train)
y_pred = regression.predict(X_test)
accuracy_score = regression.score(X_test,y_test)
accuracy_score
# There's a significantly higher accuracy score when only two features are selected, so I will use this information for the
# model instead of the previous results.

0.49365232885912835

In [102]:
regression.coef_

array([9.41770881, 6.27333622])

Based on the results here, the best regression model to predict X1 in the crime dataset is: X1 = 9.42 * X3 + 6.27 * X5

##### 2. Create a list of preprocessing steps you should try when working to build a model. Briefly describe what each step is. Work with your group to come up with the most comprehensive list you can.  

#### PRE-PROCESSING:
    
1.  Drop columns that are largely or entirely comprised of null values, since they do not contribute significantly to the data set.


2.  Handle missing data using one or more of the following strategies: 
    
    a.  Fill in numerical data with the average value for that feature (column)
    
    b.  Eliminate observations (rows) that have null values (as long as there are a small percentage of these compared to the entire dataset).
    
    c.  Fill in categorical data with the most commonly occurring value.  
    
    d.  Develop a model to predict the missing values (such as a logistic regression model or knn to impute missing values in categorical data)
    

3.  Normalize/Standardize the data.  Use functions like StandardScaler to create a common scale for all of the features in the dataset, normalizing them so that they all have a mean of 0 and a standard deviation of 1.


4.  Use techniques such as one-hot encoding or LabelEncoder to convert categorical data to numerical.


5.  Create a correlation matrix and/or heatmap to determine features in the dataset that are most likely correlated.


6.  Use the results of the heatmap along with methods such as Recursive Feature Elimination to select the features that are most likely to predict the target variable.