## GroupActWeek14

## 1. Using the documentation for Recursive Feature Selection, apply this process to the  crime dataset to create the best multivariate linear regression model https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html . You can select what you’re trying to predict, but be sure to indicate what that is. Be sure to explain what RFE is in the markdown. You should be able to answer this using what’s on the documentation page + what you already know.

### Recursive Feature Selection

  * Extracting valuable features of dataset is essential part of data preparation to train model in machine learning. Because if we give qualified inputs, our model will give back qualified outputs. Other wise remember the famous saying,**<font color='blue'> Garbage in Garbage Out<font>**
    
    
  * Scikit-learn API provides RFE class that ranks features by recursive feature elimination to select best features. The method recursively eliminates the least important features based on specific attributes taken by estimator.
    
**Why Estimator?** 
    
  * We need an estimator for RFE class. For example, a linear model or a decision tree model. These models have coefficients for linear models and feature importances in decision tree models. In selecting the optimal number of features, the estimator is trained and the features are selected via the coefficients, or via the feature importances. The least important features are removed. This process is repeated recursively until the optimal number of features is obtained.

In [34]:
import pandas as pd
import numpy as np
from numpy import array

The data (X1, X2, X3, X4, X5, X6, X7) are for each city.

X1 = total overall reported crime rate per 1 million residents

X2 = reported violent crime rate per 100,000 residents

X3 = annual police funding in $/resident

X4 = % of people 25 years+ with 4 yrs. of high school

X5 = % of 16 to 19 year-olds not in highschool and not highschool graduates.

X6 = % of 18 to 24 year-olds in college

X7 = % of people 25 years+ with at least 4 years of college

Reference: Life In America's Small Cities, By G.S. Thomas

In [14]:
crime_df=pd.read_csv('crime_data.csv')
crime_df.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7
0,478,184,40,74,11,31,20
1,494,213,32,72,11,43,18
2,643,347,57,70,18,16,16
3,341,565,31,71,11,25,19
4,773,327,67,72,9,29,24


### Goal:

**Our dataset has 7 columns. Our purpose is decreasing those columns by selecting best 3 by their rank to fit into the model.** For this dataset our prediction column is X3 and all columns other than X3 are our input.

In [54]:
X=crime_df.drop('X3',axis=1)
y=crime_df['X3']

# data splitting
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=15)

# we need an estimator to RFE class.
from sklearn.linear_model import LinearRegression
estimator=LinearRegression()

# since we got our estimator now we can define our selector
from sklearn.feature_selection import RFE
selector=RFE(estimator,n_features_to_select=3)

#fitting selctor into dataset
selector.fit(X,y)

#after fitting, we can obtain our selected features and its rank
print(selector.support_)
print('Ranking : ',selector.ranking_)


#to make it readable, printing the feature names
features =array(crime_df.columns.drop('X3'))
print("All features: ")
print(features)

print("Selected features: ")
print(features[selector.support_])
  

[False False False  True  True  True]
Ranking :  [3 4 2 1 1 1]
All features: 
['X1' 'X2' 'X4' 'X5' 'X6' 'X7']
Selected features: 
['X5' 'X6' 'X7']


### Accuracy Score of Multivariate Linear Regression for the RFE ranked columns

In [77]:
# now we can fit our multi variate linear regression model. Our X will be x5,x6,x7
X=crime_df[['X5','X6','X7']]
y=crime_df['X3']
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=77)
regressor=LinearRegression()
regressor.fit(X_train,y_train)
y_pred= regressor.predict(X_test)
print("Accuracy score: ", regressor.score(X_test,y_test))
print("Coefficients are :",regressor.coef_)

Accuracy score:  0.2812320383946695
Coefficients are : [ 0.62645727 -0.25158003  0.99308483]


### Accuracy Score of Multivariate Linear Regression for the non ranked columns

In [78]:
X=crime_df[['X1','X2','X4']]
y=crime_df['X3']
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=77)
regressor=LinearRegression()
regressor.fit(X_train,y_train)
y_pred= regressor.predict(X_test)
print("Accuracy score: ", regressor.score(X_test,y_test))
print("Coefficients are :",regressor.coef_)

Accuracy score:  0.13581107251140456
Coefficients are : [ 0.02298463 -0.00429206  0.13660579]


### Observation:

The accuracy score from RFE ranked columns are greater than other columns. So We can use Recursive Feature Selection method to select our best model.

## 2.Create a list of preprocessing steps you should try when working to build a model. Briefly describe what each step is. Work with your group to come up with the most comprehensive list you can

Preprocessing Steps:
**1. Data cleaning:**
    * Cleaning & Handling Nans: It is the process to remove incorrect data, incomplete data and also replaces the missing values. Most common strategy for replacing Nans is, replacing with that column mean value.    

**2. Data reduction:**
This process helps in the reduction of the volume of the data which makes the analysis easier yet produces the same or almost the same result. 
    * Dimensinality Reduction
    * Removing redundant Features    
    
**3. Data Transformation:**
The change made in the format or the structure of the data is called data transformation.
    * Encoding: The machine learning models use mathematical equations. So categorical data is not accepted, we convert it into numerical form.
    * Standardization/Normalization: Convert our input datas to look normally distributed.(zero mean, unit variance)    