# Question 1. 

Using the documentation for Recursive Feature Selection, apply this process to the
crime dataset to create the best model
https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html .
Since this dataset is so small, you do not need to perform a train-test split. You can
select what you’re trying to predict. Be sure to explain what RFE is in the markdown.


In [1]:
import pandas as pd
import numpy as np

crime_df = pd.read_csv('../week_13/crime_data.csv')
crime_df.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7
0,478,184,40,74,11,31,20
1,494,213,32,72,11,43,18
2,643,347,57,70,18,16,16
3,341,565,31,71,11,25,19
4,773,327,67,72,9,29,24


RFE stands for recursive feature elimination.  It is a method used to select the best features to use in the model predition (feature selection).  RFE selects the best features by running on an initial set of features and obtaining the importance of each feature and then running again on smaller and smaller sets of features until the model is optimized. 

The data (X1, X2, X3, X4, X5, X6, X7) are for each city.

X1 = total overall reported crime rate per 1 million residents

X2 = reported violent crime rate per 100,000 residents

X3 = annual police funding in $/resident

X4 = % of people 25 years+ with 4 yrs. of high school

X5 = % of 16 to 19 year-olds not in highschool and not highschool graduates.

X6 = % of 18 to 24 year-olds in college

X7 = % of people 25 years+ with at least 4 years of college

Reference: Life In America's Small Cities, By G.S. Thomas

In [17]:
crime_df.columns = ['crime_rate_mil','violent_crimes_per1000','police_funding', '%25+4yr_highschool', '%16to19_notinschool', '%18to24in_college', '%25+with4yr_college']

In [18]:
crime_df.head()

Unnamed: 0,crime_rate_mil,violent_crimes_per1000,police_funding,%25+4yr_highschool,%16to19_notinschool,%18to24in_college,%25+with4yr_college
0,478,184,40,74,11,31,20
1,494,213,32,72,11,43,18
2,643,347,57,70,18,16,16
3,341,565,31,71,11,25,19
4,773,327,67,72,9,29,24


In [93]:
crime_df.shape

(50, 7)

In [88]:
#import RFE
from sklearn.feature_selection import RFE
from sklearn.svm import SVR

X = np.array(crime_df.drop('crime_rate_mil', axis=1))
y = np.array(crime_df['crime_rate_mil'])
estimator = SVR(kernel="linear")
selector = RFE(estimator, n_features_to_select=6, step=1)
selector = selector.fit(X, y)

In [89]:
selector.support_

array([ True,  True,  True,  True,  True,  True])

In [90]:
selector.ranking_

array([1, 1, 1, 1, 1, 1])

In [99]:
y_pred = selector.predict(np.array(['213','32','72','11','43','18']).reshape(1,-1))
print(y_pred)

[507.89552177]


In [100]:
y_pred = selector.predict(np.array([565,31,71,11,25,19]).reshape(1,-1))
print(y_pred)

[575.14821896]


In [80]:
accuracy = selector.score(X, y)

In [81]:
print(accuracy)

0.5841293980921582


I tested this model by changing the n_features_to_select from 1 to all 6  and got the best accuracy score when all 6 features are used. 

# Question 2
Create a function called digital_root that takes in an integer. Digital root is the recursive
sum of all the digits in a number.
Given n, take the sum of the digits of n. If that value has more than one digit, continue
reducing in this way until a single-digit number is produced. The input will be a
non-negative integer.

In [109]:
def digital_root(n):
    sum=0
    for digit in str(n):
        sum = sum + int(digit)
        if sum > 9:
            sum_new = 0
            for digit in str(sum):
                sum_new = sum_new + int(digit)
                sum=sum_new
    return sum

In [110]:
digital_root(16)

7

In [111]:
digital_root(942)

6

In [112]:
digital_root(132189)

6

In [113]:
digital_root(493193)

2