<div style="font-size:15px; font-family:Roboto;background-color:#333333; color: white;">
    <h1><img width="121" height="40" src="http://www.radixeng.com.br/images/campaign-slogan-white.png" alt="Radix" class="css-0">  
    <center><b> US Observed Crime Rate </b> | <a href="https://www.linkedin.com/in/guimarotto/" target="_blank" style="font-family:Roboto; color: #90C820;">Guilherme Lima</a></center></h1>


<div style="font-size:20px; font-family:Roboto; background-color:#F5F5F5; color: black;">
    <h1>Summary</h1>

 * <a href="#ch1_1" target="_blank" style="font-family:Roboto; color: #90C820;"><b>Configuration and Import</b></a>
 * <a href="#ch2" target="_blank" style="font-family:verdana; color: #90C820;"><b>Exploratory Data Analysis</b></a>
     * <a href="#ch2_1" target="_blank" style="font-family:verdana; color: #90C820;"><b>Data Ingestion and Preparation</b></a>
     * <a href="#ch2_2" target="_blank" style="font-family:verdana; color: #90C820;"><b>Exploratory Analysis using Profile</b></a>
 * <a href="#ch3" target="_blank" style="font-family:verdana; color: #90C820;"><b>Feature Engineering</b></a>
     * <a href="#ch3_1" target="_blank" style="font-family:verdana; color: #90C820;"><b>Feature Preparation</b></a>
     * <a href="#ch3_3" target="_blank" style="font-family:verdana; color: #90C820;"><b>Feature Selection</b></a>  
 * <a href="#ch4" target="_blank" style="font-family:verdana; color: #90C820;"><b>Model Development</b></a>
     * <a href="#ch4_1" target="_blank" style="font-family:verdana; color: #90C820;"><b>Parameter Optimization</b></a>
     * <a href="#ch4_2" target="_blank" style="font-family:verdana; color: #90C820;"><b>Model Training and Tuning</b></a>
     * <a href="#ch4_3" target="_blank" style="font-family:verdana; color: #90C820;"><b>Model Validation</b></a>
 * <a href="#ch5" target="_blank" style="font-family:verdana; color: #90C820;"><b>Results</b></a>
 


<a id="ch1_1"></a>
<div style="font-family:verdana; word-spacing:1.5px;">
    <h1 id="italic">
        Configuration and Import
    </h1>
    <br>
<br>
</div>

In [1]:
import pandas as pd
import numpy as np
from pandas_profiling import ProfileReport
from matplotlib import pyplot as plt
from matplotlib.gridspec import GridSpec
import seaborn as sns
import warnings
warnings.filterwarnings("ignore") # ignoring annoying warnings

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression


<a id="ch2"></a>
<h1 id="basics" style="font-family:Roboto;background-color:#333333; color: white;"> 
    <center><b><br>Exploratory Data Analysis</b></center>
</h1>

<a id="ch2_1"></a>
<div style="font-family:verdana; word-spacing:1.5px;">
    <h1 id="italic">
        Data Ingestion and Preparation
    </h1>
</div>
Criminologists are interested in the effect of punishment regimes on crime rates. This has been studied using aggregate data on 47 states of the USA for 1960. The data set contains the following columns:

| Variable 	| Description 	|
|-	|-	|
| M 	| percentage of males aged 14–24 in total state population 	|
| So 	| indicator variable for a southern state 	|
| Ed 	| mean years of schooling of the population aged 25 years or over 	|
| Po1 	| per capita expenditure on police protection in 1960 	|
| Po2 	| per capita expenditure on police protection in 1959 	|
| LF 	| labour force participation rate of civilian urban males in the age-group 14-24 	|
| M.F 	| number of males per 100 females 	|
| Pop 	| state population in 1960 in hundred thousands 	|
| NW 	| percentage of nonwhites in the population 	|
| U1 	| unemployment rate of urban males 14–24 	|
| U2 	| unemployment rate of urban males 35–39 	|
| Wealth 	| wealth: median value of transferable assets or family income 	|
| Ineq 	| income inequality: percentage of families earning below half the median income 	|
| Prob 	| probability of imprisonment: ratio of number of commitments to number of offenses 	|
| Time 	| average time in months served by offenders in state prisons before their first release 	|
| Crime 	| crime rate: number of offenses per 100,000 population in 1960 	|
<div style="color:white;
       display:fill;
       border-radius:5px;
       background-color:#90C820;
       font-size:110%;
       font-family:Verdana;
       letter-spacing:0.5px">
    <p style="padding: 10px;
          color:#333337;">
        Since we are dealing with a very small datasets, we will explore in some techniques to handle very small datasets in this note book in order to avoid overfitting.
So we will (1) Use simple models with barelly none tuning, (2) Be aware for the outliers, (3) Select the features and avoid missing values and weak correlation and (4) Combine models for the final submission.
    </p>
</div>


<h1 id="t7">References</h1>

**Overfitting:**
* IRIC's Bioinformatics Platform. Overfitting and Regularization. https://bioinfo.iric.ca/overfitting-and-regularization/
* Rants on Machine Learning. What to do with “small” data? https://medium.com/rants-on-machine-learning/what-to-do-with-small-data-d253254d1a89

**Models:**
* scikit-learn. LogisticRegression. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
* Tscikit-learn. Random Forest. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
* XGBoost. XGBoost Parameters. https://xgboost.readthedocs.io/en/latest/parameter.html

**Feature selection:**
* scikit-learn. Feature selection. https://scikit-learn.org/stable/modules/feature_selection.html

**Ensemble Models**
* Tang, J., S. Alelyani, and H. Liu. "Data Classification: Algorithms and Applications." Data Mining and Knowledge Discovery Series, CRC Press (2015): pp. 498-500.


In [2]:
df_uscrime = pd.read_csv("../data/raw/uscrime.txt", delimiter= "\t")

In [3]:
df_uscrime.head()

Unnamed: 0,M,So,Ed,Po1,Po2,LF,M.F,Pop,NW,U1,U2,Wealth,Ineq,Prob,Time,Crime
0,15.1,1,9.1,5.8,5.6,0.51,95.0,33,30.1,0.108,4.1,3940,26.1,0.084602,26.2011,791
1,14.3,0,11.3,10.3,9.5,0.583,101.2,13,10.2,0.096,3.6,5570,19.4,0.029599,25.2999,1635
2,14.2,1,8.9,4.5,4.4,0.533,96.9,18,21.9,0.094,3.3,3180,25.0,0.083401,24.3006,578
3,13.6,0,12.1,14.9,14.1,0.577,99.4,157,8.0,0.102,3.9,6730,16.7,0.015801,29.9012,1969
4,14.1,0,12.1,10.9,10.1,0.591,98.5,18,3.0,0.091,2.0,5780,17.4,0.041399,21.2998,1234


In [4]:
pd.DataFrame({'df_uscrime': df_uscrime.dtypes})

Unnamed: 0,df_uscrime
M,float64
So,int64
Ed,float64
Po1,float64
Po2,float64
LF,float64
M.F,float64
Pop,int64
NW,float64
U1,float64


In [5]:
df_val = pd.DataFrame({'male_perc': 14.0 ,'southern_state': 0 ,'schooling_mean': 10.0 ,'exp_police_prot_1960': 12.0 ,'exp_police_prot_1959': 15.5, 'labour_rate': 0.640 ,'males_vs_fem': 94.0 ,'population_hundreds': 150 ,'nonwhite_perc': 1.1 ,'unemployment_rate_14_24': 0.120, 'unemployment_rate_35_39': 3.6 ,'wealth': 3200 ,'income_inequility': 20.1 ,'imprison_prob': 0.04 ,'time_srvd_avg': 39.0}, index=[0])
df_val

Unnamed: 0,male_perc,southern_state,schooling_mean,exp_police_prot_1960,exp_police_prot_1959,labour_rate,males_vs_fem,population_hundreds,nonwhite_perc,unemployment_rate_14_24,unemployment_rate_35_39,wealth,income_inequility,imprison_prob,time_srvd_avg
0,14.0,0,10.0,12.0,15.5,0.64,94.0,150,1.1,0.12,3.6,3200,20.1,0.04,39.0


<div style="color:white;
       display:fill;
       border-radius:5px;
       background-color:#90C820;
       font-size:110%;
       font-family:Verdana;
       letter-spacing:0.5px">
    <p style="padding: 10px;
          color:#333337;">
        After the ingestion, I decided to rename the columns to them to be more meaningful.
    </p>
</div>

In [6]:
# Rename the columns of df_uscrime
df_uscrime = df_uscrime.rename(columns = {'M': 'male_perc', 
                                          'So': 'southern_state',
                                          'Ed': 'schooling_mean', 
                                          'Po1': 'exp_police_prot_1960',
                                          'Po2': 'exp_police_prot_1959', 
                                          'LF': 'labour_rate',
                                          'M.F': 'males_vs_fem', 
                                          'Pop': 'population_hundreds',
                                          'NW': 'nonwhite_perc', 
                                          'U1': 'unemployment_rate_14_24',
                                          'U2': 'unemployment_rate_35_39', 
                                          'Wealth': 'wealth',
                                          'Ineq': 'income_inequility', 
                                          'Prob': 'imprison_prob',
                                          'Time': 'time_srvd_avg', 
                                          'Crime': 'crime_rate'
                                         }, inplace = False)

In [7]:
df_uscrime.describe(include=np.number)

Unnamed: 0,male_perc,southern_state,schooling_mean,exp_police_prot_1960,exp_police_prot_1959,labour_rate,males_vs_fem,population_hundreds,nonwhite_perc,unemployment_rate_14_24,unemployment_rate_35_39,wealth,income_inequility,imprison_prob,time_srvd_avg,crime_rate
count,47.0,47.0,47.0,47.0,47.0,47.0,47.0,47.0,47.0,47.0,47.0,47.0,47.0,47.0,47.0,47.0
mean,13.857447,0.340426,10.56383,8.5,8.023404,0.561191,98.302128,36.617021,10.112766,0.095468,3.397872,5253.829787,19.4,0.047091,26.597921,905.085106
std,1.256763,0.478975,1.1187,2.971897,2.796132,0.040412,2.946737,38.071188,10.282882,0.018029,0.844545,964.909442,3.989606,0.022737,7.086895,386.762697
min,11.9,0.0,8.7,4.5,4.1,0.48,93.4,3.0,0.2,0.07,2.0,2880.0,12.6,0.0069,12.1996,342.0
25%,13.0,0.0,9.75,6.25,5.85,0.5305,96.45,10.0,2.4,0.0805,2.75,4595.0,16.55,0.032701,21.60035,658.5
50%,13.6,0.0,10.8,7.8,7.3,0.56,97.7,25.0,7.6,0.092,3.4,5370.0,17.6,0.0421,25.8006,831.0
75%,14.6,1.0,11.45,10.45,9.7,0.593,99.2,41.5,13.25,0.104,3.85,5915.0,22.75,0.05445,30.45075,1057.5
max,17.7,1.0,12.2,16.6,15.7,0.641,107.1,168.0,42.3,0.142,5.8,6890.0,27.6,0.119804,44.0004,1993.0


<div style="color:white;
       display:fill;
       border-radius:5px;
       background-color:#90C820;
       font-size:110%;
       font-family:Verdana;
       letter-spacing:0.5px">
    <p style="padding: 10px;
          color:#333337;">
        Since we verified that we doesn't have any column that works at a identifier for the sample, we will add a new column that will be a state identifier. 
    </p>
</div>


In [8]:
df_uscrime['state_id'] = ['state_01', 'state_02', 'state_03','state_04','state_05', 'state_06', 'state_07','state_08','state_09', 'state_10', 
                          'state_11', 'state_12', 'state_13','state_14','state_15', 'state_16', 'state_17','state_18','state_19', 'state_20',
                          'state_21', 'state_22', 'state_23','state_24','state_25', 'state_26', 'state_27','state_28','state_29', 'state_30',
                          'state_31', 'state_32', 'state_33','state_34','state_35', 'state_36', 'state_37','state_38','state_39', 'state_40',
                          'state_41', 'state_42', 'state_43','state_44','state_45', 'state_46', 'state_47']

<div style="color:white;
       display:fill;
       border-radius:5px;
       background-color:#90C820;
       font-size:110%;
       font-family:Verdana;
       letter-spacing:0.5px">
    <p style="padding: 10px;
          color:#333337;">
      Also we can transform the column that indicates if it is a southern state into a categorical column named region. That will help us to better explore the data.
    </p>
</div>

In [9]:
def south_or_north(x):
    if x == 0:
        return 'North'
    else:
        return 'South'
df_uscrime['region'] = df_uscrime.southern_state.apply(south_or_north)   

<a id="ch2_2"></a>
<div style="font-family:verdana; word-spacing:1.5px;">
    <h1 id="italic">
        Exploratory Analysis
    </h1>
    <ol>
        <li>Missing Values
            <ol>
                <li>Drop cols with more than 60% of missing values</li>
            </ol>
        </li>
        <li>Person Correlation - Positive Correlation indicates that when one variable increase, the other also does. Negative is the opposite.
            <ol>
                <li>0: no correlation</li>
                <li>from 0 to +/-0.2: weak correlation</li>
                <li>from +/-0.2 to +/-0.7: moderate correlaton</li>
                <li>from +/-0.7 to +/-1: strong correlation</li>
            </ol>
        </li>
</ol>
</div>
<div style="color:white;
       display:fill;
       border-radius:5px;
       background-color:#90C820;
       font-size:110%;
       font-family:Verdana;
       letter-spacing:0.5px">
    <p style="padding: 10px;
          color:#333337;">
      Ideally we want to use features with moderate and strong positive correlations with 'Crime-rate' and with less than 60% missing values.
    </p>
</div>

In [None]:
profile_train = ProfileReport(df_uscrime, title='Us Crime Dataset Report', explorative = True)
profile_train.to_file(output_file='profile_uscrime.html')
profile_train

HBox(children=(FloatProgress(value=0.0, description='Summarize dataset', max=32.0, style=ProgressStyle(descrip…

<div style="color:white;
       display:fill;
       border-radius:5px;
       background-color:#90C820;
       font-size:110%;
       font-family:Verdana;
       letter-spacing:0.5px">
    <p style="padding: 10px;
          color:#333337;">
        Since we have no variable with missing values, we will not drop any variable for this reason. Now let's get a deepdive on the correlation matrix.
    </p>
</div>

In [None]:
sns.set(style="white")
corr = df_uscrime.corr()
mask = np.triu(np.ones_like(corr, dtype=np.bool))
f, ax = plt.subplots(figsize=(20, 15))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
plt.title('Correlation Matrix', fontsize=18)
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot=True)

plt.show()

In [None]:
corr['crime_rate'].abs().sort_values()

***

<a id="ch3"></a>
<h1 id="basics" style="font-family:Roboto;background-color:#333333; color: white;"> 
    <center><b><br>Feature Engineering</b></center>
</h1>

<a id="ch3_1"></a>
<div style="font-family:verdana; word-spacing:1.5px;">
    <h1 id="italic">
        Feature Preprocessing
    </h1>
    
</div>


In [None]:
sns.set_theme(style="whitegrid")
plt.figure(figsize=(20,8))
sns.barplot(data=df_uscrime,x = 'state_id', y = 'crime_rate', hue = 'region', dodge=False)
plt.grid()
plt.xticks(np.arange(0, 47, step=1), rotation=45)
plt.title('Crime rate by state an its region', fontsize=18)
plt.ylabel('Crime Rate', fontsize=16)
plt.xlabel('State', fontsize=16)


plt.show()

In [None]:
fig, axes = plt.subplots(2, 3, figsize=(18, 10))

fig.suptitle('Boxplot')

sns.set_theme(style="whitegrid")

sns.boxplot(ax=axes[0, 0], data=df_uscrime, x='region', y='males_vs_fem')
sns.boxplot(ax=axes[0, 1], data=df_uscrime, x='region', y='schooling_mean')
sns.boxplot(ax=axes[0, 2], data=df_uscrime, x='region', y='population_hundreds')
sns.boxplot(ax=axes[1, 0], data=df_uscrime, x='region', y='wealth')
sns.boxplot(ax=axes[1, 1], data=df_uscrime, x='region', y='imprison_prob')
sns.boxplot(ax=axes[1, 2], data=df_uscrime, x='region', y='exp_police_prot_1960')

In [None]:
fig = plt.figure(figsize=(20,8))
gs = GridSpec(1,2)
sns.boxplot(y=df_uscrime.crime_rate, x=df_uscrime.region, ax=fig.add_subplot(gs[0,0]))
plt.ylabel('Crime Rate', fontsize=16)
plt.xlabel('Region', fontsize=16)
sns.stripplot(y=df_uscrime.crime_rate, x=df_uscrime.region, ax=fig.add_subplot(gs[0,1]))
plt.ylabel('Crime Rate', fontsize=16)
plt.xlabel('Region', fontsize=16)
fig.show()

<div style="color:white;
       display:fill;
       border-radius:5px;
       background-color:#90C820;
       font-size:110%;
       font-family:Verdana;
       letter-spacing:0.5px">
    <p style="padding: 10px;
          color:#333337;">
        After the Exploratory Data Analysis we identified that:         
'nonwhite_perc', 'unemployment_rate_14_24', 'male_perc', 'southern_state', 'time_srvd_avg', 'unemployment_rate_35_39', 'income_inequility' and 'labour_rate' have a week correlation (below +/- 0.2) with 'crime_rate', so they will be dropped. Also exp_police_prot_1959 is highly correlated with exp_police_prot_1960, so we will drop one of them in order to avoid redundancy.
</div>

<a id="ch3_3"></a>
<div style="font-family:verdana; word-spacing:1.5px;">
    <h1 id="italic">
        Feature Selection
    </h1>
</div>
<div style="
       display:fill;
       border-radius:5px;
       background-color:#90C820;
       font-size:110%;
       font-family:Verdana;
       letter-spacing:0.5px">
    <p style="padding: 10px;
          color:#333337;">
        We will validate two hipothesis:
        <ol><li>H1- All Features:
            <ol>
                <li>male_perc, southern_state, schooling_mean, exp_police_prot_1960, exp_police_prot_1959, labour_rate, males_vs_fem, population_hundreds, nonwhite_perc, unemployment_rate_14_24, unemployment_rate_35_39, wealth, income_inequility , imprison_prob, time_srvd_avg</li>
            </ol>
        </li>
            <li>H2 - Selected Features using the criteria discussed during the preparation phase:
            <ol>
                <li> schooling_mean, exp_police_prot_1960,  males_vs_fem, population_hundreds, wealth, imprison_prob </li>
            </ol>
        </li>
        <li>Target
            <ol>
                <li>crime_rate</li>
            </ol>
        </li>
</ol><br></p>
</div>

In [None]:
df_uscrime_lite = df_uscrime.drop(columns=['nonwhite_perc', 'unemployment_rate_14_24', 'male_perc', 'southern_state', 'time_srvd_avg', 'unemployment_rate_35_39', 'income_inequility', 'labour_rate', 'exp_police_prot_1959' ])
df_val_lite = df_val.drop(columns=['nonwhite_perc', 'unemployment_rate_14_24', 'male_perc', 'southern_state', 'time_srvd_avg', 'unemployment_rate_35_39', 'income_inequility', 'labour_rate', 'exp_police_prot_1959' ])

In [None]:
df_uscrime.head()

In [None]:
df_uscrime_lite.head()

***

<a id="ch4"></a>
<h1 id="basics" style="font-family:Roboto;background-color:#333333; color: white;"> 
    <center><b><br>Model Development</b></center>
</h1>

<a id="ch4_2"></a>
<div style="font-family:verdana; word-spacing:1.5px;">
    <h1 id="italic">
Model Training and Tuning -  All features
    </h1>
    <br>
    <p>
        Since we are dealing with a very small dataset we will make a ensemble of models to get our prediction and avoid overfitting.<br>
</div>

In [None]:
X_train = df_uscrime.drop(['crime_rate', 'region', 'state_id'], axis= 1).values
y_train = df_uscrime.crime_rate.values

In [None]:
model = RandomForestRegressor(n_estimators=100, max_depth=27, max_features=6, min_samples_split=8, min_samples_leaf=1)
model.fit(X_train, y_train)
preds = model.predict(X_train)
rmse_rf = np.sqrt(mean_squared_error(y_train, preds))
print("RMSE: %f" % (rmse_rf))
del model

In [None]:
model = LogisticRegression()
model.fit(X_train, y_train)
preds = model.predict(X_train)
rmse_lr = np.sqrt(mean_squared_error(y_train, preds))
print("RMSE: %f" % (rmse_lr))
del model

In [None]:

model = XGBClassifier(learning_rate=0.1, max_depth=6, min_child_weight=7, n_estimators=100, nthread=1, objective='reg:squarederror', subsample=0.6500000000000001)
model.fit(X_train, y_train)
preds = model.predict(X_train)
rmse_xgb = np.sqrt(mean_squared_error(y_train, preds))
print("RMSE: %f" % (rmse_xgb))
del model

In [None]:
models = [
    RandomForestRegressor(n_estimators=100, max_depth=27, max_features=6, min_samples_split=8, min_samples_leaf=1),
    LogisticRegression(),
    XGBClassifier(learning_rate=0.1, max_depth=6, min_child_weight=7, n_estimators=100, nthread=1, objective='reg:squarederror', subsample=0.6500000000000001)
]

preds = pd.DataFrame()
for i, m in enumerate(models):
    m.fit(X_train, y_train),
    preds[i] = m.predict(X_train)

weights = [2, 1, 0.5]
preds['weighted_pred'] = (preds * weights).sum(axis=1) / sum(weights)
rmse_weighted = np.sqrt(mean_squared_error(y_train, preds['weighted_pred']))
print("RMSE: %f" % (rmse_weighted))

<a id="ch4_2"></a>
<div style="font-family:verdana; word-spacing:1.5px;">
    <h1 id="italic">
Model Training and Tuning - Selected features
    </h1>
    <br>
    <p>
        Since we are dealing with a very small dataset we will make a ensemble of models to get our prediction and avoid overfitting.<br>
</div>

In [None]:
X_train = df_uscrime_lite.drop(['crime_rate', 'region', 'state_id'], axis= 1).values
y_train = df_uscrime_lite.crime_rate.values

In [None]:
model = RandomForestRegressor(n_estimators=100, max_depth=27, max_features=6, min_samples_split=8, min_samples_leaf=1)
model.fit(X_train, y_train)
preds = model.predict(X_train)
rmse_rf = np.sqrt(mean_squared_error(y_train, preds))
print("RMSE: %f" % (rmse_rf))
del model

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)
preds = model.predict(X_train)
rmse_lr = np.sqrt(mean_squared_error(y_train, preds))
print("RMSE: %f" % (rmse_lr))
del model

In [None]:
model = XGBClassifier(learning_rate=0.1, max_depth=6, min_child_weight=7, n_estimators=100, nthread=1, objective='reg:squarederror', subsample=0.6500000000000001)
model.fit(X_train, y_train)
preds = model.predict(X_train)
rmse_xgb = np.sqrt(mean_squared_error(y_train, preds))
print("RMSE: %f" % (rmse_xgb))
del model

In [None]:
models = [
    RandomForestRegressor(n_estimators=100, max_depth=27, max_features=6, min_samples_split=8, min_samples_leaf=1),
    LogisticRegression(),
    XGBClassifier(learning_rate=0.1, max_depth=6, min_child_weight=7, n_estimators=100, nthread=1, objective='reg:squarederror', subsample=0.6500000000000001)
]

preds = pd.DataFrame()
for i, m in enumerate(models):
    m.fit(X_train, y_train),
    preds[i] = m.predict(X_train)

weights = [2, 1, 0.5]
preds['weighted_pred'] = (preds * weights).sum(axis=1) / sum(weights)
rmse_weighted = np.sqrt(mean_squared_error(y_train, preds['weighted_pred']))
print("RMSE: %f" % (rmse_weighted))

<div style="color:white;
       display:fill;
       border-radius:5px;
       background-color:#90C820;
       font-size:110%;
       font-family:Verdana;
       letter-spacing:0.5px">
    <p style="padding: 10px;
          color:#333337;">
      H1 has a smaller RMSE than H2, so we will use all the features in our answer.
    </p>
</div>

<a id="ch4_3"></a>
<div style="font-family:verdana; word-spacing:1.5px;">
    <h1 id="italic">
Model Validation
    </h1>
    <br>
    <p>
        Let's use the Test Dataset and build our submission sample.<br>
</div>

In [None]:
X_train = df_uscrime.drop(['crime_rate', 'region', 'state_id'], axis= 1).values
y_train = df_uscrime.crime_rate.values

models = [
    RandomForestRegressor(n_estimators=100, max_depth=27, max_features=6, min_samples_split=8, min_samples_leaf=1),
    LogisticRegression(),
    XGBClassifier(learning_rate=0.1, max_depth=6, min_child_weight=7, n_estimators=100, nthread=1, objective='reg:squarederror', subsample=0.6500000000000001)
]

preds = pd.DataFrame()
for i, m in enumerate(models):
    m.fit(X_train, y_train),
    preds[i] = m.predict(df_val.values)

weights = [2, 1, 0.5]
preds['weighted_pred'] = (preds * weights).sum(axis=1) / sum(weights)
preds.head()

***

<a id="ch5"></a>
<h1 id="ch5" style="font-family:Roboto;background-color:#333333; color: white;"> 
<center><b><br>Results</b></center>
</h1>

The final is and weighted ensemble of three regression models Random Forest, LogisticRegression and XGBoost, ordered and weighted by their RMSE that was trained using all the supported features. 

For the given sample:
* M = 14.0
* So = 0
* Ed = 10.0
* Po1 = 12.0
* Po2 = 15.5
* LF = 0.640
* M.F = 94.0
* Pop = 150
* NW = 1.1
* U1 = 0.120
* U2 = 3.6
* Wealth = 3200
* Ineq = 20.1
* Prob = 0.04
* Time = 39.0        

The Predicted Crime rate is 1141.675241.
</div>
