Following actions should be performed:

- Identify the output variable.
- Understand the type of data.
- Check if there are any biases in your dataset.
- Check whether all members of the house have the same poverty level.
- Check if there is a house without a family head.
- Set poverty level of the members and the head of the house within a family.
- Count how many null values are existing in columns.
- Remove null value rows of the target variable.
- Predict the accuracy using random forest classifier.
- Check the accuracy using random forest with cross validation.

In [1]:
# import libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
import gc as gc
from warnings import filterwarnings
filterwarnings('ignore')

In [2]:
# read csv into dataframe

test_df = pd.read_csv('test.csv')

# Begin to explore data

In [3]:
test_df.head()

Unnamed: 0,Id,v2a1,hacdor,rooms,hacapo,v14a,refrig,v18q,v18q1,r4h1,...,age,SQBescolari,SQBage,SQBhogar_total,SQBedjefe,SQBhogar_nin,SQBovercrowding,SQBdependency,SQBmeaned,agesq
0,ID_2f6873615,,0,5,0,1,1,0,,1,...,4,0,16,9,0,1,2.25,0.25,272.25,16
1,ID_1c78846d2,,0,5,0,1,1,0,,1,...,41,256,1681,9,0,1,2.25,0.25,272.25,1681
2,ID_e5442cf6a,,0,5,0,1,1,0,,1,...,41,289,1681,9,0,1,2.25,0.25,272.25,1681
3,ID_a8db26a79,,0,14,0,1,1,1,1.0,0,...,59,256,3481,1,256,0,1.0,0.0,256.0,3481
4,ID_a62966799,175000.0,0,4,0,1,1,1,1.0,0,...,18,121,324,1,0,1,0.25,64.0,,324


In [4]:
test_df.shape

(23856, 142)

In [5]:
pd.set_option('display.max_rows', 200)
test_df.dtypes

Id                  object
v2a1               float64
hacdor               int64
rooms                int64
hacapo               int64
v14a                 int64
refrig               int64
v18q                 int64
v18q1              float64
r4h1                 int64
r4h2                 int64
r4h3                 int64
r4m1                 int64
r4m2                 int64
r4m3                 int64
r4t1                 int64
r4t2                 int64
r4t3                 int64
tamhog               int64
tamviv               int64
escolari             int64
rez_esc            float64
hhsize               int64
paredblolad          int64
paredzocalo          int64
paredpreb            int64
pareddes             int64
paredmad             int64
paredzinc            int64
paredfibras          int64
paredother           int64
pisomoscer           int64
pisocemento          int64
pisoother            int64
pisonatur            int64
pisonotiene          int64
pisomadera           int64
t

# 1. Identify the output variable:

In looking at the data dictionary, the only variables that are monetary in nature are rental payments and status of household ownership. Going to explore those two to see which one seems like a better target variable

In [6]:
# create dataframes for all levels of household ownership

own_df = test_df[test_df['tipovivi1'] == 1]
inst_df = test_df[test_df['tipovivi2'] == 1]
rent_df = test_df[test_df['tipovivi3'] == 1]
prec_df = test_df[test_df['tipovivi4'] == 1]
other_df = test_df[test_df['tipovivi5'] == 1]

- Now to check rental payment values for all levels of household ownership

In [7]:
# all rental payment values for those who own their house

own_df['v2a1'].unique()

array([nan])

In [8]:
# all rental payment values for those who own and are paying installments on their house

inst_df['v2a1'].unique()

array([ 400000.,  350000.,   50000.,  175000.,  200000.,  570540.,
        499222.,       0.,  180000.,  135000.,  500000.,  450000.,
        285270.,  100000.,  370000.,  300000.,   85000.,  150000.,
        105000.,  385000.,  257000.,  230000.,  130000.,  260000.,
        250000.,   63000.,  478000.,  545000.,  210000.,  340000.,
        600000., 1000000.,  650000.,  140000.,  800000.,  258750.,
        154000.,   70000.,   61000.,  320000.,  570000.,   90000.,
        900000.,  342324.,  235000.,   54000.,  352000.,  170000.,
        171162.,   60000.,  120000.,  220000.,  399378.,  456432.,
        375000.,  125000.,  127000.,  160000.,  190000.,  145000.,
        227000.,  380000.,  318000.,  173000.,  225000.,  280000.,
        290000.,   80000.,   95000.,   69000.,  182000.,  550000.,
        310000.,  215000.,   62000.,   33000.,   35000.,  103000.,
        286981.,  383000.,  303000.,   10000.,  365000.,  228216.,
        111000.,  174000.,  410788.,  912864.,   97000.,  6120

In [9]:
# all rental payment values for those who rent their house

rent_df['v2a1'].unique()

array([ 175000.,  300000.,   90000.,  200000.,  240000.,  210000.,
        275000.,  170000.,  230000.,   42183.,   27419.,  120000.,
        150000.,  135000.,  105000.,  145000.,  100000.,  245000.,
        325000.,  140000.,  220000.,  250000.,  160000.,  225000.,
        260000.,  130000.,   60000.,   50000.,  500000.,  350000.,
        370851., 2852700.,  399378.,  180000.,  190000.,  627594.,
        110000.,  270000.,  125000.,  550000.,  280000.,  342324.,
         85000.,   80000.,  165000.,  192000., 1026971., 1426350.,
        215000.,  352000.,  400000.,  115000.,   65000.,  155000.,
        285000.,   91573.,   92000.,   70000.,  450000.,  360000.,
         75000.,   95000.,  195000.,  513485.,  112000.,  253500.,
        328060.,  285270.,  266000.,  320000.,   86000.,  206540.,
        270008.,  460000.,  315000.,  290000.,  300008.,  340000.,
         78000.,   55000.,  167738.,  770229.,  370000.,  185000.,
        713175., 1141080.,       0.,   25000.,  484958.,  3137

In [10]:
# all rental payment values for those whose ownership status is precarious

prec_df['v2a1'].unique()

array([nan])

In [11]:
# all rental payment values for those who fall into "other"

other_df['v2a1'].unique()

array([nan])

 - Given the type of data collected for this set, the only true identifiers of monetary situation would be monthly rent payment and status of house ownership. Since the "Rental Payments' is 'nan' for both extremes ("owned and fully paid house" on one end, and "precarious" or "other" on the other), I believe the strongest indicator of level of poverty is level of house ownership, so that will be our output variable.

# 2. Understand the type of data.

In [12]:
# see entire list of column names and data types

test_df.dtypes

Id                  object
v2a1               float64
hacdor               int64
rooms                int64
hacapo               int64
v14a                 int64
refrig               int64
v18q                 int64
v18q1              float64
r4h1                 int64
r4h2                 int64
r4h3                 int64
r4m1                 int64
r4m2                 int64
r4m3                 int64
r4t1                 int64
r4t2                 int64
r4t3                 int64
tamhog               int64
tamviv               int64
escolari             int64
rez_esc            float64
hhsize               int64
paredblolad          int64
paredzocalo          int64
paredpreb            int64
pareddes             int64
paredmad             int64
paredzinc            int64
paredfibras          int64
paredother           int64
pisomoscer           int64
pisocemento          int64
pisoother            int64
pisonatur            int64
pisonotiene          int64
pisomadera           int64
t

# 3. Check if there are any biases in your dataset.

 - In order to check for biases in the data, I want to see if the data is sufficiently representative of the population. I decided to do this by checking the data that was not indiciative of the condition of the household or the numbers of residents -- namely the ownership level of the house (to make sure we had enough of each level of ownership for modeling purposes), and the geographic region the individual was living in. Thus we will check the distribution of these variables.

In [13]:
# checking the distribution of house ownership levels

print(own_df.shape)
print(inst_df.shape)
print(rent_df.shape)
print(prec_df.shape)
print(other_df.shape)

(14933, 142)
(2537, 142)
(3916, 142)
(434, 142)
(2036, 142)


 - This distirbution seems reasonable to me as we have a majority of respondents who own their home, but enough in all the ownership levels to continue with building the model

From the information given we will create small dataframes of the following columns and their corresponding geographic information and then check the shape of each to get a sense of the distribution of the data:

- lugar1 = Central
- lugar2 = Chorotega
- lugar3 = PacÃfico central
- lugar4 = Brunca
- lugar5 = Huetar AtlÃ¡ntica
- lugar6 = Huetar Norte

- area1 = urban
- area2 = rural

In [14]:
# creating the 'lugar' dataframes and printing their respective shapes to see the distribution (given by the number of rows in each dataframe).

cent_df = test_df[test_df['lugar1'] == 1]
chor_df = test_df[test_df['lugar2'] == 1]
pac_df = test_df[test_df['lugar3'] == 1]
bru_df = test_df[test_df['lugar4'] == 1]
ha_df = test_df[test_df['lugar5'] == 1]
hn_df = test_df[test_df['lugar6'] == 1]

print(cent_df.shape)
print(chor_df.shape)
print(pac_df.shape)
print(bru_df.shape)
print(ha_df.shape)
print(hn_df.shape)

(13852, 142)
(1974, 142)
(1744, 142)
(2064, 142)
(2377, 142)
(1845, 142)


In [15]:
# creating the urban vs rural dataframes and printing their respective shapes to see the distribution.

urban_df = test_df[test_df['area1'] == 1]
rural_df = test_df[test_df['area2'] == 1]

print(urban_df.shape)
print(rural_df.shape)

(17216, 142)
(6640, 142)


In [16]:
# checking the percentage of our population who live in urban areas
# as that was the only real-life statistic I was able to find in searching on Google to check this dataset against for potential bias

17216/(17216+6640)

0.721663313212609

A quick Google search indicates that roughly 81% of Costa Ricans lived in urban areas from 2011 - 2021. The ratio we have gotten of urban vs rural (~72% of our dataset lives in urban areas) seems representative enough to me to not indicate significant level of bias in the data. I would even go so far as to say that it is more likely for those who are impoverished to <ins>*not*</ins> be living in urban areas, and thus our dataset will allow us to identify more of the population  who are in need of aid.

# 4. Check whether all members of the house have the same poverty level.

 - If all the members of a given household have the same poverty level (in our case, level of house ownership) then I can create a list of household IDs ('idhogar') for each of the five dataframes I have and check for any Household IDs that appear in multiple lists. If there are none, then every member of every house has been given the same poverty level.

First, create lists of the Household IDs ('idhogar') for each dataframe representing a given level of house ownership.

In [17]:
# create list for all homes that are owned

own_ids = own_df['idhogar'].tolist()

In [18]:
# confirm length of list matches with shape above

len(own_ids)

14933

In [19]:
# repeat the above for all other ownership levels

inst_ids = inst_df['idhogar'].tolist()
rent_ids = rent_df['idhogar'].tolist()
prec_ids = prec_df['idhogar'].tolist()
other_ids = other_df['idhogar'].tolist()

In [20]:
print(len(inst_ids))
print(len(rent_ids))
print(len(prec_ids))
print(len(other_ids))

2537
3916
434
2036


 - Check to see if any household ID shows up in multiple lists (which would indicate members of the same house have been classified differently based on my chosen metric).
 
Given 5 lists where we have to check the different possible pairings without care for the order, we need to check 10 total pairings. We can use the set().intersection() command to give us the set of any items which appear in both lists. If our data is correct, all sets should be empty!

In [21]:
print(set(own_ids).intersection(inst_ids))
print(set(own_ids).intersection(rent_ids))
print(set(own_ids).intersection(prec_ids))
print(set(own_ids).intersection(other_ids))
print(set(inst_ids).intersection(rent_ids))
print(set(inst_ids).intersection(prec_ids))
print(set(inst_ids).intersection(other_ids))
print(set(rent_ids).intersection(prec_ids))
print(set(rent_ids).intersection(other_ids))
print(set(prec_ids).intersection(other_ids))

set()
set()
set()
set()
set()
set()
set()
set()
set()
set()


 - Since all sets are empty, we can conclude that all members of a given household have been given the same poverty level i.e. household ownership level

# 5. Check if there is a house without a family head.

 - In order to do this, we can create a list of every household ID ('idhogar') and a list of the household ID for all rows that are heads of households ('parentesco1' = 1). If there are any households in the master list that don't appear in the second list, that means those households don't have a head of household.

In [22]:
# list of every unique household ID in the dataframe

all_houses_ids = test_df['idhogar'].unique().tolist()

In [23]:
len(all_houses_ids)

7352

This means there are 7352 unique households in the list

In [24]:
# make a dataframe of only those who are heads of households

house_with_heads_df = test_df[test_df['parentesco1'] == 1]

- Check new dataframe's shape to see the number of rows. The number of rows = the number of households with heads.

In [25]:
house_with_heads_df.shape

(7334, 142)

In [26]:
# check to see if any households have multiple people who are considered heads by checking how many unique house IDs there are in the list of 7334 household heads.

len(house_with_heads_df['idhogar'].unique().tolist())

7334

Since the number of unique household IDs (7334) in the heads of households list is the same as the number of total rows, that means every head of household is the head of a unique house. Since there are 7352 unique household IDs and only 7334 heads of households, that means there are 18 households that don't have an identifed Head of the household

# 6. Set poverty level of the members and the head of the house within a family.

Since we are basing our poverty levels off of ownership status, we will need to make a new column and insert a number, 1-5, that indicates the level of ownership. The relationships will be as follows:
- Own house = 1
- Installment payments = 2
- Rent = 3
- Precarious = 4
- Other = 5

To do this, we can use the np.select() function in Numpy

In [27]:
# create a list of our conditions

conditions = [
    (test_df['tipovivi1'] == 1),
    (test_df['tipovivi2'] == 1),
    (test_df['tipovivi3'] == 1),
    (test_df['tipovivi4'] == 1),
    (test_df['tipovivi5'] == 1),
    ]

In [28]:
# create a list of the values we want to assign for each condition

values = [1, 2, 3, 4, 5]

In [29]:
# create a new column and use np.select to assign values to it using our lists as arguments

test_df['Poverty Level'] = np.select(conditions, values)

In [30]:
# display updated DataFrame

test_df.head()

Unnamed: 0,Id,v2a1,hacdor,rooms,hacapo,v14a,refrig,v18q,v18q1,r4h1,...,SQBescolari,SQBage,SQBhogar_total,SQBedjefe,SQBhogar_nin,SQBovercrowding,SQBdependency,SQBmeaned,agesq,Poverty Level
0,ID_2f6873615,,0,5,0,1,1,0,,1,...,0,16,9,0,1,2.25,0.25,272.25,16,1
1,ID_1c78846d2,,0,5,0,1,1,0,,1,...,256,1681,9,0,1,2.25,0.25,272.25,1681,1
2,ID_e5442cf6a,,0,5,0,1,1,0,,1,...,289,1681,9,0,1,2.25,0.25,272.25,1681,1
3,ID_a8db26a79,,0,14,0,1,1,1,1.0,0,...,256,3481,1,256,0,1.0,0.0,256.0,3481,1
4,ID_a62966799,175000.0,0,4,0,1,1,1,1.0,0,...,121,324,1,0,1,0.25,64.0,,324,3


In [31]:
# pull the columns I want to check and explore the data to confirm that the Poverty Level matches with the ownership level

pov_df = test_df[['tipovivi1' , 'tipovivi2', 'tipovivi3', 'tipovivi4', 'tipovivi5', 'Poverty Level']]

In [32]:
# check to confirm the assignment worked for each level of ownership

pov_df.loc[pov_df['tipovivi1'] == 1].head(3)

Unnamed: 0,tipovivi1,tipovivi2,tipovivi3,tipovivi4,tipovivi5,Poverty Level
0,1,0,0,0,0,1
1,1,0,0,0,0,1
2,1,0,0,0,0,1


In [33]:
pov_df.loc[pov_df['tipovivi2'] == 1].head(3)

Unnamed: 0,tipovivi1,tipovivi2,tipovivi3,tipovivi4,tipovivi5,Poverty Level
5,0,1,0,0,0,2
6,0,1,0,0,0,2
67,0,1,0,0,0,2


In [34]:
pov_df.loc[pov_df['tipovivi3'] == 1].head(3)

Unnamed: 0,tipovivi1,tipovivi2,tipovivi3,tipovivi4,tipovivi5,Poverty Level
4,0,0,1,0,0,3
7,0,0,1,0,0,3
8,0,0,1,0,0,3


In [35]:
pov_df.loc[pov_df['tipovivi4'] == 1].head(3)

Unnamed: 0,tipovivi1,tipovivi2,tipovivi3,tipovivi4,tipovivi5,Poverty Level
410,0,0,0,1,0,4
411,0,0,0,1,0,4
412,0,0,0,1,0,4


In [36]:
pov_df.loc[pov_df['tipovivi5'] == 1].head(3)

Unnamed: 0,tipovivi1,tipovivi2,tipovivi3,tipovivi4,tipovivi5,Poverty Level
22,0,0,0,0,1,5
23,0,0,0,0,1,5
28,0,0,0,0,1,5


# 7. Count how many null values are existing in columns.

In [37]:
# return a list of the count of null values in every column in the dataframe

test_df.isna().sum()

Id                     0
v2a1               17403
hacdor                 0
rooms                  0
hacapo                 0
v14a                   0
refrig                 0
v18q                   0
v18q1              18126
r4h1                   0
r4h2                   0
r4h3                   0
r4m1                   0
r4m2                   0
r4m3                   0
r4t1                   0
r4t2                   0
r4t3                   0
tamhog                 0
tamviv                 0
escolari               0
rez_esc            19653
hhsize                 0
paredblolad            0
paredzocalo            0
paredpreb              0
pareddes               0
paredmad               0
paredzinc              0
paredfibras            0
paredother             0
pisomoscer             0
pisocemento            0
pisoother              0
pisonatur              0
pisonotiene            0
pisomadera             0
techozinc              0
techoentrepiso         0
techocane              0


- All of the above null values (v2a1, v18q1, rez_esc, meaneduc, SQBmeaned) are countable questions that can simply be replaced with 0.

In [38]:
# filling all null values with 0

test_df = test_df.fillna(0)

In [39]:
test_df.isna().sum()

Id                 0
v2a1               0
hacdor             0
rooms              0
hacapo             0
v14a               0
refrig             0
v18q               0
v18q1              0
r4h1               0
r4h2               0
r4h3               0
r4m1               0
r4m2               0
r4m3               0
r4t1               0
r4t2               0
r4t3               0
tamhog             0
tamviv             0
escolari           0
rez_esc            0
hhsize             0
paredblolad        0
paredzocalo        0
paredpreb          0
pareddes           0
paredmad           0
paredzinc          0
paredfibras        0
paredother         0
pisomoscer         0
pisocemento        0
pisoother          0
pisonatur          0
pisonotiene        0
pisomadera         0
techozinc          0
techoentrepiso     0
techocane          0
techootro          0
cielorazo          0
abastaguadentro    0
abastaguafuera     0
abastaguano        0
public             0
planpri            0
noelec       

# 8. Remove null value rows of the target variable.

In [40]:
# checking for null values in the Poverty Level column

test_df['Poverty Level'].isna().sum()

0

- None of the rows I am using for my Target Variable -- based off of house ownership -- are null. Thus, I have no rows to remove

# 9. Predict the accuracy using random forest classifier.

- First I need to remove the house ownership columns as they are replaced by my Poverty Level column 

In [41]:
# making a new dataframe by dropping the unneeded columns

poverty_df = test_df.drop(['tipovivi1', 'tipovivi2', 'tipovivi3', 'tipovivi4', 'tipovivi5'], axis=1)

In [42]:
# confirming the columns are gone

print(poverty_df.columns.values)

['Id' 'v2a1' 'hacdor' 'rooms' 'hacapo' 'v14a' 'refrig' 'v18q' 'v18q1'
 'r4h1' 'r4h2' 'r4h3' 'r4m1' 'r4m2' 'r4m3' 'r4t1' 'r4t2' 'r4t3' 'tamhog'
 'tamviv' 'escolari' 'rez_esc' 'hhsize' 'paredblolad' 'paredzocalo'
 'paredpreb' 'pareddes' 'paredmad' 'paredzinc' 'paredfibras' 'paredother'
 'pisomoscer' 'pisocemento' 'pisoother' 'pisonatur' 'pisonotiene'
 'pisomadera' 'techozinc' 'techoentrepiso' 'techocane' 'techootro'
 'cielorazo' 'abastaguadentro' 'abastaguafuera' 'abastaguano' 'public'
 'planpri' 'noelec' 'coopele' 'sanitario1' 'sanitario2' 'sanitario3'
 'sanitario5' 'sanitario6' 'energcocinar1' 'energcocinar2' 'energcocinar3'
 'energcocinar4' 'elimbasu1' 'elimbasu2' 'elimbasu3' 'elimbasu4'
 'elimbasu5' 'elimbasu6' 'epared1' 'epared2' 'epared3' 'etecho1' 'etecho2'
 'etecho3' 'eviv1' 'eviv2' 'eviv3' 'dis' 'male' 'female' 'estadocivil1'
 'estadocivil2' 'estadocivil3' 'estadocivil4' 'estadocivil5'
 'estadocivil6' 'estadocivil7' 'parentesco1' 'parentesco2' 'parentesco3'
 'parentesco4' 'paren

In [43]:
# dropping uneccessary object columns (ID and household ID) and then converting the remaining object columns to integers for modeling

poverty_df = poverty_df.drop(['Id', 'idhogar'], axis = 1) 

- Now, convert remaining object columns to integers

The remaining object columns (dependency, edjefe, and edjefa) have a mixture of numbers as well as some responses that are 'yes' or 'no'. For the sake of modeling, I'm going to replace 'no' with 0, remove all the rows with 'yes', calculate the mean with only numbers, and then replace yes with this mean to try and be representative of the data.

In [44]:
# for dependency column, replacing 'no' with 0

poverty_df["dependency"].replace({"no": 0}, inplace = True)

In [45]:
# removing the columns with 'yes'

mean_dep_df = poverty_df[poverty_df['dependency'] != 'yes']

In [46]:
# converting column to integers

mean_dep_df['dependency'] = mean_dep_df['dependency'].astype(float)

In [47]:
# mean of dependency column

mean_dep_df['dependency'].mean()

1.2342287072195222

In [48]:
# replacing dependency columns of 'yes' with the mean

poverty_df["dependency"].replace({"yes": 1.2342287072195222}, inplace = True)

In [49]:
# confirm it worked

poverty_df['dependency'].unique()

array(['.5', 0, '8', 1.2342287072195222, '.25', '2', '.33333334', '.375',
       '.60000002', '1.5', '.2', '.75', '.66666669', '3', '.14285715',
       '.40000001', '.80000001', '1.6666666', '.2857143', '1.25', '2.5',
       '5', '.85714287', '1.3333334', '.16666667', '4', '.125',
       '.83333331', '2.3333333', '7', '1.2', '3.5', '2.25', '3.3333333',
       '6'], dtype=object)

* Doing the same process for the 'edjefe' and 'edjefa' columns

In [50]:
# same process for 'edjefe' column

poverty_df["edjefe"].replace({"no": 0}, inplace = True)
mean_edjefe_df = poverty_df[poverty_df['edjefe'] != 'yes']
mean_edjefe_df['edjefe'] = mean_edjefe_df['edjefe'].astype(float)
mean_edjefe_df['edjefe'].mean()

5.25204770190553

In [51]:
poverty_df["edjefe"].replace({"yes": 5.25204770190553}, inplace = True)

In [52]:
poverty_df["edjefa"].replace({"no": 0}, inplace = True)
mean_edjefe_df = poverty_df[poverty_df['edjefa'] != 'yes']
mean_edjefe_df['edjefa'] = mean_edjefe_df['edjefa'].astype(float)
mean_edjefe_df['edjefa'].mean()

2.8111846822150057

In [53]:
poverty_df["edjefa"].replace({"yes": 5.25204770190553}, inplace = True)

In [54]:
# confirm all processes worked

print(poverty_df['edjefe'].unique())
print(poverty_df['edjefa'].unique())

[0 '16' '10' '6' '11' '8' '13' '14' '5' '3' '9' '17' '15' '7' '21' '4'
 '12' '2' '20' 5.25204770190553 '19' '18']
['17' 0 '11' '14' '10' '15' '9' '6' '8' '3' '2' '5' '16' '12'
 5.25204770190553 '7' '13' '21' '4' '19' '18' '20']


In [55]:
# isolating input and output columns

poverty_input = poverty_df.drop(['Poverty Level'], axis = 1)
poverty_output = poverty_df[['Poverty Level']]

In [56]:
# creating train and test splits

X_train, X_test, y_train, y_test = train_test_split(poverty_input, poverty_output, test_size=0.3, random_state = 13)

In [57]:
# checking the shapes of my data sets

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(16699, 135) (7157, 135) (16699, 1) (7157, 1)


In [58]:
# using ravel on the y vectors to make them a 1d array

y_train = np.ravel(y_train)
y_test = np.ravel(y_test)

In [59]:
# confirm shapes are now correct

print(y_train.shape)
print(y_test.shape)

(16699,)
(7157,)


In [60]:
# building a random forest classifier

classifier = RandomForestClassifier()

In [61]:
# fit the classifier

classifier.fit(X_train, y_train)

RandomForestClassifier()

In [62]:
y_predict = classifier.predict(X_test)
accuracy = accuracy_score(y_predict, y_test)
print(accuracy)

0.9413161939360067


# 10. Check the accuracy using random forest with cross validation.

In [63]:
# build random forest classifier

rf_class = RandomForestClassifier()

In [64]:
# dertermine accuracy of Random Forest model using Cross Validation and printing the score as a percentage

accuracy = cross_val_score(rf_class, X_train, y_train, scoring='accuracy', cv=10).mean()*100
print('Accuracy of Random Forests is: ', accuracy)

Accuracy of Random Forests is:  93.89186755309034
