Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Is your problem regression or classification?
- [ ] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [ ] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [ ] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency > 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

In [1]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*
    !pip install eli5

# If you're working locally:
else:
    DATA_PATH = '../data/'


In [3]:
import pandas as pd

In [6]:
df = pd.read_excel(DATA_PATH+'/Unit_2_project_data.xlsx')

In [7]:
df.head(0)

Unnamed: 0,2.1 Organization Name,2.2 Project Name,2.4 ProjectType,2.5 Utilization Tracking Method (Invalid),2.6 Federal Grant Programs,Enrollment Created By,3.1 FirstName,3.1 LastName,5.8 Personal ID,5.9 Household ID,...,4.09 Mental Health Problem,4.05 Physical Disability,CaseChildren,CaseAdults,Bed Nights During Report Period,Count of Bed Nights - Entire Episode,HEN-HP Referral Most Recent,HEN-RRH Referral Most Recent,WorkSource Referral Most Recent,YAHP Referral Most Recent


In [8]:
# drop columns that contain PPI to protect privacy

df = df.drop(columns=['3.1 FirstName', '3.1 LastName', '3.2 SocSecNo', 
                      '3.3 Birthdate', 'V5 Prior Address'])

In [20]:
# display all possible exit destinations and their totals
df['3.12 Exit Destination'].value_counts()

No exit interview completed                                                                                                      348
Client refused                                                                                                                   301
Emergency shelter, including hotel or motel paid for with emergency shelter voucher, or RHY-funded Host Home shelter             274
Rental by client with RRH or equivalent subsidy                                                                                  225
Rental by client, no ongoing housing subsidy                                                                                     176
Transitional Housing for homeless persons (including homeless youth)                                                             127
Rental by client, other ongoing housing subsidy                                                                                   72
Staying or living with family, permanent tenure                      

In [21]:
exit_reasons = ['Rental by client with RRH or equivalent subsidy', 
                'Rental by client, no ongoing housing subsidy', 
                'Staying or living with family, permanent tenure', 
                'Rental by client, other ongoing housing subsidy',
                'Permanent housing (other than RRH) for formerly homeless persons', 
                'Staying or living with friends, permanent tenure', 
                'Owned by client, with ongoing housing subsidy', 
                'Rental by client, VASH housing Subsidy'
               ]

In [22]:
# pull all exit destinations from main data file and sum up the totals of each destination, placing them into new df for calculations
exits = df['3.12 Exit Destination'].value_counts()

In [31]:
# create three data frames made up of the three key words found in perm housing
perm1 = df[df['3.12 Exit Destination'].str.contains('Rental') == True] 
perm2 = df[df['3.12 Exit Destination'].str.contains('permanent') == True]
perm3 = df[df['3.12 Exit Destination'].str.contains('Permanent') == True]

# merge the three dataframes into one to calculate exit to perm percent
perm_almost_final = perm1.append(perm2)
perm_final = perm_almost_final.append(perm3)
perm_final['3.12 Exit Destination'].value_counts()
perm_final_sums = perm_final['3.12 Exit Destination'].value_counts()
perm_final2 = df['3.12 Exit Destination'].isin(exit_reasons)
# calculate percentage of exits to perm
perm_final_percent = perm_final_sums.sum() / exits.sum()

In [24]:
df['perm_leaver'] = df['3.12 Exit Destination'].isin(exit_reasons)

In [25]:
df['perm_leaver'].value_counts(normalize=True)

False    0.713086
True     0.286914
Name: perm_leaver, dtype: float64

In [33]:
perm_final2

0        True
1        True
2        True
3        True
4        True
        ...  
2020    False
2021    False
2022    False
2023    False
2024    False
Name: 3.12 Exit Destination, Length: 2025, dtype: bool

In [12]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

train = df

# Split train into train & val
train, val = train_test_split(train, train_size=0.80, test_size=0.20, 
                              stratify=train['status_group'], random_state=42)

def wrangle(X):
    """Wrangle train, validate, and test sets in the same way"""
    
    # Prevent SettingWithCopyWarning
    X = X.copy()
    
    # About 3% of the time, latitude has small values near zero,
    # outside Tanzania, so we'll treat these values like zero.
    X['latitude'] = X['latitude'].replace(-2e-08, 0)
    
    # When columns have zeros and shouldn't, they are like null values.
    # So we will replace the zeros with nulls, and impute missing values later.
    # Also create a "missing indicator" column, because the fact that
    # values are missing may be a predictive signal.
    cols_with_zeros = ['longitude', 'latitude', 'construction_year', 
                       'gps_height', 'population']
    for col in cols_with_zeros:
        X[col] = X[col].replace(0, np.nan)
        X[col+'_MISSING'] = X[col].isnull()
            
    # Drop duplicate columns
    duplicates = ['quantity_group', 'payment_type']
    X = X.drop(columns=duplicates)
    
    # Drop recorded_by (never varies) and id (always varies, random)
    unusable_variance = ['recorded_by', 'id']
    X = X.drop(columns=unusable_variance)
    
    # Convert date_recorded to datetime
    X['date_recorded'] = pd.to_datetime(X['date_recorded'], infer_datetime_format=True)
    
    # Extract components from date_recorded, then drop the original column
    X['year_recorded'] = X['date_recorded'].dt.year
    X['month_recorded'] = X['date_recorded'].dt.month
    X['day_recorded'] = X['date_recorded'].dt.day
    X = X.drop(columns='date_recorded')
    
    # Engineer feature: how many years from construction_year to date_recorded
    X['years'] = X['year_recorded'] - X['construction_year']
    X['years_MISSING'] = X['years'].isnull()
    
    # return the wrangled dataframe
    return X



KeyError: 'status_group'

In [None]:
train = wrangle(train)
val = wrangle(val)
test = wrangle(test)

In [None]:
# Arrange data into X features matrix and y target vector
target = 'status_group'
X_train = train.drop(columns=target)
y_train = train[target]
X_val = val.drop(columns=target)
y_val = val[target]
X_test = test

In [None]:
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(strategy='median'), 
    RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
)

# Fit on train, score on val
pipeline.fit(X_train, y_train)
print('Validation Accuracy', pipeline.score(X_val, y_val))

In [None]:
# Get feature importances
rf = pipeline.named_steps['randomforestclassifier']
importances = pd.Series(rf.feature_importances_, X_train.columns)

# Plot feature importances
%matplotlib inline
import matplotlib.pyplot as plt

n = 20
plt.figure(figsize=(10,n/2))
plt.title(f'Top {n} features')
importances.sort_values()[-n:].plot.barh(color='grey');