BloomTech Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Is your problem regression or classification?
- [ ] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [ ] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency >= 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
    - Regression: Will you use mean absolute error, root mean squared error, R^2, or other regression metrics?
- [ ] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

If you haven't found a dataset yet, do that today. [Review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2) and choose your dataset.

Some students worry, ***what if my model isn't “good”?*** Then, [produce a detailed tribute to your wrongness. That is science!](https://twitter.com/nathanwpyle/status/1176860147223867393)

In [None]:
!pip install category_encoders==2.*
!pip install pandas-profiling==2.*
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from category_encoders import OneHotEncoder, OrdinalEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier
from pandas_profiling import ProfileReport
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np



In [None]:
def wrangle(fm_path):
    df = pd.read_csv(fm_path).set_index('resultId')
                                 

    df.drop(columns=['positionText','milliseconds','positionOrder','number','laps','fastestLap'], inplace=True)

    # Drop HCCCs
    cutoff = 100
    drop_cols = [col for col in df.select_dtypes('object').columns
                 if df[col].nunique() > cutoff]
    df.drop(columns=drop_cols, inplace=True)            
    
    df.fillna(0,inplace=True)
    
    df = df[df.statusId == 1]
    df['change_pos'] = df['position'] - df['grid'] 
    
    return df

#I can remove values that are 0 in [positions] and in [status ID] becuase 0 means they didnt finish the race and thats not something I can predict

In [None]:
result = wrangle('results.csv')
result.isnull().mean()*100
result.head()

In [None]:
#classification
target = 'position'
X = result.drop(target,axis=1)
y = result[target]

In [None]:
#skewed right
y.hist()

In [None]:
#split 80/20
X_remain,X_test,y_remain,y_test = train_test_split(X,y,test_size=0.2,random_state=42)
X_train,X_val,y_train,y_val = train_test_split(X_remain,y_remain,test_size=0.25,random_state=42)


In [None]:
baseline = y_train.value_counts(normalize=True)
print(baseline) 


In [None]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

In [None]:
param_grid = {
    'simpleimputer__strategy':['mean','median'],
    'randomforestclassifier__max_depth':range(5,25,5),
    'randomforestclassifier__n_estimators':range(25,200,25)
}
model_1 = RandomizedSearchCV(
    model,
    param_distributions = param_grid,
    n_jobs= -1,
    cv=5,
    verbose = 1,
    n_iter = 25
)
model_1.fit(X_train,y_train)

In [None]:
best_score = model_1.best_score_
best_params = model_1.best_params_

print('Best score for `model`:', best_score)
print('Best params for `model`:', best_params)

In [None]:
print(model_1.score(X_train,y_train))
print(model_1.score(X_val,y_val))
