In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
%matplotlib inline

In [None]:
# Read in the raw data

tax_df = pd.read_csv("./data/raw/taxinfo.csv")

In [None]:
# Review structure of data frame

tax_df.info()

In [None]:
# Basic EDA - sweetviz

import sweetviz

In [None]:
report = sweetviz.analyze(tax_df)

In [None]:
report.show_html("output/sweetviz_report.html")

<b>After reviewing the sweetviz html output, here are some things I noticed about the data:<b>

-  Median household income is $154K
-  There is an association between "College Grads" and "Household Debt Level." This would make sense because a lot of people have student loans to pay off once they graduate from college and/or maybe mom and dad take out a loan to help pay for their kids college education. It looks like 85% of respondents have one or more college graduates living in the house.
-  There is an association between "Cars" and "Average Age of People in Household." The avearge age is 60.6 with a minimum age of 18 so it can probably be assumed that all respondents have their license. Only 17% of respondents said there were zero cars at the household and all other responses collected noted 1-5 cars at the houeshold (almost 50% of respondents have 3-5 cars).
-  The responses are split pretty evenly amongst political party - Independent, Democrat, and Republican were all pretty much at 33% so it will be interesting to see how the regression will use other factors to predict political party since one variable is not more dominant than the others.
-  It is interesting that 50% of people are not submitting their taxes to the IRS in each year...not sure how that happens lol.

<b>Now I am going to see what is currently numeric and categorical variable:<b>

In [None]:
# Prior to creating my regression model, I'll split the tax_df in to two (I assume this is like doing a "test" and "train" data frame...maybe).
# I also do not want to transform my y-variable (PoliticalParty) so I'll only be working off of the "X" data frame

X = tax_df.iloc[:, 0:9]
y = tax_df.iloc[:, 9]

In [None]:
X.info()
y

In [None]:
# Listing numeric and categorical columns in data frame X to help with data processing

categorical_cols = X.select_dtypes(include=['object']).columns.tolist()
numeric_cols = X.select_dtypes(include=['number']).columns.tolist()

all_cols = X.columns.tolist()

In [None]:
numeric_cols

In [None]:
categorical_cols

<b> I think "Married" and "Filed" years (since it is a binary response) should be categorical. "Married" is not a binary response but it appears to be categorical; maybe 0 = Single, 1 = Married, 2 = Divorced, etc.<b>

In [None]:
# Change some variables in data frame X to categorical

X = X.astype({"Married":'object', "Filed_2015":'object', "Filed_2016":'object', "Filed_2017":'object'})
X.info()

<b>I have have my data frame(s) where I want them so now I can begin with regression modeling...let's see how this goes :)<b>

In [None]:
# Loading necessary packages/transformers

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import train_test_split

In [None]:
# Create a pipeline

# Create transformer objects
numeric_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

# Combine transformers into a preprocessor step
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_cols),
        ('cat', categorical_transformer, categorical_cols)])

# Classifier model with C=1
clf_model = LogisticRegression(penalty='l2', C=1, solver='saga', max_iter=500)

# Append classifier to preprocessing pipeline.
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', clf_model)])

<b>Let's see a picture of my pipeline:<b>

In [None]:
from sklearn import set_config

set_config(display='diagram')
clf

<b>Time to partition the data for further model fitting and testing<b>

In [None]:
# Using the partitioning in the HW instructions

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)

# Fit model on new training data
clf.fit(X_train, y_train)

print(f"Training score: {clf.score(X_train, y_train):.3f}")
print(f"Test score: {clf.score(X_test, y_test):.3f}")

**Well that score sucks...lets create a new model and adjust the value of C = 0.01 and put more weight on the regularization penalty term.**

In [None]:
# Classifier model with C = 0.01
clf_model_C01 = LogisticRegression(penalty='l2', C=0.01, solver='saga', max_iter=500)

# Append classifier to preprocessing pipeline.
clf_C01 = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', clf_model_C01)])

# Fit model on training data 
clf_C01.fit(X_train, y_train)

print(f"Training score: {clf_C01.score(X_train, y_train):.3f}")
print(f"Test score: {clf_C01.score(X_test, y_test):.3f}")

**Not much better. I'm going to try a few different models with varying values of C to determine which one is best.**

In [None]:
# Classifier model with C = 0.05
clf_model_C05 = LogisticRegression(penalty='l2', C=0.05, solver='saga', max_iter=500)

# Append classifier to preprocessing pipeline.
clf_C05 = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', clf_model_C05)])

# Fit model on training data 
clf_C05.fit(X_train, y_train)

print(f"Training score: {clf_C05.score(X_train, y_train):.3f}")
print(f"Test score: {clf_C05.score(X_test, y_test):.3f}")

In [None]:
# Classifier model with C = 5
clf_model_C50 = LogisticRegression(penalty='l2', C=5, solver='saga', max_iter=500)

# Append classifier to preprocessing pipeline.
clf_C50 = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', clf_model_C50)])

# Fit model on training data 
clf_C50.fit(X_train, y_train)

print(f"Training score: {clf_C50.score(X_train, y_train):.3f}")
print(f"Test score: {clf_C50.score(X_test, y_test):.3f}")

In [None]:
# Classifier model with C = 5
clf_model_C5 = LogisticRegression(penalty='l2', C=5, solver='saga', max_iter=500)

# Append classifier to preprocessing pipeline.
clf_C5 = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', clf_model_C5)])

# Fit model on training data 
clf_C5.fit(X_train, y_train)

print(f"Training score: {clf_C5.score(X_train, y_train):.3f}")
print(f"Test score: {clf_C5.score(X_test, y_test):.3f}")

In [None]:
# Classifier model with C = 10
clf_model_C10 = LogisticRegression(penalty='l2', C=10, solver='saga', max_iter=500)

# Append classifier to preprocessing pipeline.
clf_C10 = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', clf_model_C10)])

# Fit model on training data 
clf_C10.fit(X_train, y_train)

print(f"Training score: {clf_C10.score(X_train, y_train):.3f}")
print(f"Test score: {clf_C10.score(X_test, y_test):.3f}")

**So as I am running all these models with different C values, I'm realizing that the larger my C value, the worse (or no change) my scores get. I am going to try running smaller values of C.**

In [None]:
# Classifier model with C = 0.001
clf_model_C001 = LogisticRegression(penalty='l2', C=0.001, solver='saga', max_iter=500)

# Append classifier to preprocessing pipeline.
clf_C001 = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', clf_model_C001)])

# Fit model on training data 
clf_C001.fit(X_train, y_train)

print(f"Training score: {clf_C001.score(X_train, y_train):.3f}")
print(f"Test score: {clf_C001.score(X_test, y_test):.3f}")

In [None]:
# Classifier model with C = 0.0005
clf_model_C0005 = LogisticRegression(penalty='l2', C=0.0005, solver='saga', max_iter=500)

# Append classifier to preprocessing pipeline.
clf_C0005 = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', clf_model_C0005)])

# Fit model on training data 
clf_C0005.fit(X_train, y_train)

print(f"Training score: {clf_C0005.score(X_train, y_train):.3f}")
print(f"Test score: {clf_C0005.score(X_test, y_test):.3f}")

**It looks like my best score based on the C values I've used to this point is when C = 0.001 (traning = 0.405, test = 0.308). I am going to roll with this one to create my  final prediction, random forest and confusion matrix. The score is not great, I would have rather it had been closer to 80% but based on what I am gathering, a larger C value the lower the score so the model needs higher regularization to improve results.**

In [None]:
# Final logistic regression classifier model
clf_LR_model_final = LogisticRegression(penalty='l2', C=0.001, solver='saga', max_iter=500)

# Append classifier to preprocessing pipeline.
clf_LR_final = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', clf_LR_model_final)])

# Fit model on training data 
clf_LR_final.fit(X_train, y_train)
print("Training score: %.3f" % clf_LR_final.score(X_train, y_train))

# Make predictions on the test data
clf_LR_final_predictions = clf_LR_final.predict(X_test)
print(clf_LR_final_predictions[:10])  # Print out a few predictions just to see what they look like

**Overall it looks like we only have a 40% chance of accruately guessing someones Political Party based on the variables provided in the data set and high regularization. Let's create a random forest:**

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Append random forest classifier to preprocessing pipeline.
clf_rf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', RandomForestClassifier(oob_score=True, random_state=0))])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

clf_rf.fit(X_train, y_train)

print(f"Training score: {clf_rf.score(X_train, y_train):.3f}")
print(f"Test score: {clf_rf.score(X_test, y_test):.3f}")

**Wow, higher accuracy on the training data but the test data is still in line with the results from the regression modeling; we may be overfitting the data. Let's see what a confusion matrix looks like:**

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix

# Creating confusing matrix on training data

titles_options = [("Confusion matrix for train, without normalization", None),
                  ("Normalized confusion matrix for train", 'true')]

class_names = clf_rf['classifier'].classes_

for title, normalize in titles_options:
    disp = plot_confusion_matrix(clf_rf, X_train, y_train,
                                 display_labels=class_names,
                                 cmap=plt.cm.Blues,
                                 normalize=normalize)
    disp.ax_.set_title(title)

    print(title)
    print(disp.confusion_matrix)

plt.show()

In [None]:
# Creating confusing matrix on test data

titles_options = [("Confusion matrix for test, without normalization", None),
                  ("Normalized confusion matrix for test", 'true')]

class_names = clf_rf['classifier'].classes_

for title, normalize in titles_options:
    disp = plot_confusion_matrix(clf_rf, X_test, y_test,
                                 display_labels=class_names,
                                 cmap=plt.cm.Blues,
                                 normalize=normalize)
    disp.ax_.set_title(title)

    print(title)
    print(disp.confusion_matrix)

plt.show()

**I think one of the issues is how evenly spread the data is; there is not one Poltical Party more dominant than the others. It looks like Democrat is the "easiest" to predict whereas Independent is the most difficult to predict. Let's do a final random forest classifier model.**

In [None]:
# Final random forest classifier model
clf_RF_model_final = RandomForestClassifier(oob_score=True, random_state=0)

# Append classifier to preprocessing pipeline.
clf_RF_final = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', clf_RF_model_final)])

# Fit model on training data 
clf_RF_final.fit(X_train, y_train)
print("Training score: %.3f" % clf_RF_final.score(X_train, y_train))

# Make predictions on the test data
clf_RF_final_predictions = clf_RF_final.predict(X_test)
print(clf_RF_final_predictions[:10])  # Print out a few predictions just to see what they look like

**No changes here...**

<div class="alert alert-block alert-success">
<b>In summary, it looks like it is hard to predict Political Party. I think it is a combination of things such as the training data set or how evenly spead the Political Party response was amongst the three options. I ended up creating a "Play" variable to try different values of C and I could not for the life of me improve my score; it either stayed at 40% or went below. I tried to replace LogisticRegression with LogisitcRegressionCV but I kept getting error messages about things may not being defined (it is 11PM on a Sunday as a type this, so maybe I'll try again another day). I thinkt the random forest overfit my model because the training data accuracy was 100%...so...theres that.<b>
</div>
    
Update 5/18 - Submitting as is :)

<div class="alert alert-block alert-info">
<b>***********************************************************************************HW INSTRUCTIONS / NOTES BEYOND THIS POINT**********************************************************************************<b>
</div>

# HW1 - Classification models in sklearn

You'll be building a few classifier models and using some of the tech tools we learned about in Modules 1 and 2. 

## The Data

The data is a relatively small and simple dataset of taxpayer data. I got it from:

https://www.kaggle.com/dmaillie/sample-us-taxpayer-dataset

As you'll see if you visit that page, this dataset was used in a series of YouTube tutorials on using R to build random forest models. 

I read it into a pandas dataframe and used `info()` to get:

```
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1004 entries, 0 to 1003
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   HHI             1004 non-null   int64 
 1   HHDL            1004 non-null   int64 
 2   Married         1004 non-null   int64 
 3   CollegGrads     1004 non-null   int64 
 4   AHHAge          1004 non-null   int64 
 5   Cars            1004 non-null   int64 
 6   Filed_2017      1004 non-null   int64 
 7   Filed_2016      1004 non-null   int64 
 8   Filed_2015      1004 non-null   int64 
 9   PoliticalParty  1004 non-null   object
dtypes: int64(9), object(1)
memory usage: 78.6+ KB
```

Some information about the fields:

* `HHI` - household income
* `HHDL` - household debt level
* `Married` - categorical with a few levels
* `CollegGrads` - number of college grads in the household
* `AHHAge` - average age of people in the household
* `Cars` - number of cars in the household
* `Filed_2017` - 1 means they filed a tax return with the IRS for 2017
* `Filed_2016` - 1 means they filed a tax return with the IRS for 2016
* `Filed_2015` - 1 means they filed a tax return with the IRS for 2015
* `PoliticalParty` - categorical with 3 levels

## The Problem

Our overall goal is to build classifier models to predict `PoliticalParty` using the the other variables. You must use sklearn Pipelines that contain your preprocessing steps and your model estimation step.

You can do your work in a Jupyter Notebook(s) or in a Python script(s) (i.e. a ``.py`` file) or both. It's up to you.

### Task 1

Start by creating a new project folder structure with the cookiecutter-datascience-simple template that I covered in Module 1. Put the data file into its appropriate folder and put this notebook in the main project folder. Any additional notebooks and/or Python files you end up creating should go in the main project folder.

**HL 5/12/2021 - DONE**

### Task 2

Put your new project folder under version control using git. You should **NOT** track the data file. You must track any notebooks, Python scripts or additional text files you end up creating.

**HL 5/12/2021 - DONE, just need to do another commit when finished**

### Task 3

Build at least one logistic regression model (with regularization) and one random forest model to predict `PoliticalParty`. Yes, this is very similar to what we did for the Pump it Up project in Module 2. Some detailed requirements and additional information:

* I suggest you start by reading the csv file into a pandas dataframe. My dataframe is called ``tax_df``.
* Then start with some basic EDA. You can certainly use automated tools such as pandas-profiling or SweetViz as I showed in the class notes. Remember, when you run either of those, you **must** have your notebook open in the classic Jupyter Notebook interface (and **NOT** in Jupyter Lab). Once you've created the EDA reports you can close your notebook and reopen in Jupyter Lab if you wish. As we've seen, the reports get created as HTML documents. These should go in your output folder within your project.
* Since we are using regularization, all of the numeric variables should be rescaled using the `StandardScaler` - be careful, just because a variable has a numeric datatype in the pandas dataframe, it does not mean that it's necessarily a numeric variable in the context of the classification models. Think about each column and look at your EDA reports and decide whether or not it's truly numeric or needs to be treated as categorical data in the models.
* For any variables that you decide should be treated as categorical in your models, use the `OneHotEncoder` on them in the preprocessing stage.
* Even though our target variable, `PoliticalParty`, is categorical, you do **NOT** need to do any preprocessing on it. As I mentioned in our class notes, scikit-learn will automatically detect that and will do any encoding needed on its own (it uses the `LabelEncoder`).
* I broke up the ``tax_df`` into two separate dataframes that I called ``X`` and ``y``, to use in the models. Here's my code for that:

```
X = tax_df.iloc[:, 0:9]
y = tax_df.iloc[:, 9]
```

* Please use the following code for your data partitioning so that we all end up with the same training and test split:

```
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)
```

* For each model you fit, you should compute its ``score`` and create a confusion matrix on both train and test data. I did all of this repeatedly in the class notes.
* For each model (the logistic regression model and the random forest) you should make some summary comments about how well the model fits and predicts and if there is evidence of overfitting. 

**IMPORTANT** You always should put summary comments in a markdown cell. Do **NOT** write them as comments in a code cell. The whole point of Jupyter notebooks is to be able to mix markdown cells with code cells. If you choose to do all of your Python work in a ``.py`` file(s), then simple create a Jupyter notebook in which you include your summary comments.

## Optional Hacker Extra tasks
I always like to include some extra credit tasks for those who want to push themselves a little further. For this problem, consider doing one or more of the following:

* Try out the Histogram based Gradient Boosting Classifier shown in the optional materials at the end of Module 2. Compare its performance to logistic regression and the random forest.
* Create a second set of models in which you treat ``Filed_2017`` as a binary target variable and use ``PoliticalParty`` as a categorical feature variable. Is it any easier to predict ``Filed_2017`` than it was to predict ``PoliticalParty``?

## Deliverables
You should simply compress your entire project folder as either a zip file or a tar.gz file (do **NOT** ever use WinRAR to create rar files in this class). Note that when you do this, your "hidden" ``.git`` folder will get included. So, I'll be able to tell that you put the project under version control and I'll be able to look at your project folder structure. Before compressing the project folder to submit it:

* make sure all of your notebooks and .py files are in the main project folder and have good filenames,
* make sure you've committed all of your changes (git),
* upload your compressed folder in Moodle.