<a href="https://colab.research.google.com/github/mcnewcp/kaggle-tabular-playground-series-sep21/blob/ajayi-begins/kaggle-sep2021-tab-playground.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tabular Playground | September 2021
This notebook produces a machine learning model to meet the rules and specifications of the [tabular playground](https://www.kaggle.com/c/tabular-playground-series-sep-2021/overview) September 2021 competition. 

From the overview page, the associated dataset was originally used to predict whether or not a customer filed an insurance claim. The data has been anonymized and generated through th [CTGAN](https://github.com/sdv-dev/CTGAN) deep learning synthetic data generation process.  

On this competition, Coy McNew and Evan Amway are collaborating with me.

In [None]:
# Formatting Colab Output to Wrap Text
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

## Data Setup
To get started, we will import the required data-related files through the [Kaggle API](https://www.kaggle.com/docs/api). However, first, we will connect to Google Drive where the data will be housed after it has been extracted from Kaggle.

### Connect to to Google Drive


In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = "/content/gdrive/My Drive/Colab Notebooks/Kaggle"

### Point directory towards Kaggle directory in Google Drive
Connect to the .json file that contains the Kaggle API information, which is located in Google Drive


In [None]:
%cd /content/gdrive/My Drive/Colab Notebooks/Kaggle

/content/gdrive/My Drive/Colab Notebooks/Kaggle


In [None]:
# Install Kaggle API
!pip install -q kaggle

#### Acquire dataset for competition
Take a moment to confirm you are pointed toward the correct working directory, and then take a glance at the available files provided by the competition.

In [None]:
# Confirm working directory
!pwd

/content/gdrive/My Drive/Colab Notebooks/Kaggle


In [None]:
# Examine the files available for the tabular playground competition
!kaggle competitions files tabular-playground-series-sep-2021

name                  size  creationDate         
-------------------  -----  -------------------  
train.csv            862MB  2021-08-26 14:16:48  
sample_solution.csv    6MB  2021-08-26 14:16:48  
test.csv             444MB  2021-08-26 14:16:48  


Download .csv file from the competition website API. Upon visual inspection, you will see that the file is downloaded as a zipped .csv file (.csv.zip). Fortunately, pandas is capable of reading zipped .csv files.

In [None]:
# Download the zipped .csv files from competition API to file path in Google Drive

## sample solution .csv
!kaggle competitions download tabular-playground-series-sep-2021 --file sample_solution.csv --path "/content/gdrive/My Drive/Colab Notebooks/Kaggle/competitions/tabular-playground-sep2021"

## training set
!kaggle competitions download tabular-playground-series-sep-2021 --file train.csv --path "/content/gdrive/My Drive/Colab Notebooks/Kaggle/competitions/tabular-playground-sep2021"

## test set
!kaggle competitions download tabular-playground-series-sep-2021 --file test.csv --path "/content/gdrive/My Drive/Colab Notebooks/Kaggle/competitions/tabular-playground-sep2021"

sample_solution.csv.zip: Skipping, found more recently modified local copy (use --force to force download)
Downloading train.csv.zip to /content/gdrive/My Drive/Colab Notebooks/Kaggle/competitions/tabular-playground-sep2021
 99% 391M/394M [00:04<00:00, 109MB/s]
100% 394M/394M [00:04<00:00, 99.5MB/s]
Downloading test.csv.zip to /content/gdrive/My Drive/Colab Notebooks/Kaggle/competitions/tabular-playground-sep2021
 99% 201M/203M [00:02<00:00, 102MB/s]
100% 203M/203M [00:02<00:00, 95.1MB/s]


### Import Necessary Preprocessing Modules


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import os

### Prepare training and test sets
To prepare the data sets for EDA, we will need to clean and process the datasets to allow for EDA to be helpful. 

First, let's take a look at the size of the training and test sets. 

In [None]:
# Import training data from competition .csv
df_main_trian = pd.read_csv("/content/gdrive/My Drive/Colab Notebooks/Kaggle/competitions/tabular-playground-sep2021/train.csv.zip")

# Import test data from competition .csv
df_main_test = pd.read_csv("/content/gdrive/My Drive/Colab Notebooks/Kaggle/competitions/tabular-playground-sep2021/test.csv.zip")


There are {} rows and {} columns in the training set.


AttributeError: ignored

In [None]:
# Take a look at the dimensions of the freshly imported dataframe.
print("There are {} rows and {} columns in the training set.".format(df_main_trian.shape[0], df_main_trian.shape[1]))
print("There are {} rows and {} columns in the training set.".format(df_main_test.shape[0], df_main_test.shape[1]))

There are 957919 rows and 120 columns in the training set.
There are 493474 rows and 119 columns in the training set.


Get information on the any differences between the columns of the test and training sets

In [None]:
def listDiff(li1, li2):
    return list(set(li1) - set(li2)) + list(set(li2) - set(li1))

coltrain = df_main_trian.columns
coltest = df_main_test.columns

colDiff = listDiff(coltrain, coltest)
print("The following column(s) is(are) different betweeen the training and test: {}".format(colDiff))

The following column(s) is(are) different betweeen the training and test: ['claim']


According to this exercise, there is only one column that differs between the test and and trianing set. This column is called `claim`. The reason it is omitted form the test set is because it will be predicted and submitted to Kaggle for the competition as a probability. We will split the training set to test it before making predictions with the `test.csv` provided by the competition. 

Next, let's examine the training set more closely by learning about the datatypes of each column

In [None]:
lst_train_dtypes = df_main_trian.dtypes.value_counts()
print(lst_train_dtypes)

float64    118
int64        2
dtype: int64


The results show that almost all of the datatypes are floats except for two. These two exceptions are the `id` and the `claim` columns. Thus, working with mixed datatypes will not be an issue for this workbook. 

The next step will be to visualize these features to get a better understanding of how they will play in the model. Following thay step, the missing data will be analyzed. 

EDA
- descriptive stats
- find correlations
  * heatmap
- anticorrelations
- PCA (?)
- Feature selection
  * Chi-Squared test
  * [SelectKBest](https://www.datacamp.com/community/tutorials/feature-selection-python)
  * [Recursive Feature Elimination](https://machinelearningmastery.com/feature-selection-machine-learning-python/)
  * Feature Importance: subsample a tiny amount of data and run a tree-based model and assess feature importantce


## Exploratory Data Analysis