This notebook should act as an outline for how to approach all Kaggle competitions and machine learning/data analysis in general

In [1]:
import pandas as pd
import numpy as np
import os
os.chdir('/Users/Nick/Desktop/Projects/Kaggle/Titanic')

<font size = "6">Step 1: Exploratory Data Analysis</font>

<font size = "5">Missing Data, Entry Errors, and Outliers</font>

<font size = "4">Descriptions, Types, and Entry Errors</font>

1. Start by looking at the top of the dataset, as well as a description of each column and the type of each column

In [16]:
df.head()
df.describe()
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

- Look for the following things:
    - What information is actually given in the columns? Do you think it's relevant to predicting the response?
    - Are there maximum or minimum values that don't make sense or are impossible? Does the spread of the data seem weird? (eg. a mean of 0 but a max of 500?)
    - Are there types that aren't correct? Is the timestamp a string? Are the numbers objects, etc?
   

2. Use lambda functions if you need to change the type of a column (note you'll need the datetime library for datetime)

In [None]:
df['timestamp'] = df['timestamp'].apply(lambda x: datetime.datetime.strptime(x, '%Y-%m-%dT%H:%M:%S.%fZ'))

If there are other values that don't make sense, or are impossible, set them to Na's or deal with them like you would deal with missing values

<font size = "4">Missing Data</font>

1. Figure out the number of Nan's in each column as a proportion of the total data

In [2]:
df = pd.read_csv('train.csv')

In [9]:
df.isna().sum()/df.shape[0]

PassengerId    0.000000
Survived       0.000000
Pclass         0.000000
Name           0.000000
Sex            0.000000
Age            0.198653
SibSp          0.000000
Parch          0.000000
Ticket         0.000000
Fare           0.000000
Cabin          0.771044
Embarked       0.002245
dtype: float64

2. Figure out where there are zeroes, empty strings, 999's, etc.

In [13]:
df[df == 0].count(axis = 0)/df.shape[0]

PassengerId    0.0
Survived       0.0
Pclass         0.0
Name           0.0
Sex            0.0
Age            0.0
SibSp          0.0
Parch          0.0
Ticket         0.0
Fare           0.0
Cabin          0.0
Embarked       0.0
dtype: float64

3. Decide what to do with missing data

- If the percentage of missing values is relatively low (< 5%), you can probably just remove those rows

In [None]:
# drops all rows with na from the dataset
df.dropna(inplace = True) 

# drops all rows with na in a certain subset of the columns
df.dropna(susbet = ['column'], inplace = True) 

# drop columns with missing values
df.dropna(axis = 1, inplace = True)

# keep rows with at least 4 non-na values
df.dropna(thresh = 4, inplace = True)

- Otherwise, for numerical data, you can replace the missing values with the mean or median

In [None]:
df['column'].fillna(df['column'].mean(), inplace = True)
df['column'].fillna(df['column'].median(), inplace = True)

- For categorical features, usually you replace the missing values with the most common value

In [None]:
df['column'].fillna(df['column'].mode()[0], inplace = True)

<font size = "5">Use Plots to Look at the Data</font>

1. Look at the spread (distribution) of continuous data.  Does anything stand out? Does it appear to be normally distributed?

In [None]:
import seaborn as sns
sns.distplot(df['column'])

2. Look at bivariate plots to see which variables might be associated with the response variable

In [None]:
sns.pairplot(x_vars = df['columns'], y_vars = df['response'])

3. Look at boxplots to see how a numerical response changes based on category

In [None]:
sns.boxplot(x = df['column'], y = df['response'])

4. Look at count plots to see which variables are even relevant

In [None]:
sns.countplot(x = df['column'], y = df['reponse'])

3. Use a heatmap to look at pairwise correlations

In [None]:
sns.heatmap(df_num.corr())

<font size = "6">Step n: Decide Which Algorithm To Use</font>

<font size = "5">1. Dimensionality Reduction</font>

<font size = "4">Remove variables based on their association with the response variable</font>

Does the dataset have a very large number of variables? If so, consider dimensionality reduction - the goal is to select a subset (p << 100) of the variables that capture as much information as possible using the following techniques:

- Remove variables that are highly correlated with one another by looking at pairwise correlations

In [None]:
df.corr()

- Use Random Forest to look at feature importance

In [None]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(X_train, y_train)
features = df.columns
importances = model.feature_importances_
indices = np.argsort(importances)[-9:] # top 10 features
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], align = 'center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

<font size = "4">Other techniques to reduce dimensionality</font>

1. Principal component analysis: project the features onto a space of less dimensions that accounts for the greatest amount of the variance.

PCA should be used for variables which are strongly correlated, but not when the goal is to identify the factors that have an effect on the response variable (interpretability is sacrificed for speed and fit improvement)

The result of PCA will be decorrelated vectors that can be used in other machine learning algorithms

In [None]:
# PCA is affected by scale, so you need to scale the features before 
# applying PCA

from sklearn.preprocessing import StandardScaler
X_train = StandardScaler().fit_transform(X)

from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
principalComponents = pca.fit_transform(x)
principalDf = pd.DataFrame(principalComponents)

If you have labeled responses, choose a supervised learning method

<font size = "5">n. Supervised learning methods</font>

<font size = "4">Classification methods</font>

Note that there is no single answer about which classification method is best for a given dataset, and different kinds of classifiers should always be used and compared for a give dataset.  But sometimes you can use characteristics of the data to hint at which methods you should try.

I general, start with "simpler" models (like linear regression) first, and then move on to more complex models, and ultimately to deep learning methods

Naive Bayes
- More concerned with speed than accuracy
- Don't need model to be explainable
- Dataset is very large
- Assumes that predictors are independent from one another

In [None]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)

Linear SVM
- More concerned with speed than accuracy
- Don't need model to be explainable
- Dataset is smaller

- Need to scale the data

In [None]:
from sklearn import svm
clf = svm.SVC(kernel = 'linear')
clf.fit(X_train, y_train)

Decision Tree
- When you want your model to be more explainable
- Better if the features are categorical, otherwise they are discretized

In [None]:
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf.fit(X_train, y_train)

Logistic regression
- When you want your model to be more explainable
- Assumes the data is linearly (or curvy linearly) seperable in space (if you're not sure if the data is linearly seperable, use a decision tree)
- Better if predicting a binary response

In [None]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

Random Forest
- More concerned with accuracy than speed


In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators = 100, max_depth = 20, random_state = 42)
rf.fit(X_train, y_train)

Neural networks
- Need to scale and standardize the data


See the 'Neural net outline' for how to implement neural network code

<font size = "5">n. Split the data into a training set and testing set.  Use grid search to find the optimal parameters for the model(s) you've chosen</font>

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

X_train, X_test, y_train, y_test = train_test_Split(X, y, test_size = 0.2,
                                                   random_state = 0)

# This example is for SVM
parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
                     'C': [1, 10, 100, 1000]},
                    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]

clf = GridSearchCV(SVC(), parameters)
clf.fit(X_train, y_train)

# Get the best parameters
clf.best_params_

<font size = "6">n. Create an ensemble</font>

Basically just use a lot of models to predict an outcome/category and then find the average or most popular prediction and use that