## Foundations: Clean Data

Using the Titanic dataset from [this](https://www.kaggle.com/c/titanic/overview) Kaggle competition.

This dataset contains information about 891 people who were on board the ship when departed on April 15th, 1912. As noted in the description on Kaggle's website, some people aboard the ship were more likely to survive the wreck than others. There were not enough lifeboats for everybody so women, children, and the upper-class were prioritized. Using the information about these 891 passengers, the challenge is to build a model to predict which people would survive based on the following fields:

- **Name** (str) - Name of the passenger
- **Pclass** (int) - Ticket class
- **Sex** (str) - Sex of the passenger
- **Age** (float) - Age in years
- **SibSp** (int) - Number of siblings and spouses aboard
- **Parch** (int) - Number of parents and children aboard
- **Ticket** (str) - Ticket number
- **Fare** (float) - Passenger fare
- **Cabin** (str) - Cabin number
- **Embarked** (str) - Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

![Clean Data](../../img/clean_data.png)

### Read in Data

In [1]:
import pandas as pd
data = pd.read_csv("/media/user/A/github/ml-projects/linear-regression-advertising-dataset/input/advertising.csv")
data.head()

Unnamed: 0,TV,Radio,Newspaper,Sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,12.0
3,151.5,41.3,58.5,16.5
4,180.8,10.8,58.4,17.9


### Clean continuous variables

#### Fill missing for `Age`

In [2]:
def predict_sales(radio, weight, bias):
    return weight*radio + bias

In [3]:
def cost_function(radio, sales, weight, bias):
    companies = len(radio)
    total_error = 0.0
    for i in range(companies):
        total_error += (sales[i] - (weight*radio[i] + bias))**2
    return total_error / companies

#### Combine `SibSp` & `Parch`

In [4]:
def update_weights(radio, sales, weight, bias, learning_rate):
    weight_deriv = 0
    bias_deriv = 0
    companies = len(radio)

    for i in range(companies):
        # Calculate partial derivatives
        # -2x(y - (mx + b))
        weight_deriv += -2*radio[i] * (sales[i] - (weight*radio[i] + bias))

        # -2(y - (mx + b))
        bias_deriv += -2*(sales[i] - (weight*radio[i] + bias))

    # We subtract because the derivatives point in direction of steepest ascent
    weight -= (weight_deriv / companies) * learning_rate
    bias -= (bias_deriv / companies) * learning_rate

    return weight, bias

In [5]:
def train(radio, sales, weight, bias, learning_rate, iters):
    cost_history = []

    for i in range(iters):
        weight,bias = update_weights(radio, sales, weight, bias, learning_rate)

        #Calculate cost for auditing purposes
        cost = cost_function(radio, sales, weight, bias)
        cost_history.append(cost)

        # Log Progress
        if i % 10 == 0:
            print("iter={:d}    weight={:.2f}    bias={:.4f}    cost={:.2}".format(i, weight, bias, cost))

    return weight, bias, cost_history

#### Drop unnnecessary variables

In [6]:
radio = data['Radio'].values
sales = data['Sales'].values
weight = 0
bias = 0
lr = 0.01
iters = 100
train(radio,sales,weight,bias,lr,iters)

iter=0    weight=7.59    bias=0.3026    cost=3.8e+04
iter=10    weight=2405173039799.18    bias=73599287290.5942    cost=4.4e+27
iter=20    weight=815974020723917184827392.00    bias=24969141670289308909568.0000    cost=5.1e+50
iter=30    weight=276825655152063131060260668006465536.00    bias=8470979254131427124399864041963520.0000    cost=5.8e+73
iter=40    weight=93915298041452327294731439888541830077428531200.00    bias=2873846865521575588505351860423181092761108480.0000    cost=6.7e+96
iter=50    weight=31861509372640639389418406280951599831087152017021203382272.00    bias=974975331504933409782714414364953867539709003923550371840.0000    cost=7.7e+119
iter=60    weight=10809269636292856088171973252592657331885255059040396045391077683757056.00    bias=330768110314963329617902546569290030461484127346338398066860529549312.0000    cost=8.9e+142
iter=70    weight=3667130414430807992123323032464152800471756619739632767020847353568919970574761984.00    bias=1122157035834481873806553989304

(-1.006614932547681e+115,
 -3.08028322224193e+113,
 [38364.16330524686,
  7749556.174937002,
  1568119390.2759905,
  317310985414.3346,
  64208290697878.26,
  1.2992631155844442e+16,
  2.629075817425043e+18,
  5.3199691200807895e+20,
  1.0765026725753641e+23,
  2.1783171629468084e+25,
  4.4078531185034627e+27,
  8.919348130194725e+29,
  1.804841697960622e+32,
  3.652120656295389e+34,
  7.390113661054424e+36,
  1.4953991136399633e+39,
  3.0259595611633992e+41,
  6.123068538878864e+43,
  1.2390108847784284e+46,
  2.5071546445248688e+48,
  5.0732600405578486e+50,
  1.0265807693724723e+53,
  2.0772995423461308e+55,
  4.203442649007765e+57,
  8.505720885848681e+59,
  1.7211436869500366e+62,
  3.482756642128388e+64,
  7.047403375010331e+66,
  1.4260512414027028e+69,
  2.8856332394954074e+71,
  5.839116401378538e+73,
  1.1815528003416633e+76,
  2.3908874631539007e+78,
  4.837991886451058e+80,
  9.789739523118512e+82,
  1.9809665286729534e+85,
  4.00851154257525e+87,
  8.111275255984789e+89,
 

In [7]:
titanic.head()

NameError: name 'titanic' is not defined

### Clean categorical variables

#### Fill in missing & create indicator for `Cabin`

In [None]:
titanic.isnull().sum()

In [None]:
titanic.groupby()

#### Convert `Sex` to numeric

In [None]:
gender_num = {'male': 0, 'female': 1}

#### Drop unnecessary variables

In [None]:
titanic.drop(['Cabin', 'Embarked', 'Name', 'Ticket'], axis=1, inplace=True)

### Write out cleaned data

In [None]:
titanic.to_csv('../../../titanic_cleaned.csv', index=False)