### Titanic Dataset Preparation for Survival prediction

 Titanic sank after colliding with an iceberg, killing 1502 out of 2224 people on board. Interestingly, by analysing the probability of survival based on few attributes like gender, age, and social status, we can make very accurate predictions on which passengers would survive. Some groups of people were more likely to survive than others, such as women, children, and the upper-class. Therefore, we can learn about the society priorities and privileges at the time.

#### Download and save data

In [12]:
import pandas as pd 
import numpy as np
import random

In [2]:
data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
data.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2,?,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11,?,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,?,135,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON"


In [3]:
data.describe()

Unnamed: 0,pclass,survived,sibsp,parch
count,1309.0,1309.0,1309.0,1309.0
mean,2.294882,0.381971,0.498854,0.385027
std,0.837836,0.486055,1.041658,0.86556
min,1.0,0.0,0.0,0.0
25%,2.0,0.0,0.0,0.0
50%,3.0,0.0,0.0,0.0
75%,3.0,1.0,1.0,0.0
max,3.0,1.0,8.0,9.0


In [6]:
data.survived.value_counts()

0    809
1    500
Name: survived, dtype: int64

In [7]:
data = data.replace('?', np.nan)
data.isnull().sum()

pclass          0
survived        0
name            0
sex             0
age           263
sibsp           0
parch           0
ticket          0
fare            1
cabin        1014
embarked        2
boat          823
body         1188
home.dest     564
dtype: int64

In [8]:
def get_first_cabin(row):
    try:
        return row.split()[0]
    except:
        return np.nan 

In [9]:
data['cabin'] = data['cabin'].apply(get_first_cabin)

In [10]:
data.cabin

0        B5
1       C22
2       C22
3       C22
4       C22
       ... 
1304    NaN
1305    NaN
1306    NaN
1307    NaN
1308    NaN
Name: cabin, Length: 1309, dtype: object

In [11]:
data.to_csv('../datasets/titanic.csv', index=False)

### Credit Approval UCI data Preparation

To download the Credit Approval dataset from the UCI Machine Learning Repository visit this [website](http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/). Save the data into datasets folder.

Citation: 
"Dua, D. and Graff, C. (2019) [UCI Machine Learning Repository ](http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science.""

In [13]:
# load the dataset
data = pd.read_csv('../datasets/crx.data', header=None)

# create variable names according to UCI Machine Learning
# Repo information
varnames = ['A'+str(s) for s in range(1,17)]

# add column names 
data.columns = varnames

data.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


In [15]:
data.tail(100)

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
590,b,30.17,6.500,u,g,cc,v,3.125,t,t,8,f,g,00330,1200,+
591,b,27.00,0.750,u,g,c,h,4.250,t,t,3,t,g,00312,150,+
592,b,23.17,0.000,?,?,?,?,0.000,f,f,0,f,p,?,0,+
593,b,34.17,5.250,u,g,w,v,0.085,f,f,0,t,g,00290,6,+
594,b,38.67,0.210,u,g,k,v,0.085,t,f,0,t,g,00280,0,+
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,b,21.08,10.085,y,p,e,h,1.250,f,f,0,f,g,00260,0,-
686,a,22.67,0.750,u,g,c,v,2.000,f,t,2,t,g,00200,394,-
687,a,25.25,13.500,y,p,ff,ff,2.000,f,t,1,t,g,00200,1,-
688,b,17.92,0.205,u,g,aa,v,0.040,f,f,0,f,g,00280,750,-


we can see the '?' as null values. we will change them to be null explicitly. 
Variable casting to its appropriate type, and change target type to 1 or 0 at A16

In [16]:
#replace ? by np.nan 
data = data.replace('?',np.nan)

# re-cast some variables to the correct types 
data['A2'] = data['A2'].astype('float')
data['A14'] = data['A14'].astype('float')

# encode target to binary
data['A16'] = data['A16'].map({'+':1, '-':0})

In [17]:
data.tail(100)

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
590,b,30.17,6.500,u,g,cc,v,3.125,t,t,8,f,g,330.0,1200,1
591,b,27.00,0.750,u,g,c,h,4.250,t,t,3,t,g,312.0,150,1
592,b,23.17,0.000,,,,,0.000,f,f,0,f,p,,0,1
593,b,34.17,5.250,u,g,w,v,0.085,f,f,0,t,g,290.0,6,1
594,b,38.67,0.210,u,g,k,v,0.085,t,f,0,t,g,280.0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,b,21.08,10.085,y,p,e,h,1.250,f,f,0,f,g,260.0,0,0
686,a,22.67,0.750,u,g,c,v,2.000,f,t,2,t,g,200.0,394,0
687,a,25.25,13.500,y,p,ff,ff,2.000,f,t,1,t,g,200.0,1,0
688,b,17.92,0.205,u,g,aa,v,0.040,f,f,0,f,g,280.0,750,0


In [19]:
data.isnull().sum()

A1     12
A2     12
A3      0
A4      6
A5      6
A6      9
A7      9
A8      0
A9      0
A10     0
A11     0
A12     0
A13     0
A14    13
A15     0
A16     0
dtype: int64

In [20]:
## add more missing values to random positions
# this will help with the demos of the recipes

random.seed(9001)
values = set([random.randint(0, len(data)) for i in range(0, 100)])
for var in ['A3', 'A8', 'A9', 'A10']:
    data.loc[values, var] = np.nan

In [21]:
data.isnull().sum()

A1     12
A2     12
A3     92
A4      6
A5      6
A6      9
A7      9
A8     92
A9     92
A10    92
A11     0
A12     0
A13     0
A14    13
A15     0
A16     0
dtype: int64

In [22]:
# save the data 
data.to_csv('../datasets/creditApprovalUci.csv', index=False)

In [23]:
# lets find out the categorical columns 
cat_cols = [c for c in data.columns if data[c].dtypes=='O']
data[cat_cols].head()

Unnamed: 0,A1,A4,A5,A6,A7,A9,A10,A12,A13
0,b,u,g,w,v,t,t,f,g
1,a,u,g,q,h,t,t,f,g
2,a,u,g,q,h,,,f,g
3,b,u,g,w,v,t,t,t,g
4,b,u,g,w,v,t,f,f,s


In [25]:
# lets find out the numerical columns 
numeric_cols = [n for n in data.columns if data[n].dtypes != 'O']
data[numeric_cols].head()

Unnamed: 0,A2,A3,A8,A11,A14,A15,A16
0,30.83,0.0,1.25,1,202.0,0,1
1,58.67,4.46,3.04,6,43.0,560,1
2,24.5,,,0,280.0,824,1
3,27.83,1.54,3.75,5,100.0,3,1
4,20.17,5.625,1.71,0,120.0,0,1
