# Previous notebook
We covered a baseline model that we would like to beat. The 78% accuracy for our training data. We built that classifier based on heuristics that if a person is a female, they would survive. 

The code used to test that is listed below. 
```
# Change the strings into numerals so the data can be recognized by the classifiers
df.Sex = df.Sex.str.replace('female', '0')
df.Sex = df.Sex.str.replace('male', '1')
df.Sex = df.Sex.astype('int')

df.loc[df.Sex == 0, 'pred'] = 1
df.loc[df.Sex == 1, 'pred'] = 0

from sklearn.metrics import accuracy_score
accuracy_score(df.Survived, df.pred)
```

We also created a few classifiers using Naive Bayes and Logistic Regression, two simple classification models, and a few variables, Pclass, and Sex. But obviously those variables are not enough, we have many more to explore that can help improve our classifiers accuracy. We are aiming for an 81% accuracy which would put us around 300th place out of 10,000 people, or the 97th precentile. A step towards getting a better result would be to switch classifiers. Naive Bayes is no longer viable because it can always make the wrong prediction, as the results are based on calculated prior and posterior probability. This means that it always has a chance to get a wrong result based on luck and chance. Logistic regression did not have this problem since it used a threshold component of 0.5 that was based on the probabilitistic impact of each of the variables. But in order to improve logistic regression, we we need to explore the features in a polynomial space and gain more explanation to improve accuracy. 

In this notebook, we will look at different classifiers such as adaboost, decision trees, and random forest, and compete their results against the logistic regression training accuracy of 79%. (obviously the training accuracy does not mean that we will do great for the actual test, but it is a measurable metric)

What to do today
- Explore variables more deeply. Run a generic correlation graph
- Clean the cabin data so the results are usable
- Fix the age variable by manually inputting data from online
- Run a few different models with two more variables and compare.
- Create another heuristic model that will include a second pivot of class.

# Import Data

In [1]:
# Import training Dataset
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
print(os.listdir("../input"))

df = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
full = pd.concat([df, test], ignore_index = True)

# Recreate Previous Heuristic Model
Recreating the previous model in this notebook so you do not have to switch back and forth. 
I adapted this model to choose if the passenger was female AND in the higher or middle class, they survived.
### Code

In [2]:
# Change the strings into numerals so the data can be recognized by the classifiers
df.Sex = df.Sex.str.replace('female', '0')
df.Sex = df.Sex.str.replace('male', '1')
df.Sex = df.Sex.astype('int')

# Create "pred" column with survivability
df.loc[(df.Sex == 0) & (df.Pclass < 3), 'pred'] = 1
df.loc[(df.Sex == 1) | (df.Pclass == 3), 'pred'] = 0

from sklearn.metrics import accuracy_score
accuracy_score(df.Survived, df.pred)



In [3]:
# Remove df predictions
df = df.iloc[:, :-1]

### Results
I think it is very interesting how even using a different heuristic, we get the same accuracy for saying survivability for women despite narrowing the results sex with class. The ratio trade off for getting the right answers and the wrong stayed the same. 

# Get More Variables
We previously Passenger Class and Sex, with a quick touch on Age. We did not see many patterns statistically for age, but I will explore that again because I feel my previous exploration was not enough. We will create a correlation graph to quickly go over all precleaned variables, and explore any basic patterns that appear. 
### Code

In [4]:
df.head()

#### Patterns in the cabin
A pattern that could exist is just looking at the head of the dataset. Of the three Null values for the Cabin, all of those are are in Pclass 3. Likewise they embarked from S, which is Southampton.

### Code

In [5]:
print(df.Cabin[df.Pclass == 3].value_counts())
print('            ')
print(df[(df.Cabin.isnull()) & (df.Pclass == 3)].info())
print('            ')
print(df.Embarked[(df.Cabin.isnull())].value_counts())

Of the lower class, 12 of them had cabins, the remaining were cabinless.  479 people from the training set were not documented. An extra 214 for those in the test set. 

In [6]:
print(full[(full.Cabin.isnull()) & (full.Pclass == 3)].info())

This is tricky to deal with, because 3rd class passengers were located on all floors, but they were not assigned cabins. There are also others who were in higher classes who were also not assigned cabins, so by having all the results be Null, the learning algorithm may confuse Null with non-surviving. Likewise, since the main holder for tickets in the upperclass are usually men, the females who survive also hold a Null cabin info. 

I am unsure on what to do, so I will try two iterations, with cabin information in and without. In the meantime, I will clean it to reflect which section the cabins were in, drop the numbers, then convert the numbers into categorical numbers. 

### Code

In [7]:
df.Cabin[df.Cabin.notnull()].head(10)

In [8]:
df['Section'] = df.Cabin.str.extract('(\w)\d*', expand = False)
print(df.Section[df.Cabin.notnull()].head(10))

In [9]:
df.iloc[27,:]

Cabin extraction was successful, and we also found a young adult who had sisters did not survive. His father did not survive either. It would seem that a male, if with female family members, would not survive. This requires a lot of cleaning, hopefully we will not have to go there to get to the 97th percentile. 

We also need to do the same adjustments to the test set, which will be done later. 

## General Search

In [10]:
import matplotlib.pyplot as plt
import seaborn as sns

df.corr()

In [11]:
full.corr()

A few correlation coefficients stand out. 
- Survived vs Pclass, Sex, Fare
- Pclass vs Sex, Age, and Fare
- Sex vs SibSp, Parch, Fare
- Age vs SibSp, Parch
- SibSp vs Parch, Fare
- Parch vs Fare

We looked at Survived vs Pclass, Sex, and Fare already. We found Pclass to be a stronger determinant for survivability, but research would suggest that high Fare outliers stand as an exception.     
Passenger class tends to be more male in higher classes as shown by a weak positive correlation, older people tend to be in a higher class, and higher class people pay a higher fare. Something to explore is the distribution of gender for classes.     
Females were less likely to have siblings or spouses, less parents or children, and paid less.    
Older individuals had less siblings or spouses, less parents and children.    
Siblings or spouses were likely to have parents or children present, and paid a higher fare.    
Parents or children had to pay a higher fare.     

This is repeated below in a heatmap and again in a scatter histogram matrix.     
The heatmap is just a visualization, but does not do a good job because the other weaker positive correlations are drowned out by the hues for a 1:1 correlation. It is unhelpful.      
The scatter histogram matrix shows the general distributions, but also that we are dealing with largely categorical variables, the correlations will be easily muddled. For instance, older individuals are more likely to be rich, but younger individuals are more likely to be saved with their mothers. Age brings information, but not so much in a survived vs did not survive correlation form. 

What we can gather from this as other areas to explore, is that families really need to be grouped together. Fares are higher for sibsp and parch, which suggest one person is paying for everyone else, or the fee is replicated across families. 

In [12]:
plt.figure(figsize=(15,10))
plt.subplot(231)
plt.scatter(full.Fare[full.Pclass==1], full.SibSp[full.Pclass==1], alpha=0.5)
plt.title('Pclass == 1 SibSp')
plt.subplot(232)
plt.scatter(full.Fare[full.Pclass==2], full.SibSp[full.Pclass==2], alpha=0.5)
plt.title('Pclass == 2 SibSp')
plt.subplot(233)
plt.scatter(full.Fare[full.Pclass==3], full.SibSp[full.Pclass==3], alpha=0.5)
plt.title('Pclass == 3 SibSp')

plt.subplot(234)
plt.scatter(full.Fare[full.Pclass==1], full.Parch[full.Pclass==1], alpha=0.5)
plt.title('Pclass == 1 ParCh')
plt.subplot(235)
plt.scatter(full.Fare[full.Pclass==2], full.Parch[full.Pclass==2], alpha=0.5)
plt.title('Pclass == 2 ParCh')
plt.subplot(236)
plt.scatter(full.Fare[full.Pclass==3], full.Parch[full.Pclass==3], alpha=0.5)
plt.title('Pclass == 3 ParCh')

plt.show()

We can see that at least with the lower class, that the increasing correlation with siblings/spouse and ParCh exists. This does not hold up in higher classes where most people brought 1 or 0 sibsp which could suggest they were single, with a spouse, and no siblings. It would be helpful to have this information split up. Another feat in feature extraction.

In [13]:
sns.heatmap(df.corr())

In [14]:
pd.plotting.scatter_matrix(df, alpha=0.2, figsize=(10, 10), diagonal='hist')
plt.show()

### Age
A different exploration that should give a clearer picture would be the graph below. It felt against my fiber to think that children were not spared along with women on the titanic.

In [15]:
df.Age[df.Survived ==1].plot(kind='kde')
df.Age[df.Survived ==0].plot(kind='kde')
plt.legend(['Survived', 'Not'])
plt.show()

In [16]:
df.Age.isnull().value_counts()

In [17]:
full.Age.isnull().value_counts()

There are 263 missing age variables. That is a lot to fill in. But data isn't easy. 

In [18]:
df.sort_values('Name', inplace=True)

In [19]:
# I used this function to cycle through the ranges
df[df.Age.isnull()].iloc[110:130,[2, 3, 4, 5]]
# full.sort_values('Name')

In [20]:
def CA(index, age):
    df.loc[index, "Age"] = age

In [21]:
indexmissing = list(df[df.Age.isnull()].index)

In [22]:
agelist = [48, 18, 40, 40, 37, 45, 28, 21, 18, 48, 40, 22, 34, 30, 22, 29, 26, 49, 42, 28, 48, 32, 49, 39, 39, 23, 'NaN', 18, 46, 28, 19, 21, 37, 29, 45, 33, 21, 22, 43, 20, 21, 26, 29, 7, 35, 19, 46, 23, 44, 20, 22, 40, 22, 22, 41, 41, 'NaN', 35, 23, 38, 23, 5, 3, 8, 12, 31, 20, 30, 21, 26, 18, 28, 29, 22, 19, 29, 24, 31, 20, 22, 19, 62, 32, 28, 25, 23, 23, 19, 28, 'NaN', 28, 30, 29, 45, 4, 35, 28, 21, 22, 18, 25, 32, 22, 27, 21, 27, 18, 17, 27, 24, 16, 21, 27, 27, 21, 30, 16, 24, 39, 2, 24, 'NaN', 29, 27, 27, 30, 18, 69, 45, 30, 49, 39, 30, 30, 47, 19, 5, 8, 14, 20, 18, 16, 19, 16, 22, 42, 55, 22, 40, 20, 42, 20, 37, 57, 'NaN', 57, 23, 64, 48, 37, 37, 33, 20, 23, 17, 19, 66, 21, 23, 28, 43, 54, 45, 23, 45, 19, 23]

In [23]:
# For the test set
test[test.Age.isnull()].iloc[60:,[1, 2, 3, 4]]

In [24]:
agelist2 = [32, 48, 17, 36, 26, 24, 37, 27, 47, 32, 32, 23, 23, 31, 29, 21, 28, 25, 20, 24, 37, 20, 25, 25, 24, 26, 40, 23, 44, 59, 30, 18, 31, 36, 26, 20, 17, 10, 43, 63, 31, 29, 41, 20, 26, 25, 32, 21, 34, 8, 25, 20, 'NaN', 44, 43, 31, 26, 28, 18, 20, 22, 46, 25, 23, 33, 20, 40, 25, 25, 17, 16, 10, 44, 10, 19, 21, 44, 28, 23, 64, 26, 22, 21, 23, 35, 4]
indexmissing2 = list(test[test.Age.isnull()].index)

The code to combine everything should be
```
for x in range(len(indexmissing)):
    CA(indexmissing[x], agelist[x])
```

### Results
Through that long tedious search for the age. I found out that Miss has many indications, either a young girl, not married, or engaged but not using husbands name. Master always refers to a young boy. I suppose that I would also have to go through the names to collect information. On top of that, I found that people who traveled together had similar ticket numbers. Apparently tickets do matter, so I have to extract features from that variable as well. 

### Singles

In [25]:
df.loc[(df.SibSp == 0) & (df.Parch == 0), 'Alone'] = 1
df.loc[~((df.SibSp == 0) & (df.Parch == 0)), 'Alone'] = 0
df[df.Alone.notnull()].tail()

In [26]:
df.Alone.count()

In [27]:
plt.figure(figsize=(15,10))
plt.subplot(231)
df.Survived[df.Alone==1].value_counts().plot(kind='bar')
plt.title('Alone')
plt.subplot(232)
plt.title('With Family')
df.Survived[df.Alone==0].value_counts().plot(kind='bar')

In [28]:
plt.figure(figsize=(17,10))

plt.subplot(231)
plt.title('Female Fam')
df.Survived[(df.Alone==0) & (df.Sex==0)].value_counts().plot(kind='bar')
plt.subplot(232)
plt.title('Male Fam')
df.Survived[(df.Alone==0) & (df.Sex==1)].value_counts().plot(kind='bar')
plt.subplot(234)
plt.title('Female Alone')
df.Survived[(df.Alone==1) & (df.Sex==0)].value_counts().plot(kind='bar')
plt.subplot(235)
plt.title('Male Alone')
df.Survived[(df.Alone==1) & (df.Sex==1)].value_counts().plot(kind='bar')

plt.show()

These result suggest that being alone with no family members decreases the chances of survival for both genders proportionally. 

# Tickets
It came to my attention that tickets were a good indicator for survivability for groups. If in the group there was a survivor, it is likely to have another. 

This would imply a Nearest Neighbor algorithm for the ticket would be plausible. People who purchase two rooms together at the same time would have their tickets to be very close. In addition, if they embark from the same location, that increases the chance that they are a group. Similarily, if people share a ticket, and one is of the opposite gender, it is likely that the man would not survive, but have the female survive. Finally to continue to add, a closely purchased ticket would mean the lack of information from Cabins can be omitted. Finally if one male did not survive in the training data, that could be generalized to the other males over 18 to also not survive. 

You see a lot of pattern when looking at the tickets. I also found a mislabeled Embarked part. The entry for index 604, a Mr. Harry Homer embarked from Southampton not Charleston.
And there's more. If people have a zero fare

In [29]:
full.sort_values('Ticket').tail(40)

Consecutiveness, use something to minus another to say there is an adjacent ticket. 
- Has consecutive
- Has people on same ticket
- consecutive survived
- people on same ticket survived
- consecutive gender
- same ticket gender 

If females on same ticket are dead, everyone should be dead.      
If male in cabin, female likely to survive. 

At this stage, it seems logical to no longer use Adaboost, Logistic classifier, but a clustering algorithm or a neural net. I would have to do a lot of feature extraction and create a variable that reflects, "Has people on the same ticket, contains which genders" and the main feature is pretrained survival. Since we have labels of survival, we should use them. I am unsure how logistic regression will be able to capture that. 

In [30]:
full[full.Fare==0].head(50)

I am looking at the fare because, there are several of people who have paid 0, and they're all spread out in different classes. But basically all of them passed. One survived, and we have two unknown. Of the unknown, it is a hard guess, Mr. Ismay had many cabins, and he could be like the richest woman on board (read from the link a while ago) and survived by himself, but because he has no fare and shares a ticket with two others who passed, it would seem like he passed as well. Mr. Chisholm on the other hand, has no cabin, shares a consecutive ticket with Parr and Andrews, both who passed, I am much more confident that he has passed as well. 

Also since Fare is normally a continuous variable, by making 0 a discrete variable, we can eliminate the errorneous prediction trend for these passengers with 0 fares. 

Maybe use -1, 0, 1

This is how I will try to clean it. 
- Convert 0 Fare to Null to not affect other fares
- Create Free dummy for 0 fare
- Create consecutive variable by taking tickets, sort by tickets, extract rightmost integer part of the string, convert to int, and calculate a difference. If difference is equal to 1 then create consecutive dummy 1
- Create in group dummy if tickets match someone elses 
- If ticket in test set is consecutive or in a group with trained set, then we can see if already trained people survived or not. So this is why we need to run Ticket as a variable. To create a consecutive variable, you then need to be able to compare to see if the ticket already exists. So maybe we should run the numbers through a set to see if you can get a difference. 
    - My struggle now is finding out how to create the same variables for the test set. 
- Also how do you create a variable that says, someone else in my group/consecutive survived. 
- Female with same ticket is deceased => everyone in that group is dead. 
- Male with same or consecutive ticket survived => everyone in same group or consecutive survived. 


In [31]:
# Create Free Dummy for 0 Fare, convert 0 fare to Null
full.loc[full.Fare == 0, 'Free'] = 1
full.loc[full.Fare != 0, 'Free'] = 0
full.loc[full.Fare == 0, 'Fare'] = 'NaN'

In [32]:
# Pull out tickets that multiple passengers hold
full.loc[full.Ticket.duplicated(keep=False), 'Group'] = 1
full.loc[~full.Ticket.duplicated(keep=False), 'Group'] = 0

In [33]:
# Create consecutive variables. It will follow the math as follows
# g = pd.Series([1, 2, 3, 5, 7, 8, 10]) # Two identical series, when one is offset bothways by 1, does it still exist in the original series.
# s = pd.Series([1, 2, 3, 5, 7, 8, 10]) # If yes in either direction then it is consecutive. 
# (g-1).isin(s) | (g+1).isin(s)

tempdf = pd.DataFrame(full.Ticket.value_counts().sort_index())
tempdf.reset_index(inplace=True)
# Extract integers
tempdf['ticketint'] = tempdf.iloc[:, 0].str.extract('.*?(\d*$)\s*?', expand = False)
tempdf.iloc[787, 2] = 1 # To fix the broken ticket of LINE
tempdf['ticketint'] = tempdf['ticketint'].astype(int)
# Use consecutive test
tempdf.loc[((tempdf.ticketint-1).isin(tempdf.ticketint) | (tempdf.ticketint+1).isin(tempdf.ticketint)), 'hasCons'] = 1
tempdf.loc[~((tempdf.ticketint-1).isin(tempdf.ticketint) | (tempdf.ticketint+1).isin(tempdf.ticketint)), 'hasCons'] = 0
# For matching names, pull over the results back to dataframe by testing which ticket is in that Series
full.loc[full.Ticket.isin(tempdf.loc[tempdf.hasCons == 1, 'index']), 'hasCons'] = 1
full.loc[~(full.Ticket.isin(tempdf.loc[tempdf.hasCons == 1, 'index'])), 'hasCons'] = 0

### Complex polynomial dummy variables
As a male: 
- female survived consecutive/group = ?
- male survived consecutive/group = likely to live
- female dead consecutive/group = dead
- male dead consecutive/group = likely to die
- male unknown consecutive/group = ?
- female unknown consecutive/group = ?

As a female: 
- female survived consecutive/group = likely to live
- male survived consecutive/group = alive
- female dead consecutive/group = likely to die
- male dead consecutive/group = ?
- male unknown consecutive/group = ?
- female unknown consecutive/group = ?

In [34]:
pf = (full.Sex == 0)
pm = (full.Sex == 1)
of = (full.Sex == 0)
om = (full.Sex == 1)

s = (full.Survived == 1)
d = (full.Survived == 0)
u = (full.Survived.isnull())
c = (full.hasCons == 1)
g = (full.Group == 1)


How do we find if another person other than that person is dead? We have to create a graph of relationships to find out who is consecutive or in the same group. 
SOUNDS LIKE GRAPH THEORY. 

If group = 1, find duplicates of tickets using duplicate function. Get index to find out if they survived, what their gender is.     
If hasCons = 1, we need to find a way to find as many consecutive people attached to that one, and repeat for each person. It's inefficient but we only have to run this code once for less than 300 samples. We need to use the tickets, increment by one in either direction, find all people on that ticket, check each of their gender and survival, relay information back as dummy variable, and check an additional increment and repeat. We stop when none are left in initial direction and reverse increments. 

Need a markov chain or something to show who you're friends with. I'm tired of this. 

^^^ That section was written two days ago.      
How I actually achieved attaching the dummy variables onto other rows of the data was by iterating through the full dataframe, checking if the passenger was in a group, extracting the ticket, creating a dataframe using that ticket, separating the index, and then iterating through the index for the ticket, and assigning the other passengers excluding the original the complex dummy variables. 

I went through a similar process for consecutive tickets, which would hopefully also yield information at a smaller scale. 

In [35]:
# Set all members in the groups with a new dummy variable for condition of other members of group

a, _ = full.shape
for i in range(a):
    if full.loc[i, 'Group'] == 1:
        groupdf = full[full.Ticket == full.loc[i, 'Ticket']] # This is one row from this full dataframe that is a group
        b, _ = groupdf.shape
        indexlist = np.array(groupdf.index)
        for j in indexlist: # This grabs the rows in the group.
            if full.loc[j, 'Sex'] == 1: # Passenger is male
                if full.loc[j, 'Survived'] == 1: # Passenger Survived
                    full.loc[indexlist[indexlist != j], 'MSG'] = 1 # Others in group gets assigned a Male Survived Group Dummy
                elif full.loc[j, 'Survived'] == 0:
                    full.loc[indexlist[indexlist != j], 'MDG'] = 1 # Others in group gets assigned a Male Dead Group Dummy
                else: # This scenario is when survived is unknown. 
                    full.loc[indexlist[indexlist != j], 'MUG'] = 1 # Others in group gets assigned a Male Unknown Group Dummy
            else: # Passenger is female
                if full.loc[j, 'Survived'] == 1: # Passenger survived
                    full.loc[indexlist[indexlist != j], 'FSG'] = 1 # Others in group gets assigned a Female Survived Group Dummy
                elif full.loc[j, 'Survived'] == 0:
                    full.loc[indexlist[indexlist != j], 'FDG'] = 1 # Others in group gets assigned a Female Dead Group Dummy
                else:   
                    full.loc[indexlist[indexlist != j], 'FUG'] = 1 # Others in group gets assigned a Female Unknown Group Dummy

In [36]:
# Set all members with consecutive tickets with a new dummy variable for condition of statuses of other consecutive members

# Add new column with all the ticket integers
full['ticketint'] = full.loc[:, "Ticket"].str.extract('.*?(\d*$)\s*?', expand = False)
full.loc[full.Ticket == 'LINE', 'ticketint'] = 1 # Fix Line ticket integer
full['ticketint'] = full['ticketint'].astype(int) # Set ticketint to int from string

a, _ = full.shape
for i in range(a):
    if full.loc[i, 'hasCons'] == 1: # Passenger has a consecutive ticket
        consdf = full[(full.ticketint == full.loc[i, 'ticketint']+1) | (full.ticketint == full.loc[i, 'ticketint']-1)]
        indexlist = np.array(consdf.index)
        if full.loc[i, 'Sex'] == 1: # Passenger is male
            if full.loc[i, 'Survived'] == 1: # Passenger Survived
                full.loc[indexlist, 'MSC'] = 1 # Others in group gets assigned a Male Survived Group Dummy
            elif full.loc[i, 'Survived'] == 0:
                full.loc[indexlist, 'MDC'] = 1 # Others in group gets assigned a Male Dead Group Dummy
            else: # This scenario is when survived is unknown. 
                full.loc[indexlist, 'MUC'] = 1 # Others in group gets assigned a Male Unknown Group Dummy
        else: # Passenger is female
            if full.loc[i, 'Survived'] == 1: # Passenger survived
                full.loc[indexlist, 'FSC'] = 1 # Others in group gets assigned a Female Survived Group Dummy
            elif full.loc[i, 'Survived'] == 0:
                full.loc[indexlist, 'FDC'] = 1 # Others in group gets assigned a Female Dead Group Dummy
            else:   
                full.loc[indexlist, 'FUC'] = 1 # Others in group gets assigned a Female Unknown Group Dummy

In [37]:
full.head()

In [38]:
# # One of the first drafts I went through to pull these blocks of algorithms out. 
# # They took me two days to think of, repeatedly outline, and finally function the way I wanted them to. 

# # Show Series of consecutive tickets
# a = full.Ticket[full.hasCons == 1]
# # Series of extracted ticket integers
# b = a.str.extract('.*?(\d*$)\s*?', expand = False)
# # Series + 1
# c = b + 1
# # Series for tickets of all plus one consecutives
# d = c[c.isin(b)]
# # All the people with these tickets should have their name(unique identifier) gender, and survival extracted. 
# # Or we just find if there is one of the dummy variables true

# # When you have the increments, you can run the same is duplicated command. 
 

# Aggregating the cleaning functions
I only started using the full dataset at the end, because I realized that there was more information to be captured and features to be extracted after I started exploring more. To preserve the workflow that I went through, I will keep the upper section the same and repeat the process down here so that the cleaning algorithms written throughout this notebook  affects all parts of the datasets. In addition, making this section down here will make the predictions on a future date simpler, as I do not have to go to each cleaning section and run the cells, nor run the entire notebook to clean the results in preparation for running a machine learning algorithm. 

In [39]:
# Import Raw Datasets
df = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

In [40]:
# Fill missing ages by recollecting data
df.sort_values('Name', inplace=True)

def CA(dataframe, index, age):
    dataframe.loc[index, "Age"] = age
    
indexmissing = list(df[df.Age.isnull()].index)
agelist = agelist = [48, 18, 40, 40, 37, 45, 28, 21, 18, 48, 40, 22, 34, 30, 22, 29, 26, 49, 42, 28, 48, 32, 49, 39, 39, 23, np.nan, 18, 46, 28, 19, 21, 37, 29, 45, 33, 21, 22, 43, 20, 21, 26, 29, 7, 35, 19, 46, 23, 44, 20, 22, 40, 22, 22, 41, 41, np.nan, 35, 23, 38, 23, 5, 3, 8, 12, 31, 20, 30, 21, 26, 18, 28, 29, 22, 19, 29, 24, 31, 20, 22, 19, 62, 32, 28, 25, 23, 23, 19, 28, np.nan, 28, 30, 29, 45, 4, 35, 28, 21, 22, 18, 25, 32, 22, 27, 21, 27, 18, 17, 27, 24, 16, 21, 27, 27, 21, 30, 16, 24, 39, 2, 24, np.nan, 29, 27, 27, 30, 18, 69, 45, 30, 49, 39, 30, 30, 47, 19, 5, 8, 14, 20, 18, 16, 19, 16, 22, 42, 55, 22, 40, 20, 42, 20, 37, 57, np.nan, 57, 23, 64, 48, 37, 37, 33, 20, 23, 17, 19, 66, 21, 23, 28, 43, 54, 45, 23, 45, 19, 23]
test[test.Age.isnull()].iloc[60:,[1, 2, 3, 4]]
agelist2 = [32, 48, 17, 36, 26, 24, 37, 27, 47, 32, 32, 23, 23, 31, 29, 21, 28, 25, 20, 24, 37, 20, 25, 25, 24, 26, 40, 23, 44, 59, 30, 18, 31, 36, 26, 20, 17, 10, 43, 63, 31, 29, 41, 20, 26, 25, 32, 21, 34, 8, 25, 20, np.nan, 44, 43, 31, 26, 28, 18, 20, 22, 46, 25, 23, 33, 20, 40, 25, 25, 17, 16, 10, 44, 10, 19, 21, 44, 28, 23, 64, 26, 22, 21, 23, 35, 4]
indexmissing2 = list(test[test.Age.isnull()].index)

for x in range(len(indexmissing)):
    CA(df, indexmissing[x], agelist[x])
for x in range(len(indexmissing2)):
    CA(test, indexmissing2[x], agelist2[x])
    
df.sort_values('PassengerId', inplace=True)

In [41]:
# Combine datasets
full = pd.concat([df, test], ignore_index = True)
full.Age = full.Age.fillna(method='backfill')

In [42]:
# Convert strings to dummy variables
full.Sex = full.Sex.str.replace('female', '0')
full.Sex = full.Sex.str.replace('male', '1')
full.Sex = full.Sex.astype('int')

In [43]:
# Create Free Dummy for 0 Fare anomalys, convert 0 fare to Null
full.loc[full.Fare == 0, 'Free'] = 1
full.loc[full.Fare != 0, 'Free'] = 0
# full.loc[full.Fare == 0, 'Fare'] = 'NaN'

In [44]:
# Pull out tickets that multiple passengers hold, create dummy variable
full.loc[full.Ticket.duplicated(keep=False), 'Group'] = 1
full.loc[~full.Ticket.duplicated(keep=False), 'Group'] = 0

In [45]:
# Those who are traveling alone
full.loc[(full.SibSp == 0) & (full.Parch == 0), 'Alone'] = 1
full.loc[~((full.SibSp == 0) & (full.Parch == 0)), 'Alone'] = 0

In [46]:
# Create consecutive variables. It will follow the math as follows
# g = pd.Series([1, 2, 3, 5, 7, 8, 10]) # Two identical series, when one is offset bothways by 1, does it still exist in the original series.
# s = pd.Series([1, 2, 3, 5, 7, 8, 10]) # If yes in either direction then it is consecutive. 
# (g-1).isin(s) | (g+1).isin(s)

tempdf = pd.DataFrame(full.Ticket.value_counts().sort_index())
tempdf.reset_index(inplace=True)
# Extract integers
tempdf['ticketint'] = tempdf.iloc[:, 0].str.extract('.*?(\d*$)\s*?', expand = False)
tempdf.iloc[787, 2] = 1 # To fix the broken ticket of LINE
tempdf['ticketint'] = tempdf['ticketint'].astype(int)
# Use consecutive test
tempdf.loc[((tempdf.ticketint-1).isin(tempdf.ticketint) | (tempdf.ticketint+1).isin(tempdf.ticketint)), 'hasCons'] = 1
tempdf.loc[~((tempdf.ticketint-1).isin(tempdf.ticketint) | (tempdf.ticketint+1).isin(tempdf.ticketint)), 'hasCons'] = 0
# For matching names, pull over the results back to dataframe by testing which ticket is in that Series
full.loc[full.Ticket.isin(tempdf.loc[tempdf.hasCons == 1, 'index']), 'hasCons'] = 1
full.loc[~(full.Ticket.isin(tempdf.loc[tempdf.hasCons == 1, 'index'])), 'hasCons'] = 0

# I understand that two chunks of this code is unnecessary, 
# but again, for the sake of preserving my current way of thinking, 
# I will leave it be and optimize the process in another notebook. 

In [47]:
# Set all members in the groups with a new dummy variable for condition of other members of group

a, _ = full.shape
for i in range(a):
    if full.loc[i, 'Group'] == 1:
        groupdf = full[full.Ticket == full.loc[i, 'Ticket']] # This is one row from this full dataframe that is a group
        b, _ = groupdf.shape
        indexlist = np.array(groupdf.index)
        for j in indexlist: # This grabs the rows in the group.
            if full.loc[j, 'Sex'] == 1: # Passenger is male
                if full.loc[j, 'Survived'] == 1: # Passenger Survived
                    full.loc[indexlist[indexlist != j], 'MSG'] = 1 # Others in group gets assigned a Male Survived Group Dummy
                elif full.loc[j, 'Survived'] == 0:
                    full.loc[indexlist[indexlist != j], 'MDG'] = 1 # Others in group gets assigned a Male Dead Group Dummy
                else: # This scenario is when survived is unknown. 
                    full.loc[indexlist[indexlist != j], 'MUG'] = 1 # Others in group gets assigned a Male Unknown Group Dummy
            else: # Passenger is female
                if full.loc[j, 'Survived'] == 1: # Passenger survived
                    full.loc[indexlist[indexlist != j], 'FSG'] = 1 # Others in group gets assigned a Female Survived Group Dummy
                elif full.loc[j, 'Survived'] == 0:
                    full.loc[indexlist[indexlist != j], 'FDG'] = 1 # Others in group gets assigned a Female Dead Group Dummy
                else:   
                    full.loc[indexlist[indexlist != j], 'FUG'] = 1 # Others in group gets assigned a Female Unknown Group Dummy

In [48]:
# Set all members with consecutive tickets with a new dummy variable for condition of statuses of other consecutive members

# Add new column with all the ticket integers
full['ticketint'] = full.loc[:, "Ticket"].str.extract('.*?(\d*$)\s*?', expand = False)
full.loc[full.Ticket == 'LINE', 'ticketint'] = 1 # Fix Line ticket integer
full['ticketint'] = full['ticketint'].astype(int) # Set ticketint to int from string

a, _ = full.shape
for i in range(a):
    if full.loc[i, 'hasCons'] == 1: # Passenger has a consecutive ticket
        consdf = full[(full.ticketint == full.loc[i, 'ticketint']+1) | (full.ticketint == full.loc[i, 'ticketint']-1)]
        indexlist = np.array(consdf.index)
        if full.loc[i, 'Sex'] == 1: # Passenger is male
            if full.loc[i, 'Survived'] == 1: # Passenger Survived
                full.loc[indexlist, 'MSC'] = 1 # Others in group gets assigned a Male Survived Group Dummy
            elif full.loc[i, 'Survived'] == 0:
                full.loc[indexlist, 'MDC'] = 1 # Others in group gets assigned a Male Dead Group Dummy
            else: # This scenario is when survived is unknown. 
                full.loc[indexlist, 'MUC'] = 1 # Others in group gets assigned a Male Unknown Group Dummy
        else: # Passenger is female
            if full.loc[i, 'Survived'] == 1: # Passenger survived
                full.loc[indexlist, 'FSC'] = 1 # Others in group gets assigned a Female Survived Group Dummy
            elif full.loc[i, 'Survived'] == 0:
                full.loc[indexlist, 'FDC'] = 1 # Others in group gets assigned a Female Dead Group Dummy
            else:   
                full.loc[indexlist, 'FUC'] = 1 # Others in group gets assigned a Female Unknown Group Dummy

In [49]:
full[['FSG', 'MUG', 'MDG', 'FDG', 'MSG', 'FUG', 
      'MDC', 'FSC', 'FDC', 'MSC', 'MUC', 'FUC']] = full[['FSG', 'MUG', 'MDG', 
      'FDG', 'MSG', 'FUG', 'MDC', 'FSC', 'FDC', 'MSC', 'MUC', 'FUC']].fillna(value=0)
full['Age'] = full['Age'].fillna(full['Age'].mean())
full['Fare'] = full['Fare'].fillna(full['Fare'].mean())

In [50]:
full[['Free', 'Group', 'hasCons', 'Alone', 'FSG', 'MUG', 'MDG', 
        'FDG', 'MSG', 'FUG', 'MDC',
        'FSC', 'FDC', 'MSC', 'MUC', 'FUC', 'Alone']] = full[['Free', 'Group', 'hasCons', 'Alone', 'FSG', 'MUG', 'MDG', 
        'FDG', 'MSG', 'FUG', 'MDC',
        'FSC', 'FDC', 'MSC', 'MUC', 'FUC', 'Alone']].astype(int)

full['Fare'] = full['Fare'].astype(int)
full['Age'] = full['Age'].astype(int)


# Running Tests Again

In [51]:
# Splitters
from sklearn.cross_validation import KFold
from sklearn.model_selection import train_test_split

# Optimizer
from sklearn.model_selection import GridSearchCV

# Classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier

# Metrics
from sklearn.metrics import accuracy_score
from sklearn import metrics
from sklearn.model_selection import cross_validate

In [52]:
# Separate the full dataset that has new features into training and testing data. 
traindf = full.iloc[:891, :]
testdf = full.iloc[891:, :]

# Find out column names to further split data
full.columns

In [53]:
# How to use ticket as a category variable?
# Change Embarked to a category variable

In [54]:
# Hold only training variables. This does not include Cabin, Name, PassengerId, or Survived. 
# cols = ['Age', 'Fare', 'Parch', 'Pclass', 'Sex', 'SibSp', 
#         'Free', 'Group', 'hasCons', 'Alone', 'FSG', 'MUG', 'MDG', 
#         'FDG', 'MSG', 'FUG', 'MDC',
#         'FSC', 'FDC', 'MSC', 'MUC', 'FUC', 'Alone']
cols = ['Age', 'Pclass', 'Sex', 
        'Free', 'Group', 'hasCons', 'Alone', 'FSG', 'MUG', 'MDG', 
        'FDG', 'MSG', 'FUG', 'MDC',
        'FSC', 'FDC', 'MSC', 'MUC', 'FUC', 'Alone']

test = testdf.loc[:, cols]
features = traindf.loc[:, cols]
labels = traindf.loc[:, 'Survived']

In [55]:
features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size = 0.3, random_state=42)
features_train = features_train.fillna(value=0)

In [56]:
clf = LogisticRegression()
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
print(accuracy_score(pred, labels_test))
cross_validate(clf, features, labels, return_train_score=False, cv=10)

In [57]:
clf = DecisionTreeClassifier()
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
print(accuracy_score(pred, labels_test))
cross_validate(clf, features, labels, return_train_score=False, cv=10)

In [58]:
clf = DecisionTreeClassifier()
clf = AdaBoostClassifier(clf, random_state=42)
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
print(accuracy_score(pred, labels_test))
cross_validate(clf, features, labels, return_train_score=False, cv=10)

In [59]:
clf = RandomForestClassifier()
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
print(accuracy_score(pred, labels_test))
cross_validate(clf, features, labels, return_train_score=False, cv=10)

From these results, the scores lie relatively similar. The cross validation leaves us with an average of about 84%. No result showing too great of an accuracy. I guess all I can do is hope for the best. I have most faith in proper fitting for the logistic regression, and the random forests. 

In [60]:
clf = LogisticRegression()
clf.fit(features, labels)
pred = clf.predict(test).astype(int)
pred = pred.astype(int)

In [61]:
my_submission = pd.DataFrame({'PassengerId': testdf['PassengerId'], 'Survived': pred})
# you could use any filename. We choose submission here
my_submission.to_csv('logreg.csv', index=False)