## Import data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
train = pd.read_csv("../input/clean_train.csv")
test = pd.read_csv("../input/test.csv")
all_data = [train, test]
train.shape, test.shape

((881, 12), (418, 11))

## Substract Prefixes

In [3]:
train["Name"].head()

0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                             Allen, Mr. William Henry
Name: Name, dtype: object

In [4]:
train["Title"] = train["Name"].apply(lambda x: x.split(',')[1].split('.')[0].strip())

In [5]:
train["Title"].value_counts()

Mr              513
Miss            177
Mrs             125
Master           39
Dr                7
Rev               6
Mlle              2
Major             2
Col               2
Sir               1
Capt              1
Jonkheer          1
Lady              1
Don               1
Ms                1
Mme               1
the Countess      1
Name: Title, dtype: int64

In [6]:
#I do the same for the test column
test["Title"] = test["Name"].apply(lambda x: x.split(',')[1].split('.')[0].strip())
test["Title"].value_counts()

Mr        240
Miss       78
Mrs        72
Master     21
Rev         2
Col         2
Dona        1
Ms          1
Dr          1
Name: Title, dtype: int64

The only one that is train set do not contain is "Donna"

In [7]:
print("NULL VALUES IN TRAIN SET: ", train["Sex"].isnull().sum())
print("NULL VALUES IN TEST SET: ", test["Sex"].isnull().sum())

NULL VALUES IN TRAIN SET:  0
NULL VALUES IN TEST SET:  0


## Meaning.

*   Mr.     For both adult and young men.

*   Miss.   Used for unmarried women.

*   Mrs.    Used for married women.

*   Master.     Used only for male boys under 18.

*   Dr.     Means doctor.

*   Rev.    Means reverend. Honorary title for christians.

*   Mlle.   Mademoiselle, similar to Miss, but it comes from french.

*   Major.  Someone with a military rank.

*   Col.    Colonel, senior military office rank.

*   Lady.   Similar to Miss or Mrs. But for enpowered women.

*   Don.    Empowered men.

*   Jonkheer.   Nobility, in spanish: (viene de la nobleza de paises bajos).

*   Mme.    Madame: French for married woman.

*   Ms.  Used for old unmarried women or women that it marriage is uncertain.

*   Capt.   Capitan.

*   Sir.    Honorific adress for men similar to Mr.

## Understand Titles
Because all the titles are for adults, I wanna know how children are reffered. In the case of women, I can also know is they were married or not by looking if their title's names start with "Miss" for unmarried, or "Mrs" for married.

In [8]:
children  = train.query("Age < 18")
adults = train.query("Age > 17")
children.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,Master
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C,Mrs
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S,Miss
14,15,0,3,"Vestrom, Miss. Hulda Amanda Adolfina",female,14.0,0,0,350406,7.8542,,S,Miss
16,17,0,3,"Rice, Master. Eugene",male,2.0,4,1,382652,29.125,,Q,Master
22,23,1,3,"McGowan, Miss. Anna ""Annie""",female,15.0,0,0,330923,8.0292,,Q,Miss
24,25,0,3,"Palsson, Miss. Torborg Danira",female,8.0,3,1,349909,21.075,,S,Miss
38,40,1,3,"Nicola-Yarred, Miss. Jamila",female,14.0,1,0,2651,11.2417,,C,Miss
42,44,1,2,"Laroche, Miss. Simonne Marie Anne Andree",female,3.0,1,2,SC/Paris 2123,41.5792,,C,Miss
49,51,0,3,"Panula, Master. Juha Niilo",male,7.0,4,1,3101295,39.6875,,S,Master


Something that surprisse me and contrary to my assumtions some girls (under 18) are actually married. Like Mrs. Nicholas, #9 in the index. Also by just looking to the head of this new dataset I can observe that all the boys (male, under 18) are called "master". But I wanna confirm that.

In [9]:
children.value_counts("Title")

Title
Miss      51
Master    36
Mr        22
Mrs        4
dtype: int64

As we can actually 4 girls are married! And contrary to my assumptions some boys (actually almost half of them) are called Mr.

In [10]:
adults.value_counts("Title")

Title
Mr              375
Mrs             104
Miss             93
Dr                6
Rev               6
Col               2
Major             2
Mlle              2
Capt              1
Don               1
Jonkheer          1
Lady              1
Mme               1
Ms                1
Sir               1
the Countess      1
dtype: int64

## Married girls

In [11]:
married_girls = children.query("Title == 'Mrs'")
married_girls.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C,Mrs
302,308,1,1,"Penasco y Castellana, Mrs. Victor de Satode (M...",female,17.0,1,0,PC 17758,108.9,C65,C,Mrs
774,782,1,1,"Dick, Mrs. Albert Adrian (Vera Gillespie)",female,17.0,1,0,17474,57.0,B20,S,Mrs
822,831,1,3,"Yasbeck, Mrs. Antoni (Selini Alexander)",female,15.0,1,0,2659,14.4542,,C,Mrs


In [12]:
married_girls["Name"].head()

9                    Nasser, Mrs. Nicholas (Adele Achem)
302    Penasco y Castellana, Mrs. Victor de Satode (M...
774            Dick, Mrs. Albert Adrian (Vera Gillespie)
822              Yasbeck, Mrs. Antoni (Selini Alexander)
Name: Name, dtype: object

## Categorizing people
I wanna know if doctors, majors, and others with a superior name has more chances to survive.

In [13]:
#train.drop(["Category"], axis=1, inplace=True)
train["Category"] = train["Title"]
test["Category"] = test ["Title"]

In [23]:
# Unmarried women: Miss, Lady, Mille, Ms, Donna, Dona
train["Category"] = train["Category"].replace(["Miss", "Mlle", "Ms", "Donna", "Dona"], "Unmarried")
test["Category"] = test["Category"].replace(["Miss", "Mlle", "Ms", "Donna", "Dona"], "Unmarried")

# Married women: Mrs, Mme
train["Category"] = train["Category"].replace(["Mrs", "Lady", "Mme"], "Married")
test["Category"] = test["Category"].replace(["Mrs", "Lady", "Mme"], "Married")

# General men: Mr, Master, Don, Sir, theCountess, Jonkheer
train["Category"] = train["Category"].replace(["Mr", "Master", "Don", "Sir", "the Countess", "Jonkheer"], "Man")
test["Category"] = test["Category"].replace(["Mr", "Master", "Don", "Sir", "the Countess", "Jonkheer"], "Man")

# Doctors (and reverends): Dr, Rev
train["Category"] = train["Category"].replace(["Dr", "Rev"], "Doctor")
test["Category"] = test["Category"].replace(["Dr", "Rev"], "Doctor")

# Military:
train["Category"] = train["Category"].replace(["Major", "Col", "Capt"], "Military")
test["Category"] = test["Category"].replace(["Major", "Col", "Capt"], "Military")

In [19]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title,Category
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Mr,Man
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs,Married
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Miss,Unmarried
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Mrs,Married
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Mr,Man


In [16]:
train["Category"].value_counts()

Man          556
Unmarried    180
Married      127
Doctor        13
Military       5
Name: Category, dtype: int64

## Export data

In [17]:
train["Category"].to_csv("../input/train_values/name.csv", index=False)

In [25]:
test["Category"].to_csv("../input/test_values/name.csv", index=False)