## Group project - Titanic dataset
**For understanding the titanic data:** https://www.kaggle.com/c/titanic/data. Not the "overview" (machine learning) section; only __"Data Dictionary"__ and __"Variable Notes"__.

Use the questions below to get started. You don't have to do/code them all, but try to look for patterns related to survivorship. You can also come up with your own analyses and questions to answer.

You can use this notebook, or start a new one. Make sure you give the notebook a title, add your names, and also use markdown cells to explain what you did.

### Basic Questions:

1. Show that the ticket class is a good indication of socio-economic status (poor, middle-class, rich).

2. Who were the passengers (what were their characteristics)?

3. What deck were the passengers on and how did that relate to their class booking? FYI The letters in the cabin numbers indicate the deck (also see https://66.media.tumblr.com/f8eafe09144867b00c5a7f31a68c51f8/tumblr_mr65khDO0k1ql9hvko1_1280.jpg).

4. Where did the passengers come from (port of embarkation)?

5. Who was alone and who was with family?

### Deeper Analysis:

1. What factors helped people survive (probably)?

2. Did deck level matter? Or class? Or port of embarkation?

3. Did having a family member matter?

### Share
When you are done, export your notebook as an html or pdf document ("File" -> "Save And Export Notebook As..."), and upload it to Slack (#only-python channel).

### Answers to Basic Questions

#### Basics

In [2]:
import numpy as np
import pandas as pd

In [3]:
df = pd.read_csv("./data/titanic.csv")
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


#### Ticket Class and Socioeconomic Status

In [5]:
df.Survived.mean() # overall survival rate is %38.38

0.3838383838383838

In [6]:
df.agg('Survived').mean() # overall survival rate is %38.38
# this code brings the same info as the one above

0.3838383838383838

In [72]:
df.groupby('Pclass').Survived.mean() # There is a high correlation between class and survival rate (1 survived, 0 not survived) (Pclass 1 > 2 > 3).

Pclass
1    0.629630
2    0.472826
3    0.242363
Name: Survived, dtype: float64

In [7]:
df.groupby('Pclass').mean().agg('Survived') # There is a high correlation between class and survival rate (1 survived, 0 not survived) (Pclass 1 > 2 > 3).
# this code brings the same info as the one above

Pclass
1    0.629630
2    0.472826
3    0.242363
Name: Survived, dtype: float64

In [73]:
df.groupby('Pclass').Fare.mean() # There is a high association between class and Fair paid 

Pclass
1    84.154687
2    20.662183
3    13.675550
Name: Fare, dtype: float64

In [9]:
df.groupby('Pclass').mean().agg('Fare') # There is a very high association between class and Fair paid 
# this code brings the same info as the one above

Pclass
1    84.154687
2    20.662183
3    13.675550
Name: Fare, dtype: float64

In [74]:
df.groupby(['Survived', 'Pclass']).Fare.mean()

Survived  Pclass
0         1         64.684007
          2         19.412328
          3         13.669364
1         1         95.608029
          2         22.055700
          3         13.694887
Name: Fare, dtype: float64

In [64]:
df.groupby(['Survived', 'Pclass']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,PassengerId,Age,SibSp,Parch,Fare
Survived,Pclass,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,1,410.3,43.695312,0.2875,0.3,64.684007
0,2,452.123711,33.544444,0.319588,0.14433,19.412328
0,3,453.580645,26.555556,0.672043,0.384409,13.669364
1,1,491.772059,35.368197,0.492647,0.389706,95.608029
1,2,439.08046,25.901566,0.494253,0.643678,22.0557
1,3,394.058824,20.646118,0.436975,0.420168,13.694887


#### Association between gender (Sex), Class, and Survival

In [97]:
df.groupby('Sex').mean().agg('Survived') # Females (%74.2) have higher survival rate than men (18.9).

Sex
female    0.742038
male      0.188908
Name: Survived, dtype: float64

In [13]:
df.groupby(['Sex', 'Pclass']).mean().agg('Survived') # in females, higher survival is associated with class 1 and 2 but in males it is associated only with class 1.

Sex     Pclass
female  1         0.968085
        2         0.921053
        3         0.500000
male    1         0.368852
        2         0.157407
        3         0.135447
Name: Survived, dtype: float64

In [14]:
df.groupby(['Sex', 'Survived']).mean().agg('Fare') # in both females and males, survival is associated with higher fare.

Sex     Survived
female  0           23.024385
        1           51.938573
male    0           21.960993
        1           40.821484
Name: Fare, dtype: float64

In [103]:
df.groupby(['Sex', 'Pclass', 'Survived']).mean().agg('Fare') # in men but not in women, class and higher fare seem to be associated with higher survival rate. (This seems to be the opposite for women in class1 and 3)

Sex     Pclass  Survived
female  1       0           110.604167
                1           105.978159
        2       0            18.250000
                1            22.288989
        3       0            19.773093
                1            12.464526
male    1       0            62.894910
                1            74.637320
        2       0            19.488965
                1            21.095100
        3       0            12.204469
                1            15.579696
Name: Fare, dtype: float64

#### Class, Cabin, and Survival

In [105]:
df.groupby('Pclass').Survived.mean()

Pclass
1    0.629630
2    0.472826
3    0.242363
Name: Survived, dtype: float64

In [39]:
df[df.Pclass == 1].Survived.value_counts(normalize=True) # by Dursun with Geral's help

1    0.62963
0    0.37037
Name: Survived, dtype: float64

In [40]:
df[df.Pclass == 2].Survived.value_counts(normalize=True) 

0    0.527174
1    0.472826
Name: Survived, dtype: float64

In [41]:
df[df.Pclass == 3].Survived.value_counts(normalize=True)

0    0.757637
1    0.242363
Name: Survived, dtype: float64

In [104]:
df.groupby('Cabin').Survived.mean()

Cabin
A10    0.0
A14    0.0
A16    1.0
A19    0.0
A20    1.0
      ... 
F33    1.0
F38    0.0
F4     1.0
G6     0.5
T      0.0
Name: Survived, Length: 147, dtype: float64

In [19]:
df.groupby(['Pclass', 'Cabin']).Survived.mean()

Pclass  Cabin
1       A10      0.0
        A14      0.0
        A16      1.0
        A19      0.0
        A20      1.0
                ... 
3       F E69    1.0
        F G63    0.0
        F G73    0.0
        F38      0.0
        G6       0.5
Name: Survived, Length: 147, dtype: float64

In [21]:
df.groupby(["Pclass", "Cabin", "Fare"]).Survived.mean()

Pclass  Cabin  Fare   
1       A10    40.1250    0.0
        A14    52.0000    0.0
        A16    39.6000    1.0
        A19    26.0000    0.0
        A20    56.9292    1.0
                         ... 
3       F G63  7.6500     0.0
        F G73  7.6500     0.0
        F38    7.7500     0.0
        G6     10.4625    0.0
               16.7000    1.0
Name: Survived, Length: 157, dtype: float64

In [22]:
df.groupby(["Survived", "Pclass"]).Fare.mean()

Survived  Pclass
0         1         64.684007
          2         19.412328
          3         13.669364
1         1         95.608029
          2         22.055700
          3         13.694887
Name: Fare, dtype: float64

In [23]:
df.Cabin.str.match('A.*')

0        NaN
1      False
2        NaN
3      False
4        NaN
       ...  
886      NaN
887    False
888      NaN
889    False
890      NaN
Name: Cabin, Length: 891, dtype: object

In [24]:
df.isnull().sum() # missing info: age for 177; cabin info for 687; and embarked port info for 2 passengers are missing

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [26]:
df1 = df[pd.notna(df.Cabin)]
df1

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,G6,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
...,...,...,...,...,...,...,...,...,...,...,...,...
871,872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,D35,S
872,873,0,1,"Carlsson, Mr. Frans Olof",male,33.0,0,0,695,5.0000,B51 B53 B55,S
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S


In [77]:
df1.Survived.mean() # survival rate is higher in those with defined cabin name (66.7% vs 38.4%)

0.6666666666666666

In [27]:
df1.groupby(["Survived", "Pclass", "Cabin"]).Fare.mean()

Survived  Pclass  Cabin
0         1       A10      40.1250
                  A14      52.0000
                  A19      26.0000
                  A24      50.4958
                  A32      50.0000
                            ...   
1         2       F4       39.0000
          3       E10       8.0500
                  E121     12.4750
                  F E69    22.3583
                  G6       16.7000
Name: Fare, Length: 163, dtype: float64

In [28]:
df2 = df1[df1.Cabin.str.contains('A') | df1.Cabin.str.contains('B') | df1.Cabin.str.contains('C') | df1.Cabin.str.contains('D') | df1.Cabin.str.contains('E') | df1.Cabin.str.contains('F') | df1.Cabin.str.contains('G')]
df2 # df2 is the data for whom the Cabin label is available. (A-G). 203 total passengers with defined Cabin info. Cabin non-NaN =204. What is this extra one? 

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,G6,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
...,...,...,...,...,...,...,...,...,...,...,...,...
871,872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,D35,S
872,873,0,1,"Carlsson, Mr. Frans Olof",male,33.0,0,0,695,5.0000,B51 B53 B55,S
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S


In [78]:
df1.groupby(['Sex', 'Pclass']).mean().agg('Survived') # in females, higher survival is associated with class 1 and 2 but in males it is associated only with class 1.

Sex     Pclass
female  1         0.962963
        2         0.900000
        3         0.666667
male    1         0.410526
        2         0.666667
        3         0.333333
Name: Survived, dtype: float64

In [79]:
df2.groupby(['Sex', 'Pclass']).mean().agg('Survived') # in females, higher survival is associated with class 1 and 2 but in males it is associated only with class 1.

Sex     Pclass
female  1         0.962963
        2         0.900000
        3         0.666667
male    1         0.414894
        2         0.666667
        3         0.333333
Name: Survived, dtype: float64

In [30]:
df1A = df1[df1.Cabin.str.contains('A')]
df1A

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
23,24,1,1,"Sloper, Mr. William Thompson",male,28.0,0,0,113788,35.5,A6,S
96,97,0,1,"Goldschmidt, Mr. George B",male,71.0,0,0,PC 17754,34.6542,A5,C
174,175,0,1,"Smith, Mr. James Clinch",male,56.0,0,0,17764,30.6958,A7,C
185,186,0,1,"Rood, Mr. Hugh Roscoe",male,,0,0,113767,50.0,A32,S
209,210,1,1,"Blank, Mr. Henry",male,40.0,0,0,112277,31.0,A31,C
284,285,0,1,"Smith, Mr. Richard William",male,,0,0,113056,26.0,A19,S
445,446,1,1,"Dodge, Master. Washington",male,4.0,0,2,33638,81.8583,A34,S
475,476,0,1,"Clifford, Mr. George Quincy",male,,0,0,110465,52.0,A14,S
556,557,1,1,"Duff Gordon, Lady. (Lucille Christiana Sutherl...",female,48.0,1,0,11755,39.6,A16,C
583,584,0,1,"Ross, Mr. John Hugo",male,36.0,0,0,13049,40.125,A10,C


In [31]:
df1B = df1[df1.Cabin.str.contains('B')]
df1B

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
31,32,1,1,"Spencer, Mrs. William Augustus (Marie Eugenie)",female,,1,0,PC 17569,146.5208,B78,C
54,55,0,1,"Ostby, Mr. Engelhart Cornelius",male,65.0,0,1,113509,61.9792,B30,C
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,
118,119,0,1,"Baxter, Mr. Quigg Edmond",male,24.0,0,1,PC 17558,247.5208,B58 B60,C
139,140,0,1,"Giglio, Mr. Victor",male,24.0,0,0,PC 17593,79.2,B86,C
170,171,0,1,"Van der hoef, Mr. Wyckoff",male,61.0,0,0,111240,33.5,B19,S
194,195,1,1,"Brown, Mrs. James Joseph (Margaret Tobin)",female,44.0,0,0,PC 17610,27.7208,B4,C
195,196,1,1,"Lurette, Miss. Elise",female,58.0,0,0,PC 17569,146.5208,B80,C
257,258,1,1,"Cherry, Miss. Gladys",female,30.0,0,0,110152,86.5,B77,S
263,264,0,1,"Harrison, Mr. William",male,40.0,0,0,112059,0.0,B94,S


In [80]:
df.groupby(['Pclass', 'Survived']).Fare.mean()

Pclass  Survived
1       0           64.684007
        1           95.608029
2       0           19.412328
        1           22.055700
3       0           13.669364
        1           13.694887
Name: Fare, dtype: float64

In [37]:
df.groupby(["Pclass", "Survived"]).agg({'Fare': 'mean'})

Unnamed: 0_level_0,Unnamed: 1_level_0,Fare
Pclass,Survived,Unnamed: 2_level_1
1,0,64.684007
1,1,95.608029
2,0,19.412328
2,1,22.0557
3,0,13.669364
3,1,13.694887


In [108]:
df.agg("Survived").mean()

0.3838383838383838

In [109]:
df1.agg("Survived").mean() # df1 = df[pd.notna(df.Cabin)] those whose cabin is not NaN (df1 --> 204 passengers)

0.6666666666666666

In [111]:
df2.agg("Survived").mean() # (df2 --> 203 passengers)

0.6699507389162561

state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
### Change: groupby state_office and divide by sum
state_pcts = state_office.groupby(level=0).apply(lambda x:
                                                 100 * x / float(x.sum()))

In [46]:
df_class_survived_percent = df_class_survived.groupby('Fare').apply():
                                                               100*'Fare'/(Fare.mean()))
df_class_survived_percent

SyntaxError: invalid syntax (Temp/ipykernel_13440/2247500841.py, line 1)

In [47]:
df_class_survived.survived.value_counts('Fare')

AttributeError: 'DataFrame' object has no attribute 'survived'

#### Deportaion Port 

In [112]:
df.groupby('Embarked').Survived.mean()  # Rate of survival per embarked ports: C > Q > S (C: Q: S: 

Embarked
C    0.553571
Q    0.389610
S    0.336957
Name: Survived, dtype: float64

In [113]:
df.groupby(['Sex', 'Embarked']).Survived.mean()  # Rate of survival of femalea and males per Embarked ports: for females: C > Q > S; for males: C > S > Q; Female low (S) > male high (C)

Sex     Embarked
female  C           0.876712
        Q           0.750000
        S           0.689655
male    C           0.305263
        Q           0.073171
        S           0.174603
Name: Survived, dtype: float64

In [114]:
df.groupby(['Pclass', 'Embarked']).Survived.mean() # Rate of survival per class and embarked port C > Q > S

Pclass  Embarked
1       C           0.694118
        Q           0.500000
        S           0.582677
2       C           0.529412
        Q           0.666667
        S           0.463415
3       C           0.378788
        Q           0.375000
        S           0.189802
Name: Survived, dtype: float64

In [115]:
df.groupby(['Sex', 'Pclass', 'Embarked']).Survived.mean() 
# FEMALES: All survived: 1Q, 2C, 2Q:  Q --> highest survived female rate in all 3 classes;  lowest rate: 3S (37.5%)
# MALES: None survived: 1Q and 2Q;  highest survival rate for males: C for each class, 1C (40.5%).

Sex     Pclass  Embarked
female  1       C           0.976744
                Q           1.000000
                S           0.958333
        2       C           1.000000
                Q           1.000000
                S           0.910448
        3       C           0.652174
                Q           0.727273
                S           0.375000
male    1       C           0.404762
                Q           0.000000
                S           0.354430
        2       C           0.200000
                Q           0.000000
                S           0.154639
        3       C           0.232558
                Q           0.076923
                S           0.128302
Name: Survived, dtype: float64

#### Survival association with SibSp (Siblings and Spouses) and Parch (non-SibSp relatives, 0 for children means they travelled with nanny, not any relatives) 

In [116]:
df.groupby('Parch').Survived.mean() # Parch 1 and 2 is associated with higher survival rate. 

Parch
0    0.343658
1    0.550847
2    0.500000
3    0.600000
4    0.000000
5    0.200000
6    0.000000
Name: Survived, dtype: float64

In [117]:
df.groupby('SibSp').Survived.mean() # SibSp 1 and 2 is associated with higher survival rate

SibSp
0    0.345395
1    0.535885
2    0.464286
3    0.250000
4    0.166667
5    0.000000
8    0.000000
Name: Survived, dtype: float64

In [118]:
df.groupby(['SibSp', 'Parch']).Survived.mean()

SibSp  Parch
0      0        0.303538
       1        0.657895
       2        0.724138
       3        1.000000
       4        0.000000
       5        0.000000
1      0        0.520325
       1        0.596491
       2        0.631579
       3        0.333333
       4        0.000000
       5        0.333333
       6        0.000000
2      0        0.250000
       1        0.857143
       2        0.500000
       3        1.000000
3      0        1.000000
       1        0.000000
       2        0.285714
4      1        0.000000
       2        0.333333
5      2        0.000000
8      2        0.000000
Name: Survived, dtype: float64