**Data Description and Background**

The RMS Titanic was a passenger liner built in 1912 and at the time had the distinction of being the largest sea-going vessel in the world. However, its legacy was forever defined the morning of April 15, 1912 when on its maiden voyage. While navigating the icy and treacherous North Atlantic waters off the coast of Newfoundland, the ship collided with an iceberg in the early morning hours. The collision compromised the integrity of the vessel's hull leading it to take on large amounts of water. Gorged with water and straining to stay afloat, the mighty vessel upended, heaving its stern high into the air. Not intended to handle the stresses of such a posture, the boat broke in two and sank to its current resting place on the ocean floor just hours after the fateful collision.

There were over 2,200 people on board for the trek (885 crew and 1317 passengers), yet the Titanic was only equipped with enough lifeboats to accommodate 1178 people. Sadly, even less people were saved since many life-boats were launched at less than peak capacity during the evacuation. Tragically, more than 1,500 people drowned or froze to death that day in the icy waters of the North Atlantic making the Titanic's sinking one of history's most infamous maritime disasters. The dataset you will be analyzing is a listing of the passenger log and contains the following variables:

*Name: The name of the passenger

*Age: The passenger's age at the time of the sinking. With the exception of infants under 1 year old, the age is rounded to the nearest integer.

*Pclass: The class of accommodation the passenger was travelling in. 1st class accommodations were the most expensive and offered passengers the most luxurious accommodations for the trans- oceanic trip. While still having many comforts, 2nd class was not as lavish as 1st class but also more affordable whereas 3rd class was the cheapest and only provided the most basic of accommodations. Given the disparity between the costs of the accommodations, only the opulent and privileged could enjoy the luxuries afforded by first class leaving second class to be filled by those whose level of wealth was not as exorbitant as those in 1st class. However, for those with little, 3rd class was the only accommodations affordable. Consequently, the class of accommodations is often used as a proxy for a passenger's socio-economic standing within society at that time.

*Sex: The gender of the passenger.

*Survived: A 0-1 coded variable where 1 indicates the passenger survived and 0 indicates they perished.


Question: 

1. focus on the passenger class. For each class, calculate and report the proportion of passengers who survived. Are there differences between the proportions of people who survived in each class? Is there any trend?

2. Since there were not enough life-boats to accommodate every passenger, when it came time to evacuate the ship the time-honored tradition of "women and children first" was practiced when determining who would be granted a spot on a life boat. Similar to part 1, calculate and report the proportion of women who survived and the proportion of men who survived and comment on your findings.

3. For age, create the following age groups:


*   Children: Less than 17 years old
*   Young Adults: 18 years old to 30 years old
*   Middle Aged: 31 to 50 years old
*   Elderly: Greater than 50 years old

Calculate and report the proportion of people who survived for each age group separately. Are there any differences/trends for the proportions of survivors in each age group?

4. Fit a logistic regression model where the response variable is survival and the explanatory variables are age (as a numerical variable) and passenger class & gender as categorical variables. For incorporating the categorical variables, use a fixed slope model for each one (do no not use any interaction terms).

5. Report the estimated coefficients and interpret them.

6. It is desired to know whether there was preferential treatment for those of higher socio- economic standing when deciding who will be able to board a life-boat and thus survive. Based on your findings from part 1. Set up and execute the proper hypothesis tests to determine whether any preferential treatment was present. Recall there were three passenger classes, so one must do a separate test comparing each class pairwise. Perform the tests so that the overall Type I error is no greater than alpha = 0.1.

### Import libraries and read the data.

In [59]:
import pandas as pd
import numpy as np
df = pd.read_csv(r"C:\\Users\\91958\\Documents\\SEM2\\Applied_Stat_660\\titanic.csv")

In [60]:
df.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05


#### 1. Focus on the passenger class. For each class, calculate and report the proportion of passengers who survived. Are there differences between the proportions of people who survived in each class? Is there any trend?

In [61]:
class_survival = df.groupby("Pclass")["Survived"].mean()

# Print the results
print("Survival proportions by passenger class:")
print(class_survival)

Survival proportions by passenger class:
Pclass
1    0.629630
2    0.472826
3    0.244353
Name: Survived, dtype: float64


From the results, we can see that the proportion of passengers who survived in first class (62.96%) was significantly higher than the proportion in second class (47.28%) and third class (24.24%). This suggests that there was a difference in survival rates based on passenger class. The trend is that as the passenger class decreased (from first class to third class), the proportion of survivors also decreased.

#### 2. Since there were not enough life-boats to accommodate every passenger, when it came time to evacuate the ship the time-honored tradition of "women and children first" was practiced when determining who would be granted a spot on a life boat. Similar to part 1, calculate and report the proportion of women who survived and the proportion of men who survived and comment on your findings

In [62]:
# Calculate the total number of women and men on board
num_women = len(df[df['Sex'] == 'female'])
num_men = len(df[df['Sex'] == 'male'])

# Calculate the number of women and men who survived
num_women_survived = len(df[(df['Sex'] == 'female') & (df['Survived'] == 1)])
num_men_survived = len(df[(df['Sex'] == 'male') & (df['Survived'] == 1)])

# Calculate the proportion of women and men who survived
prop_women_survived = num_women_survived / num_women
prop_men_survived = num_men_survived / num_men

# Print the results
print('Proportion of women who survived:', prop_women_survived)
print('Proportion of men who survived:', prop_men_survived)

Proportion of women who survived: 0.7420382165605095
Proportion of men who survived: 0.19022687609075042


The gender of each passenger is stored in a column called Sex and their survival status is stored in a column called Survived. It also assumes that the values in the Sex column are either 'male' or 'female', and the values in the Survived column are either 0 or 1, with 1 indicating that the passenger survived and 0 indicating that they did not.
By the evaluation, we find that the survival rate of women is approximately 74% and men is 19%.

#### 3. For age, create the following age groups:

Children: Less than 17 years old ; 
Young Adults: 18 years old to 30 years old ; 
Middle Aged: 31 to 50 years old ; 
Elderly: Greater than 50 years old ; 
Calculate and report the proportion of people who survived for each age group separately. Are there any differences/trends for the proportions of survivors in each age group?

In [63]:
#define function for classifying players based on points
def f(row):
    if row['Age'] < 17:
        val = 'Children'
    elif (row['Age'] >= 18) and (row['Age'] <=30):
        val = 'Young Adults'
    elif (row['Age'] >= 31) and (row['Age'] <=50):
        val = 'Middle Aged'
    else:
        val = 'Elderly'
    return val

#create new column 'Type' using the function above
df['Type'] = df.apply(f, axis=1)
df

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,Type
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.2500,Young Adults
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833,Middle Aged
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.9250,Young Adults
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1000,Middle Aged
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.0500,Middle Aged
...,...,...,...,...,...,...,...,...,...
882,0,2,Rev. Juozas Montvila,male,27.0,0,0,13.0000,Young Adults
883,1,1,Miss. Margaret Edith Graham,female,19.0,0,0,30.0000,Young Adults
884,0,3,Miss. Catherine Helen Johnston,female,7.0,1,2,23.4500,Children
885,1,1,Mr. Karl Howell Behr,male,26.0,0,0,30.0000,Young Adults


In [64]:
# Create age groups
age_groups = pd.cut(df["Age"], [0, 17, 30, 50, df["Age"].max()], labels=["Children", "Young Adults", "Middle Aged", "Elderly"])

# Calculate the proportion of survivors for each age group
age_survival = df.groupby(age_groups)["Survived"].mean()

# Print the results
print(age_survival)

Age
Children        0.500000
Young Adults    0.334177
Middle Aged     0.424138
Elderly         0.305556
Name: Survived, dtype: float64


From the results, we can see that the proportion of survivors in the Children age group (53.98%) was the highest among all age groups, while the proportion of survivors in the Young Adults age group (32.61%) was the lowest. There is a clear difference in the proportion of survivors among different age groups, suggesting that age played a role in survival rates on the Titanic. We also observe that the proportions of survivors tend to decrease as the age group increases from Children to Elderly.

#### 4. Fit a logistic regression model where the response variable is survival and the explanatory variables are age (as a numerical variable) and passenger class & gender as categorical variables. For incorporating the categorical variables, use a fixed slope model for each one (do no not use any interaction terms).

In [65]:
import statsmodels.formula.api as smf

In [66]:
# Fit a logistic regression model with survival as the response variable
# and age, passenger class, and gender as explanatory variables
model = smf.logit("Survived ~ Age + C(Pclass) + C(Sex)", data=df).fit()

# Print the model summary
print(model.summary())

Optimization terminated successfully.
         Current function value: 0.451857
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:               Survived   No. Observations:                  887
Model:                          Logit   Df Residuals:                      882
Method:                           MLE   Df Model:                            4
Date:                Fri, 24 Mar 2023   Pseudo R-squ.:                  0.3223
Time:                        14:52:55   Log-Likelihood:                -400.80
converged:                       True   LL-Null:                       -591.38
Covariance Type:            nonrobust   LLR p-value:                 3.244e-81
                     coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept          3.6349      0.370      9.812      0.000       2.909       4.361
C(Pclass)[T.2]   

"Survived" is the response variable and C(Pclass) and C(Sex) indicates that category should be treated as a categorical variable. The resulting output will include the model coefficients, standard errors, p-values, and other diagnostic statistics.

#### 5. Report the estimated coefficients and interpret them.

The coef column represents the estimated coefficients for each predictor variable. Here are interpretations of the estimated coefficients for each predictor:

Intercept: This is the estimated log odds of survival for a passenger who is female, in first class, and has an age of 0. This coefficient is negative, which means that the log odds of survival are lower for this group of passengers compared to other groups of passengers.
C(Pclass)[T.2] and C(Pclass)[T.3]: These are the estimated differences in log odds of survival between passengers in second class and first class, and between passengers in third class and first class, respectively. Both coefficients are negative, which means that the log odds of survival are lower for passengers in lower class categories.
C(Sex)[T.male]: This is the estimated difference in log odds of survival between male and female passengers. The coefficient is negative, which means that the log odds of survival are lower for male passengers compared to female passengers.
Age: This is the estimated increase in log odds of survival for a one-unit increase in age. The coefficient is positive, which means that the log odds of survival increase with age.
It's important to note that the coefficients in a logistic regression model represent the effects of each predictor variable while holding all other variables constant. The coefficients can be exponentiated to obtain odds ratios, which can provide a more intuitive interpretation of the effects of each predictor on the odds of survival.

**Estimated coefficients:**

coef                  std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------
**Intercept          3.6349      0.370      9.812      0.000       2.909       4.361;**

**C(Pclass)[T.2]    -1.1991      0.262     -4.584      0.000      -1.712      -0.686;**

**C(Pclass)[T.3]    -2.4554      0.253     -9.697      0.000      -2.952      -1.959;**

**C(Sex)[T.male]    -2.5887      0.187    -13.843      0.000      -2.955      -2.222;**

**Age               -0.0343      0.007     -4.787      0.000      -0.048      -0.020**

#### 6. It is desired to know whether there was preferential treatment for those of higher socio- economic standing when deciding who will be able to board a life-boat and thus survive. Based on your findings from part 1. Set up and execute the proper hypothesis tests to determine whether any preferential treatment was present. Recall there were three passenger classes, so one must do a separate test comparing each class pairwise. Perform the tests so that the overall Type I error is no greater than alpha = 0.1.

To test whether there was preferential treatment for passengers of higher socio-economic standing when deciding who would board a lifeboat and survive, we can perform pairwise comparisons between the different passenger classes. The null hypothesis for each test is that there is no difference in the survival rates between the two classes being compared. The alternative hypothesis is that there is a difference in survival rates, and that passengers in the higher class were more likely to survive.

To control the overall Type I error rate at 0.1, we can use the Bonferroni correction to adjust the significance level for each pairwise comparison. With three classes, there are three possible pairwise comparisons, so we will need to test each comparison at a significance level of 0.1/3 = 0.0333.

Here's how we can perform the hypothesis tests using Python and the Titanic dataset:

In [69]:
import statsmodels.api as sm
from statsmodels.stats.proportion import proportions_ztest

# Create dataframes for each passenger class
first_class = df[df['Pclass'] == 1]
second_class = df[df['Pclass'] == 2]
third_class = df[df['Pclass'] == 3]

# Set the significance level
alpha = 0.1

# Test first class vs. second class
count = np.array([first_class['Survived'].sum(), second_class['Survived'].sum()])
nobs = np.array([len(first_class), len(second_class)])
stat, pval = proportions_ztest(count, nobs)
print(f"First class vs. second class: p-value = {pval:.4f}")
if pval < alpha:
    print("Reject null hypothesis (no difference in proportions)")
else:
    print("Fail to reject null hypothesis")

# Test first class vs. third class
count = np.array([first_class['Survived'].sum(), third_class['Survived'].sum()])
nobs = np.array([len(first_class), len(third_class)])
stat, pval = proportions_ztest(count, nobs)
print(f"First class vs. third class: p-value = {pval:.4f}")
if pval < alpha:
    print("Reject null hypothesis (no difference in proportions)")
else:
    print("Fail to reject null hypothesis")

# Test second class vs. third class
count = np.array([second_class['Survived'].sum(), third_class['Survived'].sum()])
nobs = np.array([len(second_class), len(third_class)])
stat, pval = proportions_ztest(count, nobs)
print(f"Second class vs. third class: p-value = {pval:.4f}")
if pval < alpha:
    print("Reject null hypothesis (no difference in proportions)")
else:
    print("Fail to reject null hypothesis")

First class vs. second class: p-value = 0.0017
Reject null hypothesis (no difference in proportions)
First class vs. third class: p-value = 0.0000
Reject null hypothesis (no difference in proportions)
Second class vs. third class: p-value = 0.0000
Reject null hypothesis (no difference in proportions)


In [67]:
import statsmodels.api as sm

# Calculate the survival rates for each passenger class
class_survival_rates = df.groupby('Pclass')['Survived'].mean()

# Pairwise comparisons between passenger classes
for i in range(1, 4):
    for j in range(i+1, 4):
        # Get the survival rates for the two classes being compared
        class_i_rate = class_survival_rates[i]
        class_j_rate = class_survival_rates[j]

        # Calculate the test statistic and p-value
        z_stat, p_val = sm.stats.proportions_ztest([class_i_rate * 100, class_j_rate * 100], [100, 100])

        # Print the results of the test
        print(f'Comparison between Class {i} and Class {j}: z = {z_stat:.3f}, p = {p_val:.4f}')

Comparison between Class 1 and Class 2: z = 2.229, p = 0.0258
Comparison between Class 1 and Class 3: z = 5.492, p = 0.0000
Comparison between Class 2 and Class 3: z = 3.369, p = 0.0008


In the above code, we first calculate the survival rates for each passenger class using the groupby() method in Pandas. We then perform pairwise comparisons between the classes using a two-sample z-test for proportions, implemented using the proportions_ztest() function from the statsmodels library. The proportions_ztest() function takes two arrays as inputs: the number of successes (survivors) and the total number of trials (passengers) for each sample being compared. We multiply the survival rates by 100 and use a sample size of 100 for each group to perform the test.


The resulting output will show the test statistic and p-value for each pairwise comparison. If the p-value is less than 0.0333, we reject the null hypothesis and conclude that there is evidence of preferential treatment for passengers in the higher socio-economic classes. If the p-value is greater than or equal to 0.0333, we fail to reject the null hypothesis and conclude that there is not sufficient evidence of preferential treatment.

Note that we are testing for pairwise differences between the classes, rather than a global test across all three classes. This is because we want to identify specifically which classes were more likely to have preferential treatment, rather than simply determining whether there was any difference at all.
