Library

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from statsmodels.stats.weightstats import ztest

Exercise 1
The RMS Titanic was a British passenger liner that sank in the North Atlantic Ocean
in the early morning hours of 15 April 1912, after it collided with an iceberg during its

maiden voyage from Southampton to New York City. There were an estimated 2,224 pas-
sengers and crew aboard the ship, and more than 1,500 died, making it one of the deadliest commercial
peacetime maritime disasters in modern history.
Women and children first? The aim is to understand how survivors of Titanic were
selected...

Import data

In [2]:
df = pd.read_csv("titanic.csv")
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


We focus on the features ’Survived’, ’Age’ and ’Sex’ defining a new dataframe

In [4]:
df_short=df[["Survived", "Sex" ,"Age"]]
df_short

Unnamed: 0,Survived,Sex,Age
0,0,male,22.0
1,1,female,38.0
2,1,female,26.0
3,1,female,35.0
4,0,male,35.0
...,...,...,...
886,0,male,27.0
887,1,female,19.0
888,0,female,
889,1,male,26.0


Describe the dataframe df_short

In [5]:
df_short.describe()

Unnamed: 0,Survived,Age
count,891.0,714.0
mean,0.383838,29.699118
std,0.486592,14.526497
min,0.0,0.42
25%,0.0,20.125
50%,0.0,28.0
75%,1.0,38.0
max,1.0,80.0


In [6]:
df_short.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Sex       891 non-null    object 
 2   Age       714 non-null    float64
dtypes: float64(1), int64(1), object(1)
memory usage: 21.0+ KB


Separate the dataset into men and women.

In [8]:
Titanic_men_passenger = df_short[df_short['Sex'] == 'male']
Titanic_woman_passenger = df_short[df_short['Sex'] == 'female']

print(Titanic_men_passenger)
print(Titanic_woman_passenger)

     Survived   Sex   Age
0           0  male  22.0
4           0  male  35.0
5           0  male   NaN
6           0  male  54.0
7           0  male   2.0
..        ...   ...   ...
883         0  male  28.0
884         0  male  25.0
886         0  male  27.0
889         1  male  26.0
890         0  male  32.0

[577 rows x 3 columns]
     Survived     Sex   Age
1           1  female  38.0
2           1  female  26.0
3           1  female  35.0
8           1  female  27.0
9           1  female  14.0
..        ...     ...   ...
880         1  female  25.0
882         0  female  22.0
885         0  female  39.0
887         1  female  19.0
888         0  female   NaN

[314 rows x 3 columns]


In [9]:
Titanic_men_passenger

Unnamed: 0,Survived,Sex,Age
0,0,male,22.0
4,0,male,35.0
5,0,male,
6,0,male,54.0
7,0,male,2.0
...,...,...,...
883,0,male,28.0
884,0,male,25.0
886,0,male,27.0
889,1,male,26.0


In [10]:
Titanic_woman_passenger

Unnamed: 0,Survived,Sex,Age
1,1,female,38.0
2,1,female,26.0
3,1,female,35.0
8,1,female,27.0
9,1,female,14.0
...,...,...,...
880,1,female,25.0
882,0,female,22.0
885,0,female,39.0
887,1,female,19.0


Compare the survival rate of men and women

Calculate the proportion of men (resp. women) who survived

In [35]:
men_survived = len(Titanic_men_passenger[Titanic_men_passenger['Survived']==1])

In [36]:
woman_survived = len(Titanic_woman_passenger[Titanic_woman_passenger['Survived']==1])

In [43]:
men_survived_proportion = men_survived/len(Titanic_men_passenger)
men_survived_proportion

0.18890814558058924

In [44]:
woman_survived_proportion = woman_survived/len(Titanic_woman_passenger)
woman_survived_proportion

0.7420382165605095

State the null hypothesis H0 and the alternative one HA:
H0: the proportion of men and woman survived is equally
HA: the proportion of men and woman survived is not equally

In [47]:
men_variance = men_survived_proportion*(1-men_survived_proportion)
women_variance = woman_survived_proportion*(1-woman_survived_proportion)

z_test = (men_survived_proportion-woman_survived_proportion)/np.sqrt(men_variance/len(Titanic_men_passenger) + women_variance/len(Titanic_woman_passenger))
z_test

-18.697510317440955

In [46]:
from scipy.stats import norm
zcritical=norm.ppf(1-0.05)
zcritical

1.6448536269514722

Reject H0

In [51]:
ztest_Score, p_value= ztest(Titanic_men_passenger["Survived"],Titanic_woman_passenger["Survived"],value=0)
# p_value
ztest_Score

KeyError: False

Exercise 2
We shall work on a dataset of applicants for a credit. The dataset can be downloaded on
the website of the course

Import data

In [53]:
clustering = pd.read_csv("clustering.csv")

Extract only these two features

In [54]:
df_short=clustering[["ApplicantIncome", "LoanAmount"]]

Use K-means to perform clustering on these new data with K = 2 clusters

In [55]:
from sklearn import cluster, datasets, preprocessing
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=2, random_state=0)
kmeans.fit(df_short)
kmeans.cluster_centers_
kmeans.inertia_

291148680.6268268

Use K-means to perform clustering on these new data with K = 3 clusters

In [56]:
from sklearn import cluster, datasets, preprocessing
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3, random_state=0)
kmeans.fit(df_short)
kmeans.cluster_centers_
kmeans.inertia_

151285948.94516143

Include the additional feature CoapplicantIncome

In [None]:
df_long=clustering[["CoapplicantIncome","ApplicantIncome", "LoanAmount"]]

Perform a PCA

In [2]:
np.random.seed(123)
X = np.random.randn(50,2)
X[0:25, 0] = X[0:25, 0] + 3
X[0:25, 1] = X[0:25, 1] - 4

NameError: name 'plt' is not defined