Data Dictionary of original dataset
================
| Variable | Description | Details |
| -------- | ----------- | ------- |
| Survival | Survival | 0 = No; 1 = Yes |
| Pclass | Passenger Class | 1 = upper; 2 = middle; 3 = lower |
| Name | First and Last Name | |
| Sex | Sex | |
| Age | Age | Fractional if Age less than One (1); If the Age is Estimated, it is in the form xx.5 |
| Sibsp | Number of Siblings/Spouses Aboard | |
| Parch | Number of Parents/Children Aboard | |
| Ticket | Ticket Number | |
| Fare | Passenger Fare | |
| Cabin | Cabin | |
| Embarked | Port of Embarkation | C = Cherbourg; Q = Queenstown; S = Southampton |


In [1]:

# For data manipulation and analysis
import pandas as pd 
# For mathematical operations
import numpy as np 

# For interactive plots
import plotly as py 
# For easy-to-use, high-level interface for creating expressive and interactive visualizations
import plotly.express as px 
# For creating complex and customized plots
import plotly.graph_objects as go 
# For creating subplots in a single figure
from plotly.subplots import make_subplots

# remove warnings from output
import warnings
warnings.filterwarnings('ignore')

# Set default template for plotly as dark theme
import plotly.io as pio
pio.templates.default = "plotly_dark"

In [2]:
# Load Dataset
train = pd.read_csv("train.csv")

# Make a deep copy of the train DataFrame and assign it to df
# So that we can always go back to the original dataset if we need to
df = train.copy(deep=True)

In [3]:
df.shape

(891, 12)

In [4]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [6]:
# Craete new features
df["FamilySize"] = df["SibSp"] + df["Parch"] + 1
df["IsAlone"] = df.FamilySize == 1
df['CabinLetter'] = df['Cabin'].str[0]

In [7]:
# Obtain basic statistics for numerical variables
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,FamilySize
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208,1.904602
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429,1.613459
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0,1.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104,1.0
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542,1.0
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0,2.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292,11.0


---

Objective : Analyze the missing values in a dataframe<br>
About the code : Create a function that can analyze the missing values in a dataframe and display them in a table and a heatmap. <br> 
Justification : The heatmap was chosen because it can visually show the distribution and pattern of missing values across the dataframe. 

Steps to build the function :
- The first line defines a function called `missing_values` that takes a dataframe as an input parameter.
- The second line creates a new dataframe with the data types of each column in the input dataframe using the `pd.DataFrame` method and assigns it to the variable `missingInfo`.
- The third line adds a new column to the `missingInfo` dataframe with the number of missing values in each column of the input dataframe using the `df.isnull().sum()` method.
- The fourth line adds another new column to the `missingInfo` dataframe with the percentage of missing values in each column of the input dataframe using the `round((df.isnull().sum() / len(df)) * 100, 2)` expression.
- The fifth line sorts the `missingInfo` dataframe by the percentage of missing values in descending order using the `sort_values` method with the parameter `by="% missing"` and `ascending=False`.
- The sixth line creates a heatmap plot of the input dataframe showing which values are missing (True) or present (False) using the `px.imshow` method with the parameters `df.isnull()`, `width=500`, and `title="Missing Values"` and assigns it to the variable `fig_missing`.
- The seventh line displays the plot using the `show` method on the `fig_missing` variable.
- The eighth line returns the `missingInfo` dataframe as the output of the function.
- The ninth line calls the function with a dataframe as an argument and displays its output.

Result : The result shows the data types, the number of missing values, and the percentage of missing values for each column in the input dataframe. The columns are sorted by the percentage of missing values in descending order. The result indicates that the `Cabin` column has the most missing values (77.10%), followed by the `Age` column (19.87%), and the `Embarked` column (0.22%). The rest of the columns have no missing values. The result also shows that the columns have different data types, such as object, float64, and int64.

In [8]:
def missing_values(df):
    missingInfo = pd.DataFrame(df.dtypes, columns=["dtypes"])
    missingInfo["missing"] = df.isnull().sum()
    missingInfo["% missing"] = round((df.isnull().sum() / len(df)) * 100, 2)
    missingInfo = missingInfo.sort_values(by="% missing", ascending=False)
    fig_missing = px.imshow(df.isnull(), width=500, title="Missing Values")
    fig_missing.show()
    return missingInfo
missing_values(df)

Unnamed: 0,dtypes,missing,% missing
Cabin,object,687,77.1
CabinLetter,object,687,77.1
Age,float64,177,19.87
Embarked,object,2,0.22
PassengerId,int64,0,0.0
Survived,int64,0,0.0
Pclass,int64,0,0.0
Name,object,0,0.0
Sex,object,0,0.0
SibSp,int64,0,0.0


In [9]:
# fill the missing values in the Age column with the mean of the column because it is a numerical column
df["Age"] = df["Age"].fillna(df["Age"].mean())
# fill the missing values in the Embarked column with the most frequent value of the column because it is a categorical column
df["Embarked"] = df["Embarked"].fillna(df["Embarked"].mode()[0])
# fill the missing values in the Cabin column with the string "Unknown" instead of dropping the column because there are too many missing values
df["Cabin"] = df["Cabin"].fillna("Unknown")
# call the missing_values function defined earlier to see the updated information about the missing values
missing_values(df)

Unnamed: 0,dtypes,missing,% missing
CabinLetter,object,687,77.1
PassengerId,int64,0,0.0
Survived,int64,0,0.0
Pclass,int64,0,0.0
Name,object,0,0.0
Sex,object,0,0.0
Age,float64,0,0.0
SibSp,int64,0,0.0
Parch,int64,0,0.0
Ticket,object,0,0.0


---

Objective : Analyze the distribution of data in a dataframe<br>
About the code : Create a figure with multiple subplots that show the distribution of data from a dataframe called `df`. The dataframe contains information about the passengers on the Titanic, such as their gender, class, embarked port, survival status, age and family size. The code uses the `plotly` library to create interactive plots that can be displayed in a web browser.

Justification : The code uses bar plots and pie plots for the categorical variables (gender, class, embarked port and survival status) because they can show the frequency and proportion of each category easily. The code chooses to use histograms for the numerical variables (age and family size) because they can show the shape and density of the data distribution.

The code follows these steps to build the chart:

- Define a function called `plot_row` that takes three arguments: `column`, `row` and `pull`. The function is used to create a bar plot and a pie plot for a given column of the dataframe in a specified row of the figure. The `pull` argument is used to pull out some slices of the pie plot for emphasis.
- Inside the function, use the `add_trace` method of the figure object to add a bar plot trace to the first column of the row. The x-axis values are the unique values of the column, the y-axis values are the counts of each value, and the text labels are also the counts. The name and marker color are left empty.
- Use the same method to add a pie plot trace to the second column of the row. The labels are the unique values of the column, the values are the counts of each value, and the name is left empty. The pull argument is passed to the pie plot trace to pull out some slices. The marker colors are chosen from a qualitative palette from `plotly.express`.
- Create the subplots layout using the `make_subplots` function. The function takes several arguments, such as:
  - `rows` and `cols`, which specify how many rows and columns of subplots are in the figure.
  - `specs`, which is a list of lists that defines what type of plot is in each subplot. In this case, each row has a bar plot and a pie plot, except for rows 5 and 7, which have only one histogram plot that spans two columns.
  - `subplot_titles`, which is a list of strings that gives titles to each subplot.
- Plot the rows using the `plot_row` function defined earlier. Pass in the column name, row number and pull argument for each row. For example, to plot gender bar and pie plots in row 1, use `plot_row('Sex', 1, pull=[0.1, 0])`. This means that pull out the first slice (male) by 0.1 units from the center of the pie.
- Plot the age histogram using the `add_trace` method. Pass in a histogram trace with x-axis values as
the age column of the dataframe, name as empty string, histnorm as density (to normalize
the histogram), and marker color as a hex code.
- Plot the family size histogram using the same method. Pass in a histogram trace with x-axis values as the family size column of the dataframe, name as empty string, histnorm as density, text labels as counts of each value, and marker colors as a diverging palette from `plotly.express`.
- Update the layout and traces using the `update_layout` and `update_traces` methods. Some of the layout options are:
  - `height` and `width`, which set the height and width of the figure in pixels.
  - `showlegend`, which controls whether to show legends or not.
  - `title_text`, which sets the title text of the figure.
  - `title_x`, which sets the position of the title along the x-axis.
  - `titlefont`, which sets the font style of the title.
  - `paper_bgcolor`, which sets the background color of the plot area.
  - `plot_bgcolor`, which sets the background color of the entire plot.
  - `font_color`, which sets the font color.
- Update some specific traces using filters or selectors. For example, use `update_yaxes(showgrid=False)` to remove y-axis gridlines from all subplots. Use `update_traces(marker_line_color='black', marker_line_width=2)` to change marker line color and width for only histogram traces. Use `update_annotations(font={'color': '#6bddff'})` to change or restyle title color for all subplots.
- Display the figure using the `show` method. This will open a web browser and show the interactive figure.

Results :

For gender, there were 577 male (35.2%) and 314 female (64.8%).

For passenger class, 216 people (24.2%) were in the upper class, 184 people (20.7%) were in the middle class, and 491 people (55.1%) were in the lower class.

In terms of where people boarded the ship, 646 people (72.5%) boarded in Southampton, 168 people (18.9%) boarded in Cherbourg, and 77 people (8.64%) boarded in Queenstown.
- Out of all the passengers, 342 (38.4%) survived, while 549 (61.6%) did not survive.

Now let's talk about the ages of the passengers. The average age of the people on the ship is about 29.54 years. This means that most of the people are around 29 years old.
- The middle age (median) is 29, which means that half of the people on the ship are younger than 29 and half are older than 29.
- The most common age (mode) is 29, which means that there are more people who are 29 years old than any other age.
- The range of ages is 80, which means that the oldest person on the ship is 80 years older than the youngest person.
- The interquartile range (IQR) of ages is 13, which shows that the middle 50% of the ages fall within a range of 13 years. This tells us how spread out the ages are.
- The standard deviation of ages is 13.01. This tells us that the ages vary quite a bit from the average age of 29. A higher standard deviation means more variability in the ages.

Now let's look at the family sizes. On average, each family has about 1.9 members.
- The median family size is 1.0, which means that half of the families have only one member, while the other half has more than one member.
- The most common family size is 1 member, which means that there are more families with just one person.
- The range of family sizes is 10, which means that the largest family has 10 more members than the smallest family.
- The interquartile range (IQR) of family sizes is 1.0, which shows that the middle 50% of families have a family size range of 1 member.
- The standard deviation of family sizes is approximately 1.6, which means that the family sizes vary quite a bit from the average value of 1.9.

In [10]:
def plot_row(column, row, pull=None):
    fig.add_trace(go.Bar(
        x=df[column].unique(),
        y=df[column].value_counts().values,
        text=df[column].value_counts(),
        name="",
        marker_color=None,
    ), row=row, col=1)

    fig.add_trace(go.Pie(
        labels=df[column].unique(),
        values=df[column].value_counts().values,
        name="",
        pull=pull,
        marker_colors=px.colors.qualitative.G10
    ), row=row, col=2)

fig = make_subplots(
    rows=8, cols=2,
    specs=[[{}, {'type': 'domain'}],
           [{}, {'type': 'domain'}],
           [{}, {'type': 'domain'}],
           [{}, {'type': 'domain'}],
           [{"rowspan": 2, "colspan": 2}, None],
           [None, None],
           [{"rowspan": 2, "colspan": 2}, None],
           [None, None],
          ],
    subplot_titles=('Gender Bar', 'Gender Pie', 'Pclass Bar', 'Pclass Pie', 'Embarked Bar', 'Embarked Pie',
                    'Survived Bar', 'Survived Pie', 'Age Distribution', 'Family Size'),
)

plot_row('Sex', 1, pull=[0.1, 0])
plot_row('Pclass', 2, pull=[0.1, 0, 0])
plot_row('Embarked', 3, pull=[0.1, 0, 0])
plot_row('Survived', 4, pull=[0.1, 0])

fig.add_trace(go.Histogram(
    x=df['Age'],
    name="",
    histnorm='density',
    marker_color="#f4a582"
), row=5, col=1)

fig.add_trace(go.Histogram(
    x=df['FamilySize'],
    name="",
    histnorm='density',
    text=df['FamilySize'].value_counts(),
    marker_color=px.colors.diverging.RdBu
), row=7, col=1)

fig.update_layout(
    height=1600,
    width=800,
    showlegend=False,
    title_text="Distribution of data",
    title_x=0.5,
    titlefont={'size': 25, 'family': 'Roboto', 'color': 'white'},
    paper_bgcolor="black",
    plot_bgcolor="black",
    font_color="white"
)

fig.update_yaxes(showgrid=False)

fig.update_traces(marker_line_color='black', marker_line_width=2)

fig.update_annotations(font={'color': '#6bddff'})

fig.show()

In [11]:
# Replace `data` with your actual age data
def histogram_summary(column):
    # Print the column name
    print("Summary for", column)
    
    # Extract the values from the specified column in the DataFrame
    data = df[column].values
    
    # Remove any NaN (missing) values from the data and convert to integers
    data = data[~np.isnan(data)].astype(int)

    # Calculate the mean, median, and mode of the data
    mean = np.mean(data)
    median = np.median(data)
    mode = np.argmax(np.bincount(data))

    # Calculate the range, interquartile range, and standard deviation of the data
    range1 = np.ptp(data)
    iqr = np.percentile(data, 75) - np.percentile(data, 25)
    std = np.std(data)

    # Print the calculated results
    print("Mean :", mean)
    print("Median:", median)
    print("Mode:", mode)
    print("Range:", range1)
    print("Interquartile range:", iqr)
    print("Standard deviation:", std, "\n")

# Call the histogram_summary function for the "Age" column
histogram_summary("Age")

# Call the histogram_summary function for the "FamilySize" column
histogram_summary("FamilySize")


Summary for Age
Mean : 29.544332210998878
Median: 29.0
Mode: 29
Range: 80
Interquartile range: 13.0
Standard deviation: 13.006473346327034 

Summary for FamilySize
Mean : 1.904601571268238
Median: 1.0
Mode: 1
Range: 10
Interquartile range: 1.0
Standard deviation: 1.6125528671095162 



---

Objective : To identify outliers in the Titanic dataset for certain variables. <br>
About the code : Creates a box plot visualization to identify outliers in the Titanic dataset. <br>
Justification : The chart was chosen because box plots are effective in visualizing the distribution and identifying outliers in numerical data.

Step-by-step explanation of the code:

1. Box traces are defined as a list of box plot objects using the `go.Box` function. Each box plot represents a different variable from the Titanic dataset: 'Fare', 'Age', 'SibSp', and 'Parch'. The color of each box plot is specified using the `marker` argument.

2. The layout of the chart is defined using the `go.Layout` function. The layout includes a title, a fixed height of 400 pixels, and a font size of 14.

3. A `go.Figure` object is created, which combines the box traces and the layout.

4. Finally, the `fig.show()` method is called to display the chart.

Results :

Outliers are special data points that are very different from the others. In the Titanic dataset, there are some passengers who have unusual information compared to the rest.

For 'Parch' (which stands for parents and children), 213 passengers have a higher number than 0. This means they were traveling with more family members than most people.

For 'SibSp' (which stands for siblings and spouses), 46 passengers have a higher number than 2.50. This means they had more siblings or spouses with them compared to most people.

For 'Age', 24 passengers have an age lower than 2.50. This is quite unusual because most passengers were older than that. It could be a mistake or a special case.

For 'Fare', 116 passengers paid a higher price than 65.63. This means they spent more money on their ticket compared to most people.

In [12]:
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'FamilySize', 'IsAlone',
       'CabinLetter'],
      dtype='object')

In [13]:
box_traces = [
    go.Box(x=df['Fare'], name='Fare', marker=dict(color='#87edff')),
    go.Box(x=df['Age'], name='Age', marker=dict(color='#f37a7a')),
    go.Box(x=df['SibSp'], name='SibSp', marker=dict(color='green')),
    go.Box(x=df['Parch'], name='Parch', marker=dict(color='orange'))
]

layout = go.Layout(
    title='Outliers in Titanic Dataset',
    height=400,
    font=dict(size=14)
)

fig = go.Figure(data=box_traces, layout=layout)

fig.show()

In [14]:
# Calculate summary statistics
fare_stats = df['Fare'].describe()
age_stats = df['Age'].describe()
sibsp_stats = df['SibSp'].describe()
parch_stats = df['Parch'].describe()

# Define the range for outliers (e.g., 1.5 times the interquartile range)
fare_outlier_range = 1.5 * (fare_stats['75%'] - fare_stats['25%'])
age_outlier_range = 1.5 * (age_stats['75%'] - age_stats['25%'])
sibsp_outlier_range = 1.5 * (sibsp_stats['75%'] - sibsp_stats['25%'])
parch_outlier_range = 1.5 * (parch_stats['75%'] - parch_stats['25%'])

# Identify outliers
fare_outliers = df[df['Fare'] > fare_stats['75%'] + fare_outlier_range]
age_outliers = df[df['Age'] < age_stats['25%'] - age_outlier_range]
sibsp_outliers = df[df['SibSp'] > sibsp_stats['75%'] + sibsp_outlier_range]
parch_outliers = df[df['Parch'] > parch_stats['75%'] + parch_outlier_range]

# Print outlier information
print("Outlier Information:")
print(f"Parch Outliers (Above {parch_stats['75%'] + parch_outlier_range:.2f}): {len(parch_outliers)} passengers")
print(f"SibSp Outliers (Above {sibsp_stats['75%'] + sibsp_outlier_range:.2f}): {len(sibsp_outliers)} passengers")
print(f"Age Outliers (Below {age_stats['25%'] - age_outlier_range:.2f}): {len(age_outliers)} passengers")
print(f"Fare Outliers (Above {fare_stats['75%'] + fare_outlier_range:.2f}): {len(fare_outliers)} passengers")


Outlier Information:
Parch Outliers (Above 0.00): 213 passengers
SibSp Outliers (Above 2.50): 46 passengers
Age Outliers (Below 2.50): 24 passengers
Fare Outliers (Above 65.63): 116 passengers


---

Objective : To analyze the correlation between the features in the Titanic dataset. <br>
About the code : Creates a correlation matrix chart to visualize the pairwise correlations among the features in a DataFrame. <br>
Justification : The chart is chosen because it provides a clear and concise representation of the correlations.

Step-by-step explanation of the code:

1. Calculate the correlation matrix using the Pearson correlation method:
   <br>`corr = df.corr(method='pearson')`

2. Create a boolean mask with `True` values in the upper triangular part of the correlation matrix:
   <br>`mask = np.triu(np.ones_like(corr, dtype=bool))`

3. Apply the mask to the correlation matrix, replacing the values in the upper triangular part with NaN (masked values):
   <br>`masked_corr = corr.mask(mask)`

4. Create a new Plotly figure using `px.imshow()` and specify the title, height, width, and template:
   <br>`fig = px.imshow(masked_corr, title='Correlations Among Features', height=700, width=700)`

5. Set the text for each cell in the plot to the corresponding correlation value rounded to 2 decimal places:
   <br>`fig.update_traces(text=corr.values.round(2), hovertemplate='Feature 1: %{y}<br>Feature 2: %{x}<br>Correlation: %{text}')`

6. Add a colorbar to the chart with a title:
   <br>`fig.update_traces(colorbar=dict(title="Correlation"))`

7. Iterate over each cell in the masked correlation matrix:
   - If the cell is not on the diagonal (i != j) and it has a non-null value, do the following:
     - Determine the color for the annotation text based on the correlation value (white if < 0.5, black otherwise).
     - Add an annotation with the correlation value to the corresponding cell in the figure:
       - `fig.add_annotation(x=j, y=i, text=str(masked_corr.iloc[i, j].round(2)), showarrow=False, font=dict(color=color))` 
<br><br>
8. Update the x-axis to remove gridlines:
   <br>`fig.update_xaxes(showgrid=False)`

9. Update the y-axis to remove gridlines:
   <br>`fig.update_yaxes(showgrid=False)`

10. Display the final chart:
    <br>`fig.show()`

Results :
- highest positive correlation: 'SibSp' and 'Parch' (excluded FamilySize)
- highest negative correlation: 'Pclass' and 'Fare'
- highest positive correlation with Survived: 'Fare'
- highest negative correlation with Survived: 'Pclass'


In [15]:
corr = df.corr(method='pearson')

mask = np.triu(np.ones_like(corr, dtype=bool))

masked_corr = corr.mask(mask)

fig = px.imshow(masked_corr,
                title='Correlations Among All Features',
                height=700, width=700)

fig.update_traces(text=corr.values.round(2),
                  hovertemplate='Feature 1: %{y}<br>Feature 2: %{x}<br>Correlation: %{text}',
                  colorbar=dict(title="Correlation"))

for i in range(len(masked_corr)):
    for j in range(len(masked_corr)):
        if i != j and not pd.isnull(masked_corr.iloc[i, j]):
            color = 'white' if float(masked_corr.iloc[i, j]) < 0.5 else 'black'
            fig.add_annotation(x=j, y=i, text=str(masked_corr.iloc[i, j].round(2)),
                               showarrow=False, font=dict(color=color))

fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)

fig.show()

---

Objective : To analyze the relationship between Age, Fare, Pclass, and Survival <br>
About the code : Creates a 3D scatter plot using Plotly, visualizing the relationship between age, fare, passenger class (Pclass), and survival (Survived) in a dataset (df). <br> 
Justification : The chart was chosen to visualize the relationship between age, fare, passenger class, and survival in a dataset because a 3D scatter plot allows for the simultaneous representation of three numerical variables (age, fare, and passenger class) as well as a categorical variable (survival). This type of chart can provide insights into potential patterns or correlations between these variables.

Here is a step-by-step explanation of each line and part of the code:

1. `color_sequence = ['#8B0000', '#00FF00']`: This line defines a list of two color codes, `'#8B0000'` (dark red) and `'#00FF00'` (green), which will be used to represent the different categories of the "Survived" variable.

2. `fig = px.scatter_3d(df, x="Age", y="Fare", z="Pclass", color="Survived", color_continuous_scale=color_sequence, title="Age vs Fare vs Class by Survived", height=700, width=700)`: This line creates a 3D scatter plot using the Plotly Express (px) library. It specifies the dataset `df` and assigns the variables "Age", "Fare", and "Pclass" to the x, y, and z axes, respectively. The "Survived" variable is used to determine the color of each data point, with the colors defined by `color_sequence`. Additional parameters include the chart title ("Age vs Fare vs Class by Survived"), and the dimensions of the chart (height=700, width=700).

3. `fig.show()`: This line displays the created figure (`fig`) in the output.


Results :

Green dots represent passengers who survived, while red dots represent passengers who did not survive.
Passengers in higher passenger classes had higher survival rates compared to lower classes. Those who paid higher fares also had higher chances of survival. Additionally, younger passengers had higher survival rates compared to older passengers. In summary, higher class, higher fare, and younger age were associated with increased chances of survival.

In [16]:
color_sequence = ['#cd1e1e', '#00FF00']
fig = px.scatter_3d(df, x="Age", y="Fare", z="Pclass", color="Survived", color_continuous_scale=color_sequence, title="Relationship between Age, Fare, Pclass, and Survival", height=700, width=700)
fig.show()

Objective: To analyze the passenger flow from different embarkation ports to different passenger classes and their survival status. <br>

About the code: Creates a sankey diagram that shows the passenger flow from embarkation to survival status in the Titanic dataset.

Justification: A Sankey diagram is chosen because it can effectively illustrate the flow and transition between different categories or states. In this case, it can show the movement of passengers from embarkation ports to different passenger classes and then to survival or death.

The code can be explained as follows:

- First, we define a dictionary called `nodes` that contains the attributes of the nodes in the sankey diagram. The nodes are the endpoints of the links and represent the categories or stages of the flow. We specify the following attributes for each node:
  - `pad`: the amount of padding around each node
  - `thickness`: the thickness of each node
  - `line`: the style of the border around each node
  - `label`: the text label for each node
  - `color`: the fill color for each node
- Then, we define another dictionary called `links` that contains the attributes of the links in the sankey diagram. The links are the paths that connect the nodes and represent the flow of passengers. We specify the following attributes for each link:
  - `source`: the index of the source node for each link
  - `target`: the index of the target node for each link
  - `value`: the magnitude or weight of each link
  - `color`: the color of each link
- Next, we define a function called `get_count` that takes a condition as an argument and returns the number of passengers that satisfy that condition in the dataframe. This function will help us to calculate the values for each link.
- Then, we loop through the embarkation ports (Southampton, Cherbourg, Queenstown) and passenger classes (3rd, 2nd, 1st) and add the source, target, value and color for each link that connects these nodes. We use the `get_count` function to get the number of passengers for each combination of port and class. We also use the color of the source node as the color of the link.
- Next, we loop through the passenger classes and survival status (survived, died) and add the source, target, value and color for each link that connects these nodes. We use the same logic as before to get these attributes.
- Finally, we create a figure object using plotly's `go.Figure` function and pass it a `go.Sankey` object that contains our nodes and links dictionaries. We also update the layout of the figure to add a title, a dark template and a width.
- We show the figure using plotly's `fig.show` method.


Result:

The Sankey diagram shows the passenger flow from different embarkation ports (Southampton, Cherbourg, Queenstown) to different passenger classes (3rd Class, 2nd Class, 1st Class) and their survival status (Survived, Died).

By comparing the values in the diagram, we can make the following observations:

1. Embarkation Port:
   - Southampton had the highest number of passengers, followed by Cherbourg and Queenstown.
   - Most passengers from Southampton were in 3rd Class, while a significant number of passengers from Cherbourg and Queenstown were in 1st Class.

2. Passenger Class:
   - The majority of passengers were in 3rd Class, followed by 1st Class and 2nd Class.
   - Among the different embarkation ports, Southampton had the highest number of passengers in all classes.

3. Survival Status:
   - Among the passengers in 3rd Class, a higher number died compared to those who survived.
   - In 2nd Class, the number of survivors was slightly higher than the number of casualties.
   - The highest number of survivors was in 1st Class, with a significantly lower number of casualties.

In [17]:
# Define nodes
nodes = dict(
    pad=15,
    thickness=20,
    line=dict(color="black", width=0.5),
    label=["Southampton", "Cherbourg", "Queenstown", "3rd Class", "2nd Class", "1st Class", "Survived", "Died"],
    # Use a color scheme generated by ColorBrewer
    color=["#336699", "#FF9933", "#CC66CC", "#669966", "#FF99CC ", "#FF99CC", "#00AA00", "#AA0000"]
)

# Define links
links = dict(
    source=[], # indices correspond to labels, eg Southampton, Cherbourg, ...
    target=[],
    value=[],
    color=[] # add a color attribute for links
)

# Define a function to get the number of passengers by condition
def get_count(condition):
    return df[condition].shape[0]

# Loop through the embarkation ports and passenger classes
for i, port in enumerate(["S", "C", "Q"]):
    for j, pclass in enumerate([3, 2, 1]):
        # Add the source, target and value for each link
        links["source"].append(i)
        links["target"].append(3 + j)
        links["value"].append(get_count((df["Embarked"] == port) & (df["Pclass"] == pclass)))
        # Add the color for each link based on the source node color
        links["color"].append(nodes["color"][i])

# Loop through the passenger classes and survival status
for k, pclass in enumerate([3, 2, 1]):
    for l, survived in enumerate([1, 0]):
        # Add the source, target and value for each link
        links["source"].append(3 + k)
        links["target"].append(6 + l)
        links["value"].append(get_count((df["Pclass"] == pclass) & (df["Survived"] == survived)))
        # Add the color for each link based on the source node color
        links["color"].append(nodes["color"][3 + k])

# Create figure
fig = go.Figure(data=[go.Sankey(
    node=nodes,
    link=links)])

# Add title
fig.update_layout(title_text="Passenger flow from embarkation to survival status", width = 900)

# Show figure
fig.show()

Objective: Analyze the distribution of age by sex and survival status <br>
About the code: Creates a violin plot that shows the distribution of age by sex and survival status <br>
Justification: A violin plot is a type of chart that combines a box plot and a kernel density plot to show the shape and spread of the data. It is useful for comparing multiple groups and identifying outliers.

The code can be explained as follows:

- Firstly, create a violin plot using the px.violin function. It takes the following arguments:
  - df: the name of the dataset to use
  - y: the name of the column to plot on the y-axis, in this case "Age"
  - x: the name of the column to plot on the x-axis, in this case "Sex"
  - color: the name of the column to use for coloring the violins, in this case "Survived"
  - box: a boolean value that determines whether to show a box plot inside each violin, in this case True
  - points: a string that determines whether to show individual data points inside each violin, in this case "all"
  - width: the width of the plot in pixels, in this case 800
  - height: the height of the plot in pixels, in this case 500
- The third line shows the plot using the fig.show method. This will open a new browser window with the interactive plot.

Result:
- The average age of female survivors was 29 years, while the average age of female non-survivors was 26 years.
- The average age of male survivors was 27.6 years, while the average age of male non-survivors was 31.2 years.
- The middle age of both female and male survivors was 29.7 years, while the middle age of female non-survivors was 29 years and the middle age of male non-survivors was 29.7 years.
- The ages of male survivors and non-survivors varied more from the average than the ages of female survivors and non-survivors.
- Half of the female survivors were younger than 29.7 years and half were older, while half of the female non-survivors were younger than 29 years and half were older.
- Half of the male survivors and non-survivors were younger than 29.7 years and half were older.

This means that:

- Female passengers had a better chance of surviving than male passengers, no matter how old they were.
- Male passengers who were younger had a slightly better chance of surviving than male passengers who were older, while there was no clear difference for female passengers based on their age.
- There was more variety in the age of male passengers and survivors than in the age of female passengers and non-survivors.
- Most of the passengers were around 30 years old, with some younger and some older.

In [18]:
fig = px.violin(df, y="Age", x="Sex", color="Survived", box=True, points="all", width=800, height=500)

fig.show()


In [19]:
# Group the data by sex and survival status
grouped = df.groupby(["Sex", "Survived"])

# Calculate the mean, median, standard deviation and quartiles for age
stats = grouped["Age"].agg(["mean", "median", "std", "quantile"])

# Print the results
print(pd.DataFrame(stats))

                      mean     median        std   quantile
Sex    Survived                                            
female 0         26.023272  29.000000  12.234723  29.000000
       1         28.979263  29.699118  13.032597  29.699118
male   0         31.175224  29.699118  12.350532  29.699118
       1         27.631705  29.699118  15.257584  29.699118


Objective : 

About the code : Creates a stacked bar chart that shows the survival rate of passengers based on their title and passenger class. The chart was chosen to compare the survival rate across different groups of passengers and to see how the title and class affect the survival rate.

Justification : The stacked bar chart was chosen because it allows to compare the survival rate across different groups of passengers and to see how the title and class affect the survival rate. The stacked bar chart shows the total survival rate for each title as well as the breakdown of the survival rate by passenger class. This way, we can see which titles and classes had higher or lower chances of surviving the Titanic disaster.

Steps to build the chart:

- `df['title'] = df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)`

This line extracts the title (such as Mr, Mrs, Miss, etc.) from the name column of the dataframe `df` and assigns it to a new column called `title`. The `str.extract` method uses a regular expression to match the pattern of a word followed by a dot in the name column and returns the word without the dot. The `expand=False` argument tells the method to return a series instead of a dataframe.

- `df_grouped = df.groupby(['title', 'Pclass'])['Survived'].mean().reset_index()`

This line groups the dataframe `df` by the columns `title` and `Pclass` and calculates the mean of the `Survived` column for each group. The `Survived` column contains 1 for passengers who survived and 0 for passengers who died. The mean of this column represents the survival rate for each group. The `reset_index` method converts the groupby object into a new dataframe called `df_grouped` with the columns `title`, `Pclass` and `Survived`.

- `fig = px.bar(df_grouped, x='title', y='Survived', color='Pclass', barmode='stack', labels={'title':'Title', 'Survived':'Survival Rate'}, title = "Survival Rate of Titanic Passengers by Title and Passenger Class")`

This line creates a stacked bar chart using the plotly express library. The `px.bar` function takes the dataframe `df_grouped` as the input and sets the x-axis to be the `title` column, the y-axis to be the `Survived` column, and the color of the bars to be based on the `Pclass` column. The `barmode='stack'` argument tells the function to stack the bars for each title based on the passenger class. The `labels` argument provides custom labels for the axes. The `title` argument adds a title for the chart.

- `fig.show()`

This line displays the chart in an interactive window.

Results :

- Women (Miss, Mrs, Mlle, Mme, Lady, Countess) had a much higher survival rate than men (Mr, Capt, Col, Don, Jonkheer, etc), especially in the first and second class.

- Children (Master) also had a high survival rate, except in the third class where only about 39% survived.

- Professionals (Dr, Rev) had a mixed survival rate depending on their class and gender. Male doctors in the first class had a 60% survival rate, while male doctors in the second class had none. Female doctors are not listed in the analysis. Male reverends in the second class also had no survivors.

- Nobility (Sir, Lady, Countess) and royalty (Don, Jonkheer) had either a very high or very low survival rate depending on their gender. Male nobles and royals had no survivors, while female ones had all survivors.

In [20]:
df['title'] = df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)

df_grouped = df.groupby(['title', 'Pclass'])['Survived'].mean().reset_index()

fig = px.bar(df_grouped, x='title', y='Survived', color='Pclass', barmode='stack', labels={'title':'Title', 'Survived':'Survival Rate'}, title = "Survival Rate of Titanic Passengers by Title and Passenger Class")
fig.show()

In [21]:
df_grouped.sort_values(by='Survived', ascending=False)

Unnamed: 0,title,Pclass,Survived
25,Sir,1,1.0
10,Master,2,1.0
2,Countess,1,1.0
23,Ms,2,1.0
16,Mme,1,1.0
15,Mlle,1,1.0
7,Lady,1,1.0
9,Master,1,1.0
20,Mrs,1,0.97619
12,Miss,1,0.956522


Objective : Analyze the distribution of survivors and non-survivors

About the code : Creates a histogram plot with subplots for `Sex`, `Pclass`, `Embarked`, `Age`, `FamilySize`, `IsAlone`, `SibSp` and `Parch`

Justification : The histogram plot was chosen because it is a good way to visualize the distribution of numerical data and compare the frequency of different categories or groups. A subplot was chosen because it allows to display multiple histograms in one figure and compare them easily.

Here is a step-by-step explanation of the code:

- The first line defines a dictionary `color_map` that maps the values 0 and 1 to two different colors. This will be used later to color the bars of the histogram according to the `Survived` column in the dataframe.
- The second line defines a list `columns` that contains the names of the columns that will be plotted as subplots.
- The third line uses the `make_subplots` function from plotly to create a figure with 4 rows and 2 columns of subplots, and assigns it to the variable `fig`. The `subplot_titles` argument sets the title of each subplot to be the name of the corresponding column.
- The fourth line starts a for loop that iterates over the `columns` list and its indices. The `enumerate` function returns both the index and the value of each item in the list.
- The fifth line defines a boolean variable `text` that is True if the column name is one of ['Sex', 'Pclass', 'Embarked', 'IsAlone'], and False otherwise. This will be used later to control whether to show text labels on the bars of the histogram or not.
- The sixth line uses the `px.histogram` function from plotly express to create a histogram plot for each column, using the `df` dataframe as the data source, the column name as the x-axis, and the `Survived` column as the color. The `text_auto` argument sets whether to show text labels on the bars or not, based on the value of `text`. The `barmode` argument sets how to display multiple bars per bin, in this case "group" means to show them side by side. The `opacity` argument sets how transparent the bars are, in this case 0.7 means 70% opaque. The `color_discrete_map` argument sets how to map different values of `Survived` to different colors, using the `color_map` dictionary defined earlier.
- The seventh line updates the traces (the graphical objects that make up the plot) of the histogram plot by setting their text font color to white. This makes them more visible on top of the colored bars.
- The eighth and ninth lines add the traces of the histogram plot to the figure `fig`, using the `add_trace` method. The `row` and `col` arguments specify which subplot grid cell to place them in, based on the index of the column name in the loop. Note that two traces are added for each histogram plot, one for each value of `Survived`.
- The tenth line ends the for loop.
- The eleventh line updates the layout (the appearance and style) of the figure `fig`, using
the `update_layout` method. The `showlegend` argument sets whether to show a legend or not, in this case False means not to show it. The `width` and `height` arguments set
the size of the figure in pixels. The `title` argument sets
the title of
the figure.
- The twelfth line shows
the figure
using
the
`show`
method.

Result :

- Sex: Female passengers were more likely to survive than male passengers.
- Pclass: Passengers in the first class were more likely to survive than those in the lower classes.
- Embarked: Passengers who embarked from Cherbourg were more likely to survive than those who embarked from other ports.
- AgeGroup: Adult passengers were less likely to survive than child and senior passengers.
- FamilySize: Passengers who traveled alone were less likely to survive than those who traveled with family members.
- IsAlone: Passengers who were not alone were more likely to survive than those who were alone.
- SibSp: Passengers who had one or two siblings or spouses on board were more likely to survive than those who had none or more than two.
- Parch: Passengers who had one, two or three parents or children on board were more likely to survive than those who had none or more than three.

In [22]:
color_map = {0: '#ef553b', 1: '#00cc96'}
columns = ['Sex', 'Pclass', 'Embarked', 'Age', 'FamilySize', 'IsAlone', 'SibSp', 'Parch']
fig = make_subplots(rows=4, cols=2, subplot_titles=columns)

for i, col in enumerate(columns):
    text = True if col in ['Sex', 'Pclass', 'Embarked', 'IsAlone'] else False
    hist = px.histogram(df, x=col, color='Survived', text_auto=text, barmode="group", opacity=0.7, color_discrete_map=color_map)
    hist.update_traces(textfont_color='white')  # Set text color to white
    fig.add_trace(hist.data[0], row=(i//2)+1, col=(i % 2)+1) 
    fig.add_trace(hist.data[1], row=(i//2)+1, col=(i % 2)+1)

fig.update_layout(
    showlegend=False,
    width=900,
    height=1200,
    title='Distribution of Survivors and Non-Survivors'  # Set the title
)

fig.show()

In [23]:
def age_group(age):
    if age < 18:
        return 'Child'
    elif age < 60:
        return 'Adult'
    else:
        return 'Senior'

# Apply the function to the age column and create a new column for age group
df['AgeGroup'] = df['Age'].apply(age_group)

# Define a list of variables to calculate the distribution
variables = ['Sex', 'Pclass', 'Embarked', 'AgeGroup', 'FamilySize', 'IsAlone', 'SibSp', 'Parch']

# Loop through the variables and print the distribution of survivors and non-survivors
for var in variables:
    # Group by the variable and the survived column and calculate the count
    dist = df.groupby([var, 'Survived'])['Survived'].count()
    # Print the distribution as a percentage of the total count
    # print(dist / dist.groupby(level=0).sum() * 100)
    print(pd.DataFrame(dist))


                 Survived
Sex    Survived          
female 0               81
       1              233
male   0              468
       1              109
                 Survived
Pclass Survived          
1      0               80
       1              136
2      0               97
       1               87
3      0              372
       1              119
                   Survived
Embarked Survived          
C        0               75
         1               93
Q        0               47
         1               30
S        0              427
         1              219
                   Survived
AgeGroup Survived          
Adult    0              478
         1              274
Child    0               52
         1               61
Senior   0               19
         1                7
                     Survived
FamilySize Survived          
1          0              374
           1              163
2          0               72
           1               89
3       