# Data Visualisation Assignment

Total points: 80 + 20 (5 x 4) for conciseness

Total number of questions: 3 (45 + 30 + 25)

Dataset: [Titanic](https://www.kaggle.com/competitions/titanic)


In this assignment, we'll be using the help of data visualization to help us deal with missing values. Normally, you would think of a method to fill the `NaN` values, visualize it graphically to see if  your method is feasable, and then implement it.

Here however, since this will be the first time many people are doing this, I'll be giving a hint to the method, from which you should figure it out, implement it and then visualize only to justify your method.


**Note:** for every subquestion, use a copy of the `titanic` dataframe and not the original.

##  Data Dictionary:

- `PassengerId`, `Name`, `Sex`, `Age` and `Fare` are self-explanatory
- `Survived` is the variable of whether the passenger survived or not:
  - `1`: Survived
  - `0`: Didn't survive
- `Pclass` (Passenger Class) is the ticket class booked  by the passenger and it reflects the socio-economic status of the passenger:
  - `1`: Upper Class
  - `2`: Middle Class
  - `3`: Lower Class
- `SibSp` is the total number of the passengers' siblings and spouse aboard the ship
- `Parch` is the total number of the passengers' parents and children aboard the ship
- `Ticket` is the ticket number of the passenger
- `Cabin` is the cabin number of the passenger
- `Embarked` is port of embarkation (boarding):
  - `C`: Cherbourg
  - `Q`: Queenstown
  - `S`: Southampton


In [1]:
# imports 

import pandas as pd


In [2]:
titanic = pd.read_csv("titanic.csv")
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Dealing with missing values

We first find out which columns have missing values, and then deal with them one by one

In [3]:
# function to help you keep track of missing values in a dataframe
def count_na(df = titanic):
  series = df.isna().sum()
  return series[series > 0]

In [4]:
count_na()

Age         177
Cabin       687
Embarked      2
dtype: int64

#### Embarked

we start with embarked as it has only 2 missing values.

In [5]:
titanic[titanic["Embarked"].isna()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,


- **Method 1:** We find out where place most people embarked from and fill in the missing values with the most common one.

This makes sense, but how would we show statistical evidence in a report?

----

Q1.1. Write a function that given a dataframe and column, visualizes the absolute and relative frequencies of the various values in the column (not including `NaN`) (10)

Figure shoud be made up of two subplots. (optional if you can't do it) (2)

Using this function, visualize and find out the confidence with which you can fill in the missing values in the `Embarked` column with its mode. (8)

----


Q1.2. Find another method to fill in the missing values in the `Embarked` column. (10)

Hint: When you book tickets for a train/plane with your friends together, you get assigned berths/seats next to each other/nearby.

In this question, consider *nearby* to be +- 5 cabins in the same section of the ship.

Again, justify the confidence of your answer, with the help of the function defined in 1.1 (5)

----

#### Cabin

The majority of the `Cabin` column is `NaN`. Does that mean we just drop the column?
We can, but we'll be losing data. Since we'll be making most of the data here artificially (i.e., not collected etc.), we need a very strong reason when filling these values.

----

Q2. Find a method to fill in the missing values in the `Cabin` column. (15)

Hint: It is impossible to to find/fill exactly where the passenger resided in on the ship. But if we drop the numbers in the cabin and retain only the letter, we now can classify the passengers into sections of the ship. Which ***categorical*** column  from our dataframe would be best suited for classifying passengers into different sections of the ship?

Visualize the density of the various categories in each section of the ship, using  
[stacked bar charts](https://study.com/academy/lesson/what-is-a-stacked-bar-chart.html.) to justify how you are filling in the missing values. (10)

----

#### Age

Again, like `Cabin`, `Age` isn't a categorical column, so we will need to think before just filling the missing value with a statistic like mode. 

Q3. Find a method to fill in the missing values in the `Age` column, increasing the confidence with which we can choose the correct value by grouping data by the `Sex` of the passenger. (10)

Choose one between mean, median and mode of these grouped value for each passenger depending on their sex and justify the same graphically. (10)