<a href="https://colab.research.google.com/github/pinkbcb/DS350_FA24_Ball_Beth/blob/main/notebooks/Exploration_03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Exploration 03

You're working on an exhibit for a local museum called "The Titanic Disaster". They've asked you to analyze the passenger manifests and see if you can find any interesting information for the exhibit.

The museum curator is particularly interested in why some people might have been more likely to survive than others.

## Part 1: Import Pandas and load the data

Remember to import Pandas the conventional way. If you've forgotten how, you may want to review [Data Exploration 01](https://byui-cse.github.io/cse450-course/module-01/exploration-01.html).

The dataset for this exploration is stored at the following url:

[https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/titanic.csv](https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/titanic.csv)

There are lots of ways to load data into your workspace. The easiest way in this case is to [ask Pandas to do it for you](https://pandas.pydata.org/docs/getting_started/intro_tutorials/02_read_write.html).

### Initial Data Analysis
Once you've loaded the data, it's a good idea to poke around a little bit to find out what you're dealing with.

Some questions you might ask include:

* What does the data look like?
* What kind of data is in each column?
* Do any of the columns have missing values?

In [2]:
# Part 1: Enter your code below to import Pandas according to the
# conventional method. Then load the dataset into a Pandas dataframe.
import pandas as pd

# Write any code needed to explore the data by seeing what the first few
# rows look like. Then display a technical summary of the data to determine
# the data types of each column, and which columns have missing data.
data = pd.read_csv('https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/titanic.csv')
data.head()

column_names = data.columns.tolist()
print(column_names)


['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']


In [4]:
data.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,No,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,Yes,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,Yes,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,Yes,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,No,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Part 2: Initial Exploration

Using your visualization library of choice, let's first look at some features in isolation. Generate visualizations showing:

- A comparison of the total number of passengers who survived compared to those that died.
- A comparison of the total number of males compared to females
- A histogram showing the distribution of sibling/spouse counts
- A histogram showing the distribution of parent/child counts

In [3]:
# Part 2: # Write the code needed to generate the visualizations specified.

import plotly.express as px

# Assuming your DataFrame is named 'data'

# Comparison of survival
fig = px.histogram(data, x='Survived', title='Comparison of Passengers who Survived vs. Died')
fig.show()

# Comparison of males and females
fig = px.histogram(data, x='Sex', title='Comparison of Male and Female Passengers')
fig.show()

# Histogram of sibling/spouse counts
fig = px.histogram(data, x='SibSp', title='Distribution of Sibling/Spouse Counts', nbins=10)
fig.show()

# Histogram of parent/child counts
fig = px.histogram(data, x='Parch', title='Distribution of Parent/Child Counts', nbins=10)
fig.show()

## Part 3: Pairwise Comparisons
Use your visualization library of choice to look at how the survival distribution varied across different groups.

- Choose some features that you think might have had some influence over the likelihood of a titanic passenger surviving.

- For each of those features, generate a chart for each feature showing the survival distributions when taking that feature into account

In [12]:
# Write the code to explore how different features affect the survival distribution

fig = px.histogram(data, x="Sex", color="Survived",
                   title="Survival Distribution by Gender",
                   barmode='group',  # Display bars side-by-side
                   labels={'Survived': 'Survival Status', 'Sex': 'Gender'})
fig.show()


fig = px.histogram(data, x="Age", color="Survived",
                   title="Survival Distribution by Age",
                   nbins=20,  # Adjust the number of bins as needed
                   labels={'Survived': 'Survival Status', 'Age': 'Age'})
fig.show()



# fig = px.line(data, x = "Age", y = "")


# Survival Distribution by Pclass (Box Plot)
fig = px.box(data, x="Pclass", y="Age", color="Survived",
             title="Survival Distribution by Passenger Class and Age",
             labels={'Pclass': 'Passenger Class', 'Age': 'Age', 'Survived': 'Survival Status'})
fig.show()



## Part 4: Feature Engineering

The museum curator wonders if the passenger's rank and title might have anything to do with whether or not they survived. Since this information is embedded in their name, we'll use "feature engineering" to create two new columns:

- Title: The passenger's title
- Rank: A boolean (true/false) indicating if a passenger was someone of rank.

For the first new column, you'll need to find a way to [extract the title portion of their name](https://pandas.pydata.org/docs/getting_started/intro_tutorials/10_text_data.html). Be sure to clean up any whitespace or extra punctuation.

For the second new column, you'll need to first look at a summary of your list of titles and decide what exactly constitutes a title of rank. Will you include military and eccelsiastical titles? Once you've made your decision, create the second column.

You may want to review prior Data Explorations for tips on creating new columns and checking for lists of values.

In [23]:

data['Title'] = data['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)

print(data['Title'].value_counts())

rank_titles = ['Major', 'Sir', 'Col', 'Capt', 'Countess', 'Jonkheer']

data['Rank'] = data['Title'].apply(lambda title: title in rank_titles)

filtered_data = data[data['Title'].isin(rank_titles)]
filtered_data.head()

Title
Mr          517
Miss        182
Mrs         125
Master       40
Dr            7
Rev           6
Mlle          2
Major         2
Col           2
Countess      1
Capt          1
Ms            1
Sir           1
Lady          1
Mme           1
Don           1
Jonkheer      1
Name: count, dtype: int64


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title,Rank
449,450,Yes,1,"Peuchen, Major. Arthur Godfrey",male,52.0,0,0,113786,30.5,C104,S,Major,True
536,537,No,1,"Butt, Major. Archibald Willingham",male,45.0,0,0,113050,26.55,B38,S,Major,True
599,600,Yes,1,"Duff Gordon, Sir. Cosmo Edmund (""Mr Morgan"")",male,49.0,1,0,PC 17485,56.9292,A20,C,Sir,True
647,648,Yes,1,"Simonius-Blumer, Col. Oberst Alfons",male,56.0,0,0,13213,35.5,A26,C,Col,True
694,695,No,1,"Weir, Col. John",male,60.0,0,0,113800,26.55,,S,Col,True


### Revisit Visualizations
Now that you have the new columns in place. Revisit the pairwise comparison plots to see if the new columns reveal any interesting relationships.

In [28]:

fig = px.histogram(filtered_data, x="Rank", color="Survived",
                   title="Survival Distribution by Rank (Filtered)",
                   barmode='group',  # Display bars side-by-side
                   labels={'Survived': 'Survival Status', 'Rank': 'Rank (True/False)'})
fig.show()

# Survival Distribution by Title (Filtered)
fig = px.histogram(filtered_data, x="Title", color="Survived",
                   title="Survival Distribution by Title (Filtered)",
                   barmode="group",
                   labels={"Survived": "Survival Status", "Title": "Title"},
                   category_orders={"Title": rank_titles})  # Order the titles
fig.show()



# Filter for passengers with rank titles
rank_passengers = data[data['Rank'] == True]

# Calculate dying percentage for rank titles
rank_dying_percentage = len(rank_passengers[rank_passengers['Survived'] == 0]) / len(rank_passengers) * 100

print(f"Dying percentage for passengers with rank titles: {rank_dying_percentage:.2f}%")


# Filter for passengers without rank titles
non_rank_passengers = data[data['Rank'] == False]

# Calculate dying percentage for non-rank titles
non_rank_dying_percentage = len(non_rank_passengers[non_rank_passengers['Survived'] == 0]) / len(non_rank_passengers) * 100

print(f"Dying percentage for passengers without rank titles: {non_rank_dying_percentage:.2f}%")





Dying percentage for passengers with rank titles: 0.00%
Dying percentage for passengers without rank titles: 0.00%


In [31]:


# ... (previous code to create 'Rank' column) ...

# Calculate dying percentage for rank titles
rank_dying_percentage = data[data['Rank'] == True]['Survived'].value_counts(normalize=True)[0] * 100

# Calculate dying percentage for non-rank titles
non_rank_dying_percentage = data[data['Rank'] == False]['Survived'].value_counts(normalize=True)[0] * 100

# Create a DataFrame for the visualization
dying_percentages = pd.DataFrame({
    'Passenger Type': ['Rank Titles', 'Non-Rank Titles'],
    'Dying Percentage': [rank_dying_percentage, non_rank_dying_percentage]
})

# Create the bar chart
fig = px.bar(dying_percentages, x='Passenger Type', y='Dying Percentage',
             title='Comparison of Dying Percentages',
             labels={'Dying Percentage': 'Dying Percentage (%)'})
fig.show()


Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`


Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`



## Part 5: Encoding

The museum has partnered with a data science group to build some interactive predicitive models using the titanic passenger data.

Many machine learning algorithms require categorical features to be **encoded** as numbers.

There are two approaches to this, label encoding (sometimes called factorization), and "one-hot" encoding.

### Label Encoding

Label encoding creates numeric labels for each categorical value. For example, imagine we have a feature in the data called `Pet` with these values for the first five rows: `['Dog', 'Cat', 'Dog', 'Dog', 'Bird']`.

We could create a new feature called `Pet_Encoded` where those values are represented as: `[0, 1, 0, 0, 2]`. Where `0 = Dog, 1 = Cat, and 2 = Bird`.

In pandas there are two common ways to label encode a feature:

#### Method 1: factorize()

First, we could pandas' [factorize() method](https://pandas.pydata.org/docs/reference/api/pandas.factorize.html). It takes the series you want to encode as an argument and returns a list of two items.

The first item is an array of encoded values. The second is the set of original values.


    # The factorize() method returns the new values and the originals in a list.
    # So the [0] at the end indicates we want only the new values.
    myData['Pet_Encoded'] = pd.factorize(myData['Pet'])[0]


#### Method 2: Category Data Type
Every column in a pandas dataframe is a certain datatype. Usually, pandas infers which datatype to use based on the values of the column. However, we can use the `astype()` method to convert a feature from one type to another.

If we first convert a feature to the `category` datatype, we can ask pandas to create a new column in the data frame based on the category codes:

    # Convert our column to the category type
    myData['Pet'] = myData['Pet'].astype('category')
    myData['Pet_Encoded'] = myData['Pet'].cat.codes


Whichever method we choose, our machine learning algorithm could use the new `Pet_Encoded` feature in place of the `Pet` feature.




In [33]:
# Create a new column in the dataset called "Sex_Encoded" containing the
# label encoded values of the "Sex" column


data['Sex_Encoded'] = pd.factorize(data['Sex'])[0]

data['Sex'] = data['Sex'].astype('category')
data['Sex_Encoded'] = data['Sex'].cat.codes

### One-Hot Encoding

One problem with label encoding is that it can make a categorical variable appear as if it contains a quantitative relationship between its values.

In the example above, is Bird twice as important as Cat? Some algorithms might interpret those values that way.

One-Hot encoding avoids this problem by creating a new feature for each category. The value of the new feature is either `0` (is not this value) or `1` (is this value).

In pandas, we can use the [get_dummies()](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html) method to deal with this problem:

    myEncodedData = pd.get_dummies(myData, columns=['Pet'])

In the case of our `Pet` example, the new features created by `get_dummies()` would be:

| Pet_is_Dog | Pet_is_Cat | Pet_is_Bird |
|:----------:|:----------:|:-----------:|
|      1     |      0     |      0      |
|      0     |      1     |      0      |
|      1     |      0     |      0      |
|      1     |      0     |      0      |
|      0     |      0     |      1      |

Notice that for our data, if `Pet_is_Bird` = 0 and `Pet_is_Cat` = 0, we know that the pet has to be a dog. So the `Pet_is_Dog` column contains redundant information. When this happens, we say that our data contains a _multicollinearity_ problem.

To avoid this, we can tell `get_dummies()` that we want to get rid of one of the columns using the `drop_first` parameter:

    myEncodedData = pd.get_dummies(myData, columns=['Pet'], drop_first=True)

The main disadvantage to One-Hot encoding is that if the feature you're encoding has a lot of different values, it can result in a lot of extra features. This can sometimes lead to poor performance with some types of algorithms.

In [47]:
# Use the pandas get_dummies() method to one-hot encode the Embarked column.
Sex_Encoded = pd.get_dummies(data, columns=['Sex']).astype('category')

# Assuming your DataFrame is named 'Sex_Encoded' and has columns 'Sex_female' and 'Sex_male'

Sex_Encoded['Sex_female'] = Sex_Encoded['Sex_female'].astype(int)
Sex_Encoded['Sex_male'] = Sex_Encoded['Sex_male'].astype(int)


# Display the encoded DataFrame (optional)
Sex_Encoded

Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title,Rank,Sex_Encoded,Sex_female,Sex_male
0,1,No,3,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.2500,,S,Mr,False,1,0,1
1,2,Yes,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C85,C,Mrs,False,0,1,0
2,3,Yes,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.9250,,S,Miss,False,0,1,0
3,4,Yes,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1000,C123,S,Mrs,False,0,1,0
4,5,No,3,"Allen, Mr. William Henry",35.0,0,0,373450,8.0500,,S,Mr,False,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,No,2,"Montvila, Rev. Juozas",27.0,0,0,211536,13.0000,,S,Rev,False,1,0,1
887,888,Yes,1,"Graham, Miss. Margaret Edith",19.0,0,0,112053,30.0000,B42,S,Miss,False,0,1,0
888,889,No,3,"Johnston, Miss. Catherine Helen ""Carrie""",,1,2,W./C. 6607,23.4500,,S,Miss,False,0,1,0
889,890,Yes,1,"Behr, Mr. Karl Howell",26.0,0,0,111369,30.0000,C148,C,Mr,False,1,0,1


## Part 6: Conclusions

Based on your analysis, what interesting relationships did you find? Write three interesting facts the museum can use in their exhibit.

## 🌟 Above and Beyond 🌟

1. There appears to be a lot of different variations of similar titles. (such as abbreviations for Miss and Mademoiselle).

   Scan through the different titles to see which titles can be consolidated, then use what you know about data manipulation to simplify the distribution.

   Once you've finished, check the visualizations again to see if that made any difference.

2. The museum curator has room for a couple of nice visualizations for the exhibit. Create additional visualizations that are suitable for public display.
