<a href="https://colab.research.google.com/github/martadftese/hello-world/blob/master/ArtificialUnintelligence.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Titanic tutorial using Python and a few popular machine learning libaries (pandas, scikit-learn, and numpy)

Prof. Sandra Avila,
Institute of Computing (IC/Unicamp)

MC886/MO444, August 15, 2024

---

This tutorial is based on book entitled "**Artificial unintelligence: How computers misunderstand the world**", by [Meredith Broussard](https://merbroussard.github.io/). Chapter 7 "Machine Learning: the DL on ML", 2018.

**Goal**: Predict if a passenger survived the sinking of the Titanic or not.

We've just imported several libraries that we'll use for our analysis. We use an alias, *pd*, for pandas, and the alias *np* for numpy. We now have access to all of the functions in pandas and numpy. From scikit-learn, we'll import only two functions. One is named *tree* and the other is named *preprocessing*.

In [None]:
import pandas as pd
import numpy as np
from sklearn import tree, preprocessing


The first thing we do is break our data into two sets: training data and test data. We're going to develop a model, train it on the training data, then run it on the test data. The Titanic data come already with a train/test split.

*pd.read_csv()* means "please invoke the read_csv() function, which lives in the pd (pandas) library". Technically, we created a DataFrame object and called one of its built-in methods.

In [None]:
train_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
train = pd.read_csv(train_url)

test_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"
test = pd.read_csv(test_url)

Let's see what's in the *head*, or the first few lines, of the training data:


In [None]:
train.head(3)
#print(train.head())
#train.dtypes

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",1,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",0,26.0,0,0,STON/O2. 3101282,7.925,,S


It looks like the data is twelve columns. The columns are labeled PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, and Embarked.

```
# This is formatted as code
Data Dictionary
Survived = Survival (0 = No, 1 = Yes)
Pclass = Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
Name = Name
Sex = Sex
Age = Age (in years)
Sibsp = Number of siblings / spouses aboard the Titanic
Parch = Number of parents / children aboard the Titanic
Ticket = Ticket number
Fare = Passenger fare (pre-1970 British pound)
Cabin = Cabin number
Embarked = Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
```

For most of the columns, we have data. For some column values, we do not have data. For PassengerId 1, Mr. Owen Harris Braund, the value for Cabin is NaN. This means "not a number". NaN is different than zero; zero is a number. NaN means that there is no value for this variable.



Let's see what's in the first few lines of the test dataset:

In [None]:
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


As you can see, *test* has the same type of data as *train*, minus the Survived column. Great! Our goal is to create a Survived column in the test data that contains a prediction for each passenger. (Of course, someone already knows which passengers in the test data set survived — but it wouldn't be much of a tutorial if the data set already contained the answers).

Next, we're going to run some basic summary statistics on the training dataset in order to get know it a little better.

We can get to know our data a bit by running a function called describe that assembles some basic summary statistics and puts them into a handy table, as follows:


In [None]:
# Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.
train.describe() #numerical data only

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


The training dataset has 891 record. Of these, only 714 records show the age of the passenger. For the data we have available, the average age of the passengers is 29.699118; normal people would say that the average age is thirty.

Now that we've gotten to know our data a litlle bit, it's time to do some analysis. Let's first look at the number of passengers. We can use a function called *value_counts* to do this. Value_counts will show how many values there are for each distinct category in a column. In other words, how many passengers are traveling in each passenger class? Let's find out:








In [None]:
train["Pclass"].value_counts()

Unnamed: 0_level_0,count
Pclass,Unnamed: 1_level_1
3,491
1,216
2,184


The training data shows 491 passengers traveling third class, 184 passengers traveling second class, and 216 passengers traveling first calss.

Let's look at the numbers of survival:

In [None]:
train["Survived"].value_counts()

Unnamed: 0_level_0,count
Survived,Unnamed: 1_level_1
0,549
1,342


The training data shows that 549 people died and 342 survived.

Let's see those number normalized:

In [None]:
train["Survived"].value_counts(normalize=True)

Unnamed: 0_level_0,proportion
Survived,Unnamed: 1_level_1
0,0.616162
1,0.383838


62% of passengers died, and 38% survived. If we were to make a prediction about whether a random passenger survived, we'd likely predict that they did not survive.

We could stop here if we wanted. We just drew a conclusion that would allow us to make a resonable prediction. We can do better, however, so let's keep going. Are there any factors that might help improve the prediction? In addition to survival, we have some other columns in the data: Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, and Embarked.

*Pclass* is a proxy for socioeconomic class of the passengers. That might be useful as a predictor. We could guess that first-class passengers got on the boats before third-class passegenrs. We know that "women and children first" was a principle used during maritime disasters.

Now, let's do some comparisons to see if we can find variables that seem predictive:

In [None]:
#Females that survived vs females that passed away
train["Survived"][train["Sex"] == 'female'].value_counts()

Unnamed: 0_level_0,count
Survived,Unnamed: 1_level_1
1,233
0,81


In [None]:
#Normalized female survival
train["Survived"][train["Sex"] == 'female'].value_counts(normalize=True)

Unnamed: 0_level_0,proportion
Survived,Unnamed: 1_level_1
1,0.742038
0,0.257962


In [None]:
#Males that survived vs males that passed away
train["Survived"][train["Sex"] == 'male'].value_counts()

Unnamed: 0_level_0,count
Survived,Unnamed: 1_level_1
0,468
1,109


In [None]:
#Normalized male survival
train["Survived"][train["Sex"] == 'male'].value_counts(normalize=True)

Unnamed: 0_level_0,proportion
Survived,Unnamed: 1_level_1
0,0.811092
1,0.188908


We can see that 74% of females survived, and only 18% of males survived. Therefore, for a random person, we might adjust our guess to say that they survived if they were female, but not if they were male.

But we won't, because that would mean assigning probable outcomes randomly based only on gender. We know there are other factors in the data that influence the outcome. What about women traveling third class? Women traveling first class? Women traveling with children? This quickly becomes tedious to calculate manually, so we're going to train a model to do the guessing for us based on the factors that we know.

To construct the model, we're going to use a *decision tree*, a type of machine learning algorithm. Let's train the model on the training data. We know from our exploratory analysis that the features that matter are fare class and sex. We want to create a guess for survival. We already know whether the passengers in the training data survived or not. We're going to make the model guess, then compare the guesses to reality. Whatever the percentage is that we get right is our accuracy number.

Here's an open secret of the big data world: *all data is dirty*. All of it. **Data is made by people** going around and counting things or made by sensors that are made by people. In every seemngly orderly column of numbers, there is noise. There is mess. There is incompleteness. This is life. The problem is: dirty data doesn't compute. Therefore, in machine learning, sometimes we have to make things up to make the functions run smoothly.

We'll use a function called *fillna* to fill in all the missing values:

In [None]:
train["Age"] = train["Age"].fillna(train["Age"].median())

The algorithm can't run with missing values. Thus, we need to make up the missing values. Here, we recommend using the median.

Let's take a look at the data to see what's in there:

In [None]:
#pd.set_option('display.max_rows', None)
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.361582,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,13.019697,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,22.0,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,35.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


Now that we've looked at the raw data, it's time to start working with it. Let's turn it into *arrays*.

In [None]:
#Create the target and features numpy array: target, features_one
target = train["Survived"].values

#Preprocess
encoded_sex = preprocessing.LabelEncoder()

#Convert into numbers
train.Sex = encoded_sex.fit_transform(train.Sex)
features_one =  train[["Pclass", "Sex", "Age", "Fare"]].values

In [None]:
#print(features_one, target)
#print(target)

What we're doing is running a function called *fit* on the decision tree classifier called *my_tree_one*. The features we want to consider are Pclass, Sex, Age, and Fare.

In [None]:
#Fit the first decision tree: my_tree_one
my_tree_one = tree.DecisionTreeClassifier()
my_tree_one = my_tree_one.fit(features_one, target)

We're instructing the algorithm to figure out what relationship among these four predicts the values in the target field, which is Survived:

In [None]:
#Look at the importance and score of the included features
print(my_tree_one.feature_importances_)

[0.12717495 0.31274009 0.22252956 0.33755539]


The largest number in this group of values is the considered the most important. Fare is the largest number. We can conclude that passenger fare is the most important factor in determining whether a passenger survived the *Titanic* disaster.

At this point in our data analysis, we can run a function to show exactly how accurate our calculation is within the mathematical constraints of the universe represented by this data. Let's use teh score function to find the mean accurcay:

In [None]:
print(my_tree_one.score(features_one,target))

0.9775533108866442


Wow, 98%!

Next, we'll take this model (*my_tree_one*) and apply it to the set of test data. Remember: the test data doesn't have a Survived column. Our job is to use the model to predict whether each passenger in the test data survived or perished. We know that fare is the most important predictor according to this model, but age, sex and passenger class matter mathematically too. Let's apply the model to the test data and see what happens:

In [None]:
test.describe()
# Detect missing values
#test.isna().any()
#test.isna().sum()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare
count,418.0,418.0,332.0,418.0,418.0,417.0
mean,1100.5,2.26555,30.27259,0.447368,0.392344,35.627188
std,120.810458,0.841838,14.181209,0.89676,0.981429,55.907576
min,892.0,1.0,0.17,0.0,0.0,0.0
25%,996.25,1.0,21.0,0.0,0.0,7.8958
50%,1100.5,3.0,27.0,0.0,0.0,14.4542
75%,1204.75,3.0,39.0,1.0,0.0,31.5
max,1309.0,3.0,76.0,8.0,9.0,512.3292


In [None]:
#Fill any missing fare values with the median fare
test["Fare"] = test["Fare"].fillna(train["Fare"].median())

#Fill any missing age values with the median age
test["Age"] = test["Age"].fillna(train["Age"].median())

#Preprocess
test_encoded_sex = preprocessing.LabelEncoder()

#Convert into numbers
test.Sex = test_encoded_sex.fit_transform(test.Sex)

#Extract important features from the test set: Pclass, Sex, Age, and Fare
test_features = test[["Pclass", "Sex", "Age", "Fare"]].values

#print("These features are the features:\n")
#print(test_features)

In [None]:
#Make a prediction using the test set and print
my_prediction = my_tree_one.predict(test_features)
print("This is the prediction:\n", my_prediction)

This is the prediction:
 [0 0 1 1 1 0 0 0 1 0 0 0 1 1 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1 0 0 0 1 0 1 0 0
 0 0 0 0 1 0 1 1 0 0 0 1 1 1 0 1 1 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1 1 0 1 0
 1 0 0 1 0 1 1 0 0 0 0 0 1 1 1 1 1 0 1 0 0 0 1 1 1 0 1 0 0 0 1 0 0 0 0 0 0
 0 1 1 1 0 0 1 0 1 1 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0
 1 0 1 0 0 1 0 0 1 1 0 1 1 1 1 0 0 1 1 0 1 0 0 1 0 0 0 1 1 0 1 1 0 0 1 0 1
 0 1 0 0 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 1 0
 1 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 1 1 1 1 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 1
 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 1 1 0 1 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 0
 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 1 1 0 0 0 1 0
 0 1 0 0 1 1 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 1 0 0 0
 0 1 1 1 1 0 0 1 0 0 0]


In [None]:
#Create a data frame with two colums: PassengerId & Survived
#Survived contaisn the model's prediction
PassengerId = np.array(test["PassengerId"]).astype(int)
my_solution = pd.DataFrame(my_prediction, PassengerId, columns = ["Survived"])

print("This is the solution:\n", my_solution)

This is the solution:
       Survived
892          0
893          0
894          1
895          1
896          1
...        ...
1305         0
1306         1
1307         0
1308         0
1309         0

[418 rows x 1 columns]


In [None]:
#Check that the data frame has 418 entries
print("This is the solution shape:\n", my_solution.shape)

This is the solution shape:
 (418, 1)


In [None]:
#Write the solution to a CSV file with the name my_solution.csv
my_solution.to_csv("my_solution_one.csv", index_label = ["PassengerId"])

We can upload the file to Kaggle, and verify that our predictions were about 97% accurate. Ta-da! We just did machine learning.

When someone says they have "used artificial intelligence to make a decision," mean "used machine learning", and usually they went through a process similar to the one we just worked through.

For a programmer, writting an algorithm is that easy. It gets made, it gets deployed, it seems to work. You maybe try turning the dials differently the next time to see if the accuracy seems to go up any. You try to get the highest number you can. Then, you move on to the next thing.

**Meanwhile, out in the world, these numbers have consequences.**

