<center>
<img src="../../img/ods_stickers.jpg" />
    
## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course 

Author: [Yury Kashnitsky](https://yorko.github.io). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose.

# <center> Topic 1. Exploratory data analysis with Pandas
## <center>Practice. Analyzing "Titanic" passengers

**Fill in the missing code ("You code here") and choose answers in a [web-form](https://docs.google.com/forms/d/16EfhpDGPrREry0gfDQdRPjoiQX9IumaL2mPR0rcj19k/edit).**

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
# Graphics in SVG format are more sharp and legible
%config InlineBackend.figure_format = 'svg'
pd.set_option("display.precision", 2)

**Read data into a Pandas DataFrame**

In [2]:
data = pd.read_csv('../../data/titanic_train.csv',
                  index_col='PassengerId')

**First 5 rows**

In [5]:
data.head(5)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.28,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.92,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [6]:
data.describe()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.38,2.31,29.7,0.52,0.38,32.2
std,0.49,0.84,14.53,1.1,0.81,49.69
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.12,0.0,0.0,7.91
50%,0.0,3.0,28.0,0.0,0.0,14.45
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.33


**Let's select those passengers who embarked in Cherbourg (Embarked=C) and paid > 200 pounds for their ticker (fare > 200).**

Make sure you understand how actually this construction works.

In [7]:
data[(data['Embarked'] == 'C') & (data.Fare > 200)].head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
119,0,1,"Baxter, Mr. Quigg Edmond",male,24.0,0,1,PC 17558,247.52,B58 B60,C
259,1,1,"Ward, Miss. Anna",female,35.0,0,0,PC 17755,512.33,,C
300,1,1,"Baxter, Mrs. James (Helene DeLaudeniere Chaput)",female,50.0,0,1,PC 17558,247.52,B58 B60,C
312,1,1,"Ryerson, Miss. Emily Borie",female,18.0,2,2,PC 17608,262.38,B57 B59 B63 B66,C
378,0,1,"Widener, Mr. Harry Elkins",male,27.0,0,2,113503,211.5,C82,C


**We can sort these people by Fare in descending order.**

In [8]:
data[(data['Embarked'] == 'C') & 
     (data['Fare'] > 200)].sort_values(by='Fare',
                               ascending=False).head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
259,1,1,"Ward, Miss. Anna",female,35.0,0,0,PC 17755,512.33,,C
680,1,1,"Cardeza, Mr. Thomas Drake Martinez",male,36.0,0,1,PC 17755,512.33,B51 B53 B55,C
738,1,1,"Lesurer, Mr. Gustave J",male,35.0,0,0,PC 17755,512.33,B101,C
312,1,1,"Ryerson, Miss. Emily Borie",female,18.0,2,2,PC 17608,262.38,B57 B59 B63 B66,C
743,1,1,"Ryerson, Miss. Susan Parker ""Suzette""",female,21.0,2,2,PC 17608,262.38,B57 B59 B63 B66,C


**Let's create a new feature.**

In [9]:
def age_category(age):
    '''
    < 30 -> 1
    >= 30, <55 -> 2
    >= 55 -> 3
    '''
    if age < 30:
        return 1
    elif age < 55:
        return 2
    elif age >= 55:
        return 3

In [10]:
age_categories = [age_category(age) for age in data.Age]
data['Age_category'] = age_categories

**Another way is to do it with `apply`.**

In [11]:
data['Age_category'] = data['Age'].apply(age_category)

**1. How many men/women were there onboard?**
- 412 men and 479 women
- 314 men and 577 women
- 479 men and 412 women
- 577 men and 314 women X

In [22]:
len(data[data['Sex'] == 'male'])

577

In [23]:
len(data[data['Sex'] == 'female'])

314

**2. Print the distribution of the `Pclass` feature. Then the same, but for men and women separately. How many men from second class were there onboard?**
- 104
- 108 X
- 112
- 125

In [28]:
data['Pclass'].value_counts()

3    491
1    216
2    184
Name: Pclass, dtype: int64

In [30]:
pd.crosstab(data['Pclass'], data['Sex'])

Sex,female,male
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,94,122
2,76,108
3,144,347


**3. What are median and standard deviation of `Fare`?. Round to two decimals.**
- median is  14.45, standard deviation is 49.69 X
- median is 15.1, standard deviation is 12.15
- median is 13.15, standard deviation is 35.3
- median is  17.43, standard deviation is 39.1

In [33]:
data['Fare'].median()

14.4542

In [34]:
data['Fare'].std()

49.693428597180905

**4. Is that true that the mean age of survived people is higher than that of passengers who eventually died?**
- Yes
- No X


In [41]:
data[data['Survived'] == 1]['Age'].mean()

28.343689655172415

In [42]:
data[data['Survived'] == 0]['Age'].mean()

30.62617924528302

**5. Is that true that passengers younger than 30 y.o. survived more frequently than those older than 60 y.o.? What are shares of survived people among young and old people?**
- 22.7% among young and 40.6% among old
- 40.6% among young and 22.7% among old X
- 35.3% among young and 27.4% among old
- 27.4% among young and  35.3% among old

In [47]:
data[data['Age'] < 30]['Survived'].mean()

0.40625

In [48]:
data[data['Age'] > 60]['Survived'].mean()

0.22727272727272727

**6. Is that true that women survived more frequently than men? What are shares of survived people among men and women?**
- 30.2% among men and 46.2% among women
- 35.7% among men and 74.2% among women
- 21.1% among men and 46.2% among women
- 18.9% among men and 74.2% among women X

In [51]:
data[data['Sex'] == 'male']['Survived'].mean()

0.18890814558058924

In [52]:
data[data['Sex'] == 'female']['Survived'].mean()

0.7420382165605095

**7. What's the most popular first name among male passengers?**
- Charles
- Thomas
- William
- John X

In [81]:
data['First Name'] = data['Name'].apply(lambda name: name.split(' ')[-1].replace(')', ''))

In [82]:
data

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_category,First Name
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1.0,Harris
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.28,C85,C,2.0,Thayer
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.92,,S,1.0,Laina
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.10,C123,S,2.0,Peel
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,2.0,Henry
...,...,...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.00,,S,1.0,Juozas
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.00,B42,S,1.0,Edith
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S,,"""Carrie"""
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.00,C148,C,1.0,Howell


In [83]:
data[data['Sex'] == 'male']['First Name'].value_counts()

John        16
William     15
Henry       15
James       14
Jr           9
            ..
George"      1
Warner       1
Andre        1
Williams     1
Marsh        1
Name: First Name, Length: 375, dtype: int64

In [73]:
data[data['Name'].apply(lambda name: 'Edward' in name)]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_category,First Name
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
34,0,2,"Wheadon, Mr. Edward H",male,66.0,0,0,C.A. 24579,10.5,,S,3.0,H
246,0,1,"Minahan, Dr. William Edward",male,44.0,2,0,19928,90.0,C78,Q,2.0,Edward
284,1,3,"Dorking, Mr. Edward Arthur",male,19.0,0,0,A/5. 10482,8.05,,S,1.0,Arthur
333,0,1,"Graham, Mr. George Edward",male,38.0,0,1,PC 17582,153.46,C91,S,2.0,Edward
488,0,1,"Kent, Mr. Edward Austin",male,58.0,0,0,11771,29.7,B37,C,3.0,Austin
495,0,3,"Stanley, Mr. Edward Roland",male,21.0,0,0,A/4 45380,8.05,,S,1.0,Roland
544,1,2,"Beane, Mr. Edward",male,32.0,1,0,2908,26.0,,S,2.0,Edward
547,1,2,"Beane, Mrs. Edward (Ethel Clarke)",female,19.0,1,0,2908,26.0,,S,1.0,Clarke)
572,1,1,"Appleton, Mrs. Edward Dale (Charlotte Lamson)",female,53.0,2,0,11769,51.48,C101,S,2.0,Lamson)
649,0,3,"Willey, Mr. Edward",male,,0,0,S.O./P.P. 751,7.55,,S,,Edward


**8. How is average age for men/women dependent on `Pclass`? Choose all correct statements:**
- On average, men of 1 class are older than 40 X
- On average, women of 1 class are older than 40
- Men of all classes are on average older than women of the same class X
- On average, passengers ofthe first class are older than those of the 2nd class who are older than passengers of the 3rd class X

In [90]:
data[(data['Sex'] == 'male') & (data['Pclass'] == 1)]['Age'].mean()

41.28138613861386

In [91]:
data[(data['Sex'] == 'female') & (data['Pclass'] == 1)]['Age'].mean()

34.61176470588235

In [92]:
data[(data['Sex'] == 'male') & (data['Pclass'] == 2)]['Age'].mean()

30.74070707070707

In [94]:
data[(data['Sex'] == 'female') & (data['Pclass'] == 2)]['Age'].mean()

28.722972972972972

In [95]:
data[(data['Sex'] == 'male') & (data['Pclass'] == 3)]['Age'].mean()

26.507588932806325

In [96]:
data[(data['Sex'] == 'female') & (data['Pclass'] == 3)]['Age'].mean()

21.75

In [99]:
data[data['Pclass'] == 3]['Age'].mean()

25.14061971830986

In [100]:
data[data['Pclass'] == 2]['Age'].mean()

29.87763005780347

In [101]:
data[data['Pclass'] == 1]['Age'].mean()

38.233440860215055

In [105]:
pd.crosstab(data['Pclass'], data['Survived'], normalize=True)

Survived,0,1
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.09,0.15
2,0.11,0.1
3,0.42,0.13


In [111]:
pd.crosstab(data['Pclass'], data['Sex'], values=data['Age'], aggfunc=np.mean)

Sex,female,male
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,34.61,41.28
2,28.72,30.74
3,21.75,26.51


## Useful resources
* The same notebook as an interactive web-based [Kaggle Kernel](https://www.kaggle.com/kashnitsky/topic-1-practice-analyzing-titanic-passengers) with a [solution](https://www.kaggle.com/kashnitsky/topic-1-practice-solution)
* Topic 1 "Exploratory Data Analysis with Pandas" as a [Kaggle Kernel](https://www.kaggle.com/kashnitsky/topic-1-exploratory-data-analysis-with-pandas)
* Main course [site](https://mlcourse.ai), [course repo](https://github.com/Yorko/mlcourse.ai), and YouTube [channel](https://www.youtube.com/watch?v=QKTuw4PNOsU&list=PLVlY_7IJCMJeRfZ68eVfEcu-UcN9BbwiX)