<img src="https://miro.medium.com/max/481/1*n_ms1q5YoHAQXXUIfeADKQ.png" alt="350" width="400" align="left"/>

# TUTORIAL PANDAS - TITANIC DATASET

***By Loris Liusso***

# <mark>What is Pandas?</mark>

<ul>

<li>Pandas is a Python library used for working with data sets.</li>

<li>It has functions for analyzing, cleaning, exploring, and manipulating data.</li>

<li>The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.</li>
</ul>

# Why Use Pandas?

<ul>

<li>Pandas allows us to analyze big data and make conclusions based on statistical theories.</li>

<li>Pandas can clean messy data sets, and make them readable and relevant.</li>

<li>Relevant data is very important in data science.</li>
</ul>

<img src="https://media.geeksforgeeks.org/wp-content/uploads/finallpandas.png" alt="500" width="550" align="left"/>

<img src="https://storage.googleapis.com/lds-media/images/series-and-dataframe.width-1200.png" alt="500" width="550" align="left"/>


<img src="https://miro.medium.com/max/1200/1*eE8DP4biqtaIK3aIy1S2zA.png" alt="500" width="650" align="left"/>

In [None]:
import numpy as np
import pandas as pd #Load Pandas library

In [None]:
#Load Dataset (read_csv(), function):

df = pd.read_csv('./data/titanic.csv')

df

In [None]:
df.shape #attribute shape (nrows, ncolumns)

**head() / tail()  :**

head() : Return the first n rows. (5 by default)

tail(): Return the last n rows. (5 by default)

In [None]:
df.head() #first 5 rows by DEFAULT

In [None]:
df.tail() #last 5 rows bt DEFAULT

In [None]:
df['Survived'] #Series

In [None]:
#Return a Series containing counts of unique values.

#The resulting object will be in descending order.

df['Survived'].value_counts()

In [None]:
df[['Name', 'Survived']] #Df Subset

## <mark>loc / Iloc</mark>:

## Selecting via a single value

<ul>Both loc and iloc allow input to be a single value. We can use the following syntax for data selection: <br><br>
    
<li>loc[row_label, column_label]</li>
<li>iloc[row_position, column_position]</li>
</ul>

<img src="https://miro.medium.com/max/1050/1*CgAWzayEQY8PQuMpRkSGfQ.png" alt="500" width="650" align="left"/>

In [None]:
df.head()

**<mark>df.iloc[where, where]</mark>: Select single row/column or subset of rows/columns from the Dataframe by integer position**

In [None]:
df.iloc[0, :] #row selection by index [0,:] ":" è implicito

In [None]:
df.iloc[0:2]

In [None]:
df.iloc[:, 0] #[row, col] -> column selection by index FIRST COLUMN, ALL ROWS

In [None]:
df.iloc[:, -1] #LAST COLUMNS, ALL ROWS

**<mark>df.loc[val1,val2]</mark>: Select single row/column or subset of rows/columns by label**

In [None]:
df.head()

In [None]:
df.loc[0, 'Name'] #First row, COLUMN= 'Name'

In [None]:
df.loc[:,'Name'] #ALL ROWS, COLUMN= 'Name'

### EXAMPLE: <mark>LABEL AS MAIN INDEX</mark>
<img src="https://miro.medium.com/max/867/1*10_I9N1oqs8cNYVhTALS3w.png" alt="500" width="400" align="left"/>

**describe():**
    
Generate descriptive statistics.

Descriptive statistics include those that summarize the central tendency,
dispersion and shape of a dataset’s distribution, excluding NaN values.

In [None]:
df.describe() 

In [None]:
df.dtypes #attribute dtypes: Return the dtypes in the DataFrame.

In [None]:
df.isnull() #Detect missing values. 

In [None]:
df.isnull().sum() #Sum null values

It seems that the `Cabin` information is missing in 687 rows. Pretty normal considering people in 3rd class (`Pclass`) did not have one!

❓ Use the [`pandas.DataFrame.drop`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) function to get rid of the `Cabin` in `titanic_df`

In [None]:
df.drop('Cabin', axis=1, inplace=True) 

In [None]:
df.head()

In [None]:
#Other methods: 

#df= df.drop('Cabin', axis=1) 

#Drop more columns:
    
#df.drop(columns= [col1, col2], axis=1, inplace=True)

In [None]:
df= df.drop(columns=['Ticket', 'Embarked'], axis=1) #we also remove these 2 columns because we don't need them for now

In [None]:
df.head()

**Correlation:**

In [None]:
df.corr()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

corr = df.corr()

fig, ax = plt.subplots(figsize=(16, 6))

sns.heatmap(
    corr, 
    vmin=-1, vmax=1, center=0, annot=True,
    cmap=sns.diverging_palette(20, 220, n=200),
    square=True
)

## <mark>GROUPBY()</mark>

A groupby operation involves some combination of splitting the object, applying a function, and combining the results.

**This can be used to group large amounts of data and compute operations on these groups.**

<img src="https://www.datasciencemadesimple.com/wp-content/uploads/2020/05/Generic-Groupby-mean-1.png" alt="500" width="850" align="left"/>

## Classes Analysis

Let's have a look at the ticket divide.

❓ Using a `groupby()`, create a `pclass_df` dataframe counting the number of tickets sold per class (1, 2 or 3)

In [None]:
df.head()

In [None]:
#Quanti biglietti venduti per ogni classe?

pclass_df = df.groupby("Pclass").count()["PassengerId"].to_frame(name="Biglietti venduti")
pclass_df

Looking at the number is not very visual, let's try to make sense of the data with a plot.

❓ Plot the `pclass_df` dataframe built in the previous question as a barchart

**Pandas plot(): Make plots of Series or DataFrame.**

In [None]:
pclass_df.plot(kind="bar")

Let's now have a look at **survivors**.

❓ Plot a barchart of *frequency of survivals* per class. `0` means no one survived in the class, `1` means everyone survived.

In [None]:
#Quanti sopravvissuti per ogni classe?

df_survived= df[["Pclass","Survived"]].groupby('Pclass').mean()
df_survived

In [None]:
#reset_index():

#This function resets and provides the new index to the grouped by dataframe and makes them a proper dataframe structure.

df_survived_2= df[["Pclass","Survived"]].groupby('Pclass').mean().reset_index()
df_survived_2

In [None]:
df_survived.plot(kind='bar')

## Gender Analysis

Let's have a look at the `Sex` column.

❓ Plot a barchart of survival frequency of each gender. Who survived the most?

In [None]:
df[['Survived', 'Sex']].groupby('Sex').mean().plot(kind='bar')

Let's build a fancier histogram where we show the total number of passengers + the total number of survivors (for each gender).

❓ Build a `survivors_df` DataFrame with two columns: `Total` and `Survived`, and two rows (`male` and `female`). Plot it.

In [None]:
survivors_df = df[['Survived', 'Sex']].groupby('Sex').sum() #SUM VALUES: 0 (NOT SURVIVED) , 1 (SURVIVED)
survivors_df['Total'] = df[['Survived', 'Sex']].groupby('Sex').count() #COUNT TOTAL NUMBER OF PEOPLE PER CLASS (MALE,FEMALE)
survivors_df.plot(kind='bar')

In [None]:
survivors_df['%']= (survivors_df['Survived']*100)/survivors_df['Total']
survivors_df

## Children

The former analysis did not take into account ages. We want to differentiate between a child and an adult and see how survival rates are affected.

❓ Use boolean indexing to create a `children_df` containing only rows of child passengers

In [None]:
children_df = df[df['Age'] <= 17] #BOOLEAN INDEXING
children_df.head()

❓ How many children were there in the Titanic?

In [None]:
total_children= children_df.shape[0] #number of rows
total_children

❓ How many children survived?

In [None]:
children_survived= children_df['Survived'].sum()
children_survived

❓ Plot a barchart of survivors / total for each category: `male`, `female`, `children`. Bear in mind that you need to **substract** the boys from the `male` statistics, and the girls from the `female` statistics.

In [None]:
adults_df= df[df['Age'] > 17]
adults_df= adults_df.rename(columns={'Sex': 'People'})

survivors_df = adults_df[['Survived', 'People']].groupby('People').sum() #SUM VALUES: 0 (NOT SURVIVED) , 1 (SURVIVED)
survivors_df['Total'] = adults_df[['Survived', 'People']].groupby('People').count() #COUNT TOTAL NUMBER OF PEOPLE PER CLASS (MALE,FEMALE)
survivors_df['%']= (survivors_df['Survived']*100)/survivors_df['Total']
survivors_df

In [None]:
#add column and values of Children data to survive_df

survivors_df.loc['children'] = [children_survived, total_children, (children_survived*100)/total_children ] 
survivors_df

In [None]:
survivors_df[['Survived', 'Total']].plot(kind='bar')

## Big families:

❓ Find out if it was harder for bigger families to survive?
  
Here you create a new column in your `DataFrame` for the family size of the passenger.

In [None]:
df['family_size'] = 1 + df['SibSp'] + df['Parch']
df.groupby('family_size').mean()['Survived'].plot(kind='bar');

## Distinguished titles

❓ Were passengers with distinguished titles preferred during the evacuation?
   
With some string manipulation, create a new column for each user with their title

In [None]:
df.head()

In [None]:
#split(): Split a string into a list where each word is a list item
#strip(): Remove spaces at the beginning and at the end of the string:

df['Title'] = df['Name'].apply(lambda x: x.split(',')[1].split('.')[0].strip())
df

In [None]:
#logy= Use log scaling or symlog scaling on y axis. 

#Contiamo quante persone per "Title":

df.groupby('Title').count()['PassengerId'].sort_values().plot(kind='bar', logy=True)

In [None]:
#Quanti sopravvissuti per "Title":

#1) Countess:Contessa
#2) Mlle: Signorina
#3)Lady: Signora
#4)Ms: Non sposata
#5)Sir: Signore
#.....

df.groupby('Title').mean()['Survived'].sort_values().plot(kind='bar')

In [None]:
df.groupby('Title').count()[['PassengerId']]

<img src="https://media.istockphoto.com/vectors/congratulations-greeting-sign-congrats-graduated-vector-id1148641884?k=20&m=1148641884&s=170667a&w=0&h=UZvEyiD5nxDJiLz5n0i1jdvWn-MR6wt1nomiPV1wSDE=" alt="400" width="500" align="left"/>