# About me
Andrea Lelli

Data Scientist

email: andrea.lelli@vedrai.com

!pip install pandas==1.5.3

# What is NumPy?

NumPy (Numerical Python) (https://numpy.org/) is an open-source Python library designed for scientific computing. It provides support for multidimensional arrays and advanced mathematical functions, making operations on large datasets more efficient than using Python lists.

Key Features

- Multidimensional Arrays: NumPy introduces the ndarray type, which enables fast and optimized operations on arrays.

- Vectorized Operations: Supports element-wise mathematical operations and algebraic matrix computations.

- Broadcasting: Allows operations on arrays of different shapes without explicit looping.

- Advanced Mathematical Functions: Includes tools for linear algebra, Fourier transforms, and random number generation.

NumPy is an essential library for numerical computing in Python. 

With its capabilities for handling arrays and optimized mathematical operations, it is widely used in data science, machine learning, and scientific computing.

In [1]:
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr)

[1 2 3 4 5]


In [2]:
mat = np.array([[1, 2, 3], [4, 5, 6]])
print(mat)



[[1 2 3]
 [4 5 6]]


In [3]:
arr = np.array([1, 2, 3, 4])
print(arr + 10)  # Adds 10 to each element
print(arr * 2)   # Multiplies each element by 2

[11 12 13 14]
[2 4 6 8]


In [4]:
print(arr[1])     # Second element
print(arr[1:3])   # Second to third elements

2
[2 3]


In [5]:
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
print(A @ B)  # Matrix multiplication

[[19 22]
 [43 50]]


In [6]:
np.random.uniform(low=0,high=200,size=5)

array([  4.83311831, 114.7067472 , 115.28524968, 160.18812747,
        30.88693069])

In [7]:
np.random.normal(loc=0, scale=5, size=5)

array([ 4.44423583, -5.21696202,  5.82463524,  1.95995413, -5.1313468 ])

"loc" sta per valor medio, "scale" è la varianza. "size" la misura di punti

# What is pandas?

pandas (https://pandas.pydata.org/) is an open-source Python library designed for data manipulation and analysis. It provides data structures and functions that make working with structured data (e.g., tables, time series) simple and efficient. pandas is built on top of NumPy and is widely used in data science, finance, and machine learning applications.

Key Features

- Data Structures: pandas provides two main data structures:

  - Series: A one-dimensional labeled array capable of holding any data type.

  - DataFrame: A two-dimensional table with labeled rows and columns, similar to a spreadsheet.

- Data Manipulation: pandas enables powerful data wrangling, including filtering, grouping, and aggregation.

- Handling Missing Data: Built-in functions help efficiently manage missing values.

- Data Import & Export: Read and write data from various formats (CSV, Excel, SQL, JSON, etc.).

- Time Series Analysis: Built-in support for handling and analyzing time-indexed data.

pandas is a versatile and powerful library for data analysis in Python. 

By mastering pandas, you can efficiently manipulate, clean, and analyze data, making it an essential tool for data professionals.


# Basic DataFrame Operations

In [8]:
# Setup and Required Libraries
import pandas as pd

In [9]:
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
print(s)

a    10
b    20
c    30
d    40
dtype: int64


In [10]:

#Create a DataFrame
pd.DataFrame([[1,2],["a", "b"]], columns=["colonna1", "colonna2"])

Unnamed: 0,colonna1,colonna2
0,1,2
1,a,b


In [11]:
# Create a DataFrame from dict
df_dict = {"colonna1":[1,2,2], "colonna2":["a", "a", "b"]}
pd.DataFrame(df_dict)

Unnamed: 0,colonna1,colonna2
0,1,a
1,2,a
2,2,b


In [12]:
pd.DataFrame({"x": np.random.normal(loc=0, scale=10, size=1000), "y":np.random.uniform(low=10, high=20, size=1000)})

Unnamed: 0,x,y
0,-6.852105,19.542491
1,16.964920,13.797501
2,-13.122268,17.741248
3,5.037862,19.003744
4,9.063290,19.176594
...,...,...
995,-2.934803,19.629353
996,1.417409,17.599332
997,-2.872122,10.712147
998,5.714346,19.142224


In [13]:
# 1. Loading and First Look at Data
# ================================

# Read the Titanic data
df = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')



In [14]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [15]:
# View the first few rows
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [16]:
#View the last few rows
df.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [17]:
# Get the DataFrame’s shape
print(df.shape)

(891, 12)


In [18]:
# List all columns
print(df.columns.tolist())

['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']


In [19]:
# Check data types for each column
print(df.dtypes)

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object


In [20]:
# Get a concise summary
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [21]:
# Get summary statistics
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [22]:
# Sort the DataFrame by a single column
df_sorted = df.sort_values(by='Age')
df_sorted.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
803,804,1,3,"Thomas, Master. Assad Alexander",male,0.42,0,1,2625,8.5167,,C
755,756,1,2,"Hamalainen, Master. Viljo",male,0.67,1,1,250649,14.5,,S
469,470,1,3,"Baclini, Miss. Helene Barbara",female,0.75,2,1,2666,19.2583,,C
644,645,1,3,"Baclini, Miss. Eugenie",female,0.75,2,1,2666,19.2583,,C
78,79,1,2,"Caldwell, Master. Alden Gates",male,0.83,0,2,248738,29.0,,S


In [23]:
# Sort by multiple columns (e.g., Income then Age)

df_sorted_multi = df.sort_values(by=['Fare', 'Age'], ascending=[False, True])
df_sorted_multi.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
258,259,1,1,"Ward, Miss. Anna",female,35.0,0,0,PC 17755,512.3292,,C
737,738,1,1,"Lesurer, Mr. Gustave J",male,35.0,0,0,PC 17755,512.3292,B101,C
679,680,1,1,"Cardeza, Mr. Thomas Drake Martinez",male,36.0,0,1,PC 17755,512.3292,B51 B53 B55,C
27,28,0,1,"Fortune, Mr. Charles Alexander",male,19.0,3,2,19950,263.0,C23 C25 C27,S
88,89,1,1,"Fortune, Miss. Mabel Helen",female,23.0,3,2,19950,263.0,C23 C25 C27,S


In [24]:
# Reset the index after sorting
df_sorted_reset = df_sorted_multi.reset_index(drop=True)
df_sorted_reset.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,259,1,1,"Ward, Miss. Anna",female,35.0,0,0,PC 17755,512.3292,,C
1,738,1,1,"Lesurer, Mr. Gustave J",male,35.0,0,0,PC 17755,512.3292,B101,C
2,680,1,1,"Cardeza, Mr. Thomas Drake Martinez",male,36.0,0,1,PC 17755,512.3292,B51 B53 B55,C
3,28,0,1,"Fortune, Mr. Charles Alexander",male,19.0,3,2,19950,263.0,C23 C25 C27,S
4,89,1,1,"Fortune, Miss. Mabel Helen",female,23.0,3,2,19950,263.0,C23 C25 C27,S


In [25]:
# Set a column as the new index (e.g., Age)
df_indexed = df.set_index('PassengerId')
df_indexed.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [26]:
# Rename one or more columns
df_renamed = df.rename(columns={'Sex': 'Gender'})
df_renamed.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [27]:
# Add a new column (e.g., a computed field)
df['Age_in_5_Years'] = df['Age'] + 5
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_in_5_Years
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,27.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,43.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,31.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,40.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,40.0


In [28]:
# Delete a column from the DataFrame
df_dropped = df.drop(columns=['Age_in_5_Years'])
df_dropped.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# Data Selection and Filtering

In [29]:
# Select a single column (e.g., Age)
ages = df['Age']
ages.head()

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64

In [30]:
# Select multiple columns (e.g., Age and Gender)
age_gender = df[['Age', 'Sex']]
age_gender.head()

Unnamed: 0,Age,Sex
0,22.0,male
1,38.0,female
2,26.0,female
3,35.0,female
4,35.0,male


In [31]:
# Select rows where the index is between 10 and 20 (inclusive)
df.loc[10:20, ['Age', 'Sex']]

Unnamed: 0,Age,Sex
10,4.0,female
11,58.0,female
12,20.0,male
13,39.0,male
14,14.0,female
15,55.0,female
16,2.0,male
17,,male
18,31.0,female
19,,female


In [32]:
# Select the first 5 rows and columns 0 to 2
df.iloc[0:5, 0:3]

Unnamed: 0,PassengerId,Survived,Pclass
0,1,0,3
1,2,1,1
2,3,1,3
3,4,1,1
4,5,0,3


In [33]:
# Filter rows with a condition (e.g., Age > 50)
older_than_50 = df[df['Age'] > 50]
older_than_50.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_in_5_Years
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,59.0
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S,63.0
15,16,1,2,"Hewlett, Mrs. (Mary D Kingcome)",female,55.0,0,0,248706,16.0,,S,60.0
33,34,0,2,"Wheadon, Mr. Edward H",male,66.0,0,0,C.A. 24579,10.5,,S,71.0
54,55,0,1,"Ostby, Mr. Engelhart Cornelius",male,65.0,0,1,113509,61.9792,B30,C,70.0


In [34]:
# Filter rows with multiple conditions
# Filter for females with Income > 50000
condition = (df['Sex'] == 'female') & (df['Fare'] < 50)
females_low_income = df[condition]
females_low_income.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_in_5_Years
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,31.0
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S,32.0
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C,19.0
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S,9.0
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S,63.0


In [35]:
# Use the .query() method for filtering
df_query = df.query("Age > 50 and Fare < 50")
df_query.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_in_5_Years
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S,63.0
15,16,1,2,"Hewlett, Mrs. (Mary D Kingcome)",female,55.0,0,0,248706,16.0,,S,60.0
33,34,0,2,"Wheadon, Mr. Edward H",male,66.0,0,0,C.A. 24579,10.5,,S,71.0
94,95,0,3,"Coxon, Mr. Daniel",male,59.0,0,0,364500,7.25,,S,64.0
96,97,0,1,"Goldschmidt, Mr. George B",male,71.0,0,0,PC 17754,34.6542,A5,C,76.0


In [36]:
# Get the value at row index 5 in the 'Ticket' column
ticket_val = df.at[5, 'Ticket']
print(ticket_val)

330877


In [37]:
# Get the value at the 5th row (by integer position) and 2nd column (by integer position)
cell_val = df.iat[5, 2]
print(cell_val)

3


In [38]:
df_slice = df[10:21]
df_slice.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_in_5_Years
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S,9.0
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S,63.0
12,13,0,3,"Saundercock, Mr. William Henry",male,20.0,0,0,A/5. 2151,8.05,,S,25.0
13,14,0,3,"Andersson, Mr. Anders Johan",male,39.0,1,5,347082,31.275,,S,44.0
14,15,0,3,"Vestrom, Miss. Hulda Amanda Adolfina",female,14.0,0,0,350406,7.8542,,S,19.0


In [39]:
# Create a new column 'Obesity' based on a BMI threshold
df.loc[df["Age"] >= 18, 'Adult'] = True
df.loc[df['Age'] < 18, 'Adult'] = False
df[['Age', 'Adult']].head()

Unnamed: 0,Age,Adult
0,22.0,True
1,38.0,True
2,26.0,True
3,35.0,True
4,35.0,True


In [40]:
df_filtered = df[df['Age'] > 60].reset_index(drop=True)
df_filtered.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_in_5_Years,Adult
0,34,0,2,"Wheadon, Mr. Edward H",male,66.0,0,0,C.A. 24579,10.5,,S,71.0,True
1,55,0,1,"Ostby, Mr. Engelhart Cornelius",male,65.0,0,1,113509,61.9792,B30,C,70.0,True
2,97,0,1,"Goldschmidt, Mr. George B",male,71.0,0,0,PC 17754,34.6542,A5,C,76.0,True
3,117,0,3,"Connors, Mr. Patrick",male,70.5,0,0,370369,7.75,,Q,75.5,True
4,171,0,1,"Van der hoef, Mr. Wyckoff",male,61.0,0,0,111240,33.5,B19,S,66.0,True


In [41]:
# Filter rows where Embarked is either 'S' or 'C'
df_embarked = df[df['Embarked'].isin(['S', 'C'])]
df_embarked.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_in_5_Years,Adult
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,27.0,True
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,43.0,True
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,31.0,True
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,40.0,True
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,40.0,True


In [42]:
# BUG: This might not work as expected because of operator precedence
# Incorrect: df['Age'] > 50 and df['Income'] > 50000  --> Raises a ValueError.
try:
    df_bug = df['Age'] > 50 and df['Fare'] > 50
except Exception as e:
    print("Error encountered:", e)

# Correct way using bitwise operators:
df_fixed = df[(df['Age'] > 50) & (df['Fare'] > 50)]
df_fixed.head()

Error encountered: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_in_5_Years,Adult
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,59.0,True
54,55,0,1,"Ostby, Mr. Engelhart Cornelius",male,65.0,0,1,113509,61.9792,B30,C,70.0,True
124,125,0,1,"White, Mr. Percival Wayland",male,54.0,0,1,35281,77.2875,D26,S,59.0,True
155,156,0,1,"Williams, Mr. Charles Duane",male,51.0,0,1,PC 17597,61.3792,,C,56.0,True
195,196,1,1,"Lurette, Miss. Elise",female,58.0,0,0,PC 17569,146.5208,B80,C,63.0,True


# Break 1:

Exercise: Create a dataframe that contains the information about an hypothetical list of christmas presents you made 

Exercise: Create a dataframe with 1000 rows, columns "Gender" and "Height", exactly 500 males and 500 females, such that female's height average is roughly 168, and male's roughly 174

In [43]:
df1_dict = {"Person":["Martina", "Valentina","Katia"], "Gift":["flower","book","backpack"]}
pd.DataFrame(df1_dict)

Unnamed: 0,Person,Gift
0,Martina,flower
1,Valentina,book
2,Katia,backpack


In [44]:
female_heights = np.random.normal(loc=168, scale=5, size=500)
male_heights = np.random.normal(loc=174, scale=5, size=500)

In [45]:
genders = ["Female"]*500 + ["Male"]*500

In [46]:
heights= female_heights + male_heights

In [47]:
df2 = pd.DataFrame({
    "Gender": genders,
    "Height":np.concatenate([female_heights,male_heights])
})


In [48]:
df2.head()

Unnamed: 0,Gender,Height
0,Female,174.244262
1,Female,171.064351
2,Female,171.639314
3,Female,167.208585
4,Female,165.245622


In [49]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   PassengerId     891 non-null    int64  
 1   Survived        891 non-null    int64  
 2   Pclass          891 non-null    int64  
 3   Name            891 non-null    object 
 4   Sex             891 non-null    object 
 5   Age             714 non-null    float64
 6   SibSp           891 non-null    int64  
 7   Parch           891 non-null    int64  
 8   Ticket          891 non-null    object 
 9   Fare            891 non-null    float64
 10  Cabin           204 non-null    object 
 11  Embarked        889 non-null    object 
 12  Age_in_5_Years  714 non-null    float64
 13  Adult           714 non-null    object 
dtypes: float64(3), int64(5), object(6)
memory usage: 97.6+ KB


# Grouping and Aggregation


In [50]:
# Group by Gender and compute the mean Age
mean_age_by_gender = df.groupby('Sex')['Age'].mean()
print(mean_age_by_gender)

Sex
female    27.915709
male      30.726645
Name: Age, dtype: float64


In [51]:
# Group by Class and get aggregate statistics for Fare and Age
agg_class = df.groupby('Pclass')[['Fare', 'Age']].agg(['mean', 'median'])
print(agg_class)

             Fare                 Age       
             mean   median       mean median
Pclass                                      
1       84.154687  60.2875  38.233441   37.0
2       20.662183  14.2500  29.877630   29.0
3       13.675550   8.0500  25.140620   24.0


In [52]:
#Apply multiple aggregation functions on one group
agg_fare = df.groupby('Sex')['Fare'].agg(['min', 'max', 'mean'])
print(agg_fare)


         min       max       mean
Sex                              
female  6.75  512.3292  44.479818
male    0.00  512.3292  25.523893


In [53]:
# Reset index after groupby aggregation
grouped_reset = df.groupby('Sex')['Fare'].mean().reset_index()
grouped_reset.head()

Unnamed: 0,Sex,Fare
0,female,44.479818
1,male,25.523893


In [54]:
# Filter groups based on an aggregated value
# For example, select groups where the average Age is greater than 50
group_age = df.groupby('Sex')['Age'].mean()
filtered_groups = group_age[group_age > 50]
print(filtered_groups)

Series([], Name: Age, dtype: float64)


In [55]:
# Group by multiple columns (e.g., Sex and PClass)
group_multi = df.groupby(['Sex', 'Pclass'])['Fare'].mean().reset_index()
group_multi.head()

Unnamed: 0,Sex,Pclass,Fare
0,female,1,106.125798
1,female,2,21.970121
2,female,3,16.11881
3,male,1,67.226127
4,male,2,19.741782


In [56]:
#Use .agg() with a custom function
def range_func(x):
    return x.max() - x.min()

custom_agg = df.groupby('Sex')['Fare'].agg(['mean', range_func])
print(custom_agg)

             mean  range_func
Sex                          
female  44.479818    505.5792
male    25.523893    512.3292


In [57]:
#Apply a lambda function within .agg()
lambda_agg = df.groupby('Sex')['Fare'].agg(lambda x: x.quantile(0.75) - x.quantile(0.25))
print(lambda_agg)

Sex
female    42.928125
male      18.654200
Name: Fare, dtype: float64


In [58]:
#Demonstrate transformation vs. aggregation
# Transformation: subtract group mean from each value
df['Fare_centered'] = df.groupby('Sex')['Fare'].transform(lambda x: x - x.mean())
df[['Sex', 'Fare', 'Fare_centered']].head()

Unnamed: 0,Sex,Fare,Fare_centered
0,male,7.25,-18.273893
1,female,71.2833,26.803482
2,female,7.925,-36.554818
3,female,53.1,8.620182
4,male,8.05,-17.473893


In [59]:
#Group by with as_index=False
group_as_index = df.groupby('Pclass', as_index=False)['Fare'].mean()
group_as_index.head()


Unnamed: 0,Pclass,Fare
0,1,84.154687
1,2,20.662183
2,3,13.67555


In [60]:
# Example bug: Forgetting to reset the index after aggregation
# Bug: When plotting or merging later, a grouped result with a multi-index might cause issues.
grouped_bug = df.groupby('Sex')['Fare'].mean()
print(grouped_bug)
# Correct:
grouped_corrected = grouped_bug.reset_index()
print(grouped_corrected)


Sex
female    44.479818
male      25.523893
Name: Fare, dtype: float64
      Sex       Fare
0  female  44.479818
1    male  25.523893


In [61]:
# Create a pivot table (aggregation via pivoting)
pivot_table = pd.pivot_table(df, values='Fare', index='Sex', columns='Pclass', aggfunc='mean')
print(pivot_table)


Pclass           1          2          3
Sex                                     
female  106.125798  21.970121  16.118810
male     67.226127  19.741782  12.661633


In [62]:
# Count the number of records in each group
count_by_gender = df.groupby('Sex').size().reset_index(name='Count')
print(count_by_gender)


      Sex  Count
0  female    314
1    male    577


In [63]:
# Group by age bins
# Create age bins
bins = [18, 30, 45, 60, 90]
labels = ['18-30', '31-45', '46-60', '61-90']
df['Age_Group'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)
age_group_agg = df.groupby('Age_Group')['Fare'].mean().reset_index()
print(age_group_agg)


  Age_Group       Fare
0     18-30  28.368094
1     31-45  39.684960
2     46-60  43.749956
3     61-90  43.467950


  age_group_agg = df.groupby('Age_Group')['Fare'].mean().reset_index()


In [64]:
df[df["Age"]<1]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_in_5_Years,Adult,Fare_centered,Age_Group
78,79,1,2,"Caldwell, Master. Alden Gates",male,0.83,0,2,248738,29.0,,S,5.83,False,3.476107,
305,306,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S,5.92,False,126.026107,
469,470,1,3,"Baclini, Miss. Helene Barbara",female,0.75,2,1,2666,19.2583,,C,5.75,False,-25.221518,
644,645,1,3,"Baclini, Miss. Eugenie",female,0.75,2,1,2666,19.2583,,C,5.75,False,-25.221518,
755,756,1,2,"Hamalainen, Master. Viljo",male,0.67,1,1,250649,14.5,,S,5.67,False,-11.023893,
803,804,1,3,"Thomas, Master. Assad Alexander",male,0.42,0,1,2625,8.5167,,C,5.42,False,-17.007193,
831,832,1,2,"Richards, Master. George Sibley",male,0.83,1,1,29106,18.75,,S,5.83,False,-6.773893,


In [65]:
df["Age"].unique()

array([22.  , 38.  , 26.  , 35.  ,   nan, 54.  ,  2.  , 27.  , 14.  ,
        4.  , 58.  , 20.  , 39.  , 55.  , 31.  , 34.  , 15.  , 28.  ,
        8.  , 19.  , 40.  , 66.  , 42.  , 21.  , 18.  ,  3.  ,  7.  ,
       49.  , 29.  , 65.  , 28.5 ,  5.  , 11.  , 45.  , 17.  , 32.  ,
       16.  , 25.  ,  0.83, 30.  , 33.  , 23.  , 24.  , 46.  , 59.  ,
       71.  , 37.  , 47.  , 14.5 , 70.5 , 32.5 , 12.  ,  9.  , 36.5 ,
       51.  , 55.5 , 40.5 , 44.  ,  1.  , 61.  , 56.  , 50.  , 36.  ,
       45.5 , 20.5 , 62.  , 41.  , 52.  , 63.  , 23.5 ,  0.92, 43.  ,
       60.  , 10.  , 64.  , 13.  , 48.  ,  0.75, 53.  , 57.  , 80.  ,
       70.  , 24.5 ,  6.  ,  0.67, 30.5 ,  0.42, 34.5 , 74.  ])

In [66]:
# Use groupby with a lambda to count conditions
# Count how many have BMI > 30 in each Gender group
young_counts = df.groupby('Sex')['Age'].agg(lambda x: (x < 18).sum()).reset_index(name='Young_Count')
print(young_counts)

      Sex  Young_Count
0  female           55
1    male           58


# Handling Missing Data and Merging/Joining

In [67]:
# Check for missing values in the DataFrame
missing_values = df.isnull().sum()
print(missing_values)

PassengerId         0
Survived            0
Pclass              0
Name                0
Sex                 0
Age               177
SibSp               0
Parch               0
Ticket              0
Fare                0
Cabin             687
Embarked            2
Age_in_5_Years    177
Adult             177
Fare_centered       0
Age_Group         290
dtype: int64


In [68]:
# BUG: This only returns a DataFrame of booleans, not the counts.
missing_bug = df.isnull()
print(missing_bug.head())
# Correct: use .sum() to count missing values.
print(df.isnull().sum())

   PassengerId  Survived  Pclass   Name    Sex    Age  SibSp  Parch  Ticket  \
0        False     False   False  False  False  False  False  False   False   
1        False     False   False  False  False  False  False  False   False   
2        False     False   False  False  False  False  False  False   False   
3        False     False   False  False  False  False  False  False   False   
4        False     False   False  False  False  False  False  False   False   

    Fare  Cabin  Embarked  Age_in_5_Years  Adult  Fare_centered  Age_Group  
0  False   True     False           False  False          False      False  
1  False  False     False           False  False          False      False  
2  False   True     False           False  False          False      False  
3  False  False     False           False  False          False      False  
4  False   True     False           False  False          False      False  
PassengerId         0
Survived            0
Pclass             

In [69]:
# Fill missing values in a single column using .fillna()
# Assume 'Glucose_Level' has some missing values
df['Age'] = df['Age'].fillna(df['Age'].mean())


con "fillna" non si modificano le medie dei valori 

In [70]:
np.mean([df["Age"]])

np.float64(29.69911764705882)

In [71]:
# Fill missing values for the entire DataFrame
df_filled = df.fillna(method='ffill')  # Forward fill as an example
df_filled.head()


  df_filled = df.fillna(method='ffill')  # Forward fill as an example
  df_filled = df.fillna(method='ffill')  # Forward fill as an example


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_in_5_Years,Adult,Fare_centered,Age_Group
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,27.0,True,-18.273893,18-30
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,43.0,True,26.803482,31-45
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,C85,S,31.0,True,-36.554818,18-30
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,40.0,True,8.620182,31-45
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,C123,S,40.0,True,-17.473893,31-45


In [72]:
# Drop rows with missing values using .dropna()
df_dropped = df.dropna()
print(df_dropped.shape)


(164, 16)


In [73]:
# Drop columns with too many missing values (e.g., threshold-based)
# Drop columns with more than 50% missing values
df_thresh = df.dropna(axis=1, thresh=int(0.5 * len(df)))
print(df_thresh.columns)


Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Embarked', 'Age_in_5_Years', 'Adult',
       'Fare_centered', 'Age_Group'],
      dtype='object')


In [74]:
# Replace missing values using a dictionary mapping for different columns
fill_values = {
    'Age': df['Age'].median(),
    'Embarked': 'Unknown'
}
df_replaced = df.fillna(value=fill_values)
df_replaced.head()



Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_in_5_Years,Adult,Fare_centered,Age_Group
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,27.0,True,-18.273893,18-30
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,43.0,True,26.803482,31-45
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,31.0,True,-36.554818,18-30
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,40.0,True,8.620182,31-45
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,40.0,True,-17.473893,31-45


In [75]:
# Conditional filling: fill missing values only for a subset of rows
# For rows where 'Family_History_Diabetes' is True, fill missing 'Insulin_Resistance' with its median.
median_age = df.loc[df['Pclass'] == 1, 'Age'].median()
df.loc[(df['Pclass'] == 1) & (df['Age'].isnull()), 'Age'] = median_age



In [76]:
# Identify potential outliers using quantiles (a precursor to handling them)
# Example: Identify outliers in 'Fare' using the IQR method
Q1 = df['Fare'].quantile(0.25)
Q3 = df['Fare'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['Fare'] < (Q1 - 1.5 * IQR)) | (df['Fare'] > (Q3 + 1.5 * IQR))]
print("Outliers in Fare:")
print(outliers[['Fare']])


Outliers in Fare:
         Fare
1     71.2833
27   263.0000
31   146.5208
34    82.1708
52    76.7292
..        ...
846   69.5500
849   89.1042
856  164.8667
863   69.5500
879   83.1583

[116 rows x 1 columns]


In [77]:
# Merge two DataFrames using a common key
# Create a second DataFrame with a subset of columns and an additional key
df_extra = df[['Age', 'Sex', 'Fare']].copy()
df_extra['Key'] = np.arange(len(df_extra))
df['Key'] = np.arange(len(df))

merged_df = pd.merge(df, df_extra, on='Key', suffixes=('', '_extra'))
merged_df.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_in_5_Years,Adult,Fare_centered,Age_Group,Key,Age_extra,Sex_extra,Fare_extra
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,27.0,True,-18.273893,18-30,0,22.0,male,7.25
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,43.0,True,26.803482,31-45,1,38.0,female,71.2833
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,31.0,True,-36.554818,18-30,2,26.0,female,7.925
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,40.0,True,8.620182,31-45,3,35.0,female,53.1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,40.0,True,-17.473893,31-45,4,35.0,male,8.05


In [78]:
# Perform a left join vs. a right join
# Left join: all rows from df and matching rows from df_extra
left_join = pd.merge(df, df_extra, on='Key', how='left')
print("Left join shape:", left_join.shape)

# Right join: all rows from df_extra and matching rows from df
right_join = pd.merge(df, df_extra, on='Key', how='right')
print("Right join shape:", right_join.shape)


Left join shape: (891, 20)
Right join shape: (891, 20)


In [79]:
# Perform an outer join
outer_join = pd.merge(df, df_extra, on='Key', how='outer')
print("Outer join shape:", outer_join.shape)


Outer join shape: (891, 20)


outer: union

In [80]:
# Perform an inner join
inner_join = pd.merge(df, df_extra, on='Key', how='inner')
print("Inner join shape:", inner_join.shape)


Inner join shape: (891, 20)


inner: intersection

In [81]:
# Concatenate DataFrames using pd.concat()
# Assume you have two DataFrames with the same columns
df1 = df.iloc[:50]
df2 = df.iloc[50:]
concatenated = pd.concat([df1, df2])
print(concatenated.shape)


(891, 17)


unisco gli indici con iloc

In [82]:
# BUG: Attempting to merge on a column that doesn’t exist in one DataFrame.
try:
    pd.merge(df, df_extra, left_on='Nonexistent', right_on='Key')
except Exception as e:
    print("Merge error:", e)

# Correct merge:
correct_merge = pd.merge(df, df_extra, left_on='Key', right_on='Key')
correct_merge.head()


Merge error: 'Nonexistent'


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex_x,Age_x,SibSp,Parch,Ticket,Fare_x,Cabin,Embarked,Age_in_5_Years,Adult,Fare_centered,Age_Group,Key,Age_y,Sex_y,Fare_y
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,27.0,True,-18.273893,18-30,0,22.0,male,7.25
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,43.0,True,26.803482,31-45,1,38.0,female,71.2833
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,31.0,True,-36.554818,18-30,2,26.0,female,7.925
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,40.0,True,8.620182,31-45,3,35.0,female,53.1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,40.0,True,-17.473893,31-45,4,35.0,male,8.05


# Break 2:

Exercise: Use the Titanic dataframe to calculate the average Age in each Pclass

Exercise: Use the Titanic dataframe to calculate the number of survived that are named John



In [86]:
# Read the Titanic data
df_titanic = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 17 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   PassengerId     891 non-null    int64   
 1   Survived        891 non-null    int64   
 2   Pclass          891 non-null    int64   
 3   Name            891 non-null    object  
 4   Sex             891 non-null    object  
 5   Age             891 non-null    float64 
 6   SibSp           891 non-null    int64   
 7   Parch           891 non-null    int64   
 8   Ticket          891 non-null    object  
 9   Fare            891 non-null    float64 
 10  Cabin           204 non-null    object  
 11  Embarked        889 non-null    object  
 12  Age_in_5_Years  714 non-null    float64 
 13  Adult           714 non-null    object  
 14  Fare_centered   891 non-null    float64 
 15  Age_Group       601 non-null    category
 16  Key             891 non-null    int64   
dtypes: category(1), 

In [94]:
age_class = df_titanic.groupby('Pclass')['Age'].mean()
age_class

Pclass
1    38.233441
2    29.877630
3    25.140620
Name: Age, dtype: float64

In [95]:
df_titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [101]:
survived_John = df[(df_titanic['Name'].str.contains(' John ')) & (df_titanic['Survived']==1)]
number_survived_John = len(survived_John)
number_survived_John

10

# Homework:

Exercise: Create a dataframe of at least 1000 rows about an hypothetical list of employees of your company, extract all employees in the IT department and a Salary greater than 55000.

Exercise: Create a column to split the data in Low, Medium, High fare prices and calculate the average Age per each section. Fill the Age set to nan and calculate the average age again, how did it change? How can you fill the age so that the average does not change?