# Exercises

### 1. Load the Titanic Dataset
   - Load the Titanic dataset into a pandas DataFrame.
   - Inspect the dataset and gather as much information as you can:
     - What are the columns in the dataset, and how many rows are there?
        - columns: `PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked`
        - rows: train: 891, test 418, merged 1309
     - How many missing values are present, and how are they distributed? How will you handle them – filling or dropping?
     - Which variables are relevant, and which are not?
        - `PassengerId`, `Ticket`, `Embarked`, `Cabin`
        - `Fare` ?
     - What are the variable types? Categorize them (discrete, continuous).
        - `df.info()`, 
     - Obtain basic statistics.
        - `df.describe()`  
     - Create basic visualizations.
     - Is there any additional information you can think of?

In [15]:
import pandas as pd

train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

print(train_df.head().describe())

merged = pd.concat([train_df, test_df], ignore_index=True)
merged.head()

# What are the columns in the dataset, and how many rows are there?
print(train_df.columns)
print(f'num raws of train df: {train_df.shape[0]}/ test df {test_df.shape[0]} /  merged df: {merged.shape[0]}/ ')



       PassengerId  Survived    Pclass       Age     SibSp  Parch       Fare
count     5.000000  5.000000  5.000000   5.00000  5.000000    5.0   5.000000
mean      3.000000  0.600000  2.200000  31.20000  0.600000    0.0  29.521660
std       1.581139  0.547723  1.095445   6.83374  0.547723    0.0  30.510029
min       1.000000  0.000000  1.000000  22.00000  0.000000    0.0   7.250000
25%       2.000000  0.000000  1.000000  26.00000  0.000000    0.0   7.925000
50%       3.000000  1.000000  3.000000  35.00000  1.000000    0.0   8.050000
75%       4.000000  1.000000  3.000000  35.00000  1.000000    0.0  53.100000
max       5.000000  1.000000  3.000000  38.00000  1.000000    0.0  71.283300
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
num raws of train df: 891/ test df 418 /  merged df: 1309/ 


In [23]:
# How many missing values are present, and how are they distributed? How will you handle them – filling or dropping?
merged.isna().sum()

PassengerId       0
Survived        418
Pclass            0
Name              0
Sex               0
Age             263
SibSp             0
Parch             0
Ticket            0
Fare              1
Cabin          1014
Embarked          2
dtype: int64

In [26]:
# Which variables are relevant, and which are not?
print(merged.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Name         1309 non-null   object 
 4   Sex          1309 non-null   object 
 5   Age          1046 non-null   float64
 6   SibSp        1309 non-null   int64  
 7   Parch        1309 non-null   int64  
 8   Ticket       1309 non-null   object 
 9   Fare         1308 non-null   float64
 10  Cabin        295 non-null    object 
 11  Embarked     1307 non-null   object 
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB
None


In [11]:
merged.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### 2. More Advanced Statistics
   - Retrieve statistics separated by sex and class.
   - Obtain statistics categorized by age.
   - Create visualizations of the statistics with relevant graphs.
   - Plot the cumulative survival rate as a function of ticket price.
   - Plot the cumulative survival rate as a function of passenger age.
   - Plot the cumulative survival rate as a function of ticket price *and* passenger age (does this make sense?).

In [54]:
print(train_df.groupby(['Sex', 'Pclass']).agg({'Survived': ['mean','count'], 'Age': ['mean', 'median'] }))
train_df.groupby(['Sex', 'Pclass']).describe()

               Survived              Age       
                   mean count       mean median
Sex    Pclass                                  
female 1       0.968085    94  34.611765   35.0
       2       0.921053    76  28.722973   28.0
       3       0.500000   144  21.750000   21.5
male   1       0.368852   122  41.281386   40.0
       2       0.157407   108  30.740707   30.0
       3       0.135447   347  26.507589   25.0


Unnamed: 0_level_0,Unnamed: 1_level_0,PassengerId,PassengerId,PassengerId,PassengerId,PassengerId,PassengerId,PassengerId,PassengerId,Survived,Survived,...,Parch,Parch,Fare,Fare,Fare,Fare,Fare,Fare,Fare,Fare
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
Sex,Pclass,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2
female,1,94.0,469.212766,247.476723,2.0,293.5,447.0,698.25,888.0,94.0,0.968085,...,1.0,2.0,94.0,106.125798,74.259988,25.9292,57.2448,82.66455,134.5,512.3292
female,2,76.0,443.105263,243.627288,10.0,269.75,439.5,616.75,881.0,76.0,0.921053,...,1.0,3.0,76.0,21.970121,10.891796,10.5,13.0,22.0,26.0625,65.0
female,3,144.0,399.729167,267.232416,3.0,165.25,376.0,636.0,889.0,144.0,0.5,...,1.0,6.0,144.0,16.11881,11.690314,6.75,7.8542,12.475,20.221875,69.55
male,1,122.0,455.729508,247.026449,7.0,255.5,480.5,660.75,890.0,122.0,0.368852,...,0.0,4.0,122.0,67.226127,77.548021,0.0,27.7281,41.2625,78.459375,512.3292
male,2,108.0,447.962963,256.922546,18.0,225.75,416.5,677.5,887.0,108.0,0.157407,...,0.0,2.0,108.0,19.741782,14.922235,0.0,12.33125,13.0,26.0,73.5
male,3,347.0,455.51585,261.921251,1.0,209.5,466.0,687.5,891.0,347.0,0.135447,...,0.0,5.0,347.0,12.661633,11.681696,0.0,7.75,7.925,10.0083,69.55


In [60]:
# Define age bins
age_bins = [0, 18, 25, 35, 50, 100]
age_labels = ["0-18", "19-25", "26-35", "36-50", "51+"]
train_df["AgeSlot"] = pd.cut(train_df["Age"], bins=age_bins, labels=age_labels, right=False)
train_df["AgeSlot"]
train_df.groupby(['AgeSlot']).describe()[["Survived", 'Pclass']]


Unnamed: 0_level_0,Survived,Survived,Survived,Survived,Survived,Survived,Survived,Survived,Pclass,Pclass,Pclass,Pclass,Pclass,Pclass,Pclass,Pclass
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
AgeSlot,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
0-18,113.0,0.539823,0.500632,0.0,0.0,1.0,1.0,1.0,113.0,2.584071,0.677781,1.0,2.0,3.0,3.0,3.0
19-25,165.0,0.345455,0.476964,0.0,0.0,0.0,1.0,1.0,165.0,2.460606,0.761072,1.0,2.0,3.0,3.0,3.0
26-35,201.0,0.38806,0.488525,0.0,0.0,0.0,1.0,1.0,201.0,2.353234,0.75472,1.0,2.0,3.0,3.0,3.0
36-50,161.0,0.416149,0.494457,0.0,0.0,0.0,1.0,1.0,161.0,1.931677,0.888286,1.0,1.0,2.0,3.0,3.0
51+,74.0,0.364865,0.484678,0.0,0.0,0.0,1.0,1.0,74.0,1.554054,0.742854,1.0,1.0,1.0,2.0,3.0


### 3. Continue Exploring Your Dataset
   - Examine the number of unique values in each column.
   - Explore the possibility of pivoting the dataset.

### 4. Make Decisions
   - Remove any irrelevant data.
   - Treat missing values.
   - Can you think of any interesting feature engineering?

### 5. Are You Ready to Make Predictions?
   - Try it!

### 6. Obtain the Names and Ticket Numbers of All First-Class Teenagers.


In [64]:
train_df[train_df["Pclass"] == 1][['Name', 'Ticket']]

Unnamed: 0,Name,Ticket
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",PC 17599
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",113803
6,"McCarthy, Mr. Timothy J",17463
11,"Bonnell, Miss. Elizabeth",113783
23,"Sloper, Mr. William Thompson",113788
...,...,...
871,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",11751
872,"Carlsson, Mr. Frans Olof",695
879,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",11767
887,"Graham, Miss. Margaret Edith",112053


### 8. Load the Boeing Historical Airplane Orders & Deliveries Dataset
   - How many planes were in the building process for each month of the period covered by the dataset?

In [136]:
planes = pd.read_csv('OrdersandDeliveries.csv')

In [137]:
planes.describe()

Unnamed: 0,Country,Customer Name,Delivery Year,Engine,Model Series,Order Month,Order Year,Region,Delivery Total,Order Total,Unfilled Orders
count,9073,9073,8048,9073,9073,9073,9073,9047,9073,9073,338
unique,132,570,66,7,59,13,69,14,34,69,60
top,USA,United Airlines,2018,PW,737-800,Dec,2007,North America,1,1,1
freq,3200,339,210,3077,1208,1314,349,3420,3170,3410,54


In [138]:
planes.columns

Index(['Country', 'Customer Name', 'Delivery Year ', 'Engine', 'Model Series',
       'Order Month', 'Order Year', 'Region', 'Delivery Total', 'Order Total',
       'Unfilled Orders'],
      dtype='object')

In [139]:
planes.head(15)

Unnamed: 0,Country,Customer Name,Delivery Year,Engine,Model Series,Order Month,Order Year,Region,Delivery Total,Order Total,Unfilled Orders
0,Afghanistan,Ariana Afghan Airlines,1968.0,PW,727,Mar,1968,Central Asia,1,1,
1,Afghanistan,Ariana Afghan Airlines,1970.0,PW,727,Apr,1969,Central Asia,1,1,
2,Afghanistan,Ariana Afghan Airlines,1979.0,GE,DC-10,Sep,1978,Central Asia,1,1,
3,Afghanistan,Ariana Afghan Airlines,,CF,737-700,Nov,2005,Central Asia,0,4,
4,Algeria,Air Algerie,1974.0,PW,727,Jan,1974,Africa,1,1,
5,Algeria,Air Algerie,1974.0,PW,737-200,Jan,1974,Africa,1,1,
6,Algeria,Air Algerie,1975.0,PW,727,Jan,1974,Africa,1,1,
7,Algeria,Air Algerie,1975.0,PW,737-200,Jan,1974,Africa,2,2,
8,Algeria,Air Algerie,2015.0,CF,737-800,Jan,2014,Africa,2,2,
9,Algeria,Air Algerie,2016.0,CF,737-800,Jan,2014,Africa,6,6,


In [87]:
planes.describe()

Unnamed: 0,Country,Customer Name,Delivery Year,Engine,Model Series,Order Month,Order Year,Region,Delivery Total,Order Total,Unfilled Orders,Month Delivery
count,9073,9073,8048,9073,9073,9073,9073,9047,9073,9073,338,9073
unique,132,570,66,7,59,13,69,14,34,69,60,1
top,USA,United Airlines,2018,PW,737-800,Dec,2007,North America,1,1,1,Dec
freq,3200,339,210,3077,1208,1314,349,3420,3170,3410,54,9073


In [141]:
planes['Month Delivery'] = 'Dec'

In [95]:
planes.columns

Index(['Country', 'Customer Name', 'Delivery Year ', 'Engine', 'Model Series',
       'Order Month', 'Order Year', 'Region', 'Delivery Total', 'Order Total',
       'Unfilled Orders', 'Month Delivery'],
      dtype='object')

In [147]:
df = planes.copy()
df = df.dropna(subset=['Order Year', 'Order Month', 'Delivery Year ', 'Month Delivery'])
    
# Convert dates to datetime with error handling
try:
    df['Order_Date'] = pd.to_datetime(
        df['Order Year'].astype(str) + ' ' + df['Order Month'], 
        format='%Y %b',
        errors='coerce'  # This will set invalid dates to NaT
    )
    
    df['Delivery_Date'] = pd.to_datetime(
        df['Delivery Year '].astype(str) + ' ' + df['Month Delivery'], 
        format='%Y %b',
        errors='coerce'  # This will set invalid dates to NaT
    )
    
    # Drop rows where date conversion failed
    df = df.dropna(subset=['Order_Date', 'Delivery_Date'])
    
except Exception as e:
    print(f"Error in date conversion: {e}")
    print("Problematic data:")
    print(df[['Order Year', 'Order Month', 'Delivery Year ', 'Month Delivery']].head())
df['Order Total'] = pd.to_numeric(df['Order Total'], errors='coerce')

In [148]:
df.head(15)

Unnamed: 0,Country,Customer Name,Delivery Year,Engine,Model Series,Order Month,Order Year,Region,Delivery Total,Order Total,Unfilled Orders,Month Delivery,Order_Date,Delivery_Date
0,Afghanistan,Ariana Afghan Airlines,1968,PW,727,Mar,1968,Central Asia,1,1,,Dec,1968-03-01,1968-12-01
1,Afghanistan,Ariana Afghan Airlines,1970,PW,727,Apr,1969,Central Asia,1,1,,Dec,1969-04-01,1970-12-01
2,Afghanistan,Ariana Afghan Airlines,1979,GE,DC-10,Sep,1978,Central Asia,1,1,,Dec,1978-09-01,1979-12-01
4,Algeria,Air Algerie,1974,PW,727,Jan,1974,Africa,1,1,,Dec,1974-01-01,1974-12-01
5,Algeria,Air Algerie,1974,PW,737-200,Jan,1974,Africa,1,1,,Dec,1974-01-01,1974-12-01
6,Algeria,Air Algerie,1975,PW,727,Jan,1974,Africa,1,1,,Dec,1974-01-01,1975-12-01
7,Algeria,Air Algerie,1975,PW,737-200,Jan,1974,Africa,2,2,,Dec,1974-01-01,1975-12-01
8,Algeria,Air Algerie,2015,CF,737-800,Jan,2014,Africa,2,2,,Dec,2014-01-01,2015-12-01
9,Algeria,Air Algerie,2016,CF,737-800,Jan,2014,Africa,6,6,,Dec,2014-01-01,2016-12-01
10,Algeria,Air Algerie,1971,PW,727,Feb,1970,Africa,2,2,,Dec,1970-02-01,1971-12-01


In [149]:
start_date = df['Order_Date'].min()
end_date = df['Delivery_Date'].max()
print(f'start date {start_date} end date {end_date}')

start date 1955-10-01 00:00:00 end date 2022-12-01 00:00:00


In [150]:
date_range = pd.date_range(start=start_date, end=end_date, freq='M')
print(date_range)

DatetimeIndex(['1955-10-31', '1955-11-30', '1955-12-31', '1956-01-31',
               '1956-02-29', '1956-03-31', '1956-04-30', '1956-05-31',
               '1956-06-30', '1956-07-31',
               ...
               '2022-02-28', '2022-03-31', '2022-04-30', '2022-05-31',
               '2022-06-30', '2022-07-31', '2022-08-31', '2022-09-30',
               '2022-10-31', '2022-11-30'],
              dtype='datetime64[ns]', length=806, freq='M')


In [151]:
# Create a DataFrame with all months
result_df = pd.DataFrame(index=date_range)
result_df['df in Production'] = 0

# Calculate df in production using vectorized operations
for date in date_range:
    # For each date, count df where:
    # 1. Order date is <= current date
    # 2. Delivery date is >= current date
    mask = (df['Order_Date'] <= date) & (df['Delivery_Date'] >= date)
    df_in_production = ((df[mask]['Order Total'])).sum()
    result_df.loc[date, 'df in Production'] = df_in_production

# Format index for better readability
result_df.index = result_df.index.strftime('%Y-%m')
result_df

Unnamed: 0,df in Production
1955-10,61
1955-11,100
1955-12,143
1956-01,149
1956-02,158
...,...
2022-07,130
2022-08,130
2022-09,130
2022-10,130


In [152]:
result_df.index = pd.to_datetime(result_df.index, format='%Y-%m')

# Extract year and month
result_df['Year'] = result_df.index.year
result_df['Month'] = result_df.index.month

# Create a pivot table with 'Month' as rows and 'Year' as columns
pivot_result_df = result_df.pivot_table(values='df in Production', index='Month', columns='Year', aggfunc='sum')

# Display the pivot table
pivot_result_df

Year,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,149.0,239.0,270.0,303.0,243.0,205.0,216.0,182.0,239.0,...,2874.0,3091.0,2862.0,2564.0,2049.0,1477.0,883.0,548.0,412.0,130.0
2,,158.0,239.0,273.0,303.0,251.0,223.0,218.0,188.0,260.0,...,2981.0,3137.0,2890.0,2566.0,2083.0,1497.0,887.0,548.0,417.0,130.0
3,,177.0,246.0,277.0,306.0,261.0,229.0,228.0,189.0,267.0,...,3020.0,3211.0,2920.0,2593.0,2133.0,1544.0,890.0,552.0,437.0,130.0
4,,196.0,247.0,277.0,306.0,261.0,241.0,231.0,209.0,269.0,...,3021.0,3242.0,2956.0,2622.0,2137.0,1549.0,890.0,552.0,452.0,130.0
5,,196.0,247.0,280.0,306.0,261.0,276.0,234.0,221.0,299.0,...,3148.0,3279.0,2962.0,2630.0,2146.0,1565.0,890.0,555.0,461.0,130.0
6,,196.0,247.0,280.0,317.0,266.0,283.0,241.0,222.0,320.0,...,3373.0,3321.0,2997.0,2637.0,2177.0,1595.0,894.0,555.0,461.0,130.0
7,,202.0,247.0,310.0,318.0,280.0,289.0,245.0,226.0,353.0,...,3425.0,3361.0,3087.0,2655.0,2186.0,1600.0,904.0,555.0,466.0,130.0
8,,205.0,247.0,310.0,318.0,293.0,317.0,248.0,248.0,385.0,...,3439.0,3407.0,3101.0,2672.0,2192.0,1602.0,908.0,561.0,467.0,130.0
9,,214.0,249.0,310.0,323.0,293.0,319.0,249.0,252.0,406.0,...,3484.0,3440.0,3130.0,2676.0,2200.0,1634.0,922.0,561.0,467.0,130.0
10,61.0,230.0,259.0,310.0,332.0,298.0,331.0,263.0,255.0,409.0,...,3509.0,3453.0,3146.0,2722.0,2211.0,1638.0,926.0,561.0,467.0,130.0
