# understanding the dataset

Quality data is fundamental to any data science engagement. To gain actionable insights, the appropriate data must be sourced and cleansed.

It is important at the beginning of a project to consider potential harms from your tool. These harms can be caused by designing for only a narrow group of users, having insufficient representation of sub-populations, or human labelers favoring a privileged group.

Machine learning discovers and generalizes patterns in the data and could, therefore, replicate bias. If a group is under-represented, the machine learning model has fewer examples to learn from, resulting in reduced accuracy for those individuals in this group.

When implementing these models at scale, it can result in a large number of biased decisions, harming a large number of people. Ensure you have evaluated risks and have techniques in place to mitigate them.

In [4]:
#importing the datasets
import pandas as pd

In [13]:
#importing the titanic dataset from kaggle
#link : https://www.kaggle.com/datasets/brendan45774/test-file
df = pd.read_csv('tested.csv')

# how big is the dataset

In [52]:
df.shape

(418, 12)

# how does the data look like 

In [24]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


# what is the data type of cols ?

In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Survived     418 non-null    int64  
 2   Pclass       418 non-null    int64  
 3   Name         418 non-null    object 
 4   Sex          418 non-null    object 
 5   Age          332 non-null    float64
 6   SibSp        418 non-null    int64  
 7   Parch        418 non-null    int64  
 8   Ticket       418 non-null    object 
 9   Fare         417 non-null    float64
 10  Cabin        91 non-null     object 
 11  Embarked     418 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 39.3+ KB


# are there any missing values ?

In [27]:
df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

# are there any duplicate values ?

In [31]:
df.duplicated().sum()

0

# how does the data look like mathematically ?

In [32]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,418.0,418.0,418.0,332.0,418.0,418.0,417.0
mean,1100.5,0.363636,2.26555,30.27259,0.447368,0.392344,35.627188
std,120.810458,0.481622,0.841838,14.181209,0.89676,0.981429,55.907576
min,892.0,0.0,1.0,0.17,0.0,0.0,0.0
25%,996.25,0.0,1.0,21.0,0.0,0.0,7.8958
50%,1100.5,0.0,3.0,27.0,0.0,0.0,14.4542
75%,1204.75,1.0,3.0,39.0,1.0,0.0,31.5
max,1309.0,1.0,3.0,76.0,8.0,9.0,512.3292


# how is the correlation between the columns ?

In [50]:
df.corr(numeric_only=True)

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
PassengerId,1.0,-0.023245,-0.026751,-0.034102,0.003818,0.04308,0.008211
Survived,-0.023245,1.0,-0.108615,-1.3e-05,0.099943,0.15912,0.191514
Pclass,-0.026751,-0.108615,1.0,-0.492143,0.001087,0.018721,-0.577147
Age,-0.034102,-1.3e-05,-0.492143,1.0,-0.091587,-0.061249,0.337932
SibSp,0.003818,0.099943,0.001087,-0.091587,1.0,0.306895,0.171539
Parch,0.04308,0.15912,0.018721,-0.061249,0.306895,1.0,0.230046
Fare,0.008211,0.191514,-0.577147,0.337932,0.171539,0.230046,1.0


In [49]:
df.corr(numeric_only=True)['PassengerId']

PassengerId    1.000000
Survived      -0.023245
Pclass        -0.026751
Age           -0.034102
SibSp          0.003818
Parch          0.043080
Fare           0.008211
Name: PassengerId, dtype: float64

# other stats

In [48]:
df.mean(numeric_only=True)

PassengerId    1100.500000
Survived          0.363636
Pclass            2.265550
Age              30.272590
SibSp             0.447368
Parch             0.392344
Fare             35.627188
dtype: float64

In [47]:
df.median(numeric_only=True)

PassengerId    1100.5000
Survived          0.0000
Pclass            3.0000
Age              27.0000
SibSp             0.0000
Parch             0.0000
Fare             14.4542
dtype: float64

In [46]:
df.mode(numeric_only=True)

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
0,892,0.0,3.0,21.0,0.0,0.0,7.75
1,893,,,24.0,,,
2,894,,,,,,
3,895,,,,,,
4,896,,,,,,
...,...,...,...,...,...,...,...
413,1305,,,,,,
414,1306,,,,,,
415,1307,,,,,,
416,1308,,,,,,


In [45]:
df.count(numeric_only=True)

PassengerId    418
Survived       418
Pclass         418
Age            332
SibSp          418
Parch          418
Fare           417
dtype: int64

In [44]:
df.std(numeric_only=True)

PassengerId    120.810458
Survived         0.481622
Pclass           0.841838
Age             14.181209
SibSp            0.896760
Parch            0.981429
Fare            55.907576
dtype: float64

In [43]:
df.max(numeric_only=True)

PassengerId    1309.0000
Survived          1.0000
Pclass            3.0000
Age              76.0000
SibSp             8.0000
Parch             9.0000
Fare            512.3292
dtype: float64

In [42]:
df.min(numeric_only=True)

PassengerId    892.00
Survived         0.00
Pclass           1.00
Age              0.17
SibSp            0.00
Parch            0.00
Fare             0.00
dtype: float64