### Types of Values in Dataset
- 95% to 98% - Normal Values
-  1% to 3%  - Null Values
-  1% to 2%  - Outlying Values
---

### Importing the Required Modules

In [1]:
from sklearn import datasets # Sci-Kit Learn Module
import pandas as pd
import numpy as np

---

### Mounting Drive to Access the <b>titanic.csv</b> File

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


---

### Importing Dataset

In [2]:
# # Method 1 - Using Sci-Kit Learn Dataset
# data = datasets.load_boston

In [7]:
# Method 2 - Using Custom Dataset (csv file)
data = pd.read_csv('/content/drive/MyDrive/titanic.csv')
print(data.head())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  


---
### 1. Removing/Repalcing Null values

Check the Dataset for Null values and then remove/drop them by using <b>dropna( )</b>. If there are many Null values in the dataset, then they are considered as <b>Noise</b>

Then we use <b>data.describe( )</b> to get the Average (50%) value for all the Column/Columns containing Null values, (here Age) and then replace the default Null values with this Average value using <b>data.fillna( )<b>

In [8]:
data.describe() # Used to get the Average Age i.e. 28.000000

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [9]:
print(dict(data['Cabin'].value_counts()))
print('--------------------------------------------')
print(dict(data['Embarked'].value_counts()))

{'B96 B98': 4, 'G6': 4, 'C23 C25 C27': 4, 'C22 C26': 3, 'F33': 3, 'F2': 3, 'E101': 3, 'D': 3, 'C78': 2, 'C93': 2, 'E8': 2, 'D36': 2, 'B77': 2, 'C123': 2, 'E121': 2, 'E44': 2, 'D35': 2, 'C125': 2, 'E67': 2, 'B35': 2, 'B18': 2, 'E24': 2, 'B49': 2, 'C65': 2, 'B20': 2, 'B5': 2, 'B57 B59 B63 B66': 2, 'C126': 2, 'B51 B53 B55': 2, 'F4': 2, 'C124': 2, 'F G73': 2, 'B58 B60': 2, 'C52': 2, 'D33': 2, 'C68': 2, 'D20': 2, 'D26': 2, 'B28': 2, 'C83': 2, 'E25': 2, 'D17': 2, 'B22': 2, 'C92': 2, 'C2': 2, 'E33': 2, 'C70': 1, 'E58': 1, 'A16': 1, 'C86': 1, 'D19': 1, 'D48': 1, 'A26': 1, 'B50': 1, 'A20': 1, 'C101': 1, 'A10': 1, 'A23': 1, 'E68': 1, 'D9': 1, 'B41': 1, 'D50': 1, 'C85': 1, 'B71': 1, 'D49': 1, 'B42': 1, 'C50': 1, 'A24': 1, 'E17': 1, 'D28': 1, 'C47': 1, 'E49': 1, 'B69': 1, 'B102': 1, 'A36': 1, 'B82 B84': 1, 'D6': 1, 'B3': 1, 'F38': 1, 'E77': 1, 'D11': 1, 'D30': 1, 'C46': 1, 'D45': 1, 'B101': 1, 'B38': 1, 'C45': 1, 'C90': 1, 'C62 C64': 1, 'F G63': 1, 'B39': 1, 'E10': 1, 'C95': 1, 'B86': 1, 'C99': 1,

In [10]:
print('Before Removing Null values:'+'\n')
print(data.isnull().sum()) # for getting total number of null values
print('\n','Dimentions of Dataframe Before: ',data.shape,'\n')
print('----------------------------------------------------')
# data = data.dropna()                                        # dropna() is for Removing Null/Missing Values
data = data.fillna({'Age':28, 'Cabin':'G6', 'Embarked':'S'})  # fillna() is for Replacing Null/Missing Values with Another Value
print('After Removing Null values:' + '\n')
print(data.isnull().sum())
print('\n','Dimentions of Dataframe After: ',data.shape,'\n')

Before Removing Null values:

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

 Dimentions of Dataframe Before:  (891, 12) 

----------------------------------------------------
After Removing Null values:

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

 Dimentions of Dataframe After:  (891, 12) 



---
### 2. Identifying the Outliers

An Outlier is a value that is very distant as compared to the majority of the other values. It can simply be called as 'Odd One Out value'. There are usually very few Outliers in a Dataset

Our goal here is to replace as many the Outliers/Outlying values as possible in the Dataset with the Median/Average Values

This is Done by using the 1st Quartile, 3rd Quartile and Inter Quartile Range to return the Boolean values for all the Dataset values.

The values which are <b>True</b> in the Dataset are the Outliers

In [11]:
Q1 = data.quantile(0.25) # First Quartile of the Data
Q3 = data.quantile(0.75) # Third Quartile of the Data
IQR = Q3 - Q1            # Inter Quartile Range
print(data < (Q1 - 1.5 * IQR)) or (data > (Q3 + 1.5 * IQR)) # Printing values that satisfy Both Conditions

       Age  Cabin  Embarked   Fare   Name  Parch  PassengerId  Pclass    Sex  \
0    False  False     False  False  False  False        False   False  False   
1    False  False     False  False  False  False        False   False  False   
2    False  False     False  False  False  False        False   False  False   
3    False  False     False  False  False  False        False   False  False   
4    False  False     False  False  False  False        False   False  False   
..     ...    ...       ...    ...    ...    ...          ...     ...    ...   
886  False  False     False  False  False  False        False   False  False   
887  False  False     False  False  False  False        False   False  False   
888  False  False     False  False  False  False        False   False  False   
889  False  False     False  False  False  False        False   False  False   
890  False  False     False  False  False  False        False   False  False   

     SibSp  Survived  Ticket  
0    Fal

  after removing the cwd from sys.path.
  after removing the cwd from sys.path.


Unnamed: 0,Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Survived,Ticket
0,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,True,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...
886,False,False,False,False,False,False,False,False,False,False,False,False
887,False,False,False,False,False,False,False,False,False,False,False,False
888,False,False,False,False,False,True,False,False,False,False,False,False
889,False,False,False,False,False,False,False,False,False,False,False,False


In [12]:
print(data['Age'].quantile(0.5)) # Average age value
print(data['Age'].quantile(0.95)) # Outlying age value
print(data['Age'].quantile(0.75)) # Age value to replace the Outlying values

28.0
54.0
35.0


In [13]:
data['Age'] = np.where(data['Age'] > 60.0, 47.0, data['Age']) 
# Where there is an Age value greater than 60, replace it with 47
print(data['Age'].describe()) 
# Here we can observe that the Maximum age in the ship is fixed to be 60

count    891.000000
mean      28.891886
std       11.965069
min        0.420000
25%       22.000000
50%       28.000000
75%       35.000000
max       60.000000
Name: Age, dtype: float64


We need to repeat the same process for all the other columns of the Dataset so that all the Outliers in the Dataset can be Replaced.